This lesson is about Materialized View and Secondary Indexes. What are they and when should you use them? It includes examples, limitations, best practices, and an explanation of how it works under the hood.
My name is Moreno I’m from the Solution Architects team most of you guys if you’re on slack we probably had one or more iterations you know so there are some points I would like to go into before we go into materialized views and indexes so what I tell you guys all the time when we’re on slack usually people I see on the development phase or going through Q&A always the first designing principle for you know data modeling at ScyllaDB is start with your queries right so don’t try to come up with the data model and then okay let’s see how I’m going to insert this data how I’m going to query no you start with the queries and then go the other you know start your data modeling the second thing is most times when there’s a lot of compartmentalization you know in a company I see that the DBA guys are going crazy because the developers are not talking to them and then they start seeing problems with the database and and then we tell them hey apparently you have a problem with a large partition for example they go to the developers and you know all hell breaks loose so by all means make sure you guys are talking between each other and third of course test test test make sure you’re using you know user profile on cassandra stress or you’re using your own application on a development environment so you know you’re doing things properly and you can detect problems before they happen in production
I know everybody here knows what I’m talking about it is not new but more times than I would like in the field it goes back to one of those problems because people didn’t do one those steps so our VP of engineering last year when he delivered his presentation at the summit you know his key point was we are trying to make ScyllaDB boring and of course we are talking about after it is in production in your application is running before that I want you guys to have fun with ScyllaDB and part of that is going through you know the hands-on exercises
I’m going to cover a lot during this talk you have a pretty good idea of what’s going on what is this all about but you only understand that if you go through the hands-on exercises and I put a lot of my time there to make sure it’s good and it’s covering all the use cases so this is the agenda for for today materalized views secondary index filtering filtering with indexes and what should I use so going to the definition a view is a table containing a copy of the results of some query performed on a base table the main takeaway is a materialized view is a table it’s not you know something that it’s attached to your table it’s a different table so these are the use cases so basically you know denormalization because everybody that used Cassandra used ScyllaDB sometimes you know you have to denormalize you have to create two tables that have some overlay, overlapping data so materialized views and secondary indexes are above all a convenience you know especially for developers so but again try not to forget that they are convenient but you should know what you’re doing and that’s the reason we are having this session so you know what are the shortcomings and or the good use cases for that
So different sorting order and precomputed queries so to apply restrictions to the base table values so here’s a base table example so you have a base table the name of the buildings is the partition key the cities the clustering key and all the others are regular keys you just had Tzach’s session I’m assuming everybody knows what I’m talking about when we say partition key clustering key and regular columns right so on the first query that you see there select from building where building is that name it’s going to work if you try to query by country which is the third column which is a regular column it’s going to error out it’s going to tell you if you want to do this use allow filtering so you have to add the key words allow filtering to the query but then we are basically telling you that performance will be unpredictable because it’s not indexed right so sometimes it would be a full table scan ScyllaDB is going to read all your records and then apply whatever rule you put there like for example country but then we can create a materialized view so and by the way we are going to go into the syntax a little later but that’s the idea there
So first you have a base table and then we created a materialized view called building by country and we made the country the partition key for that table for that materialized view so then what you can do instead of querying your base table for the country you do that on the materialized view so it’s pretty simple it’s pretty intuitive there wasn’t a single instance where I explained it and people said oh I cannot understand it’s pretty straightforward right this is the syntax what I did is for every example that we are putting here especially for syntax I’m putting a link to the documentation because it doesn’t make sense to put all the different commands and options it’s a lot of stuff
This is the basics this is for a subset of the data sometimes people do that so let’s say you have one city that you want to put on a materialized view I don’t know the reason I’m not the developer and the applications not mine usually it’s the other way around it’s you guys that tell me why you’re doing things on a certain way but it’s a valid valid use case so here we are making the height a clustering key so now we can apply restriction or changes of heights and use aggregates or different ordering and by the way one thing that I did here is you see that the examples are not consistent so before I was talking about building by country now I’m talking about building by city is just that I don’t want you guys to get like into the bad habit of just you know following the example so I’m trying to play a little bit with them this is the example that Tzach discussed before for the pet clinic and you can see that the base table has pet_chip_id and the partition key and time has the clustering key and then we create a materialized view and this is going to be in a different order because now we are making heart rate clustering key and it’s before time so when you create that materialized view the ordering will be different even though you kind of have the same data right okay
Some limitations we have on materialized views so there must be a one-to-one correspondence between view rows and base rows if you are to omit the base table primary key then we have no way of knowing how they are related right so if you if you’re creating a materialized view you necessarily have to include the partition key from the base table and you can use one of the regular columns as part of your clustering key or your partition key this is the the insert path so the write path when you have a materialized view so again we have a base table when you insert a record on the base table and ScyllaDB sees that there’s a materialized view created it’s automatically going to create a record on the materialized view so that’s the convenience you don’t need to have your application writing to two different tables because when you write to the base table it’s going to write to the materialized view and by the way if you create the materialized view after you already have a base table because this case let’s say the table is empty right but if you already have like a million records on the base table when you create a materialized view it’s going to populate the materialized view so this is another thing to have in mind that if you’re creating a materialized view do it off hours you know plan ahead because it might take a while depending on how much data do you have on the base table and then on the read path this drives me nuts instead of reading from the base table your application is going straight to the materialized view so it’s one hop you just go to your materialized view you’re good
The secondary indexes are different so that’s in by the way that’s the main difference between materialized views and indexes indexes will use materialized views underlined but your queries are made on the base table and with materialized views you go straight to the materialized view pretty simple right okay so this is this is how a write happens so you’re writing to to the base replica and then it’s going to create an update or an insert on the materialized view right let me get there so all of that is done asynchronously so it means that if your cluster is overloaded it might be that the base table is updated or the record is created but the view table it’s not because if the cluster is overloaded it’s going to prioritize base table writes and it’s going to do the view updates later so and that’s exactly to avoid impacting the the base table availability the consistency level is not enforced in terms of let’s say you have consistency level quorum on a three node cluster so you need two acts before it says that that records is recorded right so that is just for the base table it’s not going to wait for to achieve consistency level for the view updates
So again this is to talk a little bit about the read before write pattern that is in a materialized view so let’s say we have this base table and this is the simplest it can get right so it’s a partition key a clustering key in a regular column with some value right when you create a materialized view you are making that that value your partition key so that’s the most simple example that I can give you so let’s say we have one record with you know 0 1 and 8 so the 8 V there it’s going to be the partition key on the materialized view right so when I update that value and I’m changing from 8 to 10 it means that in the materialized view the partition key eight won’t exist anymore so I have to delete it or tombstone it and then insert the value 10 right why we do that it’s reading from the base table before it tombstones and create a new value and we have a lock there so because if you’re having concurrent updates you don’t want to you know lose your updates to the materialized view so again that case that I mentioned for the read before write is because of concurrent updates and so we make sure you know concurrent updates are handled with a lock basically when you create a materialized view it’s going to build that materialized view if you already have data on the base table of course it’s resilient to node restarts as I told you it’s another table so you don’t have to you know overthink it it’s just a different table we are going to build it for you and there are we’re mentioning too but there are three ways to see the progress right so you can query the system_distributed.view_build _status you can use nodetool viewbuildstatus but I keep telling people just look at the monitoring it’s there I prefer to look at the monitoring because you know you don’t have to to run the command every five minutes you just put on the screen and you know go make some coffee
The exercises that I mentioned there on this link so it’s basically a gist on my github and you see that there’s you know creating docker.MD this is for you to create a docker container to you know make sure you have a ScyllaDB running on your then there’s some exercises for the base table they’re kind of nice even for for people with some experience because um there’s some tricky questions there and some I try to make it interesting and then there’s the materialized views exercises we are covering all the use cases that I could think of if you have a better one that I didn’t think about by all means send me a message you can make comments there on gist just let me know and I’ll add it but I used you know different partition clustering keys you know using different regular columns different sorting order everything I could think of I put an exercise there try not to copy and paste you know try to do it yourself and then use my examples there just make sure you know you did it right if if there’s something you’re not able to follow or it’s too hard by all means let me know but I think it’s pretty easy to follow I I tested with you know some of my friends and it’s fine so global secondary indexes again an index is a table containing a copy of the key of a base table with a single partition key corresponding to the index column again when we say it like that it’s kind of what but let’s let’s keep going you know it’ll be clear
So here we have a base table and we are creating not a materialized view we’re creating an index but Moreno every index has an underlying materialized view yes correct but again you remember that previously when we were looking at materialized views your application when it was querying the data it needed to be aware that there was a materialized view and query straight to the materialized view in this case when you create an index then the application doesn’t need to be aware because for all purposes it’s querying the in the base table right and then ScyllaDB just do an implicit drawing in terms of it’s going to look to the index it’s going to get the data from the base table and we’ll give you back the information so this is an example for city so in this case we have just a partition key and then we create an index which the city is the partition key pretty simple both of those queries should work if you have the index and notice that you know you’re querying buildings their, not building by city you could if you want you can query the underlying materialized view but I don’t see the point because it’s just easier you know to let your application to think it’s querying the base table, syntax I already provide the link on the previous one it’s all one document again as I mentioned when you create an index it’s creating an underlying materialized view this is the right path so when you insert into buildings it’s going same thing that we saw for the materialized view right it’s going to go to the base table
ScyllaDB will create the corresponding record on the materialized view but when we go to the read case it’s going to, when you query for buildings and you restricting by city ScyllaDB detects okay I have an index for this column so it’s going first the materialized view to find that particular record using the city restriction and then it goes to the base table it doesn’t implicit join and gives you back your information again we were talking about global indexes and I want you to take a very good look at this right so because we are indexing a regular column now we don’t have anymore the same locality right because it’s it’s going to be a different token a different token range because you’re using city so it will probably reside in a different node so every time you are using a global index you know you’re going to have two hops so this is going to have some overhead in performance right so because of these two hops then how we can make the query a single hop right so there’s two things to that and here’s now we are talking about local secondary indexes so local secondary indexes are indexes that contain the original partition key it’s as simple as that that’s the difference
Global one you’re using any column or any regular column and you’re not including the original partition key for local secondary index you’re including that it’s that simple if it’s using the same the same partition will have the same token if it has the same token it’s going to be the same node the same shard you guys know the drill so in this case by the way so you have a coordinator right, your query is hitting the coordinator let’s say it’s node seven and then both of those queries that I mentioned before the one that goes to the materialized view and the one that goes to the base table are in the same node in this case node 2 right but there’s a way it can be even better if yeah if you’re using a token aware driver the Java one or the Go one because then your coordinator and your replicas are on the same node so it gets much better than if you’re not using the shard aware driver let me go back here again you know this is the example we are doing right now it’s a restaurant you’re creating an index on you still use location but you’re adding dish_type so that’s the difference keep that in mind you’re still you are keeping the partition key from the base table that’s it it’s not try not to overcomplicate it filtering okay that’s error that I mentioned before everybody who ever tried querying for something ordered in a primary key saw this error at least once, right? so filtering so query restrictions that may need filtering non key fields just as a previous example parts of primary key there are not prefixes partition keys with something other than with equality relations so when you’re trying to you know do ranges and stuff and clustering keys before range restriction and then by other conditions our implementation it’s called coordinator side filtering it retrieves all has if restrictions do not exist and then the results are filtered and the mismatch rows are dropped what what do I mean by that so let’s say it didn’t apply any restrictions on primary key for example what is is it going to do it’s going to read the entire table but if you apply the restriction to primary keys let’s say you’re using an in clause with you know ten different partition keys it’s going to fetch those ones and then it’s going to apply the other conditions
So if you’re restricting by the partition key in some other condition it’s going to do that but first it’s going to fetch everything if you didn’t put a good restriction okay so in this example we have restriction by clustering key and by two other columns right so first it’s going to get all the rows using c2 and then it’s going to filter the rows that match the V 1 and V 2 equality so let me just go here and this is where I think it gets interesting because and this is kind of the hard part for me and that’s the reason I put some exercises about that to you but here’s how we do let’s say you have one base table you have two materialized views that you created you have two global indexes and two local indexes all on the same base table okay when are you going to query the materialized view anyone when you query the materialized view if you if you query the base table you’re never going to the materialized view so that’s the first rule the second rule is if if you have a partition key it’s going to try first the local secondary indexes and out of multiple local secondary index it’s going to pick the first one in your schema so let’s say you for you created two local secondary index you know local secondary index a local secondary index B if you can have values for both of them it’s always going to A
So keep in mind that materialized views indexes filtering they are not silver bullets I talked about this in the beginning let me talk again the first design principle in NOSQL is start from your queries and then design your data model the reason and you know trying to emphasize this is because let let me give you a horror story from the field so we have a customer in Brazil someone from a vendor or something got access to the database and they were trying to to create some reports so what the some brilliant guy there did was to create a materialized view which the country code was the partition key and they had like 40 different countries and then you can imagine what it did to the partitions on that materialized view right, and then it was impacting production that guy shouldn’t even have access to the database but it was horrible we suffered for like three days until thanks to our large partition detector in this low query detector we were able to pinpoint the IP from the machine and they someone paid a visit to a guy I don’t think it was pleasant but to keep in mind ok this is a convenience I want to use global index local secondary index materialized views keep in mind that it’s the same design principles if you don’t have enough cardinality in your partition key you have a problem if your query is not selective enough you’re going to have a problem ok so keep in mind do I have enough cardinality and selectivity how big are my partitions going to be how many rows and how could I design my data model to be more efficient for my queries so these are some principles based on what I just mentioned you know so always keep in mind cardinality and selectivity I think this this is going to be more clear when you see Sarna’s presentation about performance he will show you that sometimes it’s okay to use allow filtering and you don’t need a secondary index depending on how you’re querying but I don’t want to give more spoilers on on his presentation