A walkthrough of the Basic Data Modeling Lab – part 2. You can find and run the lab in the Basic Data Modeling Hands-on Lab topic.
That brings me to
the clustering key.
Okay, so in the example we just saw,
we had
one column for the primary key,
and that column was the partition key.
So we did not define the clustering key.
If you remember, the clustering key is optional, it’s
not mandatory like the partition key.
And if you remember when I talked about data
modeling in NoSQL in general,
I don’t just think about the data.
I think at the very beginning about
the queries that I want to perform.
So this is what I’m going to do in this part of the lab.
Let’s say that
the query that I want to run is,. I want to select the heartrate
of a specific pet, but,. I want to define a time range.
So I want to define between one time and another,
say, in the last 3 minutes or in a specific time range.
And of course, this implies and this relates to the problem
that we just had, that we want each
pet to have more than one heart rate.
Right, so we don’t want the data to be overwritten.
So, what are we going to do in this case?
Let me just try to run this query.
And as you can see,
if I tried to run this query,
I get this error message.
And the error message tells me that
it’s actually not really an error, it’s more of a warning
that tells me, listen, you can run this query, but
the performance is going to be unpredictable
because what is actually going
to happen is that we’re going to run a full table scan.
Now, imagine a cluster with hundreds of nodes and
a few petabytes of data.
You can understand
that we probably don’t want to run a full table scan.
So our data modeling is not so great in this case.
We can, by the way, override this warning,
if we use the allow filtering,
if we add the allow
filtering to the command, then it’s going to run
and it will perform a full table scan.
But again, we in most cases don’t want to do that.
So we need to
use a different data
model and define our table differently.
And let’s do that.
I’m going to create the table “heartrate_v2”.
You can see it here.
It has the same columns, pet_chip_id, time
and heart_rate.
However,
this time. I define the primary key
as to have two columns: the pet_chip_id and the time.
Okay. In this case
the pet_chip_id
would be the partition key just as it was before.
But we also define the clustering key to be the time.
Okay.
And if you remember, we saw how that looks like
in the diagram in the slides, and maybe I’ll show you how
that looks like again in just a bit.
Now, after we have this defined,
I’m going to
insert some rows into my newly created table,
so I’m inserting some rows.
And again,
when we have a primary key
that has multiple columns, it’s
also called a “compound
primary key” or a “composite primary key”.
Basically, it means a primary key
that has more than one column.
The first part is the partition key, and the second part
is the clustering key.
Okay, so I already explained this.
The partition key determines
in which nodes in which replica nodes we’re going to
place the data, and the clustering key
by the way, clustering key is also called “ordering key”
in some other databases like DynamoDB.
Now we inherited the name
“clustering key” from Apache Cassandra.
Maybe “ordering key” is a better name.
But anyway, it’s called “clustering key”
in ScyllaDB and Cassandra, and that determines
how the rows are sorted within a given partition.
Okay, and again, the clustering key
is optional, it can be zero or more columns.
So now that we have this table defined with the partition key
and the clustering key,
we inserted some data and we can query according to the time.
This is the query that we wanted to run.
And I queried, reading data
from this newly created table.
I provided the partition key,
which is the pet_chip_id and I gave a time range.
Okay, so between one time and another, I want all the data.
And I can see that the query succeeded,
and I got three rows in this time range.
Okay, so by doing this, by changing our data model,
we solved two problems:
one is we solved the problem of
the performance that we don’t really want to do
a full table scan, we want to have efficient queries.
And the second is that we want to have multiple values
for a single partition key.
Maybe another thing to keep in mind is that
for every query I
mentioned that we need to provide the partition key
and we need to provide
the partition key can be more than one column.
So in this case it was just pet_chip_id,
but it can be more than one column.
And we need to provide in the query
all the columns that are defined in the partition key.
For the clustering key, again, it’s optional, so we can
define a clustering key,
but we don’t necessarily need to provide
the clustering key as part of the query.
And again, we can read the data.
So I read I perform another query just to see that
the data was not overwritten this time.
Okay,
so let’s see another example;
I mentioned that both the partition key
and the clustering key can have more than one column.
Let’s see an example of that: so I define a new table
heartrate_v3.
Again, I define
the three columns: pet_chip_id, time and heart_rate,
but I add another column, which is the pet_name.
And this time I define the primary key
to have two columns in the partition key.
Pet_chip_id and time, and the clustering key to be pet_name.
Now I insert some data into the newly created table
just as I did before.
And like I just said,
if I try to perform a query
and I only provide the pet_chip_id column,
it’s going to fail.
Okay, so
we get the same error message
in order for Scylla
to know where this row exists,
on which nodes and on which partition, it needs
to have the entire partition key, which is defined as two
different columns in this table: pet_chip_id and time.
If I provide only the pet_chip_id I
get this error message.
Again, I can override this warning using allow filtering,
adding allow filtering.
But that will do a full table scan
and in most cases it’s not recommended.
So now if I try again,
but this time I also provide
the time column
and I provide the clustering key.
So I perform this query and. I can see that now it succeeds.
And this query is very efficient
because the database knows where my data is located.
It has the partition key and also
the clustering key.
Another example,
let me move forward
a bit faster with this because
I want to cover some more material.
So I define a table “heartrate_v4”
and this time we’re going to have
the pet_chip_id as
the partition key.
And the clustering key
is going to have two columns: pet_name and heart_rate.
And now when I answer some data,
I can query
according to the clustering key columns that are provided.
Keep in mind that if we choose to use
the clustering key columns, they have to be provided
in the same order as they were defined.
So I execute the query.
I provide the partition key and the clustering key,
and I can see that this succeeds.
A few points, let’s see that I covered everything.
So again, if there’s more than one column
in the clustering key, we have to use the same order.
However, we don’t, it’s not mandatory to define
or to provide the clustering key – it’s optional.
So the partition key
is mandatory, and it has to be at least one column.