Data Modeling Lab Walkthrough - Part 3 - Key Selection, Sorting Order

7 min to complete

S201: Data Modeling and Application Development Basic Data Modeling Data Modeling Lab Walkthrough – Part 3 – Key Selection, Sorting Order

A walkthrough of the Basic Data Modeling Lab – part 3. You can find and run the lab in the Basic Data Modeling Hands-on Lab topic.

Transcript

Okay, maybe one last thing
relating to the clustering key,
and that would be the sorting order.
So there is, say we’re sorting according to time,
the clustering key is defined as time.
By default,
ScyllaDB would order the data in an ascending order,
so going from the earliest time to the latest time.
However, it’s also possible
to use a descending order instead.
Okay, so again,
we think about the query when we create our data model;
we think about the application and how we’re going to use it.
And let’s look at an example.
So say
we have the same use case,
but we know that in our application
we are usually interested in
the measurements from the last minute.
Okay, so our dog has a sensor that’s measuring the heart rate,
writing the heart rate, say, every second,
and it does that, creates a lot of data.
But in our application,
we usually want to see just the last minute.
So in that case, we’re interested in a descending order
because
if you remember the diagram, let’s see if I have it here,
so this diagram,. I hope you can see it.
So we have our partition key and then we have our
rows which are ordered according to the time.
So if we’re using a descending order,
it would be very easy from the partition key
to get to the first, say, three rows,
because this is the way they’re ordered in the node.
Let me go back
here.
Okay.
So again, the use case is that we have a pet
and we are interested in the last few reads.
Let me define this table.
Here we define the partition key as the pet_chip_id, and
the clustering key as the time just like we saw before.
But we define the clustering order to be descending
as opposed to the default, which is ascending.
Now, if I insert some rows like we did before,
I can query
using this query
and by using “LIMIT 3”,
I would get the last three rows.
Okay, so I provide the pet_chip_id
and in a very efficient way. I can see the last three rows
for this given partition.
Okay.
Okay, let’s see if there are any questions.
There is a good question from Joshua Michelle: “Can
you stream the data changes from Scylla?” Okay, so
there
is an option to stream data changes.
There’s a feature called “CDC”,
it’s not in the scope of this talk, but if you’re interested
in that, we have a lesson where the hands-on example
you can have a look at
ScyllaDB university and see how that’s done.
So the answer is yes, it is possible.
There’s one that just popped in from Abner:
“Is it possible to define consistency at key space table
level, or does each client have to specify for each query?”
So it’s possible to define there is a default.
You can check the default in the documentation,
I think, for CQL shell it’s one but I’m not sure.
So there is a default value for all the queries
and then you can
either change that default or you can define it
for a
query also known as tunable consistency.
So the answer is yes, it’s possible.
Okay, let me move forward.
I hope we have some time for more questions at the end.
So we saw that the clustering order
can be ascending or descending,
and you have to think about that
when you think about the queries in your application.
Okay, so
I showed quite a few examples, but,
why is it actually important to have
to select the primary key, the partition key
and the clustering key correctly?
Why do we want to do that?
So let’s define a new table “heartrate_v6”;
it has three columns and the partition key
is defined as pet_chip_id.
You do that
and insert some rows.
And this is an example we saw before, we’re going to have
just the partition key, no clustering key defined.
So we have three partitions
right now in our database.
And if we try to query according to the time,
which is not the partition key and not the clustering key,
we’re going to get the same warning
or error message that we saw before.
This query is inefficient. Why?
Because the time
doesn’t tell our database where our data is located.
And considered the case where we have many nodes and
lots of data,
it will take quite a while to scan through all the table,
through all the tables and it would be inefficient.
So if we know that we’re going to query according to the time,
we want the time to be part of the primary key.
We can override this by allow filtering, I showed that before.
And if we
do provide the partition
key in the query, then the query is efficient.

Previous Topic

Back to Lesson

Next Topic

Data Modeling Lab Walkthrough – Part 3 – Key Selection, Sorting Order

7 min to complete

About

Resources

Documentation

Keep in Touch