A Partition Key is one or more columns responsible for data distribution across the nodes. It determines in which node to store a given row.
Partition Key is a must on every table.
The Clustering Key is responsible for sorting the rows within the partition. It can be zero or more columns.
Transcript
Okay, so that brings me to the primary key.
A primary key
has two parts:
one is the partition key and the second is the clustering key.
For each table we have to define a partition key,
which has to be at least one column.
Can be one column, or it can be more.
A clustering key is optional.
And if we define a clustering key, the clustering key
can be one or more columns.
Okay.
I mentioned that
the partition key has to be defined for every table and very importantly
for every query that we perform,
we need to provide the partition key.
And if we go back to this slide, then you can understand why.
Without the partition key, the database would not know where our data is located.
So it requires the partition key in order to know
which nodes are responsible for the data.
If we don’t provide the partition key, the database would not know
and it would have to do a full table scan,
which is basically scan all the notes and look for the data.
And of course, that’s not very efficient.
Okay, so in this example,
you can see the same columns are defined: pet_chip_id, time and heart_rate.
However, here the primary key
consists of two different columns: pet_chip_id, which would be the partition
key; and time, which would be the clustering key.
Now, the way that it works, and I’m going to show you this in a diagram
in just a second, is that the partition key determines
the partition for the row, and within that partition,
rows are ordered according to the clustering key.
Okay, so let’s see how that looks like
you can see here we have different partitions.
They are the pet_chip_id and for a given pet_chip_id this one
we have multiple rows and those rows are ordered
according to the clustering key, which I believe was “time” in this case.
So it would be ordered according to the time.
We can have multiple rows and they are sorted.
And that means that if we want to access
a specific
row and we provide the partition key and the clustering key,
it would be very efficient to perform that query and know where our data is.