Importance of the Clustering Key

5 min to complete

Importance of the Clustering Key

What happens if our main query for the pet’s heart rate is by pet and time? We could then define the table as follows (we already created this table):

CREATE TABLE heartrate_v2 (
   pet_chip_id uuid,
   time timestamp,
   heart_rate int,
   PRIMARY KEY (pet_chip_id, time)
);

In this case, the partition key is the pet_chip_id, and the clustering key is the time. Each pet_chip_id (the partition key) would be a partition, that is, existing on a single node. Within that partition, the data would be stored according to the clustering key, which is “time.”

For example:

pet_chip_id time heart_rate
268e074a-a801-476c-8db5-276eb2283b03 2011-02-03 04:05:00 81
fead97e9-4d77-40c9-ba15-c45478542e20 2011-02-03 04:05:05 80
fead97e9-4d77-40c9-ba15-c45478542e20 2011-02-03 04:05:10 89
fead97e9-4d77-40c9-ba15-c45478542e20 …. ….
fead97e9-4d77-40c9-ba15-c45478542e20 2011-12-17 09:21:00 84
47045afb-fd11-44c6-9d0f-82428434e887 2011-02-03 04:05:00 83

Since a Partition Key and a Clustering Key are defined, each partition would possibly have more than one row. A row would be defined by the Partition Key and the Clustering Key. A single partition may have many rows, which would be sorted according to the Clustering Key:

 

Now it would be very efficient to query what was the heart rate of a specific pet at a given time (or time interval). Since you already inserted data into this table in the
Primary Key, Partition Key, Clustering Key – Part One part, e
xecute the query:

SELECT * from heartrate_v2 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 AND time >='2019-03-04 07:01:00' AND time <='2019-03-04 07:02:00';

So what happens here:

The partition key (pet_chip_id) tells us exactly in which node and where on that node, our data is stored. So we’d know where in the cluster we can find our data. Now, In that partition, since the data is ordered according to the time, we could very quickly scan to the given time interval (the clustering key) and return the records.

Let’s see what happens if we don’t define a clustering key. Recall the previously created table heartrate_v1:


CREATE TABLE heartrate_v1 (
    pet_chip_id  uuid,
    time timestamp,
    heart_rate int,
    PRIMARY KEY (pet_chip_id)
);

This table only has the pet_chip_id as the partition key and no clustering key.

Execute the same query, on table heartrate_v1:

SELECT * from heartrate_v1 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 AND time >='2019-03-04 07:01:00' AND time <='2019-03-04 07:02:00'; 

This query fails with the message:

InvalidRequest: Error from server: code=2200 [Invalid query] message=”Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING”

As we saw before, since the clustering key is not defined, performing this query, with ALLOW FILTERING, would be inefficient and possibly very slow. In this example, since we hardly have any data, it won’t matter, but think of a real-world case where a partition might have thousands of rows or more. 

We would still know on which partition the data exists, because we have the partition key, but we would have to do a linear scan of the entire data on that partition to find the specific row we are looking for.

fa-angle-up