Importance of Primary Key Selection

5 min to complete

Importance of Primary Key Selection

Look at the previously defined table:

CREATE TABLE heartrate_v1 (
   pet_chip_id uuid,
   time timestamp,
   heart_rate int,
   PRIMARY KEY (pet_chip_id)
);

Insert some more data:

INSERT INTO heartrate_v1(pet_chip_id, time, heart_rate) VALUES (268e074a-a801-476c-8db5-276eb2283b03, '2011-02-03 04:05:00', 81);
INSERT INTO heartrate_v1(pet_chip_id, time, heart_rate) VALUES (fead97e9-4d77-40c9-ba15-c45478542e20, '2011-02-03 04:05:05', 80);
INSERT INTO heartrate_v1(pet_chip_id, time, heart_rate) VALUES (47045afb-fd11-44c6-9d0f-82428434e887, '2011-02-03 04:05:10', 83);

After adding some data, the table includes the following:

pet_chip_id time heart_rate
268e074a-a801-476c-8db5-276eb2283b03 2011-02-03 04:05:00 81
fead97e9-4d77-40c9-ba15-c45478542e20 2011-02-03 04:05:05 80
47045afb-fd11-44c6-9d0f-82428434e887 2011-02-03 04:05:10 83

As we only defined the pet_chip_id as the Primary Key (which in this case is also the Partition Key), this is how the data would be stored in Scylla:

Let’s see what happens when we query with and without the partition key in terms of performance.

Connect to the previously created node:

docker exec -it scyllaU cqlsh

And go back to using our keyspace:

use pets_clinic;

Now, execute the query without specifying the partition key:

SELECT * FROM heartrate_v1 WHERE time='2019-03-04 07:01:999';

We get an error message:

InvalidRequest: Error from server: code=2200 [Invalid query] message=”Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING”

Since no partition key is specified, this query would perform a full cluster scan. This is slow and inefficient.

If we use any SELECT query without including the partition key, we’re going to be doing a full cluster scan. This is blocked by default. Allow Filtering bypasses Scylla restrictions on inefficient scans which can potentially create a significant load on the system. It’s easy to see why this is very inefficient and slow. Think of a production cluster with many nodes and with billions of rows.

Now let’s try the query but this time with the partition key:

SELECT * FROM heartrate_v1 WHERE pet_chip_id = 268e074a-a801-476c-8db5-276eb2283b03;

This query runs fast and returns the answer almost immediately, as we specified the partition key. Scylla knows exactly which node contains the data by hashing the ID (which is the partition key). This is quick and efficient.

fa-angle-up