Denormalization

3 min to complete

Denormalization

So what happens if we have a query that requires data from more than one table?

In the relational database world, we would use a Join statement. That is, we would combine rows from two different tables, using a related field, usually ID to get the information. 

However, Joins are not supported in Scylla (or Cassandra for that matter). What do we do then?

One option would be to do a client-side Join. This is not recommended as it is very slow and essentially requires iterating through all the information.

A better option is to denormalize. 

Note: Another option is to use Materialized Views. In essence, Materialized Views builds a new table with the indexed column as a partition key, and a user-chosen subset of columns as values. Instead of manually creating additional tables or “views” with the same data, this feature enables the Scylla server to do it automatically, safely and efficiently. In other words, the application just updates the base table, and the additional materialized-view tables gets updated automatically as needed.

If you remember at the start of the lesson we said that whereas in relational databases, data modeling is all about the data entities and the relationship between these entities, in Scylla data modeling it’s about the queries and the application. We think about that first, and base the data model around that. 

Denormalization means that we create a new table, with duplicate data, in order to answer specific queries which we would perform with Join in the relational world. You can think about this, as performing Join on write. When writing the data we duplicate it. Since writes are fast, we don’t pay a hefty price for this. Then, when we want to perform the read, we have the data right there waiting for us.

Let’s look at an example. Say we have two data entities: pets and vaccinations. 

In a relational data model we might have the following:

pets:

id name year_born
12 Rocky 2019
42 Rex 2017

vaccinations:

id pet_id vaccination
1 12 Canine Hepatitis
2 12 Heartworm
3 42 Canine Hepatitis

Now, in a relational database, if we want to get the vaccinations for a specific pet name we would perform a Join:

SELECT vaccination 
FROM pets 
JOIN vaccinations
ON pets.id = vaccinations.pet_id
WHERE name = 'Rocky'

That would result in:

id name creation_year id pet_id vaccination
12 Rocky 2019 1 12 Canine Hepatitis
12 Rocky 2019 2 12 Heartworm

In Scylla, to answer the same query (get vaccinations for a specific pet by name), we would denormalize and create an additional table: vaccinations_by_pet.

CREATE TABLE vaccinations_by_pet (
    vaccination_id uuid,
    petid uuid,
    pet_name text,
    vaccination text,
    PRIMARY KEY (pet_name, vaccination_id)
);

In vaccinations_by_pet the partition key is the pet_name and clustering key is the vaccination_id. 

Now, since the partition key is the pet_name, we can easily query the data and get all the vaccinations for that pet. By creating this additional table, we get predictable, efficient query performance when performing the read. 

To report this post you need to login first.