Denormalization

3 min to complete

Denormalization

So what happens if we have a query that requires data from more than one table?

In the relational database world, we would use a Join statement. We would combine rows from two different tables, using a related field, usually, ID, to get the information. 

However, Joins are not supported in ScyllaDB (or Cassandra, for that matter). What do we do then?

One option would be to do a client-side Join, and this is not recommended as it is very slow and requires iterating through all the information.

A better option is to denormalize. 

Note: Another option is to use Materialized Views. Materialized Views essentially build a new table with the indexed column as a partition key and a user-chosen subset of columns as values. Instead of manually creating additional tables or “views” with the same data, this feature enables the ScyllaDB server to do it automatically, safely, and efficiently. In other words, the application updates the base table, and the additional materialized-view tables gets updated automatically as needed.

If you remember, at the start of the lesson, we said that whereas in relational databases, data modeling is all about the data entities and the relationship between these entities, in ScyllaDB data modeling, it’s about the queries and the application. We think about that first and base the data model around that. 

Denormalization means creating a new table with duplicate data to answer specific queries, which we would perform with Join in the relational world. You can think about this, as performing Join on write. When writing the data, we duplicate it. Since writes are fast, we don’t pay a hefty price for this. Then, when we want to perform the read, we have the data right there waiting for us.

Let’s look at an example. Say we have two data entities: pets and vaccinations. 

In a relational data model, we might have the following:

pets:

id name year_born
12 Rocky 2019
42 Rex 2017

vaccinations:

id pet_id vaccination
1 12 Canine Hepatitis
2 12 Heartworm
3 42 Canine Hepatitis

Now, in a relational database, if we want to get the vaccinations for a specific pet name, we would perform a Join:

SELECT vaccination 
FROM pets 
JOIN vaccinations
ON pets.id = vaccinations.pet_id
WHERE name = 'Rocky'

That would result in:

id name year_born id pet_id vaccination
12 Rocky 2019 1 12 Canine Hepatitis
12 Rocky 2019 2 12 Heartworm

In ScyllaDB, to answer the same query (get vaccinations for a specific pet by name), we would denormalize and create an additional table: vaccinations_by_pet.

CREATE TABLE vaccinations_by_pet (
    vaccination_id uuid,
    pet_id uuid,
    pet_name text,
    vaccination text,
    PRIMARY KEY (pet_name, vaccination_id)
);

In vaccinations_by_pet, the partition key is the pet_name, and the clustering key is the vaccination_id. 

Since the partition key is the pet_name, we can easily query the data and get all the vaccinations for that pet. By creating this additional table, we get predictable, efficient query performance when performing the read. 

fa-angle-up