Scylla University Live | July 28 & 29 Free Virtual Training Event
Learn More >

Materialized Views and Indexes Hands-On Lab 2

13 min to complete

Materialized Views and Indexes Hands-On Lab 2
 
In this lab you’ll see Global and Local Secondary indexes in action, with example use cases and an explanation of when to use each.
Start by creating a Docker container with Scylla. This tutorial was created with version 4.2.1.
docker run --name scylla-si -d scylladb/scylla:4.2.1 --smp 2 --memory 4G

Open 3 terminals (#1 for base table, #2 for global index, #3 for local index

BASE TABLE – On terminal #1

Open a terminal to the container, run the CQL Shell, and create a Keyspace:

docker exec -ti scylla-si /bin/bash
cqlsh
CREATE KEYSPACE restaurant_chain
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

Keep in mind that SimpleStrategy should not be used in production. Learn more about this in the Replication Factor lesson.

Next, create a Table and insert some data.

USE restaurant_chain;
CREATE TABLE restaurant_chain.menus (
location text,
name text,
dish_type text,
price float,
PRIMARY KEY (location, name));
INSERT INTO menus (location, name, price, dish_type) VALUES ('Reykjavik', 'hakarl', 16, 'cold Icelandic starter');
INSERT INTO menus (location, name, price, dish_type) VALUES ('Reykjavik', 'svid', 21, 'hot Icelandic main dish');
INSERT INTO menus (location, name, price, dish_type) VALUES ('Warsaw', 'sour rye soup', 7, 'Polish soup');
INSERT INTO menus (location, name, price, dish_type) VALUES ('Warsaw', 'sorrel soup', 5, 'Polish soup');
INSERT INTO menus (location, name, price, dish_type) VALUES ('Cracow', 'beef tripe soup', 6, 'Polish soup');
INSERT INTO menus (location, name, price, dish_type) VALUES ('Warsaw', 'pork jelly', 8, 'cold Polish starter');
INSERT INTO menus (location, name, price, dish_type) VALUES ('Ho Chi Minh', 'bun mam', 8, 'Vietnamese soup');
INSERT INTO menus (location, name, price, dish_type) VALUES ('Da Lat', 'banh mi', 5, 'Vietnamese breakfast');
INSERT INTO menus (location, name, price, dish_type) VALUES ('Ho Chi Minh', 'goi cuon', 6, 'Vietnamese hot starter');

Now let’s run some queries:

SELECT * from restaurant_chain.menus;

But wait! This is a full table scan. This could have been a problem if we didn’t have very little data. We should always restrict our queries using partition keys.

SELECT * FROM menus where location = 'Warsaw';

If we had a lot of data this would perform very well.

But what if we wanted to query by other fields? Let’s try.

SELECT * from menus where dish_type = 'Polish soup';

Oops!

How about other fields?

SELECT * from menus where location = 'Warsaw' and dish_type = 'Polish soup';

We get the same error.

If we add “ALLOW FILTERING” to the above queries they would work. But just like our first query, because we are querying regular columns it would be a full table scan – VERY INEFFICIENT!
Indexes to the rescue!

Let’s take a look at our current schema and sstables. It will help us compare and contrast and further our understanding, later on.

DESC KEYSPACE restaurant_chain;
exit
nodetool flush
cd /var/lib/scylla/data/restaurant_chain/
ls -l */*

GLOBAL INDEX – On terminal #2

The data model in Scylla partitions data between cluster nodes using a partition key, which is defined in the database schema. This is an efficient way to look up rows because you can find the node hosting the row by hashing the partition key.

However, this also means that finding a row using a non-partition key requires a full table scan which is inefficient.

Global Secondary indexes (also called “Secondary indexes”) are a mechanism in Scylla which allows efficient searches on non-partition keys by creating an index. They are indexes created on columns other than the entire partition key, where each secondary index indexes one specific column. A secondary index can index a column used in the partition key in the case of a composite partition key.  With global indexing, a materialized view is created for each index. This Materialized View has the indexed column as a partition key and primary key (partition key and clustering keys) of the indexed row as clustering keys.

Next, create a global index on column dish_type:

docker exec -ti scylla-si /bin/bash
cqlsh
USE restaurant_chain;
CREATE INDEX ON menus(dish_type);

And we can see the created global index and the underlying Materialized View:

DESC SCHEMA;

And now we can query based on the newly created index:

SELECT * FROM menus WHERE dish_type = 'Polish soup';

With the index created, we can execute the query without further issues and receive all entries that describe a Polish soup, from all locations.

This is what happens when the query is executed:

  • The user provides query details to the coordinator node (1)
  • An indexing subquery (2) is used to fetch all matching base keys from the materialized view.
  • The coordinator uses the resulting base key set to request appropriate rows from the base table (3).

The index is global; underneath it stores base primary keys in a table, where the indexed column acts as a partition key. The key allows the index to scale properly. To avoid pitfalls, schema designers must remember that the same best practices for primary keys apply to secondary indexes columns, for example: avoid creating an index on a low cardinality column. Also, note that indexed data will be stored on a node that serves the partition key token of an indexed table. That node may not necessarily be the same node that stores base table data.

As before, let’s take a look at the schema and sstables, we will use this later on:

exit
nodetool flush
cd /var/lib/scylla/data/restaurant_chain/
ls -l */*

LOCAL INDEX – On terminal #3

docker exec -ti scylla-si /bin/bash
cqlsh
USE restaurant_chain;

Let’s consider indexing dish_type again, but this time in a local index. Our use case is being able to ask for all dishes of a given type for one specific location. Let’s start by creating a local index. By creating a local index we instruct the database to use the same partition key columns as a base:

CREATE INDEX ON menus((location),dish_type);

Now we can see the created local index and the underlying Materialized View:

DESC SCHEMA;

And now we can use the query:

SELECT * FROM menus WHERE location = 'Warsaw' and dish_type = 'Polish soup';

Another option would have been to use a global index for this, tith the global index, dish_type acting as the partition key of the index table. That would mean that even though our query would have contacted the node responsible for rows in Warsaw location, the indexed data could have been found on a node that handles dish_type = ‘Polish soup’ partition key, possibly a different node, creating the possibility of inter-node communication, which adds to query latency.

Using local indexes makes this query very efficient. The indexing table’s partition key is explicitly the same as the base, which ensures that both the extracting keys from the index and fetching the corresponding base rows happens on the same node. That’s much faster than a global query, which may involve fetching rows from other nodes.

 

 


(1) The query is translated to an indexing subquery and a base table query (2)&(3). However, both the base table and the underlying materialized view have the same partition keys for corresponding rows. That means that their data resides on the same node; there’s no third replica that stores the indexing information.

Notice that using a token-aware driver would have provided further performance benefits, read more about this here.

As before, let’s take a look at the schema and sstables, we will use this later on:

exit
nodetool flush
cd /var/lib/scylla/data/restaurant_chain/
ls -l */*

Inspecting the sstables for a better understanding

NOTE: The directories and files below are hypothetical. You should get the correct names from the ls -l */* (the last command ran in each terminal).

We will use sstabledump to converts the SSTable into a readable (JSON) file format, and inspect the results.

# BASE TABLE – On terminal #1

sstabledump menus-76bce980e9e711e9b4a6000000000001/mc-2-big-Data.db

# GLOBAL INDEX – On terminal #2

sstabledump menus_dish_type_idx_index-11248001e9e811e9b4a6000000000001/mc-2-big-Data.db

# LOCAL INDEX – On terminal #3

sstabledump menus_dish_type_idx_1_index-bac41d51e9e811e9b4a6000000000001/mc-2-big-Data.db

If you’d like to further investigate what happens when using the above queries with secondary indexes, try turning TRACING on and executing the queries again.

fa-angle-up