Materialized Views and Indexes Hands-On Lab 2

13 min to complete

Materialized Views and Indexes Hands-On Lab 2

In this lab, you’ll see Global and Local Secondary indexes in action, with example use cases and an explanation of when to use each.

Note: As noted in the previous lab, in addition to the instructions provided in this document, which allow you to run the lab on a machine with Docker, you can find this lab in the Katacoda learning environment here. The Katacoda environment provides an interactive virtual machine where you can execute all the commands directly from your browser without the need to configure anything.

Start by creating a Docker container with Scylla. This tutorial was created with version 4.3:

docker run --name scylla-si -d scylladb/scylla:4.3.0 --smp 2 --memory 4G

Open 3 terminals (#1 for base table, #2 for global index, #3 for local index

BASE TABLE – On terminal #1

Open a terminal, run the CQL Shell, and create a Keyspace:

docker exec -ti scylla-si /bin/bash
cqlsh
CREATE KEYSPACE restaurant_chain
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

Keep in mind that SimpleStrategy should not be used in production. Learn more about this in the Replication Factor lesson.

Next, create a Table and insert some data.

USE restaurant_chain;
CREATE TABLE restaurant_chain.menus (
city text,
name text,
dish_type text,
price float,
PRIMARY KEY (city, name));
INSERT INTO menus (city, name, price, dish_type) VALUES ('Reykjavik', 'hakarl', 16, 'cold Icelandic starter');
INSERT INTO menus (city, name, price, dish_type) VALUES ('Reykjavik', 'svid', 21, 'hot Icelandic main dish');
INSERT INTO menus (city, name, price, dish_type) VALUES ('Warsaw', 'sour rye soup', 7, 'Polish soup');
INSERT INTO menus (city, name, price, dish_type) VALUES ('Warsaw', 'sorrel soup', 5, 'Polish soup');
INSERT INTO menus (city, name, price, dish_type) VALUES ('Cracow', 'beef tripe soup', 6, 'Polish soup');
INSERT INTO menus (city, name, price, dish_type) VALUES ('Warsaw', 'pork jelly', 8, 'cold Polish starter');
INSERT INTO menus (city, name, price, dish_type) VALUES ('Ho Chi Minh', 'bun mam', 8, 'Vietnamese soup');
INSERT INTO menus (city, name, price, dish_type) VALUES ('Da Lat', 'banh mi', 5, 'Vietnamese breakfast');
INSERT INTO menus (city, name, price, dish_type) VALUES ('Ho Chi Minh', 'goi cuon', 6, 'Vietnamese hot starter');

Now let’s run some queries:

SELECT * from restaurant_chain.menus;

But wait! This is a full table scan. This could have been a problem if we didn’t have very little data. We should always restrict our queries using partition keys.

SELECT * FROM menus where city = 'Warsaw';

If we had a lot of data, this would perform very well.

But what if we wanted to query by other fields? Let’s try.

SELECT * from menus where dish_type = 'Polish soup';

Oops!

How about other fields?

SELECT * from menus where city = 'Warsaw' and dish_type = 'Polish soup';

We get the same error.

If we add “ALLOW FILTERING” to the above queries, they will work. But just like our first query, because we are querying regular columns, it would be a full table scan – VERY INEFFICIENT!
Indexes to the rescue!

Let’s take a look at our current schema:

DESC KEYSPACE restaurant_chain;

GLOBAL INDEX – On terminal #2

The data model in Scylla partitions data between cluster nodes using a partition key, which is defined in the database schema. This is an efficient way to lookup rows because you can find the node hosting the row by hashing the partition key.

However, this also means that finding a row using a non-partition key requires a full table scan which is inefficient.

Global Secondary indexes (also called “Secondary indexes”) are a mechanism in Scylla which allows efficient searches on non-partition keys by creating an index. They are indexes created on columns other than the entire partition key, where each secondary index indexes one specific column. A secondary index can index a column used in the partition key in the case of a composite partition key.  With global indexing, a materialized view is created for each index. This Materialized View has the indexed column as a partition key and primary key (partition key and clustering keys) of the indexed row as clustering keys.

Next, create a global index on column dish_type:

CREATE INDEX ON menus(dish_type);

And we can see the created global index and the underlying Materialized View:

DESC SCHEMA;

And now we can query based on the newly created index:

SELECT * FROM menus WHERE dish_type = 'Polish soup';

With the index created, we can execute the query without further issues and receive all entries that describe a Polish soup from all cities.

This is what happens when the query is executed:

  • The user provides query details to the coordinator node (1)
  • An indexing subquery (2) is used to fetch all matching base keys from the materialized view.
  • The coordinator uses the resulting base key set to request appropriate rows from the base table (3).

The index is global; underneath, it stores base primary keys in a table, where the indexed column acts as a partition key. The key allows the index to scale properly. To avoid pitfalls, schema designers must remember that the same best practices for primary keys apply to secondary indexes columns. For example: avoid creating an index on a low cardinality column. Also, note that indexed data will be stored on a node that serves the partition key token of an indexed table. That node may not necessarily be the same node that stores base table data.

LOCAL INDEX – On terminal #3

Let’s consider indexing dish_type again, but this time in a local index. Our use case is asking for all dishes of a given type for one specific city. Let’s start by creating a local index. By creating a local index, we instruct the database to use the same partition key columns as a base:

CREATE INDEX ON menus((city),dish_type);

Now we can see the created local index and the underlying Materialized View:

DESC SCHEMA;

And now we can use the query:

SELECT * FROM menus WHERE city = 'Warsaw' and dish_type = 'Polish soup';

Another option would have been to use a global index for this, with the global index, dish_type acting as the partition key of the index table. That would mean that even though our query would have contacted the node responsible for rows in Warsaw, the indexed data could have been found on a node that handles dish_type = ‘Polish soup’ partition key, possibly a different node, creating the possibility of inter-node communication, which adds to query latency.

Using local indexes makes this query very efficient. The indexing table’s partition key is explicitly the same as the base, which ensures that both the extracting keys from the index and fetching the corresponding base rows happens on the same node. That’s much faster than a global query, which may involve fetching rows from other nodes.

 

 


(1) The query is translated to an indexing subquery and a base table query (2)&(3). However, both the base table and the underlying materialized view have the same partition keys for corresponding rows. That means that their data resides on the same node; there’s no third replica that stores the indexing information.

Notice that using a token-aware driver would have provided further performance benefits. Read more about this here.

Summary

To summarize, Indexing is a useful tool that provides more types of queries on your tables. In principle, columns we wish to be queryable should be declared when the table is created, as part of a table’s primary key. Secondary Indexing is a neat way of making other columns queryable, but it comes with a cost of additional storage space and processing power to maintain the secondary index data coherent with the primary index information.

If you’d like to investigate further what happens when using the above queries with secondary indexes, try turning TRACING on and executing the queries again.

fa-angle-up