In Scylla (and Apache Cassandra), data is divided into partitions, rows, and values, which can be identified by a partition key. Sometimes the application needs to find a value by the value of another column. Doing this efficiently without scanning all of the partitions requires indexing, the focus of this lesson.
There are three indexing options available in Scylla: Materialized Views, Global Secondary Indexes, and Local Secondary Indexes. They are all covered in this lesson, along with comparing them, examples of when to use each, quizzes, and hands-on labs.
By the end of this lesson, you’ll have an understanding of the different index types in Scylla, how to use them, and when to use each one. You’ll also gain some hands-on experience from creating and using these indexes in the labs.
To understand indexing in Scylla, it helps to understand that it’s possible to “denormalize” without using indexing but rather by having the application maintain two or more views and two or more separate tables with the same data but under a different partition key. Every time the application would want to write data, it would need to write to all tables, and reads would be done directly (and efficiently) from the desired table. However, ensuring any level of consistency between the data in the two or more views requires complex application logic, resulting in slower performance.
Scylla’s indexing feature moves this complexity out of the application and into the servers. The implementation is faster (fewer round trips to the applications) and more reliable. This approach makes it much easier for applications to begin using multiple views of their data. This requires the application to declare the additional views or indexes (we’ll see how later on). From that point onward, every update to the original table (known as the “base table”) automatically updates the additional view tables as well.
Keep in mind that Materialized Views, Global, and Local Secondary Indexes are real tables and thus take up storage space.
Materialized Views (MV) are a global index. When a new MV is declared, a new table is created and is distributed to the different nodes using the standard table distribution mechanisms.
The new MV table can have a different primary key from the base table, allowing for faster searches on a different set of columns.
It’s scalable, just like the base table. Once created, the MV table is updated automatically every time the base table is updated. This update is only triggered when the base table gets updated, and it is not possible to directly update the MV table individually. Each Materialized View is a set of rows and columns that correspond to rows present in the base table specified in the Materialized View’s SELECT statement.
Reads from a Materialized View are just as fast as regular reads from a table and just as scalable. But as expected, updates to a table with Materialized Views are slower than regular updates since these updates need to update both the original table and the Materialized View table while ensuring the consistency of both updates. Performing these updates strictly from the application without any server help would have been even slower.
In Scylla, unlike Apache Cassandra, both Global and Local Secondary Indexes are implemented using Materialized Views under the hood.
Global Secondary Indexes (also called “Secondary Indexes”) are another mechanism in Scylla which allows efficient searches on non-partition keys by creating an index. Rather than creating an index on the entire partition key, this index is created on specific columns. Each Secondary Index indexes one specific column. In cases where you are using a composite (compound) partition key, a secondary index can index the column. Secondary indexes are transparent to the application. Queries have access to all the columns in the table, and indexes can be added or removed on the fly without changing the application.
Updates therefore can be more efficient with Secondary Indexes than with Materialized Views because only changes to the primary key and indexed column cause an update in the Secondary Index view.
What’s more, the size of an index is proportional to the size of the indexed data. As data in Scylla is distributed across multiple nodes, it’s impractical to store the whole index on a single node, as it limits the size of the index to the capacity of a single node, not the capacity of the entire cluster.
Hence the name Global Secondary Indexes. With global indexing, a Materialized View is created for each index. This Materialized View has the indexed column as a partition key, and it also stores the base table primary key. This means that it’s possible to query by the indexed column. Under the hood, Scylla will query the MV, get the base table primary key, and then fetch the requested column.
Global Secondary indexes provide a further advantage: it’s possible to use the indexed column’s value to find the corresponding index table row in the cluster, so reads are scalable. Note, however, that with this approach, writes are slower than with local indexing (described below) because of the overhead required to keep the indexed view up to date.
Local Secondary Indexes are an enhancement to Global Secondary Indexes, which allow Scylla to optimize workloads where the partition key of the base table and the index are the same key.
Like their global counterparts, Scylla’s local indexes are based on Materialized Views. The subtle difference lies in the primary key; local indexes share the base partition key, ensuring that their data will be colocated with base rows.
When using a Token Aware Driver, the same node is likely the coordinator, and the query does not require any inter-node communication.
Scylla’s superior performance often makes it acceptable for the user to use advanced but slower features like Materialized Views. This helps to improve the application’s data consistency and speed up its development.
You can learn more about these topics in Scylla Documentation: Materialized Views, Local Secondary Indexes, and Global Secondary Indexes. Two additional and useful references are this blog post and this one.