Introduction
Welcome to the How to Write Better Apps section! Today we’re gonna talk about Filtering and Denormalization of data.
An this class, the goal is to introduce a data modeling problem, present different solutions to it, and comment on each, showing pros and cons.
Let’s rewind about our last session: Previously we did a simple modeling of a tweets and timeline query, and, at this time, we’ll focus only on the timeline feature.
Our data modeling strategy is to partition the timeline by username, and cluster it by created_at in descending order. This approach ensures that whenever we query the timeline table, we’ll get the most recent tweets added to that particular timeline.
But now, we’d like to enhance our application by adding a timeline feature that filters by liked tweets from a user and retrieves the fifty-first items.
However, we’re facing a limitation due to our current data modeling approach.
The lack of a liked column inside the partition or clustering keys prevents us from executing this query. Attempting to do so will result in an error.
Let’s say that the user that we’re working with, has fifty-thousands rows[!] in his timeline partition and we’re running a single ScyllaDB Node for sake of simplicity.
First approach: Client Side Filtering
Let’s start with the first approach: Client Side Filtering! We’re gonna query all the tweets and filter it in the Client Side. Simple, isn’t?
Here, we observe the query processor accessing Database and Storage, and reading (fifty-thousand) 50,000 rows.
However, given the low probability (0.1%) of a user liking a tweet, this approach results in processing a lot of unnecessary data.
And this will be shipped to the client that will do a loop in this data to filter which row has the ‘liked’ = true.
Hmm… Yeah, it’s a solution but a really bad one because depending on how much data you have inside the partition, it will slow your whole cluster and your app.
Second approach: Server Side Filtering
Now the second solution: the Server Side Filtering.
The introduction of ‘ALLOW FILTERING’ offers a server-side solution, enabling ScyllaDB to filter before transmitting data, mitigating the waste observed in the first solution.
And yet, the use of allow filtering raises concerns, as indicated by the gauges, meaning potential inefficiency.
Implementing ALLOW FILTERING is still a solution, but depending on the size of your partition, it may bring you performance issues that should be avoided whenever possible.
The rule of thumb here is: use whenever your need a large part of data, and when the read frequency is RARE (since this will make a full scan in your partition, bringing possible performance issues to it).
Third approach: Secondary Indexes (SI)
The third solution is create a secondary index at the timeline adding the “liked” as our partition key, and query based on it. It’s supposed to work, right?
Under the hood, the Secondary Index creates an Materialized View using the column specified as primary key and put the original partitions and clustering keys in the end.
And also is valid to mention the effort to query using Secondary Index: basically you will land the first query on coordinator which will identify the Secondary Index.
After that he will ask the Materialized View table, the liked tweets and them bring back the data to a last query, which will filter on the server side, so it can fit our needs.
While Secondary Indexes offer a solution, it’s crucial to understand when to use it. Their primary aim is to make querying easier by a single column with high cardinality, which is not our case here, since the cardinality is just true or false.
This approach will introduce potential performance issues in a short-term, so make sure to find the right usage for it.
Fourth approach: Local Secondary Indexes (LSI)
The fourth solution is create a local secondary index at the timeline adding the “liked” as our clustering key and maintaining the partition key that we set previously.
Under the hood, the Local Secondary Index creates a materialized view by maintaining the partition key and adding a new clustering key of your choice, which is way better to our case.
The effort is almost the same of Secondary Indexes, however Local Secondary Indexes brings a higher cardinality, that makes our querying faster.
If you stop to think about, Local Secondary Index suits better our needs since we’re going directly to the partition needed together with the new clustering key.
Fifth approach: Materialized Views (MV)
And the latest solution: Materialized View with composite partition key. Instead we use the possible indexes on clustering keys, we can just add the column inside our partition which previously was only “username” and now is (“username”, “liked”).
Since we know that our query will always have a state liked = true or false, it’s easier to attach the column inside the partition key together with the username.
With that, we bring the best solution to the table, since our goal here is to do efficient read queries.
Wrapping up: Read and Write Path
Let’s do a quick wrap up on this lesson: when to use each case. Here we have two paths: the Read and the Write path.
If your read frequency is rare and your cardinality is low, it’s ok to use allow filtering to read data and this is the “only read” path. However if your cardinality is higher, you will jump into the Write Path, where you need to decide your priorities during the replication of your data.
See you in the next lesson!