An example of application enhancement and a few options for implementing it, including client-side filtering, server-side filtering, secondary indexes, materialized views, and denormalizing data. Covers the pros and cons of each possibility and when and if to use it.
But now we’re going to enhance the application, like it always happens so in this example
what we’re going to have the enhancements do is select the
for the user the fifty-first
tweets he’s interested in or he liked and this is a query we try to do so
we’re selecting for specific user the ones he liked and what are the our
solutions, so the first solution is to do client side processing or filtering
so on the right hand side we have the query processor
memtable cache and SSTable that is a single ScyllaDB node for the sake of
simplicity I have a single ScyllaDB node on the right hand side and we’re reading
from memtable cache or SSTable 50,000 rows, if the probability of a user liking
a tweet is 0.1% to get to the fifth first fifty liked tweets we need to read
around 50,000 rows but we are not processing any of that we’re
shipping 50,000 rows to the client and only interested in fifty of
them it’s a lot of waste so the first solution is not great we return a lot of
data and we’re reading a lot of data off the disk and we are interested only in
0.1 percent of them, is there another solution? And the answer is – yes, we can
use allow filtering, allow filtering is server-side filtering so we can have
ScyllaDB do the filtering before sending it back to the client, the rule of thumb
that you need to remember is that allow filtering should be used with large
parts of the data are returned, is 0.1% large? – well it’s debatable
it’s not, it’s maybe too small so in step 4 of the application what we did is we
added allow filtering and limited it to 50 queries and ran the query and when we
run the query we have two gauges alarming us, the left one at 90%
is showing us that one out of ten queries is using allow
filtering so 10% of our queries are using allow filtering, the graph below
shows us the number of queries so we have here around 8, 7 to 8 queries a
second running allow filtering, on the right hand side we have our goodness
metrics and we are seeing there that the gauge shows us 0%, so we’re reading a
lot of rows and returning only a fraction of it close to 0% back
to the client okay if you look down the graph below
provide us the numbers so we have here the rows read – 580 in my sample or in this
run, out of that we matched a single one and we dropped 579 of those okay
so when I’m using the metrics I can see how well my application is interacting
with ScyllaDB and now well I’m expecting it to work, and going back to the graph we
have the query processor pulling 50,000 rows filtering out of those and sending
back only 50 okay so solution 2 is server-side filtering
it’s reading a lot but at least it’s returning only what the client is
interesting with it which is 50
Are there other solutions? So ScyllaDBs 3.0 adds secondary indexes and secondary
indexes are way to add an index on a column and then find rows that are
related to that but the rule of thumb is that secondary indexes should be used
with high cardinality and in our case that’s not true
the liked column has two values liked or didn’t like so it’s very very bad it’s
worse than any other option so we’re not even going to try it out is there
another option and their answer is – yes. ScyllaDB 3.0 adds materialized views and a
materialized view can be thought of as an additional table that it’s
replicating data from an existing table and keeping them in sync that additional
table can have, which we refer to it as a view, can have all the data part of
it but the important part is that it can have a different primary key and for our
sample that we’re doing for our enhancement it means that we want to add
the liked attribute to our primary key and if we deploy that
we get an optimal performance okay we’re reading 50 rows from
memtable cache or SSTable, okay again using the materialized view and returning the
50 back okay so it’s no overhead for processing these query, so we had four
different solutions okay and the question is, it’s clear that solution 2
and solution 4 as the better ones but which one should I pick and the answer
it depends, it really depends on how many queries are going to be run if 50% of
your traffic is going to be this, then materials view is probably the correct
thing, if it’s very low percentage then it may be okay to do server-side
filtering so there’s no way to know that in advance, the overheads
in materialized view have to do with the additional storage, okay so when I’m
writing the data I’m replicating it so that’s additional cost