So, out of global indexes, local indexes, and filtering. Which one to pick, what’s the best option? Here are some tips. Filtering is by far the cheapest option when it comes to storage, as it consumes 0 additional bytes. It’s also the simplest one to use, as it requires no DDL management like creating an index. Then, we should consider how selective will our queries be. If we expect to fetch the majority of the rows often, filtering is most likely still the best alternative. Otherwise, indexing should be considered. If our queries tend to ask for a single partition, then that’s the target use case for local indexes. Otherwise, if queries are usually multi-partition, global indexes may be the way to go. The best advice I can give is to benchmark first, pick later.
Transcript
Out of global indexes, local indexes and
filtering, which one to pick?. Which fits best for our use case?
Here are some tips, so first of all filtering is by far the cheapest option when it
comes to storage overhead because it simply doesn’t have an
additional storage footprint it consumes zero additional bytes, it’s also very simple to use
because it requires no DDL management it doesn’t require creation of an index
you don’t need to manage it later, then if we know if we can expect our queries to be
with low selectivity which means that we’re interested in majority of the rows
anyway filtering may still be the best option if our queries are usually highly
selective then it probably makes sense to consider indexes in favor of
filtering. So now if most of the queries fit the use case for local indexes which
means that majority of our queries are single partition only, then we should
consider local indexes because there are simply better optimized for this use case
they’re faster, on the other hand if most of our queries are generic there are
multi partition or they could involve a full table scan if we filter, then we
may consider a global index which is more generic so it allows for more
use cases, well if you can’t use the local index then that’s what’s left
to pick, but overall the best advice I can give you is to benchmark
first and pick later
I created a very quick local set up of three ScyllaDB nodes and fed it with
100,000 rows and then bombarded this installation with various distributions
and the various configurations of tests and wanted to compare how good filtering
global index and local index fits for a particular query and a particular
configuration and distribution.. I have three start benchmarks and
every single one favours different option just to let everyone know that
none of this is a silver bullet, sometimes filtering is the best
sometimes global index is the best and sometimes it’s the local index
In out first example, local index is a clear winner and that’s because our query that
you can see written in blue is simply a natural use case for a local index so
filtering took quite a long time because it needed to fetch rather big partition
and then simply filter all of the rows just to find a single one that we
were interested, a global index did better but it was still worse than local
index because it likely involved some internet communication it needed to
fetch index rows from one node and then base table rows from
another so we paid for it with some round-trip time the local index was
simply the best fit here. Now, take a look at the other case
this time the query changed slightly,. I’m not longer interested in fetching a
single partition but instead wanted to
run the query and don’t care which base partition it is and in this case
filtering did even worse than before and that’s because it needed to perform a full
table scan so it couldn’t be optimized for a single partition as before thus we
paid some of the price global index performed pretty much exactly as before
because it doesn’t really care what was the base partition before, so it’s still
grossing quite a nice time and local index is simply disqualified of
this competition because we didn’t specify a single partition to
run on, so we weren’t able to use local index at all. That’s why a global index
was a winner here. And now a third case, now the distribution first of all changed
slightly, we’re back to querying a single partition but this time, each time
the query was run we wanted different base partitions and different index
column values and the quite surprising thing is that filtering which pretty
much fetched the whole partition and then filtered through the rows was
actually the fastest here, global index is far, far behind and local index is
quite similar and that’s mainly because in this particular case filtering simply
read the whole partition sequentially then filtered over it and then it was
nicely cached so it could be served pretty fast and on the other hand indexing both
global indexes and local indexes needed to perform a long series of random reads
and it’s seen in the results.