6 min to complete
Filtering. What’s that? Why is it useful? Sometimes we want to be able to query by different columns, but we’re not interested in creating secondary indexes. Filtering is one more way of allowing such queries. The mechanism is really simple – the coordinator will fetch *all* of the results specified by the key restrictions, and then filter out rows that do not match the rest of the restrictions. But, there’s a caveat. Filtering can be very performance-heavy, it can even result in fetching all rows from the table and then filter out just a few rows. Because of that, queries that involve filtering must be explicitly allowed to do so, by appending ALLOW FILTERING keyword to each such query.
Despite the performance impact, filtering isn’t always that bad. On the contrary, it’s often very useful and concise. It’s important to know when to use it.
So, what makes some queries better candidates for filtering than the others? The magic metrics is called selectivity. High selectivity means that we’re interested in only a small fraction of all the rows. Low selectivity is the opposite – the query result is not very selective – it accepts the majority of the fetched rows.
Low selectivity queries are much better candidates for filtering. The logic is simple – if we need to fetch the majority of the rows anyway, filtering some of them later will not be costly. Highly selective queries are, contrarily, not great for filtering – if we were to fetch 10 million rows from our replicas and use all this disk and network I/O just to drop all of them but one – that makes the query very inefficient.
Filtering – so what’s filtering and why is it useful?
Sometimes you want to be able to query by
different columns as was the case with indexes but we’re not interested in
creating secondary indexes, that might be the case for example if because we don’t
want extra storage overhead that indexes bring with them and also filtering is
quite more potent and generic, because with filtering we can add any number of
restrictions, any types of restrictions that it can all be filtered, while indexes have
much more specific use case for for queries. Again, we have
a query which puts some restrictions on columns which are not a
part of the key and we got this error message again saying that this
performance might be unpredictable and it is so because when filtering it might
even be the case that the coordinator would need to fetch all rows from the
table and then filter out some of them and hence if we want to use
filtering we need to explicitly allow the driver to do so by appending
ALLOW FILTERING keyword to each such a query.. So as I already mentioned a query that
uses filtering may put a strain on the performance of the whole cluster not
even one node, the whole cluster because it might end up in a full table scan
In this simple example we’re just interested in one row after all but
Scylla will first fetch all the rows from the whole table and then filter only the
one with this matching name but it’s not always the case the filtering isn’t
always that bad and on the contrary it’s often very useful and concise
In this example we specify both partition and and regular column restrictions and in
this case Scylla will know that it needs to fetch only a single partition from
one node and then filter the rest, filter all the rows returning only the one that
we’re interested in. This first of all is usually orders of magnitude more
efficient than a full table scan and secondly it can be quite efficient
overall especially if our partitions are not large, if the partition is smaller than
filtering all of its rows, just to find some, it can be totally acceptable
What makes some queries better candidates for filtering than the others?
One of the matrix is called selectivity and high selectivity
means that out of all rows we’re interested in a very small fraction
of the total, the great example is only one row out of all
on the contrary low selectivity queries mean that out of all rows that we need
to fetch we are interested in majority of them or majority is actually not a
good word, in the large part because 70% is definitely a large part but for certain
use cases even one or five percent can be large enough part to use filtering.
If we know and can predict what the selectivity will be it’s clear that low
selectivity queries are a natural candidate for filtering. The logic is
simple, if we need to fetch the majority of rows anyway
dropping just some of them is not going to put a strain on the system and on the
other hand if our query is highly selective, imagine that we need to fetch
ten billion rows and filter them all just to return a single one then it’s
obviously a performance problem.. Filtering goes well with indexing as we
also saw in one of the previous slides, with this example query that
contains both an indexed column restriction, let’s assume that we have an
index on the name column as before and there’s a regular column restriction that
doesn’t use any indexing, Scylla is perfectly capable of first using an
index to fetch all the columns that match the index restriction and
then filter out only the rest.