Filtering In Depth

6 min to complete

Filtering. What’s that? Why is it useful? Sometimes we want to be able to query by different columns, but we’re not interested in creating secondary indexes. Filtering is one more way of allowing such queries. The mechanism is really simple – the coordinator will fetch *all* of the results specified by the key restrictions, and then filter out rows that do not match the rest of the restrictions. But, there’s a caveat. Filtering can be very performance-heavy, it can even result in fetching all rows from the table and then filter out just a few rows. Because of that, queries that involve filtering must be explicitly allowed to do so, by appending ALLOW FILTERING keyword to each such query.
Despite the performance impact, filtering isn’t always that bad. On the contrary, it’s often very useful and concise. It’s important to know when to use it.
So, what makes some queries better candidates for filtering than the others? The magic metrics is called selectivity. High selectivity means that we’re interested in only a small fraction of all the rows. Low selectivity is the opposite – the query result is not very selective – it accepts the majority of the fetched rows.
Low selectivity queries are much better candidates for filtering. The logic is simple – if we need to fetch the majority of the rows anyway, filtering some of them later will not be costly. Highly selective queries are, contrarily, not great for filtering – if we were to fetch 10 million rows from our replicas and use all this disk and network I/O just to drop all of them but one – that makes the query very inefficient.

 

Transcript
fa-angle-up