What are Hot Partitions, how are they created, how can you identify them and how can you avoid them.
All right.
I saw someone asking about hot partitions.
Let’s move on to hot partitions now.
Hot partitions are essentially a problematic data
access pattern
that introduces contention to a specific chart in our cluster.
An uneven access pattern, for example, requires
some shards to be queried more often than other.
In the field, we have seen situations where
the application doesn’t impose any limits on the client side
and allow tenants to potentially span a given key.
For example, think about bots in a messaging app,
frequently spamming messages in a channel.
All right.
Last but not least,
hot partitions may also be introduced
by erratic client side configurations
in the form of “retry storms”, for example.
That is a client that tends to query a specific data
sends the request to the database, it times out
before the database does,
and while the database was still processing
the original query, the client to retry query which,
and that’s essentially what a retry storm is,
the client keeps retrying a query
and doesn’t wait for the database
to actually answer back to the client
from its previous attempts, okay.
Now the ScyllaDB monitoring makes it very easy
for one to find hot partitions within a cluster.
For example, in the presented monitoring image,
we can see that shard 20
is overwhelmed with reads.
This makes very clear if we look here into this panel
“active sstable reads”, we can see the three shards
with over 100 reads concurrently each,
and then we can see that to those three shards are also
having some read failing.
Okay.
On a different monitoring snapshot note
that there are exactly three shards with higher utilization
which correlates to the replication
factor of three configured for the key spacing question.
Here, we can see that the affected shard seven
introduces a much higher load due to its spend.
Okay.
You can see over here on the right hand side load
that shard seven across three replicas
is bottlenecking,
is essentially with 100% CPU
utilization.
As I have previously mentioned,
hot partitions are a general data access problems
which should be remediated by you.
The steps to finding them are generally using
the nodetool toppartitions command
to gather a sample of the partitions being hit over
a period of time on the affected nodes.
You may also use tracing such as probabilistic
tracing in order to analyze which queries are hitting
which shards, and then act from there.
Low cardinality indexes and views: pretty much
everyone eventually try this in their lives,
just you find out later that it was a terrible idea.
As I explained, indexes or materialised views
are essentially implemented just like regular table
that you cannot directly write into,
and just like you wouldn’t select a low cardinality key
for a table, it also means that you should not select
such a view as the partition key for your review.
In particular,
most of the time we see people trying to enable ScyllaDB
for running queries
that it isn’t really a good fit for,
such as filtering out users which have been inactive
since last year, or retrieving out tenants
by country or state and Tzach even mentioned about it.
We definitely
agree that there may be some situations
where you may need to run these queries.
So if you really, really, really
must to run those queries, then
you should still avoid
to create the view in question and
instead run what we call
“efficient full table scans”, if possible.
We have a very good article from Avi Kivity,
our CTO, explaining the overall logic of it,
but it essentially involves
sending multiple queries in parallel
to multiple nodes at once and then aggregate the results
on the client side.
All right.
Creating
a low cardinality view is in fact a nice way
to win the lottery and introduce hotspots, data imbalances
and large partitions all together in our cluster.
The table in question that I have created shows
how many large partitions you will approximately end up
creating depending on the type of data
you choose as either a base table key or as
a view.
A volume column,
for example, has essentially two possible values,
which means that you will end up with two large partitions.
If you decided to restrict by country,
then there are a total of 195 countries in the world.
But it is very easy for us to imagine that countries such
as the US,. China and India are going
to become large partitions, for example,
and if we decide to select from a view such
as status, where it may have different values
such as active, inactive, suspended and so on,
it is easy
for us to conclude in advance that the majority of writings
will sit down under a few specific statuses.
Okay.
So be very careful when also selecting
the key for your views and when selecting
which columns you are going
to create an index on top of.
And the last part of “how to shoot ourselves
in the foot”, “tombstones”,
when you delete a recording database with a log-structured
merge tree storage engine such as ScyllaDB, then
you must know that your deletes are actually writes.
Tombstones are a write marker that tells the database
that your data should be eventually deleted from disk.
On top of that, there are different type of tombstones
as when we are running delete operations
there are essentially several ways to delete data
from the cluster.
For example, we have a cell-level tombstone
which is essentially to delete a single column.
There are range tombstones,
which deletes a whole range of information
such as, for example, all records since last week
we have row
tombstones which delete an entire row.
And we have partition tombstones which deletes an
entire partition.
Okay. Yes.
Kevin said that there is also TTL, but
we can discuss about TTL later on the experts panel.
But yeah, very good.
Gotcha, Kevin, thanks.
When deleting data, in database such as ScyllaDB,
especially if you have a delete-heavy type of use case,
you should always prefer to delete partitions entirely
as these are much more performant
from the database perspective, okay?
Otherwise if you have a use case which deletes a lot and create
several tombstones such as, you may be trying
to implement a durable queue on top of ScyllaDB,
you may have elevated latencies later on, on your read
the path.
In order to understand why tombstones may slow down
your read path, let’s revisit the ScyllaDB write path again.
As you remember, data is written to an in-memory
data structure known as the “memtable”.
The memtable eventually gets flushed to disk
and results in sstable, which is how your data gets persisted.
This cycle repeats and over time
these sstables will accumulate, right?
As sstables accumulate, this introduces a problem:
the more sstables a partition has the more reads
needs to scan through all of those sstables
in order to retrieve the data that we are looking for,
which effectively increases the read latency.
Then with tombstones, what happens
is that the database mainly that you scan
through a large amount of data that shouldn’t be returned back
to the clients
deleted it.
And by the time
it manages to fetch only the live rows you need,
it may have already
spent too much time walking through a large tombstone run
and your latencies will be bad.
Over time
ScyllaDB will
merge your sstables via a process known as “compaction”
which will improve the latencies.
However, depending on how many tombstones
are present in the output, latency may still suffer,
such as if you still have a large tombstone run.
For example, in the following slide
I have taken a snapshot
of a trace I tried to scan through a partition
filled with tombstones and with a single live row.
We can see that
we have one partition, a single row was returned
and there are around 9 million range.
The time it took for this data to be retrieved
by the database was 6 second, okay?
Down below, as an extreme example,
the partition in question was simply unreadable
and the database will timeout whenever trying to read from it.