Different methods for identifying issues, using Monitoring as well as other methods.
All right.
So this brings us to the last section of our presentation.
I’ll try to be quick.
Let’s talk about some of the ways to prevent
or diagnose some of these mistakes, and trust me,
you may end up running into them as many before you.
I also did, unfortunately.
All right.
So large partitions, large
cells, large collections -. How do you diagnose them?
ScyllaDB has several system tables, which records
whenever a large partition cells or collections are found.
We even have a “large partition hunting guide”,
which I have added as a reference for you
to guide yourself if you ever have a problem.
In the picture, you can see an example of the output
of a populated large partitions table
and you have the keyspace name, the table name,
what’s the partition size that was collected, what was the
offending partition key,
how many rows are there in this partition, and so on.
All right.
Hot partitions.
As we previously spoke, there are multiple ways to
identify and address hot partitions.
there are multiple ways for you to identify and address
hot partitions
if you are in doubt of which partition may be causing
you problems, nodetool toppartitions command, under
one of the affected nodes, just simple: which keys
are mostly frequently hit during your sampling periods?
If hot partitions are often a problem to you,
ScyllaDB has a “per partition rate limit” feature,
which allows you to specify per partition rate limits,
after which the database will simply reject
any queries that hit that same partition.
And remember that retry storms may cause hot partitions.
So ensure that your client side the timeouts are always, always
higher than the server side timeouts.
To avoid declines from retry any queries
before the server has had a chance to process them.
Okay.
Oh by the way, down below here you can see the output
of nodetool toppartitions, how it looks like.
All right.
Hot shards.
The per shard imbalances which may introduce
contention in your cluster.
In the monitoring, identify whether the affected shards
are on the coordinator or the replica side.
If the imbalance is just on the coordinator side,
then it likely
means that your drivers are not configured correctly.
However, if it affects the replica side,
then it means that there is a data access pattern problem
that you should review.
Depending on how large the imbalance is,
you may configure ScyllaDB to shed requests
past a specific concurrency limit.
ScyllaDB will shed any queries that hit that specific shard
past the number you specify.
Think about this limit
as something like the DynamoDB concurrency limiter.
If you try to send many queries concurrently,
DynamoDB will simply push back on
your client and say “hey, you are pushing too fast.
Stop”.
So that’s essentially what this feature is all about.
And of course, make use of tracing,
which can be user-defined, probabilistic.
We also have what you call as “slow
query logging” and many other tracing options
that you can try, which may be a best fit for your
particular problem.
And lastly, ensure that your tombstone
eviction is efficient to avoid slowing down your read path,
especially when your use case heavily relies on deletes.
Everything starts at selecting the right compaction strategy
for your use case and reviewing your delete access patterns.
Remember, and this is very important,
deleting a partition is much more performant
than deleting a row or a cell, for example.
Be sure to
check out our “repair-based tombstone garbage collection”,
which allows you
to tell the database how to evict the tombstones,
and more recently, in Scylla open source 5.2
we introduced a concept known as “empty replica pages”.
Which allows ScyllaDB
to hint the driver to wait for a longer period of time
before timing out as the database is scanning through
a large tombstone run.
Okay, so deleting empty replica pages
is not going to improve latency,
but it is going to avoid queries from timing out.