This lesson provides an overview of advanced monitoring techniques and how to troubleshoot ScyllaDB issues. It goes over the tools and techniques used for monitoring and explains the healthy state of the system and how to get started with troubleshooting an issue.
How to monitor the System
and a large part of the talk will be how to debug an issue many times I’m
asked to look at the system or someone else is asked to look at the system and
then we have a lot of gut feelings. I try to brain dump those gut feelings
into something that is cohesive enough for you to do and hopefully catch some
issues on your own so how to monitoring so ScyllaDB monitor so the number one
tools that we use every day ScyllaDB monitoring if you don’t have ScyllaDB
monitoring install ScyllaDB monitoring or we won’t talk to you that’s probably the
number one tool – and it’s mainly because that’s the only way that we find
the way inside the system so ScyllaDB expose this huge amount of
metrics hundreds of thousands of metrics and ScyllaDB monitoring is the only way
that we find what’s going on inside the system no tool of course is legacy
ScyllaDB logs and when nothing else works we use Linux tooling so some techniques
so one it start development time so
finding out why system is not working or why system is not working as well as it
should be or why system is not working as well as it did yesterday has to do
with debugging your code and you debug your code you have unit tests and so
forth but you can also use monitoring to look at
the developed while you’re developing the code second thing is you need to
know usually when I’m asked to look at the system I ask when it stopped working
and since when they have the money going up I scroll back I look at the history
of the system I try to find out what changed or what point things
started going wrong and knowing how the system operates when it works well is a
key to find what broke last if you have background processes backups repairs
running and so forth be aware of them they will argument the way the system
operates and if you’re sensitive to latency and so forth you may
need to schedule them accordingly
proactive monitoring so proactive monitoring some things that we’re using
in the cloud basically we try to catch the system not working or if there is an
issue before the issue happens or the customer notices and Dan talked about
ScyllaDB manager so cql ping is one of the ways that we monitor the system the
other part is alert manager so it’s part of the ScyllaDB monitoring stack we have
default alerts set up you can add your own alerts we do that in the cloud we
specify alerts for latencies for customers and so forth to notice when
latency started building up before there is a big issue so in order to understand
a bit about what I’m going to talk about there are three or four slides about
architecture so it’s hopefully going to be very fast one we’re talking about the
cluster and since it’s a cluster it means that the client is sending a
single operation to coordinator node and that may be a replica or not and then
we’re talking to other nodes as well so we need to take this into account when
we’re the debugging or looking at the system next and I’m not sure if you saw
this in this detail there are multiple
components to ScyllaDB so once there is a client your application next there is a
CQL front-end that processes your. CQL requests then we hit something
that is called a storage proxy it’s an internal tool name but it’s basically
the coordinator of your requests so when the seek your request insert or query
it’s part by the CQL layer then it’s transferred to the storage proxy that
finds out what is a token and what are the replicas to which notes we need to
talk to below the storage proxy there is a database that is the local part
saving the information and then below the database there is a cache that we
are very proud of and then there is a business persistency layer there are
sstables the compaction manager talks to the database to understand if we’re
keeping compactions are keeping up if we’re not building a huge amount of memtables
that we need to flush to the disk and looks as sstables as well
aside of that there is a gossip component that makes sure as a
cluster is healthy and working and of course there is repair which is a
background process it talks to multiple nodes and streams data so those are the
components I’ll talk about some of them in some of the slides there are
different task categories in ScyllaDB most of them are background there are some
that are some the foreground ones are your requests so they’re read and
write requests background there are also write in read requests so if I’m doing a
consistency level of quorum right then. I’ll wait for two replicas to respond
and then there will be a background write happening in the background yet
that background write may become a foreground write and that is if ScyllaDB
detects it’s not keeping up so ScyllaDB will automatically shift itself to work
or your request to work with still all or wait for replicas to respond not see
at all if it detects it’s not keeping up in the right speed okay so we have a
background process that was transferred to be a foreground process from your
standpoint the happens with
read and reading with seal one but it may trigger a background process and
that is a read repair okay read repair chance we we go to all the replicas get
the information and try to repair okay if we find inconsistencies between two
replicas okay we did a read with forum and the words an inconsistency between
two replicas that read will become a foreground read and we’ll wait for all
the replicas and fix it and only then returns a response so that is a
background process that became a foreground process memtable flushing so
while I’m writing into the database we’re accumulating all of that in
something that is called a mem table and then we need to flush it to the desk
ScyllaDB keeps multiple mem tables one that is accumulating data and another one
that it’s being flushed to the disk now if we’re not flushing in fast enough to
the disk then basically writing into the mem table becomes basically a foreground
process because we’re slowing down your writes so we reallocate if we’re
writing basically a row or partition or bunch of memory to the disk then we’ll
only allow a right to enter another memtable so again it’s a background
process usually mem table flushing but it became something that affects our
foreground workload and that translates to the system being slower what’s happening
with the system and so forth commit log the same I’m flashing commit logs I’m
not handling flash of commit logs fast enough then it will influence the write
speed and it may back pressure towards the client compactions the same and then
they’re streaming and repair streaming and repair are always background it
cannot be a foreground process so a steady state healthy system what is
built up so it’s very long list but basically it means or in my view it
means that all nodes are up and running we don’t have a node being down we don’t
have a node being added it’s not that seen ScyllaDB doesn’t operate in such cases
it operates very well and we tested it but it is that effect that a node
is down or that we’re adding a node is causing additional work in the cluster
and that is something that we need to detect okay
clients are driving traffic now you’re writing your application it seems simple
in 13 to the database read from the database and so forth but it’s actually
a bit more complex and some of the faces here I’ve know some cases that it wasn’t
as simple as it sounds so one is connection balance if you’re using
ScyllaDB driver shard away drivers you’re safe if you’re not then you may hit
something and I’ll talk about it the amount of traffic so it’s not enough
that you connected to all the shards you can’t use a single connection if you are
using a single connection then really didn’t help what you did you’re
still sending all the traffic to a single shard in the system the queries
are the same so I’m using the shard away driver I’m sending a lot of traffic but
only on a single connection or to a single node that I’m sending all my
batches that doesn’t work as well okay it will affect the latencies of that
node requests for partitions and rows are balanced and an imbalance state for
example is 50% of the read requests are here hitting a single partition that is
known as it’s a hot partition so you’re you’re creating the data model you know
your application you’re controlling the requests and then the question is how
does it affect the system there are three more items that are similar there
can be a large partition a large row and a large cell and all of that has an
effect on the system so basically a steady state healthy system has to do
with all the nodes being up your client application is doing what you think it’s
supposed to do sending all similar traffic to all the nodes to all the
shards on all actions and you have a data models edits
or your requests are not creating hard partitions or hot charts and all the
data is balanced pretty much if you have any balance that’s a cause for something breaking