What are some common monitoring issues and what can we do if the monitoring dashboards show no info?
The lesson also covers the CQL Optimizations dashboard and common CQL and client imbalance issues.
how to debug so let’s start so usually when we start we look we looked
first at monitoring overview and monitoring overview is a very large
dashboard and we stroll more at the dashboard so I admit I’m not keeping up
but the most important part is on the left hand side are all nodes up or is
everything green next is to look at the alerts so we’re adding alerts for things
that you should be aware of that you pinpoint you to what is the
problem in the system look at the alerts that can tell you already what is broken
or what you need to check so that is easy everything was Green next is the
errors so we are we added the dashboard called errors and it starts to
accumulate errors coming from different parts of the system, this is an error dashboard, needs to get better
just to be clear and I’ll talk about the items that are included so
coordinator side errors are the read unavailable error write unavailable error
and range unavailable error and that has to do with how many nodes are available
for requests so if I’m doing a CL all requests and one node is down that will
transload translate to an unavailable error okay it’s bad to do a CL all
operation on ScyllaDB you’re putting yourself up for failure replica side
errors local reader local writer or AIO error basically means replica is
not able to read write or do and IO to the disk that means that you need to
check that single node okay I’ll talk about what it means to
check that single note and last there is. C++ exeptions which is something that
we usually look at if the if a single node is irregularly high or the
system didn’t have any C++ exceptions and started to have those then it may
signal that something has broken okay
it may mean that you need to look at the logs but I can’t translate this to
something is more than that at this point next question you should ask
yourself is has application changed and if the application has changed then my
own recommendation is look at the. CQL optimization dashboard okay and we’re
trying to extract information about what your application is doing towards ScyllaDB
and provide you the inputs so the inputs are you change the application somebody
started to build strings and send them directly to ScyllaDB he’s using Java not
using go cql which does it automatic and your ended up with non prepared
statements and everything is broken okay so I won’t dwell on this and other
items that are marked you’re not using paid CQL reads now if you have reads
that are very small it may not be an issue but you when you hear a large
partition of hundreds of thousands of rows and you’re returning that back
everything will dissolve ScyllaDB will run out of memory okay it’s trying to build
a huge response to a query that you just set so that is a way to set yourself up
for failure unless you know what you’re doing not using a token aware driver
you’re adding latency for no reason so change that a reverse secure reads are not
as optimized as regular reads again if you have large partitions trying to read
them backwards will end up with you crying and I was trying to figure out why
allow filtering again you need to understand that this
happens or the coordinator node so we’re getting the information off the disk and
then applying the filtering okay can it be done better or not you need to decide
but we’re providing you the inputs of how well you’re filtering if you’re
reading million rows and returning one well is that correct or not you need to
know but it’s a an indication everything was good
next question is do we have a client imbalance and this is very well answered
by the cql coordinator side so the cql coordinator side has a section on the
client connections and then the distribution of different cql requests
across nodes and across charts one thing to note is that you need to look at the
instance and the shard view okay ScyllaDB it’s charted the fact that they have in
the same number of connections to and all the instances doesn’t mean that you have
connections to all the shards a single shard usually zero can handle a lot of
Corrections and everybody else is resting and then it will be the
bottleneck off the system okay so use the shard view
secure read requests and so forth you need to see that this is balanced the
best example secure batches usually you have some background process it starts
running sometimes and then sends huge batches and it hits only a single
shard and then everything slows down to that chart handling the batch requests
coming in it needs to parse them at times merge them then distribute them to
all the other shards it’s imbalanced