An overview of the monitoring stack, the main dashboards, different deployment options, important metrics to track, alerts, and how they work. The lesson also covers common problems and how to diagnose them.
Monitoring, all I want
to explain is how the monitoring works so please pay attention to this the
monitoring is easy to use our instructions to install to
deploy monitorings like this it takes ten minutes tops if you’re doing it slow
if you’re just copying and pasting it can be two minutes so you have ScyllaDB
running on the nodes okay ScyllaDB by default we’ll have metrics on port 9180
okay 9100 is for node exporter so ScyllaDB
is exploring the ScyllaDB metrics and node exporter is exporting the OS metrics so
there’s two sets of metrics to port every time then our monitoring
solution is comprised of Prometheus Alertmanager and Grafana they’re all accessible
through your browser so usually. Prometheus is running on port 9090 and
Grafana on port 3000 so Prometheus is constantly queering both those ports in
each of the nodes every time Prometheus reaches out it will get the metrics it’s
going to save them Prometheus the time series data base okay
after the metrics are in Prometheus because we have a gazillion metrics I
kid you not. I’ll show you on the next slide a little
bit but we have more metrics than you’re ever going to use it’s
better to have the metrics and not need them than need them and not having
them right so Prometheus it’s going to store that and then Grafana
created dashboards and they’re beautiful and they’re useful they’re
the best thing so you go to Grafana we will have like six different
dashboards I think I have a note here with all their names or is it yeah
so we have overview, detailed, CPU metrics OS metrics IO
in ScyllaDB CQL so when when you go to our Grafana you see that on the top
we have all those selectors so usually the way you look at the monitoring is
you start very high level so you can look at the entire cluster and then you
can select just one DC out of this cluster here right you choose one cluster
then one DC then out of that DC you can choose one node and you can even drill
down to the shard level so that’s usually what we do when we’re
troubleshooting problems I recommend you do the same even if you’re
not troubleshooting anything by all means look at ScyllaDB when you’re doing
things you’re ingesting data take a look at the monitoring it’s going to tell you
a lot about the ScyllaDB inner workings you’ll see you know how CPU is being used
how memory is being used how i/o is happening
how often compactions are kicking in so by all means use that, this is how the
monitoring configuration file looks like it’s basically a list of
nodes for one cluster in a particular DC that’s it and we even have a script
that’s going to generate this file for you, you just listed the the cluster name
the DC the nodes it’s going to generate the yaml for you just so you don’t get in
trouble because come on it can get messy right if you dealt with yamls you
know what I’m talking about
our monitoring can be run on docker that’s what I recommend you use because
it’s just so easy but if you already have your own Prometheus and you own grafana
you’re ready using that to monitor other stuff in your company it’s pretty
simple you just add the node to Prometheus and you run that script that
comes with our monitoring load.grafana and it’s going to load all the dashboards
into your Grafana, ScyllaDB metrics remember I mentioned we have a gazillion
metrics this is like page one at out of a billion I don’t know but if you go to
any ScyllaDB node port 9180/metrics this is what you’re going
to see so you can see that the metrics themselves are here but we also have
like a help page and the type of metric so in this case this one is a counter
and this is a total number of sent messages so this is how if you need one
of the metrics that it’s not there for some reason this is how you find what a
particular metric is or maybe you’re doing you know the some reverse
engineering you went into our Grafana and then you looked and you saw the
name of the metric and you want to know what it is, you just look there okay this
is an example of alertmanager remember that I mentioned this here so
alertmanager is a plug-in for Prometheus and this is basically how you set an alert
there’s a what is the name of the file the rules.configure it yep so you
can use any of the Prometheus metrics to generate an alert the threshold is up to
you some people want to monitor for one value some people to a different value
so you set here what is the metric that you’re using what is the threshold
that you want the severity of the alert how frequent it should you know look
into Prometheus for that and the description is going to show up
in the integration so you have alertmanager you know integrated with Prometheus
looking at the metrics and you can configure alertmanager to send alerts
with pretty much everything so email pager duty slack telegram
you name it so those are our integrations that are available for
alertmanager yeah and the reason we shipped some some alerts already with
our stack is just that you have some examples right so you can look at what’s
there and based on that you can create your own the rule_config.yml I’m
always the one that I showed before and the the word that I was looking for is
the receiver which you can integrate like slack pager duty and we have an
example here for emails so this is what it looks like and that is they made it
we’re receiving common problems is generally problems with the monitoring
comes down to port or things not running so the first thing you should do if you’re
running on docker it’s you know docker. PS – a for example and look if
the containers are up and running if you’re running on your own stack by all
means take see if Prometheus is running. Grafana is running and that all the ports
are open you know between the monitoring and the nodes they’re those two
ports that I mentioned in in the beginning and among them so Prometheus
is it’s up on port 9090, Grafana on 3000 and so on and so forth make sure
all the ports are there troubleshooting is pretty easy usually it’s just a
matter of looking at netstat and look at the firewall rules some more examples I
showed you for ScyllaDB on port 9180 for the ScyllaDB metric same thing for node
exporter but different port as I mentioned in beginning
keep your monitoring stack up to because we’re constantly improving
dashboards because well people have problems when it’s really hard to
troubleshoot a problem and we feel like we’re missing a panel on a dashboard we
are going to add it so by all means keep the monitoring up to date it’s going to
benefit specify a data dir for Prometheus because Prometheus is going to
save everything has files right so make sure you specify a specific mount point
or file system for the Prometheus data because if in the future you’re
upgrading Prometheus or you’re moving it to a different machine you know exactly
where those files are so it’s easy to move easy to migrate this is what I
mentioned before always look at the monitoring it doesn’t matter if you
don’t have a problem excellent look at your workloads because then you can
predict if in six months from now you’re going to run out of disk if you know
maybe three nodes won’t be enough so always keep a look at the monitoring and
create alerts that are important for your application
don’t create alerts for things that doesn’t matter for you just create
you know alerts for the really important things let’s say you are you’re running
tight on storage so put an alert for storage I don’t care about latency don’t
create an alert for latency otherwise it’s just annoying and when
you have one important one you’re going to miss it, we will ask you for
monitoring data every time you come to me and you say “Moreno I have a
problem” if it’s not something basic like. I mentioned let’s look at the yaml
file maybe there’sa misconfiguration here if it’s a serious problem it’s a
performance problem the first thing I’m going to ask you is give me your
monitoring data and there’s two ways you can do that you can take screenshots
it’s okay I will look but then I cannot manipulate the data the best way is get
remember the data dir that I mentioned for Prometheus get those files send
those to me because then I can replay the data on my own computer and then I
can you know drill down I can do lots of things with the data screenshots okay
and this is a good experiment whenever you’re using the monitoring try to get
the monitoring data put it on your laptop try to replay the data it’s fun