This lesson covers Scylla Monitoring. Scylla Monitoring is a full-stack for monitoring a Scylla cluster and for alerting. The stack contains open source tools including Prometheus and Grafana, as well as custom Scylla dashboards and tooling.
Some of the topics covered in the lesson are:
- An overview of the monitoring stack, the main dashboards, different deployment options, important metrics to track, alerts, and how they work. Also covers common problems and how to diagnose them.
- Advanced monitoring techniques and how to troubleshoot Scylla issues. It goes over the tools and techniques used for monitoring and explains the healthy state of the system and how to get started with troubleshooting an issue.
- What are some common monitoring issues and what can we do if the monitoring dashboards show no info? Also covers the CQL Optimizations dashboard and common CQL and client imbalance issues.
- What should we do if a replica becomes unbalanced? Covers the latencies dashboard, common issues that can cause high latencies in a node, the cache replica dashboard, MV, and Memory replica views.
- In the case of replica imbalance, the common issues are hot partitions, large partitions, and issues related to the specific node. The node-specific issues include CPU, I/O, and OS related issues. These issues are explained, including how to debug them.
- What are stalls and what do they mean? What can be done with stalls? Stalls have to do with a specific task being executed continuously in the CPU without letting other operations to run. Possible causes for stalls are a Scylla bug, the data model used, and the OS.
- What does a healthy system state look like in terms of memory? Includes examples of common memory issues such as large allocations and bad_allocs, and what can be done to diagnose and solve these issues.
- What do we read and write from disk? Goes into details about the different disk-related tasks, how they are performed, what are some common issues, and how these issues can be resolved.
- Common I/O task patterns, storage I/O profiling, disk scheduling, relevant views in the monitoring dashboard, and related metrics.
The lesson also includes quizzes and a hands-on lab, in which you will install the monitoring stack and explore some of the important dashboards.
You can learn more about Scylla Monitoring in the documentation.