13 min to complete
In a previous lesson, we expanded the Mutant Monitoring System across multiple datacenters and learned about the available consistency levels. With our infrastructure in place, we can move on to important concepts like monitoring. Krishna, the manager of the Fire and Ice Mutant division, has set fire to a complete datacenter after he was mad about Deris, an in-memory mutant. Since DevOps hasn’t configured Scylla monitoring, no metrics to trigger an alarm were sent.
After this event, Division 3 decided that it was essential to accurately monitor all of the Scylla nodes so we can examine important details such as performance, latency, node availability, and more. In this lesson, we will use the Scylla Monitoring Stack to monitor our Scylla cluster.
The Scylla Monitoring stack runs in Docker and consists of Prometheus and Grafana containers. Prometheus is an open-source system monitoring and alerting toolkit and is used to gather metrics from the Scylla cluster. Grafana is an open-source analytics platform that will allow us to visualize the data in Graphs from Prometheus. Scylla Monitoring will be able to communicate with the Scylla cluster in this exercise because all of the components will be on the same virtual network.
With Scylla Monitoring, Division 3 can monitor Internal DB metrics such as load, throughput, latency, reads, writes, cache hits and misses, and more. Linux metrics are also recorded, such as disk activity, utilization, and networking. To get started, we will first need to bring up our Scylla cluster.
Starting the Scylla Cluster
For this lesson, we require a Multi DC cluster. If you just completed one of the lessons: Multi-datacenter Replication or Multi-datacenter Consistency Levels, you can skip to the next paragraph, “Configuring and Starting Scylla Monitoring”. Otherwise:
Follow this procedure to remove previous clusters and set up a new Scylla cluster.
Change to the mms directory (if it’s not your working dir already):
To bring up the second datacenter, run the docker-compose utility and reference the docker-compose-dc2.yml file:
docker-compose -f docker-compose-dc2.yml up -d
After about 60 seconds, you should be able to see DC1 and DC2 when running the “nodetool status” command:
docker exec -it mms_scylla-node1_1 nodetool status
Now that the cluster is up and running, we can configure and start the Scylla Monitoring stack.
Configuring and Starting Scylla Monitoring
Please note that in production, you should follow the official Scylla Monitoring documentation.
Download the latest release to the mms directory and make sure you are in the directory just downloaded (the directory below may be different according to the version):
Next, we need to edit the configuration files to make Scylla Monitoring work with the Mutant Monitoring System. Rename the file prometheus/scylla_servers.example.yml to prometheus/scylla_servers.yml and edit it according to your node IP addresses as follows:
Create the data_dir directory:
Now rename the file scylla_manager_servers.example.yml to scylla_manager_servers.yml
With the configuration files in place, we can start Scylla Monitoring with the following command:
./start-all.sh -v 3.0 -d data_dir -D "--network=mms_web"
Notice that we add the network interface as a parameter so that the Docker node instances IP will be available from the Prometheus Docker instance, this might not be required in production.
The Docker containers should be up and running now. Access Scylla Monitoring by browsing to http://127.0.0.1:3000/. Port 3000 is the default Grafana port.
To view the metrics, click on the Home drop-down button and then on the version number (3.0 in our case) and Scylla Overview Metrics.
Keep in mind that the metrics that we see are for our demonstration cluster, in production, the data would look different.
From this dashboard, you will be able to see useful information such as Total Requests, Load information, Disk Activity, and more. The problem now is that the graphs do not look that interesting because there is no activity on the cluster. To make the graphs pretty, let’s run a load generator.
Using the cassandra-stress Tool to Generate Load
The cassandra-stress tool is used for benchmarking and load testing Scylla and Cassandra clusters. In our case, we will use it to write data to our cluster so that we can see monitoring metrics. To get started, run the following command after replacing the IP address with the IP address of one of the nodes.
docker exec -it mms_scylla-node1_1 cassandra-stress write duration=5m cl=one -mode native cql3 -node 172.20.0.3 &
This will execute write operations for five-minute intervals. You can now switch back to the Grafana dashboard, and the graphs should be more interesting and colorful than before. After a few minutes, they will be even better, and we can begin to dive into Scylla Monitoring and see useful information about how MMS is running.
Since we just want to see the monitoring dashboard populated with data, this simple command will suffice. With cassandra-stress, it is also possible to test specific data models and to model real workloads. More information here.
Scylla Monitoring also shows useful information on a per-node basis. To view that dashboard, click on the current dashboard drop-down button, followed by Scylla Per-Server Disk I/O 3.0 (or relevant version number).
From here, you can view node-specific details individually or as a cluster. To select which nodes to monitor, click on the node box on the top of the screen and choose ALL or however many nodes you want to view.
Taking Down a Node
Now let’s observe what happens on the dashboard when a node is taken down from the cluster with the following command:
docker pause mms_scylla-node4_1
After a few seconds, we will see that there are still a total of six nodes, and one is unreachable.
To bring that node online, run the following command:
docker unpause mms_scylla-node4_1
After a few seconds, there should be no unreachable nodes on the dashboard.
We are now ready at Division 3 to monitor our Mutant Monitoring System using Scylla Monitoring. If a node goes down or performance is becoming an issue, we will be able to see that from the dashboard quickly. In one of the next lessons, we will discuss how to backup and restore the mutant data. Please remain safe out there!