We Are Under Attack!

At Division 3,  our mutant data-centers are experiencing more and more cyber attacks by evil mutants and sometimes we experience downtime and cannot track our IoT sensors. We must now prepare to plan for disaster scenarios so that we know for sure that we can survive an attack. In this lesson, we will go through a node failure scenario and learn about consistency levels.

Environment Setup

If you completed the previous lesson, you can skip to the next paragraph, “Simulating the Attack” as you already have your Mutant Monitoring setup and running with the Mutant Catalog keyspace populated with data.

Otherwise, follow this procedure to set up a Scylla cluster. Once the cluster is up, we’ll create the catalog keyspace and populate it with data.

The first task is to create the keyspace for the catalog.

docker exec -it mms_scylla-node1_1 cqlsh
CREATE KEYSPACE catalog WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy','DC1' : 3};

Now that the keyspace is created, it is time to create the table.

use catalog;

CREATE TABLE mutant_data (
   first_name text,
   last_name text,
   address text,
   picture_location text,
   PRIMARY KEY((first_name, last_name)));

Now let’s add a few mutants to the catalog with the following statements:

insert into mutant_data ("first_name","last_name","address","picture_location") VALUES ('Bob','Loblaw','1313 Mockingbird Lane', 'http://www.facebook.com/bobloblaw');
insert into mutant_data ("first_name","last_name","address","picture_location") VALUES ('Bob','Zemuda','1202 Coffman Lane', 'http://www.facebook.com/bzemuda');
insert into mutant_data ("first_name","last_name","address","picture_location") VALUES ('Jim','Jeffries','1211 Hollywood Lane', 'http://www.facebook.com/jeffries');

Simulating the Attack

Let’s use the nodetool command to examine the status of the nodes in our cluster. If you are still in the cqlsh, exit:

exit
docker exec -it mms_scylla-node1_1 nodetool status

We can see that all three nodes are currently up and running because the status is set to UN (Up/Normal). Now let’s forcefully remove node 3 from the cluster:

docker pause mms_scylla-node3_1

Now use nodetool to recheck the status of the cluster after about 30 seconds:

docker exec -it mms_scylla-node1_1 nodetool status

We can now see that node 3 is missing from the cluster because it is in a DN (Down/Normal) state. The data is safe for now because we created each keyspace with a Replication Factor of three. With a Replication Factor set to three, there will be a replica of the data on each node.

We should be able to still run queries on Scylla:

docker exec -it mms_scylla-node1_1 cqlsh
select * from catalog.mutant_data;

All of the Mutant data is still there and accessible even though there are only two replicas remaining out of three (RF=3).

Consistency Levels

The data is still accessible because the Consistency Level is still being met. The Consistency Level (CL) determines how many replicas in a cluster that must acknowledge read or write operations. Quorum is the default Consistency Level. When a majority of the replicas respond, the read or write request is honored. Since we are using a Replication Factor of 3, only two replicas respond. QUORUM can be calculated using the formula (n/2 +1) where n is the Replication Factor.

Let’s test how consistency levels work with the cqlsh client. In this exercise, we will try to write data to the cluster using a Consistency Level of one, which means If one replica responds, the read or write request is honored.

CONSISTENCY ONE;
insert into catalog.mutant_data ("first_name","last_name","address","picture_location") VALUES ('Steve','Jobs','1 Apple Road', 'http://www.facebook.com/jobs') ;
select * from catalog.mutant_data;

The query was successful!

Now let’s test using a Consistency Level of ALL. This means that all of the nodes must respond to the read or write request otherwise the request will fail.

CONSISTENCY ALL;
insert into catalog.mutant_data ("first_name","last_name","address","picture_location") VALUES ('Steve','Wozniak','2 Apple Road', 'http://www.facebook.com/woz') ;
select * from catalog.mutant_data;

Both queries failed with error “NoHostAvailable” because we only have two of the three nodes online and the Replication Factor (RF) is 3.

Adding a Node Back and Repairing the Cluster

Here you can find more information about replacing a dead node in a cluster.

We need to get the Cluster back into a healthy state with three replicas by adding a new node. Exit the cqlsh:

exit

Change directory to “scylla-code-samples/mms” if it’s not already your working dir:

cd scylla-code-samples/mms

Next, we need to get the IP address of the node that is down:

docker inspect mms_scylla-node3_1 | grep IPAddress

Run the command below to add replace_address_first_boot to scylla-replace-node.yaml. Replace the IP address with the correct IP of mms_scylla-node3_1:

echo "replace_address_first_boot: 172.20.0.3" >> ./scylla/scylla-replace-node.yaml

Now create and start the replacing node:

docker-compose -f docker-compose-day3.yml up -d

After the replacing node joins the cluster, it will be bootstrapping data. This can take about two minutes. After the bootstrap is complete we will see the replacing node in the cluster, instead of the node that is down, and that three nodes are up and running:

docker exec -it mms_scylla-node1_1 nodetool status

Great, now we have three nodes. Is everything ok now? Of course not! We need to run a repair operation on the replacing node to make sure that the data is synced with the other nodes in the cluster.

docker exec -it mms_scylla-node1_1 nodetool repair

When the repair finishes, each node will be in sync, and the data will be safe to handle additional failures. Since we do not need the paused container named mms_scylla-node3_1 anymore, we can delete it:

docker rm -f mms_scylla-node3_1

Conclusion

We are now more knowledgeable about recovering from node failure scenarios and should be able to recover from a real attack from the mutants. With a Consistency Level of quorum in a three node Scylla cluster, we can afford to lose one of the three nodes while still being able to access our data. Please be safe out there as we continue to track the mutants and evolve our Monitoring System.

To report this post you need to login first.