It covers token ranges, how data is partitioned in ScyllaDB (and Apache Cassandra, for that matter), data replication, Consistency Level (also called CL), and working with data centers.
The way that it works is that each
node is responsible for a range of tokens.
This is a bit of a simplification.
But in this example, we can see that, for example,
the first node is responsible for the range of 1 to 400
and so on.
And according to the partitioner, Scylla knows
which node is responsible for the data.
And by the way, I’m showing you how this works under the hood.
So this is what the system does,
as a user that performs the query
you don’t have to worry about that or do anything.
But it is important to understand that
this is how ScyllaDB works under the hood.
Okay.
So I mentioned data is replicated
and this is determined according to a setting called
the “Replication Factor”, abbreviated to it also as RF.
And the replication factor is the number of nodes
where data are replicated.
Okay.
This is done automatically and it’s configured per keyspace.
So a keyspace is
an aggregate that has multiple tables
and all the tables in the keyspace are going to inherit
the replication factor.
Here in the example you see
a command to create the keyspace that has a replication factor
of three.
Now if we have a replication factor of three,
it means that each
each one sorry,
each piece of data is replicated
three times to three different nodes.
So we’re going to hold three copies of each piece of data
if the replication factor is three.
And that means that even if one of the replica nodes
goes down, we can still serve requests.
Let me stop for a second and see
if there are any questions in the chat window.
I’m not going to be able to answer all of them.
So there’s one question.
If one of the three nodes
goes down, will the database still be available?
So that depends.
If we have a replication factor of three
and one of the nodes goes down,
we can still access the data because
we have two other replicas that hold the copy of the data.
So the question is.
Sorry.. So the answer is that yes,
the data will still be available.
Another question is: can the partition
include secondary keys in addition to the primary key?
So the terms here are a bit different.
A primary key
includes the partition key and the clustering key.
And I’ll talk about that just a bit in this talk,
but it’s covered quite extensively in the next talk.
So if that’s interesting for you,
stick around for the next talk and the essentials track
and I will talk about the primary key,
partition key and clustering key quite a bit.
So I want to move forward.
I have quite a lot of material.
You can save your questions for the Q&A or for the
expert panel.
Great.
So I talked about the replication factor.
Another important concept is the consistency level or CL.
CL is the number of nodes that must acknowledge a read
or a write request.
Some example of possible values are one.
So just one node needs to acknowledge the request.
Quorum means that quorum of the nodes
needs to, or All meaning all replica nodes
need to acknowledge the read or the write request.
And this is something that’s tunable.
So the consistency level can be set per operation.
Okay.
So maybe in your specific application you have a use case
where writes don’t need to have a very high consistency.
You can set the consistency level to one,
but maybe the reads, they require a higher consistency.
So in that case, you might want to set the consistency level
to be higher.
And I would show an example of this in just a bit.
Datacenters, so ScyllaDB is topology
aware, it’s built to work with multiple data centers.
It’s also rack aware.
And there are a few reasons to use
multiple data centers and to be topology aware
in these examples, we’re having a cluster that’s
basically using two data centers,
for example, one in the U.S.
and one in Asia.
And it would make sense to serve users
that are located in the U.S.
to serve their queries from the U.S.
data center just because it’s closer.
And for performance reasons, it’s going to be quicker
to serve those requests.
Another reason to use multiple
data centers would be, again, for high availability.
So even if a specific rack or even an entire data center
goes down, say, there is a disaster in the U.S.
datacenter, our system would still be up and running
and we would be serving requests from the Asia datacenter. Once
the U.S.
datacenter recovered and goes back up, the Asia datacenter
would propagate the data to the data center that was down.
That is called entropy and
anti entropy mechanisms dealing with repair.
And if you’re interested in that, then Tzach
is going to cover that.
In his talk about ScyllaDB power users,
and that’s the third talk in the advanced struct.
So if you’re interested in anti entropy mechanisms, repair
tombstones and more advanced topics, check out that talk.
Maybe just another benefit of using data centers
is for compliance reasons.
So maybe we have a compliance demand
that data
for users in a specific geographical location, say
in the US,
all their data stays within the U.S.
so then we can keep their data in the U.S.
datacenter and
comply to those requirements.