This lesson covers ScyllaDB Architecture and Token Ring Architecture, with an emphasis on how this affects ScyllaDB specific drivers. You’ll learn about the ring node architecture, Vnodes, hash function, ScyllaDB’s shard per core architecture, and how ScyllaDB Drivers work compared to Cassandra’s. Before starting this course, it’s recommended you take the ScyllaDB Essentials course.
Let’s get started now and let
me introduce you to Cassandra and ScyllaDB token ring architectures and we’ll see what they have
in common but also what ScyllaDB has done that makes it a special and worth a dedicated Python driver
the first thing to know is that a Cassandra or ScyllaDB cluster
is a collection of nodes or instances that can be visualized as a ring, all the nodes
in this ring should be homogeneous using a shared nothing approach this means that there’s
nothing special about a node in this topology one node, one Cassandra node on the ring
or any ScyllaDB node on the ring has no special role or anything special aboutáit
there is no primary or secondary or anything, they all do the same thing
this ring is called a token ring in which the positions of the nodes on the ring
define token ranges and token range partitions you can see that before if you go clockwise
on the ring the range that is preceding a node is the token range that or the partition
that it is responsible for, a partition is just a subset of data that is stored on the node in CQL
the Cassandra Query Language a partition appears as a group of sorted rows
and is the unit of access of queried data this data is usually replicated across
nodes thanks to a setting that is called the replication factor
the replication factor defines how data is replicated on nodes for example
a replication factor of 2 – means that a given token or token range or partition
will be stored on two nodes, this is the case here where partition 1 and 2 are stored on node X
and you can see that partition 2 is also stored on Y, while the partition 1 is stored on node Z
that means that if we were to lose node X we could still read the data from partition 1
from node Z, this is how high availability is achieved and how Cassandra and ScyllaDB
favor availability and partition tolerance they are called AP on the CAP theorem
this kind of token ring architecture is sensible to data distribution among the nodes queries
should in theory be evenly distributed between nodes, we could get an unbalance of data and query
load in the above scenario where we store data in three big partitions and each node holds one range
of the previous one and the and one range from the next one. If one of those partitions were to grow
larger than another one we could then have an imbalance of queries and over
sort of overload on them, to counter and this effect uh that we are calling
a hot node or hot partition we need to add more variance in the partition to node allocation and
this is done using what is called “Virtual Nodes” so instead of placing physical nodes on the ring
we will place many virtual instances of them called virtual nodes, a virtual node represents a
contiguous range of tokens owned by a single node so it’s just a smaller slice of a partition but
it’s more shuffled between nodes, a physical node may be assigned multiple and non-contiguous if
you remember the preceding slide it was contiguous this time virtual notes allow for a non-contiguous
assignment of nodes the default um is to split a node into 256 virtual nodes on the token ring
this is true for Cassandra and this is true for ScyllaDB as well, so if you look at how now the
partitions are distributed among nodes you see that there is more variance into this which will
end up in distributing the query better this is it for Cassandra’s data distribution but
ScyllaDB goes one step further on each ScyllaDB node tokens of a Vnode are further distributed among
the CPU cores of the node that are called shards this means that the data stored on
a ScyllaDB cluster is not only bound to a node but can be traced down to one of its CPU cores
this is a really interesting architecture and low level design, this is the feature that we will
leverage on the Python side ScyllaDB shard aware driver later on and I will explain to you how, now that
we understand how data is stored and distributed on the cluster let’s see how it’s queried by clients
on the physical layer, a partition is a unit of data stored on a node and
is identified by a partition key you can relate a partition key to a primary key in the CQL world
a partition key is the primary means of looking up a set of rows that comprise a partition and
a partition key serves to identify the node in the cluster that stores a given partition as well as
to distribute the data across nodes in the cluster the partitioner or the partition hash function
using the partition key will help us determine where the data is stored on the given node in
the cluster, so you take the ID in this case you see a column ID you will this is will be
the partition key, you take the value you apply a hash function on it the partitioner hash function
which by default on Cassandra and ScyllaDB is MurmurHash3 and this will give
you a token, a token is like a number which is just actually is a number that will be placed
on the token ring and from where it leads on token ring you will find out which node
which node is responsible for this data, that’s as simple as this
okay so let’s recap now on Cassandra the hash of the partition key gives you a token telling you
which node has the data that you’re looking for, we can see this architecture as a shard-per-node
architecture because from the from the token you get to a node so shard-per-node
in ScyllaDB the same hash of the partition key on the same of partition key gives you the same token but
the same token is not only telling you which node has the data but also which CPU’s core
in this node is responsible for handling it so this is a shard-per-core architecture
this is how it’s called, so Cassandra is shard-per-node while ScyllaDB is shard-per-core
*This lesson was written with the help of Alexys Jacob. Alexys, also known to the developer community across social media, as @ultrabug is the CTO at Numberly, a Source-Available contributor, Gentoo Linux developer, and PSF contributing member. The lesson is based on his talk at the Europython 2020 Conference. Thank you, Alexys!