An overview of data modeling terms like Cluster, Keyspace, Table, Partition, Row, Column, and Primary Key.
We’re going to
start with an overview of data modeling.
So I’ll start with the very basics,
relational data modeling versus NoSQL data modeling.
A lot of people have some background in relational data
modeling, and this is a very high level overview.
But typically the way we create our data model in relational databases
is we look at our data
at the entities and the relationship between those entities.
We create an entity diagram.
From that, we derive an actual model and a schema.
And after that we think about
our actual application and the queries
that derive from our schema.
So that’s the way that it’s typically done in a relational database and NoSQL.
We have to get used to a thought process that’s a bit different.
So at the very beginning we have to think not just about our data
and the entities, but also about our application.
And that’s important to understand.
We think about our application and our queries, including
the required
performance, consistency levels, required availability;
at the very beginning of the data modeling process.
Okay, so we think about the application,
then afterwards or in parallel, we think about our data
just like we did before: what are the entities
and what are the relationships between those entities.
And from that we derive our model.
Okay, so keep that in mind if you’re coming from
a relational database and you’re used to data modeling with relational
databases.
Okay.
Some terms that I’m going to be using throughout this talk,
I just want to make sure we define them and that we’re on the same page.
So a cluster is a collection of nodes.
A node is the ScyllaDB software running on a
server, on a machine.
A cluster can have anywhere between, in production, between three nodes,
all the way to hundreds of nodes, so ScyllaDB is distributed.
It’s a distributed database that works on multiple notes.
In a cluster you can define one or more keyspaces.
One of the important
attributes of the keyspace is
the replication factor, which I talked about in my previous stop.
The replication factor abbreviated as R.F.: determines
how many copies we’re going to hold of each piece of data.
Many clusters have in production have only one keyspace,
but sometimes for different reasons.
One of them can be that we want to have tables with different replication factors.
We might define multiple keyspaces, but it’s also quite common
to see clusters that have just one keyspace.
A key space holds one or more tables, typically a few tables,
and each table
contains well, it defines the columns that are in that table.
It also defines the primary key,
which I will talk about in just a bit.
And a table consists of partitions,
rows and columns, and a row includes
basically a key value pair, which is the column name and the value.
And I’ll show this in a diagram in just a bit.
Hopefully it will make it more clear.
Okay, so a keyspace, here
we have an example of creating a keyspace similar to what we saw before.
This is a key space called “Pets_Clinic” and it simply defines
a replication factor of three.
So every table within this keyspace is going to inherit the replication factor
defined in the keyspace, and all those tables in this key space
are going to replicate the data to three
different nodes.
A table is defined within the context of the keyspace,
and in this simple example, we see a table called “heartrate_v1”,
and that defines three different columns: “pet_chip_id”, “time” and “heart_rate”.
For each column it also defines the type, ”integer”, “uuid” and so on,
and it defines the primary key.
The primary key is one column or more,
and in this case it’s pet_chip_id.
Okay, so how would this look like
on the node?
So we defined the table to have three columns: the pet_chip_id,
which is the primary key, also called the partition key, in this case;
a column for time and a column for heart_rate.
And you can see that we have three partitions
or three rows, in this case.
If you remember from my previous talk, the partition key determines
how Scylla partitions the data according to the consistent hash function,
and then it knows which replica node is responsible for a given row.
Okay, so we can see that we have the partition key and the partition key
is connected to the other columns
“time” and “heart_rate”, and each one has a value.
And this is a slide from my previous talk.
For those that missed it, the way it works is that when Scylla
receives a read or a write request, it
performs a consistent hash function on the partition key
“pet_chip_id” in our case, and according to the
hash that we receive, it knows which replica nodes
are responsible for this specific row.
And if the replication factor is three, then it means that
there are going to be three copies of this data
and ScyllaDB would know
which nodes are responsible for the data according to the partition key.