An overview of data modeling best practices, tips, common mistakes and general guidelines
Let’s get started
by reviewing some of the data modeling guidelines.
So in general, as you have seen,
NoSQL data modeling tends to be very simple.
However, its simplicity can often lead
to several hiccups that may immediately or later on
will come back to haunt you at some point.
When you are doing data modeling, it is therefore
important to consider, edge cases and
to understand and know
your access patterns very well.
Otherwise, failure in doing so may easily
introduce performance issues in your distributed database
later on, such as imbalances
and hotspots.
When doing NoSQL data modeling, you will always want to follow
a query driven design approach
rather than following the traditional entity
relationship model, commonly seen in relational databases.
We think about the queries we need to run first.
Then we switch on to the schema.
When you get to the schema part, considering
you have successfully mapped out your queries,
you will end up with the proper primary key selection.
The primary key determines how your data will get
distributed across the cluster as Tzach explained
is last presentation.
At this
point, remember to select a primary key,
specifically a partition key with enough high cardinality
to ensure
that not only your data gets distributed seemingly evenly,
but also to ensure that the load and processing
gets to spread across your entire cluster neighbours.
We then have to avoid bad access patterns,
such as being very cautious about queries that do not follow
our original primary key defined restrictions.
And believe it or not, there are many people out there
who overlook monitoring
the state of their databases.
I am not going to spend too much time
talking about how monitoring is a great call
for your database overall observability, as we hosted
a master class last year specifically around that topic.
So please, please, please, please
ensure that you install and configure monitoring
as available within our distributed database.
Going forward through this presentation,
we are going to see several examples of how
the ScyllaDB monitoring stack can be very hard
diagnosing some of the mistakes that you’re going
to be discussing through this presentation.
Okay. Failure to
data modeling and access
pattern guidelines
will introduce what are known as imbalances
to your cluster, which may eventually introduce
bottlenecks as your workload grows.
From the Scylla monitoring stack we can see that
the panel where responsible for generating this graph
are on the coordinator site,
which essentially means queries
as being sent by the client, your application.
For example, in this slide we can see a three node cluster
where two nodes are seemingly unbalanced
except for the green one down the bottom.
This typically means that either
you may have clients configured with an incorrect load
balancing policy or it might be
you may have a hot partition or hot shard, which we are going
to see through the upcoming sections of this presentation.
Similarly, here we have a similar dashboard,
but now showing the replica side.
The replica side
demonstrates how queries that are sent from
your application get balanced among replicas in your cluster.
For example, writes will always be replicated
to the number of nodes
as specified within our application factor,
but reads will be replicated to the number of replicas
as specified within our consistency level.
In general, here
we can see the same repeating pattern
of one of the nodes receiving less traffic than the others.
On which we can conclude that there is a problem
with the access patterns from an application perspective
because this panel is essentially showing us
the replica side of requests.
A lot of people often ask
how evenly balanced your traffic should be.
And while it’s true that for some use cases,
it may be hard
or nearly impossible to get such a perfect distribution,
this is ideally what you should be striving for.
For example, in this slide we can see queries
being evenly balanced
and distributed across all roads of your cluster,
ensuring that your queries and your performance
of the cluster is, is
functioning smoothly.
And finally, we get to the last. Imbalance Type,
related to data distribution.
The monitoring data that I gather here demonstrates
that we have a 90 node cluster where three of the nodes
have a much higher data than the others.
This is a very high indication
that the cluster in question has a large partitions.
Given
that the imbalance precisely affect tree nodes
and that we know, well,
I know because I work in that specific situation
that if they cause our user to replication factor,
this is yet another indicator for large partitions.
Ashak asked question: “How can we avoid hot partitions?”
We are going to be discussing
about that in the upcoming slides.
Stay tuned.
So these are just some of the examples of what are imbalances.
In general, you should strive to avoid these
in order to avoid performance problems in the long run.
If you decide to ignore this advice,
then you may get to a point where your data may become
entirely unreadable.
And if you still decide
to ignore and insist on doing the mistakes
and still creating larger and larger imbalances,
unpredictable problems may arise.