What are large partitions, large collections, common mistakes and solutions.
So this brings us to the most exciting
part of our presentation,
which involves things that you should not be doing,
but believing me or not, we very often see people doing
the stumbling up on those problems over and over and over.
Before we move on,. I would like to share with you
a GitHub link, a GitHub repository that I created,
which reproduce
each of the problems that we are going to be discussing here.
Okay.
You will find here a simple code on
how to reproduce hot and large part partitions,
large collections, large tombstone grounds, etc.
and so on.
So you can play around and observe the results
along with the ScyllaDB monitoring stack.
We aren’t going to be running
enough of these samples through the rest of the presentation.
Some of these scripts may particularly take
a considerable amount of time to load the data.
But if you are interested in
knowing more, just download and run it on top of Scylla.
okay, moving on.
Let’s start with large partitions.
So large partitions are essentially partitions
which grow too big in size up to a point
when they start introducing performance problems across its
replicas.
One of the questions that we often get, and trust me,
I probably hear just being shouted out, once
at least a month or so, is how large partition
needs to be in order for it to be consider large.
Well, I’m going to answer the question,
but just for you to understand
a large partitions are a recurring topic.
We keep a world record for large partitions,
and this is a posting that. I have carved out for our users
like, posted by our CEO Dor. Laor.
A user recently broke our previous record of
150 gigabytes partition
and wrote a 2.5 terabytes
single partition into ScyllaDB.
Now, 2.5 terabytes or even
150 gigabytes are, of course
extreme examples right.
And you should definitely be running away
from these numbers.
But still we have yet no
definitive answer on how large partitions can or should be.
So how large is actually acceptable?
Well, as with many things in life,
the answer is that it depends.
For example, the larger your payload gets,
the higher the latency tends to be.
Your payload takes more server side processing time
for serialization in this realization,
and also incur a higher network data transmission overhead.
Some use cases payload sizes may be much larger than others
in such a way that what would be a large
partition for one is just reality for another.
For example,
I have worked with a web tree blockchain company
which would store several transaction, blockchain
transactions, as blobs under a single key.
And every key
could
easily get past megabytes in size.
Okay, another aspect
to consider is how you will be reading
from these partitions most of the time.
For example, a timeseries type of use case
will typically have a timestamp clustering component with it
and reading from a specific time window,
much less data than if you were to scan
a whole partition.
Okay.
I have put together the following table in order
for us to illustrate the impact of large partitions
and their different payload size.
Such as one, two and four kilobytes.
Obviously the higher the payload gets under the same
row count, the larger the partition is going to be.
However, if your use case frequently requires
scanning partitions as a whole, then you will want to be aware
that ScyllaDB cuts off pages at every one megabyte to prevent
the system from potentially starving out of memory.
Therefore, for a payload size
of four kilobytes and 10,000 rows, for example,
you would need at least 40 pages to be retrieved
in order to scan the partition fully with a single query.
This may not seem a big deal at first, but over time,
as you scale,
it may impact your overall client side day latency.
Okay.
But there’s more to when it comes to large operations.
For example,
if we take a look into ScyllaDB write path,
we can see that when we write data
to the database, it will essentially be stored
into two places: the “commit log”
and under a new memory data structure called the “memtable”.
The commit log is a
write-ahead-log, which we never really read from.
Except when there is a server crash or a server’s
interruption.
The memtable as it lives in memory
will eventually get full, and in order to free up memory
space ScyllaDB has a process known as “memtable
flush” which results in an “SSTable”.
And this is essentially how your data gets persisted.
Okay, but you may be wondering by now: “Felipe,
this is too confusing.
What does SSTables have to do with large partitions?”
That’s a very good question.
Well, it happens that SSTables have specific components
that need to be held into memory when your database starts.
This happens in order to ensure that reads are always efficient
and minimizes wasting storage disk IO
when the database is looking after the data.
Such as when you run a slick select query.
When you have extremely large partitions such as
the 2.5 terabyte example we
have spoken about, this SSTable component
will introduce heavy memory pressure,
therefore shrinking, the database room for caching.
For example, if we take a look
into the monitoring data on the screen, we can see that
there are some shards in our database
not only with an imbalance but with higher memory
utilisation when compared to the others.
And eventually one of these shards
utilization abruptly goes out.
That’s the moment when we removed and evicted
a large partition from one of the affected nodes,
resulting in a much better balance, as you can see, okay.
All right, so enough of large partitions.
Let’s now switch to a different data type
which, guys, mostly everyone gets wrong.
Collections: so,
let me go back to our documentation and
our documentation says that
collections are meant for storing, then normalizing
a relatively small amount of data.
Yet we see people struggling with it
far more often than we would like.
Collections are stored in a single cell which can make
serialization/deserialization extremely expensive.
When you make use of collections, you can define
whether the field in question is frozen or non-frozen.
A frozen collection can only be written as a whole.
You cannot append or remove elements from it.
A non-frozen collection, on the other hand,
can be appended too.
And that’s exactly the type of collection that most people
always get wrong.
I’m sorry.
To make things worse, you can even have what we call
nested collections, such as a map
which contains another map and this one may include a list.
As a result, collections are
an extremely useful data type, but
when used incorrectly, they will introduce performance
problems much sooner than large partitions.
Okay.
The answer to the question of “How
large collection can be?” is actually not much.
For example, if we create a simple key value table,
as demonstrated in this slide where our key is a sensor ID
and our value is a collection of samples recorded over time,
as soon as we start ingesting data,
our performance is going to be suboptimal.
The following monitoring snapshot is
exactly what happens when you try to append
several items to a collection all at once.
We can see that
as we have a peak in
the throughput, the left
hand side, right after the peak
our throughput starts to decrease.
And similarly, as our throughput is decreasing
our latency is essentially
increasing to terrible numbers, and we even get to a point
where our P99 write latency gets up to
above one second, which is just terrible, especially
if you’re using a database, such ScyllaDB
which is oriented for performance and low latency.
Cinthia asked: “Do these reads apply to databases
beyond ScyllaDB?” Yes, if you try to implement,
append several elements to a collection in Cassandra,
you will run under the very same situation.
Okay,
moving on.
But okay, let’s make a pause
over here and ask “How come a performance
oriented database like ScyllaDB gets
a P99
in the unit of seconds?” Well, one of the things that I did is
I asked one of our ScyllaDB engineers
about what. I was seeing in the monitoring.
And this is the explanation that he gave us: in ScyllaDB
collection cells are stored in memory are sorted vectors.
When you append an element to a collection,
the act of appending requires
merging two collections:
the old collection, which was using memory
and the new one which you are just appending.
This means that when you append an element,
it has a cost which is proportional
to the size of the entire collection.
Which is why in your collection
you should always keep your collections small.
However,
he also mentioned that if collections were implemented
on top of trees, instead of vectors
it would improve the performance of that particular operation.
But, tress won’t make small collections less efficient.
So essentially
it’s not an easy engineering problem.
So in general, ensure that you keep collections
small, as small as possible.
Okay.
All right.
If we consider that previous suboptimal data modeling
in the schema that we have,. I have demonstrated
a few slides ago
the solution to it would be
to move the timestamp component as a clustering key,
as we are seeing here, and transforming your map
into a frozen collection
as by doing this simple change to the timestamp,
we no longer need to append the data to the collection.
With this very simple change
we are greatly improved the performance of the use.
Kevin asked: “What ScyllaDB recommended as max
partition size?”. Very good question, Kevin.
I somehow answered, just for you to understand
there are theoretical limits.
We do not impose any limits.
We ask you to be cautious of
large partitions.
You have to be very cautious.
And as I explained in some slides back
a large partition will highly depend on your use case,
in your latency expectations and so on.
Okay. But in general.
And yeah, that’s what it would depend on.