In this lesson you’ll learn some best practices related to data modeling including, partitions sizing, partition distribution, prepared statements, caching, parallelism, retries, batch operations, and more.
Let’s talk a little bit about data. how should I model the data, partition the
data, look at the data in order to maximize your success. One thing if
you’re not doing it most of you are this is something that we don’t see often but
every now and then we do see it if you have to learn something from this
presentation – prepare your statements. There’s no excuse not to do it, the only
situation in which is okay not to prepare statements is that, this is a
CQL query that I just decided to run and I’m never gonna run it again
it’s fine I mean you’re doing it from the shell just for some ad-hoc stuff but
if you have queries that are running periodically it doesn’t matter if you’re
running them once a second or a hundred times more or a million or a billion or
a trillion or it doesn’t matter if you’re doing something periodically
prepare those queries the impact of not preparing it’s visible like a lot of
times you see oh my CPU has a bottleneck and I’m only doing really a hundred
queries a second, now I prepare, it impacts your latency, impacts
throughput, impacts essentially everything. Partitions sizing, we get
this question a lot how big can my partitions be? They’re
almost not a problem anymore like it’s like we inherited some of it from
Cassandra used to be the case that. I have large partitions
that’s the end of the world we got much better over the years now in ScyllaDB 3.1
even our repairs are now able to operate on a roll level which means inside the
partition so big partitions aren’t anymore the problem that they used to be
but there’s still one problem that really comes from large partitions so
what I would say is that is this a problem? Really depends on your query if
you look at those queries that we have as an example the first query isn’t
really that much of a problem again so it used to be the case that if doesn’t
matter what my queries are large partitions are a problem, the first query
is essentially out of this partition give me
one row, now we got pretty efficient at that this is essentially almost a
problem solved, as I said there are still some edges just move here and there but not
too much of a concern, the real problem comes if your query looks like the
second one like give me the entire partition and why is that because as
your partition grows the time to actually complete that query grows as
well so if you tell me I’m doing 2,000 queries a second and there is this one
shard that doesn’t seem to be able to handle a bottlenecks, why – because in
my query I want to look at the entire partition so obviously this is
going to be more expensive than the others just pay attention to that.
Large partitions going wide, our record I think is a partition with 130 gigs so see this
is pretty impressive the customer who had that didn’t even notice, until I did,
of course they did it, at some point you do but the reason they noticed
was exactly because they started doing queries like the second one so all
of a sudden, hey why are my queries taking however many seconds to complete
well because your partition is 130 gigabytes, you gotta move that out
Bad distribution is not a unique ScyllaDB issue, what is bad distribution?
It is essentially when you query some partitions more than others but in
ScyllaDB you tend to see this issue sooner, why?
This is especially for the folks coming from. Cassandra, Cassandra shards thinks at the
node level so if you have a hot partition you might not even notice it
because the entire node is handling it so maybe if you have a hundred
nodes you will notice but maybe we started with three and replication
factor of three so you don’t really it’s impossible to have a hot partition issue
you don’t notice. Again the problem doesn’t start because you’re
using ScyllaDB but you’re gonna see it right away because we’re sharding at the CPU
level so you’re going to start to see a. CPU that is essentially bottlenecking
now if you ask me this is actually a good thing, it’s showing you the problems
in your data model sooner, you want to find those problems as soon as possible
but pay attention to that as well, it’s important to have a partition key with
very high cardinality it’s important to have a partition key for
which you have a lot of values so they can distribute well and also make sure
that your traffic, maybe you have a billion partition keys we decide to
query just one you’re gonna you’re gonna find those issues and ScyllaDB is gonna
show you this like very very early in the process. Other hints on how to be
successful with ScyllaDB, keep track of our blogs we always we
always getting this information out whenever we find them we, find this
usually in customer sites but we want to make sure that the entire community is
benefiting from it so this is just one recent one that got out of one of
those experiments and like understand how the cache works you don’t
really have to understand how the cache works to use the database but
obviously if you do it’s going to give you an edge so one example that comes
from this blog that I would like to highlight is our cache is essentially an
LRU and what that means is that when the row is touched
determines which rows I’m going to to discard. When you’re doing range
queries one of the interesting patterns and I invite you to to look at the
blog if you’re doing a range query in which you’re always specifying the
current time as whatever system.now that you have in your system, the way the
cache works is that it tries to cache ranges that you queried before. We can
cache infinite ranges but if you’re asking for a specific time point I’m
caching just what you asked for so the next time you ask for – oh now but
now is two microseconds in the future I don’t have this last piece of
information in the cache I need to go to storage again to fetch it, so if you have
patterns like this try to do like a scan without an end if you just say give me
all, so if you have a time series data access pattern it’s better to say give
me all rows starting from this point of time instead of bounding both sides
if the right side of your boundary is
always moving.. Lastly parallelism, parallelism is
important a lot of times we see people not bottlenecking their ScyllaDB boxes
because they don’t have enough parallelism so we thrive on it
throw us parallelism, however there is such thing as too much, so infinite
parallelism is asking for trouble because at some point the cluster is not
going to be able to handle it and when. I’m saying infinite here I don’t mean
the mathematical concept of infinity like we’ve heard before like we have
limited parallelism we we have this control or this threadboard or etc
the guarantee is that at most at any given time it can only be sending forth
read and request to the database that’s essentially infinite so when we say
finite it just means like a boundary that makes sense
there are ways, we’re actually about to publish an article very soon so I’m not
going to go into details, there are actually theoretical ways in which we try
to based on your throughput requirements and your latency requirements, what is a good
concurrency. And also be careful with retries a lot of times we see people – hey
my requests are timing out and then i’m retrying and my server times it out in
ten seconds but my client times it out in one second so I just retry it again
because maybe it will be successful this time – it won’t be – because what happens in
the situation is that you by throwing more requests you’re actually increasing
the parallelism even further if your timeout happened just because you were
unlucky and you had a network glitch that’s fine
otherwise you’re just adding to the trouble. Batching writes finally just to
finish should I batch usually no it’s fine to batch the same things that go to
the same key but usually when you batch things they go to different keys that’s
usually not a good pattern, this is the same as Cassandra and it’s just because
you’re not going to be able to hit the node that has the data.