In this example We will use ScyllaDB Enterprise (extra isolation feature)
Three workloads are defined:
Workload Name | Type | Timeout | Relative resources |
OLTP1 | Interactive | 20ms | 60% (600 shares) |
OLTP2 | Interactive | 2s | 30% (300 shares) |
OLAP | Batch | 20s | 10% (100 shares) |
Now I’ll show
some examples, like from the CQL terminal.
So let’s say that we have this scenario.
So we have two different real time or interactive workloads,
which actually means like that
the request are probably generated independently.
They can have very high concurrency,
so they have some timeout definitions here for
the OLTP1 is 20 millisecond, and for OLTP2 is 2 seconds
and for OLAP we have also some
large computation that can tolerate 20 seconds
timeouts
and we want them to be isolated or at least relatively
isolated to each other. So we defined the number of shares
and this more or less implies the ratio between them.
I’ll just say in advance that this ratio
that you are seeing here,
you cannot expect it, for example, in the latency level
or anything else, because a lot of the time
the resource conflicts are intermixed.
So it’s not like a clear cut computation,
which service level will get some latency.
But what is guaranteed is that it will get
this amount of resources
if it needs it for its computations. So
one more example, if OLTP1
is doing very intensive
very intensive work,
IO work, for example, so maybe it will require 70%
and then you’ll see the latencies drop.
So I’ll quickly go over on how does it look
in the terminal.
So at first we have only the Cassandra role
because we enable authentication and authorization
and then we create those roles.
We, we name them after the service levels, but
it doesn’t have to be the case.
And the superuser that is given here is only to avoid
some authorization by granting all of those roles,
some permissions for some tables,
but for real use cases,
it should be done the right way.
They shouldn’t get superuser permissions
and they need to be authorized to access the tables that they
actually need to access.
So about the service
level, again, at first we don’t have any service level.
This is the command
to see which service levels are defined in the system
as can be seen.
It’s an enterprise version, because we have shares
at the end and we create the service
levels. Again, the name doesn’t really matter.
It’s just for convenience, so we can follow along.
And we defined them with the workload type and the parameters
as we mentioned in the
previous slides,
but notice that I also created the service level
that is called “default”, which is actually also exist
by default, even though you cannot see it
because every word that doesn’t have any service level attached
will eventually get it.
This service level is defined with 1000 shares.
And the reason that is defined with so many shares is because
this is the equivalent of our open source.
Like, this is the ratio between this
and all of the other operations.
Piotr, these 600 shares are applicable
for every node individually, but you cannot have
every node defined a different number of shares.
So this is a common configuration for the cluster.
Shares is only the relative importance
of the workload or in sharing the resources
in the presence of conflicts.
Then we attach the service levels
to the actual roles.
Here is like a quick query that can be run in order to see
which role is attached to which service level.
Here we see that what the
configuration took
and basically that’s it.
After all of this, you can start running workloads
on your cluster.
Of course, using the right
usernames and passwords and
enjoy the isolation.
So as far as the demo goes,
I ran, in advance, I ran some
workloads and I defined some shares.
So I want to show you how it looks like.
First of all, you can notice
that I’ve put a lot of pressure on the on the clusters.
For production it’s not
recommended to work on such high utilization
as a default, of course, that Scylla can tolerate peaks.
And as you can see, it can even tolerate
a sustained workload like this.
But it’s not that recommended.. So I’ll go over by the way,
this is a custom dashboard for our monitoring, so
you won’t have all of these metrics
readily available for you.
But I will share – as part of the course auxiliary
materials, this dashboard that I’ve created.
So what we can see here is that we have the service
level default.
I don’t know if someone can read it.
I’ll try to enlarge it a little bit, maybe.
You can see the
the service level default that naturally doesn’t
have any operations
because they haven’t put load on it on purpose, by the way.
And you can see several different workloads
similar to what we’ve defined in this example.
Eventually, the cluster crunched through
186 kops/s, but
the way it is configured is that every
every workload that’s driving towards
at least the OLTP ones, has a very high concurrency.
It has 1600 threads each
and they try to get to 88 kops/s,
but this is what they are trying to do. And
the OLAP
doesn’t have a like a capped rate.
It just has a lot less threads.
Something like 240, if I’m not mistaken.
And just as you can see here, it’s a lot jaggier.
It’s jaggier because actually
what it gets is only the
vacant resources.. So when there is
some more resources, due to some circumstances,
it can push in more operations.
And when not,
it can push less.
Of course,
it tries to push as many as it can, but we throttle it.
So this can be seen
quite nicely that the workloads that tries to get 88
kops are getting it and they are quite stable
relatively to the to the one that is
the OLAP one.. One more thing to notice
is that the OLTP2 is also a lot jaggier,
maybe I’ll try to enlarge it more.
So the OLTP2 is also a lot a lot jaggier than OLTP1
which is quite flat because OLTP1 has
the most importance and the
peaks that we are seeing here is not because this
workload is greedy, it’s because it’s catching up
like there is a resource conflict.. It gets starved
relative to OLTP1 and then with when it gets a chance,
it kind of tries to catch up.
So this is why you are seeing here this jaggedness
and OLTP1 is remains the most isolated in that sense.
So it almost
isn’t affected at all by what is happening on the cluster.
So one more measure
that we can see, it’s the actual latencies.
So you can see that. OLTP1 always gets a
or almost always gets the lower latencies.
This is a consequence, but you cannot expect,
for example, to the ratio as in the shares.
And you can see that. OLAP actually
gets the less shares or less latency.
Those are different measures for a different percentiles
in the in the latency.. The one that we are used to look
at is actually the 99th percentile because we are aiming
in general in Scylla to
lower 99 percentile. So
what we can see here
is that they’re kind of at least
keeping their relative importance.
And again, the more important the workload is, the less
jaggy is the latency and more expected. So
the same goes for read and
I will say in advance, so one more thing to see is
how does the cluster divide the shares.
So this is just a measure
from our schedulers, the IO and CPU scheduler
that shows that for OLAP for example we get 100 shares,
for OLTP we get 300,
and for OLTP1 we get 600. By the way,
the reason that is per instance, because
it relates a little bit to a question I started earlier.
So each instance, it’s an asynchronous system.
So each instance finds out about some changes in the
shares
at some other point in time.
So this is a way to track it.
So the last thing. I would like to show you is
the cluster configuration.. I would also answer..
Okay, Eliran, sorry for interruption over here.
We have like one minute, so let’s wind up, okay?
Okay.
So in that case, I will just
do it real quick.
So I alleviated
OLAP to the number of shares of OLTP2
we won’t see the effects probably
if we only have one minute but we can see that Scylla
adjust the number of shares. I will show it. And
so in a few seconds you will see that
there will be a jump in the shares
and I’ll try to refresh it real quick.
Here it is.
So you can see that at some point some of the nodes
start to give 300 shares to OLAP and eventually
they will all of them will get the 300 shares.
There it is.
So and what will happen next, I’ll just say,
is that probably you will see the throughput
of the OLAP getting a little higher.
And we can already see that OLTP because it is equally
as important starting to dive like the throughput it can push
is no longer 88kops because they are now competing
for resources. So
I will be happy to answer more questions later
and I can also
share the actual link for this monitoring
if someone wants to view it,
I will probably close it
to an hour after the presentation or something.
So there it is.