How is Workload Prioritization implemented in ScyllaDB vs. other solutions? An explanation of schedulers and controllers, an example of how it works, and some performance information.
What do we do? Traditionally.
We just divide and conquer this is actually the common solution. We either divide in time
which mean we run OLAP only on the off-peak times, this has the disadvantage that you cannot
anticipate when the OLAP will actually finish, so you can leak into the mission critical times
and you need to take into account OLTP transactions that come during this period, this off peak periods
will still suffer the high latencies that we demonstrated and we can just do multi DC division in space
we can just set up a DC per workload and it would work but just to put it in numbers
you can see that the bottom line is that there is a waste, on the OLTP server there is a waste
of resources because it has to be underutilized in order to give the low latencies
and in the OLAP server, it doesn’t run all the time it’s something periodic so it does use 100%
of the resources for a limited period of time but it has to be on all the time in order
to sync, not to mention the increased maintenance cost and complexity
so this is why we came up
with workload prioritization which is our approach to solve this problem
so workload prioritization is aimed at
minimising Inter-workflow impact, the one you saw in the earlier slides, so how
do we do that, we actually started by solving different yet similar problem
we need, back in the day to
to solve problem of balancing between our background and foreground jobs, so this is how the
scheduler was created, the concept of scheduler in ScyllaDB, it works with shares
shares are a measure of representing some ownership on a resource
that might be memory, CPU, network or whatever
relative to other shareholders, so it is relative and this scheduler only opts for
maintaining the ratio of usage between those shareholders and it kicks in only when there is
a conflict, so if fore example here we have ratio 1 to 1 so it means that
for every work unit, the blue one will get processed
the red one will get one too, if it needs to of course
so here the ratio is two to one, for example so you can see that
for each red work unit, the blue one will get two and this is how it actually works
so this is the same ratio of 2 to 1 but with different number of shares and this this
slide is actually just to emphesize the actual amount doesn’t matter only the ratio matters when
you use those shares so we added the scheduler
the little scheduler that runs around and try to decide
which tasks to take and here we can see that the ratio is 2 to 1 so what does it just goes
to the red one for one task and then for each task it gets from the red one, it will
go and get two tasks from the blue one. Of course that only works if both of them need to use this resource
so show the
highlights of schedulers that we actually use most of them in our solution is
first is the shares are really all there is to it
you only need to know about the shares in order to configure them
you don’t really need to know the interworkings
the other thing is they only kick in when there
is a resource conflict so even shareholder with the minimal amount of shares
will get the whole resource
if no one else needs it, so it is opted towards utilization and it opts for ratios
it doesn’t give any specific guarantee
for constant SLAs like throughput or something like this
it only opts for ratio of usage and they are dynamic which is very important
at least for the balancing between background and foreground work
which means that you can
change the shares on real-time and the scheduler will just adjust to the new situation
and what it actually does it just limits the impact of one shareholder over the other or at least
makes it more predictable, so just as a side note, what we ended up doing
for the background vs foreground work, we created another layer on top of those schedulers
which is called “controllers” which is only a feedback loop, this controller
just samples some stats and decide if a background work needs its shares increased
decreased or unchanged, a little example is compaction
vs online operations, so what happens is that if the disk usage increases too fast the
compaction will never get to do its job until we run out of disk space so
the controller will increase its shares and what would happen is a double effect from
one side the compaction will do more work per time unit
because now it has more shares and on the other side the writes
will decrease because we now have more shares relatively to this compaction, so going
back to workload prioritization, so what this solution gives us over the traditional solution, is better
system utilization, you will not have the server just laying around if it can do some work
it is super easy to set up, because you only need to know about the shares and
it’s dynamically adjusted, which mean you can do trial and error and just monitor it and see if the shares
you’ve set up are good enough for you
so this solution has three building blocks, there are the schedulers
that we already discussed, there is a problem of serialization, so just as an example
when two nodes need to talk to each other they go through some kind of connection
and if two workloads go through the same connection even if they are prioritized, you can get the
head of line blocking, so if OLAP will generate a lot of work items that happen to
be in flight, so when OLTP will get it’s query in, it will need to wait for the connection
to be free in order to send its data so it kind of defeats the purpose, so we had to parallelize
all those points in order to for workload prioritization to work and we still do it, so it will get
better and better over time and the third part is, to be able to classify the different
operations to different workloads, so we do it using the username you use
to set up the connection to our servers