Partitioners, tasks, splits, and executors. How to dimension your spark cluster properly?
The ScyllaDB (or Cassandra) partitioner takes the partition key of a CQL Row and uses it to determine what node in the cluster should host that data.
The partitioner generates a Token that directly maps to the TokenRange of the Cluster. Each node in the Cluster is responsible for a specific section of the full TokenRange.
The capacity required for the cluster depends on the size of the ScyllaDB cluster.
This is a good rule of thumb:
- Allocate 1 Spark CPU core for every 2 ScyllaDB cores;
- Allocate 2GB of RAM for every Spark CPU core.
So once you are connected I would like to explain a little bit about
how you can basically look at your data and then what will be happening underneath
because even with Spark it’s generally no magic so there is just some principles that are common to
most of the NoSQL databases and you just need to obey those principles and work with them
and the first and most important principle is basically your data is always partitioned
according to a partitioner and the general Spark it doesn’t have the same notion of a partition as
a Cassandra but that’s why the connector comes and and helps and the connector itself has a
Cassandra partitioner and this partitioner basically tries to take all your tokens and
have them divided into token ranges and these token ranges they basically do have a certain
size and based on that size it will give you the size of the actual task, so this way
like controlling these parameters you can easily control the partitioner and you can easily control
how big would be the tasks and once you have basically these token ranges then this task
which owns this token range is the smallest unit that is then distributed to each Spark executor
node and it will take care of this task and it will run queries using those token ranges and
this way basically you really control the mapping of the Spark partitions or you can
control the mapping of the Spark partitions to the ScyllaDB partitions and this is what
you want to do, like you want to be having the same partitions and you don’t want to have some mapping
or convergence in between so ideally this will give you direct access from Spark to ScyllaDB
I mentioned that you can control all these things and there are again parameters for
that and those parameters differ a little bit so you have some joy of reading the
documentation and generally it works like that that basically there is a split count or a split
and this split will basically give you the number of tasks that will eventually be created
out of all the token ranges that are there like in your Cassandra or ScyllaDB cluster and those
token ranges in your ScyllaDB cluster basically the bigger they are
the longer it will take to process this task so you always have to keep in mind that basically like
the more splits you have the more tasks you will have but then the tasks will be fairly small
so they will be done quickly and they will
have the overhead of like, each executor has to get the tasks from the master or
from the driver so there will be like a bigger overhead to run the task like all the tasks
start and they have some preamble and postamble and basically this time around it, takes some time
but then you need to consider the thing that since. Spark is a cluster, it has a certain number of
executors and as you can see on the slide you can configure how many executors you want to be using
for the job you can even configure how much memory you want for the executor and besides this you can
control also the parallelism using concurrent reads so those three parameters are something to
keep in mind about how fast your application will actually behave and if you think about that and
imagine that, that means that this will give you a certain throughput of number of tasks per second
and you generally want to be in a state or in a shape where basically all your executors will
be doing something, basically you won’t have an executor that will be idle or an executor that
will be processing a very big task simply because the other tasks finished earlier because they
didn’t have data so you want to avoid that and that’s why you want to have a split count that is kind of
big so you have more tasks and then basically even if the token ranges behind the task is having lots
of data it will still be processed within some reasonable time and you don’t have to wait for
much time. On this slide the important part also to remember is that in the latest Datastax version
of the driver there is a way you can configure connections per executor, so that means how
many connections to the actual nodes your executor will take and this is by default one
and for Cassandra it’s okay but if you’re connected to ScyllaDB it’s not that great because
you are not leveraging the parallelism, ScyllaDB architecture what’s good on it is basically that
you can really go and leverage all the CPUs in parallel and you avoid the switching of
the context between the CPUs and the ideal way how to avoid this switching is that you will basically
either connect directly to CPU with your data this is what the shard aware driver does and there you
don’t have to specify the connections per executor but with the Cassandra you don’t have this
comfort and you just create more connections per executor and then basically more CPUs and more
ScyllaDB resources will be able to act as a coordinator or in a random fashion of
course you’ll be able to hit directly the CPU with your query if you get lucky but the thing is that
you will basically be running your queries against all the CPUs when you increase this parameter
and this is what you want like you always want to make your CPUs on ScyllaDB busy, you want to make
them all busy not only just single one or few of them but all of them ideally
This also, you have to keep this in mind because
I will explain later that
like the distribution of the tasks plays a big role here and then let’s talk about it later so
now we basically know how we can connect to the Spark cluster and now you know
basically that the input parameters make up your efficiency or parallelism how fast the job will
run and how effectively we’ll see our resources be used, so you already know that but then
if your Spark cluster is simply too small you, of course, will not have this performance
if your Spark cluster is too big you are paying for the resources that you are not utilizing so
how big should be the cluster? So we have a rule of a thumb and it’s a good start you can of course
put more resources there if you will have for example more tasks but this is a good start and
basically if you have your ScyllaDB cluster then for every two ScyllaDB cores you use one Spark CPU core
or even if if you don’t mind you can just do one to one and this is I think the ideal case but
start with what I’m suggesting and this is like a good start. Then again it’s a good start to
allocate, to give it some memory for every Spark CPU core so those things are like the minimum
of course if your partitions are bigger or if your there is always a ratio between the data
how much data does your application or database take between how much memory and how many CPUs
do your database take and the same rule basically applies to Spark as well because
if you want to read all these data from database you ideally should have at least
the same amount of the resources or at least the minimum I’m showing on the on the slide
and then it will be very easy for you to go and process the resources, of course if you will
have stuff that needs lots of processing power for example like collections
then yes you will need more Spark CPUs if you will have data that basically
has huge blobs, then yes you will need more memory so it’s always it’s always like that so you just
need to think about your data and reflect it to the minimal requirements for the Spark cluster.