Additional item that we are constantly being asked about is to collocate or not
collocate, if you look on other systems they will tell you – put your Spark
cluster together with your data cluster yes there are “efficiencies”
in that one, you’re going to be managing less nodes, it’s going to be
more efficient from spinning up and down system, you don’t have to multiply your
controllability, control plane, however there’s a cost to that, definitely with
the system like Scylla. To make sure that it works
don’t forget each one of those Spark nodes have executors, you’re going to
define how many cores in each one of your servers going to be an executor
For example let’s say we have a 16 core server, I can say I have 8
cores for Scylla and 8 cores for the Spark executors, that’s going
to divide that node half in half, is it efficient – sometimes it is, if you have a
constant analytic workload going in that specific cluster, it might be worthwhile
to increase or scale-up the node to collocate Spark together with Scylla
However in our experience we have seen that in many cases those workloads that
are constantly running are fairly small so again the benefit of doing that
might not be that beneficial we do recommend separating the Spark
cluster from the Scylla cluster, you’ll have to tune less
you can have a more proficient Scylla cluster that’ll provide you better performance
and you might want to tear down or scale up scale, down your Spark cluster so the
dynamic of actually using the Spark is going to be more efficient and this is a
Java program at the end of the day and this one is not, so if you have to manage
memory here if you’re going to set up some kind of a off-heap memory you have to
consider the Scylla usage of that specific memory to prevent any type
of collisions. So fine tuning, as I said before
one thing I didn’t say, in this part parallelism there is a setting of how
many parallel tasks you want the executor to run, the default if I
remember correctly is 1 on the installation you can increase it to the
number of cores you have in the. Scylla side to create more and more parallel
connections, reduce the split size from 64 megabytes to 1, there is a setting
inside the Cassandra connector that says what’s going to be the maximum
connection that I want to deploy, the default is calculated by the number of
executors you have, we recommend to increase it to the number of cores or
more that you have in the Scylla side, to open more and more connections
the different number of connections going to open, by the way if you don’t have any
workload this is one, so the Java connector will open one connection between your
Spark executors and your Scylla nodes. The different concurrent writes, the actual writes
in batch in-flight you can have, the default is 5, again might make sense to increase
that number if you have a very high write load and the concurrent
reads for each one of those connections that default is 512, makes sense again if
you need a very high read workload to increase the number. To conclude so
Scylla does enable you to run analytic workloads and if you saw the
presentation this morning from Glauber, we’re actually gonna improve the
processes more and more if you get the analytic user that’s going to be able to
benefit from a different path of data and we you said scanning those
tables in a more efficient way. It is recommended from our perspective, you
have some questions about what is going to be the most efficient way to
deploy your Spark cluster with Scylla cluster, talk to us, we have found that there are
several use cases which are different from one another
and you can get better performance by small tunings in the connector or
replacing the Java driver underneath it.. Resource management is the key for a
performant cluster, time after time after time we see that if your
resource management is correct, you size it correctly
from the Spark side in the Scylla side you’ll be happy with the results.