What are some key metrics to check when processing data using Spark? Also links with more info and examples.
It’s highly recommended to set up ScyllaDB monitoring and use its dashboard or advisor. Some things to watch are the load, load PER SHARD, latencies (big partitions?), and coordination (switch to shard aware?).
Also, use spark executors and stages view to understand how well are tasks distributed and enable the spark history server.
So now we have a well managed cluster we know how to connect to it, we know how to tune
the number of the tasks and then how big the tasks should be, and how well should they be processed
how do we know that our application runs well? And I think the key part here on the
data processing are the metrics, so the first thing is you should always set up ScyllaDB monitoring
and there is a dedicated session I think later today about it and you should go and
check ScyllaDB monitoring play with its dashboard and there are various advisors that will show
you how your application behaves whether it’s not doing something unexpected, you can start watching
load you can start watching, for example, some IO metrics. The tricky part here is what I mentioned
before and that is basically you always want to make all your CPUs busy. How to do that, so the
common mistake for example is that basically people run their applications and they don’t
have it well configured or they are using some older connector and this old connector doesn’t
set up the partition properly and then you’ll end up with a very funny situation that basically
all your executors will start targeting the same or the similar token range that lives on the
same node and basically they start overloading for example three nodes out of your whole ScyllaDB cluster
or only three shards on all the nodes will be busy and this is simply because
the tasks at the beginning were not shuffled and they are in order and of course if they’re
in order it’s a very high probability that it will start hitting just the same partition or the same
token range basically and with hitting the same token range or hitting the same shard and you
just make this shard busy and all the other shards are just idle and you’re bottlenecked on the shard
so then on the ScyllaDB monitoring you will see the load, a shard and you will see that
this load per shard is very high and you will just see few shards working and this is not
what you want to do, you want basically to see all the shards working and you should go and fix your
application and validate whether you are properly setting up your tasks and you have enough
executors and enough parallelism and of course a shuffle service to be able to get leverage
of that, if not then you will be bottlenecking a few CPUs and this is definitely not efficient
The important part also is to watch the latencies so for example you will see that you have some
bigger partitions or partitions with bigger data and you will see it on the latency tab in the
monitoring at the same time you will see it on the. Spark executors and Spark master as well because
you will see that basically there are few tasks that are being stuck and all the other tasks are
already done and that means that basically you will need to split those token ranges ideally
to smaller parts and add more parallelism add more executors so you will be able to basically
split these token ranges so they don’t take that much time to get done, the other important part
when you use the Datastax driver is towards the coordination, because sometimes the coordination
between the CPUs can also take time but this is likely a problem or it will be a problem if you’ll
be using an analytics job so you run lots of select queries or lots of joins for example and
in such case it might be wise to go and just try your application with the shard aware driver so you
just replace the it’s an in-place replacement so you just get rid of the Datastax you comment it
out in your dependencies and you just use the the shard aware driver and try it with that
and I think you will be able to see that latencies will be better
because there will be no coordination between CPUs so you will save CPU op
Just so I’m not praising the ScyllaDB monitoring in Spark itself when you start Spark
and lots of people don’t know about that. Spark has a UI and Spark has very good logging
it can be eventually tuned to give you like more details but what I found that the default logging
is good enough or for most of the things it’s good enough and in the UI you can very easily see
the executors, you can very easily see the stages, you can see how the tasks are being processed how
many tasks are still pending and you will be able to understand how well the whole engine works you
will be able to see whether one task is stuck or not, you will be able to see whether a task gets
failed and repeated, that likely means that there is some time out somewhere either on ScyllaDB side
or on some processing side and this timeout creates
a failure of the task processing and you need to go and understand that
and you will be able to do it from the logs on the Spark UI very easily
So I know that basically, it’s only or we have only an hour or so and
I’m quite sure that going deeper or I could go deeper and we could spend the whole afternoon
here but that’s not the point, I actually prepared a set of links that you should go and have a look
and based on those links it will give you a better understanding of the whole infrastructure and how
it’s connected and and what does it mean when a certain task basically does something and you can
always or you should always start with the first link that I have there which is basically the
reference to documentation and have a look at how they implemented it, I am quite sure
and from my experience I know that Scala is a very implicit language so lots of things
people think that they happen underneath and they happen okay, it doesn’t have to be true you should
always even with Scala you can go and observe it you can start debugging it you can even debug
the Spark application it’s not a big deal just you just add the JDWP debugging and you can run it
and you can even go and debug the implicit behavior of Scala so in most of the use cases
or most of the cases it works well but then if you don’t understand it and use it and you for
example use some functions and you don’t use those functions with appropriate data set then they will
take double the time then you would do it with something else, and it might be a simple one line
change in Scala code so go read the documentation. I think it will help you if you understand it well
Lots of parameters on Spark get configured from various sources so not many people know but the
configuration in the end in the Spark is merged from the command line in the options from the
application options and you can have like even global options in the Spark itself like if
you are using a managed Spark cluster there will be configuration options there as well
Go and understand those configuration parameters they will help you, I saw cases where for example
There was a shared Spark cluster running on. Hadoop and basically the other applications were
killing the nodes and then the application itself had lots of failed tasks and you want to avoid
that you want to try to be kind of dynamic and you can set up various blacklisting or various dynamic
executive provisioning to be able to go and scale up and down your executor network as you need to
this is quite or those things are quite important to understand at the same time, at the same time
Spark is very knobby like sometimes you need to turn and play with lots of knobs
until you get the correct configuration done and unfortunately for that I don’t
have really good advice it’s just go and try it yourself, see how it behaves and try to do some
analysis out of this behavior and try to think about how you can tune the configuration to make
it faster, generally it’s about what I said so number of tasks how many executors are there how
big are the executors and these things should likely get their application work very well
We have about or some good materials about the ScyllaDB Migrator and
basically, ScyllaDB Migrator is a proof of concept for a full table scan so ScyllaDB Migrator I will
talk about it a little bit later but if you want to go and deep dive into it there are two
blogs there and I think it’s a very good reading and it will give you some insight
how basically approach jobs that really need to read all the data from ScyllaDB database
and then we have also very good material
again in blogs but there are also materials in ScyllaDB University, the blogs talk about
basically the older version and this is those blogs are about 2.4 version of the data sets
and data frames so be sure to check those blogs out and they go very deeply by the
way they go very deeply they explain how the data sets are being created, how they approach
the token ranges how they basically get queries and all these things is very well described there