What is Apache Spark and Why use it? What is the Spark Connector and how to use it to connect to a ScyllaDB cluster?
Spark is a unified analytics engine for large-scale data processing. It Has optimal support for divide and conquer and allows for writing applications quickly in Java, Scala, Python, R, and SQL. It’s well tested with ScyllaDB. The Spark Connector is a transparent translation of Spark data units to NoSQL. Cassandra and ScyllaDB use the same approach.
Let’s start with Spark so what is Spark? so Apache Spark basically
it’s an analytics engine and it’s especially designed to help you process large-scale data
the software itself is usually not a single machine but it’s a cluster and the relationship
between the respective nodes in the cluster is that there is always a master and this master
basically knows and has a notion about all the workers and it keeps the relationships between
them and using this master you can start your application from a so-called driver so it can be
a separate node it doesn’t have to be a separate node it can be the same as master for example
and or you can even run it from any other node in the cluster because it’s just
basically a spawn of the application through the master to the workers and this topology
has support for optimal divide and conquer so basically it will take your data and divide it to
various number of tasks and then basically these tasks are delegated to respective executors that
run on workers and these executors actually are doing the processing of your data and then in the
end you have an application phase and all the data gets returned to the master for post processing
if you do something like that if not then most of the execution happens really on the executor nodes
Spark or Apache Spark was written in Java mostly but despite that, it has very big support for
I’d even say like for any JVM supported language but the most commonly used languages that people
write applications in are Java, Scala and. Python and recently there is R and SQL as well
Apache Spark is well tested with ScyllaDB we have lots of users and customers that
use Spark with ScyllaDB and why do they use it? So that’s a big question, why would you use Spark? So
the common use cases or common usages of Spark are, for example, you have like a huge database and
you need to do some housekeeping so you can run this housekeeping job very well on Spark
You want to do some queries some analytics on top of your data, yes Spark will be the perfect engine
for that because it will just basically be able to do it quickly in parallel and it will be able
to aggregate the data for you so you don’t need to write any special framework code everything
is already in there and if you leverage Scala and functional programming it will be a place to do it
If you want to do table joints, yes you can do those on Spark as well and here I
would suggest a little bit more memory on the executors and then of course if you want to do
some data enrichment or some transformations or ETL you can write on Spark as well and of
course basically any task or any job that you can divide into small pieces, that’s what Spark is for
Spark itself however doesn’t know how to connect to ScyllaDB
so you have to teach him that and for Spark to be able to connect to ScyllaDB you use a thing called
a Spark connector. ScyllaDB is Cassandra compatible database, so basically we use the same approach
and the Datastax driver basically works well and it’s a very good start so feel free to do it feel
free to use it, and start it. We are having in progress a thing which is called the ScyllaDB Shard Aware Driver
and it’s not officially out yet it will be out on the link that I mentioned on the slide
and basically, there is a difference on how the data is accessed and we are trying to optimize
the big difference between the Datastax version and our
version is basically a shard awareness and that means that your application
on the query rooting level will know exactly what CPU on what node it has to go to for data
While in Datastax version the application only knows which node or which replica
it has to go to. With ScyllaDB it goes even down to the. CPU and basically it will save you some latency
let’s have a look how you can connect to ScyllaDB.
So what works well and what we have very well tested is the Spark cluster, ideally on version 2.4
and now I can hear all of you say okay but the latest version is 3.0 and 3.1 yes we do support
3.0 and 3.1 as well but those versions are quite recent and there is not much mileage in there so
if you want to go for something stable, go for. Spark cluster 2.4. Respecting to that or
when you will work with Spark 2.4 you should go and always check the correct versions, so basically
I wouldn’t recommend to go and use Spark connector 2.2 in Spark cluster 2.4 so there is always an
alignment between the Spark version connector version JDK version and underneath basically
there is a certain driver version used as well and you should pay attention to that and you should
basically focus on this, because it will make a big difference and if you use some components
that are not ready to work for example with latest. Spark or if you use some newer components that
of course you don’t want to use old Spark and you need to align all these components properly
All this compatibility is written in the link which I have there on the slide
so basically you can just go and check the version compatibility and compare your version
and be sure to do that because it will save you lots of time and you will avoid lots of troubles
What is good about Datastax builds is that they are in Maven so they are published so
you can just easily go and include them in your dependencies in your application
Unfortunately, the shard aware drivers are not there yet so so far once they will be put
into the official ScyllaDB repo then you will be able to have them or consume them from the
official Maven sources as well but so far you want to unfortunately have to build them manually so
there are some versions, I also pointed to a link there it points to my fork which has the
shard awareness you can go and take it you can build the jar out of there and just use this
jar locally and manually and you will be able to see basically whether it helps you or not As
I said it’s pretty good to start already with the. Datastax drivers and then if you’re for example,
if your application is very read heavy and you want to have very good latency for reads
then it will make sense to go to sharper driver and you can certainly or you should
certainly try it. It will be worth the hassle when you have to build the jar by yourself
Once you have fulfilled those prerequisites so basically all your versions are aligned then
you can go to the next step, and the next step is basically configuring all the correct input
parameters and have correct imports in your application, so this is those two steps are
also very important so there is a plethora and I think maybe even more than 50 of input parameters
that you can specify and for the older version for the 2.4 connector version you likely would
want to check them all and figure out whether you want to use them or not, or whether the default is
okay for your use case or not, and then basically you just go and set them like this. In the newer
versions for example in Spark 3.0 and 3.1 there is already a notion of catalogues and data sets that
have a newer version and there the configuration doesn’t have to be so complicated but you still
basically need to specify those configuration parameters but in a different way not as you see
on this slide but there is a slightly different way. That’s why I mentioned actually that it’s
very important which version you pick because like your mileage and your experience will
vary it won’t be a big difference but there will be a variation on how you use the Spark connector
Once you have the Spark connector you, of course, connect to a Spark master and you can connect to
your table and then you can basically use any data frame or data set operation on top of the
Spark context or you can even go and directly create a session and that can be later on seen in
the examples I didn’t want to put all the examples into the slides because it just doesn’t make sense
to write example code here but, I think a very good reference and very good start is the documentation
in the repository itself. So for example, if you want to go for a 2.4 or 2.5 connector
you usually can just go and you can see the first link that is on this slide
and there is a reference of all the parameters so the connection parameters, read parameters,
write parameters and you can just go and check them all and see whether it makes sense to change them
and then the example of how you can connect is also very well documented and you should
read it from the appropriate version so always if you work with Spark 3.0 go and look at the Spark
connector 3, if you look with Spark 3.1 go and look at the connector Spark 3.1 and then those
documentations will definitely be different so make sure to always verify that you are looking at
the correct version. In the second link this quick start there are like follow up documents
and you can just click through them at the end of every single file and go to the next
one or just list the directory documentation and go from 0.1 to 3… until the last one
They will explain and they will show you basically how to leverage all the appropriate ways how you
can connect to the ScyllaDB cluster. So the example. I showed you on the previous slide it’s something
that works well and it’s well defined but as. I said newer Sparks, it can be more convenient