Hopefully you heard from Glauber this morning, analytics is a very very popular
workload that comes on top of Scylla, we see it time after time people using
Scylla in conjunction with an analytic workload and in most cases, in most cases
it involves Spark, why Spark? well Spark is a very very common
framework out there in the market that people like to use, it comes from the
Hadoop framework, I’m going to talk about it a little bit, but I do want to give
some background for people who have never touched either Scylla or Cassandra on
how thing works, so if you look on Scylla token architecture, basically we take
the partitions, the actual data and we shard it, the actual sharding happens as
you have heard before, by cores more than that we take the partition keys and we
hash the partition key, the hashing happens in order to make sure that we
distributed data across the cluster in a very very even way. We’re using something
called murmur3 hashing function to make sure that every node will have an
equal number of partitions, prevent hotspots. That’s a great
architecture from that perspective because it guarantees that the load on
each one of those nodes is going to be equal and even, this is a
simplistic way of how things work because we omitted here the notion
of Vnodes so people who come from the. Cassandra world and Scylla world know
that there’s additional layer of resharding of the information inside the
cluster and it might affect how it looks like. So let’s dive in a little bit more
let’s translate that notions of tokens into how entries looked at Scylla okay
so I’m gonna take a partition key and that’s gonna be hashed and pushed into
one node, okay let’s say right now my replication factor is 1. If I replicate
more than, I will have the same partition on the other nodes as well, the guy who’s
gonna actual hash the information is my driver, the actual
driver that writes the data, so he knows the partition key
he knows the hashing algorithm and so taking the hashing algorithm, creating
the new hash, pushing it in and pushing it to the right node that I want to use
for example again here this idea would go to node X. Why is it relevant to us?
okay, It’s relevant because we shard the data in a manner that is different
than Spark does, okay there are differences between the hashing
functions between Scylla and Spark, and the main difference is how Spark
consume the data out of Scylla.. So let me explain a little bit for the
guys who have never used Spark before, Spark is again a distributed
system, okay that has something that’s called the driver program, which is the your
“main function” and will take your function and distribute it
across multiple executors, each executor will have different tasks and a caching
and memory settings that is going to be part of this cluster, the memory is
ordered in something that we called a resilient distributed dataset (RDD)
okay and those RDDs are going to be stored in each one of those nodes
the RDDs are “the equivalent” again to the
partitions that we familiar with Scylla but there’s a big big difference around it
one more notion here is that Spark actually is a lazy system it means it
will read the data and consume it and will never do anything on top of the
data unless you actually do like a transformation on the information
or an aggregation on the information so it’s going to store the data but the actual
execution is going to happen just a second before it’s going to try to write
the data into the cluster itself. So how it looks like, okay so let’s say I’m
going to take, this is my. Spark cluster, right and it has RDDs
It’s going to try and read from the partition
information into the different RDDs remember the RDDs are going to be also
distributed across the different nodes okay so I’m going to have multiple
partitions from Scylla written into multiple nodes of Spark, the executors
themselves. One thing that we have to concentrate here, is that when Spark
was written, it was written for a “Hadoop file system” which
the actual execution unit sits on top of the data, it means that it will store the
information that resides in that specific node and read from it, the idea
is to minimize the traffic on the network and make it as efficient as possible