Spark – Effective Big Data and Spark Cluster Size

10 Min to complete

Partitioners, tasks, splits, and executors. How to dimension your spark cluster properly?
The ScyllaDB (or Cassandra) partitioner takes the partition key of a CQL Row and uses it to determine what node in the cluster should host that data.
The partitioner generates a Token that directly maps to the TokenRange of the Cluster. Each node in the Cluster is responsible for a specific section of the full TokenRange.
The capacity required for the cluster depends on the size of the ScyllaDB cluster.
This is a good rule of thumb:

  • Allocate 1 Spark CPU core for every 2 ScyllaDB cores;
  • Allocate 2GB of RAM for every Spark CPU core.