Using Spark with Scylla

Using Spark with Scylla

Whether you use on-premise hardware or cloud-based infrastructure, Scylla is a solution that offers performance, scalability, and durability to your data.

Granted, data is stored in a columnar, table-like format that is efficient for transactional workloads. In many cases, we see Scylla used for OLTP workloads.

But what about analytics workload?

By using Spark together with Scylla, users can deploy analytics workloads on the information stored in the transactional system.

This lesson will cover:

  • An overview of Scylla, Spark, and how they can work together. Scylla and Analytics workloads 
  • Scylla token architecture, data distribution, hashing and nodes
  • Spark intro: the driver program, RDDs, and data distribution 
  • Considerations for writing and reading data using Spark and Scylla
  • What happens when writing data and what are the different configurable variables?
  • How is data read from Scylla using Spark?
  • Should Spark be collocated with Scylla? 
  • What are some best practices and considerations for configuring Spark to work with Scylla? 

You can read more about Hooking up Spark and Scylla in this four-part blog series. This blog post covers a real-world use case. The documentation demonstrates a simple Scylla-Spark integration example.