Join us at ScyllaDB University Live, instructor-led, online, hands-on, training sessions | September 24
Register now

ScyllaDB and Spark Lab

11 Min to complete

This demo shows how to run a simple Spark 3 job to add more information to empty columns (enrich) to a simple existing partially populated ScyllaDB 4.4 table.
The logic is very simple. The app will count letters in the first and last name columns and write the result into a new column if it’s not populated yet.

Prerequisites and Environment Setup:

Read and run the following steps carefully. All of the steps are required. If any step is skipped or fails, the entire app will not run correctly.
The example below works well and has been tested with Fedora 34. For other Debian systems, replace the dnf/yum commands with their appropriate apt alternative.
Make sure you have docker installed. If you don’t follow the steps here: https://docs.docker.com/engine/install/.

If you haven’t done so yet, download the example from git:
git clone https://github.com/scylladb/scylla-code-samples.git

Run the following script, which will prepare and start a ScyllaDB one node cluster with proper port forwarding:

cd spark3-scylla4-demo
./start-scylla-container.sh

Now you should have ScyllaDB 4.4.3 listening on port 9044. Verify this by using:

docker ps

Make sure you have OpenJDK 11 installed in /usr/lib/jvm/java-11. If it’s located elsewhere, fix the paths in all the *.sh scripts in the spark3-scylla4-demo directory.

sudo dnf -y install java-11-openjdk-devel

Get Spark 3.1. You can find more info about the installation here: https://spark.apache.org/downloads.html

wget https://apache.miloslavbrada.cz/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz

unzip it locally and symlink spark3 dir to it, e.g.

tar xzvf spark-3.1.2-bin-hadoop3.2.tgz
ln -s spark-3.1.2-bin-hadoop3.2 spark3

If you don’t have cqlsh installed, you can get a local version of cqlsh from scylla cqlsh development repository in 2 easy steps:

sudo dnf -y install git
git clone https://github.com/scylladb/scylla-tools-java.git

Make sure you have python2 or a later version installed.

Check if the ScyllaDB container really runs after the above:

./scylla-tools-java/bin/cqlsh $HOSTNAME 9044

Make sure you have the latest sbt (a build tool) to build scala projects: https://www.scala-sbt.org/1.x/docs/Installing-sbt-on-Linux.html.

Running the demo:

Populate the database with a sample keyspace and a sample table. Initially the first and last name and the count of letters in those names will only appear for one row. The rest of the rows will be “enriched” later:

./scylla-tools-java/bin/cqlsh $HOSTNAME 9044 -f samplespark3.cql

Check the contents of the table after loading the data. Note the null values in the “lettersinname” column:

./scylla-tools-java/bin/cqlsh $HOSTNAME 9044 -e "select * from myOwnKeyspace.sparseTable"

Next, you’re going to build the project.
Start by checking the java version:

java --version

It should be OpenJDK 11.
Next, build:

sbt assembly

Verify that you have the jar built:

ls -la target/scala-2.12/spark3-scylla4-example-assembly-0.1.jar

Now, start spark3:

./start-spark3.sh

The UI should be listening on $HOSTNAME:8080 (or any bigger free port, e.g., 8081)
Submit the app:

./submit_job_spark3.sh

Ideally, close to the end before “SparkUI: Stopped Spark web UI” you will see:
Accumulator: “Changed Row Count” is set to: 4

Accumulator (see spark accumulator) is used to get this metric. That means the job changed a total of four rows,  based on the sample of pre-populated data used.

And that means the first run was successful and updated (or enriched) the rows that didn’t have the letter count info.

Now, check the contents of the table again. This time the “lettersinname” column is populated.

./scylla-tools-java/bin/cqlsh $HOSTNAME 9044 -e "select * from myOwnKeyspace.sparseTable"

Try and run the job again after already processing the data:

This time it will not update rows that already have non-null values for the “lettersinname” column, and you will see:

Accumulator: “Changed Row Count” is set to: 0

So the job just validates all rows and figures out there are no rows to change.

Extra commands:

To trash the keyspace easily:

./scylla-tools-java/bin/cqlsh $HOSTNAME 9044 -e "drop keyspace myOwnKeyspace"

To show current rows:

./scylla-tools-java/bin/cqlsh $HOSTNAME 9044 -e "select * from myOwnKeyspace.sparseTable"

If you want to run the commands from the app interactively, use Spark REPL:

./spark3-shell.sh

 

fa-angle-up