11 Min to complete
This demo shows how to run a simple Spark 3 job to add more information to empty columns (enrich) to a simple existing partially populated ScyllaDB 4.4 table.
The logic is very simple. The app will count letters in the first and last name columns and write the result into a new column if it’s not populated yet.
Prerequisites and Environment Setup:
Read and run the following steps carefully. All of the steps are required. If any step is skipped or fails, the entire app will not run correctly.
The example below works well and has been tested with Fedora 34. For other Debian systems, replace the dnf/yum commands with their appropriate apt alternative.
Make sure you have docker installed. If you don’t follow the steps here: https://docs.docker.com/engine/install/.
If you haven’t done so yet, download the example from git:
git clone https://github.com/scylladb/scylla-code-samples.git
Run the following script, which will prepare and start a ScyllaDB one node cluster with proper port forwarding:
cd spark3-scylla4-demo
./start-scylla-container.sh
Now you should have ScyllaDB 4.4.3 listening on port 9044. Verify this by using:
docker ps
Make sure you have OpenJDK 11 installed in /usr/lib/jvm/java-11. If it’s located elsewhere, fix the paths in all the *.sh scripts in the spark3-scylla4-demo directory.
sudo dnf -y install java-11-openjdk-devel
Get Spark 3.1. You can find more info about the installation here: https://spark.apache.org/downloads.html
wget https://apache.miloslavbrada.cz/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
unzip it locally and symlink spark3 dir to it, e.g.
tar xzvf spark-3.1.2-bin-hadoop3.2.tgz
ln -s spark-3.1.2-bin-hadoop3.2 spark3
If you don’t have cqlsh installed, you can get a local version of cqlsh from scylla cqlsh development repository in 2 easy steps:
sudo dnf -y install git
git clone https://github.com/scylladb/scylla-tools-java.git
Make sure you have python2 or a later version installed.
Check if the ScyllaDB container really runs after the above:
./scylla-tools-java/bin/cqlsh $HOSTNAME 9044
Make sure you have the latest sbt (a build tool) to build scala projects: https://www.scala-sbt.org/1.x/docs/Installing-sbt-on-Linux.html.
Running the demo:
Populate the database with a sample keyspace and a sample table. Initially the first and last name and the count of letters in those names will only appear for one row. The rest of the rows will be “enriched” later:
./scylla-tools-java/bin/cqlsh $HOSTNAME 9044 -f samplespark3.cql
Check the contents of the table after loading the data. Note the null values in the “lettersinname” column:
./scylla-tools-java/bin/cqlsh $HOSTNAME 9044 -e "select * from myOwnKeyspace.sparseTable"
Next, you’re going to build the project.
Start by checking the java version:
java --version
It should be OpenJDK 11.
Next, build:
sbt assembly
Verify that you have the jar built:
ls -la target/scala-2.12/spark3-scylla4-example-assembly-0.1.jar
Now, start spark3:
./start-spark3.sh
The UI should be listening on $HOSTNAME:8080 (or any bigger free port, e.g., 8081)
Submit the app:
./submit_job_spark3.sh
Ideally, close to the end before “SparkUI: Stopped Spark web UI” you will see:
Accumulator: “Changed Row Count” is set to: 4
Accumulator (see spark accumulator) is used to get this metric. That means the job changed a total of four rows, based on the sample of pre-populated data used.
And that means the first run was successful and updated (or enriched) the rows that didn’t have the letter count info.
Now, check the contents of the table again. This time the “lettersinname” column is populated.
./scylla-tools-java/bin/cqlsh $HOSTNAME 9044 -e "select * from myOwnKeyspace.sparseTable"
Try and run the job again after already processing the data:
This time it will not update rows that already have non-null values for the “lettersinname” column, and you will see:
Accumulator: “Changed Row Count” is set to: 0
So the job just validates all rows and figures out there are no rows to change.
Extra commands:
To trash the keyspace easily:
./scylla-tools-java/bin/cqlsh $HOSTNAME 9044 -e "drop keyspace myOwnKeyspace"
To show current rows:
./scylla-tools-java/bin/cqlsh $HOSTNAME 9044 -e "select * from myOwnKeyspace.sparseTable"
If you want to run the commands from the app interactively, use Spark REPL:
./spark3-shell.sh