An overview of the ScyllaDB Migrator, which migrates data using Spark. Covers basic considerations, settings and configuration.
All right.
So let’s talk about ScyllaDB Migrator.
So the migrator is essentially an open source tool which runs on top of a Apache
Spark and can be used to migrate from both Apache Cassandra
and DynamoDB as a source to ScyllaDB Cloud.
Some particular features
of the migrator is the fact that it’s highly resilient to failures
because as it is running
its migration job, it will restore
save points in order to later
on resume transfers that failed for whatever reason.
The migrator is also extremely high
performing and can easily parallelize its work
given that it’s built on top of Apache Spark.
In such a way that you simply need to add more workers,
Apache Spark workers should you need to process faster.
With that, be very careful on
how many workers you were going to run your migrator,
especially if you are going to run an online migration because most people
want to migrate as fast as possible and they want to do so in an online way.
And then you go and you add,. I don’t know, 20 workers to your Apache
Spark deployment and then you tell they migrator to use all those 20 workers.
What typically happens is that you add so many workers
that when they are going to start your source database,
you eventually start facing performance problems on your application.
So migration is typically finding the right balance
between performance and speed.
So be very careful
over how many workers are going run your migration.
But you can, theoretically
speaking, add as many workers as you want,
as long as your application can tolerate
the overhead of another process, doing a full table scan,
you should be good to go.
So on top of these features, they migrator can also preserve
the “Writetime” and “TTL” attributes of some columns which are particularly useful
if you are migrating from Apache Cassandra, to ScyllaDB.
And we are going to discuss a little bit about that.
And of course the migrator can also rename a column,
which is very useful if you are for example,
if you want to do data modeling changes as part of your migration.
Alright.
Alissio is asking – it’s not
very related to the talk here, but is there any costs
comparison from DynamoDB to ScyllaDB Cloud?
And I’m going to ask you to send Alissio
our benchmark versus DynamoDB
and there is a price performance versus latency
comparison, okay?
Let’s move on and continue talking about the migrator.
So they migrator currently requires the Spark 2.4 in order to
be built
and JDK 8 and yes we know that it’s
slightly outdated we do have a bring
to support Spark 3.1.
But it’s not GA yet although since it’s open source
you can grab the link at the bottom and play around with it
and you can even submit patches
if something is not working for you if you would like to improve something.
Alright.
Regardless, for this session we will be using Spark 2.4.
So, the migrator also supports many tunable parameters
as you can see here on the screen.
This is just a little fraction of all the parameters we support.
For example, we have the “splitCount” which is a very important one
and essentially tells the migrator in
how many tasks it should split the migration job into.
And why is this parameter important?
It’s important because it makes sense
to reduce this value for very small table such as below one gigabytes, for example.
But as your table grows,
such as beyond 100 gigabytes, you may want,
or you may need to increase this value to parallelize the work further.
All right.
And you can also control the number of connections
which the migrator is going to use to connect to the source and target
databases, the page size when it’s reading
and to answer my question,
you can also control the consistency level.
All right.
So that’s an example of configuration file for the Scylla migrator.
It’s essentially a very simple .yaml file
where you are going to specify the source database
and then the target database,
and you are going to specify how you connect to with it
your credentials, the key space if that concept exists
in table, you want to migrate “splitCount” which we spoke
and other among with other options.
All right.
So the overall idea of these slides is just for us to see
how simple and easy it is for you to get started.
And if you go to our GitHub, you will see that
there is an example .yaml file which you can use to refer to
and all possible options that exists.
And of course, since the migrator runs on top of Apache Spark
you can definitely point your browser to the Spark
UI and this is a
screenshot I took.
In order for you to view your job’s progress and you can check your workers
standard output and the standard logs and so on.
So that’s essentially how you get some sort of visibility.
It’s pretty much bundled n Spark.