An overview of the process of migrating to ScyllaDB. Includes offline migration, online migration, common steps, different strategies and when to use each one.
Felipe
already spoke about it, that basically you first
have to really decide what type of migration you want.
I think most of the people prefer online migration.
So we will focus on this and
we will basically make sure that
you know how to, or understand how to do this effectively.
And the common steps obviously are always the schema migration
and adjustments to schema, because sometimes
even between like Cassandra and Scylla,
there can be small changes in how the schema is defined.
And at the end you have to validate the data of course.
And that’s something that I’ll try to talk about as
well.
It’s very hard
to kind of decide when you want to use which strategy.
So we have a small diagram here to help you. And
generally it boils
down to what kind of access you have.
So if you basically can access
your cluster fully or source cluster fully, then the ideal
and the fastest way is to take SSTables
down to what kind of access you have.
and just load them to the target cluster.
However, if you are basically copying the data
down to what kind of access you have.
either from a different type,. I don’t know, from DynamoDB
or from
something that basically only gives you the CQL access,
then you actually have to go with different route
and you have to basically use
either the Scylla migrator and create a Spark cluster.
And this one I would recommend for bigger clusters
and then big data, or you can go down to the CQL copy
and DSBulk and those tools they will give you actually
a very easy way how to migrate for small databases.
So always kind of,
I don’t really have a rule of thumb, but maybe you can
get it or break even at around a hundred gigabytes,
so if it’s less than a hundred gigabytes,
you can go away with the DSBulk or CQL copy.
If it’s more than a
hundred gigabytes
then I would definitely go for a Spark cluster
and let it migrate my data with Spark,
if you only have CQL access.
Another hints that can help you on
when to use each strategy or what to do, is basically,
you can check what is the size of the replicas
or how much data you want to transfer.
You definitely should check the schema restrictions.
And both John
and Felipe already mentioned it, so I’m just going to repeat.
Some of the data types
can be special and they need special handling.
So for example, DSBulk
doesn’t support the copying of collections or counters.
And the reason is very simple,
that basically
you cannot really inject them with, using timestamp
and you will create out of bound writes, as Felipe
explained them, because simply those writes that
will be coming from the write, they might get all written
by the data that they are side loading.
Another crucial part is that basically
always try to think about the complexity of the method,
because sometimes you can achieve the same thing
with a different method and you can do it faster.
And all this basically
also like boils down that you should try to think
how much data you are shuffling around the cluster.
And that like a good
rule that I generally use, that I come to the cluster
and I look at it and I pick a method
and then I’m trying to think, okay,
if I have my data here,. I need to basically
copy the SSTables to some temporary storage.
Then from this temporary storage to a loader,
then from the loader, load it and convert it to CQL
and then send it to CQL and write it.
That basically means that you are
transferring and converting your data at least three times.
copy the SSTables to some temporary storage.
And then the question is, can it be done simpler?
Can it be done in a way that basically there will be
just basically this one
Spark cluster, and this Spark cluster will
just read it on the left side and write it on the right side,
which basically means just one load of the whole database,
and this basically will answer you
which method you want to use and it will tell you that, okay,
if I have to copy all. SSTables or three different
parts to just load them,
then it’s easier just to set up the Spark cluster
and pay the price for the Spark cluster instead
and set up the networking, then copy the SSTables.
But sometimes you don’t have this luxury
and you don’t have, for example, an option
to create the networking between those two clusters
and in that case you just have to go through
some other means of,. I don’t know,
storing the data on S3 and then getting it out of the cloud
and set up the networking, then copy the SSTables.
to some other storage and loading it from there.
So that’s why this is actually a very
crucial part, to think about it, and crucial part to think
on how much or how many times
you will have to load the data and move it around.
And then of course, in the end it’s all about the time
limits and cost/budget limits because sometimes
you don’t have the budget to own a Spark cluster
and it’s just easier to use existing infrastructure and
you will have to load the data and move it around.
don’t basically
spend money on your infrastructure, but use existing
band, existing loaders or existing storage if you have it.
Sometimes we will get, for example,
preconfigured backups of your cluster.
And if the other target cluster
connects to those backups, then why not?
You just basically go and fetch them directly
or fetch data directly from the backup
and it will save you time and you have to go and build
a Spark cluster.
preconfigured backups of your cluster.
I just want to repeat what Felipe said, so
basically it’s fixed.
So basically the offline migration is pretty easy
in terms of the time events, how they happen
one after each other, and
you just basically can
do it straightforward if I put it like that.
But then at the same time you just have downtime,
and sometimes the downtime can be small, as Felipe said
in this counter-example,
it was really a matter of five or few seconds, but
most of the times
you basically want to avoid long downtimes.
The live migration is actually
better from the point of view
that gives you better flexibility.
And what I think
is the most important that you always have a way to get back.
So the live migration, basically with the dual-writes
it gives you the flexibility to go always back,
and of course if something happens
or something
that you couldn’t expect, so for example,
if there is some weird noise outside your house,
you can still go
and basically save the situation and save the day
because you still have the old cluster running
and you can troubleshoot whatever problem has happened,
and even like you can have time for the validation.
and you can troubleshoot whatever problem has happened,
In previous case,
when the validation would be taking a long time,
it was just from your downtime in a sense, but here basically
and you can troubleshoot whatever problem has happened,
the validation can go on and you have
enough time
to decide if you are okay to switch or not.
And basically you can just pull the plug once you’re ready.
Of course, the disadvantage is the cost,
because they’re basically keeping
both clusters running in parallel,
and sometimes this might be
in contradiction with your budget requirements.