Covers CQL based tools that are used for Migration, including CQLSH Copy, DSBulk, copying files, the ScyllaDB Spark Migrator, Lambda functions, and the SStableLoader.
Let’s talk about CQL based tools.
So the most simple migration tool you already have it
built in your
into your solution, either if it’s Cassandra or Scylla,
because it’s “cqlsh”,
and cqlsh has a special command “copy to and copy from”.
But, it doesn’t preserve timestamps.
So basically you can dump your database
and then load it, but you will not be able to just
persist timestamps and you will definitely,
if you start dual-writes,
you’ll definitely be a victim of the out of band
or you don’t have to,
but you have to be careful in your client,
basically not to overwrite existing data and implementing
such a thing basically will mean
implementing compare and set and you don’t want to go there.
So it would be very slow.
There is the DSBulk.
DSBulk is a tool coming from datastax and basically if
some people remember the old Bryan Hass, Cassandra
loader tool, so this is basically
a rewrite of that tool and it’s basically
a java version of the copy to/from, but it’s improved a lot
and it gives you an option to persist timestamps and TTLs
but I don’t think it works for the special types
or for the collections.
So also be careful, and also, it’s
not suited for all of the use cases, just for the small ones.
And as I said, stick around with some small data sizes
for this, because with bigger ones,
it might not be fast enough.
I tried to write down what I just said.
So from my previous slides
I tried to write down the pros and cons of these approaches.
Have a look at them.
Those dumps, actually,
or loading from a dump,
they can be useful if you’re basically
coming from different databases.
So I think some people ask
about MySQL and postgresql and all these databases,
so for this, guys,
if you don’t want to write your own data frame support
to the migrator in Spark it’s
usually easier if you write something like that
and you just load it from the CSV.
But as I said, be very, very careful with the
timestamps and its special data types,
so you might have problems with them.
Scylla Spark Migrator it’s actually,
this too is quite interesting, and the reason is that it
supports, it works on the CQL level where you have to connect
both source and target cluster to the same, or a route
between them, and it can also support resumes.
So basically if you run something,
run the migration, something happens
and you don’t have to start from scratch.
You can just basically start from some resume point,
but it is not easy, it’s harder to set up.
I put there
some links, you can look at them and you can look at the demo.
So there is a demo where you can see how hard it is
to set up a Spark cluster on your laptop.
And I think if you have a decent enough laptop
with 16 gigs of memory, then you should be
able to do this demo and play around.
What I’ve seen also is that people write Lambda functions
and that is certainly a very nice option.
And it basically works that if you, again, convert whatever
source database to the CQL query, and I see for example,
people doing like this or doing. Lambdas from the DynamoDB.
So they were migrating from DynamoDB
and they just basically started to replicate the same queries
from the Dynamo to the. Scylla with Lambda function.
And last one that I should mention,
but I don’t think that anyone should use it
because there are better methods, is the SSTable loader.
It basically converts the SSTable into CQL.
So if you if you remember my rule
about how many times you want to push the data up,
this is, I think, the worst
because it just takes the SSTable
and it converts them to query and yeah,
so it’s basically doubling the copying of the data
or the copying of the replica from the source.