The basic strategies for the migration process. Covers streaming, Dual writes, Out of order writes, Load and Stream and what happens with TTL’d data.
Now that we understand the basics
and all the considerations
we need to make during the migration process, let’s
analyze how some companies have actually migrated their data.
From now on, all the scenarios that we are going to discuss
here were real world migrations that may be relevant
to some of you, and I hope they are.
So the first migration that we are going to
study is a large media streaming company who decided to switch
from DynamoDB to ScyllaDB in order to reduce costs.
One interesting
aspect of their use case is that they had an ingestion process
which would overwrite their entire dataset on a daily basis.
As a result, there was no requirement
to actually forklift their data from one database to another.
All they did was to configure their ingestion job
to write to ScyllaDB in addition to ingestion through DynamoDB.
And as soon as that job kicked in, I mean, data was
stored in both databases.
Another good aspect of that migration
is the fact that given that DynamoDB is data modeling
and ScyllaDB data modeling are very similar,
that greatly simplifies the data modeling process.
Unlike, for example, when people were switching
from a document store or a relational database to NoSQL,
which is when you are actually, okay, I have to sit down
and start mapping how this is actually going to work,
how will I transfer my queries to satisfy my application needs.
All right.
And as I mentioned earlier, most of the time when we migrate
from one technology to another, there, there will always be
some changes that we need to make.
After all, the solutions,
even though the data modeling from DynamoDB
to ScyllaDB are similar they’re not the same database. And
their features, their internal workings
are very different from one another.
Some of the concerns
that this media streaming
company had during their migration were related
to how ScyllaDB handled out of order writes.
How they would implement record versioning,
and how is the efficiency of data compression.
These were all valid concerns, which is why the main
lesson I want you to take from this migration
specifically is the need for you to understand the differences
between your source and your target databases.
As we explore through these differences,
you may eventually stumble upon room for improvement,
which is exactly what happened here.
The use case in question was very
susceptible to out of order writes.
And before we actually
understand how we addressed this, let’s understand
what an out of order write means. And
okay, so let’s forget for a second
the fact that the migration was from DynamoDB
to ScyllaDB and
switch on the focus to Cassandra.
So one approach that we use to migrate
from Cassandra or ScyllaDB to another ScyllaDB
cluster is to perform what we call dual-writes.
It means that the client we write down
to both of data stores at the same time,
just like you’re seeing in this slide over here.
Then what comes next is that we
plug-in one of our migration
to the ScyllaDB migrator
to start reading data from the source database
and writing it to the destination one.
And this essentially brings us a concern,
consider this situation in which the Spark job at the bottom
reads some data from the source database
and next, in a very short period of time,
the client at the top writes an update to that same data.
The client writes the data to the target database
and the Spark job writes it after.
My question to you guys is, which record wins?
The write by Spark or the write by the client?
Does anyone want to guess the answer in the chat?
5 seconds.
No one.
All right, let’s move on.
So the example I just gave you is essentially an example
of what an out of order write stands for.
And essentially in the example
I gave before
the data written by the client, at the top
would have been the one correctly persisted
because the CQL protocol, the one which ScyllaDB uses,
allows one to manipulate the timestamp of your writes.
All right, so what’s an out of order write?
So I have added Martin Fowler’s definition of it where he says
“an out of order event is one that’s received late,
sufficiently late that you’ve already processed
the events that should have been processed
after the out of order event was received”.
And yes, a lot of people got the answer correctly,
so thank you very much for this. Now,
it is important to note that, this,
the capability of manipulating the timestamp of write, is
a ScyllaDB characteristic that’s not available in DynamoDB.
Which brings us to the question, how was this company
handling their out of order writes in DynamoDB?
Well, they were using DynamoDB’s condition expressions
which are very similar to lightweight
transactions in Apache Cassandra.
The drawback of conditional expressions in Dynamo
is that such operations are much more expensive,
both from a performance as well as from a cost perspective.
Then regular non-conditional expressions.
And how did they try to circumvent
the out of order write using ScyllaDB?
Well they implemented a read-before-write prior
to every write which effectively causes
the reads to spike, so what we did
is that after
we engaged with them and analysed their situation,
we improved their application and database performance
considerably by simply manipulating the
timestamp of their writes, which is the same approach
that another customer of ours,. Zillow, for example,
is doing to handle out of order events.
All right.
Now the next real world example. I brought for us to take a look
today is a ScyllaDB to ScyllaDB one.
It’s essentially an engagement platform company
who decided to migrate from their own premises
deployment to ScyllaDB Cloud, our managed solution,
and asked for our assistance to migrate their data over.
Now, one particular detail of this migration is the fact
that they used TTL to control their data expiration.
As this was a ScyllaDB to ScyllaDB migration,
we pretty much
didn’t need to go through any data modeling changes
which greatly simplify the process, but,
even though we had initially suggested them
carrying out an online type of migration,
they ultimately decided against it
and took the offline route instead.
As a result, among the drawbacks of running an offline
migration, we can highlight the window for data loss,
which actually corresponds to the actual duration
of the migration activity,
plus the fact that it is rather a manual process.
You need to snapshot your nodes, you copy
every snapshot everywhere, and finally we would load
the snapshots to the target system.
Finally, as they decided against dual-writing,
that would mean that after they switch-up clients that would be
pretty much a no-turning-back without actually losing data.
And those risks were explained upfront,
but they decided that these risks wouldn’t outweigh
the benefits in the simplicity of doing it in an offline way.
So prior to the actual production
migration, we have tested each step following the activity
in such a way that the actual time frame,
susceptible for a data loss, were properly understood.
It is also worth mentioning that depending
on your actual requirements you may not even have to deal
with data loss if you follow this approach.
But only with a temporary data inconsistency
you may repeat the process after to your switch
your application clients to the new cluster
which would effectively copy
over any data
that hasn’t been captured as part of your initial snapshots.
However it was decided by them that this was not needed
for the actual migration subject of our discussion here.
Now, before we understand how we actually carried out
this offline migration, let’s see how typically a migration
covering time-to-leave data would look like.
So the first step would be for us to configure
our application clients to dual-writing,
whereas the client would be reading
only from the existing source of truth, and then, eventually
the TTL would expire on the source,
on your existing source of truth.
And at this point
we may switch the read to the new target’s database.
And that’s it, all the data would be in sync.
Now let’s
analyse how the migration
actually went.
First, the most obvious difference
here is the fact that the client is reading and writing
only against an existing source of truth.
Next,
with the application still running, the customer took
an online snapshot of their data across all nodes.
The resulting snapshots were transferred to the target
cluster and reloaded the data using “load-and-stream”.
Now, for those of you
who are already familiar with Apache Cassandra,
load-and-stream is a ScyllaDB extension
which extends the old fashion nodetool “refresh” command.
Rather than simply loading the data for the node
and discarding the tokens, which the node is not the replica
for, load-and-stream will actually stream the data
to other cluster members, therefore greatly simplifying
the migration process.
And after the load-and-stream process completed,
the clients simply switched over the reads and writes to the
new source of truth.