A more in-depth overview of the migration process covering offline migration, online (or live) migration, and the process for implementation.
Alright, so
you have, you guys are probably here because you want to watch a migration
class and that’s exactly what we’re deep diving through on that one.
Migration is typically a very simple process, which involves
simply put reading data from a third system
and then writing it to a destination system.
You don’t have too much room to change that process,
but there are some well-known gotchas which you may end up running into
if you are not aware of some specifics.
And before we actually go into those specifics,
let’s just start from the basics.
In general, we can pretty much break it
down, break down migrations into two distinct types.
First, we have the online migration, which naturally carries
and increases the complexity
due to the fact that we need to coordinate between two systems
while our production traffic is running at the same time. And
for obvious reasons, typically doing online
migration is the preference of most organizations nowadays.
After all, for many real time types of use cases, having an outage
is definitely unacceptable from a business perspective.
In contrast, we have the offline migration,
which is much simpler than the online one
but it carries the very high drawback
of requiring an outage in order for it to be accomplished.
Okay?
So regardless of which strategy you choose,
there will always be common steps between these two.
So first they need to, you know, create and adjust the schema
to your new database, the actual data migration phase,
where you move your data from one system to another.
And then the last data validation step,
where you have pretty much free to experiment with.
But we are typically involved in comparing your source and destination databases.
In order to ensure that they are in sync and of course
that no data loss actually happenes.
All right, so as it’s simpler, let’s
review how an offline migration works.
Here, your application is currently writing and reading
from your source database, as it has pretty much always been.
And on the first step you have to do, in order to prepare
for your migration, is to actually migrate the
schema.
This step involves defining
in your target’s database,
any necessary tables, key-spaces as well as making necessary adjustments
should you be migrating from different technologies.
Important thing to highlight for ScyllaDB specifically,
your data modeling your schema migration
is what is essentially going to dictate whether your ScyllaDB deployment
is going to be successful or if it’s going to become a catastrophic failure.
All right?
But we are not going to deep dive into your data modeling in this session
because there are other concurrent sessions ongoing.
And there is a lot of content around the data modeling in Scylla University
but consider spending some good amount of time thinking on how you will
end up with your data modeling.
All right, so after you migrate your schema,
the next step is for you to shutdown your application if you so require,
and then move your data to your target database.
Considering you have stopped the application, it is expected that after
just the process of “forklifting data” completes,
you will have both databases in sync.
And then, if you so wish, you can run
data validation steps such as, you know, comparing a few records
are running some CICD pipelines against both of the source
and target databases and those of you, who are comfortable,
you will simply point your application to the new target database.
And that’s pretty much it.
Now, in contrast to an offline
migration, let’s now see how an online migration differs.
Here, we start at the same point as the offline one
where the application is writing and reading from the source database.
In addition to migrating the schema, it is also important
that we start capturing all events in our source database.
This will typically involve,
you know, enabling some sort of change in data capture.
But as we are now going to see in some databases
such as Apache Cassandra, most of the time you can simply perform
dual write from your application without any of the, you know,
making use of
another
streaming
system in order for you to capture changes.
So, the next step after we have loaded the schema and capture enable
change of data capture for our source database is to actually “forklift”
the existing data and after that is complete
you will start consuming all the changes that happen from your CDC based system.
Here it’s worth to mention, well, we already mentioned that you don’t need
to do that from Apache Cassandra and
right after you consume all the changes,
you can then run any data validation steps if you so wish.
And finally you will switch your application to the new database.