This session is about migrating to ScyllaDB. It starts with a migration overview, online and offline migrations, important steps in migration, using Kafka for migration, application changes, and the Spark migrator.
We will talk about the migration overview what is an online versus offline migration common steps dive into the online migration and the options there dive into the existing data migration which is called “forklifting” and the options handle that let’s get started so you have a Cassandra database you decided you want to move to ScyllaDB, the compatibility is there in terms of SSTables drivers etc so your clients drivers CQL and nodetool commands it will all work as same as you used to work with Cassandra but assuming you do not want to start with ScyllaDB from scratch as a clean database you will need to perform some sort of DB migration the first thing you need to decide if you want to have this as a live migration or as an offline migration in both cases you will need to migrate your data from Cassandra to ScyllaDB now let’s look a little bit into both and understand how it looks so offline migration during offline migration neither the legacy database Cassandra in this case or ScyllaDB are operational, none of them are operational the offline migration strategy is easier on the operator as it lets you stop the legacy database at some point and start and restart it at the same point in time
In the new ScyllaDB system however because both clusters are down there is a downtime users will not be able to connect to the system during the migration process and that’s usually a constraint that few organizations are willing to accept hence come the online migration that’s the second strategy – contrary to the offline migration in online migration both legacy and the ScyllaDB deployment are fully operational and can serve the traffic there are common steps that are common to both the online and the offline which are these and we will dive into them so offline migration let’s look at a timeline without there are No Dual Writes so we have writes and reads coming to the old database we migrate the schema we will show how to do that in a second you can use the describe the CQL command to basically describe your schema and pour it into a .cql file which you can then use to upload to moving it from Cassandra to ScyllaDB you can dump the schema and do some adjustments very minor ones we have the needed adjustments documented in this link
A few syntax differences minor differences between ScyllaDB and Cassandra and then you can use the CQL IP “- – file” command to just populate that schema in your ScyllaDB cluster and as I mentioned there are a few minor differences which you might need to tweak because that’s why we there’s an original and adjustable schema so after we migrated the schema we’re now we’re still talking on offline we’re still speaking about offline migration flow we need to forklift the existing data move it from Cassandra to ScyllaDB there are various ways and after forklifting the existing data the databases are at sync now we get to a validation phase where we can make sure that the information is on the new ScyllaDB cluster properly and then we can fade out the old database and start serving the writes and reads from the new database again this is the offline migration online migration which is mostly 95 percent of the time that’s usually how migrations are done so first of all you need to think about dual writes ok if you think about updating your application logic to send writes to both clusters ok you have the Cassandra cluster, you have your ScyllaDB cluster you populated the schema on the ScyllaDB cluster now before you do anything else you need to start doing dual writes to both clusters to take care of your real-time traffic okay you need to be able to compare the results and log inconsistencies if you have any you need to make sure that you’re using client-side timestamps
Okay you don’t want to have some NTP servers going crazy and then your entries in ScyllaDB are not with the same timestamp as they were with Cassandra so we need to use client timestamps okay and by the way that is the default in the latest versions of most drivers so make sure you’re using the latest drivers, that’s a general recommendation each DB writer should enable you to stop start writing to each of the database at any point in time and basically you will be able to apply the same logic to the dual reads later on in the process now let’s look at the timeline of a live migration writes and reads coming to the old database we migrated the schema to the new database now we’re starting the dual writes part during that period we’re writing to the new database and we’re still writing to the old database we are starting to handle the forklifting of the existing data which is the second part of the session how to do that forklifting at this point the databases are at sync we’re now doing a validation phase which will consist of dual reads so all the reads will be sent to the old database as they were all the time but we will start reading also from the new database but still relying on the legacy database as the source of truth until we are confident in the validation phase reading from the new database just mention that once we are confident and the validation phase is considered to be successful we can start fading out the old database and consider ScyllaDB as the source of truth
Okay let’s say you’re already using Kafka you have your Kafka layer in front of your application layer or the database there it is recommended that you buffer the writes while the migration is in process meaning all the writes to the Cassandra clusters stay intact but the writes to the ScyllaDB cluster gets buffered in a Kafka cluster and after the migration is completed you would do a double write of all the data that is buffered in Cassandra okay you will then increase the TTL and add buffer time while the migration is in progress or about to start and you can use the Kafka connect framework with Cassandra connector to write the data directly into ScyllaDB or use KSQL if you want to run a couple of transformations on your data but that is still experimental so Kafka is a good way of doing that streaming another option is to do the dual writes as I mentioned earlier this is an example how we took a code we just duplicated the writes.append and we’re now writing to DB1 and DB2 and it’s basically the same it’s not hard, we have more detailed examples in our documentation ok now let’s talk about how to migrate your existing data all this was about our real-time traffic
So we have a lot of historical data we want to migrate right so there are a few options CQL COPY, SStableloader and the Spark Migrator important thing to mention what’s written in the bottom when performing an online migration always choose the write strategy that preserves timestamps unless all your keys are unique ok you don’t want the entries to get registered, your historical data to get registered as new timestamp entries in the new database that would just mess up everything so little bit about databases under the hood so there are each one of these tools does the data migration differently okay for example ScyllaDB spark migrator is basically reading native CQL actually reading your data and writing it with CQL statements to the new data, SSTableloader basically takes the SStable files which are the basis of the data that we have stored in Cassandra okay it takes the SStable files and then it parse them and it writed CQL statement to the new database and there is the copy CQL copy command which basically allows you to use arbitrary data files CSV files to just read scan through the CSV file and just insert all that data so here we have an example of a way that you can migrate without even doing any forklifting so let’s say you have highly volatile data with very low TTL okay not TTL of years TTL of 10 days or something like that
Okay you can establish the dual writes as I explained earlier keep it running until the last record in the old database in your legacy database is expired by the TTL turn off the dual rights and just phase out the old database okay in that case you know you have all the actual needed information in the new database because the rest of the things were already TTL’ed in the old database there is no need to keep pushing them to the new database right this is just a small example small corner case now let’s talk about the three options that I mentioned let’s start with the CQL Copy, SSTableloader which is the tool that is most commonly used is a powerful tool but sometimes you just have a small amount of data you want to migrate historical data something like 2 million rows or less or you need to have a CSV files you need to a CSV file style to use for example if you’re not moving from Cassandra to ScyllaDB moving from I don’t know Mongo some other database and you cannot just move the SStable files or you cannot read the data with CQL and write it you need to dump the data somewhere ok so you need to create some a CSV file dump and that can be used now the COPY command is used to export or import small amount of data in the CSV format okay copy to export the data from a table into a CSV file copy from imports the data from a CSV file into an existing table each row in the CSV each row in your data will be written as a line in the CSV file you can determine what’s going to be the delimiter and how to parse the delimiter when you’re uploading the data you can drop columns that you don’t want so you won’t need to dump all the information and you can do some few more things
Let’s talk about the things that you can do with the COPY so for example the file size matters basically it’s not going to be trivial to import huge data dumps of CSV so if you have a big data set you can split it into multiple CSV files make sure that every CSV file would be less than two million rows and then use this and again sometimes this is the only available strategy that you can use to migrate because you’re not migrating from a cluster that supports CQL and SStables the other tools need CQL and/or SSTables format which is Cassandra that’s why most of the migrations here are based from Cassandra you can also skip unwanted columns you can select which columns you want to import as I mentioned formatting is important and may be different in various CSV file so you need to make sure that you know what your formatting is like and that’s why you will need to specify some of these flags during the CQL copy command ok use a separate host to do the loading don’t do that locally on the ScyllaDB node ok as it will be an extra overhead on the node that will already be digesting a lot of information that you’re loading to so use separate nodes and even better use multiple separate nodes in parallel to stream the data and move it, import it using the sql copy there are few other options if HEADER equals false and no column named are specified the fields are important in your deterministic order but there is a help for this command and you can read about all the various options that you have here about batching, TTL, CHUNKSIZE etc
So to summarize the pros and cons so copy command is a very simple way of migrating it’s a very simple command to use the CSV transparency is pretty much understood it’s easy to validate the destination schema can have less column than the original it can be tweaked and supports plenty of languages so you can basically have the data from various databases and then use the copy command on the CSV file to migrate ScyllaDB as mentioned and of course is compatible with Cassandra ScyllaDB and ScyllaDB cloud the con is, it is not made for huge data sets and the timestamps are not preserved okay so if you have a use case that is timestamp sensitive then this is probably not the way to go in this case SSTableloader so SSTableloader basically you can install the ScyllaDB, ok first of all we want you to use it’s written in the bottom you must use the ScyllaDB SSTableloader in order to migrate between Cassandra and ScyllaDB both of them have a SSTableloader it comes as part of the tools both in Cassandra and ScyllaDB but you need to use the ScyllaDB SSTableloader to migrate to ScyllaDB sometimes people don’t do that and they hit some issues
Ok the way that it does usually the process is you create a snapshot on each of the Cassandra nodes you run the SStable from each of the Cassandra node or even preferably the intermediate node again not to load your Cassandra nodes which are currently your live real-time database okay you can use mount points or whatever and you use SSTableloader to read the SSTables files from the Cassandra and write them into ScyllaDB there is the “-t” option for throttling if you see that somebody’s choking either the Cassandra side is starting to choke because you’re doing a lot of reads from it or the ScyllaDB side, then use the throttling to limit the rate of the SSTableloaders and of course always best practice I mentioned that also in the CQL copy run several SSTableloaders in parallel we recommend to run several SSTableloaders in parallel to utilize all the ScyllaDB nodes, so ScyllaDB has the shard architecture you want to have a lot of connections opening to ScyllaDB by the SSTableloaders so using maybe one or two or three or four SSTableloaders might not be utilizing ScyllaDB which is currently like an empty database and you want to just have it ingest all the information as soon as possible so think about eight ten twelve SSTableloaders
We also recommend to start with one key space and its underlying SSTable files and after you did this one key space you move to the next one basically always any migration starts small then go big don’t go with the full-blown migration when you’re still learning the tool okay it’s better doing something small specific key space even specific table see that you understood how to utilize the tool and then go with the full-blown migration so things that are important during the forklifting there should not be any schema updates okay on your source database SSTableloader has support for simple column rename, if you know the column name can be a big factor in the metadata that you’re storing if every column is a very very long name the metadata can be very big and sometimes it’s better to shorten the column name some simple operations about column renames are supported in SSTableloader Spark is even better in that way, as I mentioned earlier when you’re assuming RF=3 you end up with nine copies of the data until compactions will take care of it so you need to be able to accommodate that and as for the failure handling if you lost if SSTableloader failed, each loading job is per key space or table that means that in any case of failure you will need to repeat that loading job of that SS table, as you are loading the same data because of the replication factor compactions will take care of any duplication so don’t worry about that what happens if my Apache Cassandra node fails during the migration so if the node failed was a node that you were loading SSTables from then the SSTableloader will also fail we won’t be able to read from that node because it’s down and if you are using replication factor larger than one which is mostly the case then your data will exist in the other nodes and the other replicas that are still working with the other Cassandra nodes that are up
Hence you can continue with your SSTable loading from all the other Cassandra nodes and once completed all the data should still be in ScyllaDB what should I do with one of the ScyllaDB nodes fails now if one of the ScyllaDB nodes, the destination failed and it’s one of the node you were loading SSTables to then the SSTableloader job will fail and you will need to restart that job and how to roll back from and start from scratch you will need to stop the dual writes to ScyllaDB stop the ScyllaDB service use CQL to perform truncate on all the data that was already loaded to ScyllaDB start dual writes over again take new snapshots on your source Cassandra nodes and then start the SSTableloader job again from the new snapshot location okay ScyllaDB Spark Migrator so ScyllaDB Spark Migrator is a very very cool tool it’s highly resilient to failures it will retry reads that failed and writes and also saves savepoints which means it knows how to roll back to a savepoint and continue from there if it failed at some point so it has basically unlimited streaming power it depends just how many nodes or how big the spark cluster that you will be raising is but spark is a very powerful tool and you should take advantage of that there is no interruption to the workload for both Cassandra and the ScyllaDB cluster while they are running side by side accepting the new data traffic into both clusters there is a lot of flexibility to change the data model
Okay the usage of ScyllaDB migrator can help you when you need to make changes into the data model there is an option to use the ScyllaDB migrator as an ETL tool to take one data column and enrich it when you’re storing it in ScyllaDB so all of these are highly great advantages for this tool and the most important one is with solutions like the SSTableloader if you remember we were copying the number of replicas so we were copying the data three times but when we’re using spark you’re only transferring single copy of the data because it’s actually reading the CQL it’s actually reading the data using CQL statement and then writing that CQL statements to ScyllaDB so basically it’s already doing like a like a full table scan of that you will get only the single copy of the data okay so it’s not like I need to copy the same data over and over it would be like an actual application just reading the data one time and writing it to ScyllaDB you can make sure that no tombstones will be transferred when you’re using spark migrator if tombstone can be a very huge load of storage space okay and you usually don’t want the tombstones to be transferred
So you can have the tombstones not transferred when you’re using the spark migrator and this way you’ll be transferring less data saving the space saving the bandwidth on the network etc and reducing your data transfer costs the spark migrator is very simple to install you just need the standard spark stack Java JRE and JDK and SBT there is a configuration file you need to edit there is a link here on how to get it done the connections of the spark Java are created to both the ScyllaDB and the Cassandra clusters, size your spark cluster on the amount of data you have to transfer from the Cassandra nodes to ScyllaDB it is recommended if possible that you run a compaction on your Cassandra cluster before you start the migration it will accelerate the read process that there will be less tombstones these are part of the setting you need to take into consideration so we care about the spark and the Cassandra spark connector version that you use you need to use this version or higher we will need to take advantage of these latest versions to take advantage of the number of connections your spark workloads will open towards the Cassandra and the ScyllaDB clusters the number of connections that you open should be proportional to the number of cores you have on your ScyllaDB or your Cassandra deployment
So make sure they’re not too low otherwise you won’t have a connection to each shard and in ScyllaDB architecture we want to have at least one or even a few connections per shard okay so make sure you know how many cores your ScyllaDB nodes have and adjust that accordingly you basically need to define a source as a destination key space at the tables you want to migrate in order to make the transfer more efficient we asked you to set the split limit count and the split limit count would determine how many partitions are used for spark execution ideally this should be the total number of cores in the spark cluster okay and everything that I’m saying is explained also in the link, so you dont have to memorise everything
This is basically how a ScyllaDB spark migraor conf. file will look and if you worked with spark before you know it has a very good log and UI for example in this log we can see that the spark migrator created a savepoint and these are the token ranges that he added to that savepoint so it knows that if you need to go back to that savepoint these will be the token ranges you will need to address again and there is the UI of course that you can see the promise of the job running failed success always good to have an UI so takeaways on the online migration so we basically showed you that you can do a migration without downtime you should not be scared of migrations we had plenty of successful migrations and there will be sessions testimoning that by guy here who migrated to ScyllaDB cloud and was on vacation when that happened when he came back ScyllaDB cloud was up and running so if you have an existing Kafka infrastructure use replay Kafka stream options and for all other scenarios small changes to the application to do the dual writes as I mentioned for the existing data migration if your source database is a is an SQL database Mongo for example whatever use the copy command if you have access to the original SStable files use the SSTableloader command but if you want the full flexible streaming solution that can afford you the extra load then ScyllaDB spark migrator is the most recommended tool
General best practices when doing migrations okay not necessarily from Cassandra or to ScyllaDB back up your current database okay clean up any stale old TTL’ed whatever data that you have in your existing database that you don’t want to migrate them to the new one just to save disk space and time, plan for it this can take many many hours or even days if that’s large data set test the migration on a small data set start small then go big and of course test the rollback procedure so you won’t need to find out how the rollback procedure works in real time when something is bad can happen have monitoring stacks monitoring stack in place for both your source and your destination database monitor them see what they’re going through you don’t want to choke each one of them have validation points to make sure that the data is coherent before fading away your original database make sure there are no live connections to it and all your clients are already connected to the new database make sure that all the users inside your company
By users we mean the application developers know about this and they’re not starting to implement a new feature in a new model of the schema during the migration these things happen and of course last and most importantly get the ScyllaDB team involved we want to help make us a partner of the migration we can offer a lot of advice we can escort you through the migration and we can even do it for you in some cases thank you