An overview of Change Data Capture (CDC) in ScyllaDB. CDC is a feature that allows you to not only query the current state of a database’s table but also query the history of all changes made to the table. Some of the topics discussed are what is CDC, what are some common use cases, what does it do, and an overview of how it works.
What is CDC, what’s Change. Data Capture
it’s some sort of record over modifications that happen for one
or more tables in the database
We want to know when we write, we want to know when we delete
and obviously this could be done as triggers in an application or whatever
but the key features that we want to be able to consume this asynchronously we
want the consumer to be able to somehow batch and divide
and conquer and whatever and know this from a data standpoint
more than an application standpoint and most importantly this is one of the
key features in ScyllaDB in 2020, it’s one of the major
things that we’re working on and it’s going to be great and you all
have to try it and buy it
so the use cases, some of the prominent ones – fraud detection
you want to analyze transactions in some sort of batch manner to see if
credit cards are being used from many places at the same time or whatever
or you want to plug it into your Kafka pipeline, you want some sort of
analysis done on a transaction level and not on an aggregated level
other use cases are data duplication you want to mirror your database
which is having another ScyllaDB somewhere but it’s not
coupled via shared DC or you want to replicate the database
even more important, you want somehow the data
to be transformed by transaction to some other storage medium
or you have some other use case you tell me you probably know better
How does CDC and ScyllaDB work, well, how will CDC and ScyllaDB work
it’s enabled table
you don’t want to pay for stuff that you don’t need you want it as isolated as
it can be we have it on the granularity of a row
so modifications are considered
CQL row level a cluster row level
we want optionally to read pre-imaged data
we want to show the current state if data exists for a row and
we want to limit it to columns that are affected by the change
and we want to add a log for the modification
where we include optional the pre-image we want the changes that happen
and optionally we also want the post image or we want
one of the three or three of the three or etcetera
it’s all going to be optional, it’s all going to be controlled by you
so how do we implement this? Well CDC in. ScyllaDB is
just another table, it’s enabled per table and it’s
gonna be another data table, it’s stored distributed in the cluster
shares all the same properties as you’re used to with your normal data
it’s ordered by timestamp in the original operation
and the sequence of changes associated with
whatever you were doing via CQL which again CQL operation can break down
in actually more than one modification
or you’re doing a batch update or etc etc
You have the columns for the pre-image and you have the delta records for the
change, so every column records information about how we modified it and
what TTL you put into it.. The topology of this log, the CDC log
is meant to match the original data so you get the same data distribution
you get the same consistency
potentially depending on options again as for your normal data writes
so it’s synchronized with your with your writes, it shares the same
property it, shares the same distribution all to minimize
the overhead of adding the log but also to make sure that
logs and actual data in the end result matches each other as close as possible
but again some of this is going to be optional, we’re going to
allow you to use different consistency levels, although that is
a special use case, Calle be careful, and everything is going to be transient
CDC data typically represents a time window back in time that you can consume
and look at the default survival time is 24 hours
but again it’s up to you and this is of course to
ensure that if this log builds up if you’re not consuming it
it’s not going to kill your database it’s going to add to it but it’s not
going to kill it
The downsides of this approach is that we’re going to add the read
before write if you want pre-image data or read after write if you want post image
the log shares the same consistency
as every other other thing in ScyllaDB which is eventually consistent
and it represents the change as viewed by the client, because again
we distributed database, there is no single truth of change of data
So it’s all based on how the coordinator or the client talking to the coordinator
view the state and the change
And it’s the change, it’s not what happened afterwards
so for example depending on your consistency level if you lost a node
at some time point, what you view from the data might not be what you last
wrote, because you lost the only replica that held
that data or you can’t reach its consensus level
to get that value back, that’s not represented in the log so it
represents what happens
and obviously you could get partial logs in the case of severe node crashes
there’s no way around it.