What are tombstones and why is it important?
Data written to ScyllaDB gets persisted to SSTables. Since SSTables are immutable, the data can’t actually be removed when a delete is performed, instead, a marker (also called a “tombstone”) is written to indicate the value’s new status. When compaction occurs, the data will be expunged completely and the corresponding disk space recovered.
Tombstones, everyone’s favorite subject
what is it tombstone so the way we delete data in ScyllaDB that’s because of
the asynchronous nature of what we do is that whenever you delete data
you don’t really delete the data you just mark that row for deletion so you
just write the fact that this row is deleted right you just don’t go there
and delete the data why because you then you start having a coordination problem
among all the replicas so deletion and a TTL for that matter is just a
deletion that happen automatically it’s just the write there is no difference
between a deletion and a write a write writes a value and a deletion writes the
fact that this data shouldn’t be responded to the client tombstones
disappear because at some point I mean if you never get rid of it then
your disk space never truly goes down I mean it does go down because if
you delete a 100k blob the deletion itself is very small but at some point
you can have a lot of deletions that use a lot of disk space so at some point I
do want to get rid of those deletions this is called gc_grace_seconds in your
table properties so when you create a table you’re gonna see gc_grace_seconds
default for that is ten days and that is that means that after 10 days I delete
the tombstones that were created 10 days ago so tombstones
are going to be in your cluster for at least 10 days I don’t have to delete
it because they are deleted asynchronously but if I’m doing
something that would trigger a tombstone deletion in less than 10 days assuming
the default I’ll leave the tombstone be I’m not going to touch it but if if
I’m doing it after 10 days I can get rid of it because it’s been 10 days I know
before I tell you why this gc_grace_seconds exist let’s look at the
biggest problem we have with tombstones how expensive you guys think this query
will be so this query is the following select one column just imagine like
let’s not go if maybe this column is super big no it’s a simple integer
column select column from table limit 1 so what limit 1 tells me is that maybe
you have 10 billion rows just give me the first right so whatever you have
there if it’s empty it’s empty but this query will either return 0
rows or 1 row because I’m limited to 1 row but to get you a live row I might
have to scan a lot of dead rows which are the tombstones this slide is also
designed to show how good of an artist I am see there’s my tombstone
representation the example of ten seconds it means
you’ve got a lot of tombstones but if you think about a workload that’s deleting
things very constantly it’s at all possible so if I fired this
query this is select column from table limit one and I find myself in this
situation I have to go scanning my partition until I find one live row and
then I can return it to you so maybe it’s gonna be fast but maybe it won’t
because I have a lot of tombstones right so I got to get a rid of tombstones
so
performance definitely can be impacted by the presence of tombstones how to
avoid this we need to make sure that we are taking action to expire those
tombstones the biggest thing that gets rid of tombstones are compaction so you
need to make sure that compactions have the time to run and if your cluster
for instance is not properly sized maybe you’re too tight compactions don’t have
enough resources to run they gonna lag behind you’re not gonna be able to
get rid of tombstones so in that case you expand the cluster the idea here is
that why do I have this problem because like I should be able to get rid of this
tombstone like immediately right just just get rid of it why do I have to keep
it for ten days the problem with that is that there is one problem called data
resurrection right and let me walk you through why this happens so in this
example I can’t see from this screen so I’ll look in here we have
three nodes so replication factor of three first node has a = x node 3
has the same thing and this was written in timestamp t1 the other node has
timestamp t0 and a = y so this is one of these scenarios in which the
node two for whatever reason did not get the write why I don’t know but it
doesn’t have the write so if I’m doing quorum reads on that data I will always
return the correct data we discussed that already right doesn’t
matter which two nodes they pick one of them who have the data and then I would
do read repair and I will return you the correct data
now let’s say I have tombstones now remember tombstones are just writes so
in this case the correct answer if you query this row is null because you deleted
this data if I query for example node a node 1 and node 2 node 2 has a = y
but node 1 has a = null tombstone with timestamp t2 it was written
later so I know uh this write is newer than the other write the correct
answer here is null what happened now if I were to get rid of the tombstones who
wins between timestamp t2 and timestamp t0, timestamp t2 so
the tombstone will reply null but who wins between
timestamp t0 and void, timestamp t0
because this situation is indistinguishable from the situation in
which you never deleted you just wrote it and only node 2 has the data so if I
get rid of the tombstones I can’t know that this data was deleted so if you
query if you happen to query in this situation you will have the correct
value and then the tombstones are deleted and then you query one second
later and you end up with this and now what is the response the response is Y
well it could be in this case I mean if you query a quorum and you happen to be
lucky to query node one and three you’re gonna have the correct answer but
assuming node 2 is part of the of the replica set you’re gonna get your result
back it’s gonna come back to life so that’s why gc_grace_seconds
exist in our set by default for 10 days before 10 days you have to repair
your cluster to make sure that this doesn’t happen so let’s say for instance
that your repair takes one day to run and then you run it every week then you
can that’s less than grace seconds you’re fine
this situations happens if this node 2 failed the write for some reason we try
to write to all replicas but maybe one of them fail that’s why we need repairs
in the first place so in this example we did write the tombstone or we issued a
delete request to all of the nodes however for some reason node 2
not nodetool, node 2 did not get it why I don’t know
maybe you had a network issue maybe the
node was overloaded maybe the node crashed and while we were bringing it up
that’s when this delete happened so we’re gonna attempt but because the
consistency level was quorum what does that mean as long as two nodes reply I
considered the write successful so two nodes replied I considered the write
successful and node 2 lost the write it doesn’t have the write maybe lost the
disk and you have to replace them and the data is not there it doesn’t have to
be there so the biggest difference between a normal write and a deletion in
this case is that the normal write has a timestamp that can be used and that’s
the scenario that can be used to differentiate between like who has the
data but if you had do a deletion at some point you’re gonna delete the data
and gonna get rid of the tombstone then you’re find yourself in the situation in which
one node has a data with a timestamp and the other one has nothing and in that
case your data can come back which is why before gc_grace_seconds repair
your cluster.