This lesson is about ScyllaDB Compaction Strategies. It covers the ScyllaDB Write Path, Reads, SSTables, why compaction is needed, and the basics of how compaction works.
Alright so, as an introduction and a general overview about the way we store data in ScyllaDB and the way we compact it and talk about log-structured storage and then the fundamentals of compaction.
So, first
what is log-structured storage how do you use it for writes, so we start with changes to the data that are coming down the wire over CQL usually and the first thing we do with them is we record them in memory in what we call MemTables, we also store them in a writer head log what we call the commit log but that’s a little bit out of scope for this presentation so I will not talk about that but the next interesting step is we flush the data from the MemTables down into SSTables and then updates to the data accumulat over time across these SSTables so say we wrote, I don’t know, A to some cell.
It will be recorded into
one SSTable and then at a later time we write B, then we may have a write of A in this SSTable and the write of B in another SSTable and we need to take that into account
Now having several versions of the same cell being stored is called write amplification.
The SSTables themselves are
immutable meaning they are written once and they are never rewritten, they contain the changes to the data that we also called mutations.
Another
aspect of SSTables that are sorted and this is where they got their name from – Sorted Strings Tables and they also have in addition to the actual data, they have metadata like index, statistics and filter data that allows us to query the data efficiently.
Now remember as a
note that there is no static view of the databases, so unlike say relational databases that keep the actual information in tables on disk, we only keep the changes to the data, that means that for reading the data, querying the data this requires reading all the relevant SSTables that may contain pieces or updates to the data and then we apply those live mutations those mutations that are not TTLd or not overwritten by other mutations we apply them on to our cache
and that’s the way we get the form, the final form of the cells, the rows, partitions and so on.
Now, we use the bloom
filter to locate the relevant SSTables.
Doing this consolidation process consolidating these mutations is pretty expensive because we need to access the data, access the disk several times and do the consolidation in memory and this is called read amplification because we will need to read multiple SSTables in order to form like a single row.
To
summarize, why is compaction needed at all?
So first to reiterate, the SSTables are immutable and we can’t just keep writing updates on the disk, the disk will just fill up and we have to get rid of obsolete data, that means data that was explicitly deleted or it was expired using TTL or it was just overwritten we still need to get rid of the overwritten data.
So, we need compaction to reduce space amplification, then the data itself might be scattered around in different SSTables and we would like to consolidate it to reduce read amplification.
So let’s
go over the fundamentals of compaction.
The first step for compaction is to select a set of SSTables to process.
The algorithm that selects the SSTables is part of the compaction strategy, so different compaction strategies may have different selection criteria, then compaction reads these SSTables and writes the compacted output on to
a new SSTable, remember
SSTables are immutable, we never rewrite them, so we have to write a new SSTable which is the compacted output of those input SSTables and while doing so we eliminate all the overwrites or the deleted data and expire data.
Eventually
when this output SSTable is sealed and safely stored on storage we can finally delete the input SSTables
Note that doing this compaction requires temporary space because we cannot delete the input SSTables until the output SSTable is safe on
storage, so during compaction we will have on disk both the input SSTable and output SSTable.
What can we eliminate
during compaction?
Let’s go over a few
examples.
So say one SSTable has
write of A into some cell and another SSTable has a prime, which is a newer version of the same cell.
In this case we
just have to keep the prime, which is the newest content for this cell.
If we only
have one update to the cell, we just copy it onto the output SSTable, if we have data into a cell and not see represent a tombstone, a deletion of this data we still need to copy the tombstone and that’s a pretty fundamental to compaction in Cassandra and in ScyllaDB, we can’t just get rid of both the data and the tombstone, because this tombstone may need to delete other data that resides on other SSTables that we haven’t got to compact yet So when can we finally get
rid of tombstones?
Another example, for just a
tombstone and the final example it’s this !z, !z represents a
droppable tombstone, tombstone which was already expired and this way we can get rid of the tombstone and not copy it onto the output.
So, the
tombstones are kept around for what we call the gc_grace_ seconds
configuration option and that allows us to garbage collect other data and when that expires we can get rid of the tombstone.