This lesson is about Incremental Compaction Strategy – ICS It covers: what is ICS, SSTable Runs, how it works, performance, and when it should be used.
The incremental Compaction Strategy In a nutshell we
observed problems with the legacy compaction strategies so STCS – Size-Tiered Compaction Strategy has high space amplification and relatively low write amplification, while LCS has as high write amplification and low space amplification and we just wanted to benefit from both approaches, so we borrowed the SSTable runs
concept from LCS and applied it over the size-tieres of STCS. We mainly replaced the increasingly larger SSTables with increasingly longer
SSTable runs,
so each SSTable in STCS will
have a corresponding SSTable run in ICS.
Let’s talk a bit about what are SSTable runs, so first they are an expansion of the SSTable concept. They are also sorted like a single SSTable but they are comprised of a sorted set of SSTables and the SSTables in this set are non-overlapping so each one has its own set of partition keys and they don’t overlap and we call these singular SSTables – Fragments. So a run is equivalent to a large SSTable just split into smaller SSTables so in this example where we have a single SSTable
consisting of these partition keys it will be split into a run of SSTables each one holding a subset of the keys of the single SSTable. So how ICS works with these runs? Remember that the fragments are disjoint
they have no shared keys and they are sorted with respect to each other so that allows us to scan the runs
that we were compacting together and compact them incrementally, so for example we take the first fragment and write an output fragment that’s the compaction of these two fragments and when doing so we can delete these fragments, and so on so forth, the important part is that we can eliminate the fragments as soon as we compact them.
Now in contrast to STCS this allows us to free up space gradually as we go, that’s the main secret behind the ICS Okay, so just presenting a case study we did in our labs, so the benchmark has three phases. The first phase we just write 500 gigabytes of unique data and when doing so we see the space
amplification overhead of STCS here in green while the space amplification with ICS is barely noticeable, it’s really in the noise. Then the next phase rewrites over and over again about half the data size so it’s 250 gigabytes are being
rewritten all over again
and again we see the benefits of ICS here in purple versus STCS that has these high peaks eventually the third phase the final phase we run a major
compaction on the data which will take down the data set back to 500 gigabytes of unique data and we see this very high peak with STCS which goes actually it’s almost 1.5 terabyte in size which is almost three times the data size, while with ICS space drops down rapidly into the net size of the data, so that’s the bottom line for for ICS it shows that we improved space amplification and we got rid of these high peaks and that allows you to run in a much higher utilization so we could recommend running it like 80% of your disk space versus the recommendation to run at 50 percent with STCS