This lesson is a deep dive into Size Tiered Compaction Strategy (STCS) It covers STCS – read amplification, space amplification, examples, and considerations for when to use.
What is size tiered compaction strategy as the name implies you pick you need to choose which sstables to compact
together so the idea is I’m gonna try to find SStables that have more or less the same size right so if I have an SS table there is
one gigabyte in size and another SS table that is 900 megabytes in size and then another sstable there is 4
terabytes in size I will look at those first SStables and so they’re
they’re more or less same size
I will compact them together and the idea of that is that with the lifetime of this data this will move up into those tiers so some of those compactions are very fast because they are in the lower tiers and some of those
compactions in the higher tiers are very expensive because they’re gonna happen less frequently so but every now and then they will happen when do they happen when you have a certain number of SS tables in that tier so let’s say you have now three or four or five or however that number is, once you get like four or five again depends sstables in that tier you pick them together and then will compact them together and maybe they’ll generate zero size sstable because all the data was expired maybe they will generate a bigger sstable and then they will move up a tier and start a compaction the next tier what is the write amplification of size tier
compaction strategy so size tier
compaction strategy is the default compaction strategy in ScyllaDB the one takeaway is that unless you are in a specialized use case that’s probably the best one for you part of my job today is to actually convince you that maybe you think this other compaction strategy is great for you and for a fraction of you it will be but a lot of people that think oh it’s not great for my use case for that this and that I cannot use sized tiered you can like size tiered is the default a lot of our customers that change it end up changing back so I’ll try to explain why that’s the case so what is the write amplification for size tiered, once more what is write amplification is the number of times I have to rewrite this data so the write amplification for size tiered is the number of tiers that I have to write my data I’m a doesn’t mean you have to but like when you’re looking at complexity analysis if an sstable was just
written and i have four tiers I may have to rewrite that sstable four
times because it needs to be rewritten all the way up to the last tier so it’s essentially logN so it’s
really good because you’re gonna keep having more and more data but then the write amplification is still
log of the amount of data it means it doesn’t grow as fast as your data grows obviously again it does grow with the size of your data but you will see that it doesn’t grow too much with the size of your data four tiers if you think about it
ScyllaDB is sharded so every shard is gonna have its own sstable set when you think about what four tiers can get you the first tier being like one terabyte maybe the second tier is like 10 terabytes it’s a lot of data so for four tiers is a reasonable number that we can assume that it’s gonna be a cap on our number of tiers you can get more than that but four is a reasonable assumption again means that if you write data now you’re gonna have to rewrite this data four times even if this data never ever have to be rewritten so that’s the write amplification
now the read amplification is very interesting because again a lot of people think oh the read amplification for size tiered is very bad because I have a lot of files but that isn’t necessarily true because what is the read amplification for sized tiered if I have an append-only workload if I have an append-only workload it means that my data is all unique so I can have a billion files doesn’t matter my data is going to be at one of them and then the bloom filters that we talked about will make sure that I’m not gonna doesn’t mean I’ll only access one file because it’s still gonna give me false positives but the chance of a false positive is less than 1% so if I have an append-only workload if I have a workload in which I’m always inserting new data and obviously in real life that also means that it’s okay if you have a workload that inserts mostly new data still we count as only new data for purposes of the explanation then the read
implication of size tiered isn’t bad at all it’s actually great because you have one file you read from that file you have you have a thousand thousand disk doesn’t matter bloom filters will filter that out for me now if I have an update heavy workload then that’s different because if I have an update heavy workload if I have a hundred five maybe it could be the case that I have to go into those hundred files the bloom filters will tell me that my data is there because it is because I have maybe it’s an update that I have here and there or maybe it’s part if if I have a wide row I have pieces of that the partition in many files now the biggest problem of size during compaction that’s the thing that do Drive folks away from size tier compaction strategy it’s the space amplification now what is this space amplification the space
simplification comes from two places this is again in the case of over writes how many copies I am I’m having because I’m using this space twice that changes a little bit if I compact faster I managed to keep this in check but the biggest problem with space
amplification is the fact that when I’m compacting files
I cannot delete the original files until the compaction is done right especially because if our node crashes at that moment I still need to have my old files right next so I can’t do partials really which means and more than that before this compaction is done I’m still reading from the old files because this new file is not ready it cannot be read from so the space amplification is how much space I need to have in my system while I am compacting right and people say that like the space amplification for size tiered is horrible which is true but I would just like to remind you that this is the worst case space
amplification issue so a lot of you sometimes freak out so my disk is over 50% should you do something about it yes should you is this like the most urgent issue for you probably not because again this is the worst case and yeah I mean you should do something about it and even if you do if you do hit that case the compaction will fail and it will be retried so I mean this is not
necessarily the end of the world doesn’t necessarily mean you need you need to do something now but it does mean you need at that point to expand your cluster again 50% y 50% because in the worst case and again that is the worst case all of your data is being compacted right so in the worst case you’re compacting all of your data into single sstable that means the end and once more if you’re update heavy
why are you doing this you’re you’re combining a lot of overwrites into one so in an update have you were close you don’t hit this worst case because you’re expiring data you were merging data but in the worst case you are append only all of your data is unique and then you’re moving everything from your entire data set to a new file and you have a single table that’s the worst case because if you have two tables you don’t compact them at the same time we actually guarantee that you only compact one per tier one table at the same time so in your worst case you have one table that you’re completely
rewriting to the output and you don’t have overwrites in that worst case you do have to leave 50% free right now let’s look at an example this example we have we’re just running Cassandra stress writing 30 million partitions which is about nine gigabytes of data constant write rate doesn’t matter because we’re gonna be looking at disk space and this is just one charts we can focus on what’s happening one chart so this is size tiered over time again because it’s an append-only workload the space never reduces when you compact that’s exactly why we picked this workload and then you can see that in the beginning there you’re compacting like maybe two three sstables together they’re small and then you do another compaction which is the the other like number four spike there there is a little bit bigger but then in the worst case as you can see that’s the big peak over there you are compacting all of your data from the input to the output that that means once more once the compaction is done your disk space is back so it’s not gone forever right but while the compaction is going on you do need this buffer space so one use case where people do shy away from size tier legitimately is oh I want more I want to use more out of my disk I want to you know want to make sure I can’t like I am disk bound disk the amount of data is the the thing that I care about the most because I’m using maybe petabyte of data so that has a cost that’s a valid reason why not you size tiered