The lesson provides a deep dive into Leveled Compaction Strategy (LCS) It covers LCS – how it works, writes, examples, performance considerations for when to use it, space and write amplification, and common misconceptions.
Now let’s look at leveled compaction strategy this is where most of the misconceptions come from usually so that’s where I would like to spend most the time
how does leveled compaction strategy work when an SStable is created is created in a special level that we call level zero that’s essentially like whatever was in your memtable is now in your sstable that’s it nothing fancy
right it’s not part of what we call the level invariant so leveled compaction strategy has an invariant that invariant means the following the number of SS tables is is fairly big so don’t if you’re using leveled compaction strategies don’t get scared by the number of SS tables but there is a property every SS table in your system if you’re using level compactor strategy has a fixed size 160 megabytes if you have one partition that is bigger than 160 megabytes then sure I mean it’s bigger but like that’s the granularity in which we are dividing things let’s forget the use case just so we can understand the concepts each SS table is 160 megabytes again per shard etc the next level so we’re going to be going to keep having levels right the next level is 10 times bigger than the previous level and there’s always true what that means is that the next level has ten times more sstables than the level before and those sstables have a very interesting property they are disjoint which means like if I have my tokens from A to Z maybe this is sstable is guaranteed you only have a to C and the other sstable is guaranteed to have like maybe D to F depending on like on on how much data you have in your token range but they are guaranteed to be disjoint what that means is that it doesn’t matter that I have a lot of sstables I don’t actually even need bloom filters for that because I know that my key is not in all of the others right so when I generate a new level I will pick this sstable I will
compact it into the next level but here lies the disadvantage I had now have to rewrite the entire level to comply to this invariant right so every time I move data from one level to the other I have to rewrite not necessarily the entire level right but I may have to rewrite a lot and at worst case the entire level but what is the what is the let’s say I don’t have to rewrite the entire level what is the normal thing to do is to rewrite ten SStables why because at my level I have arranged the next level is 10 times bigger which means that each SS tabl is covered ten times less of your token range so if I have if in my level I have A to F right in the next level it’s going to be spread on average over ten SStables so every time I move an SStable to the next level
I may have to rewrite no I will have to write 10 sstables and then of course you can have more promotions coming out of that and we created a chain event and keep moving up and up and up but I always going to have to rewrite ten SStables in the next level
so again it is figure four levels here is special we wait for some SStables to accumulate in level zero and promote them all together against to reduce write amplification but key concept when you are talking about level zero this invariant doesn’t exist if I have five sstables in level zero you may have to read from those five sstables right very crucial what’s the read
amplification for leveled compaction strategy the worst case is the number of levels that we will once again assume to be four doesn’t doesn’t this may be you have five a lot of people have three but four is get you a couple of terabytes of data per node plus whatever you have in level zero this is the part of all level compaction strategy that I see a lot of people omitting and that’s where the misunderstandings come from
oh my worst case is 4 SStables no it’s not it’s 4 SStables plus
whatever you have in level zero as level zero it doesn’t follow the invariant now the space amplification is great 90% is guaranteed to be in the last table right plus level zero always that like plus level zero and to actually compact data you don’t need a percentage of your disk free you need a fixed amount of space
why because I’m only compacting one SStable at a time where that has 160 megabytes with 10 SStables on the other side that also have 160 megabytes so it’s essentially the amount of data that I’m holding it’s 11 times 160 megabytes and at the worst time
Alexander left but at the worst case you do me this for all shards so the number of shards times 11 times 160 that’s much much better than 50% right so unless your disk is twice as big as 11 times and then is 50% but if you have a disk like that you’re probably not here so the biggest problem with level
compaction strategy is the write path why exactly because all of those promotions because with with size tiered when I have a table I’m gonna have an sstable and I and I compact it I’m moving up one tier but in here right I at the worst case I’m gonna have to rewrite it all the way up to the next level and we will assume that my 0 sstable didn’t trigger a full write and every set that this can do the entire data set let’s focus on the case in which I’m having just one key going up I will have to go up four levels every time I go up a level I need to I have a write amplification of 10 because of that one key I’m writing 10 other sstables that have nothing to do with my key right so III get this there’s a SStable that I have here and then I move it to the next level and as we discussed this is 10 sstables that I have to rewrite that means that the write amplification for level
compaction strategy is a factor of 40 which is 10 times bigger than size tiered which is usually a factor of 4 so again much much this is 10 times worse than sized tiered if you’re looking from the point of view of write amplification I’ll have to rewrite my data and the key to
understand that just you have to have this invariant for the properties to hold this invariant must be there which means you always have to keep having to rewrite your data to be in a certain structure that allows you to have those nice properties for leveled compaction strategy and you can see it in the picture like if you run the same thing that we run in the previous example but we now use leveled compaction strategy so I mean every time every time you’re moving up a level it’s a constant space that I need and then I go up the level up to level up the level I never really have the big spike over there
so again leveled compaction strategy really really great for space
simplification but if we look at the same experiment which is as you’re going to see not even the worst case the worst case is 4 versus 40 but even in this experiment did we track the total amount of disk writes that happen
in this experiment a total amount of Rights was 50 gigabytes again total amount of data that was collected was around 9 gigabytes so for to write nine gigabytes of data into this experiment I had to write 50 gigabytes including all the rewrites again this is append-only so a lot of that is shouldn’t be for over writes is different but leveled compaction in the same scenario 100 111 gigabytes and as I said in the worst case it’s 10 times worse so problem with level compaction is the write amplification update heavy workloads are interesting because if you have a if you have an update heavy workload that means very seldom you go to the last level so you keep having this factor of 10 thing but maybe you can keep most in l0 and l1 you get your l0 SStable you move it to l1 and then because it’s always
everything is an overwrite anyway you might not
to go to the next level so update workloads and it’s hard to quantify that because it depends on on the shape of your data but if you have an update heavy workload maybe it’s the case that leveled compaction strategy can do the trick for you because you might not have to keep going to the last level all the time so you have this horrible write amplification but you might not need to go there now what are common LCS
misconceptions first one level
compaction strategy is good for read mostly workloads if it’s read only actually if it’s read only every
compaction strategy is good you just compact everything into a single
sstable beforehand and there you go you have one leveled compaction is good for with read mostly workloads so false but there are questions that we should ask to qualify this because when people are talking about read mostly what they usually come to us with is a number Oh 90% writes well 90% reads if I have 90% reads is leveled compaction good for me if I have 99% reads is leveled compaction good for me like I don’t know would you like to have 1% of the GDP of a country depends on the country like if I can have 1% of the GDP of the United States sure man why not but other countries might be less impressive so it’s the same thing here I mean we’re talking about percentages but what really matters is the amount of Rights that the disc is taking and why is that
because the properties of level
compaction only hold if the invariant is static if you can go to that invariant if the sstables are in place if they’re not changing and then you can have all those nice properties but if you can’t then you’re gonna have to read from all of your l0 sstables and they’re gonna start to pile up because the invariant is in constant flux so the key to understand this is that you’re gonna use a lot of bandwidth should keep writing and keep rewriting this data this disk bandwidth so let’s say you just to put some numbers into this example let’s say you are reading from four sstables on leveled compaction strategy but you’re reading from 6 sstable
from sized tier compaction strategy is level better not necessarily because if I’m still compacting all the time I’m reading for 4 sstables but I’m using a lot of the disk bandwidth to do so and I only have let’s say one gigabyte per second left to do my reads but with size tiered because the write
amplification is better maybe I’m using maybe I’m reading from 6 sstables but I have 2 gigabytes per second to do so because I’m using so it depends the the the usual like this might we might you might be thinking that doesn’t help much I mean I need to make a decision the usual thing we tell users how to make this decision look at your monitoring if you have compactions going on all the time with your leveled compaction strategy get out of level compactor strategy because leveled compaction strategy is good when you take a look at your monitoring and every now and then you have compactions that’s great now does that mean maybe this is 50 percent right workload if you don’t have a lot of reads as well 50 percent writes is also not a lot and maybe you have a maybe you’re have your overwrites so your compactions do happen but they’re always kept in in level zero that’s fine too so the amount of reads is not the interesting thing we should look at what we should look at is if I look at my monitoring am i compacting all the time or not if you are leveled compaction is probably not good for you because again you’re using you have those properties but you are using a lot of disk
bandwidth to guarantee those properties that you could just not have the
properties and and use this this bandwidth to do
reads from more Ssstables now I wanna use leveled compaction strategy because my I’m interested in latencies for reads like I am fascinated with read latencies and I like the fact that level
compaction strategies guarantee me that I am going to read from at most 4 sstables that’s also not true first of all because they said it is 4 plus l0 and also this is only true as I said while the invariant is kept if your
invariant is in flux you’re gonna be reading from more if you’re invariant is so damaged because you can’t keep up the you have maybe 32 sstables in level zero now you are much much worse than size tiered
because you’re gonna be reading from those four sstables in your current invariant plus the 32 SStables
that are accumulated in l0 that you could not compact because you don’t have the disk bandwidth right so that doesn’t mean sorry that the properties that people talk about from leveled compaction strategy does not hold that doesn’t mean that they’re not true that means that usually people forget about the price that you pay to get those those
invariants and once more best way to determine this just look at your
monitoring see if you’re compacting all the time leveled compaction strategy is not good for you overwrite workload this example this is again space
amplification is this base
amplification yeah 1.2 gigs and you can see the properties like a level
compaction strategy versus size tiered compaction strategy like level
compaction strategy is once more in green this example is one example in which like I don’t have to keep because it’s a update heavy I don’t have to keep going up to more and more levels all the time so those are things for you to keep into consideration as well