Welcome everybody I’ll be talking to you today about compaction strategies this is the advanced track so I do have some slides in the beginning about what’s the write path now last year we didn’t have time this presentation is by the way very similar to last year’s so if you were here last year for my compaction track I don’t think you can learn much because I’m a very good instructor so if you were here last year you learned everything so everybody knows what compactions are right? so I’m not gonna just in the interest of time so we have time to cover everything I’m not going to cover like the write path like what are compactions one of the things that is very important so you’ll understand some of the concepts we’re going to talk about our bloom filters so bloom filters are probabilistic filters they can be used to query whether or not a key exists in a file now as I said they are probabilistic so they’re not going to give you a yes-or-no answer if they tell you that the key is not in the file you can be 100% sure that that’s true so you can skip it if the bloom filter tells you this key is not present it’s not present you don’t need to bother you don’t need to read that file but if they tell you yes this could still be a false positive so that’s it in this way they’re probabilistic you look into the file but if they tell you ‘no’ for sure it’s not there you can skip that file so this is very important as a read path optimization
Some of you are usually concerned Oh what happens if I have a thousand files that’s not necessarily a concern because the bloom filters will reduce your need for access the fact that you have a thousand files doesn’t mean that you need to access a thousand files only if it is a if at all possible that your key could be there so when we talk about read pathing and read amplification bloom filters are one of the concepts that we are going to need right again the write path we’re not going to cover we write very quickly we write your memtables those mem tables are written to sstables sstables are immutable they never change which means we need to compact them whenever every now and then depending on the structure so this is how it happens you all told me you know what compactions are
Now what is important when we talk about compaction and what are the things that we’re going to be using today to classify and compare those compaction strategies we’re going to be essentially looking at three things we’re going to be looking at write amplification what is write amplification is the number of times you have to rewrite the same data so I wrote data into disk now I have to rewrite that data again why do I have to rewrite it again is interesting but why do I have to write that data again because it was an on one sstable file and I’m compacting it to another sstable file them being immutable and this is especially important if I didn’t really have to write this data again it just happens to be on a file that I’m compacting so that’s that’s write amplification
We also have read amplification which is how many files that I have to read that I would have been better if I had to read from only one so if my data is now split into two files I have a read amplification factor of two because I would be better off if my data was all back into a single file and there is also space amplification so again we write things are immutable so we if you write something and then you write the same thing again you’re going to end up with two copies of that on disk until you compact it to bring it back to one and then space amplification is exactly that in this scenario you have a space amplification of two all right so sstable merge is efficient
Again one of the things that we do I would like to highlight we believe of course and a lot of you are using ScyllaDB experience with that you’ve seen this to be true we have the schedulers we have the controllers so usually compactions are they have their own time slice the scheduler is controlling whether or not you’re compacting or for how long you’re compacting so the goal one of the goals we have of compaction is to have keep having low tail latencies even during compactions again because we have the schedulers and the and the controllers isolating that that doesn’t mean that’s also common is conception that your latency is not going to go up if you’re not doing anything and now you’re doing something your latencies going to go up the goal is that it’s not going to go up wildly and by too much right
We saw the compaction is essentially you grab a bunch of files and you’re going to combine those files into fewer files it doesn’t have to be one by the way you can still generate as there is an output of a compaction more than one file but the goal is usually to reduce our state so you have lower and therefore better read amplification in the future but how do we choose those files because if you just always compact all the files into one we’re also increasing a write amplification by too much right so imagine every time you have a new SStable you compact this sstable against your entire set your write amplification is going to be terrible so the goal is essentially reduce your read amplification as much as you can and trying to keep the write amplification under control
That means that every workload will benefit from a different strategy on how to pick those files and that’s where the compaction strategies come into play so ScyllaDB supports a fair amount of compaction strategies and this is essentially the goal of the presentation today understand what they do so you can understand and those are the criteria with which we will judge those compaction strategies once you understand what they do you’ll understand what’s the best one for my use case we’re going to be talking about the size tiered compaction strategy the time window compaction strategy and then the leveled compaction strategy and finally one compaction strategy that is soon to debut in ScyllaDB enterprise which is the incremental compaction strategy this is essentially code ready for a while now but we want to make sure that some of those properties in testing weren’t holding in some scenarios but we’re almost there
So what is size tiered compaction strategy as the name implies you pick you need to choose which sstables to compact together so the idea is I’m gonna try to find SStables that have more or less the same size right so if I have an SS table there is one gigabyte in size and another SS table that is 900 megabytes in size and then another sstable there is 4 terabytes in size I will look at those first SStables and so they’re they’re more or less same size I will compact them together and the idea of that is that with the lifetime of this data this will move up into those tiers so some of those compactions are very fast because they are in the lower tiers and some of those compactions in the higher tiers are very expensive because they’re gonna happen less frequently so but every now and then they will happen when do they happen when you have a certain number of SS tables in that tier so let’s say you have now three or four or five or however that number is, once you get like four or five again depends sstables in that tier you pick them together and then will compact them together and maybe they’ll generate zero size sstable because all the data was expired maybe they will generate a bigger sstable and then they will move up a tier and start a compaction the next tier
What is the write amplification of size tier compaction strategy so size tier compaction strategy is the default compaction strategy in ScyllaDB the one takeaway is that unless you are in a specialized use case that’s probably the best one for you part of my job today is to actually convince you that maybe you think this other compaction strategy is great for you and for a fraction of you it will be but a lot of people that think oh it’s not great for my use case for that this and that I cannot use sized tiered you can like size tiered is the default a lot of our customers that change it end up changing back so I’ll try to explain why that’s the case so what is the write amplification for size tiered, once more what is write amplification is the number of times I have to rewrite this data so the write amplification for size tiered is the number of tiers that I have to write my data I’m a doesn’t mean you have to but like when you’re looking at complexity analysis if a sstable was just written and I have four tiers I may have to rewrite that sstable four times because it needs to be rewritten all the way up to the last tier so it’s essentially logN
So it’s really good because you’re gonna keep having more and more data but then the write amplification is still log of the amount of data it means it doesn’t grow as fast as your data grows obviously again it does grow with the size of your data but you will see that it doesn’t grow too much with the size of your data four tiers if you think about it ScyllaDB is sharded so every shard is gonna have its own sstable set when you think about what four tiers can get you the first tier being like one terabyte maybe the second tier is like four tiers is a reasonable number that we can assume that it’s gonna be a cap on our number of tiers you can get more than that but four is a reasonable assumption again means that if you write data now you’re gonna have to rewrite this data four times even if this data never ever have to be rewritten so that’s the write amplification
Now the read amplification is very interesting because again a lot of people think oh the read amplification for size tiered is very bad because I have a lot of files but that isn’t necessarily true because what is the read amplification for sized tiered if I have an append-only workload if I have an append-only workload it means that my data is all unique so I can have a billion files doesn’t matter my data is going to be at one of them and then the bloom filters that we talked about will make sure that I’m not gonna doesn’t mean I’ll only access one file because it’s still gonna give me false positives but the chance of a false positive is less than 1% so if I have an append-only workload if I have a workload in which I’m always inserting new data and obviously in real life that also means that it’s okay if you have a workload that inserts mostly new data still we count as only new data for purposes of the explanation then the read implication of size tiered isn’t bad at all it’s actually great because you have one file you read from that file you have you have a thousand disk doesn’t matter bloom filters will filter that out for me
Now if I have an update heavy workload then that’s different because if I have an update heavy workload if I have a hundred five maybe it could be the case that I have to go into those hundred files the bloom filters will tell me that my data is there because it is because I have maybe it’s an update that I have here and there or maybe it’s part if if I have a wide row I have pieces of that the partition in many files now the biggest problem of size during compaction that’s the thing that do Drive folks away from size tier compaction strategy it’s the space amplification now what is this space amplification the space simplification comes from two places this is again in the case of over writes how many copies I am I’m having because I’m using this space twice that changes a little bit if I compact faster I managed to keep this in check but the biggest problem with space amplification is the fact that when I’m compacting files I cannot delete the original files until the compaction is done right especially because if our node crashes at that moment I still need to have my old files right next so I can’t do partials really which means and more than that before this compaction is done I’m still reading from the old files because this new file is not ready it cannot be read from
So the space amplification is how much space I need to have in my system while I am compacting right and people say that like the space amplification for size tiered is horrible which is true but I would just like to remind you that this is the worst case space amplification issue so a lot of you sometimes freak out so my disk is over should you is this like the most urgent issue for you probably not because again this is the worst case and yeah I mean you should do something about it and even if you do if you do hit that case the compaction will fail and it will be retried so I mean this is not necessarily the end of the world doesn’t necessarily mean you need you need to do something now but it does mean you need at that point to expand your cluster again 50% y 50% because in the worst case and again that is the worst case all of your data is being compacted right so in the worst case you’re compacting all of your data into single sstable that means the end and once more if you’re update heavy why are you doing this you’re you’re combining a lot of overwrites into one so in an update have you were close you don’t hit this worst case because you’re expiring data you were merging data but in the worst case you are append only all of your data is unique and then you’re moving everything from your entire data set to a new file and you have a single table that’s the worst case because if you have two tables you don’t compact them at the same time we actually guarantee that you only compact one per tier one table at the same time
So in your worst case you have one table that you’re completely rewriting to the output and you don’t have overwrites in that worst case you do have to leave 50% free right now let’s look at an example this example we have we’re just running Cassandra stress writing 30 million partitions which is about nine gigabytes of data constant write rate doesn’t matter because we’re gonna be looking at disk space and this is just one charts we can focus on what’s happening one chart so this is size tiered over time again because it’s an append-only workload the space never reduces when you compact that’s exactly why we picked this workload and then you can see that in the beginning there you’re compacting like maybe two three sstables together they’re small and then you do another compaction which is the the other like number four spike there there is a little bit bigger but then in the worst case as you can see that’s the big peak over there you are compacting all of your data from the input to the output that that means once more once the compaction is done your disk space is back so it’s not gone forever right but while the compaction is going on you do need this buffer space so one use case where people do shy away from size tier legitimately is oh I want more I want to use more out of my disk I want to you know want to make sure I can’t like I am disk bound disk the amount of data is the the thing that I care about the most because I’m using maybe petabyte of data so that has a cost that’s a valid reason
Why not you size tiered because then now let’s look at leveled compaction strategy this is where most of the misconceptions come from usually so that’s where I would like to spend most the time how does leveled compaction strategy work when an SStable is created is created in a special level that we call level zero that’s essentially like whatever was in your memtable is now in your sstable that’s it nothing fancy right it’s not part of what we call the level invariant so leveled compaction strategy has an invariant that invariant means the following the number of SS tables is is fairly big so don’t if you’re using leveled compaction strategies don’t get scared by the number of SS tables but there is a property every SS table in your system if you’re using level compactor strategy has a fixed size that is bigger than 160 megabytes then sure I mean it’s bigger but like that’s the granularity in which we are dividing things let’s forget the use case just so we can understand the concepts each SS table is 160 megabytes again per shard etc the next level so we’re going to be going to keep having levels right the next level is 10 times bigger than the previous level and there’s always true what that means is that the next level has ten times more sstables than the level before and those sstables have a very interesting property they are disjoint which means like if I have my tokens from A to Z maybe this is sstable is guaranteed you only have a to C and the other sstable is guaranteed to have like maybe D to F depending on like on on how much data you have in your token range but they are guaranteed to be disjoint
What that means is that it doesn’t matter that I have a lot of sstables I don’t actually even need bloom filters for that because I know that my key is not in all of the others right so when I generate a new level I will pick this sstable I will compact it into the next level but here lies the disadvantage I had now have to rewrite the entire level to comply to this invariant right so every time I move data from one level to the other I have to rewrite not necessarily the entire level right but I may have to rewrite a lot and at worst case the entire level but what is the what is the let’s say I don’t have to rewrite the entire level what is the normal thing to do is to rewrite ten SStables why because at my level I have arranged the next level is tabl is covered ten times less of your token range so if I have if in my level I have A to F right in the next level it’s going to be spread on average over ten SStables so every time I move an SStable to the next level I may have to rewrite no I will have to write 10 sstables and then of course you can have more promotions coming out of that and we created a chain event and keep moving up and up and up but I always going to have to rewrite ten SStables in the next level so again it is figure four levels here is special we wait for some SStables to accumulate in level zero and promote them all together against to reduce write amplification
But key concept when you are talking about level zero this invariant doesn’t exist if I have five sstables in level zero you may have to read from those five sstables right very crucial what’s the read amplification for leveled compaction strategy the worst case is the number of levels that we will once again assume to be four doesn’t this may be you have five a lot of people have three but four is get you a couple of terabytes of data per node plus whatever you have in level zero this is the part of all level compaction strategy that I see a lot of people omitting and that’s where the misunderstandings come from oh my worst case is 4 SStables, no it’s not it’s 4 SStables plus whatever you have in level zero as level zero it doesn’t follow the invariant now the space amplification is great table right plus level zero always that like plus level zero and to actually compact data you don’t need a percentage of your disk free you need a fixed amount of space why because I’m only compacting one SStable at a time where that has 160 megabytes with 10 SStables on the other side that also have 160 megabytes
So it’s essentially the amount of data that I’m holding it’s 11 times 160 megabytes and at the worst time Alexander left but at the worst case you do me this for all shards so the number of shards times 11 times 160 that’s much much better than 50% right so unless your disk is twice as big as 11 times and then is 50% but if you have a disk like that you’re probably not here so the biggest problem with level compaction strategy is the write path why exactly because all of those promotions because with with size tiered when I have a table I’m gonna have an sstable and I and I compact it I’m moving up one tier but in here right I at the worst case I’m gonna have to rewrite it all the way up to the next level and we will assume that my 0 sstable didn’t trigger a full write and every set that this can do the entire data set let’s focus on the case in which I’m having just one key going up I will have to go up four levels every time I go up a level I need to I have a write amplification of 10 because of that one key I’m writing 10 other sstables that have nothing to do with my key right so III get this there’s a SStable that I have here and then I move it to the next level and as we discussed this is 10 sstables that I have to rewrite that means that the write amplification for level compaction strategy is a factor of 40 which is 10 times bigger than size tiered which is usually a factor of 4
So again much much this is 10 times worse than sized tiered if you’re looking from the point of view of write amplification I’ll have to rewrite my data and the key to understand that just you have to have this invariant for the properties to hold this invariant must be there which means you always have to keep having to rewrite your data to be in a certain structure that allows you to have those nice properties for leveled compaction strategy and you can see it in the picture like if you run the same thing that we run in the previous example but we now use leveled compaction strategy so I mean every time every time you’re moving up a level it’s a constant space that I need and then I go up the level up to level up the level I never really have the big spike over there so again leveled compaction strategy really really great for space simplification but if we look at the same experiment which is as you’re going to see not even the worst case the worst case is 4 versus 40 but even in this experiment did we track the total amount of disk writes that happen in this experiment a total amount of writes was 50 gigabytes again total amount of data that was collected was around 9 gigabytes so for to write nine gigabytes of data into this experiment I had to write 50 gigabytes including all the rewrites
Again this is append-only so a lot of that is shouldn’t be for over writes is different but leveled compaction in the same scenario 100 111 gigabytes and as I said in the worst case it’s 10 times worse so problem with level compaction is the write amplification update heavy workloads are interesting because if you have an update heavy workload that means very seldom you go to the last level so you keep having this factor of 10 thing but maybe you can keep most in l0 and l1 you get your l0 SStable you move it to l1 and then because it’s always everything is an overwrite anyway you might not to go to the next level so update workloads and it’s hard to quantify that because it depends on on the shape of your data but if you have an update heavy workload maybe it’s the case that leveled compaction strategy can do the trick for you because you might not have to keep going to the last level all the time so you have this horrible write amplification but you might not need to go there now what are common LCS misconceptions
First, one level compaction strategy is good for read mostly workloads if it’s read only actually if it’s read only every compaction strategy is good you just compact everything into a single sstable beforehand and there you go you have one leveled compaction is good for with read mostly workloads so false but there are questions that we should ask to qualify this because when people are talking about read mostly what they usually come to us with is a number Oh reads is leveled compaction good for me if I have 99% reads is leveled compaction good for me like I don’t know would you like to have 1% of the GDP of a country depends on the country like if I can have 1% of the GDP of the United States sure man why not but other countries might be less impressive so it’s the same thing here I mean we’re talking about percentages but what really matters is the amount of Rights that the disc is taking and why is that because the properties of level compaction only hold if the invariant is static if you can go to that invariant if the sstables are in place if they’re not changing and then you can have all those nice properties
But if you can’t then you’re gonna have to read from all of your l0 sstables and they’re gonna start to pile up because the invariant is in constant flux so the key to understand this is that you’re gonna use a lot of bandwidth should keep writing and keep rewriting this data this disk bandwidth so let’s say you just to put some numbers into this example let’s say you are reading from four sstables on leveled compaction strategy but you’re reading from 6 sstable from sized tier compaction strategy is level better not necessarily because if I’m still compacting all the time I’m reading for 4 sstables but I’m using a lot of the disk bandwidth to do so and I only have let’s say one gigabyte per second left to do my reads but with size tiered because the write amplification is better maybe I’m using maybe I’m reading from 6 sstables but I have 2 gigabytes per second to do so because I’m using
So it depends the the the usual like this might we might you might be thinking that doesn’t help much I mean I need to make a decision the usual thing we tell users how to make this decision look at your monitoring if you have compactions going on all the time with your leveled compaction strategy get out of level compactor strategy because leveled compaction strategy is good when you take a look at your monitoring and every now and then you have compactions that’s great now does that mean maybe this is 50 percent right workload if you don’t have a lot of reads as well 50 percent writes is also not a lot and maybe you have a maybe you have your overwrites so your compactions do happen but they’re always kept in level zero that’s fine too so the amount of reads is not the interesting thing we should look at what we should look at is if I look at my monitoring am I compacting all the time or not if you are leveled compaction is probably not good for you because again you’re using you have those properties but you are using a lot of disk bandwidth to guarantee those properties that you could just not have the properties and use this bandwidth to do reads from more SStables
Now I wanna use leveled compaction strategy because my I’m interested in latencies for reads like I am fascinated with read latencies and I like the fact that level compaction strategies guarantee me that I am going to read from at most 4 sstables that’s also not true first of all because they said it is 4 plus l0 and also this is only true as I said while the invariant is kept if your invariant is in flux you’re gonna be reading from more if you’re invariant is so damaged because you can’t keep up the you have maybe 32 sstables in level zero now you are much much worse than size tiered because you’re gonna be reading from those four sstables in your current invariant plus the 32 SStables that are accumulated in l0 that you could not compact because you don’t have the disk bandwidth right so that doesn’t mean sorry that the properties that people talk about from leveled compaction strategy does not hold that doesn’t mean that they’re not true that means that usually people forget about the price that you pay to get those invariants and once more best way to determine this just look at your monitoring see if you’re compacting all the time leveled compaction strategy is not good for you overwrite workload
This example this is again space amplification is this base amplification yeah 1.2 gigs and you can see the properties like a level compaction strategy versus size tiered compaction strategy like level compaction strategy is once more in green this example is one example in which like I don’t have to keep because it’s an update heavy I don’t have to keep going up to more and more levels all the time so those are things for you to keep into consideration as well time window compaction strategy memtables have a right time so whenever we have data in memory we track the write time of that memtable don’t remember if this is the lowest write time in the set or the highest write time in the set but it’s one of them but we track it’s a property of the entire memtable which is which depends on the data that you currently have in the memtable right but there is a number that that is the write time of this memtable that is a summary as a setter min or max of the entire data set that you have in the memtable when you write that to storage the sstable will inherit this property the idea of time window compaction strategy is that the way you combine those ssstables is by combining data that were written at more or less the same time
And this is pretty interesting because there are a lot of use cases in which people use ScyllaDB there are time dependent in some way right so maybe overwrites are only possible within a time window or or even more common like I have data there is TTL the holy grail of time when the compaction strategy is to allow you to drop an sstable because usually the way the other is the other compaction strategies if all of your data is expired let’s say all of your data just just imagine that you have a twenty five zettabyte table and all of its expired it’s all TTL all gone you don’t really know this right so what you have to do is you go you’re going to read every key and then if it’s expired you drop it and if it’s not expired you write it back to the new sstable so if you have like this gigantic amount of data and it’s all expired in the other compaction strategies you’re gonna have to read a lot and then hopefully you’re not gonna write a lot into the new SStable because it a lot of that is expired the holy grail of time window compaction strategies because you know that because all of all of the keys in this SStable were waiting at the same time window and if you have a detail I can actually drop the table without even reading it drop the sstable in this case the worst case dropped entirely table
But in our example I can drop the entire sstable without reading anything because I look at the I can see that the meta data I can see that everything was written in a particular time window and I can know that I’m past this time window so I drop that if you are able to do this time window is the strategy for you if you have to ask qualification questions in that oh but what about if I come like then it’s not for you because then the opposite happens if I have now I will explain how it works but if you have a single key that is now preventing this entire sstable from being expired now you’re in trouble because you’re never gonna touch that SStable again after it’s time window is gone so time with the compaction strategy it’s really great for cases in which you can just delete the entire file because you know everything is gone or in cases in which for instance you are append-only or things like that and you know that your data comes in like you have sensors or etc and then you know that at most I’m gonna look at one file let’s say the time is part of your partition key so I mean if it’s a if it’s a different or even better the time is part of your clustering key so I know like okay it it’s not this time I don’t have to touch this SStable so those are cases in which is good for so how do you use time window compaction strategy I’m showing how to use unlike the other compaction strategy because that that brings all the important properties to light you usually gonna say what is my window size what is the unit of my window size
So I’ll say for instance four days so everything that happened in the span of four days gets compacted together how size tiered compaction strategy so I keep accumulating I keep during those four days I’ll be doing size tier compaction strategy in that window you keep adding data I’m not gonna just leave it accumulating I’ll be doing size tiered I’ll be there is no difference between size tier and time window for the current window so in the current window I am just doing sized tiered when our window gets old it ceases to be the current window I do one final major compaction which is to compact everything into a single sstable so even if I have 32 doesn’t matter I compact all of them into a single SStable and now I never touch this window again so now you understand like when I said if you have this one key that was not TTL or whatever or was inserted now he belongs in and a time window that is owed you’re never getting rid of that again because it’s in the whole point of time window compaction strategy is not to look into that time window that is owed so it’s going to be there forever
And that’s where we started getting the questions like oh I have inserted one gigabyte of data and I have four petabytes of data being used because each in each one of your sstables you have one key there’s preventing everything to be deleted and I’m not going to touch it because that’s the whole point of time window compaction strategy so good patterns for it is that again as I said when all of the data is inserted in the same window right so data that belongs together goes together one good example is that for instance if I have a partition key being the user the clustering key is the timestamp in second in seconds equal to the insertion time and I have a query like this because even in this case that’s the example in which I’m not TTLing anything like but I know that when I query I query for time window we can also discard the sstables based on clustering key as well right
So I look at that sstable I don’t even just the metadata I don’t even need to look at the bloom filters I know oh I’m querying for a time window right but if you come up with a query like if I decide to do a query like let’s say in this example the time window is one day I have one year of data and and the time is the clustering key and I decide to do a full scan on that year I’m gonna read from 365 sstables and if you call me a liar because if it’s a leap year 366 I’m not gonna talk to you again just let’s make let’s keep things simple about 365 or 366 sstables again bad pattern you will be better off reading from a single sstable so in that case I mean if I do this query once a year fine but you know just if this becomes a query that is part of your workload that’s gonna be pretty bad on the terms of read amplification you’re going to have a single query that needs 365 as tables to serve not great window so which strategy use them just the summary like size tiered it’s good for write only if you care about bandwidth it’s also good for write only level if you care about space-time window not really great because overwrites again size both size tiered and and level are great depending a little bit how your overwrites are level is it could not be bad if you keep thing if your work will keep things in a 1 if you have many updates size tiered not great because you might have big amplification if you have few updates thence that’s great because the boom filter is gonna save us time series well behave etc
Usually, time window is the best one now let’s talk in the 10 less minutes about incremental compaction strategy incremental compaction strategies a new compaction strategy that we already have working for quite a while we’re still in the last days of it’s impressive how software have bugs right now we should have a law just passed the law by government bugs are not allowed a lot of those bugs were like those there is this one scenario in which the guarantee is not really kept so we want to address that should be coming in a matter of months most it’s going to be enterprise only and he used to be called hybrid compaction strategy if we talked about it in the ScyllaDB summit a couple of years back at that time we call it hybrid compaction strategy why the name hybrid doesn’t really tell you much about the strategy which is why we change but it comes from being a hybrid between size tiered and leveled compaction strategy
How does that work the way it works is that we don’t write an sstable anymore when you have a memtable that is being written to an sstable or the result of a compaction instead of writing an sstable we write something called an ssstable run an sstable run is a generalization of the concept that leveled compaction has about sets of sstable with disjoint ranges so even when I’m writing the result of a mem table flush now if you’re using incremental compaction strategy instead of writing the one big sstable I’m going to write one big sstable run that is comprised of many small sstables physically of the same size and logically I call it that’s a run that’s a logical entity I keep doing size tiered compaction strategy so the logic of incremental compaction strategy is the same as size tiered the same so you’re not gonna be good by the way in cases in which I have a lot of updates in which you could keep just L 1 and L 0 for a level because you still need to get it all the way back to the last level so it’s exactly the same as size tiered except my input is now partitioned
So that means that I can compact it piecemeal I don’t have the 50% thing anymore because the 50% space amplification from size tiered comes from the fact that I have to compact my entire sstable set into a new sstable set what that means is that I cannot release that space but with incremental compaction strategy once I know and obviously I mean probably obvious maybe obvious that when you compact sstables as you walk them and in order like in insert order so I look at the tokens I said because the SS tables are disjoint I know that this sstable here is from A to F so when I compacted everything A to F and I already closed the next level I can just discard those sstables earlier right so you essentially solve the problem of space amplification of size tiered because you still have logically a big sstable but physically that is partition in disjoint ssstables that are small so the space amplification is actually even better than leveled strategy because in leveled strategy you always have that factor of strategy you don’t have the factor of need is essentially 160 which is the size of sstable times however many compactions you have in parallel size tiered you can have compactions in parallel if they are from the same table in different tiers so you can compact different tiers at the same time but at the worst case you’re going to be compacting as we discuss four, level compaction strategy will not compact multi tiers in parallel but it will compact 10 sstables so the amount of disk space that you’re gonna have free it’s actually better than leveled compaction strategy because the constant is better by a factor of 2, it’s not as bad as leveled
It says it’s the same as size tiered because when I move an sstable to a different level I don’t have to rewrite the level but at the same time because they are disjoint I can get rid of my input sstables faster and the read amplification is like size tiered it will be good sometimes it will not be good it will be great for append-only workloads for append-only workloads that’s essentially the best for update heavy it depends maybe again if you are lucky but I wouldn’t depend on luck just use that I mean if you’re lucky in your leveled compaction strategy you can keep everything in the lowest levels then maybe it’s better but keep in mind that every time you do it every time you go up a level your paying a factor of 10 right and here you don’t have that factor of 10 so this is size tiered versus incremental compaction strategy same example as example number one
So we are looking at here space amplification and then you can see like you’re essentially having it doesn’t look we could have included leveled and it’s not the same as leveled, leveled is different the spikes are different those are details but it’s much better you never have the spike they goal of incremental compaction strategy is to allow you the benefits of size tiered compaction strategy without the cost of 50% disk space so it’s usually good if you are you say oh I want to use size tiered the other compaction strategy all had drawbacks but the only thing it’s keeping me on other compaction strategy is the fact that I need more disk space I want to go all the way to 80 90 percent then incremental compaction strategy