This lesson covers Time-Window Compaction Strategy (TWCS), how it works, examples, performance considerations for when to use it, space and write amplification, and common misconceptions.
Time Window Compaction Strategy memtables have a right time so whenever whenever we have data in memory we track the write time of that memtable don’t remember if this is the lowest write time in the set or the highest write time in the set but it’s one of them but we track it’s a property of the entire memtable which is which depends on the data that you currently have in the memtable right but there is a number that that is the write time of this memtable that is a summary as a setter min or max of the entire data set that you have in the memtable when you write that to storage the sstable will inherit this property the idea of time window compaction strategy is that the way you combine those ssstables is by combining data that were written at more or less the same time and this is pretty interesting because there are a lot of use cases in which people use ScyllaDB there are time dependent in some way right so maybe overwrites are only possible within a time window or or even more common like I have data there is TTL the holy grail of time when the compaction strategy is to allow you to drop an sstable because usually the way the other is the other compaction strategies if all of your data is expired let’s say all of your data just just imagine that you have a twenty five zettabyte table and all of its expired it’s all TTL all gone you don’t really know this right so what you have to do is you go you’re going to read every key and then if it’s expired you drop it and if it’s not expired you write it back to the new sstable so if you have like this gigantic amount of data and it’s all expired in the other compaction strategies you’re gonna have to read a lot and then hopefully you’re not gonna write a lot into the new SStable because it a lot of that is expired the holy grail of time window compaction strategies because you know that because all of all of the keys in this SStable were waiting at the same time window and if you have a detail I can actually drop the table without even reading it drop the sstable in this case the worst case dropped entirely table but in our example I can drop the entire sstable without reading anything because I look at the I can see that the 100% of the data is expired from the meta data I can see that everything was written in a particular time window and I can know that I’m past this time window so I drop that if you are able to do this time window is the strategy for you if you have to ask qualification questions in that oh but what about if I come like then it’s not for you
because then the opposite happens if I have now I will explain how it works but if you have a single key that is now preventing this entire sstable from being expired now you’re in trouble because you’re never gonna touch that SStable again after it’s time window is gone so time with the compaction strategy it’s really great for cases in which you can just delete the entire file because you know everything is gone or in cases in which for instance you are append-only or things like that and you know that your data comes in like you have sensors or etc and then you know that at most I’m gonna look at one file let’s say the time is part of your partition key so I mean if it’s a if it’s a different or even better the time is part of your clustering key so I know like okay it it’s not this time I don’t have to touch this SStable so
those are cases in which is good for so how do you use time window compaction strategy I’m showing how to use unlike the other compaction strategy because that that brings all the important properties to light you usually gonna say what is my window size what is the unit of my window size so I’ll say for instance four days so everything that happened in the span of four days gets compacted together how size tiered compaction strategy so I keep
accumulating I keep during those four days I’ll be doing I’ll be doing size tier compaction strategy in that window you keep adding data I’m not gonna just leave it accumulating I’ll be doing size tiered I’ll be there is no
difference between size tier and time window for the current window so in the current window I am just doing sized tiered when our window gets old it ceases to be the current window I do one final major compaction which is to compact everything into a single sstable so even if I have 32 doesn’t matter I compact all of them into a single SStable and now I never touch this window again so now you understand like when I said if you have this one key that was not TTL or whatever or was inserted now he belongs in
and a time window that is owed you’re never you’re never getting rid of that again because it’s in the whole point of time window compaction strategy is not to look into that time window that is owed so it’s going to be there forever and that’s where we started getting the questions like oh I have inserted one gigabyte of data and I have four
petabytes of data being used because each in each one of your sstables you have one key there’s preventing
everything to be deleted and I’m not going to touch it because that’s the whole point of time window compaction strategy so good patterns for it is that again as I said when all of the data is inserted in the same window right so data that belongs together goes together one good example is that for instance if I have a partition key being the user the clustering key is the timestamp in second in seconds equal to the insertion time and I have a query like this because even in this case that’s the example in which I’m not TTLing
anything like but I know that when I query I query for time window we can also discard the sstables based on clustering key as well right so I look at that sstable I don’t even
just the metadata I don’t even need to look at the bloom filters I know
oh I’m querying for a time window right but if you come up with a query like if I decide to do a query like let’s say in this example the time window is one day I have one year of data and and the time is the clustering key and I decide to do a full scan on that year I’m gonna read from 365 sstables and if you call me a liar because if it’s a leap year 366 I’m not gonna talk to you again just let’s make let’s keep things simple about 365 or 366 sstables again bad pattern you will be better off reading from a single sstable so in that case I mean if I do this query once a year fine but you know just if this becomes a query that is part of your workload that’s gonna be pretty bad on the terms of read
amplification you’re going to have a single query that needs 365 as tables to serve not great
10 minutes that’s our last time
window