Repair, Tombstones and Scylla Manager

  • Some of the manager info is relevant for Manager version 1.4

Most of the presentation despite the title will be focusing on repairs and tombstones and the Scylla manager part and the end of it is pretty easy so here’s what we’re going to discuss today is like what are repairs why are they needed what are some scenarios in which they are not needed because we keep having those questions what is the famous tombstone issue you might have heard of it like what are tombstones and etc and then last thing how the Scylla manager in its current version helps you achieve those goals all right so repairs one of the things that people get confused a lot so especially like you’ve been hearing repairs repairs repairs and not necessarily had the session to explain what it is I personally find that, we did inherit that from Cassandra but I personally find the term not the best because we tend to think there is something wrong with my database and then it needs to be repaired so the first thing like when you run repair that doesn’t mean there is something wrong with the database more than that there is only a class, it does fix something but there is only a class of issues that are fixed by repair

So in part because of the name one of the things that we very commonly see is people saying well my database is low I’ll run repair to see if it gets faster it won’t get faster because that’s not the things that repairs are designed to do or maybe well something happened and whatever I think it’s wrong and then should I run a repair the logs are saying that I’m out of memory I’ll run a repair to repair that once more repair do not do this for you so what is it exactly that repair, if that’s not what repairs are what are they so whenever there is a write so this is the model you’ve probably seen this in the classes in which data modeling and the basic architecture of Scylla was being discussed every time there is a write and this is also something that a lot of people get it wrong every replica there is a replica for that key we receive that write so when you write something with consistency level one that does not mean that we are only write into one replica we are always writing to every replica

So you have replication factor of three in one datacenter replication factor of three in other datacenter we’re going to write to all six replicas for that key always so there is not there is no case in which Scylla would say I’m not gonna write you that node the node is down I cannot write you that node but we always try to write to those nodes but then there is a consistency level what the consistency level say is how many replicas have to respond before I consider this write successful and the consequence of that is that I cannot tell anything about what happened in the other replicas I don’t know maybe the write was successful maybe it was not successful so there isn’t something I can say the state really is unknown so if you write with quorums which is or local quorums which is how most people write usually gonna have two replicas that are guaranteed to have this piece of data why because the write return is success then I mean sure yeah my data is there but there is nothing you can say about the third replica most of the time is just fine because your network is without issues but maybe you had a network issue maybe the node was overloaded it didn’t reply you don’t know then reads so reads are different so this is writes reads are different reads will try to be effective and efficient so if you try if you say I will read with consistency level one that only means so there are exceptions of that like probability we will discuss that but that only means that we will try a single replica because I want my reads to be as fast as possible 

So you ask me for a consistency level one or local one I’m only gonna touch one replica what happens if I wrote with a quorum so I have a scenario with three nodes and then I have another node that doesn’t have I wrote with a quorum two nodes have the data and then what happens if I read with local one or one like anything can happen maybe I hit one of the replicas that has the data I’m fine I read the most up-to-date data maybe I read one of the replicas that didn’t return we’re not part of their quorum but he has the data anyway because most the time we will then my data will be correct but it could also be that I will hit the replica that happens not to have the write doesn’t have that data or it has older data for that key in that case this replica will return older or an inexisting data this is what repairs are designed to fix so repairs are designed to make sure that all replicas have the same copy of the data if you have an overwrite maybe overwrite scenario maybe all replicas have data but they’re the data in one of them is newer right and then repair will make sure that everybody has the most up-to-date copy is this a problem to read stale data or to read no data where you expect data I don’t know this is something that your application has to tell me 

Like if you cannot ever tolerate that there is a way to do it which is you write with quorum you read with quorum then you always consulting a majority it’s fine maybe you don’t care if you read stale data for a while because your application tolerates that then you can play with the consistency level, one way or another repair is  the tool that is going to make sure that all replicas are up to date right so in this example here I’m using I have replication factor 3 per datacenter right so I have two datacenters and I am writing with local quorum local quorum what does that mean I want a majority of the nodes in this data center to reply and then I’ll consider the write successful so in the first datacenter the datacenter that in which is sent to write two nodes are guaranteed to have the data because the local quorum returns success I don’t know which two but two of them have the data there is another node you can’t say anything about it maybe has the data maybe it doesn’t and this second datacenter maybe has the data maybe doesn’t and only after repair they’re all gonna have the same thing now there are three types of repair there’s like people call it repair but in reality there are three different processes that have repair in their name the first one the first of them is synchronous read repair synchronous read repair as the name says it happens at read time so let’s say that’s why by the way if you read and write with quorum you always have the most to date data because during a read if I contact two replicas 

So in the replication factor of three case my quorum reads will contact two replicas if I contact two replicas or three replicas or four or however many I contact and at that time I noticed that there is a mismatch between them I fix it and then I return it back to you so the synchronous read repair because it happens synchronously in the read path so again once more if you read this data with quorum there it was written with quorum even if you happen to hit a node that has the data and a node that doesn’t have the data at that point they will both have the data because synchronous will repair will have fix it there is another situation so situation number 2 in which synchronous read repairs happen and this is whenever I roll the dice it’s obviously not a physical dice but there is a chance of 1% if I’m not mistaken and I do this one the second one then I’ll talk to all replicas so chance of 1% and then you can configure like I’m not we’re not going into how you configure it but in the table properties you can configure the chance of this repair happening across datacenters and you can configure the chance of this repair happening within the datacenter so you can make it more often that it happens within the datacenter and you can make it less frequent that it runs cross datacenter the default is 0 for cross datacenters so that doesn’t ever happen across datacenters but let’s say 1% of your reads I would just say well this read I just decided that I will go read all replicas so what it does is like okay I roll the die and it tells me don’t do it that I don’t do it I just use the normal path if I see a difference how fix it if I don’t I’m not gonna do anything and I’ll just return to you

But for one percent or however number of the reads I roll the die and say okay for this one I will fix it and then I go read all the replicas in the local datacenter and if there is no difference even better if there is any difference at that point I’ll fix it however the third one is the asynchronous one we’ll talk about later when most people are talking about repairs they’re not talking about those two repairs a lot of people don’t even know those things exist because they’re happening internally to the database like the operator doesn’t need to be involved there’s nothing you have to do I mean the read comes to you and it’s already correct when you hear the word repair most of the time people are talking about asynchronous repair which is the one that you got to go and do and initiate and do something what is it that this repair does asynchronous repair so from now on when I say repair I mean asynchronous repair because we don’t know where the differences are so it could be that all keys are correct it could be the some keys are incorrect it could be that one percent of the keys are incorrect it could be the ten percent of the keys are incorrect I have no idea

So the only way to do it is to scan your entire database compare checksums between or among all of the the replicas because if you have more than two and then do the same thing if it’s wrong or stale or invalid or old I’ll fix it if not I’ll move to the next one and the reason it’s called asynchronous is not because you do it as the operator it’s because the read path is not affected by this right so the reads are happening and they don’t care I mean the reads don’t change the way the reads happen don’t change because you’re running asynchronously repair but in the background you have this process there is scanning the database talking to all the replicas and comparing checksums and every time there is a mismatch in the checksum I know that there is data in there that one of the replicas don’t have for have an older version or etc and at that point I’ll fix it do I need to run repairs that’s a question we hear a lot and the answer is yes so the simple answer is yes you do why for those reasons we discuss I mean actually this is one case this is one valid case where you need to be in a very bad state for this you become really noticeable but it could even happen to help you with performance if you if you have a lot of mismatches that you have to otherwise fix every time but the most important thing to do is make sure that all of your data is all the replicas are seeing the same data 

However do I really need to run repairs because some people don’t like doing it because maybe I don’t want to add the repair load into the cluster or whatever so there are scenarios in which you don’t need your run repair and we discuss them briefly I mean I don’t plan to delve too much into them the first one is if you really don’t care about consistency of your data that’s a trivial one like a lot of applications really don’t care they’re especially like a complex machine learning applications if you have a little bit of data that happens to be wrong like nothing bad is going to happen so I decide not to  run repair that’s fine if you run a single datacenter it is safe to skip repair if the consistency level for reads plus the consistency level for writes are bigger than the replication factor classical example quorum plus quorum which is always by definition bigger why because of what you just described like we’re always going to be finding a group of two or a majority and they will return the right data if there’s no deletes we’ll talk about that if you have deletes this breaks whenever you have deletes or TTL TTLs are deletes, they’re just deletes that you did not have to do because they were done for you but if you have deletes or TTLs you cannot skip your repairs we’ll talk about why later but if you don’t have deletes if you don’t have TTLs and you’re always doing a majority write plus majority you read it’s fine multi datacenter I’ll just skip it

Tombstones everyone’s favorite subject what is it tombstone so the way we delete data in Scylla that’s because of the asynchronous nature of what we do is that whenever you delete data you don’t really delete the data you just mark that row for deletion so you just write the fact that this row is deleted right you just don’t go there and delete the data why because you then you start having a coordination problem among all the replicas so deletion and a TTL for that matter is just a deletion that happen automatically it’s just the write there is no difference between a deletion and a write a write writes a value and a deletion writes the fact that this data shouldn’t be responded to the client tombstones disappear because at some point I mean if you never get rid of it then your disk space never truly goes down I mean it does go down because if you delete a 100k blob the deletion itself is very small but at some point you can have a lot of deletions that use a lot of disk space so at some point I do want to get rid of those deletions this is called gc_grace_seconds in your table properties so when you create a table you’re gonna see gc_grace_seconds default for that is ten days and that is that means that after 10 days I delete the tombstones that were created 10 days ago 

So tombstones are going to be in your cluster for at least 10 days I don’t have to delete it because they are deleted asynchronously but if I’m doing something that would trigger a tombstone deletion in less than 10 days assuming the default I’ll leave the tombstone be I’m not going to touch it but if if I’m doing it after 10 days I can get rid of it because it’s been 10 days I know before I tell you why this gc_grace_seconds exist let’s look at the biggest problem we have with tombstones how expensive you guys think this query will be so this query is the following select one column just imagine like let’s not go if maybe this column is super big no it’s a simple integer column select column from table limit 1 so what limit 1 tells me is that maybe you have 10 billion rows just give me the first right so whatever you have there if it’s empty it’s empty but this query will either return 0 rows or 1 row because I’m limited to 1 row but to get you a live row I might have to scan a lot of dead rows which are the tombstones this slide is also designed to show how good of an artist I am see there’s my tombstone representation the example of ten seconds it means you’ve got a lot of tombstones but if you think about a workload that’s deleting things very constantly it’s at all possible so if I fired this query this is select column from table limit one and I find myself in this situation I have to go scanning my partition until I find one live row and then I can return it to you 

So maybe it’s gonna be fast but maybe it won’t because I have a lot of tombstones right so I got to get a rid of tombstones so performance definitely can be impacted by the presence of tombstones how to avoid this we need to make sure that we are taking action to expire those tombstones the biggest thing that gets rid of tombstones are compaction so you need to make sure that compactions have the time to run and if your cluster for instance is not properly sized maybe you’re too tight compactions don’t have enough resources to run they gonna lag behind you’re not gonna be able to get rid of tombstones so in that case you expand the cluster the idea here is that why do I have this problem because like I should be able to get rid of this tombstone like immediately right just just get rid of it why do I have to keep it for ten days the problem with that is that there is one problem called data resurrection right and let me walk you through why this happens so in this example I can’t see from this screen so I’ll look in here we have three nodes so replication factor of three first node has a = x node 3 has the same thing and this was written in timestamp t1 the other node has timestamp t0 and a = y so this is one of these scenarios in which the node two for whatever reason did not get the write why I don’t know but it doesn’t have the write so if I’m doing quorum reads on that data I will always return the correct data we discussed that already right doesn’t matter which two nodes they pick one of them who have the data and then I would do read repair and I will return you the correct data 

Now let’s say I have tombstones now remember tombstones are just writes so in this case the correct answer if you query this row is null because you deleted this data if I query for example node a node 1 and node 2 node 2 has a = y but node 1 has a = null tombstone with timestamp t2 it was written later so I know uh this write is newer than the other write the correct answer here is null what happened now if I were to get rid of the tombstones who wins between timestamp t2 and timestamp t0, timestamp t2 so the tombstone will reply null but who wins between timestamp t0 and void, timestamp t0 because this situation is indistinguishable from the situation in which you never deleted you just wrote it and only node 2 has the data so if I get rid of the tombstones I can’t know that this data was deleted so if you query if you happen to query in this situation you will have the correct value and then the tombstones are deleted and then you query one second later and you end up with this and now what is the response the response is Y

Well it could be in this case I mean if you query a quorum and you happen to be lucky to query node one and three you’re gonna have the correct answer but assuming node 2 is part of the of the replica set you’re gonna get your result back it’s gonna come back to life so that’s why gc_grace_seconds exist in our set by default for 10 days before 10 days you have to repair your cluster to make sure that this doesn’t happen so let’s say for instance that your repair takes one day to run and then you run it every week then you can that’s less than grace seconds you’re fine this situations happens if this node 2 failed the write for some reason we try to write to all replicas but maybe one of them fail that’s why we need repairs in the first place so in this example we did write the tombstone or we issued a delete request to all of the nodes however for some reason node 2 not nodetool, node 2 did not get it why I don’t know maybe you had a network issue maybe the node was overloaded maybe the node crashed and while we were bringing it up that’s when this delete happened so we’re gonna attempt but because the consistency level was quorum what does that mean as long as two nodes reply I considered the write successful so two nodes replied I considered the write successful and node 2 lost the write it doesn’t have the write maybe lost the disk and you have to replace them and the data is not there it doesn’t have to be there so the biggest difference between a normal write and a deletion in this case is that the normal write has a timestamp that can be used and that’s the scenario that can be used to differentiate between like who has the data

But if you had do a deletion at some point you’re gonna delete the data and gonna get rid of the tombstone then you’re find yourself in the situation in which one node has a data with a timestamp and the other one has nothing and in that case your data can come back which is why before gc_grace_seconds repair your cluster let me talk about the Scylla manager then because a lot of questions are popping up I don’t know what were yours is but I’ll just quickly explain what the Scylla manager is I’m not gonna go into all of these slides and because that helps the things about repair too so the Scylla manager what it is is a tool to manage Scylla clusters in the same tool you can manage multiple Scylla clusters some people decide not to do that and have one manager per Scylla cluster maybe because you have different teams in your organization it doesn’t matter but you can in a single Scylla manager manage a lot of Scylla clusters 

If you have less fewer than five nodes Scylla manager is free you can download it use it for free if you have more than five nodes then it’s an enterprise only subscription by enterprise only it does work with Scylla open source if you choose to use it for whatever reason but it’s not free for you to use so essentially what it does is that this is changing so we with 2.0 that we are about to release very soon some of those things are changing so one of the things I say here is that there is no agent you don’t have to install any agent in the Scylla nodes with Scylla manager 2.0 we’re actually moving to an agent model and what replaced the need for agents is that Scylla manager would just SSH into your nodes and and kick repairs into that so you would just essentially have a private and public pairs of keys the Scylla manager would just automatically log into the nodes create tunnels and fire the repairs and just manage the whole process for you turns out there are many organizations in which it’s not that easy to run SSH you may have like two factor authentication etc 

We actually thought this is great because like you don’t need any extra agent any extra set up but we actually move into an agent model now because it makes easier in those in the simple scenario this is better but in the complex scenario in which you may have like very complex SSH in which you cannot log in without a password or you need two factor authentication this this gets a little bit harder so essentially the Scylla manager what it does when you kick out a repair on Scylla you need to go in each individual node first of all and trigger repair from that node and a cluster wide repair means you cycle through the entire cluster and the repair is sent as one block said repaired everything that’s in this node what the Scylla manager does is first of all breaks it into small pieces it chooses which of those small pieces to repair based on Scylla shard aware architecture so we will say oh this piece belongs to this shard doesn’t cross shards so we know repair this piece and then if this piece ended I will repair another piece that also belongs to this shard to keep utilization equal if you don’t take this into account you can still repair but your utilization is not gonna be equal among shard so it’s much it’s more efficient to run the repairs its resumable because if the node if there is any issue as I said you want to stop the repairs for whatever reason you you stop from this point because you know we track it in a database 

So essentially what happens is that you have a Scylla database that is tracking which ranges have I repaired already which ranges are there to repair which shards those ranges belongs to and manages the process of firing those small repairs so we usually you can control if your repairs are going faster or slower by how many ranges you repaired at the same time you can resume them you can stop it and they’re gonna be running as you asked in the background all the time so again this is the state of the Scylla manager today there’s if you go and install the Scylla manager today I do recommend to do it like I we have the hands-on lab there are people there they’re willing to help you can do it at your own pace as well but if you do it today that’s how it’s going to happen it’s all over SSH you don’t need to install anything into the Scylla nodes as we discussed there are advantages and disadvantages with that with Scylla manager 2.0 we’re moving away from this model 

But right now that’s exactly what you do we don’t need to obsess too much about the actual commands it’s all in the documentation but just to walk it through quickly essentially what you do is you set up the manager you set up the SSH agents that creates a key you don’t have to provide a key it creates a key that it’s used just for that and once you do that you essentially add a cluster to the manager so it’s as I said the manager operates with the granularity of a cluster you can add one two three and then you have tasks inside that and you essentially said you have the repair tasks that come by default you can delete the tasks you can create new tasks change the frequency in which the task happens but once you register the repair task it will happen automatically ever every let’s say five day seven days however much you want the Scylla monitor does manage the monitor does manage as well a little bit so you can see in our monitoring if you have failed segments you can see the progress of repair you can see for instance that this repair is so you can also track in our monitoring solution if you’re using the Scylla manager what’s the progress of your repairs 

Again as I said it is available for free for every Scylla user to use as long as you have at most five nodes if you have more than five nodes then it’s part of our enterprise offering and give it a try and it’s available for a CentOS Ubuntu as well even if you give it a try today people in the other room will be happy to help you just this is really just a summary table of contents that is about what I just said repairs in nodetool repair you can do it but they’re node local you need to go and and maybe gonna use cron maybe gonna start having issues that one repair hasn’t finished yet in one node and the other already started and it becomes hard to distribute the manager repair is global it looks at the cluster as a whole divided in small ranges and goes through them nodetool repair is an operation that you somehow have to go and do it the managers are recurrent as Tomer said nodetool repairs fire and forget you fire it and it’s a big operation if it fails you have to do it all again the manager repairs because it’s based on tokens you might fail this range but the whole operation is preserved no records kept if you do nodetool repair for the manager repair can actually go later and see how the repair was done you keep history for that it’s all kept in Scylla

Nodetool repair is not aware of our architecture so it doesn’t achieve equal utilization among units while the repair from the manager is shard aware nodetool repair the retry is you go and retry there is a retry policy for manager repair so overall I mean as I said it’s better it’s a better repair tool on top of everything else

To report this post you need to login first.