In this lesson you’ll learn about repair and why is it needed. A common misconception is that repairs are a way to fix “things” that are wrong with the database.
So what are repairs? ScyllaDB Repair is a process that runs in the background and synchronizes the data between nodes so that eventually, all the replicas hold the same data. Data stored on nodes can become inconsistent with other replicas over time, which is why repairs are a necessary part of database maintenance. Using ScyllaDB repair makes data on the node consistent with the other nodes in the cluster.
all right so repairs one of the things that people get confused a lot so
especially like you’ve been hearing repairs repairs repairs and not
necessarily had the session to explain what it is I personally find
that, we did inherit that from Cassandra but I personally find the term not the
best because we tend to think there is something wrong with my
database and then it needs to be repaired so the first thing like when
you run repair that doesn’t mean there is something wrong with the
database more than that there is only a class, it does fix something but there
is only a class of issues that are fixed by repair so in part because of the name
one of the things that we very commonly see is people saying well my database is
low. I’ll run repair to see if it gets faster
it won’t get faster because that’s not the things that repairs are designed to
do or maybe well something happened
and whatever. I think it’s wrong and then should I run
a repair the logs are saying that I’m out of memory I’ll run a repair to
repair that once more repair do not do this for you so what is it exactly that
repair, if that’s not what repairs are what are they so whenever there is a
write so this is the model you’ve probably seen this in the classes in
which data modeling and the basic architecture of ScyllaDB was being
discussed every time there is a write and this is also something that a lot of
people get it wrong every replica there is a replica for that key we receive
that write so when you write something with consistency level one that does not
mean that we are only write into one replica we are always writing to every
replica so you have replication factor of three in one datacenter replication
factor of three in other datacenter we’re going to write to all six replicas
for that key always so there is not there is no case in which ScyllaDB would
say I’m not gonna write you that node the node is down I cannot write
you that node but we always try to write to those nodes but then there is a
consistency level what the consistency level say
is how many replicas have to respond before I consider this write successful
and the consequence of that is that I cannot tell anything about what happened
in the other replicas I don’t know maybe the write was successful maybe it was
not successful so there isn’t something I can say the state really is unknown so
if you write with quorums which is or local quorums which is how most people
write usually gonna have two replicas that are guaranteed to have this piece
of data why because the write return is success then I mean sure yeah
my data is there but there is nothing you can say about the third replica most
of the time is just fine because your network is without issues but maybe you
had a network issue maybe the node was overloaded it didn’t reply you don’t
know then reads so reads are different so this is writes reads are different
reads will try to be effective and efficient so if you try if you say I
will read with consistency level one that only means so there are exceptions
of that like probability we will discuss that but that only means that we
will try a single replica because I want my reads to be as fast as possible so
you ask me for a consistency level one or local one I’m only gonna touch one
replica what happens if I wrote with a quorum so I have a scenario with three
nodes and then I have another node that doesn’t have I wrote with a quorum two
nodes have the data and then what happens if I read with local one or one
like anything can happen maybe I hit one of the replicas that has the data I’m
fine I read the most up-to-date data maybe I read one of the replicas that
didn’t return we’re not part of their quorum but he has the data anyway
because most the time we will then my data will be correct but it could also
be that I will hit the replica that happens not to have the write doesn’t
have that data or it has older data for that key in that case this
replica will return older or an inexisting data
this is what repairs are designed to fix so repairs are designed to make sure
that all replicas have the same copy of the data if you have an overwrite maybe
overwrite scenario maybe all replicas have data but they’re the data in one of
them is newer right and then repair will make sure that everybody has the most
up-to-date copy is this a problem to read stale data or to read no data
where you expect data I don’t know this is something that your application has
to tell me like if you cannot ever tolerate that there is a way to do it
which is you write with quorum you read with quorum then you always consulting a
majority it’s fine maybe you don’t care if you read stale data for a while
because your application tolerates that then you can play with the consistency
level, one way or another repair is the tool that is going to make sure that
all replicas are up to date right so in this example here
I’m using I have replication factor 3 per datacenter
right so I have two datacenters and I am writing with local quorum local
quorum what does that mean I want a majority of the nodes in this data
center to reply and then I’ll consider the write successful so in the first
datacenter the datacenter that in which is sent to write two nodes are
guaranteed to have the data because the local quorum returns success I don’t
know which two but two of them have the data there is another node you can’t say
anything about it maybe has the data maybe it doesn’t and this second
datacenter maybe has the data maybe doesn’t and only after repair they’re
all gonna have the same thing now there are three types of repair there’s like
people call it repair but in reality there are three different processes that
have repair in their name the first one the first of them is synchronous read repair
synchronous read repair as the name says it happens at read time so
let’s say that’s why by the way if you read and write with quorum you always
have the most to date data because during a read if I
contact two replicas so in the replication factor of three case my quorum
reads will contact two replicas if I contact two replicas or three replicas or
four or however many I contact and at that time I noticed that there is a
mismatch between them I fix it and then. I return it back to you so the
synchronous read repair because it happens synchronously in the read path
so again once more if you read this data with quorum there it was written with quorum
even if you happen to hit a node that has the data and a node that doesn’t have
the data at that point they will both have the data because synchronous will
repair will have fix it there is another situation so situation number 2 in which
synchronous read repairs happen and this is whenever I roll the dice it’s
obviously not a physical dice but there is a chance of 1% if I’m not mistaken
and I do this one the second one then I’ll talk to all replicas so chance of 1% and
then you can configure like I’m not we’re not going into how you configure
it but in the table properties you can configure the chance of this repair
happening across datacenters and you can configure the chance of this repair
happening within the datacenter so you can make it more often that it
happens within the datacenter and you can make it less frequent that it
runs cross datacenter the default is 0 for cross datacenters so that doesn’t
ever happen across datacenters but