Let’s take a look under the hood
You probably heard at previous ScyllaDB summit by the way how many people actually are
here from the previous summit, so quite a few and you probably heard at previous
ScyllaDB summit that ScyllaDB is going to use Raft Consensus protocol for lightweight
transactions, jumping ahead we changed that decision we are still
working on Raft and I will in the end I will look at how we are going to use
Raft and for what but let me explain why we do not use Raft for lightweight
transactions. ScyllaDB is a shared-nothing architecture this is a classical picture
of the consistent hash ring, you have multiple nodes on the ring so
you hash the partition key and get the token, the token is somewhere on
the ring then you hash the node ID and you also place it on the ring and every
node is responsible for a range of tokens on the consistent hash
Everybody knows that now we add VNnodes the reason we add VNodes
how many people are not familiar with VNodes actually, so everybody is familiar
with that, so the reason we add them is because we want to even out the
token ranges between physical nodes let’s take a look at it from replication
point of view, every VNode has a primary replica responsible for it and a couple
of secondaries which are selected as a product of a hash function if you have a
lot of VNodes then you have a lot of combinations of primary and secondary
let’s call them replication groups as academia calls
these replication groups so the more VNodes you have the more
replication groups you have there are two classes of protocols leader-based
protocols and leader-less protocols leader-based protocols basically say
for isolation, for consistency protocols they basically say
let’s select the leader once and this leader is going to make a decision
which transaction goes first which transaction goes say second it will
decide on the global order of transactions, leader-less protocols say
let’s select a leader for every transaction obviously the advantage of
leader-based protocol is that you don’t have to select a leader on every
transaction you don’t have to have this extra round of negotiations the
disadvantage is that you have to keep a state, who the leader is
is it alive or not, what’s the latest transaction for every leader now
we have VNodes we have lots of replication groups, the number
of replication groups is actually a binomial coefficient of the number of VNodes
so for let’s say 16 VNodes and replication factor of 3 is 560 a
replication groups, for 2560 VNodes it’s already a lot, a lot of state to maintain
and let’s not forget another property of. ScyllaDB, unlike Cassandra we don’t stop at
VNodes we actually have CNodes, CNodes is a name for, so every VNode is
partitioned within a node because we have shards so we slice every VNode
across all of the shards so that every shard owns a piece of VNode and we
slice it many many times it’s not like we slice it once not twice but four
thousand times so every VNode is very carefully sliced across
shards, so the number of groups, if he directly apply Raft
paradigm to ScyllaDB it’s just not going to work, too much state to maintain with
leaders so we had to use leader-less protocol, leader-less protocols are more
robust they don’t need a state but you pay for it with an extra round of
negotiations as I said, let’s take a look at how it works
Paxos is a protocol for, it was invented for distributed networks
decision-making, unreliable networks it was not invented for replications but it
is often used and it can be used for replication when you look at how it’s
used for replication you can see that. Paxos is actually an umbrella term
there are lots of adaptations and adjustments to make it work in database
environment but one thing is certain before you can actually choose a value
apply a new mutation you need to
negotiate who is going to do that, that’s the first round of Paxos, imagine you
don’t do it then so may happen that one replica actually accept one mutation
another replica accept another and this is irreversible so your history
of changes splits and you cannot reconcile it ever so first you need to
sort of lock everybody out and this is done by a concept which is called
ballots so the replica which acts on behalf of the client says could you
promise me to accept the value if I come up with it and other replicas do reply
okay if there is no contention, they reply okay I will accept the value the
second round of Paxos is okay I got the majority of responses I know I locked
everybody out I’m going to propose a new value for this row and if everything is
good if no timeouts and no crashes nothing like that, then majority
of replicas responds okay we accept that mutation we basically accept your
proposal for a new value for a new version of the record and finally when
the coordinator knows, so it’s like this is a distributed state machine
it’s state is changed as soon as the majority of replicas accept
the value but when the state is changed it’s not like everybody knows the state
is changed we need another round of responses, we need to collect
responses to actually find this out, so when we find this out the coordinator is
then responsible for saying hey I know that the decision is made let’s actually
store the mutation in the base table so this is the Learn round and since we
have lightweight transactions conditions and stuff we have an extra round when we
actually need to retrieve a record and check the conditions, these two rounds
they could be collapsed and we are working on making sure that we
actually collapse them into a single one as you can see from this like saga style
decision-making there are some flaws in this approach, so what are they and how
to live with them, one is that this is quite expensive, we have seen that in
benchmarks we have seen it in diagrams it’s an expensive thing so use
it wisely for the stuff that really requires consistency, not all of this
stuff in your application and we are going to work on reducing the number
of rounds and reducing efficiency the second issue is the starvation that I
demonstrated with benchmarks, there are actually solutions for that it’s called
Paxos leases and it reduces starvation quite a bit I hope we are going to get
to it quite soon, so this is on us the third issue that you need to be aware of
is this new state that you need to handle, with eventual consistency you
can always retry, you know that if you supply the same mutation with the
same timestamp, well actually you can’t always retry
anyway with Paxos there is this extra
uncertainty that is in the system, naturally we expect a database to
execute query exactly once and if there is an execution failure we expect that
there is a failure the query was not applied but this is not always true, you
can actually get a failure but the database actually stores the record, well
you got the failure afterwards, it’s committed, it’s written to replicas
but then you just didn’t get the message back and with Paxos chances for these to
occur are much higher and you need to be aware of it so you can get a timeout,
the timeout error but the mutation is actually
accepted and successfully committed, many users are actually surprised by that
and one thing we can do on a database side so that you don’t get so many of
these messages, is to provide you with good diagnostics when this
happens, so you can actually understand where and what happened and make
decisions accordingly finally the Paxos needs to store
persist the intermediate state somewhere so all of this is not just hey here is
the ballot here is the the promise, this is all persistent so that if a node
crashes or restarts this is preserved across restarts and we add a new system
tables to store this state this system table has a TTL, the default TTL for
expiration is like three hours but you still need to account for it in capacity
planning because it stores basically three hours of in-flight transactions
and we also work on reducing the amount of state you have to
store to remove the data that is already successfully applied from the
table proactively.