Join us at ScyllaDB Labs, instructor-led hands-on training sessions | May 29
Register now

Lightweight Transactions Under the Hood

11 min to complete

So how is LWT implemented in ScyllaDB? Paxos was invented as an algorithm for achieving consensus on a single value over unreliable communication channels. Many parts of the algorithm are left to implementers, so it can be tailored to solving the problem of database replication. In ScyllaDB, the algorithm participants are replicas responsible for a given partition key. When a client suggests a change to the key (any modification statement can be represented as a partition mutation), a coordinator node acting on the client’s behalf ensures that the majority of replicas holding the key accept the change. Any node in the cluster can be a coordinator for some change. This is done in two steps: first, the majority of replicas responsible for the key make a promise to the coordinator to accept the change, if the coordinator decides to make it. This step is necessary to make sure that no two concurrent coordinators “split” the history, when some replicas accept changes from one coordinator, and others from another. Essentially it temporarily locks out other changes and allows them to happen one at a time. After the coordinator receives a majority of promises, it suggests a change. If the change is accepted by the majority, the algorithm achieved progress. The main caveat is that the protocol is very expensive: it’s 4 times more expensive than a usual write in terms of network latency, and a hundred times in terms of I/O since it incurs a read of the old row. By any measure, the network latency dominates I/O costs, but these costs should not be discounted either: fetching whole pages of an LSM tree can saturate I/O bandwidth way before the network bandwidth limit is reached. Some of the protocol RPCs could be collapsed, and work in ScyllaDB has begun to this end.