This session goes over the performance of LWT. It compares different use cases, with and without LWT in terms of bandwidth and latency.
So let’s look at performance as I said this is not great
but this is fair and this is pretty good and this is obviously better than
Cassandra but let’s take a look at it, we used a couple of setups
for performance, jumping ahead, the performance is dominated by latency
and by CPU availability, so it’s CPU bound and latency bound, this is the
description of the first setup, it’s within the same data center so the
latency is quite low, these are three rather small to medium nodes that we
used but they do showcase the bottlenecks and the limits of
lightweight transactions so this is a setup within a single data center and
then we look at the multi DC setup and we use concurrent connections to see it
to basically look at the throughput, how many statements can we get with
lightweight transactions and please bear in mind it’s the first implementation so
we are going to work on performance. We compare lightweight transaction
performance with eventual consistency performance, what is
interesting on this chart is hitting the scalability limit, you can
see that 20 clients we almost reached it, 40 clients we saturated the scalability
limit, we reached around 10,000 transactions per second and then it
becomes flat and if you look at the latency the latency becomes worse the
reason the latency becomes worse is that the CPU gets more and more saturated so
more and more statements wait in the queue before they even
can get to execution, and the scalability currently, we also see that
the scalability of eventually consistent
so the scalability of eventually consistent statements is much better
obviously this is the first approach to the target and we are going to make
sure that we don’t get flat that quick but bear in mind that we
will look at the internals, the work that needs to be done to ensure consistency
is pretty much four times more work than with eventual consistency that
needs to be done plus the lightweight transaction has to do read and remember
I said a read is very expensive, so this is an uncontended case what if we try to
hit the same key from multiple clients, things are not shiny here either and
this chart is probably number one reason why we have dash-dash
experimental switch still, so if you try to hit the same key you can only get to
500 queries per second, that’s the max regardless of how much hardware we use
so this has to do with the liveness of the protocol, the Paxos protocol, one
of the most, it’s very elegant but it’s very hard to understand protocol in
distributed computing, I hope you’ll get to it, and the latency is also not
good so this is in milliseconds ,so we basically timeout if we have a lot of
clients. I should say that the industry and the academia has solutions for this
it just is a very fresh feature and. I’m sure that in upcoming months we
will fix that too, so the second set up is multiple regions, the Rtt time is 20
milliseconds, what is good about this benchmark is that it
demonstrates that you don’t actually pay in CPU costs
for using multiple regions you pay in throughput it takes more time to
saturate the system so we need 1500 clients to actually reach
the 10,000 queries per second on this setup but we still reach the same 10,000
queries per second that’s an important part and the scalability also
you can see that with 3,000 clients we can actually get to a 100 000
eventually consistent statements but only ten thousand of lightweight
transactions per second. The latency behaves in expected way, here we see
latency grows primarily because the CPU is saturated so queries have to wait
outside the system before they get into the system, we added some
metrics to help monitor usage of lightweight transactions, this metrics
will soon appear on our standard Grafana dashboards, the important part is
metrics for the coordinator, there are some histograms and counters for
basically issues so everything here is important latency spikes, timeouts
unavailable errors, contention errors, unfinished_commits and also
condition_not_met, meaning that you are supplying the change that is not applied so you’re
probably doing some useless work, this is a screenshot of metrics going up and
down during our torturing of the cluster.