Part two of the disk scheduling lesson covers common I/O task patterns, storage I/O profiling, disk scheduling, relevant views in the monitoring dashboard, and related metrics.
mem tables, compaction, streaming and repair
they’re usually 120 kilobytes of size so there are large reads or in ScyllaDB terms
there are large commit logs commit logs are built from 32 megabytes
segments we pre allocate the space on disk so when we start we don’t
start writing to the disk before we know that we have 32 megabytes that’s
basically to mean being been done to make sure that writing to the commit
logs is more optimal okay we don’t need to go to the OS and ask for space we
save on that so we do only the i/o to write to flush that down and the oil or
the XFS doesn’t need to allocate additional space we recycle the commit
logs even further so if I use I flushed all the main tables
related to that commit log and they’re now stored in in sstables we won’t give
the OS back that commit log we really reuse that space again okay and that is
done again to save on allocations to save file system operations reading is
much easier than allocating and returning back okay so that’s why you
have commit logs nowadays and recycling commit logs we cited commit logs
are basically junk okay ScyllaDB doesn’t read them if it crashes and boots up so
if I looked at tasks disk I/o tasks patterns we can see that they are
different we have reads that can be 4k or larger
we have writes that are usually 128 kilobytes of size and we have other
reads from coming from compaction streaming and repairs that are usually
128 kilobytes I’m not talking about commit log reads or hinted and off
because they’re not the regular case so. I’m skipping that when we read from them
now we are getting a machine from you we have no idea what it is capable of doing
and we need to schedule operations we need to know that we’re not sending too
much to the disk or too little to the disk we know we need to know what the
disk basically can do so we have a process that is done during ScyllaDB setup
which is I/O tune and I/o tune basically benchmarks your disk provides
information what is the read and write bandwidth what is the read and write
I/O’s that your disk can do for example i3.metal write bandwidth usage six
gigabytes by i3.metal read language is 15 gigabytes which is huge ok so we
want to use that information that writes can only do six gigabytes and reads can
do 15 gigabytes and let’s talk about the different profiles so we said that we
have read that our four kilobytes and we have larger reads now what’s the
difference why does it matter so if we’re talking about profiling it on on a
disk we can see here that this specific disk we got around 150 concurrent
requests at some average latency but if we’re looking at 128 kilobytes requests
we’re down to 30 concurrent requests so how many I/Os can I do to the disk
should I tune to 30 requests going down in parallel to the disk or should I
tune to 150 given that you you’re doing smaller reads and writes
so legacy ScyllaDB did only four kilobytes and tuned itself for the four kilobyte
case and then basically we sent too much okay and then we had to tell you tell
you to tune it down a bit nowadays we were much smarter okay
Glauber is to blame for that he wrote the code it’s perfect timing so read I/Os
so basically what we’re doing we’re checking how many four kilobyte writes
we can do or reads we can do but also what is the bandwidth and the bandwidth
is a counter part so the disk with 30 concurrent requests of 20 kilobytes is
limiting what we can do basically by it’s bandwidths the same for I I/Os
how many writes we can do but again writes are usually at 128 kilobytes size
and we have this scheduling so basically this scheduling has to do with queuing
up requests and making sure that we have fairness between them so how do we see
how much we’re allocating and how does it look so we have the disk i/o
dashboard and here we have we can see there i/o operations we can see
the i/o bandwidth and we can see the queue delay i/o queue delay means that we have
requests that are waiting which is bad okay that translates again to latencies
so we can see the bandwidths that we’re doing if you’ll divide one eye off by
the size you’ll see that it’s totally different
usually I/os of query divided by the bandwidth will be much smaller than
commit logs or flushing and so forth so if we’re talking about about advanced
ScyllaDB monitoring I talked about the query cache I talked about the
sustainable but you don’t see them today on the dashboard that’s because we we
are living some for the future and it’s already hard for us as well
but there are metrics so if you’ll go to. Prometheus you’ll find those those are
the prefixes and you can see that we have for sstables around 20 metrics
and for query cache we have around 20 metrics and if you query it you’ll see
also input about what it means and if. I’m talking about myths so I have a disk
running at 90 percent utilization does it means that I have only 10 percent
left and their answer is no the disk have hyper lism so you can send 100 and
you can send 100 plus 100 tells you if you are you saturated one pass but it
doesn’t mean that you cannot saturate added an additional pass inside the disk
so when you’re looking at ScyllaDB metrics versus are I/O stats
100 doesn’t mean anything ok if you’re under 100 it means a lot you’re not
maximum maximizing everything but if you’re seeing you’re just doing 110
that’s fine may do more than that
ok ok and the rest is a homework for you guys