This lesson explains what we read and write from disk and goes into details about the different disk-related tasks, how they are performed, what are some common issues, and how these issues can be resolved.
In high-level
we have a ScyllaDB node it has three resources there is CPU disk
usually an SSD on nvme these days if anybody’s on AGG my condolences to
you it’s not optimized for ScyllaDB there are users that are using it and it’s
fine but again it’s not optimized for. ScyllaDB seek times are hard that’s the
way it is and there is networking the reason that
networking is signal here is that we don’t have a scheduler for networking
today so we have a scheduling scheduler for CPU will context switch between
different tasks and ScyllaDB we have a scheduler for SSD we tag every IO to the
disk if it’s coming from compaction repair writes reads we know what is
going on when we’re writing to the network we don’t have a scheduler
however we are relying on on the network scheduler that is built inside Linux so
if I have a connection that is being used to write to read and write to a
different node for ongoing reads and writes and I’m streaming it uses a
different connection and Linux will balance which means those two okay so we
are relying on it’s not the complete. Wild West so if you’re streaming not
everything will break we are relying on. Linux scheduling for networking so when
we talk about items that we need to schedule we have commit logs mem tables
compactions query and streaming and repairs we have the task categories that
we talked about this scheduling so what are we reading and writing
there are basically three items there are sstables commit logs and hints ok
and next slide has to do with what sstables are what commit logs are what
hints and they’re not exactly the same okay so ScyllaDB has evolved to tune
itself for each or to know how to work for each and I’ll talk about it a bit so
I’m gonna skip this slide and I’m gonna cover it here so we got to talk a bit
about the read pass and how it looks and. I have an example here
a replica shard trying to read some data off the disk so we have a read request
and the read request is coming in and it will go to the mem tables and it will go
to the read row cache if the row cache doesn’t have the information then it
will go down to the underlying sstables
and in this example I have four sstables okay the great out once our items
in memory so main tables row cache bloom filter summary are all items in memory
the blue components are items on the disk so the index and the data or
components that I need to do an i/o to the disk okay and let’s look at the example so
rereading partition eighth row one we’re fetching information from the mem table
we’re looking at the row cache and the row cache doesn’t have all the
information now that the first case is the row cache does have all the
information so it has rereading a row that has three columns A B and C we
found the information from the mem table we found that Row is cached in the
row cache and we returned back the result so that was simple there was no
aisle okay we read from the main table and from the row cache and everything
was perfect next so we weren’t lucky the row isn’t cached so we’re going to the
disk or baby to the disk we’re going to the sstables we start with the bloom
filters and we see that p8 partition 8 may exist in this sstable and this
sstable in this sstable but won’t exist in the last sstable so we’re
down to three sstables that we need to continue reading from next we go to the
summary file ok and we search for the information of where this partition may
be in the index file so bloom filters provide us in from definitive answer
that the partition is not in sstable if the partition is in sstable they
may be provide us false positives which means that the partition won’t exist
in sstable and they will tell us that it exists so if they say no they’re correct
if they say yes they may be incorrect so we go to the summary file we find
information and then we access the index index is again the disk so we see here
that to sstables actually have partition 8 the third one doesn’t so
we’re down to sstables but we had to do a single IO that was incorrect
because the bloom filter didn’t provide us correct well it provided us
information but it wasn’t true we go to the compression and then which
translates information coming off the index and then we find the information
in the data file we read the information from the data file we merge it insert it
into the row cache and then we turn it back okay
so single partition read pass if we’re talking about the reads the index reads
can be small or mid-sized the data reads are really dependent on the size that
you your partitions are or the rows are okay if the partitions are large will be
reading large chunks if the partitions are small we may be reading small chunks
as you sequentially reads flew large petition ScyllaDB will start doing readaheads
and readaheads are basically speculative reads from the disk
okay we’re reading data of the disk that you may use or not use okay and as you
continue to read a large partition we will do it it will reduce the latency
next in ScyllaDB we cache the readers so that’s saying that I’m reading a huge
partition that has 100 thousands of rows and I’m reading that full partition I’m
returning chunks backs also in pages because I cannot return all those
100,000 rows in a single response it will kill ScyllaDB okay so I’m reading
5,000 rows and five thousand rows. ScyllaDB caches those readers so between
one request and the consecutive one we already have buffers in memory that has
the next rows in them so ScyllaDB will catch those readers and is able to use them
so anybody that thinks he’s smarter than. ScyllaDB in doing paging is wrong you won’t
be able to do it better you’re killing yourself but without
being able to use the cached readers okay if you submit a request starting
for specifics starting from a specific partition in row reading on that will go
to the index find out which sstables has that, start reading,drops the readaheads
and the buffers that we already have in memory okay you cannot do better
paged readings and ScyllaDB does okay that’s the moral of the story