This lesson is about ScyllaDB Monitoring, Manager, Dashboards, How to Troubleshoot and Debug ScyllaDB, Memory Management, and some other advanced topics.
My name is Shlomi I manage the development team I’ll be talking about advanced monitoring and when I was asked to do this session the questions were tell people a bit more about ScyllaDB internals how it’s built so they can understand so they’ll be able to help to understand how to monitor ScyllaDB which is great and it ended up with a slide deck off 140 slides and it’s all here
So we’re not gonna do the 140 slides we don’t have the time but I may skip a bit and I’m gonna start with how to monitor the system and a large part of the talk will be how to debug an issue many times I’m asked to look at the system or someone else is asked to look at the system and then we have a lot of gut feelings I try to brain dump those gut feelings into something that is cohesive enough for you to do and hopefully catch some issues on your own so how to monitoring
So ScyllaDB monitor so the number one tools that we use every day ScyllaDB monitoring if you don’t have ScyllaDB monitoring install ScyllaDB monitoring or we won’t talk to you that’s probably the number one tool – and it’s mainly because that’s the only way that we find the way inside the system
So ScyllaDB exposes this huge amount of metrics hundreds of thousands of metrics and ScyllaDB monitoring is the only way that we find what’s going on inside the system nodetool of course is legacy ScyllaDB logs and when nothing else works we use Linux tooling so some techniques so one it start development time so finding out why system is not working or why system is not working as well as it should be or why system is not working as well as it did yesterday has to do with debugging your code and you debug your code you have unit tests and so forth but you can also use monitoring to look at the developed while you’re developing the code second thing is you need to know usually when I’m asked to look at the system I ask when it stopped working and since when they have the money going up I scroll back I look at the history of the system I try to find out what changed or what point things started going wrong and knowing how the system operates when it works well is a key to find what broke
Last if you have background processes backups repairs running and so forth be aware of them they will argument the way the system operates and if you’re sensitive to latency and so forth you may need to schedule them accordingly proactive monitoring so proactive monitoring some things that we’re using in the cloud basically we try to catch the system not working or if there is an issue before the issue happens or the customer notices and Dan talked about ScyllaDB manager so cql ping is one of the ways that we monitor the system the other part is alert manager so it’s part of the ScyllaDB monitoring stack we have default alerts set up you can add your own alerts we do that in the cloud we specify alerts for latencies for customers and so forth to notice when latency started building up before there is a big issue
So in order to understand a bit about what I’m going to talk about there are three or four slides about architecture so it’s hopefully going to be very fast one we’re talking about the cluster and since it’s a cluster it means that the client is sending a single operation to coordinator node and that may be a replica or not and then we’re talking to other nodes as well so we need to take this into account when we’re the debugging or looking at the system next and I’m not sure if you saw this in this detail there are multiple components to ScyllaDB so once there is a client your application next there is a CQL front-end that processes your CQL requests then we hit something that is called a storage proxy it’s an internal tool name but it’s basically the coordinator of your requests so when the CQL request insert or query is parsed by the CQL layer then it’s transferred to the storage proxy that finds out what is a token and what are the replicas to which nodes we need to talk to below the storage proxy there is a database that is the local part saving the information and then below the database there is a cache that we are very proud of and then there is a persistency layer are sstables the compaction manager talks to the database to understand if we’re keeping compactions are keeping up if we’re not building a huge amount of memtables that we need to flush to the disk and looks as sstables as well
Aside of that there is a gossip component that makes sure as a cluster is healthy and working and of course there is repair which is a background process it talks to multiple nodes and streams data so those are the components I’ll talk about some of them in some of the slides there are different task categories in ScyllaDB most of them are background there are some that are some the foreground ones are your requests so they’re read and write requests background there are also write in read requests so if I’m doing a consistency level of quorum write then I’ll wait for two replicas to respond and then there will be a background write happening in the background yet that background write may become a foreground write and that is if ScyllaDB detects it’s not keeping up so ScyllaDB will automatically shift itself to work or your request to work with CL ALL or wait for replicas to respond not CL ALL if it detects it’s not keeping up in the right speed
Okay so we have a background process that was transferred to be a foreground process from your standpoint the same happens with read so I’m reading with CL ONE but it may trigger a background process and that is a read repair okay read repair chance we we go to all the replicas get the information and try to repair okay if we find inconsistencies between two replicas okay we did a read with Quorum and the words an inconsistency between two replicas that read will become a foreground read and we’ll wait for all the replicas and fix it and only then returns a response so that is a background process that became a foreground process
Memtable flushing so while I’m writing into the database we’re accumulating all of that in something that is called a mem table and then we need to flush it to the desk ScyllaDB keeps multiple mem tables one that is accumulating data and another one that it’s being flushed to the disk now if we’re not flushing in fast enough to the disk then basically writing into the mem table becomes basically a foreground process because we’re slowing down your writes so we’ll evacuate if we’re writing basically a row or partition or bunch of memory to the disk then we’ll only allow a right to enter another memtable so again it’s a background process usually mem table flushing but it became something that affects our foreground workload and that translates to the system being slower
What’s happening with the system and so forth commit log the same I’m flashing commit logs I’m not handling flash of commit logs fast enough then it will influence the write speed and it may back pressure towards the client compactions the same and then they’re streaming and repair streaming and repair are always background it cannot be a foreground process so a steady state healthy system what is built up so it’s very long list but basically it means or in my view it means that all nodes are up and running we don’t have a node being down we don’t have a node being added
It’s not that seen ScyllaDB doesn’t operate in such cases it operates very well and we tested it but it is that effect that a node is down or that we’re adding a node is causing additional work in the cluster and that is something that we need to detect okay clients are driving traffic now you’re writing your application it seems simple insert into the database read from the database and so forth but it’s actually a bit more complex and some of the faces here I’ve know some cases that it wasn’t as simple as it sounds so one is connection balance if you’re using ScyllaDB driver shard aware drivers you’re safe if you’re not then you may hit something and I’ll talk about it
The amount of traffic so it’s not enough that you connected to all the shards you can’t use a single connection if you are using a single connection then it really didn’t help what you did you’re still sending all the traffic to a single shard in the system the queries are the same so I’m using the shard aware driver I’m sending a lot of traffic but only on a single connection or to a single node that I’m sending all my batches that doesn’t work as well okay it will affect the latencies of that node requests for partitions and rows are balanced and an imbalance state for example is 50% of the read requests are here hitting a single partition that is known as a hot partition so you’re you’re creating the data model you know your application you’re controlling the requests and then the question is how does it affect the system
There are three more items that are similar there can be a large partition a large row and a large cell and all of that has an effect on the system so basically a steady state healthy system has to do with all the nodes being up your client application is doing what you think it’s supposed to do sending all similar traffic to all the nodes to all the shards on all connections and you have a data model that or your requests are not creating hot partitions or hot shards and all the data is balanced pretty much if you have imbalance that’s a cause for something breaking how to debug
So let’s start so usually when we start we look we looked first at monitoring overview and monitoring overview is a very large dashboard and we stroll more at the dashboard so I admit I’m not keeping up but the most important part is on the left hand side are all nodes up or is everything green next is to look at the alerts so we’re adding alerts for things that you should be aware of that you pinpoint you to what is the problem in the system look at the alerts that can tell you already what is broken or what you need to check so that is easy everything was Green next is the errors so we are we added the dashboard called errors and it starts to accumulate errors coming from different parts of the system, this is an error dashboard, needs to get better just to be clear and I’ll talk about the items that are included so coordinator side errors are the read unavailable error write unavailable error and range unavailable error and that has to do with how many nodes are available for requests so if I’m doing a CL ALL requests and one node is down that will translate to an unavailable error okay it’s bad to do a CL ALL operation on ScyllaDB you’re putting yourself up for failure
Replica side errors local read error local write error or an IO error basically means replica is not able to read write or do an IO to the disk that means that you need to check that single node okay I’ll talk about what it means to check that single node and last there is C++ exceptions which is something that we usually look at if the if a single node is irregularly high or the system didn’t have any C++ exceptions and started to have those then it may signal that something has broken okay it may mean that you need to look at the logs but I can’t translate this to something is more than that at this point next question you should ask yourself is has the application changed and if the application has changed then my own recommendation is look at the CQL optimization dashboard okay and we’re trying to extract information about what your application is doing towards ScyllaDB and provide you the inputs so the inputs are you change the application somebody started to build strings and send them directly to ScyllaDB he’s using Java not using Gocql which does it automatic and you ended up with non prepared statements and everything is broken
Okay so I won’t dwell on this and other items that are marked you’re not using paid CQL reads now if you have reads that are very small it may not be an issue but you when you hit a large partition of hundreds of thousands of rows and you’re returning that back everything will dissolve ScyllaDB will run out of memory okay it’s trying to build a huge response to a query that you just set so that is a way to set yourself up for failure unless you know what you’re doing not using a token aware driver you’re adding latency for no reason so change that a reverse secure reads are not as optimized as regular reads again if you have large partitions trying to read them backwards will end up with you crying and us trying to figure out why allow filtering again you need to understand that this happens on the coordinator node so we’re getting the information off the disk and then applying the filtering
Okay can it be done better or not you need to decide but we’re providing you the inputs of how well you’re filtering if you’re reading million rows and returning one well is that correct or not you need to know but it’s a an indication everything was good next question is do we have a client imbalance and this is very well answered by the cql coordinator side so the cql coordinator side has a section on the client connections and then the distribution of different cql requests across nodes and across shards one thing to note is that you need to look at the instance and the shard view okay ScyllaDB is sharded the fact that they have in the same number of connections to and all the instances doesn’t mean that you have connections to all the shards a single shard usually zero can handle all your connections and everybody else is resting and then it will be the bottleneck off the system
Okay so use the shard view secure read requests and so forth you need to see that this is balanced the best example CQL batches usually you have some background process it starts running sometimes and then sends huge batches and it hits only a single shard and then everything slows down to that shard handling the batch requests coming in it needs to parse them at times merge them then distribute them to all the other shards it’s imbalanced okay okay so we checked out everything was green next we go to the replica imbalance and here’s a life is a bit more complex so we’re going to talk about three or four things that you can do and then we’ll fall down to everything else but hopefully we’ll be able to provide additional input in the future
The first item I go to and look at is a latencies okay so we added those are exposed and the latency is if I look at chart view especially if I’m looking at historical the system is a bit slower the latencies are a bit higher then I can pinpoint the time I can pinpoint the node that started having those and so forth and then look at what’s happening so when we talk about overview of latencies again we need to look at the instance and the shard view higher read latencies on replication factor of nodes or shards can indicate an imbalance on data access so we said before that if we have a hot shard or a hot partitions and you’re sending requests to a specific shard in the system and how does that translate to latencies so it can’t translate to latencies so if you’re using the ScyllaDB shard aware driver if not it’s a bit more complex but if you are then it will be perfect for that and then the latencies of those shards on those three specific nodes will be higher than everything else in the system and that’s a very good indication that you have a hot shard case
Another option is you have an imbalance so with times you are accessing larger partitions and then you need to go in check that if you have spikes which are very high temporarily in your latencies usually on a single shard not on all the shards or not in all the nodes then you need to go to the single node check and that can be a stall I’ll talk about it a bit later but that’s a good indication of something happening on that specific node and we need to check that specific node and last if we have higher read latencies again you may be sending more read requests to those specific shards they may be doing CQL processing or foreground starting the cql processing of those requests before they split up that’s why there are a bit slower so you can look at the traffic itself next is look at, we have a detailed view in in ScyllaDB monitoring and one section is a replica so if we look at the replica we can find specific items related to replica imbalance and I’ll talk about items that you can find here so we have active sstable reads and queued sstable reads so active sstable reads means that how many sstables are we currently reading and in many cases it will be 0 in some systems it may be constant but not going up and down and not spiking on a specific node and so forth
If it is spiking up on a specific node it may be it may mean that that node is not keeping up with compactions okay so we need to go and check that single node that has a higher number of active sstable reads queued sstable reads translate to latencies so queuing up sstable reads and ScyllaDB will start queuing up after 100 sstable reads going down to the disk simply because we’re not adding additional requests without providing responses for previous requests or will max out the capacity then it is an indication that we may be overloading the system okay or that specific node has an issue it may be have a stall or something else or the disk is slower for some reason that is unknown so we need to do the single node check right currently blocked and dirty on commit logs that has to do with the fact that we have inbound rights and we’re writing them to the disk and we need to check what’s going on if they’re the counters are not zero and usually the counters will be zero then we need to do the single node check again and see why that node is not being able to flush down is it the commit logs or the memtables to the disk reads fails write failed again go to the single note check writes timeout if they’re imbalanced we need to go to the single node check again
Detailed cache replicas so the cache view on the detailed monitoring has many a lot of information I’m not going to talk about every specific item I’m going to talk about the first items that you need to look at so there are metrics reads with no misses partition hits row hits and if it’s imbalanced and I’ve seen systems that are imbalanced you’ll have a single shard that is much higher than everything else so if that is a case it’s a hot partition 99% of the cases it may be it’s a hot partition please note it may be a hot partition due to background process or queries internal to the ScyllaDB as well or it can be due to even the drivers doing a pool from from the nodes so it may not be even your application it can be a driver query that is causing that another item that is interesting is the total bytes if they are the same for partition and rows between the shards and if it’s not it means that we don’t have the same number of partitions but we have the same amount of memory in the cache and that means that we have a partition that is larger or a row that is larger so in that case we should go to large partition row cell check last there is a materialized view memory replica I’m going to talk about the memory ones so in case of imbalance you may see that the LSA is dropping some of the customers I guess have seen it and it usually means that ScyllaDB had to evacuate memory of LSA
LSA memory is what we use for memtables and cache so we evict that memory in order to build up queues usually queues of ongoing items and if we’re building up a queue that usually bad the system in a stable state should not have a lot of queues building up it should be LSA will usually be I don’t know majority of the memory will be in LSA if we’re talking about eight gigabytes per shard less than one gigabyte will be a non LSA memory and it will be constant in that manner if there is a spike drop or a large drop it means something is bad with that node and we need to go in check why usually it means that we’re building up queues somewhere and that will translate to the system being slower something not working so we’ve been to the replica imbalance and we got three items that we need to check or three possibilities it may be a hot partition large partition or a single node check so hot partition check so we have identified which node is suspect to be have a hot partition usually it will be three replicas if we’re talking about RF 3 and then we can use nodetool toppartition and nodetool cfhistograms so nodetool toppartition is something that we added it has a drawback that you need to find out which keyspace and table you’re looking at so it’s not that easy so the way that I solve it when I need to find out what’s going on is I use nodetool cfhistograms nodetool cfhistograms allows to get statistics about keyspaces and tables and how many access I take two snapshots decrement between them and find out which one is access to the most and then I use toppartition and find if I’m lucky or not basically large partition row cells so large partitions rows and cell are computed at compaction
Okay and we added three tables to collect the information about those it’s available in master or the next major version the other thing is that you can do it you can search ScyllaDB logs for this so the goods the good news about it is that you can have the background process that you can look at if you accumulated large partitions the bad news is that it may be too late at times so systems that are very tuned toward small partitions low latencies if they see large partition it’s probably too late you didn’t take that into account when you built your application your data model and so forth and then you need to figure out how to handle it single node check this is the most most complex because it can be leading to multiple items so single node check starts for me starts with CPU OS and I/O checks and I’ll talk about that then I check the logs I check for errors stalls large allocations and bad alloc and I wish they weren’t as common but they do happen and last we check the logs to see if the OS has killed something for any reason
Someone talked about the JMX before it is common to see that the OOM killer part of the OS kills the ScyllaDB JMX okay because the OS didn’t have enough memory it won’t kill ScyllaDB it found that the JMX allocator grew too much or you allocated it too much and then it will kill it or you may have disk errors and then you need to replace a disk or replace the instance if you’re on AWS so the CPU so this is a view of 3.1 and the interesting part here are the following high time spent in task violation so what is a task violation so we talked about the different tasks that we have in ScyllaDB we have read in writes or statements as we call them we have compactions we have memtable flush we have streaming and repair and so forth ScyllaDB has a CPU scheduler and every task quota we allow a different task family to execute okay if it doesn’t have enough we’ll take another task family and run it but if it has enough it will run for a quota if it doesn’t context out it doesn’t give up the CPU and continues running that’s a task violation and that may translate into latencies
Okay the task quota is very small so usually you won’t even see it but if your violation is for example two milliseconds then you will see it in your latencies okay if we’re talking specifically about the statement group then if the statements group is the one violating so you have statements groups again the read and write requests that may mean that you have a large partition okay because if I have a huge partition that I need to merge or read information from or use the cache and update the schema and so forth then it will hit me here I’ll see a task violation is a statements seek a single node check in the OS level so I check the disk am I running out of disk space so here we can see if we we have a disk that is running out of disk space or we have imbalanced state and imbalanced state would be again a background process if I’m running for example a backup and it’s doing a lot of i/o to the disk and reading from the disk and uploading it to S3 or wherever then I’m eating up the disk i/o from ScyllaDB so ScyllaDB won’t behave the same it’s sending requests to the disk and it’s not receiving the responses in the same pace okay so that’s one example and the i/o again compaction bandwidth and streaming so we talked about the hot partition large partition and single node checks and that’s unfortunately all the templated analysis I can provide you with if you’re not falling into any of this and there are others and others are too big at this point so we are working to make this better both for you and for us nodetool top partitions large partitions and didn’t exist so they started to exist because we need to analyze it and we needed to make it accessible to you guys so you can analyze it on your own
So we’ll add additional tools as time goes by so what are stalls and I’ve talked about them before so as I said the CPU scheduler in ScyllaDB allocates task quotas and when we’re starting for example to allocate to run a specific function it can have a lot of sub-functions it calls and if those sub-functions will call other sub-functions then it will continue and so forth when will it end it will end in two cases it will end if the function call a yield function which basically means I don’t I can keep on running but does anybody else need to run or it may end if I do i/o to the disk to the network, whatever in all other cases my code will keep on running so if I have a an endless loop it will keep on running nobody will kill it inside C so the models that ScyllaDB uses is a cooperative scheduling I have some slides on it I’m not sure we’ll reach it but that’s the model and tasks have to relinquish to the CPU they have to give up or else they’ll keep on running so when we’re talking about stalls which means that something didn’t give up the cpu and it may be something in our code that is bad it may have to do specifically with your data model as well and I’ll talk for an example here so we have an issue in ScyllaDB today that if you didn’t alter and you have a large partition inside the cache when we need to upgrade the schema of that partition inside the cache and it will happen when you read from it then we needed to do it on a partition level so if I have a one gigabyte partition I altered the schema of that partition inside the cache till I read it nothing will happen when I read it will upgrade the schema but I need to do it across all the partition even if you’re reading a single row and we can’t indicate this is what’s happening in the matrix but we can find out this is happening through a back trace so installing ScyllaDB logs is translated basically to a back trace in the code okay that we can read yo can do most of the times or some of the times you won’t be able to do anything about it but at least will tell you why it happened and then you can decide what to do on your end
So stalls are recorded in this string and large stalls can lead to task quota violations and so forth so what do you do with stalls one you report them maybe enhancement a bug that we need to fix if we’ll see additional people telling they have that issue we’ll prioritize it, two, it may be your data model so you saw that using blobs would be great and you started bringing blobs that are ten megabytes of memory that’s not great okay and you may be able to work with blobs that are half a megabyte of memory and we’ll be able to tell you that it’s large the stalls are caused by large cells but if you go back you have a way today to figure it out on your own as well the other source can be the OS and we are fixing bugs or working around bugs in OS as well as we can but there are cases where we’ll tell you you need to update the kernel or you using an old version there is a patch to XFS whatever so the instructions you’ll get when you post a stall can be upgrade to a new version this is fixed or change your data model again you’re using large blobs or three, +1 to fix or enhancement around this and we’ll take that into account so that is a stall memory management larger allocations bad allocs so this is a map of how ScyllaDB looks at memory and we’ll talk about a single shard so that is a total memory a single shard has and it’s broken up to LSA which means that ScyllaDB manages that memory internally which it’s float is broken up to memtables and cache it’s split between them
So if I’m accumulating a lot of memtables the cache will be smaller vice versa it also means that if you have a large read workload the cache may grow as well and then the standard allocation standard allocations are usually for they’re not supposed to be very large usually these are around queues managing the software and so forth the lifetime of ScyllaDB but again if we have a queue building up it will eat up memory from memtables the first thing that will take the hit is a cache that can’t be relinquished ok memtables cannot be dropped out they need to be flushed to the disk so that’s the first item that will be hit so usually most of the memories in LSA cache or memtables again 8 gigabytes of memory usually to see usual to see 7 gigabytes allocated to memtables and cache 1 gigabyte or even less for other things when LSA memory drops it means that we have to evict items that translated to higher latencies we need to read more from the disk to answer your queries you don’t have the cache anymore larger allocations so ScyllaDB manages the memory or tries to manage the memory in a good manner which means that it breaks up blobs to smaller chunks it does a lot of things but large contiguous allocations are bad and they’re bad because they take a lot of time and they take a lot of space and to reach a large contiguous allocation then it means that we may need to free a lot of space as well till we get that large buffer that we can write to so large large allocations are bad and ScyllaDB writes this information to the logs
Again it’s a back-trace and it has a size of the allocation so what can you do with those again report them it seems the docs one thing for ensured we’ll take a look and it can be again your applications so we can provide you information what to change on your application side large blobs are again not great if you disable paging again you could have seen it in the CQL optimization but if we see large allocations and we find it’s around the area of building a result set then we know it’s related so if you disable paging that’s the reason and we can provide you this information we can tell you to again upgrade change your application or tell us this is important to you again will to take it into account when we schedule items bad allocs so large allocations are bad but there is always something worse than something bad and bad allocs are worse so that means that ScyllaDB wasn’t able to allocate the memory okay and it throws a bad alloc now bad allocs can happen and ScyllaDB is able to work with some bad allocs so it can happen they’re not great but again can happen it can be a temporary time where there was stress we couldn’t allocate there was a bad alloc your client or your application may have gotten an error even back for a request that it sent and it will retry and everything will be perfect but in some cases it’s not transient and in some cases we cannot really keep on working if that’s the case so again what you can do is always report the issue and we’ll try to figure out why it happened it’s not perfect I know but in many cases we can tell you it’s either try to not build so huge batches break up your batches to be smaller that will reduce the pressure on memory bad allocs will be smaller and you won’t reach that point that’s an example
So again you may be instructed to upgrade change your application or tell us it’s important to you because there’s no other way of work around scheduling so in high-level we have a ScyllaDB node it has three resources there is CPU disk usually an SSD on NVME these days if anybody’s on HDD my condolences to you it’s not optimized for ScyllaDB there are users that are using it and it’s fine but again it’s not optimized for ScyllaDB seek times are hard that’s the way it is and there is networking the reason that networking is signal here is that we don’t have a scheduler for networking today so we have a scheduling scheduler for CPU will context switch between different tasks and ScyllaDB we have a scheduler for SSD we tag every IO to the disk if it’s coming from compaction repair writes reads we know what is going on when we’re writing to the network we don’t have a scheduler however we are relying on on the network scheduler that is built inside Linux so if I have a connection that is being used to write to read and write to a different node for ongoing reads and writes and I’m streaming it uses a different connection and Linux will balance between those two okay so we are relying on it’s not the complete Wild West so if you’re streaming not everything will break we are relying on Linux scheduling for networking so when we talk about items that we need to schedule we have commit logs memtables compactions query and streaming and repairs we have the task categories that we talked about Disk Scheduling so what are we reading and writing there are basically three items there are sstables commit logs and hints ok and next slide has to do with what sstables are what commit logs are what hints and they’re not exactly the same
Okay so ScyllaDB has evolved to tune itself for each or to know how to work for each and I’ll talk about it a bit so I’m gonna skip this slide and I’m gonna cover it here so we’re gonna to talk a bit about the read pass and how it looks and I have an example here a replica shard trying to read some data off the disk so we have a read request and the read request is coming in and it will go to the memtables and it will go to the row cache if the row cache doesn’t have the information then it will go down to the underlying sstables and in this example I have four sstables okay the grayed out ones are items in memory so memtables row cache bloom filter summary are all items in memory the blue components are items on the disk so the index and the data are components that I need to do an i/o to the disk okay and let’s look at the example so rereading partition eight row one we’re fetching information from the memtable we’re looking at the row cache and the row cache doesn’t have all the information now that the first case is the row cache does have all the information so it has rereading a row that has three columns A B and C we found the information from the memtable we found that Row is cached in the row cache and we returned back the result so that was simple there was no IO
Okay we read from the memtable and from the row cache and everything was perfect next so we weren’t lucky the row isn’t cached so we’re going to the disk or maybe to the disk we’re going to the sstables we start with the bloom filters and we see that p8 partition 8 may exist in this sstable and this sstable in this sstable but won’t exist in the last sstable so we’re down to three sstables that we need to continue reading from next we go to the summary file ok and we search for the information of where this partition may be in the index file so bloom filters provide us in from definitive answer that the partition is not in sstable if the partition is in sstable they may be provide us false positives which means that the partition won’t exist in sstable and they will tell us that it exists so if they say no they’re correct if they say yes they may be incorrect so we go to the summary file we find information and then we access the index index is again the disk so we see here that two sstables actually have partition 8 the third one doesn’t so we’re down to two sstables but we had to do a single IO that was incorrect because the bloom filter didn’t provide us correct well it provided us information but it wasn’t true we go to the compression and then which translates information coming off the index and then we find the information in the data file we read the information from the data file we merge it insert it into the row cache and then we turn it back okay so single partition read pass if we’re talking about the reads the index reads can be small or mid-sized the data reads are really dependent on the size that you your partitions are or the rows are
Okay if the partitions are large will be reading large chunks if the partitions are small we may be reading small chunks as you sequentially read through large partitions ScyllaDB will start doing read aheads and read aheads are basically speculative reads from the disk okay we’re reading data of the disk that you may use or not use okay and as you continue to read a large partition we will do it it will reduce the latency next in ScyllaDB we cache the readers so that’s saying that I’m reading a huge partition that has 100 thousands of rows and I’m reading that full partition I’m returning chunks backs those are in pages because I cannot return all those will kill ScyllaDB okay so I’m reading ScyllaDB caches those readers so between one request and the consecutive one we already have buffers in memory that has the next rows in them so ScyllaDB will catch those readers and is able to use them so anybody that thinks he’s smarter than ScyllaDB in doing paging is wrong you won’t be able to do it better you’re killing yourself but without being able to use the cached readers okay if you submit a request starting for specifics starting from a specific partition in row reading on that will go to the index find out which sstables has that, start reading, drops the read aheads and the buffers that we already have in memory okay you cannot do better paged readings than ScyllaDB does okay that’s the moral of the story memtables compaction streaming and repair they’re usually 120 kilobytes of size so there are large reads or in ScyllaDB terms they are large commit logs commit logs are built from 32 megabytes segments
We pre allocate the space on disk so when we start we don’t start writing to the disk before we know that we have 32 megabytes that’s basically to mean being been done to make sure that writing to the commit logs is more optimal okay we don’t need to go to the OS and ask for space we save on that so we do only the i/o to write to flush that down and the OS or the XFS doesn’t need to allocate additional space we recycle the commit logs even further so if I use I flushed all the memtables related to that commit log and they’re now stored in in sstables we won’t give the OS back that commit log we really reuse that space again okay and that is done again to save on allocations to save file system operations reading is much easier than allocating and returning back okay so that’s why you have commit logs nowadays and recycling commit logs recycled commit logs are basically junk okay ScyllaDB doesn’t read them if it crashes and boots up so if I looked at tasks disk I/o tasks patterns we can see that they are different we have reads that can be 4k or larger we have writes that are usually 128 kilobytes of size and we have other reads from coming from compaction streaming and repairs that are usually commit log reads or hinted handoff because they’re not the regular case so I’m skipping that when we read from them now we are getting a machine from you we have no idea what it is capable of doing and we need to schedule operations we need to know that we’re not sending too much to the disk or too little to the disk we know we need to know what the disk basically can do so we have a process that is done during ScyllaDB setup which is I/O tune and I/o tune basically benchmarks your disk provides information what is the read and write bandwidth what is the read and write I/O’s that your disk can do for example i3.metal write bandwidth is six gigabytes but i3.metal read bandwidth is 15 gigabytes which is huge ok so we want to use that information that writes can only do six gigabytes and reads can do 15 gigabytes and let’s talk about the different profiles so we said that we have read that are four kilobytes and we have larger reads now what’s the difference why does it matter so if we’re talking about profiling it on on a disk we can see here that this specific disk we got around 150 concurrent requests at some average latency but if we’re looking at 128 kilobytes requests we’re down to 30 concurrent requests so how many I/Os can I do to the disk should I tune to 30 requests going down in parallel to the disk or should I tune to 150 given that you you’re doing smaller reads and writes so legacy ScyllaDB did only four kilobytes and tuned itself for the four kilobyte case and then basically we sent too much
Okay and then we had to tell you tell you to tune it down a bit nowadays we were much smarter okay Glauber is to blame for that he wrote the code it’s perfect timing so read IOPs so basically what we’re doing we’re checking how many four kilobyte writes we can do or reads we can do but also what is the bandwidth and the bandwidth is a counter part so the disk with 30 concurrent requests of 220 kilobytes is limiting what we can do basically by it’s bandwidths the same for IOPs how many writes we can do but again writes are usually at 128 kilobytes size and we have disk scheduling so basically disk scheduling has to do with queuing up requests and making sure that we hav e fairness between them so how do we see how much we’re allocating and how does it look so we have the disk i/o dashboard and here we have we can see there i/o operations we can see the i/o bandwidth and we can see the queue delay i/o queue delay means that we have requests that are waiting which is bad okay that translates again to latencies so we can see the bandwidths that we’re doing if you’ll divide one IOPs by the size you’ll see that it’s totally different usually IOPs of query divided by the bandwidth will be much smaller than commit logs or flushing and so forth so if we’re talking about about advanced ScyllaDB monitoring I talked about the query cache I talked about the SSTab;es but you don’t see them today on the dashboard that’s because we we are leaving some for the future and it’s already hard for us as well but there are metrics so if you’ll go to Prometheus you’ll find those those are the prefixes and you can see that we have for sstables around 20 metrics and for query cache we have around 20 metrics and if you query it you’ll see also input about what it means and if I’m talking about myths so I have a disk running at 90 percent utilization does it means that I have only 10 percent left
And the answer is no the disk have high parallelism so you can send 100 and you can send 100 plus 100 tells you if you are you saturated one pass but it doesn’t mean that you cannot saturate added an additional pass inside the disk so when you’re looking at ScyllaDB metrics versus are I/O stats under 100 it means a lot you’re not maximum maximizing everything but if you’re seeing you’re just doing 110 that’s fine may do more than that ok ok and the rest is a homework for you guys