Cluster Management, Repair and Scylla Manager


*The manager part is relevant for Manager version 1.4


So my name is Dan I’m a field engineer and part of the support group as well at Scylla we’ll be talking about this is not going to be a very deep dive into the internals of things but we’ll be covering the tools we use for cluster management

So basically this the things that the Scylla stack consists of besides Scylla itself right so what we have here is basically the Scylla monitoring that’s the the entire system that monitors Scylla as it is running provides us with graphs and stats and insights into what’s happening with your cluster and we’ll go over its components and how it’s built and how its deployed and we’ll be talking about Scylla manager where it’s at right now where it is going and also how it is deployed how it works how it connects to Scylla cluster etc

okay so that starts with the monitoring so the monitoring stack uses Prometheus for matrix collection what you have is the Scylla nodes they export Prometheus stats natively so if you are familiar with Prometheus it’s pretty easy to hook up Scylla to Prometheus whenever you need it to but our stack already comes with it and also all the Scylla nodes during installation install the node exporter which is used for stuff that is not Scylla so for like free disk space CPU load that sort of thing that’s Scylla does not present directly to monitoring so the Prometheus itself has Prometheus alert manager and Graphana on top of it with our own dashboards so you don’t really need to do anything when you install it it comes with pre-configured with everything you need to monitor Scylla of course since it’s a known stack and lots of people use it you can expand it you can do whatever you want with it it’s also open source available on github so what it looks like we have the dashboards that come with Scylla there is a slight difference between enterprise and open-source dashboards so the different dashboards are enabling us to drill down into specifics like some of them deal only with cql queries some of them deal with a general overview of the system some of them deal with the the node status like CPU loads free memory that sort of thing and also there is a dashboard for Scylla manager

so we’ll show that later on as well okay so this is how we can filter views inside Scylla so if you look at the the top screenshot it shows that we can actually divide the the views by different looks so we can look at a cluster level and that’s the the first graph with just one line that’s the load for the whole cluster right now aggregated across all the nodes if you look at the second one you can see this is per node and for example in this particular case one node is actually getting hit by much more load than the others so this shows us that there is actually a problem with your current data distribution or something else potentially so this is how you can actually drill down into specific problems when you have them and then the third one is a per charge node per per shard for instance so what you can see there is because of Scylla specific architecture that it uses a shard per core your chunks of data are actually designated to a specific CPU core on every node so if you drill down to this level this is on a specific node just the one that is overloaded the one that has the blue graph in the middle screenshot so here we can see that the load is more or less evenly distributed across the shards so in this particular case the load was hitting one node much more than the others but it wasn’t shard specific for example

okay and so you can basically switch as you can see at the in the top you can switch between the views and it gets filtered this way and then you can also choose whether you want to see all or just / DC or just per node okay so how do we install monitoring first of all the most common installation is Docker based we this is I think 100% of the installations that we have our Docker based but there are options you can also install it on bare metal or whatever VM you have somewhere in Amazon or something and and there is a way to install the whole stack without Docker but it takes some work and the way we release the monitoring stack is either on github releases so you just download the tar.gz and unpack it or something that’s less recommended because you might make mistakes there but it gives you a bit more flexibility to play with it is if you just download it from git but you need to make sure that you switch to the right branch because master is unstable so and the branch will pretty much give you the same code as you get in a specific release

okay so not going to get too much into it this is showing how we configure monitoring so right now it’s just a single .yml file for the monitoring itself called scylla_servers.yml and it contains the targets that we want to monitor so it just has the IPs of the Scylla nodes and then for a list of IPs we need to specify what it belongs to so we can with a single monitoring stack we can monitor multiple clusters and we can monitor and then it knows which DC and node it belongs to so that we can drill down to / DC views as well because if we just group them here it does not get it from nodetool it gets it from this file so this needs to be done there is a script that is shipped with monitoring stack that can do it for you by actually collecting data from nodetool but it’s called config gen and it’s a little python script that does it for you and also if you’re doing configuration management Puppet Ansible whatever you can pretty much probably extract this sort of data from inventory file that you have

okay also manager 2.0 – it’s still not out but it’s pretty close to release it will have a console API and monitoring would be able to get it from the manager if we have the manager installed before monitoring okay the alerts manager this is what it its configuration looks like this is very standard to Prometheus if you’re familiar with it there will be no surprises here this is an example of a disk full alert which basically the expression says available space at a specific mount point is less than 25% available and it will send an alert and that is done in the rule_config.yml file that is there is an example for it shipped with the whole stack and you can adjust it however you want and that contains routes which represent a routing tree so the the most specific rule wins it can drill down it can basically match true/false with specific rules until it reaches a certain rule there are inhibitions in place so you can mute an alert if another alert comes in for example so you can you know don’t get alerts fatigue and the receiver is basically where to send it to and there are tons of plugins for Prometheus you can send it to slack telegram email I don’t know whatever you want

okay and this is the configuration what it looks like so these are the the rule for example so there isn’t really that much to look at here besides this is the configuration file and right next to it is what the alert email would look like to you again if you are familiar with Prometheus this is probably not gonna be much of a surprise okay so let’s switch to the manager so before we get to the manager itself again this is something you’re in advance track so this is probably something you already know quite well but we need to cover the rationale for having all this in place so we need to do repairs this is what the manager does this is the main function of the manager right now and the reason we have repairs is pretty simple there is the cap theorem in place which says you can have consistency availability or partition tolerance and pick any two so a typical SQL will have good consistency and availability but no partition tolerance unless you have some plugins for sharding and play with stuff like Galera and but then again you have your own problems there databases like Scylla cassandra and a lot of the others in the same category basically don’t stick to consistency too much in order to have good partition tolerance and availability that means that it doesn’t mean that we don’t have consistency at all but it means that consistency is eventual and not immediate and it’s tunable

so you can pick exactly what kind of consistency you expect so this is the CL parameter you issue with the queries so per query you can decide whether you want it to have high consistency or just good enough consistency or no consistency at all if you just do a CL one so what happens is during repair is data is compared across all the replicas for that and you need to have replicas if you want to have multi master cluster and whenever there is an inconsistency it replicates the whatever is the newest record it pushes that to all the other nodes and overwrites the whatever is existing there so basically the most recent record always wins and every insert in Scylla happens with a time stamp which is something you can view but it’s not regular query so when you issue a regular select you will not see it but everything has a time stamp every row so the row would the highest time timestamp is the one that gets replicated during repair

so how we can do those repairs how we can reach consistency so at write time if we cannot write all the replicas at once there is sent at handoff which basically means the other nodes will store the replicas that are supposed to be sent to the node that wasn’t able to answer on time and they will stream that whenever the node becomes available this is a good mechanism but it’s limited in terms of how much they can hold so at during read time there is the read repairs whenever we read and we want to have a good consistency so we read with consistency three we get two answers with one row and the third comes in from a node with a replica but that one is older so we will return the answer from the two nodes that have the newer one and then we will also fix the one on the third node and there is the the maintenance action that is a full repair which is exactly what the manager does it can be done at the whole cluster level it can be done at smaller levels we’ll cover that but this is something that is part of the maintenance just like backups that databases like Scylla and Cassandra and others with lower consistency require

okay so the Scylla manager it basically has two components okay there is the manager itself it’s a daemon that exposes a REST API and there is the tool to to talk to the daemon so you can of course issue your own REST API calls but there is a command called SC tool that talks to the manager from the CLI it’s centrally managed usually it’s installed together with the monitoring on the same node just because the monitoring machine is already has access to all the scylla nodes and because nobody wants to designate an entire server just to do that just to do the manager but it’s up to you it is created specifically for Scylla so it will probably not work for anything else but the point is it’s shard aware so the Scylla’s shard aware architecture is taken into account by the manager so it can be made highly available and it is stateless so it uses a Scylla database as a back-end which it installs locally or it can connect to an existing Scylla database obviously it’s not a good idea to use it with the same database it manages so you either install another cluster if you want it’s highly available or just keep it to the local single replica small setup which is quite common it’s not a production database after all and yeah you can use any Scylla cluster but like I said you don’t want to create the chicken-and-egg situation where manager is managing the database that its uses for its own back-end

okay right now the current version of manager uses SSH to leverage API access because most of the setups for Scylla do not expose the API so the API is listening on a local host port and when you run nodetool you can run it locally but typically you cannot run it against another host because it will just not be able to reach so what the manager does is it does not use nodetool directly it uses the API it uses the rest endpoint that nodetool connects to but it simply SSHs into the host and then runs the commands locally talks to the API locally of course if you want to expose the API outside that’s not best practice for security purposes because it’s not secured but it is possible and then the manager can just talk to it directly so like I said it just connects over SSH so the connection is secure and it deals with all the establishment of those connections the certificate distribution all that there are scripts inside manager to do that and the database is used to do that is the the current latest enterprise ok just one thing I forgot to mention here is that remember there is a dashboard for Scylla manager so the monitoring is integrated with the manager so you can go into that dashboard and if you have the manager also configured you’ll be able to see the progress of the tasks that is currently running and everything that it does one more thing so as you can see in the future manager 2.0 will be dropping the SSH part because it’s proven to be a bit too complex to set up especially in secure environments where you have your SSH daemon really locked down tight and unable to log in without Kerberos passwords entered every time for example we took that into account when we were planning for the next version and this is one of the main pain points with lots of customers that they had these problems because of their specific internal ssh configurations so the next version will simply have an agent running on the Scylla node and the the agent would sorry the manager will be talking to that agent instead of trying to establish SSH so that will be a secure encrypted connection but it will be you will need to have the agent running on every node just like we already have node exporter running there so that’s pretty standard so the deployment itself for the manager once you install the RPMs you just run Scylla manager setup it’s a tiny little script that asks you just a couple of questions and then you run Scylla manager SSH setup give it the current user that has access to the nodes and give it the path to the key and which user to create the manager user that will be used for SSH and the key to push for that user and then we’ll give it a single IP of the node it will go into that node with SSH using the user in this case centos and discover the cluster using nodetool pull out all the data reach out to all the other nodes and establish connectivity everywhere create this user and the key to authorize users authorized keys etc then it will just tell you that it connected to all of these lists of nodes

okay and this is what happens is when I want to add a cluster so sctool is the command already mention it so cluster add give it one host no need to talk no need to give it a list of hosts it will discover the rest give it a name this name does not necessarily have to be the cluster name within Scylla this is just something to designate it inside the manager database it will give it a UUID anyway so you’ll be using probably that UUID more often than the name and that’s it if you have a SSH in place and you didn’t expose the API then you also need to specify the SSH parameters and that’s it so in the future SSH will just become redundant in this particular step but right now we still use it okay so if you run the status after that you’ll get a nice little table printed out to to your screen with the status of the cluster

okay and this is what the the monitoring looks like for the manager so you will have your Prometheus pulling data from the manager as well and it will show the dashboard that contains let me just open it here oh it does not open anyway so as you can see here it shows us how many nodes there are and the current repair progress going on there and also it shows the the cql health check so it will keep pinging the cql port and making sure that it’s available on all the nodes and that’s the next thing I’m going to talk about well just soon enough so this is basically the types of deployment we have available so we have binaries for all the standard common distributions and we also have it on Docker so you can install it directly from Docker but that means you will need to install Scylla and connect it to it so link the two containers if you install Scylla on Docker and this on Docker and the manager on Docker you just link them together and they will work ok this presentation can be found on the University website once you log into Scylla University in here if you follow this gist it will guide you through setting up a full Scylla 3 node cluster with docker and how to set up the monitoring that works with it directly and how to set up the manager to manage this cluster so you will have three Scylla nodes 3 container 3 containers for monitoring and two more one Scylla and one manager running on your machine and pretty much presenting the entire stack there are rpms there are Deb’s for Debian and Ubuntu and there is a docker on docker hub

okay so the next version like I already said will have node side agents so we will not have to deal with SSH anymore the manager will also do backups so it will be dealing it will be using our what was it called our sync not our sync block-level sorry yeah our clone that’s the one they gave a big presentation a few months ago on it in Poland and if you have any requests for additional features you feel free to open them and we’re always happy to get more so basically right now the main reason for the manager is repairs and what we really care about are those maintenance repairs that need to happen all the time so first of all like I said every row has a timestamp and the newest one wins so another thing to know about Scylla and I’m pretty sure it’s been mentioned before it’s an advance track is the fact that we don’t delete data directly we replace the data with the tombstone when we issue a delete on the row so and then garbage collection which is a periodic process walks over all those tombstones and drops that data okay and one of the reasons we really need to make sure we do those maintenance repairs is this is one of the problems there are more but this is one of the more obvious and common ones it’s basically when we issue a delete it gets propagated to all the relevant replicas and if for example one of the nodes during that propagation was not available it doesn’t necessarily need to be down but I don’t know a network outage it just I don’t know was overloaded at the time and did not respond to it not yet whatever it wasn’t available so what happens is we have out of three replicas we have two where we have a tombstone and a third where nothing happened later on garbage collection runs through that and that tombstone is removed we don’t have it anymore and suddenly that node that wasn’t available comes back online we should read most of the nodes don’t have that row anymore at all so there is no newer record with a newer time stamp that says it’s not here but the the third node actually has it so it responds and it has a time stamp which is newer than nothing and we end up having this record propagated back with the repair to all the other nodes so that means we need to really keep the repair period lower than the garbage collection to avoid this situation because if our garbage collection happens less often than repairs this cannot happen this makes sense it’s kind of a race condition I guess but this is why typically the default garbage collection is set to ten days you can play with it that’s called gc_grace_seconds in scylla.yaml so we do repairs on a weekly basis by default so then it’s three days period that’s different

okay so we don’t really need the manager to do repairs because we’ve had repair as long before we issued the manager otherwise we wouldn’t be able to operate right so we could just do nodetool repair and the problem with nodetool repair is you need to run it on each host specifically it’s an ad hoc operation it’s not monitored you just issue it and pray for the best there are no records for it there is no there is nothing you just issue it and you wait for it to somehow run through that’s it it is not shard aware and there are no retries if it failed you just start over okay at the manager level when you do a repair you can do it globally per cluster then you can also drill down to specific parts of your cluster it is it can be scheduled so of course with the nodetool you can also do cron jobs that sort of thing but again you need to manage it somehow here you do it through the manager in one central location you can stop and start and repair you can pause it if the repair is very big and it’s running for a long time and you expect to have like a very serious load on the cluster right now you can just pause the repair until the load goes away and then continue it’s better to rerun it if there are lots of inserts at that point but still it’s possible the history is kept so you know exactly how long they took those repairs what succeeded what didn’t etc it’s shard aware so it’s done specifically for Scylla so it’s aware of the architecture and you can play with the rules for the retries I’m not going to go into that here but it’s a it has lots of bells and whistles around it so the jobs this is the the basic granularity of the jobs but it’s it can get pretty complex and there is even more that you can do with it so the simple repair you just specify the cluster this is something that you do instead of going to every node with nodetool you can do a recurrent repair so you can schedule it there are several ways to schedule it you can do a per region repair for example so you just only specify DC and you can you can even do only specific keyspace or even just a table that you want to repair that is probably you know you know that one keyspace is getting hit on a weekend and the other only on Wednesdays so you schedule the repairs to other days it makes sense and it also supports wildcards so you can do a sctool repair and then like a star for everything except for us-east-1 for example in my example at the bottom here and this needs to be in quotes actually because otherwise Linux will not interpret it correctly but the manager supports it and then this syntax also applies to keyspaces tables the these wildcards can work with anything and then you can just specify specific codes where you want to repair or even just specific token ranges you want to repair but again you need to be very careful with these things because you don’t want to get into those zombie problems resurrections and all of that stuff we we’re not there yet so you need to make sure that your entire database gets repaired often enough to avoid these problems

okay the other thing the manager does is the CQL health check I already mentioned it before but basically it’s just another monitoring health checking way we apply through the manager it will monitor on every node whether the cql port is available and open and up and running and it even keeps a graph of its response time so that’s another indicator of your current load actually so it’s built on manager so it’s already aware of the cluster topology you don’t need to define it like you would in the monitoring stack and it’s pretty lightweight so the connection is don’t need to go like your typical for example Cassandra stress when you run it it will go to one node discover the entire cluster and start work working against it here it will only connect this single host without loading anything else and without clearing the system tables it’s secure it does not actually log into CQL it only queries it and it reports both its availability and the latency of the response so that was pretty much it.

To report this post you need to login first.