K8 Kubernetes – Scylla Operator

Hi I’m Yannis I’m a Software Engineer at Arrikto And today we are going to talk about running Scylla on Kubernetes, more specifically using the Scylla operator how did this all start, why are we even talking about running Scylla on Kubernetes when I was doing my diploma thesis, my professor Vangelis asked me Kubernetes seems like good work load management platform and there are stateful software out there like Casandra, like Scylla which I discovered later that require specific expertise to manage so can we leverage Kubernetes in order to automate the management of stateful software in this particular case can we automate the management of Scylla, yes we can, this takes shape in the form of the Scylla operator

So what we will learn today, first of all, we’ll go through some core Kubernetes principles then we will go through the design of the Scylla operator and why we made this particular choices and why they might change in the future, in the meantime we’re going to be doing a hands-on practice with the Scylla operator in the playground environment and finally we will see what are the features of the Scylla operator and what is needed to run it with high performance in a production environment alright so let’s start with some core Concepts about kubernetes and why all of this theoretical stuff is good

I like to start with an example so I at it’s core what is Kubernetes, it’s just a thing that runs containers and nodes and to illustrate that let’s see how the container is running Kubernetes and cointeiners in Kubernetes are called pods, it works in the following way a user expresses their desired state which is in this case that I want a container with surrounds the Tomcat server and exposes port 80 to be running, the users write the desired state in Yaml in a pod object and then submit this subject to the Kubernetes API server, Kubernetes saves the object in its database, etcd and then the Kubernetes Scheduler which watches the API server and notices the new pod and says this pod is not scheduled but the user wants it to run so I can have the schedule it I decide on node on node 4, another instance for example you could you could make API call to the node and say “run this”, in Kubernetes and everything is expressed declaratively and everything is expressed and persisted in etcd that’s why the Scheduler edits the pod object and writes the Node name on the pod object so the desired state is this pod this container has run on node 4 then on each node runs a program called the kubelet, the kubelet also watches the Kubernetes API server. and the kubelet notices the pod says the user desires a container tomcat then it computes the real state of the system how does it do that, it talks to the container runtime docker underneath and says right now I don’t have a container tomcat running, what do I do to reconcile my desired stayed with the actual state I start a new container 

This is the logic of how Kubernetes works, right there’s this whole notion of declaratively expressing what you want to happen and then you have components Loops that notice this Desired state calculate the actual state of the system and then take action to reconcile the two now let’s see something a little bit more complex so well it is possible to run a stateful system on pods, you probably want to  use something a little bit more complex to do that so Kubernetes tells you if you want to run a stateful system like a database or something like that use a statefulSet, a StatefulSet it’s just a bunch of pods that comes with three guarantees first each pod is unique it has its own unique identity, second Kubernetes is carefree about how it handles this pods for example when you deploy statefulset or when you update it, it deploys and updates the pods one by one for example of this is useful for example in Scylla to maintain Quorum among the members and finally it gives its member its own persistent network identity and its own disk so pretty basic stuff but necessary for stateful system to work correctly and to better showcase that let’s see an example and work and step-by-step how the statefulset works so user writes statefulset object and expresses the desired state submits it to the Kubernetes API server which saves it to its DB and then a program the statefulset controller notices the user desired state and says the user wants three replicas how many do I have now? 0 ok starting new one and again the same the user wants three replicas, how many do I have? – the same how many does the user want 3 how many do I have 1 how many ready one I can create a new one and so on and so forth until I reach the steady state and the steady state is when the user desired state matches the real state of the system so here you may have noticed a pattern emerging and this is the controller pattern it is not new in Kubernetes, the controller systems use it since forever and Scylla also uses internally for their scheduler and stuff and the gist of this pattern is that everything in Kubernetes is expressed in objects, objects have two parts, they have a spec that is written by the user and this is the desired state of the system and they have a Status which is the real state of the system and there are programs called controllers which are infinite Loops that observe the desired state provided by the user calculate, the state of the system by talking to physical resources for example the kubelet talks to the container runtime or by seeing other Kubernetes object and finally they reconcile the desired state with the actual state and this is Kubernetes in a nutshell

This is everything that Kubernetes does, so the question is can we take this pattern and apply it to Scylla to manage it, this is where operator pattern emerges what is the operator. we express Scylla as a  Kubernetes object exactly how when you have the pod or the statefulset you have a Scylla cluster object which says that I want three members and two rack, one DC and things like that and then you have a new controller which sees the user desired state calculate real state of the system and reconciles the two, and this controller encapsulates domain-specific knowledge about Scylla, this controller knows how to best run Scylla so now that we have seen some core concepts of how Kubernetes works and how we intend to apply these Concepts in order to manage Scylla let’s see how we designed the operator so as we said we have Scylla cluster object that we say for example I want version 3 I wanted to use CPU set I want one data center, one rack our own controller which sees the desired state calculates actual state takes actions to reconcile the two, so let’s see how we map the concept of Scylla to the concept of Kubernetes 


In our design A Scylla member, I called member because there is an overload in the term “node” because there are physical nodes and the Scylla nodes so just call member a Scylla member is a Scylla node, please make this mental mapping in your head ok so a Scylla member is a Kubernetes pod, a container if you will, A Scylla rack is a statefulset , a Scylla datacenter is multiple statefulsets and we make this because we need racks to be separately scheduled in separate geographical locations and the Scylla cluster is the cluster object, our cluster Custom object so as a user you just write this that I want this cluster with this version, this datacenter, this racks this is an overview of our design and I want to keep four things here so first of all notice the operator pattern the user writes in cluster Custom object and our controller reconciles it then we have each rack inside the statefulset and each member as you will see actually has its own static IP address and we do that with the Kubernetes service object you don’t need to know too much about this we have a guarantee that each member will have an IP address that doesn’t change

So this greatly simplifies management operations and finally that we have included a sidecar in our design to aid us in various operations so the sidecar is there to setup config files install plugins and for future extensibility like backups and restores and finally we see the challenge when we’re trying to see what IP address to use to identify Scylla member. kubernetes tells you to not use IP addresses, kubernetes tells you to use DNS records permanent DNS records, but this doesn’t really work well for Scylla because internally it saves IP addresses so it complicates it and you have to do some bookkeeping instead of that we saw that there was this service object which guaranteed that a certain IP address would be always pointing to a certain pod so how does this work this actually works with IP tables under the hood but we don’t need to know that the operator takes care of all of that for you and we have sticky IPS for each pod of the Scylla cluster so let’s see how this is done so we so how it looks like it’s this is done so when the user like I did right now writes a Scylla cluster object and submits it to the Kubernetes API server the Scylla operator that watches the Kubernetes API server notices that new cluster and infinite loop starts that says ok I want one rack in Europe with 1B how many do I have 0 ok create a new statefulset for that rack again I want one rack in europe with 1B how many members does it have right now 0, ok, scale it up, it creates a member again I want one rack Europe was 1B how many members does it have 1 how many do I want one ok move past it second rack I want europe with 1C it doesn’t exist create it, again so on and so forth, so let’s no see how the operator does a scaling down so we are talking features of the Scylla operator we talk scale-up and deployment let’s talk about scaling down which is different it’s more difficult you have to be careful because if you don’t do this correctly you will lose data and Kubernetes doesn’t do this correctly by default so what does the user do, they Express the desired state 

They make the number of members from two they make it one, they submit this to the Kubernetes API server, saves it in HD and the Scylla operator notices and says how many members should rack europe-west 1C have one how many does it have two I have to remove a member how do I do that I need to tell the member Scylla europe-west 1C one this one the big one on the right that he hasn’t decommissioned it has to stream its data to the other nodes and has to leave, how do I do that similarly to the Kubernetes Scheduler that expresses this intent for something to happen by writing something persistently in etcd, we do the same here we write a label on Kubernetes object on this member’s service now the sidecar that is running alongside the Scylla member notices this label and starts the node to decommission on this member once this is started it will do all Scylla things stream data rebalance all of that and once it is over the sidecar will know, and since it knows, it will again notify the Scylla operator that the decommission is over and everything is safe to be removed so it changes the label from the decommission false to the decommission true the operator knows that it’s now safe to remove and to do that it scales down the statefulset and the statefulset just deletes it and that’s how we do a safe scaled down

Let me get back to the slides so we seen Scylla features we saw scale up, deploy, scale down monitoring so I’m wondering if there is much to say here we use standard Prometheus for metrics Graphana for visualization and there’s native support for Scylla there is some amazing work done in Graphana for the metrics and everything is so beautifully visualized and what the operator does is basically integrate with that already existing monitoring style and the thing that Henric did, now that he started working on the operator is the ability to do a live configuration change so let’s say that you have a Scylla cluster running and you need to change some values in Scylla Yaml or the rackdc.properties or something like that instead of going into its member and changing them you can just specify a configuration to use and the operator will take care of propagating it to all the members by a rolling update so this is something actually that a lot of users of the operator and ask for we talked about core Kubernetes concepts the design of the Scylla operator features of the Scylla operator

Let’s see what are some things that are required to run Scylla operator with high performance the first thing is CPU pinning and this is the classic image that is used in all presentations, so Scylla achieves high performance by using CPU pinning and making sure that each shard on it’s CPU doesn’t interfere with the other now to enable that in Kubernetes you have to enable an option cold “Static CPU management policy” and this must be enabled in the kubelet configuration file, so the good thing is that it is supported the bad thing is that many providers don’t enable it by default so I know that GKE doesn’t enable it, last time I checked so you have to do some manual work there to make sure that it is enabled the second thing you should take care of are disks, the disk quality is very important so you should always use local SSD disks with Scylla, Network Attached Storage was actually a common thing with Cassandra as well but while it’s fault tolerant it is very slow and Scylla handles all replications for you, so there’s really no reason to that support for persistent Kubernetes supports for persistent volumes for local storage it’s I would say medium well it’s not great but it’s not bad it’s somewhere in the middle

To report this post you need to login first.