Covers some common problems and how to solve them when using ScyllaDB and the K8 Operator.
Some of the common problems are not looking at the correct resources, a lack of resources for the k8 cluster, wrong node affinity, taints and tolerations, wrong regions/availability zones.
Some common performance issues stem from the absence of cpuset and networkHosting, the absence of monitoring, and general bad practices on the client-side.
When it comes to troubleshooting, I get questions from users and from customers
all the time and I decided to put a brief session here on the
on the most common problems that we see, so first in order to verify things on Kubernetes
remember always to specify the ‘-n’ for namespaces, I find myself constantly
trying to look at this, a ScyllaDB pod and then I’m not you using the namespace and then I
cannot find that pod, it’s silly but it happens a lot, also you can you know list all the
resources for a particular namespace, so you can see everything is there and Kubernetes
is going to show you, let’s say if you need three pods, it’s going to
to tell you how many pods you have, out of the three that you want, so it could be 3/3
meaning that they are all there and it’s going to tell you also from those pods which
are ready and which are not, so the first time you’re deploying a cluster you’re going to see
one, then that pod will be ready, then you see the second one being deployed and so on
also after you have all the resources there one thing you want to do is to describe that
component, so we talked about the stateful set that the Operator is using, so for example
kubectl describe stateful set and the name of the stateful set, for example ScyllaDB cluster
and of course ‘-n’ ScyllaDB, in this case and it’s going to tell you all the
events that are happening on the Kubernetes side and lastly, you can check the logs
of any pod in the case of the Operator pod and on the ScyllaDB pods, we are redirecting all the
logs not to a file but to stdout, so you can easily verify those logs using kubectl logs
the most common problems that I see
myself, when I’m using the Operator and also my users and customers, is lack of
resources in the Kubernetes cluster, so if you don’t have Kubernetes nodes with
you know 32 CPUs and 512 gigabytes of ram and three terabytes of disks available
and you request for those resources, Kubernetes not going to create them out of nowhere, so first
make sure that you have the Kubernetes resources available, so you can deploy your ScyllaDB
cluster, the second most common problem is that you’re using some sort of node affinity or taints
and tolerations and it’s just wrong so maybe you’re trying to deploy ScyllaDB nodes
in nodes that have taints, that are not going to accept those pods, and the last most
common problem is, people tend to sometimes just copy and paste examples in the documentation
I think on our examples we’re using us-east
Google Kubernetes engine, so we are using us-east in the examples and then people deploy
Kubernetes cluster on us-west and then it just doesn’t work, so by all means try to verify those
three things before you come asking questions but at any time we can help you investigate
one last thing is that, if you find a real problem with the Operator
go to GitHub, open an issue, that is going to help us make it better, to fix bugs, to fix problems and
of course, we also accept requests for features, so there’s something that it’s missing
on our Operator, by all means, let us know so we can properly assign resources and work on it
and there are also the performance problems, so maybe you deploy the Operator, you deploy
the ScyllaDB cluster and when you’re testing and you’re running some workloads, you see
that the performance is very poor or sub optimal right, usually it boils down to those problems
with cpuset and network hosting, I mentioned that this should be the default but of course
not it’s not going to help you if you didn’t do your homework and set your Kubernetes clusters
properly, as I mentioned, absence of monitoring well of course, not having monitoring is not
going to cause a performance problem but if you don’t have any visibility, you are left with
guess working, so you’re guessing what might be wrong and you don’t know for sure, so again
let me emphasize, have monitoring so we can see what’s going on inside the cluster and the last
and this is really the most common performance issue that people see when using the Operator
is that, well they’re just doing bad practices on the client side, so maybe you have just one
pod with one CPU and you are bottlenecked on the client side and ScyllaDB is barely using
one percent of the cluster capacity