What are stalls and what do they mean? What can be done with stalls?
Stalls have to do with a specific task being executed continuously in the CPU without letting other operations to run. Possible causes for stalls are a ScyllaDB bug, the data model used, and the OS.
What are stalls and I’ve talked about them before so as I said the CPU scheduler in
ScyllaDB allocates tasks for us and when we’re starting for example to
allocate to run a specific function it can have a lot of sub functions it’s
calls and if those sub functions will call other sub functions then it will
continue and so forth where will it end it will end in two cases it will end if
the function call a yield function which basically means I don’t I can keep
on running but does anybody else need to run or it may and if I do i/o to the
disk to the network, whatever in all other cases my code will keep on
running so if I have a an endless loop it will keep on running nobody will kill
it inside C so the models that ScyllaDB uses is a cooperative scheduling I have
some slides on it I’m not sure we’ll reach it but that’s the model and tasks
have to relinquish to the CPU they have to give up or else they’ll keep on
running so when we’re talking about stalls which
means that something didn’t give up to the cpu and it may be something in our
code that is bad it may have to do specifically with your data model as
well and I’ll talk for an example here so we have an issue in ScyllaDB today that
if you didn’t alter and you have a large partition inside the cache when we
need to upgrade the schema of that partition inside the cache and it will
happen when you read from it then we needed to do it on a partition level so
if I have a one gigabyte partition I altered the schema of that partition
inside the cache till I read it nothing will happen when I read it will upgrade
the schema but I need to do it across all the partition even if you’re reading
a single row and we can’t indicate this is what’s happening in the matrix but we
can find out this is happening through a back trace so installing ScyllaDB logs is
translated basically to a back trace in the code okay that we can read
yo can do most of the times or some of the times you won’t be able to do anything
about it but at least will tell you why it happened and then you can decide what
to do you your end so stalls are recorded in this string and large stalls can lead
to task violations and so forth so what do you do with stalls
one you report them maybe enhancement a bug that we need to fix if we’ll see
additional people telling they have that issue we’ll prioritize it, two, it may be
your data model so you saw that using blocks would be great and you starting
bringing blobs that are ten megabytes of memory that’s not great okay and you may
be able to work with blobs that are half about a megabyte of memory and we’ll be
able to tell you that it’s large the stalls are caused by large cells but if
you go back you have a way today to figure it out on
your own as well the other source can be the OS and we are fixing bugs or
working around bugs in OS as well as we can but there are cases where we’ll tell
you you need to update the kernel or you using an old version there is a patch to
XFS whatever so the instructions you’ll get when you post a stall can be upgrade
to a new version this is fixed or change your data model again you’re using large
blobs or three, +1 to fix or enhancement around this and we’ll take that into
account so that is a stall