With our recent release of Container Stacks v2 into public beta we're totally loving Kubernetes. But as with all love affairs, there are some bothersome aspects that we have to accept and work with. One such aspect is in the inflexibility of the vanilla shutdown sequence provided by Kubernetes.
We're also prolific users of Sidekiq for the parts of our backend that are ruby-based (we're running a bunch of other technologies, but we think Sidekiq is hands-down the best for running ruby jobs). As with any background workers, Sidekiq is sensitive to its shutdown sequence. We need to have more control over this.
This article provides a solution to achieve graceful shutdown of Sidekiq workers via the Kubernetes pod shutdown lifecycle.
There is a lot of documentation out there around the current Kubernetes pod shutdown sequence (see appendices for some starting points). NOTE: I say current as this information is only valid as of now... this might change (though I think at this point that is fairly unlikely). The current shutdown sequence looks like the following:
1. POD marked as *terminating* 2. Optional: PreStop hook called synchronously 3. SIGTERM sent to container process if still present 4. Kubernetes waits up-to *grace-period* for container to exit 5. SIGKILL sent to container process if still present
Kubernetes allows you to specify the
terminationGracePeriodSeconds (ie. how long it will wait for shutdown after SIGTERM sent) in your spec. Unfortunately Kubernetes doesn't allow you to specify the shutdown sequence itself.
At Cloud 66 we were previously lucky enough to be controlling the shutdown process via our own homegrown scheduler, this enabled us to expose the shutdown sequence to our users directly (in the form of
USR1;1h;TERM;10s;KILL for example). But now we need another solution.
Furthermore (and specific to Sidekiq) as we have some very long running jobs (dependent on external resources), we want to have a long wait time; but also want to terminate the workers as soon as they are no longer busy. So our ideal Sidekiq shutdown sequence looks like the following:
1. Send USR1 (or TSTP for sidekiq > 5.0.0) to workers 2. Wait until they are no longer processing jobs 3. Send TERM
Solution: Use a Pre-Stop Hook
Looking at the shutdown sequence above, you'll see that there is a PreStop hook point called during the sequence. More on this in the Kubernetes Container Lifecycle Hooks documentation. The salient bit of information is essentially that kubernetes will execute some command of your choosing at that hook point, and it will execute it synchronously, waiting for the command to complete before resuming the shutdown sequence.
Using this hook point, we can inject the graceful shutdown behaviour we want for our Sidekiq workers. And because we need ths ourselves (and given that Sidekiq is ruby-based) I put together the following ruby script to do just that!
As the hook command executes in context of your image, you'll need to include this script inside your image (simply put it in your source code if you're using Cloud 66 Skycap). Note that the script is executed with the following arguments:
Usage: sidekiq_safe_shutdown.rb [options] -o, --output [ARG] File-path or stdout (default: stdout) -t, --timeout [ARG] Timeout in seconds (default: 120) -h, --help Display this help
For the example below we're putting this script in our image in the path:
And don't forget to make it executable with:
chmod +x /tmp/sidekiq_safe_shutdown.rb)
Invoking via Kubernetes Manually
If you're running Kubes directly, then you'll need to manually modify your Pod spec to include terminationGracePeriodSeconds and invoking the PreStop hook:
spec: #with default timeout terminationGracePeriodSeconds: 15 #or with specific timeout terminationGracePeriodSeconds: 3605
lifecycle: preStop: exec: #with default timeout command: ["/tmp/sidekiq_safe_shutdown.rb"] #or with specific timeout command: ["/tmp/sidekiq_safe_shutdown.rb", "-t", "3600"]
Invoking via Cloud 66
#with default timeout pre_stop_command: /tmp/sidekiq_safe_shutdown.rb stop_grace: 15s #or with specific timeout pre_stop_command: /tmp/sidekiq_safe_shutdown.rb -t 3600 stop_grace: 3605s
And that should be all you need - now when your Sidekiq workers shut down they will do so gracefully!
Appendices (Further Reading)