Workload Schedulability SLOs

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Workload Canary

Overview

This SLO measures the percentage of pods timed out or failed during a complete lifecycle, measured while waiting for a canary pods. In the current implementation the image for the pod is pulled from the built-in OpenShift registry.

The error rate is a general indicator of cluster and workload health.

There are two types of alerts that fire if we expect to miss the configured objective.

A ticket alert means that the error rate is slightly higher than the objective. If we don’t intervene at some point after receiving this alert, we can expect to miss the objective. However no immediate, urgent action is necessary. A ticket alert should have a label severity: warning.
A page alert means that the error rate is significantly higher than the objective. Immediate action is necessary to not miss the objective. A page alert should have a label severity: critical and should page on-call.

Steps for debugging

Unschedulable workloads often indicate resource exhaustion on a cluster but can have many root causes. Since the canary pod uses an image from the built-in OpenShift registry, the alert can be triggered by a misbehaving image registry.

First check the Events: section of the kubectl describe output.

kubectl describe pods -A -l "scheduler-canary-controller.appuio.io/instance"

The output should contain events why the pod isn’t schedulable:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  46s (x1 over 56s)  default-scheduler  0/12 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {storagenode: True}, that the pod didn't tolerate, 6 Insufficient cpu, 6 Insufficient memory.

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  12s (x1 over 15s)  default-scheduler  0/12 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {storagenode: True}, that the pod didn't tolerate, 6 Too many pods.

Events:
  Type     Reason          Age   From               Message
  ----     ------          ----  ----               -------
  Warning  Failed          13s   kubelet            Failed to pull image "invalid:image": rpc error: code = Unknown desc = reading manifest image in docker.io/library/invalid: errors:
denied: requested access to the resource is denied
unauthorized: authentication required
  Warning  Failed   13s               kubelet  Error: ErrImagePull
  Normal   BackOff  12s               kubelet  Back-off pulling image "invalid:image"
  Warning  Failed   12s               kubelet  Error: ImagePullBackOff
  Normal   Pulling  0s (x2 over 15s)  kubelet  Pulling image "invalid:image"

We don’t have a lot of experience with this alert yet. If you had to debug this alert, please consider adding any insight, tips, or code snippets you gained to this runbook.

Resource Exhaustion

Allocatable resources can be checked with the kubectl describe nodes output.

kubectl --as=cluster-admin describe nodes -l node-role.kubernetes.io/app=
[...]
Allocatable:
  cpu:                3500m
  ephemeral-storage:  123201474766
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             15258372Ki
  pods:               110
[...]
Non-terminated Pods:                      (38 in total)
[...]
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                2644m (75%)   5420m (154%)
  memory             5914Mi (39%)  14076Mi (94%)
  ephemeral-storage  100Ki (0%)    0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)

`ImagePullBackOff`

The canary pod uses an image from the OpenShift registry. If the registry is unavailable, the pod will fail to start.

Check the Status: section of the kubectl describe output:

kubectl -n appuio-openshift4-slos describe imagestream canary

Check if pods are crashing for the image registry:

kubectl -n openshift-image-registry get pods

Check the logs for the operator and the registry:

kubectl -n openshift-image-registry logs deployments/cluster-image-registry-operator

kubectl -n openshift-image-registry logs deployments/image-registry --all-containers

Tune

If this alert isn’t actionable, noisy, or was raised too late you may want to tune the SLO.

You have the option tune the SLO through the component parameters. You can modify the objective, disable the page or ticket alert, or completely disable the SLO.

The example below will set the SLO set the objective to 99.25%, adjust the time until a pod is seen as timed out, and disable the page alert. This means this SLO won’t alert on-call anymore.

slos:
  workload-schedulability:
    canary:
      objective: 99.25
      alerting:
        page_alert:
          enabled: false
      _sli:
        overallPodTimeout: 5m

Disabling the SLO or changing the objective will also impact the SLO dashboard and SLA reporting. Only disable SLOs if they’re not relevant, not if the alerts are noisy.