Workload Schedulability SLOs
Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub. |
Workload Canary
Overview
This SLO measures the percentage of pods timed out or failed during a complete lifecycle, measured while waiting for a canary pods. In the current implementation the image for the pod is pulled from the built-in OpenShift registry.
The error rate is a general indicator of cluster and workload health.
There are two types of alerts that fire if we expect to miss the configured objective.
-
A ticket alert means that the error rate is slightly higher than the objective. If we don’t intervene at some point after receiving this alert, we can expect to miss the objective. However no immediate, urgent action is necessary. A ticket alert should have a label
severity: warning
. -
A page alert means that the error rate is significantly higher than the objective. Immediate action is necessary to not miss the objective. A page alert should have a label
severity: critical
and should page on-call.
Steps for debugging
Unschedulable workloads often indicate resource exhaustion on a cluster but can have many root causes. Since the canary pod uses an image from the built-in OpenShift registry, the alert can be triggered by a misbehaving image registry.
First check the Events:
section of the kubectl describe
output.
kubectl describe pods -A -l "scheduler-canary-controller.appuio.io/instance"
The output should contain events why the pod isn’t schedulable:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 46s (x1 over 56s) default-scheduler 0/12 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {storagenode: True}, that the pod didn't tolerate, 6 Insufficient cpu, 6 Insufficient memory.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 12s (x1 over 15s) default-scheduler 0/12 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {storagenode: True}, that the pod didn't tolerate, 6 Too many pods.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Failed 13s kubelet Failed to pull image "invalid:image": rpc error: code = Unknown desc = reading manifest image in docker.io/library/invalid: errors:
denied: requested access to the resource is denied
unauthorized: authentication required
Warning Failed 13s kubelet Error: ErrImagePull
Normal BackOff 12s kubelet Back-off pulling image "invalid:image"
Warning Failed 12s kubelet Error: ImagePullBackOff
Normal Pulling 0s (x2 over 15s) kubelet Pulling image "invalid:image"
We don’t have a lot of experience with this alert yet. If you had to debug this alert, please consider adding any insight, tips, or code snippets you gained to this runbook. |
Resource Exhaustion
Allocatable resources can be checked with the kubectl describe nodes
output.
kubectl --as=cluster-admin describe nodes -l node-role.kubernetes.io/app=
[...]
Allocatable:
cpu: 3500m
ephemeral-storage: 123201474766
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15258372Ki
pods: 110
[...]
Non-terminated Pods: (38 in total)
[...]
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 2644m (75%) 5420m (154%)
memory 5914Mi (39%) 14076Mi (94%)
ephemeral-storage 100Ki (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
ImagePullBackOff
The canary pod uses an image from the OpenShift registry. If the registry is unavailable, the pod will fail to start.
Check the Status:
section of the kubectl describe
output:
kubectl -n appuio-openshift4-slos describe imagestream canary
Check if pods are crashing for the image registry:
kubectl -n openshift-image-registry get pods
Check the logs for the operator and the registry:
kubectl -n openshift-image-registry logs deployments/cluster-image-registry-operator
kubectl -n openshift-image-registry logs deployments/image-registry --all-containers
Tune
If this alert isn’t actionable, noisy, or was raised too late you may want to tune the SLO.
You have the option tune the SLO through the component parameters. You can modify the objective, disable the page or ticket alert, or completely disable the SLO.
The example below will set the SLO set the objective to 99.25%, adjust the time until a pod is seen as timed out, and disable the page alert. This means this SLO won’t alert on-call anymore.
slos:
workload-schedulability:
canary:
objective: 99.25
alerting:
page_alert:
enabled: false
_sli:
overallPodTimeout: 5m
Disabling the SLO or changing the objective will also impact the SLO dashboard and SLA reporting. Only disable SLOs if they’re not relevant, not if the alerts are noisy. |