Storage SLOs

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Storage Operations

Overview

This SLO measures the percentage of failed storage operations, reported by kubelets and the controller-manager. The error rate is a general indicator of the health of the storage provider, but might not show you the root cause of the issue.

There are two types of alerts that fire if we expect to miss the configured objective.

A ticket alert means that the error rate is slightly higher than the objective. If we don’t intervene at some point after receiving this alert, we can expect to miss the objective. However no immediate, urgent action is necessary. A ticket alert should have a label severity: warning.
A page alert means that the error rate is significantly higher than the objective. Immediate action is necessary to not miss the objective. A page alert should have a label severity: critical and should page on-call.

Steps for debugging

A high storage operations error rate can have many root causes. To narrow down the issue we first need to find out which operations and which providers are effected.

Connect to the Prometheus on the cluster and run the following query:

sum(rate(storage_operation_duration_seconds_count{volume_plugin=~"kubernetes.io/csi.+", status="fail-unknown"}[5m])) by (job, operation_name, volume_plugin, node)

This should return one or more time series. With these you should be able to narrow down the issue:

If all time series have the same volume_plugin label the issue is most likely caused by that provider. Look at the logs of that csi provider to investigate further.
If they have the same node label it might be an issue with one specific node.
If all have the same operation_name label then there is an issue with this specific operation. For volume_attach and volume_detach check if the controller-manager logs anything useful. For all other operations the kubelet might log more information.
If all time series have the same job label it might be an issue with the kubelet or the controller-manager, look at the respective logs.

This should give you a starting point to investigate the root cause. You should also check if there are other, related firing alerts and create/delete a test PVC to properly assess the impact of this alert.

We don’t have a lot of experience with this alert yet. If you had to debug this alert, please consider adding any insight, tips, or code snippets you gained to this runbook.

Tune

If this alert isn’t actionable, noisy, or was raised too late you may want to tune the SLO.

You have the option tune the SLO through the component parameters. You can modify the objective, disable the page or ticket alert, restrict the SLO to certain storage operations or volume plugins, or completely disable the SLO.

The example below will set the SLO set the objective to 99.25%, disable the page alert, and only consider the Syn-managed CephFS storage provider. This means this SLO won’t alert on-call anymore and won’t alert if there are issues with CSI plugins other than the Syn-managed CephFS CSI plugin.

slos:
  storage:
    csi-operations:
      objective: 99.25
      _sli:
        volume_plugin: "kubernetes.io/csi:syn-rook-ceph-operator.cephfs.csi.ceph.com"
      alerting:
        page_alert:
          enabled: false

Disabling the SLO or changing the objective will also impact the SLO dashboard and SLA reporting. Only disable SLOs if they’re not relevant, not if the alerts are noisy.

Canaries

Overview

This SLO measures the percentage of pods timed out or failed during a complete lifecycle, measured while waiting for a canary pods. The Pod mounts a PVC and writes a file to it, then removes the file and the PVC. In the current implementation the image for the pod is pulled from the built-in OpenShift registry.

There are two types of alerts that fire if we expect to miss the configured objective.

A ticket alert means that the error rate is slightly higher than the objective. If we don’t intervene at some point after receiving this alert, we can expect to miss the objective. However no immediate, urgent action is necessary. A ticket alert should have a label severity: warning.
A page alert means that the error rate is significantly higher than the objective. Immediate action is necessary to not miss the objective. A page alert should have a label severity: critical and should page on-call.

Steps for debugging

An unschedulable storage canary workload can have more root causes than just a storage issue.

First check the debugging section of the Workload Schedulability - Canary runbook.

Check the output of the kubectl describe documented in the run book above for storage specific issues:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedMount       47s                kubelet            MountVolume.SetUp failed for volume "pvc-d0fd0b25-60dc-47dd-9ec8-734941361dff" : mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs nfs-store3.example.ch:/srv/nfs/export/canary-nfs-pvc-d0fd0b25-60dc-47dd-9ec8-734941361dff /var/lib/kubelet/pods/9dbc5afa-5af1-4176-bc06-8a428a5448a8/volumes/kubernetes.io~nfs/pvc-d0fd0b25-60dc-47dd-9ec8-734941361dff
Output: mount.nfs: Connection refused

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedMount       23m                kubelet            MountVolume.SetUp failed for volume "pvc-7681612b-9ac4-4cd6-867c-14c990351078" : mount failed: No space left on device

If the pod goes into a Error state you can check the logs of the pod with kubectl logs.

kubectl describe pods -A -l "scheduler-canary-controller.appuio.io/instance" | grep Error

kubectl logs -n <namespace> <pod-name>

Check the PVC and PV status with kubectl get pvc and kubectl get pv and check the logs of the CSI plugin for more information.

# Extract the pod name and PVC name
kubectl get pods -A -l "scheduler-canary-controller.appuio.io/instance" -o=jsonpath="{range .items[*]}{.metadata.name}{'\t'}{.spec.volumes[*].persistentVolumeClaim.claimName}{'\n'}{end}"

# Check the PVC status
kubectl describe pvc -n <namespace> <pvc-name>

# Extract the PV name from the PVC
kubectl get pvc -n <namespace> <pvc-name> -o=jsonpath="{.spec.volumeName}"
# Check the PV status
kubectl describe pv <pv-name>

We don’t have a lot of experience with this alert yet. If you had to debug this alert, please consider adding any insight, tips, or code snippets you gained to this runbook.

Tune

If this alert isn’t actionable, noisy, or was raised too late you may want to tune the SLO.

The example below will set the SLO set the objective to 97.00%, disable the page alert, and only consider the Syn-managed CephFS storage provider. This means this SLO won’t alert on-call anymore and won’t alert if there are issues with CSI plugins other than the Syn-managed CephFS CSI plugin.

slos:
  storage:
    canary:
      objective: 99.0
      _sli:
        volume_plugins:
          # Disable default storage class canary
          "": null
          # Enable Syn-managed CephFS canary
          cephfs-fspool-cluster: {}

Disabling the SLO or changing the objective will also impact the SLO dashboard and SLA reporting. Only disable SLOs if they’re not relevant, not if the alerts are noisy.