Kubernetes API SLOs

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

API Requests

Overview

This SLO measures the percentage of valid, well structured Kubernetes API requets that fail. The error rate is a general indicator of the health of the API server, but might not show you the root cause of the issue.

There are two types of alerts that fire if we expect to miss the configured objective.

  • A ticket alert means that the error rate is slightly higher than the objective. If we don’t intervene at some point after receiving this alert, we can expect to miss the objective. However no immediate, urgent action is necessary. A ticket alert should have a label severity: warning.

  • A page alert means that the error rate is significantly higher than the objective. Immediate action is necessary to not miss the objective. A page alert should have a label severity: critical and should page on-call.

Steps for debugging

A high Kubernetes API server error rate can have multiple root causes. First check if you are generally able to connect the cluter’s API through kubectl. If you can’t, you most likely also received a SLO_KubeApiServerFailure and it makes sense to also look at its runbook even if the alert didn’t trigger.

If you can still access the API server, look at the pods running in namespace openshift-kube-apiserver. Check if any of the API server pods seem to be crashing and check their logs. Also check etcd running in namespace openshift-etcd and see if any pods are crashing or logging errors.

If you have issues with viewing logs through kubectl, but the cluster generally still works, you might be able to look at the logs in elasticsearch.

The API server logs can be fairly noisy, to help you narrow down the issue, you can connect to the Prometheus on the cluster and run the following query:

sum (rate(apiserver_request_total{code=~"(5..|429)"}[10m])) by (verb, resource, code)

This should return one or more time series. With these you should be able to narrow down the issue:

  • If all, or most, timeseries have the code 429 this most likely means the API server is overloaded. In that case, double-check if one or more master nodes have a high load. If so either investigate what generates the high load or increase master nodes size.

  • If you have a large amount of 504 errors, an upstream service is misbehaving. Check etcd or the OpenShift API server in namespace openshift-apiserver.

  • If you see other codes than the generic 500, check online what this error code means.

  • Check which verbs are effected. For example, if all writes fail, this might indicate a degraded etcd cluster.

  • Check what resources are effected. For example, if only OpenShift resources are effected, there is most likely an issue with the OpenShift API server and not the Kubernetes API server.

This should give you a starting point to investigate the root cause. You should also check if there are other, related firing alerts.

We don’t have a lot of experience with this alert yet. If you had to debug this alert, please consider adding any insight, tips, or code snippets you gained to this runbook.

Tune

If this alert isn’t actionable, noisy, or was raised too late you may want to tune the SLO.

You have the option tune the SLO through the component parameters. You can modify the objective, disable the page or ticket alert, or completely disable the SLO.

The example below will set the SLO set the objective to 99.25% and disable the page alert. This means this SLO won’t alert on-call anymore.

slos:
  kubernetes_api:
    requests:
      objective: 99.25
      alerting:
        page_alert:
          enabled: false
Disabling the SLO or changing the objective will also impact the SLO dashboard and SLA reporting. Only disable SLOs if they’re not relevant, not if the alerts are noisy.

API Uptime

Overview

This SLI measures the up-time of the Kubernetes API by probing the /heathz endpoint. If this SLI results in an alert, it means the Kubernetes API server is unable to handle requests or clients are simply unable to reach it.

There are two types of alerts that fire if we expect to miss the configured objective.

  • A ticket alert means that the error rate is slightly higher than the objective. If we don’t intervene at some point after receiving this alert, we can expect to miss the objective. However no immediate, urgent action is necessary. A ticket alert should have a label severity: warning.

  • A page alert means that the error rate is significantly higher than the objective. Immediate action is necessary to not miss the objective. A page alert should have a label severity: critical and should page on-call.

Steps for debugging

First try to access the Kubernetes API through kubectl.

If the API server is degraded, you might not be able to authenticate to OpenShift through the web console, but the API still might mostly work. Get the admin kubeconfig from the password manager and try to connect directly.

If kubectl access still seems to work, try to check what error the probe is returning, by forwarding the blackbox exporter UI:

kubectl -n appuio-openshift4-slos port-forward svc/prometheus-blackbox-exporter 9115

You’ll probably also want to follow the SLO_KubeApiServerHighErrorRate runbook.

If you can’t reach the Kubernetes API at all, fist check through the cloud provider portal if the master nodes are running. If they’re running, but the Kubernetes API is still not reachable, try to connect to one of them using SSH. You’ll need the SSH key stored in vault and use one of the LB VMs as a jumphost.

# For example: https://api.syn.vshn.net
# IMPORTANT: do NOT add a trailing `/`. Commands below will fail.
export COMMODORE_API_URL=<lieutenant-api-endpoint>

# Set Project Syn cluster and tenant ID
export CLUSTER_ID=<lieutenant-cluster-id> # Looks like: c-<something>
export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .tenant)

# Login to vault
export VAULT_ADDR=https://vault-prod.syn.vshn.net
vault login -method=oidc

# Fetch SSH key
vault kv get -format=json clusters/kv/${TENANT_ID}/${CLUSTER_ID}/cloudscale/ssh \
  | jq -r '.data.data.private_key' | base64 --decode > ssh_key
chmod 400 ssh_key

# Connect to master node
MASTER_NODE=etcd-0
LB_HOST=$(grep -E "^Host.*${CLUSTER_ID}" ~/.ssh/sshop_config | head -1 | awk '{print $2}')
ssh -J "${LB_HOST}" -i ssh_key "core@${MASTER_NODE}"

Check the logs in /var/log/etcd, /var/log/kube-apiserver, and /var/log/containers. Also see if any systemd service has crashed.

We don’t have a lot of experience with this alert yet. If you had to debug this alert, please consider adding any insight, tips, or code snippets you gained to this runbook.

Tune

If this alert isn’t actionable, noisy, or was raised too late you may want to tune the SLO.

You have the option tune the SLO through the component parameters. You can modify the objective, disable the page or ticket alert, or completely disable the SLO.

The example below will set the SLO set the objective to 99.25% and disable the page alert. This means this SLO won’t alert on-call anymore.

slos:
  kubernetes_api:
    canary:
      objective: 99.9
      _sli:
        timeout: 10s
        interval: 30s
      alerting:
        page_alert:
          enabled: false
Disabling the SLO or changing the objective will also impact the SLO dashboard and SLA reporting. Only disable SLOs if they’re not relevant, not if the alerts are noisy.

If you adjust the objective please be aware how this will impact alerting.

With the default SLO of 99.9% and probe interval of 10s, if 6 probes fail in an hour we will emit a page alert.

If you adjust the objective, this number will change and increasing the SLO or decreasing probe interval might result in unactionable alerts. You can calculate the number of failed probes \(f\), given SLO \(slo\) as percentage beteween 0 and 100 and the probe interval in seconds \(int\).

\[f = \dfrac{5164}{int} \left( 1-\dfrac{slo}{100} \right)\]