Alert Group: syn-PodCapacity

An OpenShift cluster can handle a limited number of pods per node. If you try to run more pods than this limit, OpenShift won’t be able to schedule these pods.

We provide two rules to notice if a cluster is hitting the limit of schedulable pods.

Alert Rule: SYN_TooManyPods

Overview

This alert indicates that the cluster only has capacity for few additional pods. By default, this threshold is the pod limit per node.

If you receive this alert, it’s likely that the cluster won’t be able to keep all workloads running in case of a worker node failure. After verifying that the amount of running pods in the cluster is close to the cluster’s pod capacity, you should add worker nodes as soon as possible.

Investigate

  • Verify the alert

    • Check the per worker node pod limit of your cluster

      kubectl get KubeletConfig -o yaml
    • Check the number of pods on each worker node

      kubectl describe node -lnode-role.kubernetes.io/app | grep -E "(^Name:|^Non-terminated)"
    • Verify that the reported capacity is correct. If there’s a discrepancy between the alert, and the actual number of running pods reported by Kubernetes, there might be a bug in the alert rule. If that’s the case, please disable this alert and open an issue for this component.

  • Look at running pods for any large number of suspicious pods

  • Add one or more worker nodes according to the install instructions for your cloud

Tune

If this alert isn’t actionable, noisy, or was raised too late you might want to tune it.

Through the component parameters you have the option modify the alert threshold factor, change for how long it needs to be firing until you are alerted, or disable it outright. In the example below will adapt the rule to only alert after two hours, but decrease the threshold factor to 0.5.

capacityAlerts:
  enabled: false
  groups:
    PodCapacity:
      rules:
        TooManyPods:
          enabled: true
          for: 2h
          expr:
            factor: 0.5

Alert Rule: SYN_ExpectTooManyPods

Overview

This alert indicates that we expect that the cluster will soon only have capacity for few additional pods. By default, we raise the alert if we expect the cluster to not have enough capacity to handle a node failure in three days. After verifying that the cluster’s the pod count is growing, you should plan to add worker nodes in the next days.

Investigate

  • Look at the source of this alert in Prometheus

    • Does the prediction look realistic?

    • Compare it to the graph without the predict_linear

    • If there is any doubt in the prediction, monitor this graph for the next hours or days

  • Check the number of actually running pods on each worker node

    kubectl describe node -lnode-role.kubernetes.io/app | grep -E "(^Name:|^Non-terminated)"

    If the number is widely different than the prediction, the alert is probably not actionable.

  • Look at running pods for any large number of suspicious pods

  • Add one or more worker nodes according to the install instructions for your cloud

Tune

If this alert isn’t actionable, noisy, or was raised too late you might want to tune it.

Through the component parameters you have the option tune the alert rule. You can modify the threshold, change for how long it needs to be firing until you are alerted, how far into the future to predict, or disable it outright.

In the example below will adapt the rule so that it will alert if we expect to not be able to schedule any more pods in 5 days, but only if it fired for 12h.

capacityAlerts:
  enabled: false
  groups:
    PodCapacity:
      rules:
        ExpectTooManyPods:
          enabled: true
          for: 12h
          expr:
            threshold: '0' (1)
            predict: '5*24*60*60' (2)
1 The threshold can be an arbitrary promql expression
2 How far into the future to predict in seconds