Alert Group: syn-CpuCapacity

We provide two rules to notice if a cluster has high CPU utilization. If a cluster has consistently high CPU utilization, workload latency might increase and operations might time out.

Alert Rule: SYN_ClusterCpuUsageHigh

Overview

This alert indicates that the total CPU utilization over all worker nodes is high. By default, it will fire if the number of idle cores is less than the core count of the largest worker node for more than two hours.

If you receive this alert, a worker node failure in the cluster is likely to cause high latency for customer workloads. After verifying that the cluster’s CPU utilization is high, you should consider adding more CPU capacity.

Investigate

Verify the alert
- Verify that the reported load is correct.
  kubectl top nodes
  If the load reported by top nodes doesn’t appear to be particularly high, there might be a bug in the alert rule. If that’s the case, please disable this alert and open an issue for this component.
- Check if there is a sudden increase in CPU usage that indicate that this might be temporary or caused by a misbehaving workload.
Either add more worker nodes or resize existing worker nodes according to the install instructions for your cloud

Tune

If this alert isn’t actionable, noisy, or was raised too late you might want to tune it.

Through the component parameters you have the option modify the alert threshold factor, change for how long it needs to be firing until you are alerted, or disable it outright. In the example below will adapt the rule to only alert after four hours and change the threshold factor to 0.7.

capacityAlerts:
  groups:
    CpuCapacity:
      rules:
        ClusterCpuUsageHigh:
          enabled: true
          for: 4h
          expr:
            factor: 0.7

Alert Rule: SYN_ExpectClusterCpuUsageHigh

Overview

This alert indicates that we expect the total CPU utilization over all worker nodes to become high in the next days. By default, this alert will fire if we expect the number of idle cores to less than the core count of the largest worker node in three days.

If you receive this alert, the cluster might soon experience degraded workload performance in case of a worker node failure. After verifying the CPU utilization growth, you should consider adding more CPU capacity in the next days.

Investigate

Look at the source of this alert in Prometheus
- Does the prediction look realistic?
- Compare it to the graph without the predict_linear
- If there is any doubt in the prediction, monitor this graph for the next hours or days
Check the actual CPU usage on each worker node
```
kubectl top nodes
```
If the number is widely different than the prediction, the alert is probably not actionable.
Check if there is a sudden increase in CPU usage that indicate that this might be temporary or caused by a misbehaving workload.
Add one or more worker nodes or resize existing worker nodes according to the install instructions for your cloud

Tune

If this alert isn’t actionable, noisy, or was raised too late you might want to tune it.

Through the component parameters you have the option tune the alert rule. You can modify the threshold, change for how long it needs to be firing until you are alerted, how far into the future to predict, or disable it outright.

In the example below will adapt the rule so that it will alert if we expect that all CPU cores will be utilized in 5 days, but only if it fired for 12h.

capacityAlerts:
  groups:
    CpuCapacity:
      rules:
        ClusterCpuUsageHigh:
          enabled: true
          for: 12h
          expr:
            threshold: '0' (1)
            predict: '5*24*60*60' (2)

1	The threshold can be an arbitrary promql expression
2	How far into the future to predict in seconds