Alert Group: syn-MemoryCapacity

We provide two rules to notice if a cluster has high memory usage. If a cluster uses up most of its memory some workload might start to get terminated.

Alert Rule: SYN_ClusterMemoryUsageHigh

Overview

This alert indicates that the total memory usage over all worker nodes is high. By default, it will fire if the amount unused memory is less than the memory capacity of the largest worker node.

If you receive this alert, a worker node failure in the cluster will likely cause customer workloads to be OOM killed. After verifying that the cluster’s memory utilization is high, you should consider adding more memory capacity.

Investigate

  • Verify the alert

    • Verify that the reported load is correct.

      kubectl top nodes

      If the memory utilization reported by top nodes doesn’t appear to be particularly high, there might be a bug in the alert rule. If that’s the case, please disable this alert and open an issue for this component.

    • Check if there is a sudden increase in memory usage that indicate that this might be temporary or caused by a misbehaving workload.

  • Either add more worker nodes or resize existing worker nodes according to the install instructions for your cloud

Tune

If this alert isn’t actionable, noisy, or was raised too late you might want to tune it.

Through the component parameters you have the option modify the alert threshold factor, change for how long it needs to be firing until you are alerted, or disable it outright. In the example below will adapt the rule to only alert after four hours and change the threshold factor to 0.7.

capacityAlerts:
  groups:
    MemoryCapacity:
      rules:
        ClusterMemoryUsageHigh:
          enabled: true
          for: 4h
          expr:
            factor: 0.7

Alert Rule: SYN_ExpectClusterMemoryUsageHigh

Overview

This alert indicates that the total memory utilization over all worker nodes may become high over the next days. By default, this alert will fire if we expect the amount of unused memory to be less than the memory capacity of the largest worker node in three days.

If you receive this alert, the cluster might soon not be able to keep all workloads running in case of a worker node failure. After verifying the memory utilization growth, you should consider adding more memory capacity in the next days.

Investigate

  • Look at the source of this alert in Prometheus

    • Does the prediction look realistic?

    • Compare it to the graph without the predict_linear

    • If there is any doubt in the prediction, monitor this graph for the next hours or days

  • Check the actual memory usage on each worker node

    kubectl top nodes

    If the number is widely different than the prediction, the alert is probably not actionable.

  • Check if there is a sudden increase in memory usage that indicate that this might be temporary or caused by a misbehaving workload.

  • Add one or more worker nodes or resize existing worker nodes according to the install instructions for your cloud

Tune

If this alert isn’t actionable, noisy, or was raised too late you might want to tune it.

Through the component parameters you have the option tune the alert rule. You can modify the threshold, change for how long it needs to be firing until you are alerted, how far into the future to predict, or disable it outright.

In the example below will adapt the rule so that it will alert if we expect that all memory will be utilized in 5 days, but only if it fired for 12h.

capacityAlerts:
  groups:
    MemoryCapacity:
      rules:
        ExpectClusterMemoryUsageHigh:
          enabled: true
          for: 12h
          expr:
            threshold: '0' (1)
            predict: '5*24*60*60' (2)
1 The threshold can be an arbitrary promql expression
2 How far into the future to predict in seconds