Alert Group: syn-ResourceRequests

We provide four rules to notice if a cluster is hitting the limit of requested resources. If a cluster has too many requested resources it might not be able to schedule all workloads.

Alert Rule: SYN_TooMuchMemoryRequested

Overview

This alert indicates that the total sum of memory requests of user workload is close to the capacity of the cluster. By default, it will fire if the amount of unrequested memory is less than the capacity of the largest worker node.

If you receive this alert, it’s likely that the cluster won’t be able to reschedule all workloads in case of a worker node failure. After verifying the alert you should add more memory capacity as soon as possible.

Investigate

  • Verify the alert

    • Verify that the reported capacity is correct.

      kubectl describe node -lnode-role.kubernetes.io/app | grep "Allocated resources" -A 8

      If the allocated resources reported by Kubernetes don’t indicate high memory reservation on the nodes, there might be a bug in the alert rule. If that’s the case, please disable this alert and open an issue for this component.

    • Check if there is a sudden increase in resource requests that indicate that this might be temporary.

  • Either add more worker nodes or resize existing worker nodes according to the install instructions for your cloud

Tune

If this alert isn’t actionable, noisy, or was raised too late you might want to tune it.

Through the component parameters you have the option modify the alert threshold factor, change for how long it needs to be firing until you are alerted, or disable it outright. In the example below will adapt the rule to only alert after two hours and change the threshold factor to 0.7.

capacityAlerts:
  groups:
    ResourceRequests:
      rules:
        TooMuchMemoryRequested:
          enabled: true
          for: 2h
          expr:
            factor: 0.7

Alert Rule: SYN_ExpectTooMuchMemoryRequested

Overview

This alert indicates that we expect the total sum of memory requests of user workload to soon be close to the capacity of the cluster. By default, it will fire if the predicted amount of unrequested memory in three days is less than the capacity of the largest worker node.

If you receive this alert, the cluster may soon not be able to reschedule all workloads in case of a worker node failure. After verifying that the cluster’s memory reservations are growing, you should add more memory capacity in the next days.

Investigate

  • Look at the source of this alert in Prometheus

    • Does the prediction look realistic?

    • Compare it to the graph without the predict_linear

    • If there is any doubt in the prediction, monitor this graph for the next hours or days

  • Check the actual resource requests on each worker node

    kubectl describe node -lnode-role.kubernetes.io/app | grep "Allocated resources" -A 8

    If the number is widely different than the prediction, the alert is probably not actionable.

  • Check if there is a sudden increase in resource requests that indicate that this might be temporary.

  • Add one or more worker nodes according to the install instructions for your cloud

Tune

If this alert isn’t actionable, noisy, or was raised too late you might want to tune it.

Through the component parameters you have the option tune the alert rule. You can modify the threshold, change for how long it needs to be firing until you are alerted, how far into the future to predict, or disable it outright.

In the example below will adapt the rule so that it will alert if we expect that all available memory will be requested in 5 days, but only if it fired for 12h.

capacityAlerts:
  groups:
    ResourceRequests:
      rules:
        ExpectTooMuchMemoryRequested:
          enabled: true
          for: 12h
          expr:
            threshold: '0' (1)
            predict: '5*24*60*60' (2)
1 The threshold can be an arbitrary promql expression
2 How far into the future to predict in seconds

Alert Rule: SYN_TooMuchCPURequested

Overview

This alert indicates that the total sum of CPU requests of user workloads is close to the capacity of the cluster. By default, it will fire if the amount of unrequested CPU cores is less than the core count of the largest worker node.

If you receive this alert, it’s likely that the cluster won’t be able to reschedule all workloads in case of a worker node failure. After verifying that the cluster’s CPU reservation is high, you should add more CPU capacity as soon as possible.

Investigate

  • Verify the alert

    • Verify that the reported capacity is correct.

      kubectl describe node -lnode-role.kubernetes.io/app | grep "Allocated resources" -A 8

      If the allocated resources reported by Kubernetes don’t indicate high CPU reservation on the nodes, there might be a bug in the alert rule. If that’s the case, please disable this alert and open an issue for this component.

    • Check if there is a sudden increase in resource requests that indicate that this might be temporary.

  • Either add more worker nodes or resize existing worker nodes according to the install instructions for your cloud

Tune

If this alert isn’t actionable, noisy, or was raised too late you might want to tune it.

Through the component parameters you have the option modify the alert threshold factor, change for how long it needs to be firing until you are alerted, or disable it outright. In the example below will adapt the rule to only alert after two hours and change the threshold to 4 cores.

capacityAlerts:
  groups:
    ResourceRequests:
      rules:
        TooMuchCPURequested:
          enabled: true
          for: 2h
          expr:
            threshold: '4' (1)
1 The threshold can be an arbitrary promql expression

Alert Rule: SYN_ExpectTooMuchCPURequested

Overview

This alert indicates that we expect the total sum of CPU requests of user workloads to soon be close to the capacity of the cluster. By default, it will fire if the predicted number of unrequested CPU cores in three days is less than the number of cores of the largest worker node.

If you receive this alert, the cluster may soon not be able to reschedule all workloads in case of a worker node failure. After verifying that the cluster’s CPU reservation is growing, you should add more CPU capacity in the next days.

Investigate

  • Look at the source of this alert in Prometheus

    • Does the prediction look realistic?

    • Compare it to the graph without the predict_linear

    • If there is any doubt in the prediction, monitor this graph for the next hours or days

  • Check the actual resource requests on each worker node

    kubectl describe node -lnode-role.kubernetes.io/app | grep "Allocated resources" -A 8

    If the number is widely different than the prediction, the alert is probably not actionable.

  • Check if there is a sudden increase in resource requests that indicate that this might be temporary.

  • Add one or more worker nodes according to the install instructions for your cloud

Tune

If this alert isn’t actionable, noisy, or was raised too late you might want to tune it.

Through the component parameters you have the option tune the alert rule. You can modify the threshold, change for how long it needs to be firing until you are alerted, how far into the future to predict, or disable it outright.

In the example below will adapt the rule so that it will alert if we expect that all CPU cores will be requested in 5 days, but only if it fired for 12h.

capacityAlerts:
  groups:
    ResourceRequests:
      rules:
        ExpectTooMuchCPURequested:
          enabled: true
          for: 12h
          expr:
            threshold: '0' (1)
            predict: '5*24*60*60' (2)
1 The threshold can be an arbitrary promql expression
2 How far into the future to predict in seconds