Alert Group: syn-UnusedCapacity

We provide an alert rule which fires if a cluster has multiple unused nodes. When this alert fires, the cluster can be scaled down to optimize costs.

Alert Rule: SYN_ClusterHasUnusedNodes

Overview

This alert indicates that the cluster utilization is low enough that nodes can be removed safely. The alert fires if pod count, CPU requests, CPU usage, memory requests, and memory usage are are all low enough that a node can be removed without impacting cluster stability in case of a node failure. By default, the alert will fire when the equivalent of at least four nodes has been unused for each metric for eight hours. Both the threshold of four unused nodes, and the duration of eight hours can be tuned through the component parameters.

Investigate

  • Verify the alert

    • Verify remaining capacity is high.

      kubectl top nodes | grep -E "NAME|worker"
      kubectl describe node -lnode-role.kubernetes.io/app | grep "Allocatable:|Allocated resources:" -A 8
      kubectl describe node -lnode-role.kubernetes.io/app | grep "Non-terminated Pods:"

      If the allocated resources reported by Kubernetes don’t indicate low utilization on the nodes, there might be a bug in the alert rule. If that’s the case, please disable this alert and open an issue for this component.

    • Check if there is a sudden drop in resource requests that indicate that this might be temporary.

  • Either remove worker nodes or resize existing worker nodes according to the install instructions for your cloud

Tune

If this alert isn’t actionable, noisy, or was raised too early you might want to tune it.

Through the component parameters you have the option modify the number of reserved nodes. Additionally, you can change for how long the alert needs to be firing until you are alerted, or disable it completely. In the example below will adapt the rule to only alert after two days and change the number of reserved nodes to 5.

capacityAlerts:
  groups:
    UnusedCapacity:
      rules:
        ClusterHasUnusedNodes:
          enabled: true
          for: 2d
          expr:
            reserved: 5