CiliumAgentUnexpectedCount

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

This alert fires if a cluster has a mismatch between the number of Cilium agent pods and nodes for more than 5 minutes.

This usually indicates that the Cilium agent DaemonSet is misconfigured or missing.

Steps for debugging

  1. Check the Cilium agent DaemonSet

    kubectl -n cilium get ds -l app.kubernetes.io/name=cilium-agent
    1. If the DaemonSet is missing, check the OLM operator logs (or Helm deployment status)

      kubectl -n cilium logs deployment/clife-controller-manager (1)
      1 For Cilium 1.16, use deployment/cilium-ee-olm.
    2. If the DaemonSet is present, find nodes which don’t have a Cilium agent pod and check them for scheduling issues or untolerated taints

      for node in $(kubectl get nodes -oname); do
        pod=$(kubectl -oname -n cilium get pods -l app.kubernetes.io/name=cilium-agent --field-selector spec.nodeName="${node#node/}")
        if [ "${pod}" == "" ]; then
          echo "Cilium agent missing on node ${node}"
        fi
      done