CiliumBpfOperationErrorRateHigh
Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub. |
Overview
This alert fires if the error rate for eBPF operations on a node for a map and operation is >= 50% for 10 minutes or longer. Depending on the map for which the alert fires this can have many different effects.
Known maps
Please update this section if you encounter this alert for a map which isn’t listed yet. |
cilium_policy_*
-
This is the eBPF map which contains endpoint policy configurations. Endpoint policy configurations are created from network policies in the cluster. If this map fills up completely or if there’s a high error rate for operations on this map, this can severely impact traffic on the cluster, since endpoints for which the policy map cannot be configured may not work correctly.
Steps for debugging
Check Cilium agent status
NODE=<node name of affected node> (1)
AGENT_POD=$(kubectl -n cilium get pods --field-selector=spec.nodeName=$NODE \
-l app.kubernetes.io/name=cilium-agent -oname)
kubectl -n cilium exec -it $AGENT_POD --as=cluster-admin -- cilium status (2)
kubectl -n cilium exec -it $AGENT_POD --as=cluster-admin -- cilium status --verbose (3)
kubectl -n cilium logs $AGENT_POD --tail=50 (4)
1 | The node indicated in the alert |
2 | --as=cluster-admin is required on VSHN managed clusters |
3 | Show the agent status on the node |
4 | Show verbose agent status on the node. In this output, you may see details about eBPF sync jobs which have errors. |
5 | In some cases, you will find details on the failing eBPF operations in the agent logs. |