NodeDrainStuck
Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub. |
Overview
This alert fires when a node is stuck in the drain process for more than the set amount of time. Default is 10 minutes.
This can hold up the maintenance process and delay the upgrade of the cluster.
Steps for debugging
Nodes usually get stuck in the drain process when there is a pod that isn’t evictable.
This can be caused by a (rogue) PodDisruptionBudget
, pods on new nodes not entering a Ready
state, or extremely long terminationGracePeriodSeconds
among other things.
Check operator logs
The drain process on OpenShift is initiated by the machine-config-operator
.
The operator logs reasons for why a node isn’t draining.
You can find these logs by running the following command:
kubectl -n openshift-machine-config-operator logs deployments/machine-config-controller
Look out for messages like the following:
E0116 14:26:30.051014 1 drain_controller.go:110] error when evicting pods/"apiserver-786c87d87d-9lkxw" -n "openshift-oauth-apiserver" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
Add the Create a customer ticket to get the |
Check pods on node
You can list the pods on the node by running the following command:
kubectl get pods --all-namespaces --field-selector spec.nodeName=$NODE_NAME
Check for pods (for example database pods) that might lead to data loss if evicted.
Force drain the node
If the node is stuck in the drain process for a long time, you can force the drain by running the following command:
kubectl drain --force --ignore-daemonsets --delete-emptydir-data --grace-period=30 $NODE_NAME
This can lead to data loss depending on the application. Check pods left on node first. |