UpgradeJobFailed

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

This alert fires when a upgrade job has failed.

Machine config pools might still try to apply changes if not paused. If the automatic pause on fail is enabled the next maintenance might be blocked.

To clear this alert delete the failed UpgradeJob after investigating the root cause.

You might need to unpause the MachineConfigPools if they were automatically paused on upgrade job failure.

However, be aware that unpausing MachineConfigPools may trigger further node reboots.

Steps for debugging

Check the reason given in the alert to understand why the upgrade job has failed.

Check for stuck nodes

This alert can indicate that a node is stuck in the reboot process during a maintenance.

Check the [] handbook for steps on how to debug a stuck node drain process.

Check the ClusterVersion for upgrade errors

kubectl get clusterversions version -ojson | jq '.status.conditions'

Look for the ReleaseAccepted, Failing, and Upgradeable conditions for more details on the failure.

Check the upgrade job hook jobs

kubectl -n appuio-openshift-upgrade-controller get jobs | grep -v Complete

Look for any jobs that are not complete and check their logs for errors.

Pause or unpause Machine Config Pools

If the upgrade job failure has left the Machine Config Pools in an unfinished state you might want to pause or unpause them manually.

# List unpaused MCPs
kubectl get mcp -ojson | jq -r '.items[] | select(.spec.paused == false) | .metadata.name'
# List paused MCPs
kubectl get mcp -ojson | jq -r '.items[] | select(.spec.paused) | .metadata.name'

To pause a Machine Config Pool:

POOL=XXX
kubectl --as=system:admin patch mcp $POOL -p '{"spec":{"paused":true}}' --type=merge

To unpause a Machine Config Pool:

POOL=XXX
kubectl --as=system:admin patch mcp $POOL -p '{"spec":{"paused":false}}' --type=merge

Tune

If this alert isn’t actionable, noisy, or was raised too early you may want to tune the time until the alert fires. You can do this by changing the for parameter in the UpgradeJobFailed alert configuration.