PausedMachineConfigPool

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

This alert fires when a MachineConfigPool is paused, but there are no active or paused upgradejobs. Paused machine config pools will likely block the next maintenance and prevent the upgradejob from progressing.

Steps for debugging

Possible reasons why a MachineConfigPool is paused:

  • (likely) An engineer manually paused the MachineConfigPool (e.g. to prevent node reboots)

  • (likely) The previous upgradejob timed out and the upgrade controller paused the MachineConfigPool to prevent node reboots

  • Something went wrong during a delayed upgrade of a MachineConfigPool by the upgrade controller

  • Another operator/component on the cluster paused the MachineConfigPool

Check last update to the .spec.paused field

You may be able to figure out who paused the pool by looking at the managed fields:

kubectl get mcp -oyaml --show-managed-fields | yq '.items[] | select(.spec.paused) | .metadata | {.name: .managedFields[] | select(.fieldsV1."f:spec"."f:paused")}'

If the manager of the field is kubectl-edit or kubectl-patch and it was recently updated, it was likely done manually by an engineer. Depending on the reason for pausing, consider to either ensure it is unpaused before the next maintenance or suspending maintenance on the cluster.

If the pool was paused by an operator, look for the reason in the operator logs.

Unpause the MachineConfigPool after investigation

If the root cause is known and node reboots are acceptable, then the MCPs can be unpaused:

for mcp in $(kubectl get mcp -o name); do
kubectl --as=system:admin patch $mcp --type=merge -p '{"spec":{"paused":false}}'
done

If node reboots are not acceptable, either find a suitable window with the customer or unpause at the start of the next maintenance with an upgradejob hook.

---
apiVersion: managedupgrade.appuio.io/v1beta1
kind: UpgradeJobHook
metadata:
  name: unpause-mcp
  namespace: appuio-openshift-upgrade-controller
spec:
  events:
    - Start
  selector:
    matchLabels:
      appuio-managed-upgrade: "true"
  run: Next
  template:
    spec:
      template:
        spec:
          containers:
            - args:
                - -c
                - |
                  for mcp in $(kubectl get mcp -o name); do
                    kubectl patch $mcp --type=merge -p '{"spec":{"paused":false}}'
                  done
              command:
                - bash
              image: quay.io/appuio/oc:v4.19
              name: unpause-mcp
              env:
                - name: HOME
                  value: /export
              volumeMounts:
                - mountPath: /export
                  name: export
              workingDir: /export
          restartPolicy: Never
          serviceAccountName: openshift-upgrade-controller-controller-manager
          volumes:
            - emptyDir: {}
              name: export