Alert rule: CephPGUnavailableBlockingIO

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

Data availability is reduced impacting the clusters ability to service I/O. One or more placement groups (PGs) are in a state that blocks IO.

Steps for debugging

Find degraded PGs

ceph_cluster_ns=syn-rook-ceph-cluster
kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph pg dump_stuck

Trigger a repeer of degraded PGs

During initial testing and benchmarking, we’ve found that sometimes triggering a repeer of PGs in state degraded+undersized can unstick the recovery process. You can use the shell snippet below to trigger a repeer for all degraded+undersized PGs.

ceph_cluster_ns=syn-rook-ceph-cluster
for pgid in $(kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools \
              -- ceph pg dump_stuck undersized -f json | sed 's/}ok/}/' | \
              jq -r '.stuck_pg_stats[] | .pgid')
do
    kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- ceph pg repeer "${pgid}"
done

Check Ceph status

The following command should show a line starting with recovery: under io: if Ceph is making progress recovering the degraded PGs.

ceph_cluster_ns=syn-rook-ceph-cluster
kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph status

Rollback damaged or missing object to a prior version

This procedure leads to a loss of data. It should be used as a last resort only.

Find placement groups with missing objects.

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph health detail
HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
pg 2.4 is active+degraded, 78 unfound (1)

1	Shows the placement group that’s missing an object.

Check if other OSDs might have the object.

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ pg_with_missing_objects=<PG> # e.g. "2.4" (1)
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph pg ${pg_with_missing_objects} query | jq '.recovery_state[] | select(.name == "Started/Primary/Active") | .might_have_unfound'
{ "osd": 1, "status": "osd is down"} (2)

1	Placement group to check.
2	Might show why an object is missing.

If no other OSDs has the object, rollback the object to a prior version.
```
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ pg_with_missing_objects=<PG> # e.g. "2.4" (1)
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph pg ${pg_with_missing_objects} mark_unfound_lost revert (2)
```
1 Placement group with irrecoverable missing object.

2 delete might be the safer option for systems that get confused if they see an older version of the object.

Upstream documentation

docs.ceph.com/en/latest/rados/operations/health-checks#pg-availability

1	Placement group with irrecoverable missing object.
2	`delete` might be the safer option for systems that get confused if they see an older version of the object.