Alert rule: CephPGsUnclean
Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub. |
Overview
PGs have been unclean for more than 15 minutes in a pool. Unclean PGs haven’t recovered from a previous failure.
Steps for debugging
Find degraded PGs
ceph_cluster_ns=syn-rook-ceph-cluster
kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph pg dump_stuck
Trigger a repeer of degraded PGs
During initial testing and benchmarking, we’ve found that sometimes triggering a repeer of PGs in state degraded+undersized
can unstick the recovery process.
You can use the shell snippet below to trigger a repeer for all degraded+undersized
PGs.
ceph_cluster_ns=syn-rook-ceph-cluster
for pgid in $(kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools \
-- ceph pg dump_stuck undersized -f json | sed 's/}ok/}/' | \
jq -r '.stuck_pg_stats[] | .pgid')
do
kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- ceph pg repeer "${pgid}"
done
Check Ceph status
The following command should show a line starting with recovery:
under io:
if Ceph is making progress recovering the degraded PGs.
ceph_cluster_ns=syn-rook-ceph-cluster
kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph status
Rollback damaged or missing object to a prior version
This procedure leads to a loss of data. It should be used as a last resort only. |
-
Find placement groups with missing objects.
$ ceph_cluster_ns=syn-rook-ceph-cluster $ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph health detail HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) pg 2.4 is active+degraded, 78 unfound (1)
1 Shows the placement group that’s missing an object. -
Check if other OSDs might have the object.
$ ceph_cluster_ns=syn-rook-ceph-cluster $ pg_with_missing_objects=<PG> # e.g. "2.4" (1) $ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph pg ${pg_with_missing_objects} query | jq '.recovery_state[] | select(.name == "Started/Primary/Active") | .might_have_unfound' { "osd": 1, "status": "osd is down"} (2)
1 Placement group to check. 2 Might show why an object is missing. -
If no other OSDs has the object, rollback the object to a prior version.
$ ceph_cluster_ns=syn-rook-ceph-cluster $ pg_with_missing_objects=<PG> # e.g. "2.4" (1) $ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph pg ${pg_with_missing_objects} mark_unfound_lost revert (2)
1 Placement group with irrecoverable missing object. 2 delete
might be the safer option for systems that get confused if they see an older version of the object.