Alert rule: CephPGsDamaged

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

During data consistency checks (scrub), at least one PG has been flagged as being damaged or inconsistent. Check to see which PG is affected, and attempt a manual repair if necessary.

Steps for debugging

Repair damaged PGs

$ ceph_cluster_ns=syn-rook-ceph-cluster
# List pools
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- rados ceph osd pool ls

$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- rados list-inconsistent-pg <POOL> (1)
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph pg repair <PG_NUM> (2)

1	Execute for the pool shown in the alert or all pools if no pool is shown.
2	Tries to repair the PG.

Upstream documentation

docs.ceph.com/en/latest/rados/operations/health-checks#pg-damaged