Alert rule: CephObjectMissing

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

The latest version of a RADOS object can not be found, even though all OSDs are up. I/O requests for this object from clients will block (hang). Resolving this issue may require the object to be rolled back to a prior version manually, and manually verified.

Steps for debugging

Rollback damaged or missing object to a prior version

This procedure leads to a loss of data. It should be used as a last resort only.
  1. Find placement groups with missing objects.

    $ ceph_cluster_ns=syn-rook-ceph-cluster
    $ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph health detail
    HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
    pg 2.4 is active+degraded, 78 unfound (1)
    1 Shows the placement group that’s missing an object.
  2. Check if other OSDs might have the object.

    $ ceph_cluster_ns=syn-rook-ceph-cluster
    $ pg_with_missing_objects=<PG> # e.g. "2.4" (1)
    $ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph pg ${pg_with_missing_objects} query | jq '.recovery_state[] | select(.name == "Started/Primary/Active") | .might_have_unfound'
    { "osd": 1, "status": "osd is down"} (2)
    1 Placement group to check.
    2 Might show why an object is missing.
  3. If no other OSDs has the object, rollback the object to a prior version.

    $ ceph_cluster_ns=syn-rook-ceph-cluster
    $ pg_with_missing_objects=<PG> # e.g. "2.4" (1)
    $ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph pg ${pg_with_missing_objects} mark_unfound_lost revert (2)
    1 Placement group with irrecoverable missing object.
    2 delete might be the safer option for systems that get confused if they see an older version of the object.