Alert rule: CephDeviceFailureRelocationIncomplete

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

The device health module has determined that one or more devices will fail soon, but the normal process of relocating the data on the device to other OSDs in the cluster is blocked.

Ensure that the cluster has available free space. It may be necessary to add capacity to the cluster to allow the data from the failing device to successfully migrate, or to enable the balancer.

Steps for debugging

Check which OSD is failing.

$ ceph_cluster_ns=syn-rook-ceph-cluster
# Show life expectancy of the OSD
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph device ls

Increase cluster size.

Upstream documentation

docs.ceph.com/en/latest/rados/operations/health-checks#device-health-in-use