Alert rule: CephDeviceFailurePredictionTooHigh

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

The device health module has determined that devices predicted to fail can not be remediated automatically, since too many OSDs would be removed from the cluster to ensure performance and availabililty. Prevent data integrity issues by adding new OSDs so that data may be relocated.

Steps for debugging

Check which OSD is failing.

$ ceph_cluster_ns=syn-rook-ceph-cluster
# Show life expectancy of the OSD
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph device ls

Increase cluster size.

Upstream documentation

docs.ceph.com/en/latest/rados/operations/health-checks#device-health-toomany