Alert rule: CephDeviceFailurePredicted

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

The device health module has determined that one or more devices will fail soon. To review device status use ceph device ls. To show a specific device use ceph device info <dev id>. Mark the OSD out so that data may migrate to other OSDs. Once the OSD has drained, destroy the OSD, replace the device, and redeploy the OSD.

Steps for debugging

Check which OSD is failing.

$ ceph_cluster_ns=syn-rook-ceph-cluster
# Show life expectancy of the OSD
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph device ls

If usage is high, or you have three or less OSDs, you may need to add new OSDs to the cluster.

$ ceph_cluster_ns=syn-rook-ceph-cluster
# Show amount of active OSDs
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph osd stat
# Show available space in cluster
ka -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph pg stat

Mark the OSD as out, so data may migrate to other OSDs in the cluster.

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph osd out <osd id>

Upstream documentation

docs.ceph.com/en/latest/rados/operations/health-checks#id2