Alert rule: CephOSDBackfillFull

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

An OSD has reached the BACKFILL FULL threshold. This will prevent rebalance operations from completing. Use ceph health detail and ceph osd df to identify the problem. To resolve, add capacity to the affected OSD’s failure domain, restore down/out OSDs, or delete unwanted data.

Steps for debugging

Check current capacity utilisation

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph df
--- RAW STORAGE ---
[...]

--- POOLS ---
POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics   1    8  937 KiB        3  2.7 MiB      0     93 GiB
fspool-metadata         2   32   23 MiB       29   69 MiB   0.02     93 GiB
fspool-data0            3   32      0 B        0      0 B      0     93 GiB
storagepool             4   32   20 KiB        5   71 KiB      0     93 GiB

Upstream documentation

docs.ceph.com/en/latest/rados/operations/health-checks#osd-backfillfull