Alert rule: CephDaemonCrash

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

One or more daemons have crashed recently, and need to be acknowledged. This notification ensures that software crashes don’t go unseen. To acknowledge a crash, use the ceph crash archive <id> command.

Steps for debugging

Check possible crashes

If Ceph status shows recent crashes

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     92716509-0f84-4739-8d04-541d2e7c3e66
    health: HEALTH_WARN
            1 daemons have recently crashed (1)
[ ... remaining output omitted ... ]

1	One or more lines indicating recent crashes.

Get list of recent crashes.

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph crash ls-new
ID                                                                ENTITY        NEW
[... some date and uuid ...]                                      mds.fspool-b  * (1)

1	ID and affected entity of crash

Get more information about the nature of the crash.

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph crash info {ID}

If the issue is resolved and the warning is still present, clear crash list.

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph crash archive {ID}

Upstream documentation

docs.ceph.com/en/latest/rados/operations/health-checks/#recent-crash