Alert rule: CephMgrModuleCrash

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

One or more mgr modules have crashed and have yet to be acknowledged. A crashed module may impact functionality within the cluster. Use the ceph crash command to determine which module has failed, and archive it to acknowledge the failure.

Steps for debugging

Check possible crashes

If Ceph status shows recent crashes

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     92716509-0f84-4739-8d04-541d2e7c3e66
    health: HEALTH_WARN
            1 daemons have recently crashed (1)
[ ... remaining output omitted ... ]

1	One or more lines indicating recent crashes.

Get list of recent crashes.

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph crash ls-new
ID                                                                ENTITY        NEW
[... some date and uuid ...]                                      mds.fspool-b  * (1)

1	ID and affected entity of crash

Get more information about the nature of the crash.

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph crash info {ID}

If the issue is resolved and the warning is still present, clear crash list.

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph crash archive {ID}

Upstream documentation

docs.ceph.com/en/latest/rados/operations/health-checks#recent-mgr-module-crash