Alert rule: CephMgrModuleCrash
Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub. |
Overview
One or more mgr modules have crashed and have yet to be acknowledged.
A crashed module may impact functionality within the cluster.
Use the ceph crash
command to determine which module has failed, and archive it to acknowledge the failure.
Steps for debugging
Check possible crashes
If Ceph status shows recent crashes
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph status
cluster:
id: 92716509-0f84-4739-8d04-541d2e7c3e66
health: HEALTH_WARN
1 daemons have recently crashed (1)
[ ... remaining output omitted ... ]
1 | One or more lines indicating recent crashes. |
Get list of recent crashes.
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph crash ls-new
ID ENTITY NEW
[... some date and uuid ...] mds.fspool-b * (1)
1 | ID and affected entity of crash |
Get more information about the nature of the crash.
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph crash info {ID}
If the issue is resolved and the warning is still present, clear crash list.
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph crash archive {ID}