Alert rule: CephHealthError
Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub. |
Overview
The cluster state has been HEALTH_ERROR for more than 5 minutes.
Please check ceph health detail
for more information.
Steps for debugging
Check Ceph cluster status
Check Ceph for detailed information why the cluster is in HEALTH_ERROR
state
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph status
cluster:
id: 92716509-0f84-4739-8d04-541d2e7c3e66
health: HEALTH_ERROR (1)
[ ... detailed information ... ] (2)
[ ... detailed information ... ] (2)
[ ... detailed information ... ] (2)
[ ... remaining output omitted ... ]
1 | General cluster health status |
2 | One or more lines of information giving details why the cluster state is degraded.
Only available if the cluster health isn’t HEALTH_OK . |
Check Ceph crash logs
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- ceph crash ls (1)
[ ... list of crash logs ... ]
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
ceph crash info <CRASH_ID> (2)
[ ... detailed crash info ... ]
1 | List currently not archived crash logs |
2 | Show detailed information of crash log with id <CRASH_ID> |
Archive Ceph crash logs
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
ceph crash archive-all (1)
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
ceph crash archive <CRASH_ID> (2)
1 | Archive all currently not archived crash logs |
2 | Archive crash log with id <CRASH_ID> |