Alert rule: CephOSDHostDown
Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub. |
Overview
This alert fires if an OSD isn’t running. This alert can be caused because a storage node is unavailable.
Steps for debugging
Check that all storage nodes are ready
$ kubectl get nodes -l node-role.kubernetes.io/storage
NAME STATUS ROLES AGE VERSION
storage-9649 Ready storage,worker 6d18h v1.20.0+87cc9a4
storage-96bf Ready storage,worker 6d18h v1.20.0+87cc9a4
storage-cbf0 Ready storage,worker 6d19h v1.20.0+87cc9a4
Investigate any nodes which show as NotReady
in the output of the previous command
Check OSD pod status
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" get pods -l app=rook-ceph-osd
NAME READY STATUS RESTARTS AGE
rook-ceph-osd-0-7cb86cd754-hvbkm 1/1 Running 0 4d1h
rook-ceph-osd-1-76c757c5f5-2xxrp 1/1 Running 0 3h47m
rook-ceph-osd-2-6bcb99c85d-c7dmq 1/1 Running 0 3h44m
Investigate pods which aren’t in state Running
with 1/1
ready containers.
The command should show 3 pods.
If there are fewer than 3 pods, investigate the Ceph cluster and CephCluster
resource status.
Check Ceph cluster status
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph status
cluster:
id: 92716509-0f84-4739-8d04-541d2e7c3e66
health: HEALTH_WARN (1)
[ ... detailed information ... ] (2)
[ ... detailed information ... ] (2)
[ ... detailed information ... ] (2)
[ ... remaining output omitted ... ]
1 | General cluster health status |
2 | One or more lines of information giving details why the cluster state is degraded.
Only available if the cluster health isn’t HEALTH_OK . |
Check Ceph crash logs
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- ceph crash ls (1)
[ ... list of crash logs ... ]
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
ceph crash info <CRASH_ID> (2)
[ ... detailed crash info ... ]
1 | List currently not archived crash logs |
2 | Show detailed information of crash log with id <CRASH_ID> |
Archive Ceph crash logs
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
ceph crash archive-all (1)
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
ceph crash archive <CRASH_ID> (2)
1 | Archive all currently not archived crash logs |
2 | Archive crash log with id <CRASH_ID> |
Check CephCluster
resource status
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster}" describe cephcluster
[ ... metadata and spec omitted ... ]
Status:
Ceph:
Capacity:
Bytes Available: 574670872576
Bytes Total: 578156433408
Bytes Used: 3485560832
Last Updated: 2022-11-03T13:54:53Z
Fsid: a5231b40-1896-448d-aae5-9b37f3d16bee
Health: HEALTH_OK (1)
Last Changed: 2022-11-03T10:31:44Z (2)
Last Checked: 2022-11-03T13:54:53Z
Previous Health: HEALTH_WARN
Versions:
Mds:
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable): 2 (3)
Mgr:
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable): 1 (4)
Mon:
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable): 3 (5)
Osd:
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable): 3 (6)
Overall:
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable): 9 (7)
Conditions:
Last Heartbeat Time: 2022-11-03T13:54:53Z
Last Transition Time: 2022-11-02T12:22:41Z
Message: Cluster created successfully
Reason: ClusterCreated
Status: True
Type: Ready (8)
Message: Cluster created successfully
Observed Generation: 3
Phase: Ready
State: Created
Storage:
Device Classes:
Name: hdd
Version:
Image: quay.io/ceph/ceph:v17.2.5 (9)
Version: 17.2.5-0
1 | Current Ceph cluster health |
2 | Time of last Ceph cluster health status change |
3 | List of running Ceph MDS version(s) |
4 | List of running Ceph MGR version(s) |
5 | List of running Ceph MON version(s) |
6 | List of running Ceph OSD version(s) |
7 | List of all running Ceph version(s) |
8 | Current condition of CephCluster resource |
9 | Ceph docker image used for the Ceph cluster |
If the current condition of the CephCluster
resource isn’t Ready
, check the Rook-Ceph operator logs for any errors.
$ ceph_operator_ns=syn-rook-ceph-operator
$ kubectl -n "${ceph_operator_ns}" logs deploy/rook-ceph-operator --since=2h (1)
1 | Adjust parameter --since depending on the age of the alert |