Alert rule: CephOSDDownHigh

More than 10% of the OSDs are down.

Steps for debugging

Check that all storage nodes are ready

$ kubectl get nodes -l
NAME           STATUS   ROLES            AGE     VERSION
storage-9649   Ready    storage,worker   6d18h   v1.20.0+87cc9a4
storage-96bf   Ready    storage,worker   6d18h   v1.20.0+87cc9a4
storage-cbf0   Ready    storage,worker   6d19h   v1.20.0+87cc9a4

Investigate any nodes which show as NotReady in the output of the previous command

Check OSD pod status

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" get pods -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-7cb86cd754-hvbkm   1/1     Running   0          4d1h
rook-ceph-osd-1-76c757c5f5-2xxrp   1/1     Running   0          3h47m
rook-ceph-osd-2-6bcb99c85d-c7dmq   1/1     Running   0          3h44m

Investigate pods which aren’t in state Running with 1/1 ready containers.

The command should show 3 pods. If there are fewer than 3 pods, investigate the Ceph cluster and CephCluster resource status.

Check Ceph cluster status

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph status
    id:     92716509-0f84-4739-8d04-541d2e7c3e66
    health: HEALTH_WARN (1)
            [ ... detailed information ... ] (2)
            [ ... detailed information ... ] (2)
            [ ... detailed information ... ] (2)

[ ... remaining output omitted ... ]
1 General cluster health status
2 One or more lines of information giving details why the cluster state is degraded. Only available if the cluster health isn’t HEALTH_OK.

Check CephCluster resource status

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster}" describe cephcluster
[ ... metadata and spec omitted ... ]
      Bytes Available:  574670872576
      Bytes Total:      578156433408
      Bytes Used:       3485560832
      Last Updated:     2022-11-03T13:54:53Z
    Fsid:               a5231b40-1896-448d-aae5-9b37f3d16bee
    Health:             HEALTH_OK (1)
    Last Changed:       2022-11-03T10:31:44Z (2)
    Last Checked:       2022-11-03T13:54:53Z
    Previous Health:    HEALTH_WARN
        ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable):  2 (3)
        ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable):  1 (4)
        ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable):  3 (5)
        ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable):  3 (6)
        ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable):  9 (7)
    Last Heartbeat Time:   2022-11-03T13:54:53Z
    Last Transition Time:  2022-11-02T12:22:41Z
    Message:               Cluster created successfully
    Reason:                ClusterCreated
    Status:                True
    Type:                  Ready (8)
  Message:                 Cluster created successfully
  Observed Generation:     3
  Phase:                   Ready
  State:                   Created
    Device Classes:
      Name:  hdd
    Image: (9)
    Version:  17.2.5-0
1 Current Ceph cluster health
2 Time of last Ceph cluster health status change
3 List of running Ceph MDS version(s)
4 List of running Ceph MGR version(s)
5 List of running Ceph MON version(s)
6 List of running Ceph OSD version(s)
7 List of all running Ceph version(s)
8 Current condition of CephCluster resource
9 Ceph docker image used for the Ceph cluster

If the current condition of the CephCluster resource isn’t Ready, check the Rook-Ceph operator logs for any errors.

$ ceph_operator_ns=syn-rook-ceph-operator
$ kubectl -n "${ceph_operator_ns}" logs deploy/rook-ceph-operator --since=2h (1)
1 Adjust parameter --since depending on the age of the alert