Alert rule: CephOSDDownHigh

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

More than 10% of the OSDs are down.

Steps for debugging

Check that all storage nodes are ready

$ kubectl get nodes -l node-role.kubernetes.io/storage
NAME           STATUS   ROLES            AGE     VERSION
storage-9649   Ready    storage,worker   6d18h   v1.20.0+87cc9a4
storage-96bf   Ready    storage,worker   6d18h   v1.20.0+87cc9a4
storage-cbf0   Ready    storage,worker   6d19h   v1.20.0+87cc9a4

Investigate any nodes which show as NotReady in the output of the previous command

Check OSD pod status

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" get pods -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-7cb86cd754-hvbkm   1/1     Running   0          4d1h
rook-ceph-osd-1-76c757c5f5-2xxrp   1/1     Running   0          3h47m
rook-ceph-osd-2-6bcb99c85d-c7dmq   1/1     Running   0          3h44m

Investigate pods which aren’t in state Running with 1/1 ready containers.

The command should show 3 pods. If there are fewer than 3 pods, investigate the Ceph cluster and CephCluster resource status.

Check Ceph cluster status

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     92716509-0f84-4739-8d04-541d2e7c3e66
    health: HEALTH_WARN (1)
            [ ... detailed information ... ] (2)
            [ ... detailed information ... ] (2)
            [ ... detailed information ... ] (2)

[ ... remaining output omitted ... ]

1	General cluster health status
2	One or more lines of information giving details why the cluster state is degraded. Only available if the cluster health isn’t `HEALTH_OK`.

Check Ceph crash logs

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- ceph crash ls (1)
[ ... list of crash logs ... ]
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
      ceph crash info <CRASH_ID> (2)
[ ... detailed crash info ... ]

1	List currently not archived crash logs
2	Show detailed information of crash log with id `<CRASH_ID>`

Archive Ceph crash logs

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
      ceph crash archive-all (1)
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
      ceph crash archive <CRASH_ID> (2)

1	Archive all currently not archived crash logs
2	Archive crash log with id `<CRASH_ID>`

Check `CephCluster` resource status

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster}" describe cephcluster
[ ... metadata and spec omitted ... ]
Status:
  Ceph:
    Capacity:
      Bytes Available:  574670872576
      Bytes Total:      578156433408
      Bytes Used:       3485560832
      Last Updated:     2022-11-03T13:54:53Z
    Fsid:               a5231b40-1896-448d-aae5-9b37f3d16bee
    Health:             HEALTH_OK (1)
    Last Changed:       2022-11-03T10:31:44Z (2)
    Last Checked:       2022-11-03T13:54:53Z
    Previous Health:    HEALTH_WARN
    Versions:
      Mds:
        ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable):  2 (3)
      Mgr:
        ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable):  1 (4)
      Mon:
        ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable):  3 (5)
      Osd:
        ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable):  3 (6)
      Overall:
        ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable):  9 (7)
  Conditions:
    Last Heartbeat Time:   2022-11-03T13:54:53Z
    Last Transition Time:  2022-11-02T12:22:41Z
    Message:               Cluster created successfully
    Reason:                ClusterCreated
    Status:                True
    Type:                  Ready (8)
  Message:                 Cluster created successfully
  Observed Generation:     3
  Phase:                   Ready
  State:                   Created
  Storage:
    Device Classes:
      Name:  hdd
  Version:
    Image:    quay.io/ceph/ceph:v17.2.5 (9)
    Version:  17.2.5-0

1	Current Ceph cluster health
2	Time of last Ceph cluster health status change
3	List of running Ceph MDS version(s)
4	List of running Ceph MGR version(s)
5	List of running Ceph MON version(s)
6	List of running Ceph OSD version(s)
7	List of all running Ceph version(s)
8	Current condition of `CephCluster` resource
9	Ceph docker image used for the Ceph cluster

If the current condition of the CephCluster resource isn’t Ready, check the Rook-Ceph operator logs for any errors.

$ ceph_operator_ns=syn-rook-ceph-operator
$ kubectl -n "${ceph_operator_ns}" logs deploy/rook-ceph-operator --since=2h (1)

1	Adjust parameter `--since` depending on the age of the alert

Upstream documentation

docs.ceph.com/en/latest/rados/operations/health-checks#osd-down