Alert rule: CephOSDDiskNotResponding

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

This alert fires if an OSD isn’t running. This alert can be caused because a storage node is unavailable.

Steps for debugging

Check that all storage nodes are ready

$ kubectl get nodes -l node-role.kubernetes.io/storage
NAME           STATUS   ROLES            AGE     VERSION
storage-9649   Ready    storage,worker   6d18h   v1.20.0+87cc9a4
storage-96bf   Ready    storage,worker   6d18h   v1.20.0+87cc9a4
storage-cbf0   Ready    storage,worker   6d19h   v1.20.0+87cc9a4

Investigate any nodes which show as NotReady in the output of the previous command

Check OSD pod status

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" get pods -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-7cb86cd754-hvbkm   1/1     Running   0          4d1h
rook-ceph-osd-1-76c757c5f5-2xxrp   1/1     Running   0          3h47m
rook-ceph-osd-2-6bcb99c85d-c7dmq   1/1     Running   0          3h44m

Investigate pods which aren’t in state Running with 1/1 ready containers.

The command should show 3 pods. If there are fewer than 3 pods, investigate the Ceph cluster and CephCluster resource status.

Check Ceph cluster status

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     92716509-0f84-4739-8d04-541d2e7c3e66
    health: HEALTH_WARN (1)
            [ ... detailed information ... ] (2)
            [ ... detailed information ... ] (2)
            [ ... detailed information ... ] (2)

[ ... remaining output omitted ... ]
1 General cluster health status
2 One or more lines of information giving details why the cluster state is degraded. Only available if the cluster health isn’t HEALTH_OK.

Check CephCluster resource status

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster}" describe cephcluster
[ ... metadata and spec omitted ... ]
Status:
  Ceph:
    Capacity:
      Bytes Available:  305409896448
      Bytes Total:      578156433408
      Bytes Used:       272746536960
      Last Updated:     2021-07-13T12:02:48Z
    Health:             HEALTH_OK (1)
    Last Changed:       2021-07-13T09:18:38Z (2)
    Last Checked:       2021-07-13T12:02:48Z
    Previous Health:    HEALTH_WARN
    Versions:
      Mgr:
        ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable):  1 (3)
      Mon:
        ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable):  3 (4)
      Osd:
        ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable):  3 (5)
      Overall:
        ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable):  7 (6)
  Conditions:
    Last Heartbeat Time:   2021-07-13T12:02:48Z
    Last Transition Time:  2021-07-06T18:07:00Z
    Message:               Cluster created successfully
    Reason:                ClusterCreated
    Status:                True
    Type:                  Ready (7)
  Message:                 Cluster created successfully
  Phase:                   Ready
  State:                   Created
  Storage:
    Device Classes:
      Name:  hdd
  Version:
    Image:    docker.io/ceph/ceph:v16.2.4 (8)
    Version:  16.2.4-0
1 Current Ceph cluster health
2 Time of last Ceph cluster health status change
3 List of running Ceph MGR version(s)
4 List of running Ceph MON version(s)
5 List of running Ceph OSD version(s)
6 List of all running Ceph version(s)
7 Current condition of CephCluster resource
8 Ceph docker image used for the Ceph cluster

If the current condition of the CephCluster resource isn’t Ready, check the Rook-Ceph operator logs for any errors.

$ ceph_operator_ns=syn-rook-ceph-operator
$ kubectl -n "${ceph_operator_ns}" logs deploy/rook-ceph-operator --since=2h (1)
1 Adjust parameter --since depending on the age of the alert