Alert rule: CephOSDFlapping

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

This alert fires if a Ceph OSD pod was marked down and back up at least once a minute for 5 minutes. This may indicate a network issue (latency, packet loss, MTU mismatch) on the cluster network, or the public network if no cluster network is deployed. Check the network stats on the listed host(s).

Steps for debugging

Check OSD pod status

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" get pods -l app=rook-ceph-osd
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-7cb86cd754-hvbkm   1/1     Running   0          4d1h
rook-ceph-osd-1-76c757c5f5-2xxrp   1/1     Running   0          3h47m
rook-ceph-osd-2-6bcb99c85d-c7dmq   1/1     Running   0          3h44m

Investigate pods which have more than zero restarts.

Check if pod restarts were caused by resource limits

For pods with more than zero restarts, check if they were restarted due to resource consumption.

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" describe pod <POD_WITH_RESTARTS>

In particular, check whether section State for container osd has a note that the previous container was OOMKilled.

Check Ceph status

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     92716509-0f84-4739-8d04-541d2e7c3e66
    health: HEALTH_WARN (1)
            [ ... detailed information ... ] (2)
            [ ... detailed information ... ] (2)
            [ ... detailed information ... ] (2)

[ ... remaining output omitted ... ]

1	General cluster health status
2	One or more lines of information giving details why the cluster state is degraded. Only available if the cluster health isn’t `HEALTH_OK`.

Check Ceph crash logs

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- ceph crash ls (1)
[ ... list of crash logs ... ]
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
      ceph crash info <CRASH_ID> (2)
[ ... detailed crash info ... ]

1	List currently not archived crash logs
2	Show detailed information of crash log with id `<CRASH_ID>`

Archive Ceph crash logs

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
      ceph crash archive-all (1)
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
      ceph crash archive <CRASH_ID> (2)

1	Archive all currently not archived crash logs
2	Archive crash log with id `<CRASH_ID>`

Upstream documentation

docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd#flapping-osds