Alert rule: CephSlowOps

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

This alert fires if some Ceph OSD requests take too long to process. Slow OSD requests may be the cause of PVCs taking a long time to become bound to a PV.

Most likely, slow requests will be caused by heavy load on the Ceph cluster. See the Ceph documentation for a more detailed explanation of possible causes of slow requests.

Steps for debugging

Check Ceph status

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     92716509-0f84-4739-8d04-541d2e7c3e66
    health: HEALTH_WARN (1)
            [ ... detailed information ... ] (2)
            [ ... detailed information ... ] (2)
            [ ... detailed information ... ] (2)

[ ... remaining output omitted ... ]

1	General cluster health status
2	One or more lines of information giving details why the cluster state is degraded. Only available if the cluster health isn’t `HEALTH_OK`.

Check Ceph crash logs

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- ceph crash ls (1)
[ ... list of crash logs ... ]
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
      ceph crash info <CRASH_ID> (2)
[ ... detailed crash info ... ]

1	List currently not archived crash logs
2	Show detailed information of crash log with id `<CRASH_ID>`

Archive Ceph crash logs

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
      ceph crash archive-all (1)
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
      ceph crash archive <CRASH_ID> (2)

1	Archive all currently not archived crash logs
2	Archive crash log with id `<CRASH_ID>`

Check OSD slow op log

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ osd_id=0 (1)
$ kubectl -n "${ceph_cluster_ns}" exec -it -c osd deploy/rook-ceph-osd-${osd_id} -- \
      bash -c "unset CEPH_ARGS; ceph daemon osd.${osd_id} dump_blocked_ops" (2)
$ kubectl -n "${ceph_cluster_ns}" exec -it -c osd deploy/rook-ceph-osd-${osd_id} -- \
      bash -c "unset CEPH_ARGS; ceph daemon osd.${osd_id} dump_historic_slow_ops" (3)
$ kubectl -n "${ceph_cluster_ns}" exec -it -c osd deploy/rook-ceph-osd-${osd_id} -- \
      bash -c "unset CEPH_ARGS; ceph daemon osd.${osd_id} ops" (4)

1	Set variable `osd_id` to the ID (0,1,2,…) of the OSD which has been reported as having slow ops.
2	Show currently blocked ops on OSD
3	Show recent slow ops for OSD
4	Show all ops in flight on OSD

Upstream documentation

docs.ceph.com/en/latest/rados/operations/health-checks#slow-ops