Alert rule: CephMgrPrometheusModuleInactive

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

The MGR/Prometheus module is unreachable. This could mean that the module has been disabled or the MGR itself is down. Without the MGR/Prometheus module metrics and alerts will no longer function.

Steps for debugging

Check if the MGR is active

Check if the MGR shows as active.

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph status
  cluster:
    id:     92716509-0f84-4739-8d04-541d2e7c3e66
    health: HEALTH_WARN (1)
            [ ... detailed information ... ] (2)
            [ ... detailed information ... ] (2)
            [ ... detailed information ... ] (2)

[ ... remaining output omitted ... ]

1	General cluster health status
2	One or more lines of information giving details why the cluster state is degraded. Only available if the cluster health isn’t `HEALTH_OK`.

Check Ceph crash logs

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- ceph crash ls (1)
[ ... list of crash logs ... ]
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
      ceph crash info <CRASH_ID> (2)
[ ... detailed crash info ... ]

1	List currently not archived crash logs
2	Show detailed information of crash log with id `<CRASH_ID>`

Archive Ceph crash logs

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
      ceph crash archive-all (1)
$ kubectl -n "${ceph_cluster_ns}" exec -it deploy/rook-ceph-tools -- \
      ceph crash archive <CRASH_ID> (2)

1	Archive all currently not archived crash logs
2	Archive crash log with id `<CRASH_ID>`

Activate the MGR/Prometheus module

# Check if the module is enabled
$ ceph_cluster_ns=syn-rook-ceph-cluster
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph mgr module ls | jq '.enabled_modules'
# Enable the module
$ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph mgr module enable prometheus

Check that Prometheus can scrape the MGR pod

If the kubectl exec command in the snippet below hangs or otherwise fails, check:

Node to node firewall rules, since we’re running Ceph in host network mode
Network policies between the Prometheus namespace and the Ceph cluster namespace

$ ceph_cluster_ns=syn-rook-ceph-cluster
$ mgr_ip=$(kubectl -n "${ceph_cluster_ns}" get pods -l app=rook-ceph-mgr \
      -o jsonpath='{.items[0].status.podIP}')
$ monitoring_ns=openshift-monitoring (1)
$ prometheus_pod=$(kubectl -n ${monitoring_ns} get pods -l app=prometheus \
      -o jsonpath='{.items[0].metadata.name}')
$ kubectl -n "${monitoring_ns}" exec -it "${prometheus_pod}" -- \
      curl "http://${mgr_ip}:9283/metrics"
[ ... metrics output omitted ... ]

1	Replace the namespace depending on your K8s distribution