Alert rule: RookCephOperatorScaledDown

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

This alert fires if the Rook-Ceph operator deployment is scaled to 0 for more than an hour. While the operator is scaled to 0, the Ceph cluster isn’t actively managed and could start degrading.

Steps for debugging

Check if the rook-ceph ArgoCD app is synced and healthy

$ kubectl -n syn get app rook-ceph
NAME        SYNC STATUS   HEALTH STATUS
rook-ceph   Synced        Healthy

If the output of the kubectl command indicates that the app isn’t synced and healthy, check the app in ArgoCD. You can use the argocd CLI or the web interface to do so.

Check configured replicas in cluster catalog

Verify that the operator deployment manifest in the cluster catalog specifies .spec.replicas=1 by inspecting the cluster catalog. The cluster catalog is linked in column "GitRepo URL" on control.vshn.net. The operator deployment manifests can be found in manifests/rook-ceph/01_rook_ceph_helmchart/rook-ceph/templates/deployment.yaml.