Alert rule: RookCephOperatorScaledDown
Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub. |
Overview
This alert fires if the Rook-Ceph operator deployment is scaled to 0 for more than an hour. While the operator is scaled to 0, the Ceph cluster isn’t actively managed and could start degrading.
Steps for debugging
Check if the rook-ceph
ArgoCD app is synced and healthy
$ kubectl -n syn get app rook-ceph
NAME SYNC STATUS HEALTH STATUS
rook-ceph Synced Healthy
If the output of the kubectl
command indicates that the app isn’t synced and healthy, check the app in ArgoCD.
You can use the argocd
CLI or the web interface to do so.
Check configured replicas in cluster catalog
Verify that the operator deployment manifest in the cluster catalog specifies .spec.replicas=1
by inspecting the cluster catalog.
The cluster catalog is linked in column "GitRepo URL" on control.vshn.net.
The operator deployment manifests can be found in manifests/rook-ceph/01_rook_ceph_helmchart/rook-ceph/templates/deployment.yaml
.