Alert rule: CephFilesystemDegraded

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

One or more metadata daemons (MDS ranks) are failed or in a damaged state. At best the filesystem is partially available, worst case is the filesystem is completely unusable.

Steps for debugging

Check for damaged metadata

  1. List pools with damaged metadata

    $ ceph_cluster_ns=syn-rook-ceph-cluster
    $ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph mds metadata
  2. Check for damaged metadata in the mds admin socket

    # Find the mds daemon pod
    $ ceph_cluster_ns=syn-rook-ceph-cluster
    $ damaged_fspool=<FSPOOL> # e.g. fspool-a
    $ kubectl -n ${ceph_cluster_ns} get pods | grep ceph-mds | grep ${damaged_fspool}
    # Query the mds admin socket for damaged metadata
    $ ka -n ${ceph_cluster_ns} exec -it <POD> -- ceph daemon /var/run/ceph/ceph-mds.fspool-a.asok damage ls