Alert rule: CephFilesystemDamaged

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

Filesystem metadata has been corrupted. Data may be inaccessible. See below for some starting points to analyze metrics from the MDS daemon admin socket.

Steps for debugging

Check for damaged metadata

  1. List pools with damaged metadata

    $ ceph_cluster_ns=syn-rook-ceph-cluster
    $ kubectl -n ${ceph_cluster_ns} exec -it deploy/rook-ceph-tools -- ceph mds metadata
  2. Check for damaged metadata in the mds admin socket

    # Find the mds daemon pod
    $ ceph_cluster_ns=syn-rook-ceph-cluster
    $ damaged_fspool=<FSPOOL> # e.g. fspool-a
    $ kubectl -n ${ceph_cluster_ns} get pods | grep ceph-mds | grep ${damaged_fspool}
    # Query the mds admin socket for damaged metadata
    $ ka -n ${ceph_cluster_ns} exec -it <POD> -- ceph daemon /var/run/ceph/ceph-mds.fspool-a.asok damage ls