Alert rule: APPUiOCloudReportingDatabaseBackfillingFailed

Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub.

Overview

APPUiO Cloud Reporting failed while trying to backfill Prometheus metering data to the reporting database. This might leave gaps in the reported data and totals will be too low in the next invoice run.

Deleting the failed job will clear this alert.

Steps for debugging

  • Check job logs.

  • Check for missing migrations:

    NAMESPACE=appuio-cloud-reporting
    FAILED_JOB=job/XXXXXX-XXXXXX
    
    oc --as=cluster-admin -n "${NAMESPACE}" debug "${FAILED_JOB}" \
      --keep-init-containers=false --image=ghcr.io/appuio/appuio-cloud-reporting:latest -- \
      appuio-cloud-reporting migrate --show-pending
    
    # If pending migrations are shown
    oc --as=cluster-admin -n "${NAMESPACE}" debug "${FAILED_JOB}" \
      --keep-init-containers=false --image=ghcr.io/appuio/appuio-cloud-reporting:latest -- \
      appuio-cloud-reporting migrate
    oc --as=cluster-admin -n "${NAMESPACE}" debug "${FAILED_JOB}" \
      --keep-init-containers=false --image=ghcr.io/appuio/appuio-cloud-reporting:latest -- \
      appuio-cloud-reporting migrate --seed
  • Check Prometheus/Thanos connectivity and logs.

After fixing the underlying issue, you need to backfill all gaps in the reporting database.

Backfill gaps in reporting database

This runbook expects jq to be installed.

On macOS it expects the date command from GNU coreutils to be installed as gdate (brew install coreutils).

  1. Find failed jobs

CRONJOB_NAME=<name of the cronjob that has failing jobs>

query_name=$(kubectl get cronjobs $CRONJOB_NAME "-o=jsonpath={.metadata.annotations['query-name']}")
echo $query_name

failed_timestamps=$(kubectl get jobs --selector=cron-job-name=$CRONJOB_NAME -ojson \
  | jq -r '[.items[] | select((.status.conditions[] | select( .type == "Failed" ) | length ) >  0)]
            | .[].metadata.creationTimestamp')
echo $failed_timestamps
  1. Review query name and failed timestamps

  2. Re-run the queries to backfill the gaps

job_template=$(kubectl get jobs --selector=cron-job-name=$CRONJOB_NAME -ojson \
  | jq -r '.items | first | .metadata.name')

date_bin=date
if [[ "$OSTYPE" == "darwin"* ]]; then date_bin=gdate; fi

echo $failed_timestamps \
  | xargs -I @ gdate --date @ +"%Y-%m-%dT%H:00:00Z" \
  | xargs -n1 oc --as=cluster-admin debug job/$job_template -- appuio-cloud-reporting report --query-name=$query_name --begin