Alert rule: APPUiOReportingDatabaseBackfillingFailed
Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub. |
Overview
APPUiO Reporting failed while trying to backfill Prometheus metering data to the reporting database. This might leave gaps in the reported data and totals will be too low in the next invoice run.
Deleting the failed job will clear this alert.
Steps for debugging
-
Check job logs.
-
Check for missing migrations:
NAMESPACE=appuio-reporting FAILED_JOB=job/XXXXXX-XXXXXX oc --as=cluster-admin -n "${NAMESPACE}" debug "${FAILED_JOB}" \ --keep-init-containers=false --image=ghcr.io/appuio/appuio-reporting:latest -- \ appuio-reporting migrate --show-pending # If pending migrations are shown oc --as=cluster-admin -n "${NAMESPACE}" debug "${FAILED_JOB}" \ --keep-init-containers=false --image=ghcr.io/appuio/appuio-reporting:latest -- \ appuio-reporting migrate oc --as=cluster-admin -n "${NAMESPACE}" debug "${FAILED_JOB}" \ --keep-init-containers=false --image=ghcr.io/appuio/appuio-reporting:latest -- \ appuio-reporting migrate --seed
-
Check Prometheus/Thanos connectivity and logs.
After fixing the underlying issue, you need to backfill all gaps in the reporting database. |
Backfill gaps in reporting database
This runbook expects On macOS it expects the |
-
Find failed jobs
CRONJOB_NAME=<name of the cronjob that has failing jobs>
NAMESPACE=appuio-reporting
query_name=$(kubectl -n $NAMESPACE get cronjobs $CRONJOB_NAME "-o=jsonpath={.metadata.annotations['query-name']}")
echo $query_name
failed_timestamps=$(kubectl -n $NAMESPACE get jobs --selector=cron-job-name=$CRONJOB_NAME -ojson \
| jq -r '[.items[] | select((.status.conditions[] | select( .type == "Failed" ) | length ) > 0)]
| .[].metadata.creationTimestamp')
echo $failed_timestamps
-
Review query name and failed timestamps
-
Re-run the queries to backfill the gaps
NAMESPACE=appuio-reporting
job_template=$(kubectl -n $NAMESPACE get jobs --selector=cron-job-name=$CRONJOB_NAME -ojson \
| jq -r '.items | first | .metadata.name')
date_bin=date
if [[ "$OSTYPE" == "darwin"* ]]; then date_bin=gdate; fi
echo $failed_timestamps \
| xargs -I @ $date_bin --date @ +"%Y-%m-%dT%H:00:00Z" \
| xargs -n1 oc -n $NAMESPACE --as=cluster-admin debug job/$job_template -- appuio-reporting report --query-name=$query_name --begin