Alert Group: syn-Prometheus
Alert Rule: SYN_PrometheusRemoteWriteDesiredShards
Overview
This alert may indicate that the remote write receiver isn’t accepting metrics due to an internal problem or that there’s a network issue between Prometheus and the remote write receiver. If the remote write receiver is a Mimir instance, the root cause may be that the ngnix in front of the Mimir components has stale pod IPs in its DNS cache.
icon:[search] Investigate
-
Check that the remote write receiver’s endpoint is reachable
kubectl -n openshift-monitoring --as=cluster-admin exec -it prometheus-k8s-0 -- curl <remote-write-endpoint>
-
Check the remote write receiver for any issues. If the remote write receiver is a Mimir instance, check the Mimir nginx pod logs. Errors like the following indicate that nginx’s DNS cache contains stale pod IPs.
2022/12/13 08:50:31 [error] 9#9: *2748893 vshn-appuio-mimir-distributor-headless.vshn-appuio-mimir.svc.cluster.local could not be resolved (110: Operation timed out), client: 10.128.10.35, server: , request: "POST /api/v1/push HTTP/1.1", host: "metrics-receive.appuio.net"
=== Resolve
If you’ve identified that the Mimir nginx is the cause of the issue, restart the nginx pod.