PostgreSQL by VSHN
Uptime
We don’t yet have a lot of operational experience with this service. If you received this alert, please add any insights you gained to improve this runbook. |
Overview
The SLI measures the uptime of each PostgreSQL by VSHN instance. This SLI is measured by a prober that executes a SQL query every second.
If this SLI results in an alert, it means that a significant number of SQL queries failed and that we risk missing the SLO.
There are two types of alerts that fire if we expect to miss the configured objective.
-
A ticket alert means that the error rate is slightly higher than the objective. If we don’t intervene at some point after receiving this alert, we can expect to miss the objective. However, no immediate, urgent action is necessary. A ticket alert should have a label
severity: warning
. -
A page alert means that the error rate is significantly higher than the objective. Immediate action is necessary to not miss the objective.
Steps for debugging
Failed probes can have a multitude of reasons, but in general there are two different kinds of issue clases. Either the instance itself is failing or provisioning or updating the instance failed.
In any case, you should first figure out where the effected instance runs.
The alert will provide you with three labels: cluster_id
, namespace
, and name
.
Connect to the Kubernetes cluster with the provided cluster_id
and get the effected claim.
export NAMESPACE={{ namespace }}
export NAME={{ name }}
export COMPOSITE=$(kubectl -n $NAMESPACE get vshnpostgresql $NAME -o jsonpath="{.spec.resourceRef.name}")
kubectl -n $NAMESPACE get vshnpostgresql $NAME
If the claim is not SYNCED
this might indicate that there is an issue with provisioning.
If it is synced there is most likely an issue with the instance itself, you can skip to the next subsection.
Debugging Provisioning
To figure out what went wrong with provisioning it usually helps to take a closer look at the composite.
kubectl --as cluster-admin describe xvshnpostgresql $COMPOSITE
If there are sync issues there usually are events that point to the root cause of the issue.
Further it can help to look at the Object
resources that are created for this instance.
kubectl --as cluster-admin get object -l crossplane.io/composite=$COMPOSITE
If any of them are not synced, describing them should point you in the right direction.
Finally, it might also be helpful to look at the logs of various crossplane components in namespace syn-crossplane
.
Debugging PostgreSQL Instance
If the instance is synced, but still not running, we’ll need to look at the database pods themselves.
First see if the pods are running.
export INSTANCE_NAMESPACE=$(kubectl -n $NAMESPACE get vshnpostgresql $NAME -o jsonpath="{.status.instanceNamespace}")
kubectl --as cluster-admin -n $INSTANCE_NAMESPACE get pod
If they’re running, check the logs if there are any obvious error messages
kubectl --as cluster-admin -n $INSTANCE_NAMESPACE sts/${COMPOSITE}
If you can’t see any pods at all, this might be a Stackgres operator issue. Check the corresponding SGCluster resource.
kubectl --as cluster-admin -n $INSTANCE_NAMESPACE describe sgcluster ${COMPOSITE}
If there are no obvious errors, it might help to also look at the Stackgres operator logs.
kubectl -n syn-stackgres-operator logs deployments/stackgres-operator
Tune
If this alert isn’t actionable, noisy, or was raised too late you may want to tune the SLO.
You have the option tune the SLO through the component parameters. You can modify the objective, disable the page or ticket alert, or completely disable the SLO.
The example below will set the SLO set the objective to 99.25% and disable the page alert.
appcat:
slos:
vshn:
postgresql:
uptime:
objective: 99.25
alerting:
page_alert:
enabled: false
PostgreSQLReplicationCritical
This alert fires if the replication is broken for a VSHNPostrgeSQL instance last longer than 10 minutes.
# figure out which pod is master
kubectl get pods -n $affected_namespace -l app=StackGresCluster -l role=master
# figure out which pod are replicas
kubectl get pods -n default -l app=StackGresCluster,cluster=true -l role=replica
# reinitialize replica pod (example commands)
k --as cluster-admin -n vshn-postgresql-test-cluster-always-true-jnlj4 exec -ti pods/test-cluster-always-true-jnlj4-1 -- bash
(inside pod):bash-4.4$ patronictl list
(inside pod):bash-4.4$ patronictl reinit test-cluster-always-true-jnlj4 test-cluster-always-true-jnlj4-1
notice that patronictl takes the name of the cluster and the name of the pod, not the name of the statefulset. Can be used in scripts with --force
flag which auto-accepts all prompts.
if that doesn;t work: delete replica pods and let patroni fix itself if that doesn’t work either, please refer to the patroni documentation on how to fix replication issues. Stackgres documentation
We need to build know-how on how to fix replication issues, please document your findings in this runbook.
PostgreSQLReplicationLagCritical
This alert fires if the replication lag is higher than 5 minutes for a VSHNPostrgeSQL instance. It means that replicas are behind master. If storage and network looks okay, you can try to re-initialize the node with the replication lag. Probably related to network or storage.
PostgreSQLPodReplicasCritical
This alert fires when there are issues with statefullset responsible for replicas. It means that there are less replicas than expected. Most probably related to quota issues.
kubectl describe -n vshn-postgresql-<instance> sts <instance>
## for example: kubectl -n vshn-postgresql-test-cluster-always-true-jnlj4 describe sts test-cluster-always-true-jnlj4
## get events from affected namespace and look for issues
k -n vshn-postgresql-test-cluster-always-true-jnlj4 get events
PostgreSQLConnectionsCritical
This alert fires when the used connection is over 90% of the configured max_connections
limit (defaults to 100).
It means that either the connection limit is set too low or an application is misbehaving and spawning too many connections.
You either need to raise the max_connections
parameter on the PostgreSQL instance or debug the application, as it might be misbehaving and spawning too many connections.