PostgreSQL by VSHN

Uptime

We don’t yet have a lot of operational experience with this service. If you received this alert, please add any insights you gained to improve this runbook.

Overview

The SLI measures the uptime of each PostgreSQL by VSHN instance. This SLI is measured by a prober that executes a SQL query every second.

If this SLI results in an alert, it means that a significant number of SQL queries failed and that we risk missing the SLO.

There are two types of alerts that fire if we expect to miss the configured objective.

A ticket alert means that the error rate is slightly higher than the objective. If we don’t intervene at some point after receiving this alert, we can expect to miss the objective. However, no immediate, urgent action is necessary. A ticket alert should have a label severity: warning.
A page alert means that the error rate is significantly higher than the objective. Immediate action is necessary to not miss the objective.

Steps for debugging

Failed probes can have a multitude of reasons, but in general there are two different kinds of issue clases. Either the instance itself is failing or provisioning or updating the instance failed.

In any case, you should first figure out where the effected instance runs. The alert will provide you with three labels: cluster_id, namespace, and name.

Connect to the Kubernetes cluster with the provided cluster_id and get the effected claim.

export NAMESPACE={{ namespace }}
export NAME={{ name }}

export COMPOSITE=$(kubectl -n $NAMESPACE get vshnpostgresql $NAME -o jsonpath="{.spec.resourceRef.name}")
kubectl -n $NAMESPACE get vshnpostgresql $NAME

If the claim is not SYNCED this might indicate that there is an issue with provisioning. If it is synced there is most likely an issue with the instance itself, you can skip to the next subsection.

Debugging Provisioning

To figure out what went wrong with provisioning it usually helps to take a closer look at the composite.

kubectl --as cluster-admin describe xvshnpostgresql $COMPOSITE

If there are sync issues there usually are events that point to the root cause of the issue.

Further it can help to look at the Object resources that are created for this instance.

kubectl --as cluster-admin get object -l crossplane.io/composite=$COMPOSITE

If any of them are not synced, describing them should point you in the right direction.

Finally, it might also be helpful to look at the logs of various crossplane components in namespace syn-crossplane.

Debugging PostgreSQL Instance

If the instance is synced, but still not running, we’ll need to look at the database pods themselves.

First see if the pods are running.

export INSTANCE_NAMESPACE=$(kubectl -n $NAMESPACE get vshnpostgresql $NAME -o jsonpath="{.status.instanceNamespace}")
kubectl --as cluster-admin -n $INSTANCE_NAMESPACE get pod

If they’re running, check the logs if there are any obvious error messages

kubectl --as cluster-admin -n $INSTANCE_NAMESPACE sts/${COMPOSITE}

If you can’t see any pods at all, this might be a Stackgres operator issue. Check the corresponding SGCluster resource.

kubectl --as cluster-admin -n $INSTANCE_NAMESPACE describe sgcluster ${COMPOSITE}

If there are no obvious errors, it might help to also look at the Stackgres operator logs.

kubectl -n syn-stackgres-operator logs deployments/stackgres-operator

Tune

If this alert isn’t actionable, noisy, or was raised too late you may want to tune the SLO.

You have the option tune the SLO through the component parameters. You can modify the objective, disable the page or ticket alert, or completely disable the SLO.

The example below will set the SLO set the objective to 99.25% and disable the page alert.

appcat:
  slos:
    vshn:
      postgresql:
        uptime:
          objective: 99.25
          alerting:
            page_alert:
              enabled: false

PostgreSQLReplicationCritical

This alert fires if the replication is broken for a VSHNPostrgeSQL instance last longer than 10 minutes.

# figure out which pod is master
kubectl get pods -n $affected_namespace -l app=StackGresCluster -l role=master

# figure out which pod are replicas
kubectl get pods -n default -l app=StackGresCluster,cluster=true -l role=replica

# reinitialize replica pod (example commands)
k --as cluster-admin -n vshn-postgresql-test-cluster-always-true-jnlj4 exec -ti pods/test-cluster-always-true-jnlj4-1 -- bash
(inside pod):bash-4.4$ patronictl list
(inside pod):bash-4.4$ patronictl reinit test-cluster-always-true-jnlj4 test-cluster-always-true-jnlj4-1

notice that patronictl takes the name of the cluster and the name of the pod, not the name of the statefulset. Can be used in scripts with --force flag which auto-accepts all prompts.

if that doesn;t work: delete replica pods and let patroni fix itself if that doesn’t work either, please refer to the patroni documentation on how to fix replication issues. Stackgres documentation

We need to build know-how on how to fix replication issues, please document your findings in this runbook.

PostgreSQLReplicationLagCritical

This alert fires if the replication lag is higher than 5 minutes for a VSHNPostrgeSQL instance. It means that replicas are behind master. If storage and network looks okay, you can try to re-initialize the node with the replication lag. Probably related to network or storage.

PostgreSQLPodReplicasCritical

This alert fires when there are issues with statefullset responsible for replicas. It means that there are less replicas than expected. Most probably related to quota issues.

kubectl describe -n vshn-postgresql-<instance> sts <instance>
## for example:  kubectl -n vshn-postgresql-test-cluster-always-true-jnlj4 describe sts test-cluster-always-true-jnlj4

## get events from affected namespace and look for issues
k -n vshn-postgresql-test-cluster-always-true-jnlj4 get events

PostgreSQLConnectionsCritical

This alert fires when the used connection is over 90% of the configured max_connections limit (defaults to 100). It means that either the connection limit is set too low or an application is misbehaving and spawning too many connections. You either need to raise the max_connections parameter on the PostgreSQL instance or debug the application, as it might be misbehaving and spawning too many connections.