MariaDB by VSHN

Uptime

We don’t yet have a lot of operational experience with this service. If you received this alert, please add any insights you gained to improve this runbook.

Overview

The SLI measures the uptime of each MariaDB by VSHN instance. This SLI is measured by a prober that executes a SQL query every second.

If this SLI results in an alert, it means that a significant number of SQL queries failed and that we risk missing the SLO.

There are two types of alerts that fire if we expect to miss the configured objective.

  • A ticket alert means that the error rate is slightly higher than the objective. If we don’t intervene at some point after receiving this alert, we can expect to miss the objective. However, no immediate, urgent action is necessary. A ticket alert should have a label severity: warning.

  • A page alert means that the error rate is significantly higher than the objective. Immediate action is necessary to not miss the objective.

Steps for debugging

Failed probes can have a multitude of reasons, but in general there are two different kinds of issue classes. Either the instance itself is failing or provisioning or updating the instance failed.

In any case, you should first figure out where the effected instance runs. The alert will provide you with three labels: cluster_id, namespace, and name. Connect to the Kubernetes cluster with the provided cluster_id and get the effected claim.

export NAMESPACE={{ namespace }}
export NAME={{ name }}

export COMPOSITE=$(kubectl -n $NAMESPACE get vshnmariadb $NAME -o jsonpath="{.spec.resourceRef.name}")
kubectl -n $NAMESPACE get vshnmariadb $NAME

If the claim is not SYNCED this might indicate that there is an issue with provisioning. If it is synced there is most likely an issue with the instance itself, you can skip to the next subsection.

Debugging Provisioning

To figure out what went wrong with provisioning it usually helps to take a closer look at the composite.

kubectl --as cluster-admin describe xvshnmariadb $COMPOSITE

If there are sync issues there usually are events that point to the root cause of the issue.

Furthermore, it can help to look at the Object resources that are created for this instance or the releases.helm.crossplane.io object associated with the instance.

kubectl --as cluster-admin get object -l crossplane.io/composite=$COMPOSITE
kubectl --as cluster-admin get object $OBJECT_NAME
kubectl --as cluster-admin get releases.helm.crossplane.io -l crossplane.io/composite=$COMPOSITE
kubectl --as cluster-admin describe releases.helm.crossplane.io -l crossplane.io/composite=$COMPOSITE

If any of them are not synced, describing them should point you in the right direction.

Finally, it might also be helpful to look at the logs of various crossplane components in namespace syn-crossplane.

Debugging MariaDB Instance

If the instance is synced, but still not running, we’ll need to look at the database pods themselves.

First see if the pods are running.

export INSTANCE_NAMESPACE=$(kubectl -n $NAMESPACE get vshnmariadb $NAME -o jsonpath="{.status.instanceNamespace}")
kubectl --as cluster-admin -n $INSTANCE_NAMESPACE get pod

If they’re running, check the logs if there are any obvious error messages

kubectl --as cluster-admin -n $INSTANCE_NAMESPACE sts/${COMPOSITE}

If you can’t see any pods at all, then there might be an issue with the statefulset (eg. faulty configuration). Check the corresponding statefulset and events.

kubectl --as cluster-admin -n $INSTANCE_NAMESPACE describe sts mariadb

Tune

If this alert isn’t actionable, noisy, or was raised too late you may want to tune the SLO.

You have the option tune the SLO through the component parameters. You can modify the objective, disable the page or ticket alert, or completely disable the SLO.

The example below will set the SLO set the objective to 99.25% and disable the page alert.

appcat:
  slos:
    vshn:
      mariadb:
        uptime:
          objective: 99.25
          alerting:
            page_alert:
              enabled: false