MinIO by VSHN
Uptime
We don’t yet have a lot of operational experience with this service. If you received this alert, please add any insights you gained to improve this runbook. |
Overview
The SLI measures the uptime of each MinIO by VSHN instance. This SLI is measured by a prober that puts a simple file on to a test bucket every second.
If this SLI results in an alert, it means that a significant number of write operations failed and that we risk missing the SLO.
There are two types of alerts that fire if we expect to miss the configured objective.
-
A ticket alert means that the error rate is slightly higher than the objective. If we don’t intervene at some point after receiving this alert, we can expect to miss the objective. However, no immediate, urgent action is necessary. A ticket alert should have a label
severity: warning
. -
A page alert means that the error rate is significantly higher than the objective. Immediate action is necessary to not miss the objective.
Steps for debugging
Failed probes can have a multitude of reasons, but in general there are two different kinds of issue classes. Either the instance itself is failing or provisioning or updating the instance failed.
In any case, you should first figure out where the effected instance runs.
The alert will provide you with three labels: cluster_id
, namespace
, and name
.
Connect to the Kubernetes cluster with the provided cluster_id
and get the effected claim.
export NAMESPACE={{ namespace }}
export NAME={{ name }}
export COMPOSITE=$(kubectl -n $NAMESPACE get vshnminio $NAME -o jsonpath="{.spec.resourceRef.name}")
kubectl -n $NAMESPACE get vshnminio $NAME
If the claim is not SYNCED
this might indicate that there is an issue with provisioning.
If it is synced there is most likely an issue with the instance itself, you can skip to the next subsection.
Debugging Provisioning
To figure out what went wrong with provisioning it usually helps to take a closer look at the composite.
kubectl --as cluster-admin describe xvshnminio $COMPOSITE
If there are sync issues there usually are events that point to the root cause of the issue.
Furthermore, it can help to look at the Object
resources that are created for this instance or the releases.helm.crossplane.io
object associated with the instance.
kubectl --as cluster-admin get object -l crossplane.io/composite=$COMPOSITE
kubectl --as cluster-admin get object $OBJECT_NAME
kubectl --as cluster-admin get releases.helm.crossplane.io -l crossplane.io/composite=$COMPOSITE
kubectl --as cluster-admin describe releases.helm.crossplane.io -l crossplane.io/composite=$COMPOSITE
If any of them are not synced, describing them should point you in the right direction.
Finally, it might also be helpful to look at the logs of various crossplane components in namespace syn-crossplane
.
Debugging MinIO Instance
If the instance is synced, but still not running, we’ll need to look at the minio pods themselves.
First see if the pods are running.
export INSTANCE_NAMESPACE=$(kubectl -n $NAMESPACE get vshnminio $NAME -o jsonpath="{.status.instanceNamespace}")
kubectl --as cluster-admin -n $INSTANCE_NAMESPACE get pod
If they’re running, check the logs if there are any obvious error messages
kubectl --as cluster-admin -n $INSTANCE_NAMESPACE deployment/$COMPOSITE # for standalone setups
kubectl --as cluster-admin -n $INSTANCE_NAMESPACE sts/$COMPOSITE # for distributed setups
If you can’t see any pods at all, then there might be an issue with the statefulset (eg. faulty configuration). Check the corresponding statefulset and events.
kubectl --as cluster-admin -n $INSTANCE_NAMESPACE describe deployment $COMPOSITE # for standalone setups
kubectl --as cluster-admin -n $INSTANCE_NAMESPACE describe sts $COMPOSITE # for distributed setups
Tune
If this alert isn’t actionable, noisy, or was raised too late you may want to tune the SLO.
You have the option tune the SLO through the component parameters. You can modify the objective, disable the page or ticket alert, or completely disable the SLO.
The example below will set the SLO set the objective to 99.25% and disable the page alert.
appcat:
slos:
vshn:
minio:
uptime:
objective: 99.25
alerting:
page_alert:
enabled: false