CiliumKVStoreMeshRemoteClusterNotReady
Please consider opening a PR to improve this runbook if you gain new information about causes of the alert, or how to debug or resolve the alert. Click "Edit this Page" in the top right corner to create a PR directly on GitHub. |
Overview
This alert fires if a remote cluster’s KVStore is not reachable from the local cluster for 10 minutes or longer. This usually indicates one of the following two:
-
There are network issues preventing cluster mesh connectivity
-
The remote cluster has misconfigured network policies which prevent access to its KVStore
Depending on the network configuration, there may be static routes on each node for the remote cluster’s cluster mesh API server. |
Steps for debugging
The steps in this section assume that your current Kubernetes context points to the source cluster. |
This section assumes that you’re running cluster mesh with the cluster mesh API server enabled. |
Prerequisites
-
cilium
CLI, install from cilium/cilium-cli
Identifying the root cause
First, check the source cluster’s overall cluster mesh status
cilium -n cilium clustermesh status --as=cluster-admin (1)
1 | --as=cluster-admin is required on VSHN Managed OpenShift, may need to be left out on other clusters. |
Investigating cluster mesh API
The cluster mesh API runs in the cilium
namespace as deployment clustermesh-apiserver
.
Check that the pod runs and check the logs for errors with
kubectl -n cilium get pods -l app.kubernetes.io/name=clustermesh-apiserver
kubectl -n cilium logs deploy/clustermesh-apiserver --all-containers
Checking KVStoreMesh status
You can check the KVStoreMesh connection to the remote cluster mesh API server in the clustermesh-apiserver
deployment:
kubectl -n cilium --as=cluster-admin exec -it deploy/clustermesh-apiserver -c kvstoremesh -- \
clustermesh-apiserver kvstoremesh-dbg status (1)
kubectl exec -it -n cilium --as=cluster-admin deploy/clustermesh-apiserver -c kvstoremesh -- \
clustermesh-apiserver kvstoremesh-dbg troubleshoot (2)
1 | Show a connection summary of the KVStoreMesh |
2 | Show connection details of the KVStoreMesh |
Checking node routing tables and connectivity
For setups which use static routes to make the nodes of the clusters participating in the cluster mesh reachable from each other, you can check the routing tables on the host and verify connectivity with ping
.
NODE=<node name of affected node>
REMOTE_NODE=<ip of a node in the remote cluster>
oc -n syn-debug-nodes debug node/${NODE} --as=cluster-admin -- chroot /host ip r
oc -n syn-debug-nodes debug node/${NODE} --as=cluster-admin -- chroot /host ping -c4 ${REMOTE_NODE}
DEBUG_IMAGE=ghcr.io/digitalocean-packages/doks-debug:latest (1)
NODE=<node name of affected node>
REMOTE_NODE=<ip of a node in the remote cluster>
kubectl debug node/${NODE} -it --image=${DEBUG_IMAGE} -- ip r
kubectl debug node/${NODE} -it --image=${DEBUG_IMAGE} -- ping -c4 ${REMOTE_NODE} (2)
1 | We’re using the DigitalOcean doks-debug image, which comes with a bunch of common tools installed.
See digitalocean/doks-debug for details. |
2 | This command hasn’t been tested yet, it’s possible that your cluster configuration will not allow ping in node debug containers. |