Parameters

The parent key for all of the following parameters is openshift4_slos.

namespace

type

string

default

appuio-openshift4-slos

The default namespace for ArgoCD to fall back to.

images

type

dictionary

The images to use for this component.

images.sloth

type

dictionary

Sloth isn’t actually deployed to the cluster, but used to render PrometheusRules.

The entry in images allows Renovate to create version upgrade PRs. The Sloth version can be overridden by the tag parameter.

slos

type

dictionary

The configuration option of all default SLOs for the APPUiO Managed OpenShift product.

slos.storage.csi-operations

type

dictionary

default
csi-operations:
  enabled: true
  objective: 99.5
  _sli:
    volume_plugin: "kubernetes.io/csi.+"
    operation_name: ".+"

The configuration for the csi-operations SLO.

The SLO can be disabled by setting enabled to false.

You can configure which volume plugins or storage operations are considered for the SLO by setting _sli.volume_plugin or _sli.operation_name respectively. The fields can contain an arbitrary PromQL regex label matcher.

Any additional field is added directly to the slo input for sloth.

Look at the runbook for an explanation of this SLO.

slos.kubernetes_api.requests

type

dictionary

default
requests:
  enabled: true
  objective: 99.9
  _sli:
    apiserver: "kube-apiserver"

The configuration for the kubernetes API requests SLO.

The SLO can be disabled by setting enabled to false.

You can configure which API servers are actually considered for the SLO by setting _sli.apiserver. By default the SLO only consideres the Kubernetes API server and not the OpenShift API server. The field can contain an arbitrary PromQL regex label matcher.

Any additional field is added directly to the slo input for sloth.

Look at the runbook for an explanation of this SLO.

slos.kubernetes_api.canary

type

dictionary

default
canary:
  enabled: true
  objective: 99.9
  _sli:
    interval: 10s
    timeout: 5s

The configuration for the kubernetes API canary SLO.

The SLO can be disabled by setting enabled to false.

You can configure the probe interval and timeout by setting _sli.interval and _sli.probe respectively. Both parameters are in Go duration format (for example 1m30s).

Any additional field is added directly to the slo input for sloth.

Look at the runbook for an explanation of this SLO.

slos.workload-schedulability.canary

type

dictionary

default
workload-schedulability:
  canary:
    enabled: true
    objective: 99.75
    _sli:
      podStartInterval: 1m
      overallPodTimeout: 3m

The configuration for the canary based workload schedulability SLO.

The SLO can be disabled by setting enabled to false.

You can configure the interval canary pods are created (podStartInterval) and the timeout until a pod is seen as stuck (overallPodTimeout). Both parameters are in Go duration format (for example 1m30s).

Any additional field is added directly to the slo input for sloth.

Look at the runbook for an explanation of this SLO.

slos.network.canary

type

dictionary

default
network:
  canary:
    enabled: true
    objective: 99.95

The configuration for the canary based network SLO, measuring packet loss between nodes.

The SLO can be disabled by setting enabled to false. Any additional field is added directly to the slo input for sloth.

Look at the runbook for an explanation of this SLO.

alerting

type

dictionary

Common alerting configuration for all deployed SLOs.

alerting.labels

type

dictionary

default
labels:
  syn: "true"
  syn_component: "openshift4-slos"

Labels that are added to all Prometheus alerts generated by this component.

alerting.page_labels

type

dictionary

default
page_labels:
  severity: critical

Labels that are added to all page Prometheus alerts generated by this component. page_alerts are alerts are critical alerts for a high burn rate that require immediate attention.

alerting.ticket_labels

type

dictionary

default
ticket_labels:
  severity: warning

Labels that are added to all ticket Prometheus alerts generated by this component. ticket_alerts are alerts are alerts for an elevated burn rate that might require attention, but aren’t urgent.

specs

type

dictionary

default

{}

The SLO definition that are passed to Sloth. The key is used as the name of the resulting PrometheusRule. It must be a valid Kubernetes name.

specs.NAME.metadata

type

dictionary

example
metadata:
  namespace: my-important-service
  labels:
    prometheus: apps

The metadata applied to the PrometheusRule manifest. The name is derived from the name of the parent dictionary.

specs.NAME.sloth_input

type

dictionary

example
appuio-ch-http-get-availability:
  sloth_input:
    version: "prometheus/v1"
    service: "appuio-ch"
    labels:
      owner: "myteam"
    _slos:
      # We allow failing (5xx and 429) 1 request every 1000 requests (99.9%).
      appuio-ch-http-get-availability:
        enabled: true (1)
        objective: 99.9
        description: "SLO based on availability for blackbox HTTP GET request."
        sli:
          raw:
            error_ratio_query: |
              1 - (
                  sum_over_time(probe_success{instance="https://www.appuio.ch/"}[{{.window}}])
                /
                  count_over_time(up{instance="https://www.appuio.ch/"}[{{.window}}])
              )
        alerting:
          name: AppuioChHttpGetErrorRatio
          labels:
            category: "availability"
          annotations:
            # Overwrite default Sloth SLO alert summmary on ticket and page alerts.
            summary: "High error rate on 'appuio.ch' responses"
          page_alert:
            labels:
              severity: warning
          ticket_alert:
            labels:
              severity: warning
              routing_key: myteam
1 enabled is an optional field that allows users to disable certain SLOs through the hierarchy. The field will default to true if omitted.

The input for sloth to generate the PrometheusRule.spec. See Sloth introduction for more information.

The slos can be passed as either an array or as a dictionary with the key _slos. This is done to allow easier modification of the SLOs from the Project Syn hierarchy.

controller_node_affinity

type

dict

default
requiredDuringSchedulingIgnoredDuringExecution:
  nodeSelectorTerms:
    - matchExpressions:
      - key: node-role.kubernetes.io/infra
        operator: Exists

This parameter is used to configure spec.affinity.nodeAffinity for the blackbox-exporter and scheduler-canary-controller deployments. We default to scheduling the blackbox-exporter and scheduler-canary-controller on the infra nodes.

To customize the node affinity for those deployments, please use reclass’s overwrite mechanism by using key ~controller_node_affinity, since otherwise your changes will most likely be appended to the component defaults.

canary_node_affinity

type

dict

default
requiredDuringSchedulingIgnoredDuringExecution:
  nodeSelectorTerms:
    - matchExpressions:
      - key: node-role.kubernetes.io/app
        operator: Exists

This parameter can be used to configure spec.affinity.nodeAffinity for the SchedulerCanary custom resource generated by the component.

We don’t recommend adjusting this parameter unless the component is installed on a cluster that has all-in-one nodes.

blackbox_exporter

type

dictionary

blackbox_exporter allows setting up a optional Blackbox exporter.

blackbox_exporter.enabled

type

boolean

default

true

Controls whether the Blackbox exporter is deployed.

blackbox_exporter.name

type

string

default

prometheus-blackbox-exporter

The name of the Blackbox exporter deployment.

blackbox_exporter.namespace

type

string

default

${openshift4_slos:namespace}

The namespace of the Blackbox exporter deployment.

blackbox_exporter.deployment.resources

type

dictionary

default

see class/defaults.yml

The resources to use for the Blackbox exporter deployment.

blackbox_exporter.deployment.affinity

type

dictionary

default
deployment:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution: []
      requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: kubernetes.io/hostname
          labelSelector:
            matchExpressions:
              - key: name
                operator: In
                values:
                  - ${openshift4_slos:blackbox_exporter:name}

Affinity rules for the Blackbox exporter deployment.

Schedules replicas on different nodes. This is done to avoid SLO violations when rebooting a worker node.

blackbox_exporter.deployment.replicas

type

integer

default

2

The number of replicas for the Blackbox exporter deployment. Defaults to 2 to avoid SLO violations when rebooting a worker node.

blackbox_exporter.deployment.podDisruptionBudget

type

dictionary

default
deployment:
  podDisruptionBudget:
    selector:
      matchLabels:
        name: ${openshift4_slos:blackbox_exporter:name}
    minAvailable: 1

The PodDisruptionBudget for the Blackbox exporter deployment. Ensures at least one replica is available at all times.

blackbox_exporter.config

type

dictionary

default

see class/defaults.yml

The blackbox exporter configuration. See Configuration for more information.

blackbox_exporter.probes

type

dictionary

default

{}

example
probes:
  http-appuio-ch:
    spec:
      jobName: get-http-appuio-ch
      interval: 15s
      module: http_2xx
      targets:
        staticConfig:
          static:
            - https://www.appuio.ch/

The Probe definitions that are deployed in the cluster and picked up by the blackbox exporter managed by the component. The key is used as the name of the resulting Probe. It must be a valid Kubernetes name.

The .spec.prober part is automatically filled from the Blackbox exporter configuration and can omitted.

canary_scheduler_controller

type

dictionary

canary_scheduler_controller allows setting up the canary controller to test workload schedulability. The manifests are rendered using Kustomize.

canary_scheduler_controller.enabled

type

boolean

default

true

Controls whether the controller is deployed.

canary_scheduler_controller.manifests_version

type

string

default

${openshift4_slos:images:canary_scheduler_controller:tag}

The Git reference to the canary controller manifests. The default is the tag of the canary controller image.

canary_scheduler_controller.kustomize_input

type

dictionary

default
kustomize_input:
  namespace: ${openshift4_slos:namespace}

The input passed to the Kustomize renderer. See The Kustomization File for all available options.

network_canary

type

dictionary

network_canary allows configuring the network canary used for measuring packet loss for network SLO.

network_canary.enabled:

type

boolean

default

${openshift4_slos:slos:network:canary:enabled}

Whether the canary should be deployed. By default the component will deploy the canary if and only if the network canary SLO is enabled.

network_canary.namespace

type

string

default

appuio-network-canary

In which namespace the network canary should be deployed.

INFO: This needs to differ from the default SLO namespace so that we can choose different node selectors for the canary.

network_canary.nodeselector

type

string

default

node-role.kubernetes.io/worker=

On which nodes the canary should be deployed on. By default the network canary will run on all worker nodes.

network_canary.resources

type

dictionary

default
resources:
  limits:
    memory: 40Mi
  requests:
    cpu: 1m
    memory: 20Mi

The resource requests and limits for the network canary.

network_canary.tolerations

type

dictionary

default
tolerations:
  infrastructure:
    effect: NoSchedule
    key: node-role.kubernetes.io/infra
    operator: Exists
  storage:
    key: 'storagenode'
    operator: 'Exists'

The tolerations for the network canary daemonset. The values of the dictionary will be passed as is to the manifest.

Example

namespace: appuio-openshift4-slos

specs:
  appuio-ch-http-get-availability:
    sloth_input:
      version: "prometheus/v1"
      service: "appuio-ch"
      labels:
        owner: "myteam"
      _slos:
        # We allow failing (5xx and 429) 1 request every 1000 requests (99.9%).
        appuio-ch-http-get-availability:
          objective: 99.9
          description: "SLO based on availability for blackbox HTTP GET request."
          sli:
            raw:
              error_ratio_query: |
                1 - (
                    sum_over_time(probe_success{instance="https://www.appuio.ch/"}[{{.window}}])
                  /
                    count_over_time(up{instance="https://www.appuio.ch/"}[{{.window}}])
                )
          alerting:
            name: AppuioChHttpGetErrorRatio
            labels:
              category: "availability"
            annotations:
              # Overwrite default Sloth SLO alert summmary on ticket and page alerts.
              summary: "High error rate on 'appuio.ch' responses"
            page_alert:
              labels:
                severity: warning
            ticket_alert:
              labels:
                severity: warning
                routing_key: myteam

blackbox_exporter:
  probes:
    http-appuio-ch:
      spec:
        jobName: get-http-appuio-ch
        interval: 15s
        module: http_2xx
        targets:
          staticConfig:
            static:
              - https://www.appuio.ch/