Parameters

The parent key for all of the following parameters is openshift4_monitoring.

manifests_version

type

string

default

release-4.14

Select which version of the upstream alerting (and recording) rules should be used by the component. This parameter must be changed to match the cluster’s OCP4 minor version.

We recommend setting this parameter based on the reported OpenShift version which can be found in the cluster’s dynamic facts.

manifests_version: release-${dynamic_facts:openshiftVersion:Major}.${dynamic_facts:openshiftVersion:Minor}

defaultConfig

type

dictionary

default
nodeSelector:
  node-role.kubernetes.io/infra: ''

A dictionary holding the default configuration which should be applied to all components.

upstreamRules.networkPlugin

type

string

default

openshift-sdn

Choose either openshift-sdn or ovn-kubernetes depending on the installed network plugin. If a custom network plugin is used, set any other string as the value for this parameter. This ensures neither openshift-sdn nor OVN-Kubernetes monitoring rules are deployed.

enableAlertmanagerIsolationNetworkPolicy

type

boolean

default

true

Blocks all traffic to Alertmanager pods except the allowed API traffic.

This works around an observed accidental clustering with user workload or custom Alertmanager clusters in other namespaces.

enableUserWorkloadAlertmanagerIsolationNetworkPolicy

type

boolean

default

true

Blocks all traffic to Alertmanager pods except the allowed API traffic.

This works around an observed accidental clustering with system or custom Alertmanager clusters in other namespaces.

enableUserWorkload

type

boolean

default

true

A parameter to enable monitoring for user-defined projects.

configs

type

dictionary

default
prometheusK8s:
  remoteWrite: []
  _remoteWrite: {}
  externalLabels:
    cluster_id: ${cluster:name}
    tenant_id: ${cluster:tenant}
  retention: 8d
  volumeClaimTemplate:
    spec:
      resources:
        requests:
          storage: 50Gi
alertmanagerMain:
  volumeClaimTemplate:
    spec:
      resources:
        requests:
          storage: 2Gi

A dictionary holding the configurations for the monitoring components.

The component will remove empty fields (null, and empty lists or objects) from the provided configuration.

See the OpenShift docs for available parameters.

This table shows the monitoring components you can configure and the keys used to specify the components:

Component Key

Prometheus Operator

prometheusOperator

Prometheus

prometheusK8s

Alertmanager

alertmanagerMain

kube-state-metrics

kubeStateMetrics

openshift-state-metrics

openshiftStateMetrics

Grafana

grafana

Telemeter Client

telemeterClient

Prometheus Adapter

k8sPrometheusAdapter

Thanos Querier

thanosQuerier

configs.prometheusK8s._remoteWrite

type

dictionary

default

{}

example
_remoteWrite:
  example:
    url: https://prometheus.example.com/api/v1/write
    headers:
      "X-Scope-OrgID": example
    writeRelabelConfigs:
      - action: keep
        sourceLabels: ['syn']
        regex: '.+'
      - action: keep
        timeseries:
          - foo_metric_one
          - foo_metric_two
    basicAuth:
      username:
        name: remote-write
        key: username
      password:
        name: remote-write
        key: password

A dictionary holding the remote write configurations for the Prometheus component. The key is the name of the configuration, the value is the content of the configuration.

The remote write configuration will be appended to the configs.prometheusK8s.remoteWrite parameter for backwards compatibility.

In this configuration only, writeRelabelConfigs entries can hold an entry for timeseries containing a list of strings representing individual Prometheus timeseries. These will be translated into a regex entry, with a regular expression matching any one of the listed timeseries.

configsUserWorkload

type

dictionary

default
alertmanager:
  enabled: true
  enableAlertmanagerConfig: true
  volumeClaimTemplate: ${openshift4_monitoring:configs:alertmanagerMain:volumeClaimTemplate}
prometheusOperator: {}
prometheus:
  retention: 8d
  volumeClaimTemplate: ${openshift4_monitoring:configs:prometheusK8s:volumeClaimTemplate}
thanosRuler: {}

A dictionary holding the configurations for the user workload monitoring components.

By default, we configure the user workload monitoring Prometheus and Alertmanager to inherit the volumeClaimTemplate specifications from the cluster-monitoring config. This allows users to configure the default storageclass and volume size of both monitoring stacks through the cluster-monitoring config.

This table shows the monitoring components you can configure and the keys used to specify the components:

Component Key Note

Alertmanager

alertmanager

Only on OpenShift 4.11 and newer

Prometheus Operator

prometheusOperator

Prometheus

prometheus

Thanos Ruler

thanosRuler

configsUserWorkload.prometheus._remoteWrite

type

dictionary

default

{}

example
_remoteWrite:
  example:
    url: https://prometheus.example.com/api/v1/write
    headers:
      "X-Scope-OrgID": customer
    writeRelabelConfigs:
      - sourceLabels: ['customer']
        regex: '.+'
        action: keep
    basicAuth:
      username:
        name: remote-write-customer
        key: username
      password:
        name: remote-write-customer
        key: password

A dictionary holding the remote write configurations for the Prometheus component of the user workload monitoring stack. The key is the name of the configuration, the value is the content of the configuration.

The remote write configuration will be appended to the configsUserWorkload.prometheus.remoteWrite parameter for backwards compatibility.

alertManagerConfig

type

dictionary

default
route:
  group_wait: 0s
  group_interval: 5s
  repeat_interval: 10m
inhibit_rules:
  # Don't send warning or info if a critical is already firing
  - target_match_re:
      severity: warning|info
    source_match:
      severity: critical
    equal:
      - namespace
      - alertname
  # Don't send info if a warning is already firing
  - target_match_re:
      severity: info
    source_match:
      severity: warning
    equal:
      - namespace
      - alertname

A dictionary holding the configuration for the AlertManager.

See the OpenShift docs for available parameters.

The component will silently drop any fields in the provided config which are empty. The component treats null as empty for scalar fields.

alertManagerAutoDiscovery

type

dictionary

default
alertManagerAutoDiscovery:
  enabled: true
  debug_config_map: false
  team_receiver_format: team_default_%s
  additional_alert_matchers: []
  prepend_routes: []
  append_routes: []

alertManagerAutoDiscovery holds the configuration for the Alertmanager auto-discovery feature.

The auto-discovery routes alerts to the configured teams based on their namespaces and the top-level syn.teams[*].instances and syn.owner parameters. Auto-discovery first creates a list of Commodore component instances by parsing the applications array using the same rules as Commodore itself (see also the Commodore component instantiation documentation). For each discovered instance, the component then renders the instance parameters, and reads the cmoponent’s namespace from field namespace or namespace.name in the rendered parameters. Finally, routing rules are generated to route alerts from the discovered namespaces to the associated component instance’s owning team.

syn Team Example
syn:
  owner: daring-donkeys
  teams:
    electric-elephants:
      instances: [postgres]

The auto-discovery feature is enabled by default. A ConfigMap can be enabled with debug_config_map to debug the auto-discovery feature.

The configuration is merged with the alertManagerConfig parameter. Route receivers are generated for each team based on the team_receiver_format parameter. The routes are ordered as follows:

alertManagerAutoDiscovery.prepend_routes + generated routes + alertManagerAutoDiscovery.append_routes + alertManagerConfig.routes + route all to syn.owner

additional_alert_matchers is a list of additional alert matchers to add to the generated routes. This can be used to handle special cases where the auto-discovery feature does not work as expected. For example if an alert should go to a different team than the namespace suggests based on a label.

alertManagerAutoDiscovery:
  additional_alert_matchers:
    - 'syn_team = ""'
# becomes
- continue: true
  matchers:
    - syn_team = ""
    - namespace =~ "my-ns"
  receiver: team_default_lovable-lizards
- continue: false
  matchers:
    - syn_team = ""
    - namespace =~ "my-ns"
  receiver: __component_openshift4_monitoring_null

alerts

type

dictionary

Configuration parameters related to influence the resulting alert rules.

includeNamespaces

type

list

default

See class/defaults.yml

List of namespace patterns to use for alerts which have namespace=~"(openshift-.*|kube-.*|default)" in the upstream rule. The component generates a regex pattern from the list by concatenating all elements into a large OR-regex. To inject the custom regex, the component searches for the exact string namespace=~"(openshift-.*|kube-.*|default)" in field expr of each alert rule and replaces it with the regex generated from this parameter and parameter excludeNamespaces.

The component processes the list with com.renderArray() to allow users to drop entries in the hierarchy.

The component doesn’t validate that the list entries are valid regex patterns.

Example

We assume that the input config has patterns default and syn.*:

includeNamespaces:
  - default
  - syn.*

The component will generate namespace selector namespace=~"(default|syn.*)" from this input configuration.

excludeNamespaces

type

list

default

[]

List of namespace patterns to exclude for alerts which have namespace=~"(openshift-.*|kube-.*|default)" in the upstream rule. The component generates a regex pattern from the list by concatenating all elements into a large OR-regex. To inject the custom regex, the component searches for the exact string namespace=~"(openshift-.*|kube-.*|default)" in field expr of each alert rule and replaces it with the regex generated from this parameter and parameter includeNamespaces.

The component processes the list with com.renderArray() to allow users to drop entries in the hierarchy.

The component doesn’t validate that the list entries are valid regex patterns.

Example

We assume that the input config has patterns default and openshift. and syn. for includeNamespaces and openshift-adp for excludeNamespaces:

includeNamespaces:
  - default
  - openshift.*
  - syn.*
excludeNamespaces:
  - openshift-adp

The component will generate namespace selector namespace="(default|openshift.|syn.)",namespace!"(openshift-adp)" from this input configuration.

ignoreNames

type

list

default

See class/defaults.yml

List of alert rule names to be dropped.

This parameter is taken into account in the filterRules and filterPatchRules library functions.

ignoreWarnings

type

list

default

See class/defaults.yml

List of alert rule names for which to drop alerts with label severity: warning.

In contrast to ignoreNames, this parameter is not taken into account in the filterRules and filterPatchRules library functions.

ignoreGroups

type

list

default

See class/defaults.yml

List of complete alert rule groups to drop.

This parameter is not taken into account for filterRules and filterPatchRules.

customAnnotations

type

dict

default

{}

Maps alert names to sets of custom annotations. Allows configuring custom annotations for individual alerts.

Example:

customAnnotations:
  Watchdog:
    runbook_url: https://www.google.com/?q=Watchdog

patchRules

type

dict

keys

potential values of parameter manifests_versions and *

default

See class/defaults.yml on GitHub

The parameter patchRules allows users to customize upstream alerts. The component expects that top-level keys in the parameter correspond to values of parameter manifests_versions. Additionally, the component supports special top-level key *.

Alert patches which are defined under top-level key * are applied regardless of the OpenShift 4 version specified in parameter manifest_versions. Additionally, the component applies all patches under the key which matches the value of parameter manifest_versions. If an alert is patched in both top-level key * and the top-level key matching parameter manifest_versions, the patches are merged together, with the version-specific patch overriding the generic patch.

The component expects alert names as keys and any alert configuration as values in each top-level key. See the Prometheus alerting rules documentation for extended documentation on configuring alerting rules.

Example:

patchRules:
  '*':
    PrometheusRemoteWriteBehind:
      annotations:
        runbook_url: https://example.com/runbooks/PrometheusRemoteWriteBehind.html
  release-4.14:
    SystemMemoryExceedsReservation:
      for: 30m

ignoreUserWorkload

type

list

default

[]

A list of alerting rules for which the component should patch the expr and annotations.description fields to ensure they don’t alert for the user workload monitoring stack.

By default, we don’t turn off any alerts for the user workload monitoring stack.

The parameter supports removing entries by providing the entry to remove prefixed with ~. The parameter can be completely cleared with the following config:

parameters:
  openshift4_monitoring:
    alerts:
      ~ignoreUserWorkload: []

silence

type

dict

Parameters to configure the silence CronJob.

silence.silences

type

dict

default
"Silence non syn alerts":
  matchers:
    - name: alertname
      value: ".+"
      isRegex: true
    - name: syn
      value: ""
      isRegex: false

Contains the list of silences to be applied. The key is used as the comment of the silence and the value is a dictionary which is passed to Alertmanager.

Silences removed from the hierarchy stay active in Alertmanager for up to 24h until they expire.

Silences all non-SYN alerts by default.

schedule

type

string

default

'0 */4 * * *'

Schedule of the CronJob in cron syntax.

serviceAccountName

type

string

default

prometheus-k8s

Name of the service account used when running the silence job. The service account must have permission to access the Alertmanager service through its oAuth proxy.

servingCertsCABundleName

type

string

default

serving-certs-ca-bundle

Name of the config map containing the CA bundle of the Alertmanager service.

jobHistoryLimit

type

dict

Parameters to configure the numbers of silence job objects to keep.

failed

type

number

default

3

Number of failed jobs to keep.

successful

type

number

default

3

Number of successful jobs to keep.

capacityAlerts

type

dict

This parameter allows users to enable and configure alerts for capacity management. The capacity alerts are enabled by default and can be disabled completely by setting the key capacityAlerts.enabled to false. Predictive alerts are disabled by default and can be enabled individually as shown below by setting ExpectClusterCpuUsageHigh.enabled to true.

The dictionary will be transformed into a PrometheusRule object by the component.

The component provides 10 alerts that are grouped in four groups. You can disable or modify each of these alert rules individually. The fields in these rules will be added to the final PrometheusRule, with the exception of expr. The expr field contains fields which can be used to tune the default alert rule. Alternatively the default rule can be completely overwritten by setting the expr.raw field (see example below). See Resource Management for an explanation for every alert rule.

Example:

capacityAlerts:
  enabled: true (1)
  groupByNodeLabels: [] (2)
  groups:
    PodCapacity:
      rules:
        TooManyPods:
          annotations:
            message: 'The number of pods is too damn high' (3)
          for: 3h (4)
        ExpectTooManyPods:
          expr: (5)
            range: '2d'
            predict: '5*24*60*60'

    ResourceRequests:
      rules:
        TooMuchMemoryRequested:
          enabled: true
          expr:
            raw: sum(kube_pod_resource_request{resource="memory"}) > 9000*1024*1024*1024 (6)
    CpuCapacity:
      rules:
        ClusterCpuUsageHigh:
          enabled: false (7)
        ExpectClusterCpuUsageHigh:
          enabled: false (7)
    UnusedCapacity:
      rules:
        ClusterHasUnusedNodes:
          enabled: false (8)
1 Enables capacity alerts
2 List of node labels (as they show up in the kube_node_labels metric) by which alerts are grouped
3 Changes the alert message for the pod capacity alert
4 Only alerts for pod capacity if it fires for 3 hours
5 Change the pod count prediction to look at the last two days and predict the value in five days
6 Completely overrides the default alert rule and alerts if the total memory request is over 9000 GB
7 Disables both CPU capacity alert rules
8 Disables alert if the cluster has unused nodes.

rules

type

dict

default

{}

This parameter allows users to configure additional Prometheus rules to deploy on the cluster.

Each key-value pair in the dictionary is transformed into a PrometheusRule object by the component.

The component expects that values are dicts themselves and expects that keys in those dicts are prefixed with record: or alert: to indicate whether the rule is a recording or alerting rule. The component will transform the keys into fields in the resulting rule by taking the prefix as the field name and the rest of the key as the field value. For example, key "record:sum:some:metric:5m" would be transformed into record: sum:some:metric:5m which should define a recording rule with name sum:some:metric:5m. This field is then merged into the provided value which should be a valid rule definition.

See the Prometheus docs for supported configurations for recording and alerting rules.

Example:

rules:
  generic-rules:
    "alert:ContainerOOMKilled":
      annotations:
        message: A container ({{$labels.container}}) in pod {{ $labels.namespace }}/{{ $labels.pod }} was OOM killed
      expr: |
        kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
      labels:
        source: https://git.vshn.net/swisscompks/syn-tenant-repo/-/blob/master/common.yml
        severity: devnull

Example

defaultConfig:
  nodeSelector:
    node-role.kubernetes.io/infra: ''
configs:
  prometheusK8s:
    volumeClaimTemplate:
      spec:
        resources:
          requests:
            storage: 100Gi
alerts:
  ignoreNames:
    - KubeAPIErrorsHigh
    - KubeClientErrors

secrets

type

dict

default

{}

A dict of secrets to create in the namespace. The key is the name of the secret, the value is the content of the secret. The value must be a dict with a key stringData which is a dict of key/value pairs to add to the secret.

remoteWriteDefaults

type

dict

default
remoteWriteDefaults:
  cluster: {}
  userWorkload: {}
example
remoteWriteDefaults:
  cluster:
    queueConfig:
      maxShards: 80
  userWorkload:
    queueConfig:
      maxShards: 20

A dict of default remote write configurations for the Prometheus component. Those values are merged into each remote write configuration in configs.prometheusK8s.remoteWrite and configsUserWorkload.prometheus.remoteWrite.

cronjobs

type

dict

A dict of arbitrary cronjobs to create in the openshift-monitoring namespace. The key is the name of the cronjob and the values are its configuration options as shown below.

schedule

type

string

Schedule of the CronJob in cron syntax.

script

type

string

The script to execute as part of the cronjob.

image

type

dict

default

images.oc from class/defaults.yml

image.image

type

string

The image used by the cronjob.

image.tag

type

string

The image tag used by the cronjob.

config

type

dict

default

{}

Any additional custom configuration for the cronjob.

Example

cronjobs:
  my-cronjob:
    schedule: "1 * * * *"
    image:
      image: quay.io/appuio/oc
      tag: v4.13
    script: |
      #!/bin/sh
      echo "this is an example"
    config:
      spec:
        failedJobsHistoryLimit: 1

customNodeExporter

This parameter allows users to deploy an additional node-exporter DaemonSet. We provide this option, since OpenShift’s cluster-monitoring stack currently doesn’t allow users to customize the bundled node-exporter DaemonSet.

Currently, the parameter is tailored to allow users to run an additional node-exporter which enables collectors that aren’t enabled in the default node exporter.

The configuration is rendered by using the same Jsonnet that’s used by the OpenShift cluster-monitoring stack to generate the default node-exporter DaemonSet. The component further customizes the resulting manifests to ensure that there’s no conflicts between the default node-exporter and the additional node-exporter.

The additional node-exporter is deployed in the namespace indicated by parameter namespace. By default this is namespace openshift-monitoring. The component also deploys a ServiceMonitor which ensures that the additional node-exporter is scraped by the cluster-monitoring stack’s Prometheus.

Users can configure arbitrary recording and alerting rules which use metrics scraped from the additional node-exporter via parameter rules.

enabled

type

bool

default

false

Whether to deploy the additional node-exporter.

collectors

type

list

default

["network_route"]

Which collectors to enable in the additional node-exporter. By default, all collectors are disabled. Users can remove entries from this list by providing an existing entry prefixed with ~.

args

type

list

default

[]

Additional command line arguments to pass to the additional node-exporter. Please note that specifying --[no-]collector.<name> here will break the DaemonSet, since node-exporter doesn’t support specifying these flags multiple times. Users should use parameter customNodeExporter.collectors to enable collectors.

metricRelabelings

type

list

default

See class/defaults.yml

This parameter allows users to specify the content of field metricRelabelings of the ServiceMonitor which is created for the additional node-exporter. By default, the component drops all metrics except node_network_route* metrics for host devices prefixed with ens. Since this component only applies to OpenShift 4, we know that any node’s host interfaces will use device names that are prefixed with ens.

Users are encouraged to extend or overwrite this parameter to ensure all the metrics they’re interested in are actually scraped by Prometheus.