Parameters

The parent key for all of the following parameters is openshift4_monitoring.

`manifests_version`

type	string
default	`release-4.17`

Select which version of the upstream alerting (and recording) rules should be used by the component. This parameter must be changed to match the cluster’s OCP4 minor version.

We recommend setting this parameter based on the reported OpenShift version which can be found in the cluster’s dynamic facts.

manifests_version: release-${dynamic_facts:openshiftVersion:Major}.${dynamic_facts:openshiftVersion:Minor}

`defaultConfig`

type	dictionary
default	`nodeSelector: node-role.kubernetes.io/infra: ''`

A dictionary holding the default configuration which should be applied to all components.

The contents of this parameter aren’t applied to components nodeExporter and prometheusOperatorAdmissionWebhook which don’t support field nodeSelector.

`enableAlertmanagerIsolationNetworkPolicy`

type	boolean
default	`true`

Blocks all traffic to Alertmanager pods except the allowed API traffic.

This works around an observed accidental clustering with user workload or custom Alertmanager clusters in other namespaces.

`enableUserWorkloadAlertmanagerIsolationNetworkPolicy`

type	boolean
default	`true`

Blocks all traffic to Alertmanager pods except the allowed API traffic.

This works around an observed accidental clustering with system or custom Alertmanager clusters in other namespaces.

`enableUserWorkload`

type	boolean
default	`true`

A parameter to enable monitoring for user-defined projects.

`configs`

type	dictionary
default	See `class/defaults.yml`

A dictionary holding the configurations for the monitoring components.

The component will remove empty fields (null, and empty lists or objects) from the provided configuration.

See the OpenShift docs for available parameters.

This table shows the monitoring components you can configure and the keys used to specify the components:

Component Key

Component	Key
Prometheus Operator	`prometheusOperator`
Prometheus Operator admission webhook	`prometheusOperatorAdmissionWebhook`
Prometheus	`prometheusK8s`
Alertmanager	`alertmanagerMain`
kube-state-metrics	`kubeStateMetrics`
openshift-state-metrics	`openshiftStateMetrics`
Telemeter Client	`telemeterClient`
Metrics Server	`metricsServer`
Thanos Querier	`thanosQuerier`
Node exporter	`nodeExporter`
Console monitoring plugin	`monitoringPlugin`

Prometheus Operator

prometheusOperator

Prometheus Operator admission webhook

prometheusOperatorAdmissionWebhook

Prometheus

prometheusK8s

Alertmanager

alertmanagerMain

kube-state-metrics

kubeStateMetrics

openshift-state-metrics

openshiftStateMetrics

Telemeter Client

telemeterClient

Metrics Server

metricsServer

Thanos Querier

thanosQuerier

Node exporter

nodeExporter

Console monitoring plugin

monitoringPlugin

`configs.prometheusK8s._remoteWrite`

type

dictionary

default

{}

example

_remoteWrite:
  example:
    url: https://prometheus.example.com/api/v1/write
    headers:
      "X-Scope-OrgID": example
    writeRelabelConfigs:
      - action: keep
        sourceLabels: ['syn']
        regex: '.+'
      - action: keep
        timeseries:
          - foo_metric_one
          - foo_metric_two
    basicAuth:
      username:
        name: remote-write
        key: username
      password:
        name: remote-write
        key: password

A dictionary holding the remote write configurations for the Prometheus component. The key is the name of the configuration, the value is the content of the configuration.

The remote write configuration will be appended to the configs.prometheusK8s.remoteWrite parameter for backwards compatibility.

In this configuration only, writeRelabelConfigs entries can hold an entry for timeseries containing a list of strings representing individual Prometheus timeseries. These will be translated into a regex entry, with a regular expression matching any one of the listed timeseries.

`configsUserWorkload`

type

dictionary

default

alertmanager:
  enabled: true
  enableAlertmanagerConfig: true
  volumeClaimTemplate: ${openshift4_monitoring:configs:alertmanagerMain:volumeClaimTemplate}
prometheusOperator: {}
prometheus:
  retention: 8d
  volumeClaimTemplate: ${openshift4_monitoring:configs:prometheusK8s:volumeClaimTemplate}
thanosRuler: {}

A dictionary holding the configurations for the user workload monitoring components.

By default, we configure the user workload monitoring Prometheus and Alertmanager to inherit the volumeClaimTemplate specifications from the cluster-monitoring config. This allows users to configure the default storageclass and volume size of both monitoring stacks through the cluster-monitoring config.

This table shows the monitoring components you can configure and the keys used to specify the components:

Component Key Note

Component	Key	Note
Alertmanager	`alertmanager`	Only on OpenShift 4.11 and newer
Prometheus Operator	`prometheusOperator`
Prometheus	`prometheus`
Thanos Ruler	`thanosRuler`

Alertmanager

alertmanager

Only on OpenShift 4.11 and newer

Prometheus Operator

prometheusOperator

Prometheus

prometheus

Thanos Ruler

thanosRuler

`configsUserWorkload.prometheus._remoteWrite`

type

dictionary

default

{}

example

_remoteWrite:
  example:
    url: https://prometheus.example.com/api/v1/write
    headers:
      "X-Scope-OrgID": customer
    writeRelabelConfigs:
      - sourceLabels: ['customer']
        regex: '.+'
        action: keep
    basicAuth:
      username:
        name: remote-write-customer
        key: username
      password:
        name: remote-write-customer
        key: password

A dictionary holding the remote write configurations for the Prometheus component of the user workload monitoring stack. The key is the name of the configuration, the value is the content of the configuration.

The remote write configuration will be appended to the configsUserWorkload.prometheus.remoteWrite parameter for backwards compatibility.

`alertManagerConfig`

type

dictionary

default

route:
  group_wait: 0s
  group_interval: 5s
  repeat_interval: 10m
inhibit_rules:
  # Don't send warning or info if a critical is already firing
  - target_match_re:
      severity: warning|info
    source_match:
      severity: critical
    equal:
      - namespace
      - alertname
  # Don't send info if a warning is already firing
  - target_match_re:
      severity: info
    source_match:
      severity: warning
    equal:
      - namespace
      - alertname

A dictionary holding the configuration for the AlertManager.

See the OpenShift docs for available parameters.

The component will silently drop any fields in the provided config which are empty. The component treats null as empty for scalar fields.

`alertManagerAutoDiscovery`

type	dictionary
default	`alertManagerAutoDiscovery: enabled: true debug_config_map: false team_receiver_format: team_default_%s additional_alert_matchers: [] prepend_routes: [] append_routes: []`

alertManagerAutoDiscovery holds the configuration for the Alertmanager auto-discovery feature.

The auto-discovery routes alerts to the configured teams based on their namespaces and the top-level syn.teams[*].instances and syn.owner parameters. Auto-discovery first creates a list of Commodore component instances by parsing the applications array using the same rules as Commodore itself (see also the Commodore component instantiation documentation). For each discovered instance, the component then reads the component’s namespace from field namespace or namespace.name in the rendered parameters of this component. Finally, routing rules are generated to route alerts from the discovered namespaces to the associated component instance’s owning team.

Without special handling, the namespace discovery would discover namespace openshift4-monitoring for component instances that use namespace: ${_instance}. This is the case because we read the instance’s namespace from the rendered parameters for component openshift4-monitoring and therefore ${_instance} resolves to openshift4-monitoring.

To address this case, the component has override logic in the namespace discovery for component instances which use ${_instance} in their namespace definition. The override logic replaces all occurrences of openshift4-monitoring in the discovered namespace with the instance name for instances other than openshift4-monitoring.

syn Team Example

syn:
  owner: daring-donkeys
  teams:
    electric-elephants:
      instances: [postgres]

The auto-discovery feature is enabled by default. A ConfigMap can be enabled with debug_config_map to debug the auto-discovery feature.

The configuration is merged with the alertManagerConfig parameter. Route receivers are generated for each team based on the team_receiver_format parameter. The routes are ordered as follows:

alertManagerAutoDiscovery.prepend_routes + generated routes + alertManagerAutoDiscovery.append_routes + alertManagerConfig.routes + route all to syn.owner

additional_alert_matchers is a list of additional alert matchers to add to the generated routes. This can be used to handle special cases where the auto-discovery feature does not work as expected. For example if an alert should go to a different team than the namespace suggests based on a label.

alertManagerAutoDiscovery:
  additional_alert_matchers:
    - 'syn_team = ""'
# becomes
- continue: true
  matchers:
    - syn_team = ""
    - namespace =~ "my-ns"
  receiver: team_default_lovable-lizards
- continue: false
  matchers:
    - syn_team = ""
    - namespace =~ "my-ns"
  receiver: __component_openshift4_monitoring_null

`operatorRuleNamespaces`

type	list
default	`[]`

Additional namespaces to manage operator managed PrometheusRules.

`alerts`

type	dictionary

Configuration parameters related to influence the resulting alert rules.

`includeNamespaces`

type	list
default	See `class/defaults.yml`

List of namespace patterns to use for alerts which have namespace=~"(openshift-.*|kube-.*|default)" in the upstream rule. The component generates a regex pattern from the list by concatenating all elements into a large OR-regex. To inject the custom regex, the component searches for the exact string namespace=~"(openshift-.*|kube-.*|default)" in field expr of each alert rule and replaces it with the regex generated from this parameter and parameter excludeNamespaces.

The component processes the list with com.renderArray() to allow users to drop entries in the hierarchy.

The component doesn’t validate that the list entries are valid regex patterns.

Example

We assume that the input config has patterns default and syn.*:

includeNamespaces:
  - default
  - syn.*

The component will generate namespace selector namespace=~"(default|syn.*)" from this input configuration.

`excludeNamespaces`

type	list
default	`[]`

List of namespace patterns to exclude for alerts which have namespace=~"(openshift-.*|kube-.*|default)" in the upstream rule. The component generates a regex pattern from the list by concatenating all elements into a large OR-regex. To inject the custom regex, the component searches for the exact string namespace=~"(openshift-.*|kube-.*|default)" in field expr of each alert rule and replaces it with the regex generated from this parameter and parameter includeNamespaces.

The component processes the list with com.renderArray() to allow users to drop entries in the hierarchy.

The component doesn’t validate that the list entries are valid regex patterns.

Example

We assume that the input config has patterns default and openshift. and syn. for includeNamespaces and openshift-adp for excludeNamespaces:

includeNamespaces:
  - default
  - openshift.*
  - syn.*
excludeNamespaces:
  - openshift-adp

The component will generate namespace selector namespace=_{"(default|openshift.|syn.)",namespace!}"(openshift-adp)" from this input configuration.

`ignoreNames`

type	list
default	See `class/defaults.yml`

List of alert rule names to be dropped.

This parameter is taken into account in the filterRules and filterPatchRules library functions.

`ignoreWarnings`

type	list
default	See `class/defaults.yml`

List of alert rule names for which to drop alerts with label severity: warning.

In contrast to ignoreNames, this parameter is not taken into account in the filterRules and filterPatchRules library functions.

`ignoreGroups`

type	list
default	See `class/defaults.yml`

List of complete alert rule groups to drop.

This parameter is not taken into account for filterRules and filterPatchRules.

`customAnnotations`

type	dict
default	`{}`

Maps alert names to sets of custom annotations. Allows configuring custom annotations for individual alerts.

Example:

customAnnotations:
  Watchdog:
    runbook_url: https://www.google.com/?q=Watchdog

`patchRules`

type: dict
default: See class/defaults.yml on GitHub

The parameter patchRules allows users to customize upstream alerts. The component expects that top-level keys in the parameter correspond to the name of the alert.

The component expects alert names as keys and any alert configuration as values in each top-level key. See the Prometheus alerting rules documentation for extended documentation on configuring alerting rules.

Example:

patchRules:
  PrometheusRemoteWriteBehind:
    annotations:
      runbook_url: https://example.com/runbooks/PrometheusRemoteWriteBehind.html
  SystemMemoryExceedsReservation:
    for: 30m

`ignoreUserWorkload`

type	list
default	`[]`

A list of alerting rules for which the component should patch the expr and annotations.description fields to ensure they don’t alert for the user workload monitoring stack.

By default, we don’t turn off any alerts for the user workload monitoring stack.

The parameter supports removing entries by providing the entry to remove prefixed with ~. The parameter can be completely cleared with the following config:

parameters:
  openshift4_monitoring:
    alerts:
      ~ignoreUserWorkload: []

`silence`

type

dict

Parameters to configure the silence CronJob.

`silence.silences`

type	dict
default	`"Silence non syn alerts": matchers: - name: alertname value: ".+" isRegex: true - name: syn value: "" isRegex: false`

Contains the list of silences to be applied. The key is used as the comment of the silence and the value is a dictionary which is passed to Alertmanager.

Silences removed from the hierarchy stay active in Alertmanager for up to 24h until they expire.

Silences all non-SYN alerts by default.

`schedule`

type	string
default	'0 /4 * *'

Schedule of the CronJob in cron syntax.

`serviceAccountName`

type	string
default	prometheus-k8s

Name of the service account used when running the silence job. The service account must have permission to access the Alertmanager service through its oAuth proxy.

`servingCertsCABundleName`

type	string
default	serving-certs-ca-bundle

Name of the config map containing the CA bundle of the Alertmanager service.

`jobHistoryLimit`

type

dict

Parameters to configure the numbers of silence job objects to keep.

`failed`

type	number
default	3

Number of failed jobs to keep.

`successful`

type	number
default	3

Number of successful jobs to keep.

`capacityAlerts`

type

dict

This parameter allows users to enable and configure alerts for capacity management. The capacity alerts are enabled by default and can be disabled completely by setting the key capacityAlerts.enabled to false. Predictive alerts are disabled by default and can be enabled individually as shown below by setting ExpectClusterCpuUsageHigh.enabled to true.

The dictionary will be transformed into a PrometheusRule object by the component.

The component provides 10 alerts that are grouped in four groups. You can disable or modify each of these alert rules individually. The fields in these rules will be added to the final PrometheusRule, with the exception of expr. The expr field contains fields which can be used to tune the default alert rule. Alternatively the default rule can be completely overwritten by setting the expr.raw field (see example below). See Resource Management for an explanation for every alert rule.

Example:

capacityAlerts:
  enabled: true (1)
  groupByNodeLabels: [] (2)
  groups:
    PodCapacity:
      rules:
        TooManyPods:
          annotations:
            message: 'The number of pods is too damn high' (3)
          for: 3h (4)
        ExpectTooManyPods:
          expr: (5)
            range: '2d'
            predict: '5*24*60*60'

    ResourceRequests:
      rules:
        TooMuchMemoryRequested:
          enabled: true
          expr:
            raw: sum(kube_pod_resource_request{resource="memory"}) > 9000*1024*1024*1024 (6)
    CpuCapacity:
      rules:
        ClusterCpuUsageHigh:
          enabled: false (7)
        ExpectClusterCpuUsageHigh:
          enabled: false (7)
    UnusedCapacity:
      rules:
        ClusterHasUnusedNodes:
          enabled: false (8)

1	Enables capacity alerts
2	List of node labels (as they show up in the `kube_node_labels` metric) by which alerts are grouped
3	Changes the alert message for the pod capacity alert
4	Only alerts for pod capacity if it fires for 3 hours
5	Change the pod count prediction to look at the last two days and predict the value in five days
6	Completely overrides the default alert rule and alerts if the total memory request is over 9000 GB
7	Disables both CPU capacity alert rules
8	Disables alert if the cluster has unused nodes.

`rules`

type	dict
default	`{}`

This parameter allows users to configure additional Prometheus rules to deploy on the cluster.

Each key-value pair in the dictionary is transformed into a PrometheusRule object by the component.

The component expects that values are dicts themselves and expects that keys in those dicts are prefixed with record: or alert: to indicate whether the rule is a recording or alerting rule. The component will transform the keys into fields in the resulting rule by taking the prefix as the field name and the rest of the key as the field value. For example, key "record:sum:some:metric:5m" would be transformed into record: sum:some:metric:5m which should define a recording rule with name sum:some:metric:5m. This field is then merged into the provided value which should be a valid rule definition.

See the Prometheus docs for supported configurations for recording and alerting rules.

Example:

rules:
  generic-rules:
    "alert:ContainerOOMKilled":
      annotations:
        message: A container ({{$labels.container}}) in pod {{ $labels.namespace }}/{{ $labels.pod }} was OOM killed
      expr: |
        kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
      labels:
        source: https://git.vshn.net/swisscompks/syn-tenant-repo/-/blob/master/common.yml
        severity: devnull

Example

defaultConfig:
  nodeSelector:
    node-role.kubernetes.io/infra: ''
configs:
  prometheusK8s:
    volumeClaimTemplate:
      spec:
        resources:
          requests:
            storage: 100Gi
alerts:
  ignoreNames:
    - KubeAPIErrorsHigh
    - KubeClientErrors

`secrets`

type	dict
default	`{}`

A dict of secrets to create in the namespace. The key is the name of the secret, the value is the content of the secret. The value must be a dict with a key stringData which is a dict of key/value pairs to add to the secret.

`remoteWriteDefaults`

type	dict
default	`remoteWriteDefaults: cluster: {} userWorkload: {}`
example	`remoteWriteDefaults: cluster: queueConfig: maxShards: 80 userWorkload: queueConfig: maxShards: 20`

A dict of default remote write configurations for the Prometheus component. Those values are merged into each remote write configuration in configs.prometheusK8s.remoteWrite and configsUserWorkload.prometheus.remoteWrite.

`cronjobs`

type

dict

A dict of arbitrary cronjobs to create in the openshift-monitoring namespace. The key is the name of the cronjob and the values are its configuration options as shown below.

`schedule`

type

string

Schedule of the CronJob in cron syntax.

`script`

type

string

The script to execute as part of the cronjob.

`image`

type	dict
default	`images.oc` from `class/defaults.yml`

`image.image`

type

string

The image used by the cronjob.

`image.tag`

type

string

The image tag used by the cronjob.

`config`

type	dict
default	`{}`

Any additional custom configuration for the cronjob.

Example

cronjobs:
  my-cronjob:
    schedule: "1 * * * *"
    image:
      image: quay.io/appuio/oc
      tag: v4.13
    script: |
      #!/bin/sh
      echo "this is an example"
    config:
      spec:
        failedJobsHistoryLimit: 1

`customNodeExporter`

This parameter allows users to deploy an additional node-exporter DaemonSet. We provide this option, since OpenShift’s cluster-monitoring stack currently doesn’t allow users to customize the bundled node-exporter DaemonSet.

Currently, the parameter is tailored to allow users to run an additional node-exporter which enables collectors that aren’t enabled in the default node exporter.

The configuration is rendered by using the same Jsonnet that’s used by the OpenShift cluster-monitoring stack to generate the default node-exporter DaemonSet. The component further customizes the resulting manifests to ensure that there’s no conflicts between the default node-exporter and the additional node-exporter.

The additional node-exporter is deployed in the namespace indicated by parameter namespace. By default this is namespace openshift-monitoring. The component also deploys a ServiceMonitor which ensures that the additional node-exporter is scraped by the cluster-monitoring stack’s Prometheus.

Users can configure arbitrary recording and alerting rules which use metrics scraped from the additional node-exporter via parameter rules.

`enabled`

type	bool
default	`false`

Whether to deploy the additional node-exporter.

`collectors`

type	list
default	`["network_route"]`

Which collectors to enable in the additional node-exporter. By default, all collectors are disabled. Users can remove entries from this list by providing an existing entry prefixed with ~.

`args`

type	list
default	`[]`

Additional command line arguments to pass to the additional node-exporter. Please note that specifying --[no-]collector.<name> here will break the DaemonSet, since node-exporter doesn’t support specifying these flags multiple times. Users should use parameter customNodeExporter.collectors to enable collectors.

`metricRelabelings`

type	list
default	See `class/defaults.yml`

This parameter allows users to specify the content of field metricRelabelings of the ServiceMonitor which is created for the additional node-exporter. By default, the component drops all metrics except node_network_route* metrics for host devices prefixed with ens. Since this component only applies to OpenShift 4, we know that any node’s host interfaces will use device names that are prefixed with ens.

Users are encouraged to extend or overwrite this parameter to ensure all the metrics they’re interested in are actually scraped by Prometheus.