Parameters
The parent key for all of the following parameters is openshift4_monitoring.
manifests_version
| type |
string |
| default |
|
Select which version of the upstream alerting (and recording) rules should be used by the component. This parameter must be changed to match the cluster’s OCP4 minor version.
|
We recommend setting this parameter based on the reported OpenShift version which can be found in the cluster’s dynamic facts.
|
defaultConfig
| type |
dictionary |
| default |
|
A dictionary holding the default configuration which should be applied to all components.
The contents of this parameter aren’t applied to components nodeExporter and prometheusOperatorAdmissionWebhook which don’t support field nodeSelector.
|
enableAlertmanagerIsolationNetworkPolicy
| type |
boolean |
| default |
|
Blocks all traffic to Alertmanager pods except the allowed API traffic.
This works around an observed accidental clustering with user workload or custom Alertmanager clusters in other namespaces.
enableUserWorkloadAlertmanagerIsolationNetworkPolicy
| type |
boolean |
| default |
|
Blocks all traffic to Alertmanager pods except the allowed API traffic.
This works around an observed accidental clustering with system or custom Alertmanager clusters in other namespaces.
enableUserWorkload
| type |
boolean |
| default |
|
A parameter to enable monitoring for user-defined projects.
configs
| type |
dictionary |
| default |
A dictionary holding the configurations for the monitoring components.
The component will remove empty fields (null, and empty lists or objects) from the provided configuration.
See the OpenShift docs for available parameters.
This table shows the monitoring components you can configure and the keys used to specify the components:
| Component | Key |
|---|---|
Prometheus Operator |
|
Prometheus Operator admission webhook |
|
Prometheus |
|
Alertmanager |
|
kube-state-metrics |
|
openshift-state-metrics |
|
Telemeter Client |
|
Metrics Server |
|
Thanos Querier |
|
Node exporter |
|
Console monitoring plugin |
|
configs.prometheusK8s._remoteWrite
| type |
dictionary |
| default |
|
| example |
|
A dictionary holding the remote write configurations for the Prometheus component. The key is the name of the configuration, the value is the content of the configuration.
The remote write configuration will be appended to the configs.prometheusK8s.remoteWrite parameter for backwards compatibility.
In this configuration only, writeRelabelConfigs entries can hold an entry for timeseries containing a list of strings representing individual Prometheus timeseries.
These will be translated into a regex entry, with a regular expression matching any one of the listed timeseries.
configsUserWorkload
| type |
dictionary |
| default |
|
A dictionary holding the configurations for the user workload monitoring components.
By default, we configure the user workload monitoring Prometheus and Alertmanager to inherit the volumeClaimTemplate specifications from the cluster-monitoring config.
This allows users to configure the default storageclass and volume size of both monitoring stacks through the cluster-monitoring config.
This table shows the monitoring components you can configure and the keys used to specify the components:
| Component | Key | Note |
|---|---|---|
Alertmanager |
|
Only on OpenShift 4.11 and newer |
Prometheus Operator |
|
|
Prometheus |
|
|
Thanos Ruler |
|
configsUserWorkload.prometheus._remoteWrite
| type |
dictionary |
| default |
|
| example |
|
A dictionary holding the remote write configurations for the Prometheus component of the user workload monitoring stack. The key is the name of the configuration, the value is the content of the configuration.
The remote write configuration will be appended to the configsUserWorkload.prometheus.remoteWrite parameter for backwards compatibility.
alertManagerConfig
| type |
dictionary |
| default |
|
A dictionary holding the configuration for the AlertManager.
See the OpenShift docs for available parameters.
The component will silently drop any fields in the provided config which are empty.
The component treats null as empty for scalar fields.
alertManagerAutoDiscovery
| type |
dictionary |
| default |
|
alertManagerAutoDiscovery holds the configuration for the Alertmanager auto-discovery feature.
The auto-discovery routes alerts to the configured teams based on their namespaces and the top-level syn.teams[*].instances and syn.owner parameters.
Auto-discovery first creates a list of Commodore component instances by parsing the applications array using the same rules as Commodore itself (see also the Commodore component instantiation documentation).
For each discovered instance, the component then reads the component’s namespace from field namespace or namespace.name in the rendered parameters of this component.
Finally, routing rules are generated to route alerts from the discovered namespaces to the associated component instance’s owning team.
|
Without special handling, the namespace discovery would discover namespace To address this case, the component has override logic in the namespace discovery for component instances which use |
syn Team Examplesyn:
owner: daring-donkeys
teams:
electric-elephants:
instances: [postgres]
The auto-discovery feature is enabled by default.
A ConfigMap can be enabled with debug_config_map to debug the auto-discovery feature.
The configuration is merged with the alertManagerConfig parameter.
Route receivers are generated for each team based on the team_receiver_format parameter.
The routes are ordered as follows:
alertManagerAutoDiscovery.prepend_routes + generated routes + alertManagerAutoDiscovery.append_routes + alertManagerConfig.routes + route all to syn.owner
additional_alert_matchers is a list of additional alert matchers to add to the generated routes.
This can be used to handle special cases where the auto-discovery feature does not work as expected.
For example if an alert should go to a different team than the namespace suggests based on a label.
alertManagerAutoDiscovery:
additional_alert_matchers:
- 'syn_team = ""'
# becomes
- continue: true
matchers:
- syn_team = ""
- namespace =~ "my-ns"
receiver: team_default_lovable-lizards
- continue: false
matchers:
- syn_team = ""
- namespace =~ "my-ns"
receiver: __component_openshift4_monitoring_null
operatorRuleNamespaces
| type |
list |
| default |
|
Additional namespaces to manage operator managed PrometheusRules.
alerts
| type |
dictionary |
Configuration parameters related to influence the resulting alert rules.
includeNamespaces
| type |
list |
| default |
List of namespace patterns to use for alerts which have namespace=~"(openshift-.*|kube-.*|default)" in the upstream rule.
The component generates a regex pattern from the list by concatenating all elements into a large OR-regex.
To inject the custom regex, the component searches for the exact string namespace=~"(openshift-.*|kube-.*|default)" in field expr of each alert rule and replaces it with the regex generated from this parameter and parameter excludeNamespaces.
The component processes the list with com.renderArray() to allow users to drop entries in the hierarchy.
| The component doesn’t validate that the list entries are valid regex patterns. |
excludeNamespaces
| type |
list |
| default |
|
List of namespace patterns to exclude for alerts which have namespace=~"(openshift-.*|kube-.*|default)" in the upstream rule.
The component generates a regex pattern from the list by concatenating all elements into a large OR-regex.
To inject the custom regex, the component searches for the exact string namespace=~"(openshift-.*|kube-.*|default)" in field expr of each alert rule and replaces it with the regex generated from this parameter and parameter includeNamespaces.
The component processes the list with com.renderArray() to allow users to drop entries in the hierarchy.
| The component doesn’t validate that the list entries are valid regex patterns. |
Example
We assume that the input config has patterns default and openshift. and syn. for includeNamespaces and openshift-adp for excludeNamespaces:
includeNamespaces:
- default
- openshift.*
- syn.*
excludeNamespaces:
- openshift-adp
The component will generate namespace selector namespace="(default|openshift.|syn.)",namespace!"(openshift-adp)" from this input configuration.
ignoreNames
| type |
list |
| default |
List of alert rule names to be dropped.
This parameter is taken into account in the filterRules and filterPatchRules library functions.
|
ignoreWarnings
| type |
list |
| default |
List of alert rule names for which to drop alerts with label severity: warning.
In contrast to ignoreNames, this parameter is not taken into account in the filterRules and filterPatchRules library functions.
|
ignoreGroups
| type |
list |
| default |
List of complete alert rule groups to drop.
This parameter is not taken into account for filterRules and filterPatchRules.
|
customAnnotations
| type |
dict |
| default |
|
Maps alert names to sets of custom annotations. Allows configuring custom annotations for individual alerts.
Example:
customAnnotations:
Watchdog:
runbook_url: https://www.google.com/?q=Watchdog
patchRules
- type
-
dict
- default
The parameter patchRules allows users to customize upstream alerts.
The component expects that top-level keys in the parameter correspond to the name of the alert.
The component expects alert names as keys and any alert configuration as values in each top-level key. See the Prometheus alerting rules documentation for extended documentation on configuring alerting rules.
Example:
patchRules:
PrometheusRemoteWriteBehind:
annotations:
runbook_url: https://example.com/runbooks/PrometheusRemoteWriteBehind.html
SystemMemoryExceedsReservation:
for: 30m
ignoreUserWorkload
| type |
list |
| default |
|
A list of alerting rules for which the component should patch the expr and annotations.description fields to ensure they don’t alert for the user workload monitoring stack.
By default, we don’t turn off any alerts for the user workload monitoring stack.
The parameter supports removing entries by providing the entry to remove prefixed with ~.
The parameter can be completely cleared with the following config:
parameters:
openshift4_monitoring:
alerts:
~ignoreUserWorkload: []
silence.silences
| type |
dict |
| default |
|
Contains the list of silences to be applied. The key is used as the comment of the silence and the value is a dictionary which is passed to Alertmanager.
Silences removed from the hierarchy stay active in Alertmanager for up to 24h until they expire.
Silences all non-SYN alerts by default.
serviceAccountName
| type |
string |
| default |
prometheus-k8s |
Name of the service account used when running the silence job. The service account must have permission to access the Alertmanager service through its oAuth proxy.
capacityAlerts
| type |
dict |
This parameter allows users to enable and configure alerts for capacity management.
The capacity alerts are enabled by default and can be disabled completely by setting the key capacityAlerts.enabled to false.
Predictive alerts are disabled by default and can be enabled individually as shown below by setting ExpectClusterCpuUsageHigh.enabled to true.
The dictionary will be transformed into a PrometheusRule object by the component.
The component provides 10 alerts that are grouped in four groups.
You can disable or modify each of these alert rules individually.
The fields in these rules will be added to the final PrometheusRule, with the exception of expr.
The expr field contains fields which can be used to tune the default alert rule.
Alternatively the default rule can be completely overwritten by setting the expr.raw field (see example below).
See Resource Management for an explanation for every alert rule.
Example:
capacityAlerts:
enabled: true (1)
groupByNodeLabels: [] (2)
groups:
PodCapacity:
rules:
TooManyPods:
annotations:
message: 'The number of pods is too damn high' (3)
for: 3h (4)
ExpectTooManyPods:
expr: (5)
range: '2d'
predict: '5*24*60*60'
ResourceRequests:
rules:
TooMuchMemoryRequested:
enabled: true
expr:
raw: sum(kube_pod_resource_request{resource="memory"}) > 9000*1024*1024*1024 (6)
CpuCapacity:
rules:
ClusterCpuUsageHigh:
enabled: false (7)
ExpectClusterCpuUsageHigh:
enabled: false (7)
UnusedCapacity:
rules:
ClusterHasUnusedNodes:
enabled: false (8)
| 1 | Enables capacity alerts |
| 2 | List of node labels (as they show up in the kube_node_labels metric) by which alerts are grouped |
| 3 | Changes the alert message for the pod capacity alert |
| 4 | Only alerts for pod capacity if it fires for 3 hours |
| 5 | Change the pod count prediction to look at the last two days and predict the value in five days |
| 6 | Completely overrides the default alert rule and alerts if the total memory request is over 9000 GB |
| 7 | Disables both CPU capacity alert rules |
| 8 | Disables alert if the cluster has unused nodes. |
rules
| type |
dict |
| default |
|
This parameter allows users to configure additional Prometheus rules to deploy on the cluster.
Each key-value pair in the dictionary is transformed into a PrometheusRule object by the component.
The component expects that values are dicts themselves and expects that keys in those dicts are prefixed with record: or alert: to indicate whether the rule is a recording or alerting rule.
The component will transform the keys into fields in the resulting rule by taking the prefix as the field name and the rest of the key as the field value.
For example, key "record:sum:some:metric:5m" would be transformed into record: sum:some:metric:5m which should define a recording rule with name sum:some:metric:5m.
This field is then merged into the provided value which should be a valid rule definition.
Example:
rules:
generic-rules:
"alert:ContainerOOMKilled":
annotations:
message: A container ({{$labels.container}}) in pod {{ $labels.namespace }}/{{ $labels.pod }} was OOM killed
expr: |
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
labels:
source: https://git.vshn.net/swisscompks/syn-tenant-repo/-/blob/master/common.yml
severity: devnull
Example
defaultConfig:
nodeSelector:
node-role.kubernetes.io/infra: ''
configs:
prometheusK8s:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 100Gi
alerts:
ignoreNames:
- KubeAPIErrorsHigh
- KubeClientErrors
secrets
| type |
dict |
| default |
|
A dict of secrets to create in the namespace.
The key is the name of the secret, the value is the content of the secret.
The value must be a dict with a key stringData which is a dict of key/value pairs to add to the secret.
remoteWriteDefaults
| type |
dict |
| default |
|
| example |
|
A dict of default remote write configurations for the Prometheus component.
Those values are merged into each remote write configuration in configs.prometheusK8s.remoteWrite and configsUserWorkload.prometheus.remoteWrite.
cronjobs
| type |
dict |
A dict of arbitrary cronjobs to create in the openshift-monitoring namespace.
The key is the name of the cronjob and the values are its configuration options as shown below.
image
| type |
dict |
| default |
|
customNodeExporter
This parameter allows users to deploy an additional node-exporter DaemonSet. We provide this option, since OpenShift’s cluster-monitoring stack currently doesn’t allow users to customize the bundled node-exporter DaemonSet.
Currently, the parameter is tailored to allow users to run an additional node-exporter which enables collectors that aren’t enabled in the default node exporter.
The configuration is rendered by using the same Jsonnet that’s used by the OpenShift cluster-monitoring stack to generate the default node-exporter DaemonSet. The component further customizes the resulting manifests to ensure that there’s no conflicts between the default node-exporter and the additional node-exporter.
The additional node-exporter is deployed in the namespace indicated by parameter namespace.
By default this is namespace openshift-monitoring.
The component also deploys a ServiceMonitor which ensures that the additional node-exporter is scraped by the cluster-monitoring stack’s Prometheus.
Users can configure arbitrary recording and alerting rules which use metrics scraped from the additional node-exporter via parameter rules.
collectors
| type |
list |
| default |
|
Which collectors to enable in the additional node-exporter.
By default, all collectors are disabled.
Users can remove entries from this list by providing an existing entry prefixed with ~.
args
| type |
list |
| default |
|
Additional command line arguments to pass to the additional node-exporter.
Please note that specifying --[no-]collector.<name> here will break the DaemonSet, since node-exporter doesn’t support specifying these flags multiple times.
Users should use parameter customNodeExporter.collectors to enable collectors.
metricRelabelings
| type |
list |
| default |
This parameter allows users to specify the content of field metricRelabelings of the ServiceMonitor which is created for the additional node-exporter.
By default, the component drops all metrics except node_network_route* metrics for host devices prefixed with ens.
Since this component only applies to OpenShift 4, we know that any node’s host interfaces will use device names that are prefixed with ens.
Users are encouraged to extend or overwrite this parameter to ensure all the metrics they’re interested in are actually scraped by Prometheus.