Parameters
The parent key for all of the following parameters is openshift4_monitoring
.
manifests_version
type |
string |
default |
|
Select which version of the upstream alerting (and recording) rules should be used by the component. This parameter must be changed to match the cluster’s OCP4 minor version.
We recommend setting this parameter based on the reported OpenShift version which can be found in the cluster’s dynamic facts.
|
defaultConfig
type |
dictionary |
default |
|
A dictionary holding the default configuration which should be applied to all components.
upstreamRules.networkPlugin
type |
string |
default |
|
Choose either openshift-sdn
or ovn-kubernetes
depending on the installed network plugin.
If a custom network plugin is used, set any other string as the value for this parameter.
This ensures neither openshift-sdn nor OVN-Kubernetes monitoring rules are deployed.
enableAlertmanagerIsolationNetworkPolicy
type |
boolean |
default |
|
Blocks all traffic to Alertmanager pods except the allowed API traffic.
This works around an observed accidental clustering with user workload or custom Alertmanager clusters in other namespaces.
enableUserWorkloadAlertmanagerIsolationNetworkPolicy
type |
boolean |
default |
|
Blocks all traffic to Alertmanager pods except the allowed API traffic.
This works around an observed accidental clustering with system or custom Alertmanager clusters in other namespaces.
enableUserWorkload
type |
boolean |
default |
|
A parameter to enable monitoring for user-defined projects.
configs
type |
dictionary |
default |
|
A dictionary holding the configurations for the monitoring components.
See the OpenShift docs for available parameters.
This table shows the monitoring components you can configure and the keys used to specify the components:
Component | Key |
---|---|
Prometheus Operator |
|
Prometheus |
|
Alertmanager |
|
kube-state-metrics |
|
openshift-state-metrics |
|
Grafana |
|
Telemeter Client |
|
Prometheus Adapter |
|
Thanos Querier |
|
configs.prometheusK8s._remoteWrite
type |
dictionary |
default |
|
example |
|
A dictionary holding the remote write configurations for the Prometheus component. The key is the name of the configuration, the value is the content of the configuration.
The remote write configuration will be appended to the configs.prometheusK8s.remoteWrite
parameter for backwards compatibility.
In this configuration only, writeRelabelConfigs
entries can hold an entry for timeseries
containing a list of strings representing individual Prometheus timeseries.
These will be translated into a regex
entry, with a regular expression matching any one of the listed timeseries.
configsUserWorkload
type |
dictionary |
default |
|
A dictionary holding the configurations for the user workload monitoring components.
By default, we configure the user workload monitoring Prometheus and Alertmanager to inherit the volumeClaimTemplate
specifications from the cluster-monitoring config.
This allows users to configure the default storageclass and volume size of both monitoring stacks through the cluster-monitoring config.
This table shows the monitoring components you can configure and the keys used to specify the components:
Component | Key | Note |
---|---|---|
Alertmanager |
|
Only on OpenShift 4.11 and newer |
Prometheus Operator |
|
|
Prometheus |
|
|
Thanos Ruler |
|
configsUserWorkload.prometheus._remoteWrite
type |
dictionary |
default |
|
example |
|
A dictionary holding the remote write configurations for the Prometheus component of the user workload monitoring stack. The key is the name of the configuration, the value is the content of the configuration.
The remote write configuration will be appended to the configsUserWorkload.prometheus.remoteWrite
parameter for backwards compatibility.
alertManagerConfig
type |
dictionary |
default |
|
A dictionary holding the configuration for the AlertManager.
See the OpenShift docs for available parameters.
alerts
type |
dictionary |
Configuration parameters related to influence the resulting alert rules.
includeNamespaces
type |
list |
default |
List of namespace patterns to use for alerts which have namespace=~"(openshift-.|kube-.|default)"
in the upstream rule.
The component generates a regex pattern from the list by concatenating all elements into a large OR-regex.
To inject the custom regex, the component searches for the exact string namespace=~"(openshift-.|kube-.|default)"
in field expr
of each alert rule and replaces it with the regex generated from the parameter.
The component processes the list with `com.renderArray()` to allow users to drop entries in the hierarchy.
The component doesn’t validate that the list entries are valid regex patterns. |
ignoreNames
type |
list |
default |
List of alert rule names to be dropped.
This parameter is taken into account in the filterRules and filterPatchRules library functions.
|
ignoreWarnings
type |
list |
default |
List of alert rule names for which to drop alerts with label severity: warning
.
In contrast to ignoreNames , this parameter is not taken into account in the filterRules and filterPatchRules library functions.
|
ignoreGroups
type |
list |
default |
List of complete alert rule groups to drop.
This parameter is not taken into account for filterRules and filterPatchRules .
|
customAnnotations
type |
dict |
default |
|
Maps alert names to sets of custom annotations. Allows configuring custom annotations for individual alerts.
Example:
customAnnotations:
Watchdog:
runbook_url: https://www.google.com/?q=Watchdog
patchRules
- type
-
dict
- keys
-
potential values of parameter
manifests_versions
and*
- default
The parameter patchRules
allows users to customize upstream alerts.
The component expects that top-level keys in the parameter correspond to values of parameter manifests_versions
.
Additionally, the component supports special top-level key *
.
Alert patches which are defined under top-level key are applied regardless of the OpenShift 4 version specified in parameter
manifest_versions
.
Additionally, the component applies all patches under the key which matches the value of parameter manifest_versions
.
If an alert is patched in both top-level key and the top-level key matching parameter
manifest_versions
, the patches are merged together, with the version-specific patch overriding the generic patch.
The component expects alert names as keys and any alert configuration as values in each top-level key. See the Prometheus alerting rules documentation for extended documentation on configuring alerting rules.
Example:
patchRules:
'*':
PrometheusRemoteWriteBehind:
annotations:
runbook_url: https://example.com/runbooks/PrometheusRemoteWriteBehind.html
release-4.11:
SystemMemoryExceedsReservation:
for: 30m
ignoreUserWorkload
type |
list |
default |
|
A list of alerting rules for which the component should patch the expr
and annotations.description
fields to ensure they don’t alert for the user workload monitoring stack.
By default, we don’t turn off any alerts for the user workload monitoring stack.
The parameter supports removing entries by providing the entry to remove prefixed with ~
.
The parameter can be completely cleared with the following config:
parameters:
openshift4_monitoring:
alerts:
~ignoreUserWorkload: []
silence.silences
type |
dict |
default |
|
Contains the list of silences to be applied. The key is used as the comment of the silence and the value is a dictionary which is passed to Alertmanager.
Silences removed from the hierarchy stay active in Alertmanager for up to a year until they expire.
Silences all non-SYN alerts by default.
serviceAccountName
type |
string |
default |
prometheus-k8s |
Name of the service account used when running the silence job. The service account must have permission to access the Alertmanager service through its oAuth proxy.
capacityAlerts
type |
dict |
This parameter allows users to enable and configure alerts for capacity management.
The capacity alerts are enabled by default and can be disabled completely by setting the key capacityAlerts.enabled
to false
.
Predictive alerts are disabled by default and can be enabled individually as shown below by setting ExpectClusterCpuUsageHigh.enabled
to true
.
The dictionary will be transformed into a PrometheusRule
object by the component.
The component provides 10 alerts that are grouped in four groups.
You can disable or modify each of these alert rules individually.
The fields in these rules will be added to the final PrometheusRule
, with the exception of expr
.
The expr
field contains fields which can be used to tune the default alert rule.
Alternatively the default rule can be completely overwritten by setting the expr.raw
field (see example below).
See Resource Management for an explanation for every alert rule.
Example:
---
capacityAlerts:
enabled: true (1)
groupByNodeLabels: [] (2)
groups:
PodCapacity:
rules:
TooManyPods:
annotations:
message: 'The number of pods is too damn high' (3)
for: 3h (4)
ExpectTooManyPods:
expr: (5)
range: '2d'
predict: '5*24*60*60'
ResourceRequests: rules: TooMuchMemoryRequested: enabled: true expr: raw: sum(kube_pod_resource_request{resource="memory"}) > 9000*1024*1024*1024 (6) CpuCapacity: rules: ClusterCpuUsageHigh: enabled: false (7) ExpectClusterCpuUsageHigh: enabled: false (7) UnusedCapacity: rules: ClusterHasUnusedNodes: enabled: false (8) --- <1> Enables capacity alerts <2> List of node labels (as they show up in the `kube_node_labels` metric) by which alerts are grouped <3> Changes the alert message for the pod capacity alert <4> Only alerts for pod capacity if it fires for 3 hours <5> Change the pod count prediction to look at the last two days and predict the value in five days <6> Completely overrides the default alert rule and alerts if the total memory request is over 9000 GB <7> Disables both CPU capacity alert rules <8> Disables alert if the cluster has unused nodes.
rules
type |
dict |
default |
|
This parameter allows users to configure additional Prometheus rules to deploy on the cluster.
Each key-value pair in the dictionary is transformed into a PrometheusRule
object by the component.
The component expects that values are dicts themselves and expects that keys in those dicts are prefixed with record:
or alert:
to indicate whether the rule is a recording or alerting rule.
The component will transform the keys into fields in the resulting rule by taking the prefix as the field name and the rest of the key as the field value.
For example, key "record:sum:some:metric:5m"
would be transformed into record: sum:some:metric:5m
which should define a recording rule with name sum:some:metric:5m
.
This field is then merged into the provided value which should be a valid rule definition.
Example:
---
rules:
generic-rules:
"alert:ContainerOOMKilled":
annotations:
message: A container ({{$labels.container}}) in pod {{ $labels.namespace }}/{{ $labels.pod }} was OOM killed
expr: |
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
labels:
source: https://git.vshn.net/swisscompks/syn-tenant-repo/-/blob/master/common.yml
severity: devnull
---