Parameters
The parent key for all of the following parameters is openshift4_monitoring
.
manifests_version
type |
string |
default |
|
Select which version of the upstream alerting (and recording) rules should be used by the component. This parameter must be changed to match the cluster’s OCP4 minor version.
We recommend setting this parameter based on the reported OpenShift version which can be found in the cluster’s dynamic facts.
|
defaultConfig
type |
dictionary |
default |
|
A dictionary holding the default configuration which should be applied to all components.
upstreamRules.networkPlugin
type |
string |
default |
|
Choose either openshift-sdn
or ovn-kubernetes
depending on the installed network plugin.
If a custom network plugin is used, set any other string as the value for this parameter.
This ensures neither openshift-sdn nor OVN-Kubernetes monitoring rules are deployed.
enableAlertmanagerIsolationNetworkPolicy
type |
boolean |
default |
|
Blocks all traffic to Alertmanager pods except the allowed API traffic.
This works around an observed accidental clustering with user workload or custom Alertmanager clusters in other namespaces.
enableUserWorkloadAlertmanagerIsolationNetworkPolicy
type |
boolean |
default |
|
Blocks all traffic to Alertmanager pods except the allowed API traffic.
This works around an observed accidental clustering with system or custom Alertmanager clusters in other namespaces.
enableUserWorkload
type |
boolean |
default |
|
A parameter to enable monitoring for user-defined projects.
configs
type |
dictionary |
default |
|
A dictionary holding the configurations for the monitoring components.
The component will remove empty fields (null
, and empty lists or objects) from the provided configuration.
See the OpenShift docs for available parameters.
This table shows the monitoring components you can configure and the keys used to specify the components:
Component | Key |
---|---|
Prometheus Operator |
|
Prometheus |
|
Alertmanager |
|
kube-state-metrics |
|
openshift-state-metrics |
|
Grafana |
|
Telemeter Client |
|
Prometheus Adapter |
|
Thanos Querier |
|
configs.prometheusK8s._remoteWrite
type |
dictionary |
default |
|
example |
|
A dictionary holding the remote write configurations for the Prometheus component. The key is the name of the configuration, the value is the content of the configuration.
The remote write configuration will be appended to the configs.prometheusK8s.remoteWrite
parameter for backwards compatibility.
In this configuration only, writeRelabelConfigs
entries can hold an entry for timeseries
containing a list of strings representing individual Prometheus timeseries.
These will be translated into a regex
entry, with a regular expression matching any one of the listed timeseries.
configsUserWorkload
type |
dictionary |
default |
|
A dictionary holding the configurations for the user workload monitoring components.
By default, we configure the user workload monitoring Prometheus and Alertmanager to inherit the volumeClaimTemplate
specifications from the cluster-monitoring config.
This allows users to configure the default storageclass and volume size of both monitoring stacks through the cluster-monitoring config.
This table shows the monitoring components you can configure and the keys used to specify the components:
Component | Key | Note |
---|---|---|
Alertmanager |
|
Only on OpenShift 4.11 and newer |
Prometheus Operator |
|
|
Prometheus |
|
|
Thanos Ruler |
|
configsUserWorkload.prometheus._remoteWrite
type |
dictionary |
default |
|
example |
|
A dictionary holding the remote write configurations for the Prometheus component of the user workload monitoring stack. The key is the name of the configuration, the value is the content of the configuration.
The remote write configuration will be appended to the configsUserWorkload.prometheus.remoteWrite
parameter for backwards compatibility.
alertManagerConfig
type |
dictionary |
default |
|
A dictionary holding the configuration for the AlertManager.
See the OpenShift docs for available parameters.
The component will silently drop any fields in the provided config which are empty.
The component treats null
as empty for scalar fields.
alertManagerAutoDiscovery
type |
dictionary |
default |
|
alertManagerAutoDiscovery
holds the configuration for the Alertmanager auto-discovery feature.
The auto-discovery routes alerts to the configured teams based on their namespaces and the top-level syn.teams[*].instances
and syn.owner
parameters.
Auto-discovery first creates a list of Commodore component instances by parsing the applications
array using the same rules as Commodore itself (see also the Commodore component instantiation documentation).
For each discovered instance, the component then renders the instance parameters, and reads the cmoponent’s namespace from field namespace
or namespace.name
in the rendered parameters.
Finally, routing rules are generated to route alerts from the discovered namespaces to the associated component instance’s owning team.
syn
Team Examplesyn:
owner: daring-donkeys
teams:
electric-elephants:
instances: [postgres]
The auto-discovery feature is enabled by default.
A ConfigMap can be enabled with debug_config_map
to debug the auto-discovery feature.
The configuration is merged with the alertManagerConfig
parameter.
Route receivers are generated for each team based on the team_receiver_format
parameter.
The routes are ordered as follows:
alertManagerAutoDiscovery.prepend_routes + generated routes + alertManagerAutoDiscovery.append_routes + alertManagerConfig.routes + route all to syn.owner
additional_alert_matchers
is a list of additional alert matchers to add to the generated routes.
This can be used to handle special cases where the auto-discovery feature does not work as expected.
For example if an alert should go to a different team than the namespace suggests based on a label.
alertManagerAutoDiscovery:
additional_alert_matchers:
- 'syn_team = ""'
# becomes
- continue: true
matchers:
- syn_team = ""
- namespace =~ "my-ns"
receiver: team_default_lovable-lizards
- continue: false
matchers:
- syn_team = ""
- namespace =~ "my-ns"
receiver: __component_openshift4_monitoring_null
alerts
type |
dictionary |
Configuration parameters related to influence the resulting alert rules.
includeNamespaces
type |
list |
default |
List of namespace patterns to use for alerts which have namespace=~"(openshift-.*|kube-.*|default)"
in the upstream rule.
The component generates a regex pattern from the list by concatenating all elements into a large OR-regex.
To inject the custom regex, the component searches for the exact string namespace=~"(openshift-.*|kube-.*|default)"
in field expr
of each alert rule and replaces it with the regex generated from this parameter and parameter excludeNamespaces
.
The component processes the list with com.renderArray()
to allow users to drop entries in the hierarchy.
The component doesn’t validate that the list entries are valid regex patterns. |
excludeNamespaces
type |
list |
default |
|
List of namespace patterns to exclude for alerts which have namespace=~"(openshift-.*|kube-.*|default)"
in the upstream rule.
The component generates a regex pattern from the list by concatenating all elements into a large OR-regex.
To inject the custom regex, the component searches for the exact string namespace=~"(openshift-.*|kube-.*|default)"
in field expr
of each alert rule and replaces it with the regex generated from this parameter and parameter includeNamespaces
.
The component processes the list with com.renderArray()
to allow users to drop entries in the hierarchy.
The component doesn’t validate that the list entries are valid regex patterns. |
Example
We assume that the input config has patterns default
and openshift.
and syn.
for includeNamespaces
and openshift-adp
for excludeNamespaces
:
includeNamespaces:
- default
- openshift.*
- syn.*
excludeNamespaces:
- openshift-adp
The component will generate namespace selector namespace="(default|openshift.|syn.)",namespace!"(openshift-adp)"
from this input configuration.
ignoreNames
type |
list |
default |
List of alert rule names to be dropped.
This parameter is taken into account in the filterRules and filterPatchRules library functions.
|
ignoreWarnings
type |
list |
default |
List of alert rule names for which to drop alerts with label severity: warning
.
In contrast to ignoreNames , this parameter is not taken into account in the filterRules and filterPatchRules library functions.
|
ignoreGroups
type |
list |
default |
List of complete alert rule groups to drop.
This parameter is not taken into account for filterRules and filterPatchRules .
|
customAnnotations
type |
dict |
default |
|
Maps alert names to sets of custom annotations. Allows configuring custom annotations for individual alerts.
Example:
customAnnotations:
Watchdog:
runbook_url: https://www.google.com/?q=Watchdog
patchRules
- type
-
dict
- keys
-
potential values of parameter
manifests_versions
and*
- default
The parameter patchRules
allows users to customize upstream alerts.
The component expects that top-level keys in the parameter correspond to values of parameter manifests_versions
.
Additionally, the component supports special top-level key *
.
Alert patches which are defined under top-level key *
are applied regardless of the OpenShift 4 version specified in parameter manifest_versions
.
Additionally, the component applies all patches under the key which matches the value of parameter manifest_versions
.
If an alert is patched in both top-level key *
and the top-level key matching parameter manifest_versions
, the patches are merged together, with the version-specific patch overriding the generic patch.
The component expects alert names as keys and any alert configuration as values in each top-level key. See the Prometheus alerting rules documentation for extended documentation on configuring alerting rules.
Example:
patchRules:
'*':
PrometheusRemoteWriteBehind:
annotations:
runbook_url: https://example.com/runbooks/PrometheusRemoteWriteBehind.html
release-4.14:
SystemMemoryExceedsReservation:
for: 30m
ignoreUserWorkload
type |
list |
default |
|
A list of alerting rules for which the component should patch the expr
and annotations.description
fields to ensure they don’t alert for the user workload monitoring stack.
By default, we don’t turn off any alerts for the user workload monitoring stack.
The parameter supports removing entries by providing the entry to remove prefixed with ~
.
The parameter can be completely cleared with the following config:
parameters:
openshift4_monitoring:
alerts:
~ignoreUserWorkload: []
silence.silences
type |
dict |
default |
|
Contains the list of silences to be applied. The key is used as the comment of the silence and the value is a dictionary which is passed to Alertmanager.
Silences removed from the hierarchy stay active in Alertmanager for up to 24h until they expire.
Silences all non-SYN alerts by default.
serviceAccountName
type |
string |
default |
prometheus-k8s |
Name of the service account used when running the silence job. The service account must have permission to access the Alertmanager service through its oAuth proxy.
capacityAlerts
type |
dict |
This parameter allows users to enable and configure alerts for capacity management.
The capacity alerts are enabled by default and can be disabled completely by setting the key capacityAlerts.enabled
to false
.
Predictive alerts are disabled by default and can be enabled individually as shown below by setting ExpectClusterCpuUsageHigh.enabled
to true
.
The dictionary will be transformed into a PrometheusRule
object by the component.
The component provides 10 alerts that are grouped in four groups.
You can disable or modify each of these alert rules individually.
The fields in these rules will be added to the final PrometheusRule
, with the exception of expr
.
The expr
field contains fields which can be used to tune the default alert rule.
Alternatively the default rule can be completely overwritten by setting the expr.raw
field (see example below).
See Resource Management for an explanation for every alert rule.
Example:
capacityAlerts:
enabled: true (1)
groupByNodeLabels: [] (2)
groups:
PodCapacity:
rules:
TooManyPods:
annotations:
message: 'The number of pods is too damn high' (3)
for: 3h (4)
ExpectTooManyPods:
expr: (5)
range: '2d'
predict: '5*24*60*60'
ResourceRequests:
rules:
TooMuchMemoryRequested:
enabled: true
expr:
raw: sum(kube_pod_resource_request{resource="memory"}) > 9000*1024*1024*1024 (6)
CpuCapacity:
rules:
ClusterCpuUsageHigh:
enabled: false (7)
ExpectClusterCpuUsageHigh:
enabled: false (7)
UnusedCapacity:
rules:
ClusterHasUnusedNodes:
enabled: false (8)
1 | Enables capacity alerts |
2 | List of node labels (as they show up in the kube_node_labels metric) by which alerts are grouped |
3 | Changes the alert message for the pod capacity alert |
4 | Only alerts for pod capacity if it fires for 3 hours |
5 | Change the pod count prediction to look at the last two days and predict the value in five days |
6 | Completely overrides the default alert rule and alerts if the total memory request is over 9000 GB |
7 | Disables both CPU capacity alert rules |
8 | Disables alert if the cluster has unused nodes. |
rules
type |
dict |
default |
|
This parameter allows users to configure additional Prometheus rules to deploy on the cluster.
Each key-value pair in the dictionary is transformed into a PrometheusRule
object by the component.
The component expects that values are dicts themselves and expects that keys in those dicts are prefixed with record:
or alert:
to indicate whether the rule is a recording or alerting rule.
The component will transform the keys into fields in the resulting rule by taking the prefix as the field name and the rest of the key as the field value.
For example, key "record:sum:some:metric:5m"
would be transformed into record: sum:some:metric:5m
which should define a recording rule with name sum:some:metric:5m
.
This field is then merged into the provided value which should be a valid rule definition.
Example:
rules:
generic-rules:
"alert:ContainerOOMKilled":
annotations:
message: A container ({{$labels.container}}) in pod {{ $labels.namespace }}/{{ $labels.pod }} was OOM killed
expr: |
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
labels:
source: https://git.vshn.net/swisscompks/syn-tenant-repo/-/blob/master/common.yml
severity: devnull
Example
defaultConfig:
nodeSelector:
node-role.kubernetes.io/infra: ''
configs:
prometheusK8s:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 100Gi
alerts:
ignoreNames:
- KubeAPIErrorsHigh
- KubeClientErrors
secrets
type |
dict |
default |
|
A dict of secrets to create in the namespace.
The key is the name of the secret, the value is the content of the secret.
The value must be a dict with a key stringData
which is a dict of key/value pairs to add to the secret.
remoteWriteDefaults
type |
dict |
default |
|
example |
|
A dict of default remote write configurations for the Prometheus component.
Those values are merged into each remote write configuration in configs.prometheusK8s.remoteWrite
and configsUserWorkload.prometheus.remoteWrite
.
cronjobs
type |
dict |
A dict of arbitrary cronjobs to create in the openshift-monitoring
namespace.
The key is the name of the cronjob and the values are its configuration options as shown below.
image
type |
dict |
default |
|
customNodeExporter
This parameter allows users to deploy an additional node-exporter DaemonSet. We provide this option, since OpenShift’s cluster-monitoring stack currently doesn’t allow users to customize the bundled node-exporter DaemonSet.
Currently, the parameter is tailored to allow users to run an additional node-exporter which enables collectors that aren’t enabled in the default node exporter.
The configuration is rendered by using the same Jsonnet that’s used by the OpenShift cluster-monitoring stack to generate the default node-exporter DaemonSet. The component further customizes the resulting manifests to ensure that there’s no conflicts between the default node-exporter and the additional node-exporter.
The additional node-exporter is deployed in the namespace indicated by parameter namespace
.
By default this is namespace openshift-monitoring
.
The component also deploys a ServiceMonitor
which ensures that the additional node-exporter is scraped by the cluster-monitoring stack’s Prometheus.
Users can configure arbitrary recording and alerting rules which use metrics scraped from the additional node-exporter via parameter rules
.
collectors
type |
list |
default |
|
Which collectors to enable in the additional node-exporter.
By default, all collectors are disabled.
Users can remove entries from this list by providing an existing entry prefixed with ~
.
args
type |
list |
default |
|
Additional command line arguments to pass to the additional node-exporter.
Please note that specifying --[no-]collector.<name>
here will break the DaemonSet, since node-exporter
doesn’t support specifying these flags multiple times.
Users should use parameter customNodeExporter.collectors
to enable collectors.
metricRelabelings
type |
list |
default |
This parameter allows users to specify the content of field metricRelabelings
of the ServiceMonitor
which is created for the additional node-exporter.
By default, the component drops all metrics except node_network_route*
metrics for host devices prefixed with ens
.
Since this component only applies to OpenShift 4, we know that any node’s host interfaces will use device names that are prefixed with ens
.
Users are encouraged to extend or overwrite this parameter to ensure all the metrics they’re interested in are actually scraped by Prometheus.