Openshift Cluster Monitoring
Openshift Monitoring is composed of Platform Monitoring and User Workload Monitoring.
The official OpenShift documentation contains all user-facing information such as usage and configuration.
This the multi-page printable view of this section. Click here to print.
Openshift Monitoring is composed of Platform Monitoring and User Workload Monitoring.
The official OpenShift documentation contains all user-facing information such as usage and configuration.
Please refer to the Alerting Consistency OpenShift enhancement proposal for the recommendations applying to OCP built-in alerting rules.
The enhancement proposal mentioned above states the following for OCP built-in alerts:
Alerts SHOULD include a namespace label indicating the source of the alert.
Unfortunately this isn’t something that we can verify by static analysis because the namespace label can come from the PromQL result or be added statically. Nevertheless we can still use the Telemetry data to identify OCP alerts that don’t respect this statement.
First, create an OCP cluster from the latest stable release. Once it is installed, run this command to return the list of all OCP built-in alert names:
curl -sk -H "Authorization: Bearer $(oc create token prometheus-k8s -n openshift-monitoring)" \
https://$(oc get routes -n openshift-monitoring thanos-querier -o jsonpath='{.status.ingress[0].host}')/api/v1/rules \
| jq -cr '.data.groups | map(.rules) | flatten | map(select(.type =="alerting")) | map(.name) | unique |join("|")'
Then from https://telemeter-lts.datahub.redhat.com, retrieve the list of all alerts matching the names that fired without a namespace label, grouped by minor release:
count by (alertname,version) (
alerts{alertname=~"<insert the list of names returned by the previous command>",namespace=""} *
on(_id) group_left(version) max by(_id, version) (
label_replace(id_version_ebs_account_internal:cluster_subscribed{version=~"4.\d\d.*"}, "version", "$1", "version", "^(4.\\d+).*$")
)
)
You should now track back the non-compliant alerts to their component of origin and file bugs against them (example).
The exercise should be done at regular intervals, at least once per release cycle.
This document explains how to ingest metrics into the OpenShift Platform monitoring stack. It only applies for the OCP core components and Red Hat certified operators.
For user application monitoring, please refer to the official OCP documentation.
This document is intended for OpenShift developers that want to expose Prometheus metrics from their operators and operands. Readers should be familiar with the architecture of the OpenShift cluster monitoring stack.
Prometheus is a monitoring system that pulls metrics over HTTP, meaning that monitored targets need to expose an HTTP endpoint (usually /metrics
) which will be queried by Prometheus at regular intervals (typically every 30 seconds).
To avoid leaking sensitive information to potential attackers, all OpenShift components scraped by the in-cluster monitoring Prometheus should follow these requirements:
As described in the Client certificate scraping enhancement proposal, we recommend that the components rely on client TLS certificates for authentication/authorization. This is more efficient and robust than using bearer tokens because token-based authn/authz add a dependency (and additional load) on the Kubernetes API.
To this goal, the Cluster monitoring operator provisions a TLS client certificate for the in-cluster Prometheus. The client certificate is issued for the system:serviceaccount:openshift-monitoring:prometheus-k8s
Common Name (CN) and signed by the kubernetes.io/kube-apiserver-client
signer. The certificate can be verified using the certificate authority (CA) bundle located at the client-ca-file
key of the kube-system/extension-apiserver-authentication
ConfigMap.
In practice the Cluster Monitoring Operator creates a CertificateSigningRequest object for the
prometheus-k8s
service account which is automatically approved by the cluster-policy-controller. Once the certificate is issued by the controller, CMO provisions a secret namedmetrics-client-certs
which contains the TLS certificate and key (respectively undertls.crt
andtls.key
keys in the secret). CMO also rotates the certificate before it gets expired.
There are several options available depending on which framework your component is built.
If your component already relies on *ControllerCommandConfig
from github.com/openshift/library-go/pkg/controller/controllercmd
, it should automatically expose a TLS-secured /metrics
endpoint which has an hardcoded authorizer for the system:serviceaccount:openshift-monitoring:prometheus-k8s
service account (link).
Example: the Cluster Kubernetes API Server Operator.
The “simplest” option when the component doesn’t rely on github.com/openshift/library-go
(and switching to library-go isn’t an option) is to run a kube-rbac-proxy
sidecar in the same pod as the application being monitored.
Here is an example of a container’s definition to be added to the Pod’s template of the Deployment (or Daemonset):
- args:
- --secure-listen-address=0.0.0.0:8443
- --upstream=http://127.0.0.1:8081
- --config-file=/etc/kube-rbac-proxy/config.yaml
- --tls-cert-file=/etc/tls/private/tls.crt
- --tls-private-key-file=/etc/tls/private/tls.key
- --client-ca-file=/etc/tls/client/client-ca-file
- --logtostderr=true
- --allow-paths=/metrics
image: quay.io/brancz/kube-rbac-proxy:v0.11.0 # usually replaced by CVO by the OCP kube-rbac-proxy image reference.
name: kube-rbac-proxy
ports:
- containerPort: 8443
name: metrics
resources:
requests:
cpu: 1m
memory: 15Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /etc/kube-rbac-proxy
name: secret-kube-rbac-proxy-metric
readOnly: true
- mountPath: /etc/tls/private
name: secret-kube-rbac-proxy-tls
readOnly: true
- mountPath: /etc/tls/client
name: metrics-client-ca
readOnly: true
[...]
- volumes:
# Secret created by the service CA operator.
# We assume that the Kubernetes service exposing the application's pods has the
# "service.beta.openshift.io/serving-cert-secret-name: kube-rbac-proxy-tls"
# annotation.
- name: secret-kube-rbac-proxy-tls
secret:
secretName: kube-rbac-proxy-tls
# Secret containing the kube-rbac-proxy configuration (see below).
- name: secret-kube-rbac-proxy-metric
secret:
secretName: secret-kube-rbac-proxy-metric
# ConfigMap containing the CA used to verify the client certificate.
- name: metrics-client-ca
configMap:
name: metrics-client-ca
Note: The
metrics-client-ca
ConfigMap needs to be created by your component and synced from thekube-system/extension-apiserver-authentication
ConfigMap.
Here is a Secret containing the kube-rbac-proxy’s configuration (it allows only HTTPS requets to the /metrics
endpoint for the Prometheus service account):
apiVersion: v1
kind: Secret
metadata:
name: secret-kube-rbac-proxy-metric
namespace: openshift-example
stringData:
config.yaml: |-
"authorization":
"static":
- "path": "/metrics"
"resourceRequest": false
"user":
"name": "system:serviceaccount:openshift-monitoring:prometheus-k8s"
"verb": "get"
type: Opaque
Example: node-exporter from the Cluster Monitoring operator.
Starting with v0.16.0, the controller-runtime
framework provides a way to expose and secure a /metrics
endpoint using TLS with minimal effort.
Refer to https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/metrics/server for details about TLS configuration and check the next section to understand how it needs to be configured.
You don’t use
library-go
,controller-runtime
>= v0.16.0 or don’t want to run akube-rbac-proxy
sidecar.
In such situations, you need to implement your own HTTPS server for /metrics
. As explained before, it needs to require and verify the TLS client certificate using the root CA stored under the client-ca-file
key of the kube-system/extension-apiserver-authentication
ConfigMap.
In practice, the server should:
ClientAuth
field to RequireAndVerifyClientCert
.Example: https://github.com/openshift/cluster-monitoring-operator/pull/1870
To tell Promehteus to scrape the metrics from your component, you need to create:
The Service object looks like this:
apiVersion: v1
kind: Service
metadata:
annotations:
# This annotation tells the service CA operator to provision a Secret
# holding the certificate + key to be mounted in the pods.
# The Secret name is "<annotation value>" (e.g. "secret-my-app-tls").
service.beta.openshift.io/serving-cert-secret-name: tls-my-app-tls
labels:
app.kubernetes.io/name: my-app
name: metrics
namespace: openshift-example
spec:
ports:
- name: metrics
port: 8443
targetPort: metrics
selector:
app.kubernetes.io/name: my-app
type: ClusterIP
Then the ServiceMonitor object looks like:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: openshift-example
spec:
endpoints:
- interval: 30s
# Matches the name of the service's port.
port: metrics
scheme: https
# The CA file used by Prometheus to verify the server's certificate.
# It's the cluster's CA bundle from the service CA operator.
caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
# The name of the server (CN) in the server's certificate.
serverName: my-app.openshift-example.svc
# The client's certificate file used by Prometheus when scraping the metrics.
# This file is located in the Prometheus container.
certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt
# The client's key file used by Prometheus when scraping the metrics.
# This file is located in the Prometheus container.
keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key
selector:
# Select all Services in the same namespace that have the `app.kubernetes.io/name: my-app` label.
matchLabels:
app.kubernetes.io/name: my-app
This serves as a collection of resources that relate to FAQ around configuring/debugging the in-cluster monitoring stack. Particularly it applies to two OpenShift Projects:
Both PM
and UWM
monitoring stacks rely on the ServiceMonitor
and PodMonitor
custom resources in order to tell Prometheus which endpoints to scrape.
The examples below show the namespace openshift-monitoring
, which can be replaced with openshift-user-workload-monitoring
when dealing with UWM
.
A detailed description of how the resources are linked exists here, but we will walk through some common issues to debug the case of missing metrics.
serviceMonitorSelector
in the Prometheus
CR matches the key in the ServiceMonitor
labels.Service
you want to scrape must have an explicitly named port.ServiceMonitor
must reference the port
by this name.ServiceMonitor
must match an existing Service
.Assuming this criteria is met but the metrics don’t exist, we can try debug the cause.
There is a possibility Prometheus has not loaded the configuration yet. The following metrics will help to determine if that is in fact the case or if there are errors in the configuration:
prometheus_config_last_reload_success_timestamp_seconds
prometheus_config_last_reload_successful
If there are errors with reloading the configuration, it is likely the configuration itself is invalid and examining the logs will highlight this.
oc logs -n openshift-monitoring prometheus-k8s-0 -c <container-name>
Assuming that the reload was a success then the Prometheus should see the configuration.
oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/status/config | grep "<service-monitor-name>"
If the ServiceMonitor
does not exist in the output, the next step would be to investigate the logs of both prometheus
and the prometheus-operator
for errors.
Assuming it does exist then we know prometheus-operator
is doing its job. Double check the ServiceMonitor
definition.
Check the service discovery endpoint to ensure Prometheus can discover the target. It will need the appropriate RBAC to do so. An example can be found here.
We have, in the past seen cases where the TargetDown
alert was firing when all endpoints appeared to be up. The following commands fetch some useful metrics to help identify the cause.
As the alert fires, get the list of active targets in Prometheus
oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/targets?state=active > targets.prometheus-k8s-0.json
oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/targets?state=active > targets.prometheus-k8s-1.json
Reports all targets that Prometheus couldn’t connect to with some reason (timeout, refused, …)
A dialer_name
can be passed as a label to limit the query to interesting components. For example {dialer_name=~".+openshift-.*"}
.
oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'query=rate(net_conntrack_dialer_conn_failed_total{}[1h]) > 0' > net_conntrack_dialer_conn_failed_total.prometheus-k8s-0.json
oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'query=net_conntrack_dialer_conn_failed_total{} > 1' > net_conntrack_dialer_conn_failed_total.prometheus-k8s-1.json
Identify targets that are slow to serve metrics and may be considered as down.
oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'sort_desc(max by(job) (max_over_time(scrape_duration_seconds[1h])))' > slow.prometheus-k8s-0.json
oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'sort_desc(max by(job) (max_over_time(scrape_duration_seconds[1h])))' > slow.prometheus-k8s-1.json
Often, when “high” CPU usage or spikes are identified it can be a symptom of expensive rules.
A good place to start the investigation is the /rules
endpoint of Prometheus and analyse any queries which might contribute to the problem by identifying excessive rule evaluation times.
In cases where excessive CPU usage is being reported, it might be useful to obtain Pprof profiles from the Prometheus containers over a short time span.
To gather CPU profiles over a period of 30 minutes, run the following:
SLEEP_MINUTES=5
duration=${DURATION:-30}
while [ $duration -ne 0 ]; do
for i in 0 1; do
echo "Retrieving CPU profile for prometheus-k8s-$i..."
oc exec -n openshift-monitoring prometheus-k8s-$i -c prometheus -- curl -s http://localhost:9090/debug/pprof/profile?seconds="$duration" > cpu.prometheus-k8s-$i.$(date +%Y%m%d-%H%M%S).pprof;
done
echo "Sleeping for $SLEEP_MINUTES minutes..."
sleep $(( 60 * $SLEEP_MINUTES ))
(( --duration ))
done
The following queries might prove useful for debugging.
Calculate the ingestion rate over the last two minutes:
oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 \
-- curl -s http://localhost:9090/api/v1/query --data-urlencode \
'query=sum by(pod,job,namespace) (max without(instance) (rate(prometheus_tsdb_head_samples_appended_total{namespace=~"openshift-monitoring|openshift-user-workload-monitoring"}[2m])))' > samples_appended.json
Calculate “non-evictable” memory:
oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 \
-- curl -s http://localhost:9090/api/v1/query --data-urlencode \
'query=sort_desc(sum by (pod,namespace) (max without(instance) (container_memory_working_set_bytes{namespace=~"openshift-monitoring|openshift-user-workload-monitoring", container=""})))' > memory.json
In cases where excessive memory is being reported, it might be useful to obtain Pprof profiles from the Prometheus containers over a short time span.
To gather memory profiles over a period of 30 minutes, run the following:
SLEEP_MINUTES=5
duration=${DURATION:-30}
while [ $duration -ne 0 ]; do
for i in 0 1; do
echo "Retrieving memory profile for prometheus-k8s-$i..."
oc exec -n openshift-monitoring prometheus-k8s-$i -c prometheus -- curl -s http://localhost:9090/debug/pprof/heap > heap.prometheus-k8s-$i.$(date +%Y%m%d-%H%M%S).pprof;
done
echo "Sleeping for $SLEEP_MINUTES minutes..."
sleep $(( 60 * $SLEEP_MINUTES ))
(( --duration ))
done
This document is intended for OpenShift developers that want to ship new metrics to the Red Hat Telemetry service.
Before going to the details, a few words about Telemetry and the process to add a new metric..
What is Telemetry?
Telemetry is a system operated and hosted by Red Hat that allows to collect data from connected clusters to enable subscription management automation, monitor the health of clusters, assist with support, and improve customer experience.
What does sending metrics via Telemetry mean?
You should send the metrics via Telemetry when you want and need to see these metrics for all OpenShift clusters. This is primarily for gaining insights on how OpenShift is used, troubleshooting and monitoring the fleet of clusters. Users can already see these metrics in their clusters via Prometheus even when not available via Telemetry.
How are metrics shipped via Telemetry?
Only metrics which are already collected by the in-cluster monitoring stack can be shipped via Telemetry. The telemeter-client
pod running in the openshift-monitoring
namespace collects metrics from the prometheus-k8s
service every 4m30s using the /federate
endpoint and ships the samples to the Telemetry endpoint using a custom protocol.
How long will it take for my new telemetry metrics to show up?
Please start this process and involve the monitoring team as early as possible. The process described in this document includes a thorough review of the underlying metrics and labels. The monitoring team will try to understand your use case and perhaps propose improvements and optimizations. Metric, label and rule names will be reviewed for following best practices. This can take several review rounds over multiple weeks.
Shipping metrics via Telemetry is only possible for components running in namespaces with the openshift.io/cluster-monitoring=true
label. In practice, it means that your component falls into one of these 2 categories:
Your component should already be instrumented and scraped by the in-cluster monitoring stack using ServiceMonitor
and/or PodMonitor
objects.
The overall process is as follows:
PrometheusRule
objects.The first step is to identify which metrics you want to send via Telemetry and what is the cardinality of the metrics (e.g. how many timeseries it will be in total). Typically you start with metrics that show how your component is being used. In practice, we recommend to start shipping not more than:
If you are above these limits, you have 2 choices:
Finally your metric MUST NOT contain any personally identifiable information (names, email addresses, information about user workloads).
Use the following template to file a JIRA task in the MON project.
h1. Request for sending data via telemetry
The goal is to collect metrics about ... because ...
h2. <Metric name>
<Metric name> represents ...
Labels
* <label 1>, possible values are ...
* <label 2>, possible values are ...
The cardinality of the metric is at most <X>.
h2. <Other metric>
...
Reach out to @team-telemetry
on the #forum-monitoring
or #forum-observatorium
Slack channels for an explicit approval (e.g. in-cluster and RHOBS team leads).
Recording rules are required to reduce the cardinality of the metrics being shipped.
Even for low-cardinality metrics, we require to aggregate them before shipping to Telemetry to remove unnecessary labels such as instance
or pod
. This will also protect the telemetry backend against future label additions to the underlying metrics.
Let’s take a concrete example: each Prometheus pod exposes a prometheus_tsdb_head_series
metric which tracks the number of active timeseries. There can be up to 4 Prometheus pods in a given cluster (2 pods in openshift-monitoring
and 2 in openshift-user-workload-monitoring
when user-defined monitoring is enabled). To reduce the number of timeseries shipped via Telemetry, we configure the following recording rule to sum the values by namespace
and job
labels:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cluster-monitoring-operator-prometheus-rules
namespace: openshift-monitoring
spec:
groups:
- name: openshift-monitoring.rules
rules:
- expr: |-
sum by (job,namespace) (
max without(instance) (
prometheus_tsdb_head_series{namespace=~"openshift-monitoring|openshift-user-workload-monitoring"}
)
)
record: openshift:prometheus_tsdb_head_series:sum
Your PrometheusRule
object(s) should be created by your operator with your ServiceMonitor
and/or PodMonitor
objects.
Clone the cluster-monitoring-operator repository locally.
Modify the /manifests/0000_50_cluster-monitoring-operator_04-config.yaml file to add the metric to the allowed list. Include comments to:
#
# owners: (@openshift/openshift-team-monitoring)
#
# openshift:prometheus_tsdb_head_series:sum tracks the total number of active series
- '{__name__="openshift:prometheus_tsdb_head_series:sum"}'
make --always-make docs
Commit the changes into Git and open a pull request in the openshift/cluster-monitoring-operator repository linking to the initial JIRA ticket.
Ask for a review on the #forum-monitoring
Slack channel.
Once the pull request in the cluster-monitoring-operator repository is merged, the configuration of the Telemetry server needs to be synchronized.
Clone the rhobs/configuration repository.
Run
make whitelisted_metrics && make
Commit the changes into Git and open a pull request in the rhobs/configuration repository.
Ask for a review on the #forum-observatorium
Slack channel.
Once merged, the updated configuration should be rolled out to the production Telemetry within a few days. After this happens, clusters running the next (e.g. master
) OCP version should start sending the new metric(s) to Telemetry.
A given metric may have different labels (aka dimensions) that helps refining the characteristics of the thing being measured. Each unique combination of a metric name + optional key/value pairs represents a timeseries in the Prometheus parlance. And the total number of active timeseries for a given metric name represents the cardinality of the metric.
For example, consider a component exposing a fictuous my_component_ready
metric:
my_component_ready 1
The metric has no label but because Prometheus will automatically attach target labels such as pod
and instance
, the total cardinality could be 1 (single replica), 2 (2 replicas), …
To find out the current cardinality of a metric on a live cluster, you can run this PromQL query:
count(my_component_ready)
Now consider another metric tracking HTTP requests:
http_requests_total{method="GET", code="200", path="/"} 10
http_requests_total{method="GET", code="404", path="/foo"} 1
http_requests_total{method="POST", code="200", path="/"} 12
http_requests_total{method="POST", code="500", path="/login"} 2
While you may think that the cardinality is 4 because there are 4 timeseries, this isn’t true because we can’t really predict in advance all values for the code
and path
labels. This is what is called a high-cardinality metric. An even worse case would be a metric with a userid
or ip
label (we would say that this metric has unbounded cardinality).
On top of that, pod churn (e.g. pods being rolled-out because of version upgrades) also increase the cardinality because the values of target-based labels (such pod
and instance
) would change.
Because Prometheus keeps all active timeseries in-memory for indexing, the more timeseries, the more memory is required. The same is true for the Telemeter server. Which is why we want to keep the cardinality of metrics shipped via Telemetry under a reasonable value (typically less than 5).
See the previous section. Every metric shipped to Telemetry has to be multiplied by the number of connected clusters that may be sending that metric. Pushing too many metrics from a single cluster may cause service degradation and resource exhaustion on both the in-cluster monitoring stack and on the Telemetry server side.
Yes, the Telemeter client is already configured to collect and send firing alerts. On Telemetry side, the alerts can be queried using the alerts
metric.
Once you have updated the telemeter-client
configuration in the master
branch, you can create backports to older OCP releases. The procedure follows the usual OCP backport process which involves creating bug tickets in the OCPBUGS
project (preferably assigned to your component) and opening pull requests in openshift/cluster-monitoring-operator against the desired release-4.x
branches.
Please refer to the following links for more details:
You can also reach out to the OpenShift monitoring team for advice.
If your component’s metrics aren’t already collected by the in-cluster monitoring stack, you need to deploy at least one ServiceMonitor or one PodMonitor resource in your component’s namespace.
If your component is deployed by the Cluster Version Operator (CVO), it is enough to add the manifest to the CVO payload.
Again you can reach out to the OpenShift monitoring team for advice.
The Telemetry client authenticates against the Telemeter server using the cluster’s pull secret. The Telemeter server verifies that the pull secret is valid and matches with the cluster’s identifier. The Telemetry protocol uses HTTPS for encryption.
Finally the Telemeter server will only allow metrics which are explicitly allowed by its running configuraiton.
Also know as Telemetry or Telemeter server. A service operated by Red Hat that receives metrics from all OCP connected clusters.
The telemeter-client
pod runnning in the openshift-monitoring
namespace. It is responsible for collecting the platform metrics at regular intervals (every 4m30s) and sending them to the Telemetry server.
The prometheus-k8s-0
and prometheus-k8s-1
pods running in the openshift-monitoring
namespace. They are in charge of collecting metrics from the OpenShift components (operators+operands) and evaluating the associated alerting and recording rules. The Prometheus pods are configured using ServiceMonitor
, PodMonitor
and PrometheusRule
custom resources coming from namespaces with the openshift.io/cluster-monitoring=true
label.