This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Openshift Cluster Monitoring

1: Instrumentation guidelines
2: Collecting metrics with Prometheus
3: Alerting
4: Dashboards
5: Sending metrics via Telemetry
6: Frequently asked questions

Openshift Cluster Monitoring

Openshift Monitoring is composed of Platform Monitoring and User Workload Monitoring.

The official OpenShift documentation contains all user-facing information such as usage and configuration.

1 - Instrumentation guidelines

Instrumentation guidelines

This document details good practices to adopt when you instrument your application for Prometheus. It is not meant to be a replacement of the upstream documentation but an introduction focused on the OpenShift use case.

Targeted audience

This document is intended for OpenShift developers that want to instrument their operators and operands for Prometheus.

Getting started

To instrument software written in Golang, see the official Golang client. For other languages, refer to the curated list of client libraries.

Prometheus stores all data as time series which are a stream of timestamped values (samples) identified by a metric name and a set of unique labels (a.ka. dimensions or key/value pairs). Its data model is described in details in this page. Time series would be represented like this:

# HELP http_requests_total Total number of HTTP requests by method and handler.
# TYPE http_requests_total counter
http_requests_total{method="GET", handler="/messages"}  500
http_requests_total{method="POST", handler="/messages"} 10

Prometheus supports 4 metric types:

Gauge which represents a single numerical value that can arbitrarily go up and down.
Counter, a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. When querying a counter metric, you usually apply a rate() or increase() function.
Histogram which represents observations (usually things like request durations or response sizes) and counts them in configurable buckets.
Summary which represents observations too but it reports configurable quantiles over a (fixed) sliding time window. In practice, they are rarely used.

Adding metrics for any operation should be part of the code review process like any other factor that is kept in mind for production ready code.

To learn more about when to use which metric type, how to name metrics and how to choose labels, read the following documentation:

Example

Here is a fictional Go code example instrumented with a Gauge metric and a multi-dimensional Counter metric:

	cpuTemp := prometheus.NewGauge(prometheus.GaugeOpts{
		Name: "cpu_temperature_celsius",
		Help: "Current temperature of the CPU.",
	})

	hdFailures := prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Name: "hd_errors_total",
			Help: "Number of hard-disk errors.",
		},
		[]string{"device"},
	)}

	reg := prometheus.NewRegistry()
	reg.MustRegister(cpuTemp, m.hdFailures)

	cpuTemp.Set(55.2)

	// Record 1 failure for the /dev/sda device.
	hdFailures.With(prometheus.Labels{"device":"/dev/sda"}).Inc()
	// Record 3 failures for the /dev/sdb device.
	hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc()
	hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc()
	hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc()

Labels

Defining when to add and when not to add a label to a metric is a difficult choice. The general rule is: the fewer labels, the better. Every unique combination of label names and values creates a new time series and Prometheus memory usage is mostly driven by the number of times series loaded into RAM during ingestion and querying. A good rule of thumb is to have less than 10 time series per metric name and target. A common mistake is to store dynamic information such as usernames, IP addresses or error messages into a label which can lead to thousands of time series.

Labels such as pod, service, job and instance shouldn’t be set by the application. Instead they are discovered at runtime by Prometheus when it queries the Kubernetes API to discover which targets should be scraped for metrics.

Custom collectors

It is sometimes not feasible to use one of the 4 Metric types, typically when your application already has the information stored for other purpose (for instance, it maintains a list of custom objects retrieved from the Kubernetes API). In this case, the custom collector pattern can be useful.

You can find an example of this pattern in the github.com/prometheus-operator/prometheus-operator project.

Next steps

Collect metrics with Prometheus.
Configure alerting with Prometheus.
Add dashboards to the OCP console.

2 - Collecting metrics with Prometheus

Collecting metrics with Prometheus

This document explains how to ingest metrics into the OpenShift Platform monitoring stack. It only applies for the OCP core components and Red Hat certified operators.

For user application monitoring, please refer to the official OCP documentation.

Targeted audience

This document is intended for OpenShift developers that want to expose Prometheus metrics from their operators and operands. Readers should be familiar with the architecture of the OpenShift cluster monitoring stack.

Exposing metrics for Prometheus

Prometheus is a monitoring system that pulls metrics over HTTP, meaning that monitored targets need to expose an HTTP endpoint (usually /metrics) which will be queried by Prometheus at regular intervals (typically every 30 seconds).

To avoid leaking sensitive information to potential attackers, all OpenShift components scraped by the in-cluster monitoring Prometheus should follow these requirements:

Use HTTPS instead of plain HTTP.
Implement proper authentication (e.g. verify the identity of the requester).
Implement proper authorization (e.g. authorize requests issued by the Prometheus service account or users with GET permission on the metrics endpoint).

As described in the Client certificate scraping enhancement proposal, we recommend that the components rely on client TLS certificates for authentication/authorization. This is more efficient and robust than using bearer tokens because token-based authn/authz add a dependency (and additional load) on the Kubernetes API.

To this goal, the Cluster monitoring operator provisions a TLS client certificate for the in-cluster Prometheus. The client certificate is issued for the system:serviceaccount:openshift-monitoring:prometheus-k8s Common Name (CN) and signed by the kubernetes.io/kube-apiserver-client signer. The certificate can be verified using the certificate authority (CA) bundle located at the client-ca-file key of the kube-system/extension-apiserver-authentication ConfigMap.

In practice the Cluster Monitoring Operator creates a CertificateSigningRequest object for the prometheus-k8s service account which is automatically approved by the cluster-policy-controller. Once the certificate is issued by the controller, CMO provisions a secret named metrics-client-certs which contains the TLS certificate and key (respectively under tls.crt and tls.key keys in the secret). CMO also rotates the certificate before it gets expired.

There are several options available depending on which framework your component is built.

library-go

If your component already relies on *ControllerCommandConfig from github.com/openshift/library-go/pkg/controller/controllercmd, it should automatically expose a TLS-secured /metrics endpoint which has an hardcoded authorizer for the system:serviceaccount:openshift-monitoring:prometheus-k8s service account (link).

Example: the Cluster Kubernetes API Server Operator.

kube-rbac-proxy sidecar

The “simplest” option when the component doesn’t rely on github.com/openshift/library-go (and switching to library-go isn’t an option) is to run a kube-rbac-proxy sidecar in the same pod as the application being monitored.

Here is an example of a container’s definition to be added to the Pod’s template of the Deployment (or Daemonset):

  - args:
    - --secure-listen-address=0.0.0.0:8443
    - --upstream=http://127.0.0.1:8081
    - --config-file=/etc/kube-rbac-proxy/config.yaml
    - --tls-cert-file=/etc/tls/private/tls.crt
    - --tls-private-key-file=/etc/tls/private/tls.key
    - --client-ca-file=/etc/tls/client/client-ca-file
    - --logtostderr=true
    - --allow-paths=/metrics
    image: quay.io/brancz/kube-rbac-proxy:v0.11.0 # usually replaced by CVO by the OCP kube-rbac-proxy image reference.
    name: kube-rbac-proxy
    ports:
    - containerPort: 8443
      name: metrics
    resources:
      requests:
        cpu: 1m
        memory: 15Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/kube-rbac-proxy
      name: secret-kube-rbac-proxy-metric
      readOnly: true
    - mountPath: /etc/tls/private
      name: secret-kube-rbac-proxy-tls
      readOnly: true
    - mountPath: /etc/tls/client
      name: metrics-client-ca
      readOnly: true
[...]
  - volumes:
    # Secret created by the service CA operator.
    # We assume that the Kubernetes service exposing the application's pods has the
    # "service.beta.openshift.io/serving-cert-secret-name: kube-rbac-proxy-tls"
    # annotation.
    - name: secret-kube-rbac-proxy-tls
      secret:
        secretName: kube-rbac-proxy-tls
    # Secret containing the kube-rbac-proxy configuration (see below).
    - name: secret-kube-rbac-proxy-metric
      secret:
        secretName: secret-kube-rbac-proxy-metric
    # ConfigMap containing the CA used to verify the client certificate.
    - name: metrics-client-ca
      configMap:
        name: metrics-client-ca

Note: The metrics-client-ca ConfigMap needs to be created by your component and synced from the kube-system/extension-apiserver-authentication ConfigMap.

Here is a Secret containing the kube-rbac-proxy’s configuration (it allows only HTTPS requets to the /metrics endpoint for the Prometheus service account):

apiVersion: v1
kind: Secret
metadata:
  name: secret-kube-rbac-proxy-metric
  namespace: openshift-example
stringData:
  config.yaml: |-
    "authorization":
      "static":
      - "path": "/metrics"
        "resourceRequest": false
        "user":
          "name": "system:serviceaccount:openshift-monitoring:prometheus-k8s"
        "verb": "get"    
type: Opaque

Example: node-exporter from the Cluster Monitoring operator.

controller-runtime (>= v0.16.0)

Starting with v0.16.0, the controller-runtime framework provides a way to expose and secure a /metrics endpoint using TLS with minimal effort.

Refer to https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/metrics/server for details about TLS configuration and check the next section to understand how it needs to be configured.

Roll your own HTTPS server

You don’t use library-go, controller-runtime >= v0.16.0 or don’t want to run a kube-rbac-proxy sidecar.

In such situations, you need to implement your own HTTPS server for /metrics. As explained before, it needs to require and verify the TLS client certificate using the root CA stored under the client-ca-file key of the kube-system/extension-apiserver-authentication ConfigMap.

In practice, the server should:

Set TLSConfig’s ClientAuth field to RequireAndVerifyClientCert.
Reload the root CA when the source ConfigMap is updated.
Reload the server’s certificate and key when they are updated.

Example: https://github.com/openshift/cluster-monitoring-operator/pull/1870

Configuring Prometheus to scrape metrics

To tell the Prometheus pods running in the openshift-monitoring namespace (e.g. prometheus-k8s-{0,1}) to scrape the metrics from your operator/operand pods, you should use ServiceMonitor and/or PodMonitor custom resources.

The workflow is:

Add the openshift.io/cluster-monitoring: "true" label to the namespace where the scraped targets live.
- Important: only OCP core components and Red Hat certified operators can set this label on namespaces.
- OCP core components can set the label on their namespaces in the CVO manifets directly.
- For OLM operators:
  - There’s no automatic way to enforce the label (yet).
  - The OCP console will display a checkbox at installation time to enable cluster monitoring for the operator if you add the operatorframework.io/cluster-monitoring=true annotation to the operator’s CSV.
  - For CLI installations, the requirement should be detailed in the installation procedure (example for the Logging operator).
Add Role and RoleBinding to give the prometheus-k8s service account access to pods, endpoints and services in your namespace.
In case of ServiceMonitor:
- Create a Service object selecting the scraped pods.
- Create a ServiceMonitor object targeting the Service.
In case of PodMonitor:
- Create a PodMonitor object targeting the pods.

Below is an fictitious example using a ServiceMonitor object to scrape metrics from pods deployed in the openshift-example namespace.

Role and RoleBinding manifests

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: prometheus-k8s
  namespace: openshift-example
rules:
- apiGroups:
  - ""
  resources:
  - services
  - endpoints
  - pods
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: prometheus-k8s
  namespace: openshift-example
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: prometheus-k8s
subjects:
- kind: ServiceAccount
  name: prometheus-k8s
  namespace: openshift-monitoring

Service manifest

apiVersion: v1
kind: Service
metadata:
  annotations:
    # This annotation tells the service CA operator to provision a Secret
    # holding the certificate + key to be mounted in the pods.
    # The Secret name is "<annotation value>" (e.g. "secret-my-app-tls").
    service.beta.openshift.io/serving-cert-secret-name: tls-my-app-tls
  labels:
    app.kubernetes.io/name: my-app
  name: metrics
  namespace: openshift-example
spec:
  ports:
  - name: metrics
    port: 8443
    targetPort: metrics
  # Select all Pods in the same namespace that have the `app.kubernetes.io/name: my-app` label.
  selector:
    app.kubernetes.io/name: my-app
  type: ClusterIP

ServiceMonitor manifest

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: openshift-example
spec:
  endpoints:
  - interval: 30s
    # Matches the name of the service's port.
    port: metrics
    scheme: https
    tlsConfig:
      # The CA file used by Prometheus to verify the server's certificate.
      # It's the cluster's CA bundle from the service CA operator.
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      # The name of the server (CN) in the server's certificate.
      serverName: my-app.openshift-example.svc
      # The client's certificate file used by Prometheus when scraping the metrics.
      # This file is located in the Prometheus container.
      certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt
      # The client's key file used by Prometheus when scraping the metrics.
      # This file is located in the Prometheus container.
      keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key
  selector:
    # Select all Services in the same namespace that have the `app.kubernetes.io/name: my-app` label.
    matchLabels:
      app.kubernetes.io/name: my-app

Next steps

Configure alerting with Prometheus.
Send Telemetry metrics.

3 - Alerting

Alerting

Targeted audience

This document is intended for OpenShift developers that want to write alerting rules for their operators and operands.

Configuring alerting rules

You configure alerting rules based on the metrics being collected for your component(s). To do so, you should create PrometheusRule objects in your operator/operand namespace which will also be picked up by the Prometheus operator (provided that the namespace has the openshift.io/cluster-monitoring="true" label for layered operators).

Here is an example of a PrometheusRule object with a single alerting rule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cluster-example-operator-rules
  namespace: openshift-example-operator
spec:
  groups:
  - name: operator
    rules:
    - alert: ClusterExampleOperatorUnhealthy
      annotations:
        description: Cluster Example operator running in pod {{$labels.namespace}}/{{$labels.pods}} is not healthy.
        summary: Operator Example not healthy
      expr: |
                max by(pod, namespace) (last_over_time(example_operator_healthy[5m])) == 0
      for: 15m
      labels:
        severity: warning

You can choose to configure all your alerting rules into a single PrometheusRule object or split them into different objects (one per component). The mechanism to deploy the object(s) depends on the context: it can be deployed by the Cluster Version Operator (CVO), the Operator Lifecycle Manager (OLM) or your own operator.

Guidelines

Please refer to the Alerting Consistency OpenShift enhancement proposal for the recommendations applying to OCP built-in alerting rules.

If you need a review of alerting rules from the OCP monitoring team, you can reach them on the #forum-openshift-monitoring channel.

Identifying alerting rules without a namespace label

The enhancement proposal mentioned above states the following for OCP built-in alerts:

Alerts SHOULD include a namespace label indicating the source of the alert.

Unfortunately this isn’t something that we can verify by static analysis because the namespace label can come from the PromQL result or be added statically. Nevertheless we can still use the Telemetry data to identify OCP alerts that don’t respect this statement.

First, create an OCP cluster from the latest stable release. Once it is installed, run this command to return the list of all OCP built-in alert names:

curl -sk -H "Authorization: Bearer $(oc create token prometheus-k8s -n openshift-monitoring)" \
https://$(oc get routes -n openshift-monitoring thanos-querier -o jsonpath='{.status.ingress[0].host}')/api/v1/rules \
| jq -cr '.data.groups | map(.rules) | flatten | map(select(.type =="alerting")) | map(.name) | unique |join("|")'

Then from https://telemeter-lts.datahub.redhat.com, retrieve the list of all alerts matching the names that fired without a namespace label, grouped by minor release:

count by (alertname,version) (
  alerts{alertname=~"<insert the list of names returned by the previous command>",namespace=""} *
  on(_id) group_left(version) max by(_id, version) (
    label_replace(id_version_ebs_account_internal:cluster_subscribed{version=~"4.\d\d.*"}, "version", "$1", "version", "^(4.\\d+).*$")
  )
)

You should now track back the non-compliant alerts to their component of origin and file bugs against them (example).

The exercise should be done at regular intervals, at least once per release cycle.

4 - Dashboards

Dashboards

Targeted audience

This document is intended for OpenShift developers that want to add visualization dashboards for their operators and operands in the OCP administrator console.

Getting started

Please refer to the document written by the Observability UI team.

The team can also be found in the #forum-observability-ui Slack channel.

5 - Sending metrics via Telemetry

Sending metrics via Telemetry

Targeted audience

This document is intended for OpenShift developers that want to ship new metrics to the Red Hat Telemetry service.

Background

Before going to the details, a few words about Telemetry and the process to add a new metric..

What is Telemetry?

Telemetry is a system operated and hosted by Red Hat that allows to collect data from connected clusters to enable subscription management automation, monitor the health of clusters, assist with support, and improve customer experience.

What does sending metrics via Telemetry mean?

You should send the metrics via Telemetry when you want and need to see these metrics for all OpenShift clusters. This is primarily for gaining insights on how OpenShift is used, troubleshooting and monitoring the fleet of clusters. Users can already see these metrics in their clusters via Prometheus even when not available via Telemetry.

How are metrics shipped via Telemetry?

Only metrics which are already collected by the in-cluster monitoring stack can be shipped via Telemetry. The telemeter-client pod running in the openshift-monitoring namespace collects metrics from the prometheus-k8s service every 4m30s using the /federate endpoint and ships the samples to the Telemetry endpoint using a custom protocol.

How long will it take for my new telemetry metrics to show up?

Please start this process and involve the monitoring team as early as possible. The process described in this document includes a thorough review of the underlying metrics and labels. The monitoring team will try to understand your use case and perhaps propose improvements and optimizations. Metric, label and rule names will be reviewed for following best practices. This can take several review rounds over multiple weeks.

Requirements

Shipping metrics via Telemetry is only possible for components running in namespaces with the openshift.io/cluster-monitoring=true label. In practice, it means that your component falls into one of these 2 categories:

Your operator/operand is included in the OCP payload (e.g. it is a core/platform component).
Your operator/operand is deployed via OLM and has been certified by Red Hat.

Your component should already be instrumented and scraped by the in-cluster monitoring stack using ServiceMonitor and/or PodMonitor objects.

Sending metrics via Telemetry step-by-step

The overall process is as follows:

Request approval from the monitoring team.
Configure recording rules using PrometheusRule objects.
Modify the configuration of the Telemeter client in the Cluster Monitoring Operator repository to collect the new metrics.
Synchronize the Telemeter server’s configuration from the Cluster Monitoring Operator project.
Wait for the Telemeter server’s configuration to be rolled out to production.

Request approval

The first step is to identify which metrics you want to send via Telemetry and what is the cardinality of the metrics (e.g. how many timeseries it will be in total). Typically you start with metrics that show how your component is being used. In practice, we recommend to start shipping not more than:

1 to 3 metrics.
1 to 10 timeseries per metric.
10 timeseries in total.

If you are above these limits, you have 2 choices:

(recommended) aggregate the metrics before sending. For instance: sum all values for a given metric.
request an exception from the monitoring team. The exception requires approval from upper management so make sure that your request is motivated!

Finally your metric MUST NOT contain any personally identifiable information (names, email addresses, information about user workloads).

Use the following information to file 1 JIRA ticket per metric in the MON project:

Type: Task
Title: Send metric <metric name> via Telemetry
Label: telemetry-review-request
Description template:

h1. Request for sending data via telemetry

The goal is to collect metrics about ... because ...

<Metric name> represents ...

Labels
* <label 1>, possible values are ...
* <label 2>, possible values are ...

The cardinality of the metric is at most <X>.

Component exposing the metric: https://github.com/<org>/<project>

Reach out to @team-telemetry on the #forum-openshift-monitoring or #forum-observatorium Slack channels for an explicit approval (e.g. in-cluster and RHOBS team leads).

Configure recording rules

Recording rules are required to reduce the cardinality of the metrics being shipped.

Even for low-cardinality metrics, we require to aggregate them before shipping to Telemetry to remove unnecessary labels such as instance or pod. This will also protect the telemetry backend against future label additions to the underlying metrics.

Let’s take a concrete example: each Prometheus pod exposes a prometheus_tsdb_head_series metric which tracks the number of active timeseries. There can be up to 4 Prometheus pods in a given cluster (2 pods in openshift-monitoring and 2 in openshift-user-workload-monitoring when user-defined monitoring is enabled). To reduce the number of timeseries shipped via Telemetry, we configure the following recording rule to sum the values by namespace and job labels:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cluster-monitoring-operator-prometheus-rules
  namespace: openshift-monitoring
spec:
  groups:
  - name: openshift-monitoring.rules
    rules:
    - expr: |-
        sum by (job,namespace) (
          max without(instance) (
            prometheus_tsdb_head_series{namespace=~"openshift-monitoring|openshift-user-workload-monitoring"}
          )
        )
      record: openshift:prometheus_tsdb_head_series:sum

Your PrometheusRule object(s) should be created by your operator with your ServiceMonitor and/or PodMonitor objects.

Modify the Telemeter client’s configuration

Clone the cluster-monitoring-operator repository locally.
Modify the /manifests/0000_50_cluster-monitoring-operator_04-config.yaml file to add the metric to the allowed list. Include comments to:

Identify the team owning the metric.
Provide a short description.
(optional) Indicate which team(s) will consume the metric, it helps knowing who to contact if changes are made in the future.

    #
    # owners: (@openshift/openshift-team-monitoring)
    #
    # openshift:prometheus_tsdb_head_series:sum tracks the total number of active series
    - '{__name__="openshift:prometheus_tsdb_head_series:sum"}'

make --always-make docs

Commit the changes into Git and open a pull request in the openshift/cluster-monitoring-operator repository linking to the initial JIRA ticket.
Ask for a review on the #forum-monitoring Slack channel.

Synchronize the Telemeter server’s configuration

Once the pull request in the cluster-monitoring-operator repository is merged, the configuration of the Telemetry server needs to be synchronized.

Clone the rhobs/configuration repository.
Run

make whitelisted_metrics && make

Commit the changes into Git and open a pull request in the rhobs/configuration repository.
Ask for a review on the #forum-observatorium Slack channel.

Once merged, the updated configuration should be rolled out to the production Telemetry within a few days. After this happens, clusters running the next (e.g. master) OCP version should start sending the new metric(s) to Telemetry.

Frequently asked questions (FAQ)

What is the cardinality of a metric?

A given metric may have different labels (aka dimensions) that helps refining the characteristics of the thing being measured. Each unique combination of a metric name + optional key/value pairs represents a timeseries in the Prometheus parlance. And the total number of active timeseries for a given metric name represents the cardinality of the metric.

For example, consider a component exposing a fictuous my_component_ready metric:

my_component_ready 1

The metric has no label but because Prometheus will automatically attach target labels such as pod and instance, the total cardinality could be 1 (single replica), 2 (2 replicas), …

To find out the current cardinality of a metric on a live cluster, you can run this PromQL query:

count(my_component_ready)

Now consider another metric tracking HTTP requests:

http_requests_total{method="GET", code="200", path="/"} 10
http_requests_total{method="GET", code="404", path="/foo"} 1
http_requests_total{method="POST", code="200", path="/"} 12
http_requests_total{method="POST", code="500", path="/login"} 2

While you may think that the cardinality is 4 because there are 4 timeseries, this isn’t true because we can’t really predict in advance all values for the code and path labels. This is what is called a high-cardinality metric. An even worse case would be a metric with a userid or ip label (we would say that this metric has unbounded cardinality).

On top of that, pod churn (e.g. pods being rolled-out because of version upgrades) also increase the cardinality because the values of target-based labels (such pod and instance) would change.

Because Prometheus keeps all active timeseries in-memory for indexing, the more timeseries, the more memory is required. The same is true for the Telemeter server. Which is why we want to keep the cardinality of metrics shipped via Telemetry under a reasonable value (typically less than 5).

Why is there a limit on the number of metrics that can be collected?

See the previous section. Every metric shipped to Telemetry has to be multiplied by the number of connected clusters that may be sending that metric. Pushing too many metrics from a single cluster may cause service degradation and resource exhaustion on both the in-cluster monitoring stack and on the Telemetry server side.

Will Telemetry automatically collect alerts?

Yes, the Telemeter client is already configured to collect and send firing alerts. On Telemetry side, the alerts can be queried using the alerts metric.

How do I ship metrics via Telemetry for older OCP releases?

Once you have updated the telemeter-client configuration in the master branch, you can create backports to older OCP releases. The procedure follows the usual OCP backport process which involves creating bug tickets in the OCPBUGS project (preferably assigned to your component) and opening pull requests in openshift/cluster-monitoring-operator against the desired release-4.x branches.

How do I get access to Telemetry?

Check https://gitlab.cee.redhat.com/data-hub/dh-docs/-/blob/master/docs/interacting-with-telemetry-data.adoc

How do I instrument my component for Prometheus?

Please refer to the following links for more details:

Metric and label naming (upstream Prometheus documentation).
Instrumentation (upstream Prometheus documentation).
Instrumenting Kubernetes

You can also reach out to the OpenShift monitoring team for advice.

How does the in-cluster monitoring stack scrape metrics from my component?

If your component’s metrics aren’t already collected by the in-cluster monitoring stack, you need to deploy at least one ServiceMonitor or one PodMonitor resource in your component’s namespace.

If your component is deployed by the Cluster Version Operator (CVO), it is enough to add the manifest to the CVO payload.

Again you can reach out to the OpenShift monitoring team for advice.

How is the communication secured between the Telemeter client and server?

The Telemetry client authenticates against the Telemeter server using the cluster’s pull secret. The Telemeter server verifies that the pull secret is valid and matches with the cluster’s identifier. The Telemetry protocol uses HTTPS for encryption.

Finally the Telemeter server will only allow metrics which are explicitly allowed by its running configuraiton.

Glossary

Telemetry

Also know as Telemetry or Telemeter server. A service operated by Red Hat that receives metrics from all OCP connected clusters.

Telemeter client

The telemeter-client pod runnning in the openshift-monitoring namespace. It is responsible for collecting the platform metrics at regular intervals (every 4m30s) and sending them to the Telemetry server.

In-cluster monitoring stack

The prometheus-k8s-0 and prometheus-k8s-1 pods running in the openshift-monitoring namespace. They are in charge of collecting metrics from the OpenShift components (operators+operands) and evaluating the associated alerting and recording rules. The Prometheus pods are configured using ServiceMonitor, PodMonitor and PrometheusRule custom resources coming from namespaces with the openshift.io/cluster-monitoring=true label.

6 - Frequently asked questions

Frequently asked questions

This serves as a collection of resources that relate to FAQ around configuring/debugging the in-cluster monitoring stack. Particularly it applies to two OpenShift Projects:

How can I (as a monitoring developer) troubleshoot support cases?

See this presentation to understand which tools are at your disposal.

How do I understand why targets aren’t discovered and metrics are missing?

Both PM and UWM monitoring stacks rely on the ServiceMonitor and PodMonitor custom resources in order to tell Prometheus which endpoints to scrape.

The examples below show the namespace openshift-monitoring, which can be replaced with openshift-user-workload-monitoring when dealing with UWM.

A detailed description of how the resources are linked exists here, but we will walk through some common issues to debug the case of missing metrics.

Ensure the serviceMonitorSelector in the Prometheus CR matches the key in the ServiceMonitor labels.
The Service you want to scrape must have an explicitly named port.
The ServiceMonitor must reference the port by this name.
The label selector in the ServiceMonitor must match an existing Service.

Assuming this criteria is met but the metrics don’t exist, we can try debug the cause.

There is a possibility Prometheus has not loaded the configuration yet. The following metrics will help to determine if that is in fact the case or if there are errors in the configuration:

prometheus_config_last_reload_success_timestamp_seconds
prometheus_config_last_reload_successful

If there are errors with reloading the configuration, it is likely the configuration itself is invalid and examining the logs will highlight this.

oc logs -n openshift-monitoring prometheus-k8s-0 -c <container-name>

Assuming that the reload was a success then the Prometheus should see the configuration.

oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/status/config | grep "<service-monitor-name>"

If the ServiceMonitor does not exist in the output, the next step would be to investigate the logs of both prometheus and the prometheus-operator for errors.

Assuming it does exist then we know prometheus-operator is doing its job. Double check the ServiceMonitor definition.

Check the service discovery endpoint to ensure Prometheus can discover the target. It will need the appropriate RBAC to do so. An example can be found here.

How do I troubleshoot the TargetDown alert?

First of all, check the TargetDown runbook.

We have, in the past seen cases where the TargetDown alert was firing when all endpoints appeared to be up. The following commands fetch some useful metrics to help identify the cause.

As the alert fires, get the list of active targets in Prometheus

oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/targets?state=active > targets.prometheus-k8s-0.json

oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/targets?state=active > targets.prometheus-k8s-1.json

Reports all targets that Prometheus couldn’t connect to with some reason (timeout, refused, …)

A dialer_name can be passed as a label to limit the query to interesting components. For example {dialer_name=~".+openshift-.*"}.

oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'query=rate(net_conntrack_dialer_conn_failed_total{}[1h]) > 0' > net_conntrack_dialer_conn_failed_total.prometheus-k8s-0.json

oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'query=net_conntrack_dialer_conn_failed_total{} > 1' > net_conntrack_dialer_conn_failed_total.prometheus-k8s-1.json

Identify targets that are slow to serve metrics and may be considered as down.

oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'sort_desc(max by(job) (max_over_time(scrape_duration_seconds[1h])))' > slow.prometheus-k8s-0.json

oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'sort_desc(max by(job) (max_over_time(scrape_duration_seconds[1h])))' > slow.prometheus-k8s-1.json

How do I troubleshoot high CPU usage of Prometheus?

Often, when “high” CPU usage or spikes are identified it can be a symptom of expensive rules.

A good place to start the investigation is the /rules endpoint of Prometheus and analyse any queries which might contribute to the problem by identifying excessive rule evaluation times.

How do I retrieve CPU profiles?

In cases where excessive CPU usage is being reported, it might be useful to obtain Pprof profiles from the Prometheus containers over a short time span.

To gather CPU profiles over a period of 30 minutes, run the following:

SLEEP_MINUTES=5
duration=${DURATION:-30}
while [ $duration -ne 0 ]; do
  for i in 0 1; do
	echo "Retrieving CPU profile for prometheus-k8s-$i..."
	oc exec -n openshift-monitoring prometheus-k8s-$i -c prometheus -- curl -s http://localhost:9090/debug/pprof/profile?seconds="$duration" > cpu.prometheus-k8s-$i.$(date +%Y%m%d-%H%M%S).pprof;
  done
  echo "Sleeping for $SLEEP_MINUTES minutes..."
  sleep $(( 60 * $SLEEP_MINUTES ))
  (( --duration ))
done

How do I debug high memory usage?

The following queries might prove useful for debugging.

Calculate the ingestion rate over the last two minutes:

oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 \
-- curl -s http://localhost:9090/api/v1/query --data-urlencode \
'query=sum by(pod,job,namespace) (max without(instance) (rate(prometheus_tsdb_head_samples_appended_total{namespace=~"openshift-monitoring|openshift-user-workload-monitoring"}[2m])))' > samples_appended.json

Calculate “non-evictable” memory:

oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 \
-- curl -s http://localhost:9090/api/v1/query --data-urlencode \
'query=sort_desc(sum by (pod,namespace) (max without(instance) (container_memory_working_set_bytes{namespace=~"openshift-monitoring|openshift-user-workload-monitoring", container=""})))' > memory.json

How do I get memory profiles?

In cases where excessive memory is being reported, it might be useful to obtain Pprof profiles from the Prometheus containers over a short time span.

To gather memory profiles over a period of 30 minutes, run the following:

SLEEP_MINUTES=5
duration=${DURATION:-30}
while [ $duration -ne 0 ]; do
  for i in 0 1; do
	echo "Retrieving memory profile for prometheus-k8s-$i..."
	oc exec -n openshift-monitoring prometheus-k8s-$i -c prometheus -- curl -s http://localhost:9090/debug/pprof/heap > heap.prometheus-k8s-$i.$(date +%Y%m%d-%H%M%S).pprof;
  done
  echo "Sleeping for $SLEEP_MINUTES minutes..."
  sleep $(( 60 * $SLEEP_MINUTES ))
  (( --duration ))
done