Openshift Cluster Monitoring
Openshift Monitoring is composed of Platform Monitoring and User Workload Monitoring.
The official OpenShift documentation contains all user-facing information such as usage and configuration.
This the multi-page printable view of this section. Click here to print.
Openshift Monitoring is composed of Platform Monitoring and User Workload Monitoring.
The official OpenShift documentation contains all user-facing information such as usage and configuration.
This document details good practices to adopt when you instrument your application for Prometheus. It is not meant to be a replacement of the upstream documentation but an introduction focused on the OpenShift use case.
This document is intended for OpenShift developers that want to instrument their operators and operands for Prometheus.
To instrument software written in Golang, see the official Golang client. For other languages, refer to the curated list of client libraries.
Prometheus stores all data as time series which are a stream of timestamped values (samples) identified by a metric name and a set of unique labels (a.ka. dimensions or key/value pairs). Its data model is described in details in this page. Time series would be represented like this:
# HELP http_requests_total Total number of HTTP requests by method and handler.
# TYPE http_requests_total counter
http_requests_total{method="GET", handler="/messages"} 500
http_requests_total{method="POST", handler="/messages"} 10
Prometheus supports 4 metric types:
rate()
or increase()
function.Adding metrics for any operation should be part of the code review process like any other factor that is kept in mind for production ready code.
To learn more about when to use which metric type, how to name metrics and how to choose labels, read the following documentation:
Here is a fictional Go code example instrumented with a Gauge metric and a multi-dimensional Counter metric:
cpuTemp := prometheus.NewGauge(prometheus.GaugeOpts{
Name: "cpu_temperature_celsius",
Help: "Current temperature of the CPU.",
})
hdFailures := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "hd_errors_total",
Help: "Number of hard-disk errors.",
},
[]string{"device"},
)}
reg := prometheus.NewRegistry()
reg.MustRegister(cpuTemp, m.hdFailures)
cpuTemp.Set(55.2)
// Record 1 failure for the /dev/sda device.
hdFailures.With(prometheus.Labels{"device":"/dev/sda"}).Inc()
// Record 3 failures for the /dev/sdb device.
hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc()
hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc()
hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc()
Defining when to add and when not to add a label to a metric is a difficult choice. The general rule is: the fewer labels, the better. Every unique combination of label names and values creates a new time series and Prometheus memory usage is mostly driven by the number of times series loaded into RAM during ingestion and querying. A good rule of thumb is to have less than 10 time series per metric name and target. A common mistake is to store dynamic information such as usernames, IP addresses or error messages into a label which can lead to thousands of time series.
Labels such as pod
, service
, job
and instance
shouldn’t be set by the application. Instead they are discovered at runtime by Prometheus when it queries the Kubernetes API to discover which targets should be scraped for metrics.
It is sometimes not feasible to use one of the 4 Metric types, typically when your application already has the information stored for other purpose (for instance, it maintains a list of custom objects retrieved from the Kubernetes API). In this case, the custom collector pattern can be useful.
You can find an example of this pattern in the github.com/prometheus-operator/prometheus-operator project.
This document explains how to ingest metrics into the OpenShift Platform monitoring stack. It only applies for the OCP core components and Red Hat certified operators.
For user application monitoring, please refer to the official OCP documentation.
This document is intended for OpenShift developers that want to expose Prometheus metrics from their operators and operands. Readers should be familiar with the architecture of the OpenShift cluster monitoring stack.
Prometheus is a monitoring system that pulls metrics over HTTP, meaning that monitored targets need to expose an HTTP endpoint (usually /metrics
) which will be queried by Prometheus at regular intervals (typically every 30 seconds).
To avoid leaking sensitive information to potential attackers, all OpenShift components scraped by the in-cluster monitoring Prometheus should follow these requirements:
As described in the Client certificate scraping enhancement proposal, we recommend that the components rely on client TLS certificates for authentication/authorization. This is more efficient and robust than using bearer tokens because token-based authn/authz add a dependency (and additional load) on the Kubernetes API.
To this goal, the Cluster monitoring operator provisions a TLS client certificate for the in-cluster Prometheus. The client certificate is issued for the system:serviceaccount:openshift-monitoring:prometheus-k8s
Common Name (CN) and signed by the kubernetes.io/kube-apiserver-client
signer. The certificate can be verified using the certificate authority (CA) bundle located at the client-ca-file
key of the kube-system/extension-apiserver-authentication
ConfigMap.
In practice the Cluster Monitoring Operator creates a CertificateSigningRequest object for the
prometheus-k8s
service account which is automatically approved by the cluster-policy-controller. Once the certificate is issued by the controller, CMO provisions a secret namedmetrics-client-certs
which contains the TLS certificate and key (respectively undertls.crt
andtls.key
keys in the secret). CMO also rotates the certificate before it gets expired.
There are several options available depending on which framework your component is built.
If your component already relies on *ControllerCommandConfig
from github.com/openshift/library-go/pkg/controller/controllercmd
, it should automatically expose a TLS-secured /metrics
endpoint which has an hardcoded authorizer for the system:serviceaccount:openshift-monitoring:prometheus-k8s
service account (link).
Example: the Cluster Kubernetes API Server Operator.
The “simplest” option when the component doesn’t rely on github.com/openshift/library-go
(and switching to library-go isn’t an option) is to run a kube-rbac-proxy
sidecar in the same pod as the application being monitored.
Here is an example of a container’s definition to be added to the Pod’s template of the Deployment (or Daemonset):
- args:
- --secure-listen-address=0.0.0.0:8443
- --upstream=http://127.0.0.1:8081
- --config-file=/etc/kube-rbac-proxy/config.yaml
- --tls-cert-file=/etc/tls/private/tls.crt
- --tls-private-key-file=/etc/tls/private/tls.key
- --client-ca-file=/etc/tls/client/client-ca-file
- --logtostderr=true
- --allow-paths=/metrics
image: quay.io/brancz/kube-rbac-proxy:v0.11.0 # usually replaced by CVO by the OCP kube-rbac-proxy image reference.
name: kube-rbac-proxy
ports:
- containerPort: 8443
name: metrics
resources:
requests:
cpu: 1m
memory: 15Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /etc/kube-rbac-proxy
name: secret-kube-rbac-proxy-metric
readOnly: true
- mountPath: /etc/tls/private
name: secret-kube-rbac-proxy-tls
readOnly: true
- mountPath: /etc/tls/client
name: metrics-client-ca
readOnly: true
[...]
- volumes:
# Secret created by the service CA operator.
# We assume that the Kubernetes service exposing the application's pods has the
# "service.beta.openshift.io/serving-cert-secret-name: kube-rbac-proxy-tls"
# annotation.
- name: secret-kube-rbac-proxy-tls
secret:
secretName: kube-rbac-proxy-tls
# Secret containing the kube-rbac-proxy configuration (see below).
- name: secret-kube-rbac-proxy-metric
secret:
secretName: secret-kube-rbac-proxy-metric
# ConfigMap containing the CA used to verify the client certificate.
- name: metrics-client-ca
configMap:
name: metrics-client-ca
Note: The
metrics-client-ca
ConfigMap needs to be created by your component and synced from thekube-system/extension-apiserver-authentication
ConfigMap.
Here is a Secret containing the kube-rbac-proxy’s configuration (it allows only HTTPS requets to the /metrics
endpoint for the Prometheus service account):
apiVersion: v1
kind: Secret
metadata:
name: secret-kube-rbac-proxy-metric
namespace: openshift-example
stringData:
config.yaml: |-
"authorization":
"static":
- "path": "/metrics"
"resourceRequest": false
"user":
"name": "system:serviceaccount:openshift-monitoring:prometheus-k8s"
"verb": "get"
type: Opaque
Example: node-exporter from the Cluster Monitoring operator.
Starting with v0.16.0, the controller-runtime
framework provides a way to expose and secure a /metrics
endpoint using TLS with minimal effort.
Refer to https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/metrics/server for details about TLS configuration and check the next section to understand how it needs to be configured.
You don’t use
library-go
,controller-runtime
>= v0.16.0 or don’t want to run akube-rbac-proxy
sidecar.
In such situations, you need to implement your own HTTPS server for /metrics
. As explained before, it needs to require and verify the TLS client certificate using the root CA stored under the client-ca-file
key of the kube-system/extension-apiserver-authentication
ConfigMap.
In practice, the server should:
ClientAuth
field to RequireAndVerifyClientCert
.Example: https://github.com/openshift/cluster-monitoring-operator/pull/1870
To tell the Prometheus pods running in the openshift-monitoring
namespace (e.g. prometheus-k8s-{0,1}
) to scrape the metrics from your operator/operand pods, you should use ServiceMonitor
and/or PodMonitor
custom resources.
The workflow is:
openshift.io/cluster-monitoring: "true"
label to the namespace where the scraped targets live.
operatorframework.io/cluster-monitoring=true
annotation to the operator’s CSV.Below is an fictitious example using a ServiceMonitor object to scrape metrics from pods deployed in the openshift-example
namespace.
Role and RoleBinding manifests
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: prometheus-k8s
namespace: openshift-example
rules:
- apiGroups:
- ""
resources:
- services
- endpoints
- pods
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: prometheus-k8s
namespace: openshift-example
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: prometheus-k8s
subjects:
- kind: ServiceAccount
name: prometheus-k8s
namespace: openshift-monitoring
Service manifest
apiVersion: v1
kind: Service
metadata:
annotations:
# This annotation tells the service CA operator to provision a Secret
# holding the certificate + key to be mounted in the pods.
# The Secret name is "<annotation value>" (e.g. "secret-my-app-tls").
service.beta.openshift.io/serving-cert-secret-name: tls-my-app-tls
labels:
app.kubernetes.io/name: my-app
name: metrics
namespace: openshift-example
spec:
ports:
- name: metrics
port: 8443
targetPort: metrics
# Select all Pods in the same namespace that have the `app.kubernetes.io/name: my-app` label.
selector:
app.kubernetes.io/name: my-app
type: ClusterIP
ServiceMonitor manifest
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: openshift-example
spec:
endpoints:
- interval: 30s
# Matches the name of the service's port.
port: metrics
scheme: https
tlsConfig:
# The CA file used by Prometheus to verify the server's certificate.
# It's the cluster's CA bundle from the service CA operator.
caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
# The name of the server (CN) in the server's certificate.
serverName: my-app.openshift-example.svc
# The client's certificate file used by Prometheus when scraping the metrics.
# This file is located in the Prometheus container.
certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt
# The client's key file used by Prometheus when scraping the metrics.
# This file is located in the Prometheus container.
keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key
selector:
# Select all Services in the same namespace that have the `app.kubernetes.io/name: my-app` label.
matchLabels:
app.kubernetes.io/name: my-app
This document is intended for OpenShift developers that want to write alerting rules for their operators and operands.
You configure alerting rules based on the metrics being collected for your component(s). To do so, you should create PrometheusRule
objects in your operator/operand namespace which will also be picked up by the Prometheus operator (provided that the namespace has the openshift.io/cluster-monitoring="true"
label for layered operators).
Here is an example of a PrometheusRule object with a single alerting rule:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cluster-example-operator-rules
namespace: openshift-example-operator
spec:
groups:
- name: operator
rules:
- alert: ClusterExampleOperatorUnhealthy
annotations:
description: Cluster Example operator running in pod {{$labels.namespace}}/{{$labels.pods}} is not healthy.
summary: Operator Example not healthy
expr: |
max by(pod, namespace) (last_over_time(example_operator_healthy[5m])) == 0
for: 15m
labels:
severity: warning
You can choose to configure all your alerting rules into a single PrometheusRule
object or split them into different objects (one per component). The mechanism to deploy the object(s) depends on the context: it can be deployed by the Cluster Version Operator (CVO), the Operator Lifecycle Manager (OLM) or your own operator.
Please refer to the Alerting Consistency OpenShift enhancement proposal for the recommendations applying to OCP built-in alerting rules.
If you need a review of alerting rules from the OCP monitoring team, you can reach them on the #forum-openshift-monitoring
channel.
The enhancement proposal mentioned above states the following for OCP built-in alerts:
Alerts SHOULD include a namespace label indicating the source of the alert.
Unfortunately this isn’t something that we can verify by static analysis because the namespace label can come from the PromQL result or be added statically. Nevertheless we can still use the Telemetry data to identify OCP alerts that don’t respect this statement.
First, create an OCP cluster from the latest stable release. Once it is installed, run this command to return the list of all OCP built-in alert names:
curl -sk -H "Authorization: Bearer $(oc create token prometheus-k8s -n openshift-monitoring)" \
https://$(oc get routes -n openshift-monitoring thanos-querier -o jsonpath='{.status.ingress[0].host}')/api/v1/rules \
| jq -cr '.data.groups | map(.rules) | flatten | map(select(.type =="alerting")) | map(.name) | unique |join("|")'
Then from https://telemeter-lts.datahub.redhat.com, retrieve the list of all alerts matching the names that fired without a namespace label, grouped by minor release:
count by (alertname,version) (
alerts{alertname=~"<insert the list of names returned by the previous command>",namespace=""} *
on(_id) group_left(version) max by(_id, version) (
label_replace(id_version_ebs_account_internal:cluster_subscribed{version=~"4.\d\d.*"}, "version", "$1", "version", "^(4.\\d+).*$")
)
)
You should now track back the non-compliant alerts to their component of origin and file bugs against them (example).
The exercise should be done at regular intervals, at least once per release cycle.
This document is intended for OpenShift developers that want to add visualization dashboards for their operators and operands in the OCP administrator console.
Please refer to the document written by the Observability UI team.
The team can also be found in the #forum-observability-ui
Slack channel.
This document is intended for OpenShift developers that want to ship new metrics to the Red Hat Telemetry service.
Before going to the details, a few words about Telemetry and the process to add a new metric..
What is Telemetry?
Telemetry is a system operated and hosted by Red Hat that allows to collect data from connected clusters to enable subscription management automation, monitor the health of clusters, assist with support, and improve customer experience.
What does sending metrics via Telemetry mean?
You should send the metrics via Telemetry when you want and need to see these metrics for all OpenShift clusters. This is primarily for gaining insights on how OpenShift is used, troubleshooting and monitoring the fleet of clusters. Users can already see these metrics in their clusters via Prometheus even when not available via Telemetry.
How are metrics shipped via Telemetry?
Only metrics which are already collected by the in-cluster monitoring stack can be shipped via Telemetry. The telemeter-client
pod running in the openshift-monitoring
namespace collects metrics from the prometheus-k8s
service every 4m30s using the /federate
endpoint and ships the samples to the Telemetry endpoint using a custom protocol.
How long will it take for my new telemetry metrics to show up?
Please start this process and involve the monitoring team as early as possible. The process described in this document includes a thorough review of the underlying metrics and labels. The monitoring team will try to understand your use case and perhaps propose improvements and optimizations. Metric, label and rule names will be reviewed for following best practices. This can take several review rounds over multiple weeks.
Shipping metrics via Telemetry is only possible for components running in namespaces with the openshift.io/cluster-monitoring=true
label. In practice, it means that your component falls into one of these 2 categories:
Your component should already be instrumented and scraped by the in-cluster monitoring stack using ServiceMonitor
and/or PodMonitor
objects.
The overall process is as follows:
PrometheusRule
objects.The first step is to identify which metrics you want to send via Telemetry and what is the cardinality of the metrics (e.g. how many timeseries it will be in total). Typically you start with metrics that show how your component is being used. In practice, we recommend to start shipping not more than:
If you are above these limits, you have 2 choices:
Finally your metric MUST NOT contain any personally identifiable information (names, email addresses, information about user workloads).
Use the following information to file 1 JIRA ticket per metric in the MON project:
Task
Send metric <metric name> via Telemetry
telemetry-review-request
h1. Request for sending data via telemetry
The goal is to collect metrics about ... because ...
<Metric name> represents ...
Labels
* <label 1>, possible values are ...
* <label 2>, possible values are ...
The cardinality of the metric is at most <X>.
Component exposing the metric: https://github.com/<org>/<project>
Reach out to @team-telemetry
on the #forum-openshift-monitoring
or #forum-observatorium
Slack channels for an explicit approval (e.g. in-cluster and RHOBS team leads).
Recording rules are required to reduce the cardinality of the metrics being shipped.
Even for low-cardinality metrics, we require to aggregate them before shipping to Telemetry to remove unnecessary labels such as instance
or pod
. This will also protect the telemetry backend against future label additions to the underlying metrics.
Let’s take a concrete example: each Prometheus pod exposes a prometheus_tsdb_head_series
metric which tracks the number of active timeseries. There can be up to 4 Prometheus pods in a given cluster (2 pods in openshift-monitoring
and 2 in openshift-user-workload-monitoring
when user-defined monitoring is enabled). To reduce the number of timeseries shipped via Telemetry, we configure the following recording rule to sum the values by namespace
and job
labels:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cluster-monitoring-operator-prometheus-rules
namespace: openshift-monitoring
spec:
groups:
- name: openshift-monitoring.rules
rules:
- expr: |-
sum by (job,namespace) (
max without(instance) (
prometheus_tsdb_head_series{namespace=~"openshift-monitoring|openshift-user-workload-monitoring"}
)
)
record: openshift:prometheus_tsdb_head_series:sum
Your PrometheusRule
object(s) should be created by your operator with your ServiceMonitor
and/or PodMonitor
objects.
Clone the cluster-monitoring-operator repository locally.
Modify the /manifests/0000_50_cluster-monitoring-operator_04-config.yaml file to add the metric to the allowed list. Include comments to:
#
# owners: (@openshift/openshift-team-monitoring)
#
# openshift:prometheus_tsdb_head_series:sum tracks the total number of active series
- '{__name__="openshift:prometheus_tsdb_head_series:sum"}'
make --always-make docs
Commit the changes into Git and open a pull request in the openshift/cluster-monitoring-operator repository linking to the initial JIRA ticket.
Ask for a review on the #forum-monitoring
Slack channel.
Once the pull request in the cluster-monitoring-operator repository is merged, the configuration of the Telemetry server needs to be synchronized.
Clone the rhobs/configuration repository.
Run
make whitelisted_metrics && make
Commit the changes into Git and open a pull request in the rhobs/configuration repository.
Ask for a review on the #forum-observatorium
Slack channel.
Once merged, the updated configuration should be rolled out to the production Telemetry within a few days. After this happens, clusters running the next (e.g. master
) OCP version should start sending the new metric(s) to Telemetry.
A given metric may have different labels (aka dimensions) that helps refining the characteristics of the thing being measured. Each unique combination of a metric name + optional key/value pairs represents a timeseries in the Prometheus parlance. And the total number of active timeseries for a given metric name represents the cardinality of the metric.
For example, consider a component exposing a fictuous my_component_ready
metric:
my_component_ready 1
The metric has no label but because Prometheus will automatically attach target labels such as pod
and instance
, the total cardinality could be 1 (single replica), 2 (2 replicas), …
To find out the current cardinality of a metric on a live cluster, you can run this PromQL query:
count(my_component_ready)
Now consider another metric tracking HTTP requests:
http_requests_total{method="GET", code="200", path="/"} 10
http_requests_total{method="GET", code="404", path="/foo"} 1
http_requests_total{method="POST", code="200", path="/"} 12
http_requests_total{method="POST", code="500", path="/login"} 2
While you may think that the cardinality is 4 because there are 4 timeseries, this isn’t true because we can’t really predict in advance all values for the code
and path
labels. This is what is called a high-cardinality metric. An even worse case would be a metric with a userid
or ip
label (we would say that this metric has unbounded cardinality).
On top of that, pod churn (e.g. pods being rolled-out because of version upgrades) also increase the cardinality because the values of target-based labels (such pod
and instance
) would change.
Because Prometheus keeps all active timeseries in-memory for indexing, the more timeseries, the more memory is required. The same is true for the Telemeter server. Which is why we want to keep the cardinality of metrics shipped via Telemetry under a reasonable value (typically less than 5).
See the previous section. Every metric shipped to Telemetry has to be multiplied by the number of connected clusters that may be sending that metric. Pushing too many metrics from a single cluster may cause service degradation and resource exhaustion on both the in-cluster monitoring stack and on the Telemetry server side.
Yes, the Telemeter client is already configured to collect and send firing alerts. On Telemetry side, the alerts can be queried using the alerts
metric.
Once you have updated the telemeter-client
configuration in the master
branch, you can create backports to older OCP releases. The procedure follows the usual OCP backport process which involves creating bug tickets in the OCPBUGS
project (preferably assigned to your component) and opening pull requests in openshift/cluster-monitoring-operator against the desired release-4.x
branches.
Please refer to the following links for more details:
You can also reach out to the OpenShift monitoring team for advice.
If your component’s metrics aren’t already collected by the in-cluster monitoring stack, you need to deploy at least one ServiceMonitor or one PodMonitor resource in your component’s namespace.
If your component is deployed by the Cluster Version Operator (CVO), it is enough to add the manifest to the CVO payload.
Again you can reach out to the OpenShift monitoring team for advice.
The Telemetry client authenticates against the Telemeter server using the cluster’s pull secret. The Telemeter server verifies that the pull secret is valid and matches with the cluster’s identifier. The Telemetry protocol uses HTTPS for encryption.
Finally the Telemeter server will only allow metrics which are explicitly allowed by its running configuraiton.
Also know as Telemetry or Telemeter server. A service operated by Red Hat that receives metrics from all OCP connected clusters.
The telemeter-client
pod runnning in the openshift-monitoring
namespace. It is responsible for collecting the platform metrics at regular intervals (every 4m30s) and sending them to the Telemetry server.
The prometheus-k8s-0
and prometheus-k8s-1
pods running in the openshift-monitoring
namespace. They are in charge of collecting metrics from the OpenShift components (operators+operands) and evaluating the associated alerting and recording rules. The Prometheus pods are configured using ServiceMonitor
, PodMonitor
and PrometheusRule
custom resources coming from namespaces with the openshift.io/cluster-monitoring=true
label.
This serves as a collection of resources that relate to FAQ around configuring/debugging the in-cluster monitoring stack. Particularly it applies to two OpenShift Projects:
See this presentation to understand which tools are at your disposal.
Both PM
and UWM
monitoring stacks rely on the ServiceMonitor
and PodMonitor
custom resources in order to tell Prometheus which endpoints to scrape.
The examples below show the namespace openshift-monitoring
, which can be replaced with openshift-user-workload-monitoring
when dealing with UWM
.
A detailed description of how the resources are linked exists here, but we will walk through some common issues to debug the case of missing metrics.
serviceMonitorSelector
in the Prometheus
CR matches the key in the ServiceMonitor
labels.Service
you want to scrape must have an explicitly named port.ServiceMonitor
must reference the port
by this name.ServiceMonitor
must match an existing Service
.Assuming this criteria is met but the metrics don’t exist, we can try debug the cause.
There is a possibility Prometheus has not loaded the configuration yet. The following metrics will help to determine if that is in fact the case or if there are errors in the configuration:
prometheus_config_last_reload_success_timestamp_seconds
prometheus_config_last_reload_successful
If there are errors with reloading the configuration, it is likely the configuration itself is invalid and examining the logs will highlight this.
oc logs -n openshift-monitoring prometheus-k8s-0 -c <container-name>
Assuming that the reload was a success then the Prometheus should see the configuration.
oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/status/config | grep "<service-monitor-name>"
If the ServiceMonitor
does not exist in the output, the next step would be to investigate the logs of both prometheus
and the prometheus-operator
for errors.
Assuming it does exist then we know prometheus-operator
is doing its job. Double check the ServiceMonitor
definition.
Check the service discovery endpoint to ensure Prometheus can discover the target. It will need the appropriate RBAC to do so. An example can be found here.
First of all, check the TargetDown runbook.
We have, in the past seen cases where the TargetDown
alert was firing when all endpoints appeared to be up. The following commands fetch some useful metrics to help identify the cause.
As the alert fires, get the list of active targets in Prometheus
oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/targets?state=active > targets.prometheus-k8s-0.json
oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/targets?state=active > targets.prometheus-k8s-1.json
Reports all targets that Prometheus couldn’t connect to with some reason (timeout, refused, …)
A dialer_name
can be passed as a label to limit the query to interesting components. For example {dialer_name=~".+openshift-.*"}
.
oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'query=rate(net_conntrack_dialer_conn_failed_total{}[1h]) > 0' > net_conntrack_dialer_conn_failed_total.prometheus-k8s-0.json
oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'query=net_conntrack_dialer_conn_failed_total{} > 1' > net_conntrack_dialer_conn_failed_total.prometheus-k8s-1.json
Identify targets that are slow to serve metrics and may be considered as down.
oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'sort_desc(max by(job) (max_over_time(scrape_duration_seconds[1h])))' > slow.prometheus-k8s-0.json
oc exec -n openshift-monitoring prometheus-k8s-1 -c prometheus -- curl http://localhost:9090/api/v1/query --data-urlencode 'sort_desc(max by(job) (max_over_time(scrape_duration_seconds[1h])))' > slow.prometheus-k8s-1.json
Often, when “high” CPU usage or spikes are identified it can be a symptom of expensive rules.
A good place to start the investigation is the /rules
endpoint of Prometheus and analyse any queries which might contribute to the problem by identifying excessive rule evaluation times.
In cases where excessive CPU usage is being reported, it might be useful to obtain Pprof profiles from the Prometheus containers over a short time span.
To gather CPU profiles over a period of 30 minutes, run the following:
SLEEP_MINUTES=5
duration=${DURATION:-30}
while [ $duration -ne 0 ]; do
for i in 0 1; do
echo "Retrieving CPU profile for prometheus-k8s-$i..."
oc exec -n openshift-monitoring prometheus-k8s-$i -c prometheus -- curl -s http://localhost:9090/debug/pprof/profile?seconds="$duration" > cpu.prometheus-k8s-$i.$(date +%Y%m%d-%H%M%S).pprof;
done
echo "Sleeping for $SLEEP_MINUTES minutes..."
sleep $(( 60 * $SLEEP_MINUTES ))
(( --duration ))
done
The following queries might prove useful for debugging.
Calculate the ingestion rate over the last two minutes:
oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 \
-- curl -s http://localhost:9090/api/v1/query --data-urlencode \
'query=sum by(pod,job,namespace) (max without(instance) (rate(prometheus_tsdb_head_samples_appended_total{namespace=~"openshift-monitoring|openshift-user-workload-monitoring"}[2m])))' > samples_appended.json
Calculate “non-evictable” memory:
oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 \
-- curl -s http://localhost:9090/api/v1/query --data-urlencode \
'query=sort_desc(sum by (pod,namespace) (max without(instance) (container_memory_working_set_bytes{namespace=~"openshift-monitoring|openshift-user-workload-monitoring", container=""})))' > memory.json
In cases where excessive memory is being reported, it might be useful to obtain Pprof profiles from the Prometheus containers over a short time span.
To gather memory profiles over a period of 30 minutes, run the following:
SLEEP_MINUTES=5
duration=${DURATION:-30}
while [ $duration -ne 0 ]; do
for i in 0 1; do
echo "Retrieving memory profile for prometheus-k8s-$i..."
oc exec -n openshift-monitoring prometheus-k8s-$i -c prometheus -- curl -s http://localhost:9090/debug/pprof/heap > heap.prometheus-k8s-$i.$(date +%Y%m%d-%H%M%S).pprof;
done
echo "Sleeping for $SLEEP_MINUTES minutes..."
sleep $(( 60 * $SLEEP_MINUTES ))
(( --duration ))
done