Telemetry (Telemeter)

2 minute read

Telemetry (Telemeter)

For RHOBS Overview see this document

Telemeter is the metrics-only hard tenant of the RHOBS service designed as a centralized OpenShift Telemetry pipeline for OpenShift Container Platform. It is an essential part of gathering real-time telemetry for remote health monitoring, automation and billing purposes.

Product Managers

  • Roger Floren

Big Picture Overview

Source

Support

To escalate issues use, depending on issue type:

  • For questions related to the service or kind of data it ingests, use telemetry-sme@redhat.com (internal) mail address. For quick questions you can try to use #forum-telemetry on CoreOS Slack.
  • For functional bugs or feature requests use Bugzilla, with Product: Openshift Container Platform and Telemeter component (example bug). You can additionally notify us about a new bug on #forum-telemetry on CoreOS Slack.
  • For functional bugs or feature requests for historical storage (Data Hub), use the PNT Jira project.

For the managing team: See our internal agreement document.

Escalations

For urgent escalation use:

  • For Telemeter Service Unavailability: @app-sre-ic and @observatorium-oncall on CoreOS Slack.
  • For Historical Data (DataHub) Service Unavailability: @data-hub-ic on CoreOS Slack.

Service Level Agreement

SLO

RHOBS has currently established the following default Service Level Objectives. This is based on the infrastructure dependencies we have listed here (internal).

Previous docs (internal):

Metrics SLIs

API SLI Type SLI Spec Period SLI Implementation Dashboard
/write Availability The % of successful (non 5xx) requests 28d Metrics from Observatorium API Dashboard
/write Latency The % of requests under X latency 28d Metrics from Observatorium API Dashboard

Read Metrics TBD.

Agreements:

NOTE: No entry for your case (e.g. dev/staging) means zero formal guarantees.

SLI Date of Agreement Tier SLO Notes
/write Availability 2020/2019 Internal (default) 99% success rate for incoming requests This depends on SSO RedHat com SLO (98.5%). In worst case (everyone needs to refresh token) we have below 98.5%, in the avg case with caching being 5m (we cannot change it) ~99% (assuming we can preserve 5m)
/write Latency 2020/2019 Internal (default) 95% of requests < 250ms, 99% of requests < 1s
Write Limits

Within our SLO, the write request must match following criteria to be considered valid:

TODO: Provide example tune-ed client Prometheus configurations for remote write

Last modified November 14, 2024