## Evolution of the Observability Operator (fka MSO)

4 minute read

Evolution of the Observability Operator (fka MSO)

Owners:
- @dmohr
- @jan--f
Related Tickets:
- https://issues.redhat.com/browse/MON-2279
- https://issues.redhat.com/browse/MON-2482
- https://issues.redhat.com/browse/MON-2508 (short-term focus)
Other docs:
- Original proposal for MSO

TL;DR: As a mid-to-long term vision for the Monitoring Stack Operator we propose to rename MSO as OO (Observability Operator). With the name, we propose to establish OO as the open source, (single) cluster-side component for OpenShift observability needs. In conjunction with Observatorium / RHOBS - as the multi-cluster, multi-tenant, scalable observability backend component. OO is thought to manage different kinds of cluster-side monitoring, alerting, logging, and tracing (and potentially profiling) stack setups covering the needs of OpenShift variants like the client-side for the fully managed multi-cluster use cases to HyperShift and single node air gapped setups.

Why

With the rise of new OpenShift variants with very different needs regarding observability, the desire for a consistent way of providing differently configured monitoring stacks grows. Examples:

Traditional single-cluster on-prem OpenShift deployments need a self-contained monitoring stack where all components (scraping, storage, visualization, alerting, log collection) run on the same cluster. This is the kind of setup Cluster Monitoring Operator was designed for.
Multi-cluster deployments need a central (aggregated, federated) view on metrics, logs and alerts. Certain components of the stack don’t run on the workload clusters but in some central infrastructure.
Resource-constraint deployments need a stripped down version of the monitoring stack, e.g. only forwarding signals to a central infrastructure or disabling certain parts of the stack completely.
Mixed deployments (e.g. OpenShift + OpenStack) are conceptually very similar to the multi-cluster use case, also needing a central pane of glass for observability signals.
Special purpose deployments (e.g. OpenShift Data Foundation) have special requirements when it comes to monitoring, that are tricky to align with the existing CMO setup.
Looking at eventually correlating different observability signals also the cluster-side stack would potentially benefit from a holistic approach for deploying monitoring, logging and tracing components and configuring them in the right way to work together.
Managed service deployments need a multi tenancy capable way of deploying many similarly built monitoring stacks to a single cluster. This is the short-term focus for OO.

The proposal is to combine all these (and more) use cases into one single (meta) operator (as recommended by the operator best practices) which can be configured with e.g. presets to instruct lower-level operators (like prometheus-operator, potentially Loki operator or Jaeger one) to deploy purpose-built monitoring stacks for different uses cases. This is similar to the original CMO concept but with much higher flexibility, and feature velocity in mind, thanks to not being tied to OpenShift Core versioning.

Additionally, supporting multiple different ways of deploying monitoring stacks (CMO as the standard OpenShift way, OO for managed services, something else for e.g. HyperShift or edge scenarios, …) is a burden for the team. Instead, eventually supporting only one way to deploy monitoring stacks - with OO - covering all these use cases makes it a lot simpler and far more consistent.

Pitfalls of the current solution

CMO is built for traditional self-operated single-cluster focused deployments of OpenShift. It intentionally lacks the flexibility for many other use cases (see above) in order to provide monitoring that is resilient against configuration drift. E.g. the reason for creating OO (MSO) in the first place - supporting managed service uses cases - can’t currently be covered by CMO. See the original MSO proposal for more details.

The results of this lack of flexibility can be readily observed: Red Hat teams have built their own solutions for their monitoring use cases, e.g. leveraging community operators or self-written deployments, with varying success, reliability and supportability.

Goals

Goals and use cases for the solution as proposed in How:

Widen the scope of OO to cover additional use cases besides managed services.
Replace existing ways of deploying monitoring stacks across Red Hat products with OO.
Focus on OpenShift use cases primarily but don’t exclude vanilla Kubernetes as a possible target.
Create an API that easily allows common configuration across observability signals.

Non-Goals

Create a multi-cluster capable observability operator.

How

Define use cases to be covered in detail.
Prioritize use cases and add needed features one by one.

Alternatives

Tackle each monitoring use case across Red Hat products one by one and build a custom solution for them. This would lead to many different (but potentially simpler) implementations which need to be supported.
Develop signal specific operators that can handle the required use cases. This would likely require an API between those operators to apply common configuration.

Action Plan

Collection of requirements and prioritization of use cases currently in progess (Q3 2022).

Last modified January 17, 2024