This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Team

Team

Place for documentation, processes for Red Hat Monitoring Group Teams

1 - Observability Platform

Observability Platform

We are the team responsible (mainly) for:

1.1 - Observability Platform Team SRE Processes

Observability Platform Team SRE Processes

This document explains a few processes related to operating our production software together with AppSRE team. All operations can be summarised as the Site Reliability Engineering part of our work.

Goals

The main goals of our SRE work within this team is to deploy and operate software that meets functional (API and performance (Telemeter SLO, MST SLO) requirements of our customers.

Currently, we offer internal service called RHOBS that is used across the company.

Releasing Changes / Patches

In order to maintain our software on production reliably we need to be able to test, release and deploy our changes rapidly.

We can divide things we change into a few categories. Let’s elaborate all processes per each category:

Software Libraries

Library is a project that is not deployed directly, but rather it is a dependency for micro-service application we deploy. For example https://github.com/prometheus/client_golang or https://github.com/grpc-ecosystem/go-grpc-middleware

Testing:

  • Linters, unit and integration tests on every PR.

Releasing

  • GitHub release using git tag (RC first, then major/minor or patch release).

Deploying

Software for (Micro) Services

Software for (micro) services usually lives in many separate open source repositories in GitHub e.g https://github.com/thanos-io/thanos, https://github.com/observatorium/api.

Testing:

  • Linters, unit and integration tests on every PR.

Releasing

  • GitHub release using git tag (RC first, then major/minor or patch release).
  • Building docker images in quay.io and dockerhub (backup) using CI for each tag or main branch PR.

Deploying

Configuration

All or configuration is rooted in https://github.com/rhobs/configuration configuration templates written in jsonnet. It can be then overridden by defined parameters in app-interface saas files .

Testing:

  • Building jsonnet resources, linting jsonnet, validating openshift templates.
  • Validation on App-interface side against Kubernetes YAMLs compatibility

Releasing

  • Merge to main and get the commit hash you want to deploy.

Deploying

  • Make sure you are on RH VPN.
  • Propose PR to https://gitlab.cee.redhat.com/service/app-interface with the change of ref for the desired environment to desired commit sha is app-interface Saas file for desired tenant (telemeter or MST) or change environment parameter.
  • Ask team for review. If change is impacting production heavily, notify AppSRE.
    • If only saas.yaml file was changed /lgtm from Observability Platform team is enough for PR to get merged automatically.
    • If any other file was changed, AppSRE engineer has to lgtm it.
  • When merged, CI will deploy the changes to namespace specified in saas.yaml e.g to production.

NOTE: Don’t change both production and staging in the same MR. NOTE: Deploy to production only changed that were previously in staging (automation for this TBD).

You can see the version change:

Monitoring Resources

Grafana Dashboards are defined here: https://github.com/rhobs/configuration/tree/main/observability/dashboards Alerts and Rules here: https://github.com/rhobs/configuration/blob/main/observability/prometheusrules.jsonnet

Testing:

  • ):

Releasing

  • Merge to main and get the commit hash you want to deploy.

Deploying

On-Call Rotation

Currently, RHOBS services are supported 24h/7d by two teams:

  • AppSRE team for infrastructure and “generalist” support. They are our first incident response point. They try to support our stack as far as runbooks and general knowledge allows.
  • Observability Platform Team (Dev on-call) for incidents impacting SLA outside of the AppSRE expertise (bug fixes, more complex troubleshooting). We are notified when AppSRE needs.

Incident Handling

This is the process we as the Observability Team try to follow during incident response.

The incident occurs when any of our services violates SLO we set with our stakeholders. Refer to Telemeter SLO and MST SLO for details on SLA.

NOTE: Following procedure applies to both Production and Staging. Many teams e.g SubWatch depends on working staging, so follow similar process as production. The only difference is that we do not need to mitigate of fix staging issues outside of office hours.

Potential Trigger Points:

  • You got notified on Slack by AppSRE team or paged through Pager Duty.
  • You got notified about potential SLA violation by the customer: Unexpected responses, things that worked before do not work now etc.
  • You touch production for unrelated reasons and noticed something worrying (error logs, un-monitored resource etc).
  1. If you are not on-call notify Observability Platform on-call engineer. If you are on-call, on-call engineer is not present or you agreed that you will handle this incident, go to step 2.

  2. Straight away, create JIRA for potential incident. Don’t think twice, it’s easy to create and essential to track incident later on. Fill the following parts:

    • Title: Symptom you see.

    • Type: Bug

    • Priority: Try to assess how important it is. If impacting production it’s a “Blocker”

    • Component: RHOBS

    • (Important) Label: incident no-qe

    • Description: Mention how you were notified (ideally with link to alert/Slack thread). Mention what you know so far.

      jira See example incident tickets here

  3. If AppSRE is not yet aware, drop link to created incident ticket to #sd-app-sre channel and notify @app-sre-primary and @observatorium-oncall. Don’t repeat yourself, ask everyone to follow the ticket comments.

    • AppSRE may or may not create dedicated channel, do communication efforts and start on-call Zoom meeting. We as the dev team don’t need to worry about those elements, go to step 4.
  4. Investigate possible mitigation. Ensure the problem is mitigated before focusing on root cause analysis.

    • Important: Note all performed actions and observations on created JIRA ticket via comments. This allows anyone to follow up what was checked and how. It is also essential for detailed Post Mortem / RCA process later on.
    • Note on JIRA ticket all automation or monitoring gaps you wished you had. This will be useful for actions after the incident.
  5. Ensure if incident is mitigated with AppSRE. Investigate root cause. If mitigation is applied and root cause is known, claim the incident being over.

  6. Note the potential long term fixes and ideas. Close the incident JIRA ticket.

After Incident

  1. After some time (within week), start a Post-mortem (RCA) document in Google Docs. Use following Google Docs RCA template. Put it in our RCA Team Google directory here. Link it in JIRA ticket too.
  2. Collaborate on Post Mortem. Make sure it is blameless but accurate. Share it as soon as possible with the team and AppSRE.
  3. Once done, schedule meeting with the Observability Platform and optionally AppSRE to discuss RCA/Post Mortem action items and effect.

Idea: If you have time, before sharing Post Mortem / RCA perform “Wheel of Misfortune”. Select an on-call engineer who was not participaing in incident and simulate the error by triggering root-cause in safe environment. Then meet together with team and allow engineer to coordinate simulated incident. Help on the way to share knowledge and insights. This is the best way to on-board people to production topics.

NOTE: Freely available SRE book is a good source of general patterns around in efficient incident management. Recommended read!

2 - Team member onboarding

Team member onboarding

Welcome to the Monitoring Team! We created this document to help guide you through your onboarding.

Please fork this repo and propose changes to the content as a pull request if something is not accurate, outdated or unclear. Use your fork to track the progress of your onboarding tasks, i.e. by keeping a copy in your fork with comments about your completion status. Please read the doc thoroughly and check off each item as you complete it. We consistently want to improve our onboarding experience and your feedback helps future team members.

This documents contains some links to Red Hat internal documents and you won’t be able to access them without a Red Hat associate login. We are still trying to keep as much information as possible in the open.

Team Background

Team Monitoring mainly focuses on Prometheus, Thanos, Observatorium, and their integration into Kubernetes and OpenShift. We are also responsible for the hosted multi-tenant Observability service which powers such services as OpenShift Telemetry and OSD metrics.

Prometheus is a monitoring project initiated at SoundCloud in 2012. It became public and widely usable in early 2015. Since then, it found adoption across many industries. In early 2016 development diversified away from SoundCloud as CoreOS hired one of the core developers. Prometheus is not a single application but an entire ecosystem of well-separated components, which work well together but can be used individually.

CoreOS was acquired by Red Hat in early 2018. The CoreOS monitoring team became the Red Hat Monitoring Team, which has evolved into the “Observability group”. The teams were divided into two teams:

  1. The In-Cluster Observability Team
  2. The Observability Platform team (aka RHOBS or Observatorium team)

You may encounter references to these two teams. In early 2021 we decided to combine the efforts of these teams more closely in order to avoid working in silos and ensure we have a well functioning end-to-end product experience across both projects. We are, however, still split into separate scrum teams for efficiency. We are now collectively known as the “Monitoring Team” and each team works together across the various domains.

We work on all important aspects of the upstream eco-system and a seamless monitoring experience for Kubernetes using Prometheus. Part of that is integrating our open source efforts into the OpenShift Container Platform (OCP), the commercial Red hat Kubernetes distribution.

People relevant to Prometheus who you should know about:

  • Julius Volz (@juliusv): Previously worked at Soundcloud where he developed prometheus. Now working as an independent contractor and organizing PromCon (Prometheus community conference). He also worked with weave.works on a prototype for remote/long-term time series storage, with Influxdb on Flux PromQL support. Contributed new Prometheus UI in React. He also created a new company PromLens, for a rich PromQL UI.
  • Bjoern Rabenstein (@beorn7): Worked at SoundCloud but now works at Grafana. He’s active again upstream and the maintainer of pushgateway and client_golang (the Go client Prometheus library).
  • Frederic Branczyk (@brancz): Joined CoreOS in 2016 to work around Prometheus. Core Team member since then and one of the minds behind our team’s vision. Left Red Hat in 2020 to start his new company around continuous profiling (PolarSignals). Still very active in upstream.
  • Julien Pivotto (@roidelapluie): prometheus/prometheus maintainer. Very active in other upstream projects (Alertmanager, …).

Thanos

Thanos is a monitoring system, which was created based on Prometheus principles. It is a distributed version of Prometheus where every piece of Prometheus like scraping, querying, storage, recording, alerting, and compaction can be deployed as separate horizontally scalable components. This allows more flexible deployments and capabilities beyond single clusters. Thanos also supports object storage as the main storage option, allowing cheap long term retention for metrics. At the end it exposes the same (yet extended) Prometheus APIs and uses gRPC to communicate between components.

Thanos was created because of the scalability limits of Prometheus in 2017. At that point a similar project Cortex was emerging too, but it was over complex at that time. In November 2017, Fabian Reinartz (@fabxc, consulting for Improbable at that time) and Bartek Plotka (@bwplotka), teamed up to create Thanos based on the Prometheus storage format. Around February 2018 the project was shown at Prometheus Meetup in London, and in Summer 2018 announced on PromCon 2018. In 2019, our team in Red Hat, led at that point by Frederic Branczyk @brancz, contributed essential pieces allowing Thanos to receive remote-write (push model) for Prometheus metrics. Since then, we could leverage Thanos for Telemetry gathering and then in in-cluster Monitoring, too.

When working with it you will most likely interact with Bartek Plotka (@bwplotka) and other team members.

These are the people you’ll be in most contact with when working upstream. If you run into any communication issues, please let us know as soon as possible.

Talks

Advocating about sane monitoring and alerting practices (especially focused on Kubernetes environments) and how Prometheus implements them is part of our team’s work. That can happen internally or on public channels. If you are comfortable giving talks on the topic or some specific work we have done, let us know so we can plan ahead to find you a speaking opportunity at meetups or conferences. If you are not comfortable, but want to break this barrier let us know as well, we can help you get more comfortable in public speaking slowly step by step. If you want to submit a CFP for a talk please add it to this spreadsheet and inform your manager.

First days (accounts & access)

  1. Follow up on administrative tasks.
  2. Understand the meetings the team attends to:

Ask your manager to be added to the Observability Program calendar. Ensure you attend the following recurring meetings:

  • Team syncs
  • Sprint retro/planning
  • Sprint reviews
  • Weekly architecture call
  • 1on1 with manager
  • Weekly 1on1 with your mentor (mentors are tracked here)

First weeks

Set up your computer and development environment and do some research. Feel free to come back to these on an ongoing basis as needed. There is no need to complete them all at once.

General

  1. Review our product documentation (this is very important): Understanding the monitoring stack | Monitoring | OpenShift Container Platform 4.10
  2. Review our team’s process doc: Monitoring Team Process
  3. Review how others should formally submit requests to our team: Requests: Monitoring Team
  4. If you haven’t already, buy this book and make a plan to finish it over time (you can expense it): “Site Reliability Engineering: How Google Runs Production Systems”. Online version of the book can be found here: https://sre.google/books/.
  5. Ensure you attend a meeting with your team lead or architect to give a general overview of our in-cluster OpenShift technology stack.
  6. Ensure you attend a meeting with your team lead or architect to give a general overview of our hosted Observatorium/Telemetry stack.
  7. Bookmark this spreadsheet for reference of all OpenShift release dates. Alternatively, you can add the OpenShift Release Dates calendar.

Watch these talks

[optional] Additional information & Exploration

The team uses various tools, you should get familiar with them by reading through the documentation and trying them out:

Who’s Who?

  • For all the teams and people in OpenShift Engineering, see this Team Member Tracking spreadsheet. Bookmark this and refer to it as needed.
  • Schedule a meeting with your manager to go over the team organizational structure

First project

Your first project should ideally:

  • Provide an interesting set of related tasks that make you familiar with various aspects of internal and external parts of Prometheus and OpenShift.
  • Encourage discussion with other upstream maintainers and/or people at Red Hat.
  • Be aligned with the area of Prometheus and more generally the monitoring stack you want to work on.
  • Have a visible impact for you and others

Here’s a list of potential starter projects, talk to us to discuss them in more detail and figure out which one suits you.

(If you are not a new hire, please add/remove projects as appropriate)

  • Setup Prometheus, Alertmanager, and node-exporter
    • As binaries on your machine (Bonus: Compile them yourself)
    • As containers
  • Setup Prometheus as a StatefulSet on vanilla Kubernetes (minikube or your tool of choice)
  • Try the Prometheus Operator on vanilla Kubernetes (minikube or your tool of choice)
  • Try kube-prometheus on vanilla Kubernetes
  • Try the cluster-monitoring-operator on Openshift (easiest is through the cluster-bot on slack)

During the project keep the feedback cycles with other people as long or short as you feel confident. If you are not sure, ask! Try to briefly check in with the team regularly.

Try to submit any coding work in small batches. This makes it easier for us to review and realign quickly.

Everyone gets stuck sometimes. There are various smaller issues around the Prometheus and Alertmanager upstream repositories and the different monitoring operators. If you need a bit of distance, tackle one of them for a while and then get back to your original problem. This will also help you to get a better overview. If you are still stuck, just ask someone and we’ll discuss things together.

First Months

  • If you will be starting out working more closely with the in-cluster stack be sure to review this document as well: In-Cluster Monitoring Onboarding. Otherwise if you are starting out more focused on the Observatorium service, review this doc: Observatorium Platform Onboarding
  • Try to get something (anything) merged into one of our repositories
  • Begin your 2nd project
  • Create a PR for the the master onboarding doc (this one) with improvements you think would help others

Second project

After your starter project is done, we’ll discuss how it went and what your future projects will be. By then you’ll hopefully have a good overview which areas you are interested in and what their priority is. Discuss with your team lead or manager what your next project will be.

Glossary

Our team’s glossary can be found here.