Team
Place for documentation, processes for Red Hat Monitoring Group Teams
This the multi-page printable view of this section. Click here to print.
Place for documentation, processes for Red Hat Monitoring Group Teams
We are the team responsible (mainly) for:
This document explains a few processes related to operating our production software together with AppSRE team. All operations can be summarised as the Site Reliability Engineering part of our work.
The main goals of our SRE work within this team is to deploy and operate software that meets functional (API and performance (Telemeter SLO, MST SLO) requirements of our customers.
Currently, we offer internal service called RHOBS that is used across the company.
In order to maintain our software on production reliably we need to be able to test, release and deploy our changes rapidly.
We can divide things we change into a few categories. Let’s elaborate all processes per each category:
Library is a project that is not deployed directly, but rather it is a dependency for micro-service application we deploy. For example https://github.com/prometheus/client_golang or https://github.com/grpc-ecosystem/go-grpc-middleware
Testing:
Releasing
Deploying
Software for (micro) services usually lives in many separate open source repositories in GitHub e.g https://github.com/thanos-io/thanos, https://github.com/observatorium/api.
Testing:
Releasing
Deploying
All or configuration is rooted in https://github.com/rhobs/configuration configuration templates written in jsonnet. It can be then overridden by defined parameters in app-interface saas files .
Testing:
Releasing
Deploying
saas.yaml
file was changed /lgtm
from Observability Platform team is enough for PR to get merged automatically.saas.yaml
e.g to production.NOTE: Don’t change both production and staging in the same MR. NOTE: Deploy to production only changed that were previously in staging (automation for this TBD).
You can see the version change:
#team-monitoring-info
Slack channel on CoreOS Slack.Grafana Dashboards are defined here: https://github.com/rhobs/configuration/tree/main/observability/dashboards Alerts and Rules here: https://github.com/rhobs/configuration/blob/main/observability/prometheusrules.jsonnet
Testing:
Releasing
Deploying
synchronize.sh
to create an MR/PR against app-interface. This will copy all generated YAML resources to proper places.Currently, RHOBS services are supported 24h/7d by two teams:
AppSRE
team for infrastructure and “generalist” support. They are our first incident response point. They try to support our stack as far as runbooks and general knowledge allows.Observability Platform
Team (Dev on-call) for incidents impacting SLA outside of the AppSRE expertise (bug fixes, more complex troubleshooting). We are notified when AppSRE needs.This is the process we as the Observability Team try to follow during incident response.
The incident occurs when any of our services violates SLO we set with our stakeholders. Refer to Telemeter SLO and MST SLO for details on SLA.
NOTE: Following procedure applies to both Production and Staging. Many teams e.g SubWatch depends on working staging, so follow similar process as production. The only difference is that we do not need to mitigate of fix staging issues outside of office hours.
Potential Trigger Points:
If you are not on-call notify Observability Platform on-call engineer. If you are on-call, on-call engineer is not present or you agreed that you will handle this incident, go to step 2.
Straight away, create JIRA for potential incident. Don’t think twice, it’s easy to create and essential to track incident later on. Fill the following parts:
Title: Symptom you see.
Type: Bug
Priority: Try to assess how important it is. If impacting production it’s a “Blocker”
Component: RHOBS
(Important) Label: incident
no-qe
Description: Mention how you were notified (ideally with link to alert/Slack thread). Mention what you know so far.
See example incident tickets here
If AppSRE is not yet aware, drop link to created incident ticket to #sd-app-sre channel and notify @app-sre-primary
and @observatorium-oncall
. Don’t repeat yourself, ask everyone to follow the ticket comments.
Investigate possible mitigation. Ensure the problem is mitigated before focusing on root cause analysis.
Ensure if incident is mitigated with AppSRE. Investigate root cause. If mitigation is applied and root cause is known, claim the incident being over.
Note the potential long term fixes and ideas. Close the incident JIRA ticket.
Idea: If you have time, before sharing Post Mortem / RCA perform “Wheel of Misfortune”. Select an on-call engineer who was not participaing in incident and simulate the error by triggering root-cause in safe environment. Then meet together with team and allow engineer to coordinate simulated incident. Help on the way to share knowledge and insights. This is the best way to on-board people to production topics.
NOTE: Freely available SRE book is a good source of general patterns around in efficient incident management. Recommended read!
Welcome to the Monitoring Team! We created this document to help guide you through your onboarding.
Please fork this repo and propose changes to the content as a pull request if something is not accurate, outdated or unclear. Use your fork to track the progress of your onboarding tasks, i.e. by keeping a copy in your fork with comments about your completion status. Please read the doc thoroughly and check off each item as you complete it. We consistently want to improve our onboarding experience and your feedback helps future team members.
This documents contains some links to Red Hat internal documents and you won’t be able to access them without a Red Hat associate login. We are still trying to keep as much information as possible in the open.
Team Monitoring mainly focuses on Prometheus, Thanos, Observatorium, and their integration into Kubernetes and OpenShift. We are also responsible for the hosted multi-tenant Observability service which powers such services as OpenShift Telemetry and OSD metrics.
Prometheus is a monitoring project initiated at SoundCloud in 2012. It became public and widely usable in early 2015. Since then, it found adoption across many industries. In early 2016 development diversified away from SoundCloud as CoreOS hired one of the core developers. Prometheus is not a single application but an entire ecosystem of well-separated components, which work well together but can be used individually.
CoreOS was acquired by Red Hat in early 2018. The CoreOS monitoring team became the Red Hat Monitoring Team, which has evolved into the “Observability group”. The teams were divided into two teams:
You may encounter references to these two teams. In early 2021 we decided to combine the efforts of these teams more closely in order to avoid working in silos and ensure we have a well functioning end-to-end product experience across both projects. We are, however, still split into separate scrum teams for efficiency. We are now collectively known as the “Monitoring Team” and each team works together across the various domains.
We work on all important aspects of the upstream eco-system and a seamless monitoring experience for Kubernetes using Prometheus. Part of that is integrating our open source efforts into the OpenShift Container Platform (OCP), the commercial Red hat Kubernetes distribution.
Thanos is a monitoring system, which was created based on Prometheus principles. It is a distributed version of Prometheus where every piece of Prometheus like scraping, querying, storage, recording, alerting, and compaction can be deployed as separate horizontally scalable components. This allows more flexible deployments and capabilities beyond single clusters. Thanos also supports object storage as the main storage option, allowing cheap long term retention for metrics. At the end it exposes the same (yet extended) Prometheus APIs and uses gRPC to communicate between components.
Thanos was created because of the scalability limits of Prometheus in 2017. At that point a similar project Cortex was emerging too, but it was over complex at that time. In November 2017, Fabian Reinartz (@fabxc, consulting for Improbable at that time) and Bartek Plotka (@bwplotka), teamed up to create Thanos based on the Prometheus storage format. Around February 2018 the project was shown at Prometheus Meetup in London, and in Summer 2018 announced on PromCon 2018. In 2019, our team in Red Hat, led at that point by Frederic Branczyk @brancz, contributed essential pieces allowing Thanos to receive remote-write (push model) for Prometheus metrics. Since then, we could leverage Thanos for Telemetry gathering and then in in-cluster Monitoring, too.
When working with it you will most likely interact with Bartek Plotka (@bwplotka) and other team members.
These are the people you’ll be in most contact with when working upstream. If you run into any communication issues, please let us know as soon as possible.
Advocating about sane monitoring and alerting practices (especially focused on Kubernetes environments) and how Prometheus implements them is part of our team’s work. That can happen internally or on public channels. If you are comfortable giving talks on the topic or some specific work we have done, let us know so we can plan ahead to find you a speaking opportunity at meetups or conferences. If you are not comfortable, but want to break this barrier let us know as well, we can help you get more comfortable in public speaking slowly step by step. If you want to submit a CFP for a talk please add it to this spreadsheet and inform your manager.
Ask your manager to be added to the Observability Program calendar. Ensure you attend the following recurring meetings:
Set up your computer and development environment and do some research. Feel free to come back to these on an ongoing basis as needed. There is no need to complete them all at once.
The team uses various tools, you should get familiar with them by reading through the documentation and trying them out:
Your first project should ideally:
Here’s a list of potential starter projects, talk to us to discuss them in more detail and figure out which one suits you.
(If you are not a new hire, please add/remove projects as appropriate)
During the project keep the feedback cycles with other people as long or short as you feel confident. If you are not sure, ask! Try to briefly check in with the team regularly.
Try to submit any coding work in small batches. This makes it easier for us to review and realign quickly.
Everyone gets stuck sometimes. There are various smaller issues around the Prometheus and Alertmanager upstream repositories and the different monitoring operators. If you need a bit of distance, tackle one of them for a while and then get back to your original problem. This will also help you to get a better overview. If you are still stuck, just ask someone and we’ll discuss things together.
After your starter project is done, we’ll discuss how it went and what your future projects will be. By then you’ll hopefully have a good overview which areas you are interested in and what their priority is. Discuss with your team lead or manager what your next project will be.
Our team’s glossary can be found here.