Team member onboarding

9 minute read

Team member onboarding

Team member onboarding

Welcome to the Monitoring Team! We created this document to help guide you through your onboarding.

Please fork this repo and propose changes to the content as a pull request if something is not accurate, outdated or unclear. Use your fork to track the progress of your onboarding tasks, i.e. by keeping a copy in your fork with comments about your completion status. Please read the doc thoroughly and check off each item as you complete it. We consistently want to improve our onboarding experience and your feedback helps future team members.

This documents contains some links to Red Hat internal documents and you won’t be able to access them without a Red Hat associate login. We are still trying to keep as much information as possible in the open.

Team Background

Team Monitoring mainly focuses on Prometheus, Thanos, Observatorium, and their integration into Kubernetes and OpenShift. We are also responsible for the hosted multi-tenant Observability service which powers such services as OpenShift Telemetry and OSD metrics.

Prometheus is a monitoring project initiated at SoundCloud in 2012. It became public and widely usable in early 2015. Since then, it found adoption across many industries. In early 2016 development diversified away from SoundCloud as CoreOS hired one of the core developers. Prometheus is not a single application but an entire ecosystem of well-separated components, which work well together but can be used individually.

CoreOS was acquired by Red Hat in early 2018. The CoreOS monitoring team became the Red Hat Monitoring Team, which has evolved into the “Observability group”. The teams were divided into two teams:

The In-Cluster Observability Team
The Observability Platform team (aka RHOBS or Observatorium team)

You may encounter references to these two teams. In early 2021 we decided to combine the efforts of these teams more closely in order to avoid working in silos and ensure we have a well functioning end-to-end product experience across both projects. We are, however, still split into separate scrum teams for efficiency. We are now collectively known as the “Monitoring Team” and each team works together across the various domains.

We work on all important aspects of the upstream eco-system and a seamless monitoring experience for Kubernetes using Prometheus. Part of that is integrating our open source efforts into the OpenShift Container Platform (OCP), the commercial Red hat Kubernetes distribution.

People relevant to Prometheus who you should know about:

Julius Volz (@juliusv): Previously worked at Soundcloud where he developed prometheus. Now working as an independent contractor and organizing PromCon (Prometheus community conference). He also worked with weave.works on a prototype for remote/long-term time series storage, with Influxdb on Flux PromQL support. Contributed new Prometheus UI in React. He also created a new company PromLens, for a rich PromQL UI.
Bjoern Rabenstein (@beorn7): Worked at SoundCloud but now works at Grafana. He’s active again upstream and the maintainer of pushgateway and client_golang (the Go client Prometheus library).
Frederic Branczyk (@brancz): Joined CoreOS in 2016 to work around Prometheus. Core Team member since then and one of the minds behind our team’s vision. Left Red Hat in 2020 to start his new company around continuous profiling (PolarSignals). Still very active in upstream.
Julien Pivotto (@roidelapluie): prometheus/prometheus maintainer. Very active in other upstream projects (Alertmanager, …).

Thanos

Thanos is a monitoring system, which was created based on Prometheus principles. It is a distributed version of Prometheus where every piece of Prometheus like scraping, querying, storage, recording, alerting, and compaction can be deployed as separate horizontally scalable components. This allows more flexible deployments and capabilities beyond single clusters. Thanos also supports object storage as the main storage option, allowing cheap long term retention for metrics. At the end it exposes the same (yet extended) Prometheus APIs and uses gRPC to communicate between components.

Thanos was created because of the scalability limits of Prometheus in 2017. At that point a similar project Cortex was emerging too, but it was over complex at that time. In November 2017, Fabian Reinartz (@fabxc, consulting for Improbable at that time) and Bartek Plotka (@bwplotka), teamed up to create Thanos based on the Prometheus storage format. Around February 2018 the project was shown at Prometheus Meetup in London, and in Summer 2018 announced on PromCon 2018. In 2019, our team in Red Hat, led at that point by Frederic Branczyk @brancz, contributed essential pieces allowing Thanos to receive remote-write (push model) for Prometheus metrics. Since then, we could leverage Thanos for Telemetry gathering and then in in-cluster Monitoring, too.

When working with it you will most likely interact with Bartek Plotka (@bwplotka) and other team members.

These are the people you’ll be in most contact with when working upstream. If you run into any communication issues, please let us know as soon as possible.

Talks

Advocating about sane monitoring and alerting practices (especially focused on Kubernetes environments) and how Prometheus implements them is part of our team’s work. That can happen internally or on public channels. If you are comfortable giving talks on the topic or some specific work we have done, let us know so we can plan ahead to find you a speaking opportunity at meetups or conferences. If you are not comfortable, but want to break this barrier let us know as well, we can help you get more comfortable in public speaking slowly step by step. If you want to submit a CFP for a talk please add it to this spreadsheet and inform your manager.

First days (accounts & access)

Follow up on administrative tasks.
Understand the meetings the team attends to:

Ask your manager to be added to the Observability Program calendar. Ensure you attend the following recurring meetings:

Team syncs
Sprint retro/planning
Sprint reviews
Weekly architecture call
1on1 with manager
Weekly 1on1 with your mentor (mentors are tracked here)

First weeks

Set up your computer and development environment and do some research. Feel free to come back to these on an ongoing basis as needed. There is no need to complete them all at once.

General

Review our product documentation (this is very important): Understanding the monitoring stack | Monitoring | OpenShift Container Platform 4.10
Review our team’s process doc: Monitoring Team Process
Review how others should formally submit requests to our team: Requests: Monitoring Team
If you haven’t already, buy this book and make a plan to finish it over time (you can expense it): “Site Reliability Engineering: How Google Runs Production Systems”. Online version of the book can be found here: https://sre.google/books/.
Ensure you attend a meeting with your team lead or architect to give a general overview of our in-cluster OpenShift technology stack.
Ensure you attend a meeting with your team lead or architect to give a general overview of our hosted Observatorium/Telemetry stack.
Bookmark this spreadsheet for reference of all OpenShift release dates. Alternatively, you can add the OpenShift Release Dates calendar.

Watch these talks

Prometheus introduction by Julius Volz (project’s cofounder) @ KubeCon EU 2020
The Zen of Prometheus, by Kemal (ex-Observability Platform team) @ PromCon 2020
The RED Method: How To Instrument Your Services by Tom Wilkie @ GrafanaCon EU 2018
Thanos: Prometheus at Scale by Lucas and Bartek (Observability Platform team) @ DevConf 2020
Instrumenting Applications and Alerting with Prometheus, from Simon (Cluster Observability team) @ OSSEU 2019
PromQL for mere mortals by Ian Billett (Observability Platform team)@ PromCon 2019
Life of an alert (Alertmanager), @ PromCon 2018
Best practices and pitfalls @ PromCon 2017
Deep Dive: Kubernetes Metric APIs using Prometheus
Monitoring Kubernetes with prometheus-operator by Lili @ Cloud Native Computing Berlin meetup 2021
Using Jsonnet to Package Together Dashboards, Alerts and Exporters by Tom Wilkie(Grafana Labs) @ Kubecon, Europe 2018
(Internal) Observatorium Deep Dive, January 2021 by Kemal
(Internal) How to Get Reviewers to Block your Changes, March 2022 by Assaf

[optional] Additional information & Exploration

https://www.linkedin.com/learning/
- Red Hat has a corporate subscription to LinkedIn Learning that has great introductory courses to many topics relevant to our team
Prometheus Monitoring channel on Youtube
PromLabs PromQL cheat sheet
Prometheus-example-app
Kubernetes-sample-controller

The team uses various tools, you should get familiar with them by reading through the documentation and trying them out:

Who’s Who?

For all the teams and people in OpenShift Engineering, see this Team Member Tracking spreadsheet. Bookmark this and refer to it as needed.
Schedule a meeting with your manager to go over the team organizational structure

First project

Your first project should ideally:

Provide an interesting set of related tasks that make you familiar with various aspects of internal and external parts of Prometheus and OpenShift.
Encourage discussion with other upstream maintainers and/or people at Red Hat.
Be aligned with the area of Prometheus and more generally the monitoring stack you want to work on.
Have a visible impact for you and others

Here’s a list of potential starter projects, talk to us to discuss them in more detail and figure out which one suits you.

(If you are not a new hire, please add/remove projects as appropriate)

Setup Prometheus, Alertmanager, and node-exporter
- As binaries on your machine (Bonus: Compile them yourself)
- As containers
Setup Prometheus as a StatefulSet on vanilla Kubernetes (minikube or your tool of choice)
Try the Prometheus Operator on vanilla Kubernetes (minikube or your tool of choice)
Try kube-prometheus on vanilla Kubernetes
Try the cluster-monitoring-operator on Openshift (easiest is through the cluster-bot on slack)

During the project keep the feedback cycles with other people as long or short as you feel confident. If you are not sure, ask! Try to briefly check in with the team regularly.

Try to submit any coding work in small batches. This makes it easier for us to review and realign quickly.

Everyone gets stuck sometimes. There are various smaller issues around the Prometheus and Alertmanager upstream repositories and the different monitoring operators. If you need a bit of distance, tackle one of them for a while and then get back to your original problem. This will also help you to get a better overview. If you are still stuck, just ask someone and we’ll discuss things together.

First Months

If you will be starting out working more closely with the in-cluster stack be sure to review this document as well: In-Cluster Monitoring Onboarding. Otherwise if you are starting out more focused on the Observatorium service, review this doc: Observatorium Platform Onboarding
Try to get something (anything) merged into one of our repositories
Begin your 2nd project
Create a PR for the the master onboarding doc (this one) with improvements you think would help others

Second project

After your starter project is done, we’ll discuss how it went and what your future projects will be. By then you’ll hopefully have a good overview which areas you are interested in and what their priority is. Discuss with your team lead or manager what your next project will be.

Glossary

Our team’s glossary can be found here.

Last modified January 17, 2024