Observability Platform
We are the team responsible (mainly) for:
- RHOBS including:
- Thanos.
- Observatorium
This the multi-page printable view of this section. Click here to print.
We are the team responsible (mainly) for:
This document explains a few processes related to operating our production software together with AppSRE team. All operations can be summarised as the Site Reliability Engineering part of our work.
The main goals of our SRE work within this team is to deploy and operate software that meets functional (API and performance (Telemeter SLO, MST SLO) requirements of our customers.
Currently, we offer internal service called RHOBS that is used across the company.
In order to maintain our software on production reliably we need to be able to test, release and deploy our changes rapidly.
We can divide things we change into a few categories. Let’s elaborate all processes per each category:
Library is a project that is not deployed directly, but rather it is a dependency for micro-service application we deploy. For example https://github.com/prometheus/client_golang or https://github.com/grpc-ecosystem/go-grpc-middleware
Testing:
Releasing
Deploying
Software for (micro) services usually lives in many separate open source repositories in GitHub e.g https://github.com/thanos-io/thanos, https://github.com/observatorium/api.
Testing:
Releasing
Deploying
All or configuration is rooted in https://github.com/rhobs/configuration configuration templates written in jsonnet. It can be then overridden by defined parameters in app-interface saas files .
Testing:
Releasing
Deploying
saas.yaml
file was changed /lgtm
from Observability Platform team is enough for PR to get merged automatically.saas.yaml
e.g to production.NOTE: Don’t change both production and staging in the same MR. NOTE: Deploy to production only changed that were previously in staging (automation for this TBD).
You can see the version change:
#team-monitoring-info
Slack channel on CoreOS Slack.Grafana Dashboards are defined here: https://github.com/rhobs/configuration/tree/main/observability/dashboards Alerts and Rules here: https://github.com/rhobs/configuration/blob/main/observability/prometheusrules.jsonnet
Testing:
Releasing
Deploying
synchronize.sh
to create an MR/PR against app-interface. This will copy all generated YAML resources to proper places.Currently, RHOBS services are supported 24h/7d by two teams:
AppSRE
team for infrastructure and “generalist” support. They are our first incident response point. They try to support our stack as far as runbooks and general knowledge allows.Observability Platform
Team (Dev on-call) for incidents impacting SLA outside of the AppSRE expertise (bug fixes, more complex troubleshooting). We are notified when AppSRE needs.This is the process we as the Observability Team try to follow during incident response.
The incident occurs when any of our services violates SLO we set with our stakeholders. Refer to Telemeter SLO and MST SLO for details on SLA.
NOTE: Following procedure applies to both Production and Staging. Many teams e.g SubWatch depends on working staging, so follow similar process as production. The only difference is that we do not need to mitigate of fix staging issues outside of office hours.
Potential Trigger Points:
If you are not on-call notify Observability Platform on-call engineer. If you are on-call, on-call engineer is not present or you agreed that you will handle this incident, go to step 2.
Straight away, create JIRA for potential incident. Don’t think twice, it’s easy to create and essential to track incident later on. Fill the following parts:
Title: Symptom you see.
Type: Bug
Priority: Try to assess how important it is. If impacting production it’s a “Blocker”
Component: RHOBS
(Important) Label: incident
no-qe
Description: Mention how you were notified (ideally with link to alert/Slack thread). Mention what you know so far.
See example incident tickets here
If AppSRE is not yet aware, drop link to created incident ticket to #sd-app-sre channel and notify @app-sre-primary
and @observatorium-oncall
. Don’t repeat yourself, ask everyone to follow the ticket comments.
Investigate possible mitigation. Ensure the problem is mitigated before focusing on root cause analysis.
Ensure if incident is mitigated with AppSRE. Investigate root cause. If mitigation is applied and root cause is known, claim the incident being over.
Note the potential long term fixes and ideas. Close the incident JIRA ticket.
Idea: If you have time, before sharing Post Mortem / RCA perform “Wheel of Misfortune”. Select an on-call engineer who was not participaing in incident and simulate the error by triggering root-cause in safe environment. Then meet together with team and allow engineer to coordinate simulated incident. Help on the way to share knowledge and insights. This is the best way to on-board people to production topics.
NOTE: Freely available SRE book is a good source of general patterns around in efficient incident management. Recommended read!