Rules and alerting capabilities

9 minute read

Rules and alerting capabilities

Overview

As explained in more details here, RHOBS features a deployment of Observatorium.

Through the Observatorium API, tenants are able to create, read, update and delete their own Prometheus recording and alerting rules via the Observatorium Rules API.

In addition to this, each of the RHOBS instances has an Alertmanager deployed, which makes possible for tenants to configure custom alert routing configuration to route firing alerts to their specified receivers.

Goal

This page aims to provide a simple tutorial of how a tenant can create an alerting rule via the Observatorium Rules API and configure Alertmanager properly to get alerted via a desired receiver.

For this tutorial we will be using the rhobs tenant in the MST stage environment. URLs may change slightly in case another tenant is used.

Authenticate against the Observatorium API

To have access to the Observatorium API, the tenant making the requests needs to be correctly authenticated. For this you can install obsctl which is a dedicated CLI tool to interact with Observatorium instances as tenants. It uses the provided credentials to fetch OAuth2 access tokens via OIDC and saves both the token and credentials for multiple tenants and APIs, locally.

You can get up and running quickly with the following steps,

  • Make sure you have Go 1.17+ on your system and install obsctl,

    go install github.com/observatorium/obsctl@latest
    
  • Add your desired Observatorium API,

    obsctl context api add --name='staging-api' --url='https://observatorium-mst.api.stage.openshift.com'
    
  • Save credentials for a tenant under the API you just added (you will need your own OIDC Client ID, Client Secret & TENANT),

    obsctl login --api='staging-api' --oidc.audience='observatorium' --oidc.client-id='<CLIENT_ID>' --oidc.client-secret='<SECRET>' --oidc.issuer-url='https://sso.redhat.com/auth/realms/redhat-external' --tenant='<TENANT>'
    
  • Verify that you are using the correct API + tenant combination or “context” (in this case it would be staging-api/rhobs),

    obsctl context current
    

For this tutorial we will be using https://observatorium-mst.api.stage.openshift.com as our target Observatorium API.

Now that we have set up obsctl, let’s start creating an alerting rule.

Create an alerting rule

A tenant can create and list recording and alerting rules via the Observatorium Rules API. For this tutorial we will be creating an alerting rule, to also make use of the alerting capabilities that are available in Observatorium.

If you want to get more details about how to interact with the Rules API and its different endpoints, refer to the upstream documentation or OpenAPI spec.

In your local environment, create a Prometheus alerting rule YAML file with the definition of the alert you want to add. Note that the file should be defined following the Observatorium OpenAPI specification. The file should be in Prometheus recording and/or alerting rules format.

For example, you can create a file named alerting-rule.yaml:

groups:
- interval: 30s
  name: test-firing-alert
  rules:
  - alert: TestFiringAlert
    annotations:
      dashboard: https://grafana.stage.devshift.net/d/Tg-mH0rizaSJDKSADX/api?orgId=1&refresh=1m
      description: Test firing alert
      message: Message of firing alert here
      runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md
      summary: Summary of firing alert here
    expr: vector(1)
    for: 1m
    labels:
      severity: page

Now to set this rule file, you can use obsctl,

obsctl metrics set --rule.file=/path/to/alerting-rule.yaml

obsctl uses the credentials you saved earlier, to make an authenticated application/yaml PUT request to the api/v1/rules/raw endpoint of the Observatorium API, which creates your alerting rule.

obsctl should print out the response, which, if successful, would be: successfully updated rules file.

Besides checking this response, you can also list or read the configured rules for your tenant by,

obsctl metrics get rules.raw

This would make a GET request to api/v1/rules/raw endpoint and return the rules you configured in YAML form. This endpoint will immediately reflect any newly set rules.

Note that in the response a tenant_id label for the particular tenant was added automatically. Since Observatorium API is tenant-aware, this extra validation step is also performed. Also, in the case of rules expressions, the tenant_id labels are injected into the PromQL query, which ensures that only data from a specific tenant is selected during evaluation.

You can also check your rule’s configuration, health, and resultant alerts by,

obsctl metrics get rules

This would make a GET request to api/v1/rules endpoint, and return rules you configured in Prometheus HTTP API format JSON response. You can read more about checking rule state here.

Note that this endpoint does not reflect newly set rules immediately and might take up to a minute to sync.

How to update and delete an alerting rule

As mentioned in the upstream docs that each time a PUT request is made to the /api/v1/rules/raw endpoint, the rules contained in the request will overwrite all the other rules for that tenant. Thus, each time you use obsctl metrics set --rule.file=<file> it will overwrite all other rules for a tenant with the new rule file.

Make sure to grab your existing rules YAML file and append any new rules or groups you want to create, to this file.

Using the example above, in case you want to create a second alerting rule, a new alert rule should be added to the file,

groups:
- interval: 30s
  name: test-firing-alert
  rules:
  - alert: TestFiringAlert
    annotations:
      dashboard: https://grafana.stage.devshift.net/d/Tg-mH0rizaSJDKSADX/api?orgId=1&refresh=1m
      description: Test firing alert!!
      message: Message of firing alert here
      runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md
      summary: Summary of firing alert here
    expr: vector(1)
    for: 1m
    labels:
      severity: page
- interval: 30s
  name: test-new-firing-alert
  rules:
  - alert: TestNewFiringAlert
    annotations:
      dashboard: https://grafana.stage.devshift.net/d/Tg-mH0rizaSJDKSADX/api?orgId=1&refresh=1m
      description: Test new firing alert!!
      message: Message of new firing alert here
      runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md
      summary: Summary of new firing alert here
    expr: vector(1)
    for: 1m
    labels:
      severity: page

And then this new file can be set via obsctl metrics set --rule.file=/path/to/alerting-rule.yaml to update your rules configuration.

Similarly, if you want to delete a rule, you can remove that from your existing rule file, before setting it with obsctl.

If you want to delete all rule(s) for a tenant, you can run obsctl metrics set --rule.file= with an empty file.

Sync rules from your cluster

Alternatively, you can choose to sync rules from your cluster to Observatorium Rules API via prometheus-operator’s PrometheusRule CRD.

Create PrometheusRule objects in your cluster containing your desired rules (you can even create multiple). Ensure that you have a tenant label specifying your tenant in metadata.labels like below,

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    tenant: rhobs
  name: obsctl-reloader-example
spec:
  groups:
  - interval: 30s
    name: test-firing-alert
    rules:
    - alert: TestFiringAlert
      annotations:
        dashboard: https://grafana.stage.devshift.net/d/Tg-mH0rizaSJDKSADX/api?orgId=1&refresh=1m
        description: Test firing alert!!
        message: Message of firing alert here
        runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md
        summary: Summary of firing alert here
      expr: vector(1)
      for: 1m
      labels:
        severity: page
  - interval: 30s
    name: test-new-firing-alert
    rules:
    - alert: TestNewFiringAlert
      annotations:
        dashboard: https://grafana.stage.devshift.net/d/Tg-mH0rizaSJDKSADX/api?orgId=1&refresh=1m
        description: Test new firing alert!!
        message: Message of new firing alert here
        runbook: https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md
        summary: Summary of new firing alert here
      expr: vector(1)
      for: 1m
      labels:
        severity: page

Once the rules are present in a namespace, you can run a obsctl-reloader deployment to sync these rules to Observatorium. Images can be found at https://quay.io/repository/app-sre/obsctl-reloader.

You can create a deployment using example OpenShift template with correct K8s RBAC. Ensure that you specify your OIDC credentials, tenant name and cluster namespace containing PrometheusRules, as environment variables,

...
  spec:
    replicas: 1
    selector:
      matchLabels:
        app.kubernetes.io/component: obsctl-reloader
        app.kubernetes.io/instance: obsctl-reloader
        app.kubernetes.io/name: obsctl-reloader
    template:
      metadata:
        labels:
          app.kubernetes.io/component: obsctl-reloader
          app.kubernetes.io/instance: obsctl-reloader
          app.kubernetes.io/name: obsctl-reloader
          app.kubernetes.io/version: latest
      spec:
        containers:
        - env:
          - name: NAMESPACE_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: OBSERVATORIUM_URL
            value: https://observatorium-mst.api.stage.openshift.com
          - name: OIDC_AUDIENCE
            value: observatorium
          - name: OIDC_ISSUER_URL
            value: https://sso.redhat.com/auth/realms/redhat-external
          - name: SLEEP_DURATION_SECONDS
            value: 15
          - name: MANAGED_TENANTS
            value: rhobs
          - name: OIDC_CLIENT_ID
            valueFrom:
              secretKeyRef:
                key: client_id
                name: ${TENANT_SECRET}
          - name: OIDC_CLIENT_SECRET
            valueFrom:
              secretKeyRef:
                key: client_secret
                name: ${TENANT_SECRET}
          image: quay.io/app-sre/obsctl-reloader:a9daddf
          imagePullPolicy: IfNotPresent
          name: obsctl-reloader
        serviceAccountName: obsctl-reloader

Keep in mind that all the PrometheusRule CRDs are combined to a single rule file by obsctl-reloader. Also, note that the obsctl-reloader project is still experimental, and needs to be deployed by the tenant.

Create a routing configuration in Alertmanager

Now that the alerting rule is correctly created, you can start to configure Alertmanager.

Configure alertmanager.yaml

Create a merge request to app-interface/resources/rhobs:

  • Choose the desired environment (production/stage) folder. For this tutorial, we will be using the stage environment.
  • Modify the alertmanager-routes-<instance>-secret.yaml file with the desired configuration.
  • After changing the file, open a merge request with the updated configuration file.

The alertmanager-routes-<instance>-secret.yaml already contains basic configuration, such as a customized template for slack notifications and a few receivers. For this tutorial, a slack-monitoring-alerts-stage receiver was configured with a route matching the rhobs tenant_id:

routes:
- matchers:
  - tenant_id = 0fc2b00e-201b-4c17-b9f2-19d91adc4fd2
  receiver: slack-monitoring-alerts-stage

For more information about how to configure Alertmanager, check out the official Alertmanager documentation.

Check the alerting rule state

It is possible to check all rule groups for a tenant by querying /api/v1/rules endpoint (i.e by running obsctl metrics get rules). /api/v1/rules supports only GET requests and proxies to the upstream read endpoint (in this case, Thanos Querier).

This endpoint returns the processed and evaluated rules from Observatorium’s Thanos Rule in Prometheus HTTP API format JSON.

It is different from api/v1/rules/raw endpoint (which can be queried by running obsctl metrics get rules.raw) in a few ways,

  • api/v1/rules/raw only returns the unprocessed/raw rule file YAML that was configured whereas api/v1/rules returns processed JSON rules with health and alert data.
  • api/v1/rules/raw immediately reflects changes to rules, whereas api/v1/rules can take up to a minute to sync with new changes.

Thanos Ruler evaluates the Prometheus rules - in this case for example, it checks which alerting rules will be triggered, the last time they were evaluated and more.

For example, if TestFiringAlert is already firing, the response will contain a "state": "firing" entry for this alert:

"alerts": [
  {
    "labels": {
      "alertname": "TestFiringAlert",
      "severity": "page",
      "tenant_id": "0fc2b00e-201b-4c17-b9f2-19d91adc4fd2"
    },
    "annotations": {
      "dashboard": "https://grafana.stage.devshift.net/d/Tg-mH0rizaSJDKSADX/api?orgId=1&refresh=1m",
      "description": "Test firing alert",
      "message": "Message of firing alert here",
      "runbook": "https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md",
      "summary": "Summary of firing alert here"
    },
    "state": "firing",
    "activeAt": "2022-03-02T10:13:39.051462148Z",
    "value": "1e+00",
    "partialResponseStrategy": "ABORT"
  }
],

If the alert has already the "state": "firing" entry, with the Alertmanager having the routing configuration for a specific receiver (in our case, slack), it should be possible to see the alert showing up on slack, in the configured slack channel.

Configure secrets in Vault

In case you want to configure a receiver (e.g. slack, pagerduty) to receive alert notifications, it is likely necessary that you’d need to provide secrets so that Alertmanager has push access to. Currently, you have to store the desired secrets in Vault and embed them via app-sre templating. Refer to https://vault.devshift.net/ui/vault/ to create a new secret or to retrieve an existing one. You can them embed this secret in your Alertmanager configuration file using the following syntax:

{{{ vault('app-sre/integrations-input/alertmanager-integration', 'slack_api_url') }}}

Where app-sre/integrations-input/alertmanager-integration is the path of the stored secret in Vault and slack_api_url is the key.

You can refer to the app-interface documentation to get more information about this.

Once your MR is merged with the desired Alertmanager configuration, the configuration file is reloaded by the Observatorium Alertmanager instances. To get your MR merged an approval from app-sre is necessary.

Testing the route configuration

If you want to test your Alertmanager configuration to verify that the configured receivers are receiving the right alert, we recommend the use of amtool.

Note that the original configuration file in app-interface is a file of type Secret. In this case, you should aim to test the data what is under alertmanager.yaml key. There may be also app-interface specific annotation (e.g. how the slack_url is constructed by retrieving a Vault secret) - which may prompt the validation by amtool to fail.

After installing amtool correctly, you can check the configuration of the alertmanager.yaml file with:

amtool check-config alertmanager.yaml

It is also possible to check the configuration against specific receivers.

For our example, we have slack-monitoring-alerts-stage receiver configured.

To check that the configured route matches the RHOBS tenant_id, we can run:

amtool config routes test --config.file=alertmanager.yaml --verify.receivers=slack-monitoring-alerts-stage tenant_id=0fc2b00e-201b-4c17-b9f2-19d91adc4fd2

Summary

After this tutorial, you should be able to:

  1. Create an alerting rule through Observatorium Rules API.
  2. Setup Observatorium Alertmanager instances with the desired routing configuration.
  3. Check that the integration works properly on the configured receiver.

Additional resources

In case problems occur or if you want to have a general overview, here is a list of links that can help you:

Stage Production
Alertmanager UI Alertmanager UI
Alertmanager logs Alertmanager logs
Thanos Rule logs Thanos Rule logs

Note: As of today, tenants are unable to access the Alertmanager UI. Please reach out to @observatorium-support in the #forum-observatorium to get help if needed.

Last modified November 14, 2024