Alerting based on metrics

In this tutorial we will create alerts on the ping_request_count metric that we instrumented earlier in the Instrumenting HTTP server written in Go tutorial.

For the sake of this tutorial we will alert when the ping_request_count metric is greater than 5. Check out real world best practices to learn more about alerting principles.

Download the latest release of Alertmanager for your operating system from here .

Alertmanager supports various receivers like email, webhook, pagerduty, slack etc through which it can notify when an alert is firing. You can find the list of receivers and how to configure them here. We will use webhook as a receiver for this tutorial, head over to webhook.site  and copy the webhook URL which we will use later to configure the Alertmanager.

First let's setup Alertmanager with the webhook receiver.

alertmanager.yml

global:
  resolve_timeout: 5m
route:
  receiver: webhook_receiver
receivers:
    - name: webhook_receiver
      webhook_configs:
        - url: '<INSERT-YOUR-WEBHOOK>'
          send_resolved: false

Replace <INSERT-YOUR-WEBHOOK> with the webhook that we copied earlier in the alertmanager.yml file and run the Alertmanager using the following command.

alertmanager --config.file=alertmanager.yml

Once the Alertmanager is up and running navigate to http://localhost:9093  and you should be able to access it.

Now that we have configured the Alertmanager with webhook receiver let's add the rules to the Prometheus config.

prometheus.yml

global:
 scrape_interval: 15s
 evaluation_interval: 10s
rule_files:
  - rules.yml
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - localhost:9093
scrape_configs:
 - job_name: prometheus
   static_configs:
       - targets: ["localhost:9090"]
 - job_name: simple_server
   static_configs:
       - targets: ["localhost:8090"]

Note that the evaluation_interval,rule_files and alerting sections were added to the Prometheus config. evaluation_interval defines the intervals at which the rules are evaluated, rule_files accepts an array of yaml files that defines the rules and the alerting section defines the Alertmanager configuration. As mentioned in the beginning of this tutorial we will create a basic rule where we want to raise an alert when the ping_request_count value is greater than 5.

rules.yml

groups:
 - name: Count greater than 5
   rules:
   - alert: CountGreaterThan5
     expr: ping_request_count > 5
     for: 10s

Now let's run Prometheus using the following command.

prometheus --config.file=./prometheus.yml

Open http://localhost:9090/rules  in your browser to see the rules. Next run the instrumented ping server and visit the http://localhost:8090/ping  endpoint and refresh the page at least 6 times. You can check the ping count by navigating to the http://localhost:8090/metrics  endpoint. To see the status of the alert visit http://localhost:9090/alerts . Once the condition ping_request_count > 5 is true for more than 10s the state will become FIRING. Now if you navigate back to your webhook.site URL you will see the alert message.

Similarly Alertmanager can be configured with other receivers to notify when an alert is firing.

Inhibiting alerts from an entire cluster

When a whole cluster (or instance) becomes unreachable, you usually don't want a separate notification for every alert that fires as a consequence. Alertmanager's inhibition feature lets a single "cluster is down" alert mute all the dependent alerts coming from that same cluster, so you receive one meaningful notification instead of a flood.

Inhibition is configured with inhibit_rules in alertmanager.yml. The following rule mutes every alert that shares the same cluster label value as a firing ClusterUnreachable alert:

alertmanager.yml

inhibit_rules:
  - source_matchers:
      - 'alertname = "ClusterUnreachable"'
    target_matchers:
      - 'alertname != "ClusterUnreachable"'
    equal:
      - 'cluster'
  • source_matchers selects the alert that suppresses others when it is firing (here, ClusterUnreachable).
  • target_matchers selects the alerts to mute. ClusterUnreachable is excluded so the source alert itself is still delivered.
  • equal lists the labels whose values must match between the source and target alerts for the inhibition to apply. Alerts are muted only when they share the same cluster value, so an outage in one cluster never hides alerts from another.

For this to work, both the ClusterUnreachable alert and the alerts you want to mute must carry a cluster label, for example set on your alerting rules or added through external_labels. An alert is also never inhibited by itself, so ClusterUnreachable is always delivered.

On this page