Dragon Pastoral

凯龘牧歌

Chapter21 - Metrics Monitoring and Alerting System

Posted at # SystemDesign

Metrics Monitoring and Alerting System

This chapter focuses on designing a highly scalable metrics monitoring and alerting system, which is critical for ensuring high availability and reliability.

Step 1 - Understand the Problem and Establish Design Scope

A metrics monitoring system can mean a lot of different things - eg you don’t want to design a logs aggregation system, when the interviewer is interested in infra metrics only.

Let’s try to understand the problem first:

High-level requirements and assumptions

The infrastructure being monitored is large-scale:

A variety of metrics can be monitored:

Non-functional requirements

What requirements are out of scope?

Step 2 - Propose High-Level Design and Get Buy-In

Fundamentals

There are five core components involved in a metrics monitoring and alerting system: metrics-monitoring-core-components

Data model

Metrics data is usually recorded as a time-series, which contains a set of values with timestamps. The series can be identified by name and an optional set of tags.

Example 1 - What is the CPU load on production server instance i631 at 20:00? metrics-example-1

The data can be identified by the following table: metrics-example-1-data

The time series is identified by the metric name, labels and a single point in at a specific time.

Example 2 - What is the average CPU load across all web servers in the us-west region for the last 10min?

CPU.load host=webserver01,region=us-west 1613707265 50

CPU.load host=webserver01,region=us-west 1613707265 62

CPU.load host=webserver02,region=us-west 1613707265 43

CPU.load host=webserver02,region=us-west 1613707265 53

...

CPU.load host=webserver01,region=us-west 1613707265 76

CPU.load host=webserver01,region=us-west 1613707265 83

This is an example data we might pull from storage to answer that question. The average CPU load can be calculated by averaging the values in the last column of the rows.

The format shown above is called the line protocol and is used by many popular monitoring software in the market - eg Prometheus, OpenTSDB.

What every time series consists of: time-series-data-example

A good way to visualize how data looks like: time-series-data-viz

The data access pattern is write-heavy and spiky reads as we collect a lot of metrics, but they are infrequently accessed, although in bursts when eg there are ongoing incidents.

The data storage system is the heart of this design.

There are many databases, specifically tailored for storing time-series data. Many of them support custom query interfaces which allow for effective querying of time-series data.

Example scale of InfluxDB - more than 250k writes per second when provisioned with 8 cores and 32gb RAM: influxdb-scale

It is not expected for you to understand the internals of a metrics database as it is niche knowledge. You might be asked only if you’ve mentioned it on your resume.

For the purposes of the interview, it is sufficient to understand that metrics are time-series data and to be aware of popular time-series databases, like InfluxDB.

One nice feature of time-series databases is the efficient aggregation and analysis of large amounts of time-series data by labels. InfluxDB, for example, builds indexes for each label.

It is critical, however, to keep the cardinality of labels low - ie, not using too many unique labels.

High-level Design

high-level-design

Step 3 - Design Deep Dive

Let’s deep dive into several of the more interesting parts of the system.

Metrics collection

For metrics collection, occasional data loss is not critical. It’s acceptable for clients to fire and forget. metrics-collection

There are two ways to implement metrics collection - pull or push.

Here’s how the pull model might look like: pull-model-example

For this solution, the metrics collector needs to maintain an up-to-date list of services and metrics endpoints. We can use Zookeeper or etcd for that purpose - service discovery.

Service discovery contains configuration rules about when and where to collect metrics from: service-discovery-example

Here’s a detailed explanation of the metrics collection flow: metrics-collection-flow

At our scale, a single metrics collector is not enough. There must be multiple instances. However, there must also be some kind of synchronization among them so that two collectors don’t collect the same metrics twice.

One solution for this is to position collectors and servers on a consistent hash ring and associate a set of servers with a single collector only: consistent-hash-ring

With the push model, on the other hand, services push their metrics to the metrics collector proactively: push-model-example

In this approach, typically a collection agent is installed alongside service instances. The agent collects metrics from the server and pushes them to the metrics collector. metrics-collector-agent

With this model, we can potentially aggregate metrics before sending them to the collector, which reduces the volume of data processed by the collector.

On the flip side, metrics collector can reject push requests as it can’t handle the load. It is important, hence, to add the collector to an auto-scaling group behind a load balancer.

so which one is better? There are trade-offs between both approaches and different systems use different approaches:

Here are some of the main differences between push and pull:

PullPush
Easy debuggingThe /metrics endpoint on application servers used for pulling metrics can be used to view metrics at any time. You can even do this on your laptop. Pull wins.If the metrics collector doesn’t receive metrics, the problem might be caused by network issues.
Health checkIf an application server doesn’t respond to the pull, you can quickly figure out if an application server is down. Pull wins.If the metrics collector doesn’t receive metrics, the problem might be caused by network issues.
Short-lived jobsSome of the batch jobs might be short-lived and don’t last long enough to be pulled. Push wins. This can be fixed by introducing push gateways for the pull model [22].
Firewall or complicated network setupsHaving servers pulling metrics requires all metric endpoints to be reachable. This is potentially problematic in multiple data center setups. It might require a more elaborate network infrastructure.If the metrics collector is set up with a load balancer and an auto-scaling group, it is possible to receive data from anywhere. Push wins.
PerformancePull methods typically use TCP.Push methods typically use UDP. This means the push method provides lower-latency transports of metrics. The counterargument here is that the effort of establishing a TCP connection is small compared to sending the metrics payload.
Data authenticityApplication servers to collect metrics from are defined in config files in advance. Metrics gathered from those servers are guaranteed to be authentic.Any kind of client can push metrics to the metrics collector. This can be fixed by whitelisting servers from which to accept metrics, or by requiring authentication.

There is no clear winner. A large organization probably needs to support both. There might not be a way to install a push agent in the first place.

Scale the metrics transmission pipeline

metrics-transmission-pipeline

The metrics collector is provisioned in an auto-scaling group, regardless if we use the push or pull model.

There is a chance of data loss if the time-series DB is down, however. To mitigate this, we’ll provision a queuing mechanism: queuing-mechanism

This approach has several advantages:

Kafka can be configured with one partition per metric name, so that consumers can aggregate data by metric names. To scale this, we can further partition by tags/labels and categorize/prioritize metrics to be collected first. metrics-collection-kafka

The main downside of using Kafka for this problem is the maintenance/operation overhead. An alternative is to use a large-scale ingestion system like Gorilla. It can be argued that using that would be as scalable as using Kafka for queuing.

Where aggregations can happen

Metrics can be aggregated at several places. There are trade-offs between different choices:

Query Service

Having a separate query service from the time-series DB decouples the visualization and alerting system from the database, which enables us to decouple the DB from clients and change it at will.

We can add a Cache layer here to reduce the load to the time-series database: cache-layer-query-service

We can also avoid adding a query service altogether as most visualization and alerting systems have powerful plugins to integrate with most time-series databases. With a well-chosen time-series DB, we might not need to introduce our own caching layer as well.

Most time-series DBs don’t support SQL simply because it is ineffective for querying time-series data. Here’s an example SQL query for computing an exponential moving average:

select id,
       temp,
       avg(temp) over (partition by group_nr order by time_read) as rolling_avg
from (
  select id,
         temp,
         time_read,
         interval_group,
         id - row_number() over (partition by interval_group order by time_read) as group_nr
  from (
    select id,
    time_read,
    "epoch"::timestamp + "900 seconds"::interval * (extract(epoch from time_read)::int4 / 900) as interval_group,
    temp
    from readings
  ) t1
) t2
order by time_read;

Here’s the same query in Flux - query language used in InfluxDB:

from(db:"telegraf")
  |> range(start:-1h)
  |> filter(fn: (r) => r._measurement == "foo")
  |> exponentialMovingAverage(size:-10s)

Storage layer

It is important to choose the time-series database carefully.

According to research published by Facebook, ~85% of queries to the operational store were for data from the past 26h.

If we choose a database, which harnesses this property, it could have significant impact on system performance. InfluxDB is one such option.

Regardless of the database we choose, there are some optimizations we might employ.

Data encoding and compression can significantly reduce the size of data. Those features are usually built into a good time-series database. double-delta-encoding

In the above example, instead of storing full timestamps, we can store timestamp deltas.

Another technique we can employ is down-sampling - converting high-resolution data to low-resolution in order to reduce disk usage.

We can use that for old data and make the rules configurable by data scientists, eg:

For example, here’s a 10-second resolution metrics table:

metrictimestamphostnameMetric_value
cpu2021-10-24T19:00:00Zhost-a10
cpu2021-10-24T19:00:10Zhost-a16
cpu2021-10-24T19:00:20Zhost-a20
cpu2021-10-24T19:00:30Zhost-a30
cpu2021-10-24T19:00:40Zhost-a20
cpu2021-10-24T19:00:50Zhost-a30

down-sampled to 30-second resolution:

metrictimestamphostnameMetric_value (avg)
cpu2021-10-24T19:00:00Zhost-a19
cpu2021-10-24T19:00:30Zhost-a25

Finally, we can also use cold storage to use old data, which is no longer used. The financial cost for cold storage is much lower.

Alerting system

alerting-system

Configuration is loaded to cache servers. Rules are typically defined in YAML format. Here’s an example:

- name: instance_down
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: instance_down
    expr: up == 0
    for: 5m
    labels:
      severity: page

The alert manager fetches alert configurations from cache. Based on configuration rules, it also calls the query service at a predefined interval. If a rule is met, an alert event is created.

Other responsibilities of the alert manager are:

The alert store is a key-value database, like Cassandra, which keeps the state of all alerts. It ensures a notification is sent at least once. Once an alert is triggered, it is published to Kafka.

Finally, alert consumers pull alerts data from Kafka and send notifications over to different channels - Email, text message, PagerDuty, webhooks.

In the real-world, there are many off-the-shelf solutions for alerting systems. It is difficult to justify building your own system in-house.

Visualization system

The visualization system shows metrics and alerts over a time period. Here’s an dashboard built with Grafana: grafana-dashboard

A high-quality visualization system is very hard to build. It is hard to justify not using an off-the-shelf solution like Grafana.

Step 4 - Wrap up

Here’s our final design: final-design