configure-alerting-rules

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Configure Alerting Rules

配置告警规则

Set up Prometheus alerting rules and Alertmanager for reliable, actionable incident notifications.

See Extended Examples for complete configuration files and templates.

搭建Prometheus告警规则和Alertmanager，实现可靠、可落地的事件通知。

完整配置文件和模板请查看扩展示例。

When to Use

适用场景

Implementing proactive monitoring with automated incident detection
Routing alerts to appropriate teams based on severity and service ownership
Reducing alert fatigue through intelligent grouping and deduplication
Integrating monitoring with on-call systems (PagerDuty, Opsgenie)
Establishing escalation policies for critical production issues
Migrating from legacy monitoring systems to Prometheus-based alerting
Creating actionable alerts that guide responders to resolution

实现带自动事件检测的主动监控
基于告警严重程度和服务归属将告警路由给对应团队
通过智能分组和去重减少告警疲劳
将监控与值班系统（PagerDuty、Opsgenie）打通
为生产关键问题建立升级策略
从传统监控系统迁移到基于Prometheus的告警体系
创建可指导响应人员处理问题的可落地告警

Inputs

输入项

Required: Prometheus metrics to alert on (error rates, latency, saturation)
Required: On-call rotation and escalation policies
Optional: Existing alert definitions to migrate
Optional: Notification channels (Slack, email, PagerDuty)
Optional: Runbook documentation for common alerts

必填：用于告警的Prometheus指标（错误率、延迟、饱和度）
必填：值班轮值和升级策略
可选：待迁移的现有告警定义
可选：通知渠道（Slack、邮件、PagerDuty）
可选：常见告警的运行手册文档

Procedure

操作步骤

Step 1: Deploy Alertmanager

步骤1：部署Alertmanager

Install and configure Alertmanager to receive alerts from Prometheus.

Docker Compose deployment (basic structure):

yaml

version: '3.8'
services:
  alertmanager:
    image: prom/alertmanager:v0.26.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    # ... (see EXAMPLES.md for complete configuration)

Basic Alertmanager configuration (

alertmanager.yml

excerpt):

yaml

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical

安装并配置Alertmanager，用于接收Prometheus发送的告警。

Docker Compose部署（基础结构）：

yaml

version: '3.8'
services:
  alertmanager:
    image: prom/alertmanager:v0.26.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    # ... 完整配置请查看EXAMPLES.md

Alertmanager基础配置（

alertmanager.yml

片段）：

yaml

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'service'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical

... (see EXAMPLES.md for complete routing, inhibition rules, and receivers)

... 完整路由、抑制规则和接收器配置请查看EXAMPLES.md


**Configure Prometheus to use Alertmanager** (`prometheus.yml`):

```yaml
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093
      timeout: 10s
      api_version: v2

Expected: Alertmanager UI accessible at

http://localhost:9093

, Prometheus "Status > Alertmanagers" shows UP status.

On failure:

Check Alertmanager logs:
```
docker logs alertmanager
```

Verify Prometheus can reach Alertmanager:

curl http://alertmanager:9093/api/v2/status

Test webhook URLs:

curl -X POST <SLACK_WEBHOOK_URL> -d '{"text":"test"}'

Validate YAML syntax:
```
amtool check-config alertmanager.yml
```


**配置Prometheus对接Alertmanager**（`prometheus.yml`）：

```yaml
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093
      timeout: 10s
      api_version: v2

预期结果：可通过`http://localhost:9093访问Alertmanager UI，Prometheus的「状态 > Alertmanagers」页面显示状态为UP。

故障排查：

查看Alertmanager日志：
```
docker logs alertmanager
```

验证Prometheus可访问Alertmanager：

curl http://alertmanager:9093/api/v2/status

测试Webhook地址：`curl -X POST <SLACK_WEBHOOK_URL> -d '{"text":"test"}'
验证YAML语法：
```
amtool check-config alertmanager.yml
```

Step 2: Define Alerting Rules in Prometheus

步骤2：在Prometheus中定义告警规则

Create alerting rules that fire when conditions are met.

Create alerting rules file (

/etc/prometheus/rules/alerts.yml

excerpt):

yaml

groups:
  - name: instance_alerts
    interval: 30s
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been down for >5min."
          runbook_url: "https://wiki.example.com/runbooks/instance-down"

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          # ... (see EXAMPLES.md for complete alerts)

Alert design best practices:

for
duration: Prevents flapping alerts. Use 5-10 minutes for most alerts.
Descriptive annotations: Include current value, affected resource, and runbook link.
Severity levels: critical (pages on-call), warning (investigate), info (FYI)
Team labels: Enable routing to correct team/channel
Runbook links: Every alert should have a runbook URL

Load rules into Prometheus:

yaml

undefined

创建满足条件时触发的告警规则。

创建告警规则文件（

/etc/prometheus/rules/alerts.yml

片段）：

yaml

groups:
  - name: instance_alerts
    interval: 30s
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "实例 {{ $labels.instance }} 已宕机"
          description: "{{ $labels.instance }} 宕机时间超过5分钟。"
          runbook_url: "https://wiki.example.com/runbooks/instance-down"

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} CPU使用率过高
          # ... 完整告警配置请查看EXAMPLES.md

告警设计最佳实践：

for
时长：避免告警抖动。大多数告警建议设置5-10分钟。
描述性注解：包含当前值、受影响资源和运行手册链接。
严重级别：critical（触发值班人员告警）、warning（待排查）、info（仅通知）
团队标签：支持路由到正确的团队/频道
运行手册链接：每个告警都应配备运行手册URL

将规则加载到Prometheus：

yaml

undefined

prometheus.yml

rule_files:

"rules/*.yml"


Validate and reload:

```bash
promtool check rules /etc/prometheus/rules/alerts.yml
curl -X POST http://localhost:9090/-/reload

Expected: Alerts visible in Prometheus "Alerts" page, alerts fire when thresholds exceeded, Alertmanager receives fired alerts.

On failure:

Check Prometheus logs for rule evaluation errors
Verify rule syntax with
```
promtool check rules
```
Test alert queries independently in Prometheus UI
Inspect alert state transitions: Inactive → Pending → Firing

rule_files:

"rules/*.yml"


验证并重载配置：

```bash
promtool check rules /etc/prometheus/rules/alerts.yml
curl -X POST http://localhost:9090/-/reload

预期结果：Prometheus「告警」页面可查看所有告警，阈值超出时告警触发，Alertmanager接收到触发的告警。

故障排查：

查看Prometheus日志排查规则评估错误
使用
```
promtool check rules
```
验证规则语法
在Prometheus UI中单独测试告警查询语句
检查告警状态流转：未激活 → 待触发 → 已触发

Step 3: Create Notification Templates

步骤3：创建通知模板

Design readable, actionable notification messages.

Create template file (

/etc/alertmanager/templates/default.tmpl

excerpt):

gotmpl

{{ define "slack.default.title" }}
[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}
{{ end }}

{{ define "slack.default.text" }}
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity }}
*Summary:* {{ .Annotations.summary }}
{{ if .Annotations.runbook_url }}*Runbook:* {{ .Annotations.runbook_url }}{{ end }}
{{ end }}
{{ end }}

设计可读性高、可落地的通知消息。

创建模板文件（

/etc/alertmanager/templates/default.tmpl

片段）：

gotmpl

{{ define "slack.default.title" }}
[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}
{{ end }}

{{ define "slack.default.text" }}
{{ range .Alerts }}
*告警:* {{ .Labels.alertname }}
*严重级别:* {{ .Labels.severity }}
*摘要:* {{ .Annotations.summary }}
{{ if .Annotations.runbook_url }}*运行手册:* {{ .Annotations.runbook_url }}{{ end }}
{{ end }}
{{ end }}

... (see EXAMPLES.md for complete email and PagerDuty templates)

... 完整邮件和PagerDuty模板请查看EXAMPLES.md


**Use templates in receivers**:

```yaml
receivers:
  - name: 'slack-custom'
    slack_configs:
      - channel: '#alerts'
        title: '{{ template "slack.default.title" . }}'
        text: '{{ template "slack.default.text" . }}'

Expected: Notifications formatted consistently, include all relevant context, actionable with runbook links.

On failure:

Test template rendering:

amtool template test --config.file=alertmanager.yml

Check template syntax errors in Alertmanager logs
Use
```
{{ . | json }}
```
to debug template data structure


**在接收器中使用模板**：

```yaml
receivers:
  - name: 'slack-custom'
    slack_configs:
      - channel: '#alerts'
        title: '{{ template "slack.default.title" . }}'
        text: '{{ template "slack.default.text" . }}'

预期结果：通知格式统一，包含所有相关上下文，附带运行手册链接可直接使用。

故障排查：

测试模板渲染：

amtool template test --config.file=alertmanager.yml

查看Alertmanager日志排查模板语法错误
使用
```
{{ . | json }}
```
调试模板数据结构

Step 4: Configure Routing and Grouping

步骤4：配置路由和分组

Optimize alert delivery with intelligent routing rules.

Advanced routing configuration (excerpt):

yaml

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s

  routes:
    - match:
        team: platform
      receiver: 'team-platform'
      routes:
        - match:
            severity: critical
          receiver: 'pagerduty-platform'
          group_wait: 10s
          repeat_interval: 15m
          continue: true   # Also send to Slack

通过智能路由规则优化告警投递

高级路由配置（片段）：

yaml

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s

  routes:
    - match:
        team: platform
      receiver: 'team-platform'
      routes:
        - match:
            severity: critical
          receiver: 'pagerduty-platform'
          group_wait: 10s
          repeat_interval: 15m
          continue: true   # 同时发送到Slack

... (see EXAMPLES.md for complete routing with time intervals)

... 带时间间隔的完整路由配置请查看EXAMPLES.md


**Grouping strategies**:

```yaml


**分组策略**：

```yaml

Group by alertname: All HighCPU alerts bundled together

按告警名称分组：所有高CPU告警聚合为一个分组

group_by: ['alertname']

Group by alertname AND cluster: Separate notifications per cluster

按告警名称+集群分组：每个集群的告警单独通知

group_by: ['alertname', 'cluster']


**Expected:** Alerts routed to correct teams, grouped logically, timing appropriate for severity.

**On failure:**
- Test routing: `amtool config routes test --config.file=alertmanager.yml --alertname=HighCPU --label=severity=critical`
- Check routing tree: `amtool config routes show --config.file=alertmanager.yml`
- Verify `continue: true` if alert should match multiple routes

group_by: ['alertname', 'cluster']


**预期结果**：告警路由到正确团队，逻辑分组合理，告警时机匹配严重级别。

**故障排查**：
- 测试路由：`amtool config routes test --config.file=alertmanager.yml --alertname=HighCPU --label=severity=critical`
- 查看路由树：`amtool config routes show --config.file=alertmanager.yml`
- 如果告警需要匹配多个路由，请确认`continue: true`配置

Step 5: Implement Inhibition and Silencing

步骤5：实现抑制和静默规则

Reduce alert noise with inhibition rules and temporary silences.

Inhibition rules (suppress dependent alerts):

yaml

inhibit_rules:
  # Cluster down suppresses all node alerts in that cluster
  - source_match:
      alertname: 'ClusterDown'
      severity: 'critical'
    target_match_re:
      alertname: '(InstanceDown|HighCPU|HighMemory)'
    equal: ['cluster']

  # Service down suppresses latency and error alerts
  - source_match:
      alertname: 'ServiceDown'
    target_match_re:
      alertname: '(HighLatency|HighErrorRate)'
    equal: ['service', 'namespace']

通过抑制规则和临时静默规则减少告警噪音。

抑制规则（抑制依赖告警：

yaml

inhibit_rules:
  # 集群宕机时抑制该集群下所有节点告警
  - source_match:
      alertname: 'ClusterDown'
      severity: 'critical'
    target_match_re:
      alertname: '(InstanceDown|HighCPU|HighMemory)'
    equal: ['cluster']

  # 服务宕机时抑制延迟和错误告警
  - source_match:
      alertname: 'ServiceDown'
    target_match_re:
      alertname: '(HighLatency|HighErrorRate)'
    equal: ['service', 'namespace']

... (see EXAMPLES.md for more inhibition patterns)

... 更多抑制模式请查看EXAMPLES.md


**Create silences programmatically**:

```bash


**编程式创建静默规则：

```bash

Silence during maintenance

运维期间静默告警

amtool silence add
instance=app-server-1
--author="ops-team"
--comment="Scheduled maintenance"
--duration=2h

amtool silence add
instance=app-server-1
--author="ops-team"
--comment="计划内运维"
--duration=2h

List and manage silences

查看和管理静默规则

amtool silence query amtool silence expire <SILENCE_ID>


**Expected:** Inhibition reduces cascade alerts automatically, silences prevent notifications during planned maintenance.

**On failure:**
- Test inhibition logic with live alerts
- Check Alertmanager UI "Silences" tab
- Verify silence matchers are exact (labels must match perfectly)

amtool silence query amtool silence expire <SILENCE_ID>


**预期结果**：抑制规则自动减少级联告警，静默规则阻止计划内运维期间的通知。

**故障排查**：
- 使用实时告警测试抑制逻辑
- 查看Alertmanager UI「静默」标签页
- 验证静默规则匹配器精确（标签必须完全匹配

Step 6: Integrate with External Systems

步骤6：对接外部系统

Connect Alertmanager to PagerDuty, Opsgenie, Jira, etc.

PagerDuty integration (excerpt):

yaml

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: 'YOUR_INTEGRATION_KEY'
        severity: '{{ .CommonLabels.severity }}'
        description: '{{ range .Alerts.Firing }}{{ .Annotations.summary }}{{ end }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          alertname: '{{ .GroupLabels.alertname }}'
        # ... (see EXAMPLES.md for complete integration examples)

Webhook for custom integrations:

yaml

receivers:
  - name: 'webhook-custom'
    webhook_configs:
      - url: 'https://your-webhook-endpoint.com/alerts'
        send_resolved: true

Expected: Alerts create incidents in PagerDuty, appear in team communication channels, trigger on-call escalations.

On failure:

Verify API keys/tokens are valid
Check network connectivity to external services
Test webhook endpoints independently with curl
Enable debug mode:
```
--log.level=debug
```

将Alertmanager对接PagerDuty、Opsgenie、Jira等系统打通。

PagerDuty对接（片段）：

yaml

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: 'YOUR_INTEGRATION_KEY'
        severity: '{{ .CommonLabels.severity }}'
        description: '{{ range .Alerts.Firing }}{{ .Annotations.summary }}{{ end }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          alertname: '{{ .GroupLabels.alertname }}'
        # ... 完整对接示例请查看EXAMPLES.md

**自定义对接Webhook：

yaml

receivers:
  - name: 'webhook-custom'
    webhook_configs:
      - url: 'https://your-webhook-endpoint.com/alerts'
        send_resolved: true

预期结果：告警在PagerDuty中生成事件，出现在团队沟通渠道，触发值班升级流程。

故障排查：

验证API密钥/令牌有效
检查到外部服务的网络连通性
使用curl单独测试Webhook端点
开启调试模式：
```
--log.level=debug
```

Validation

验证项

Common Pitfalls

常见问题

Alert fatigue: Too many low-priority alerts cause responders to ignore critical ones. Set strict thresholds, use inhibition.
Missing
for
duration: Alerts without
```
for
```
fire on transient spikes. Always use 5-10 minute windows.
Overly broad grouping: Grouping by
```
['...']
```
sends individual notifications. Use specific label grouping.
No runbook links: Alerts without runbooks leave responders guessing. Every alert needs a runbook URL.
Incorrect severity: Mislabeling warnings as critical desensitizes team. Reserve critical for emergencies.
Forgotten silences: Silences without expiration can hide real issues. Always set end times.
Single route: All alerts to one channel loses context. Use team-specific routing.
No inhibition: Cascade alerts during outages create noise. Implement inhibition rules.

告警疲劳：过多低优先级告警导致响应人员忽略关键告警。设置严格阈值，使用抑制规则。
缺少
for
时长：没有
```
for
```
的告警会因瞬时波动触发。始终使用5-10分钟的窗口。
分组范围过宽：使用
```
['...']
```
分组会发送单独通知。使用特定标签分组。
缺少运行手册链接：没有运行手册的告警会让响应人员无从下手。每个告警都需要运行手册URL。
严重级别错误：将警告标记为严重会降低团队敏感度。严重级别仅用于紧急情况。
静默规则未设置过期时间：没有过期时间的静默规则会隐藏真实问题。始终设置结束时间。
单一路由：所有告警发送到一个频道会丢失上下文。使用团队专属路由。
缺少抑制规则：故障期间的级联告警会产生大量噪音。配置抑制规则。

configure-alerting-rules

Original

Translation

Configure Alerting Rules

配置告警规则

When to Use

适用场景

Inputs

输入项

Procedure

操作步骤

Step 1: Deploy Alertmanager

步骤1：部署Alertmanager

... (see EXAMPLES.md for complete routing, inhibition rules, and receivers)

... 完整路由、抑制规则和接收器配置请查看EXAMPLES.md

Step 2: Define Alerting Rules in Prometheus

步骤2：在Prometheus中定义告警规则

prometheus.yml

prometheus.yml

Step 3: Create Notification Templates

步骤3：创建通知模板

... (see EXAMPLES.md for complete email and PagerDuty templates)

... 完整邮件和PagerDuty模板请查看EXAMPLES.md

Step 4: Configure Routing and Grouping

步骤4：配置路由和分组

... (see EXAMPLES.md for complete routing with time intervals)

... 带时间间隔的完整路由配置请查看EXAMPLES.md

Group by alertname: All HighCPU alerts bundled together

按告警名称分组：所有高CPU告警聚合为一个分组

Group by alertname AND cluster: Separate notifications per cluster

按告警名称+集群分组：每个集群的告警单独通知

Step 5: Implement Inhibition and Silencing

步骤5：实现抑制和静默规则

... (see EXAMPLES.md for more inhibition patterns)

... 更多抑制模式请查看EXAMPLES.md

Silence during maintenance

运维期间静默告警

List and manage silences

查看和管理静默规则

Step 6: Integrate with External Systems

步骤6：对接外部系统

Validation

验证项

Common Pitfalls

常见问题

Related Skills

相关技能