configure-alerting-rules
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseConfigure Alerting Rules
配置告警规则
Set up Prometheus alerting rules and Alertmanager for reliable, actionable incident notifications.
See Extended Examples for complete configuration files and templates.
搭建Prometheus告警规则和Alertmanager,实现可靠、可落地的事件通知。
完整配置文件和模板请查看扩展示例。
When to Use
适用场景
- Implementing proactive monitoring with automated incident detection
- Routing alerts to appropriate teams based on severity and service ownership
- Reducing alert fatigue through intelligent grouping and deduplication
- Integrating monitoring with on-call systems (PagerDuty, Opsgenie)
- Establishing escalation policies for critical production issues
- Migrating from legacy monitoring systems to Prometheus-based alerting
- Creating actionable alerts that guide responders to resolution
- 实现带自动事件检测的主动监控
- 基于告警严重程度和服务归属将告警路由给对应团队
- 通过智能分组和去重减少告警疲劳
- 将监控与值班系统(PagerDuty、Opsgenie)打通
- 为生产关键问题建立升级策略
- 从传统监控系统迁移到基于Prometheus的告警体系
- 创建可指导响应人员处理问题的可落地告警
Inputs
输入项
- Required: Prometheus metrics to alert on (error rates, latency, saturation)
- Required: On-call rotation and escalation policies
- Optional: Existing alert definitions to migrate
- Optional: Notification channels (Slack, email, PagerDuty)
- Optional: Runbook documentation for common alerts
- 必填:用于告警的Prometheus指标(错误率、延迟、饱和度)
- 必填:值班轮值和升级策略
- 可选:待迁移的现有告警定义
- 可选:通知渠道(Slack、邮件、PagerDuty)
- 可选:常见告警的运行手册文档
Procedure
操作步骤
Step 1: Deploy Alertmanager
步骤1:部署Alertmanager
Install and configure Alertmanager to receive alerts from Prometheus.
Docker Compose deployment (basic structure):
yaml
version: '3.8'
services:
alertmanager:
image: prom/alertmanager:v0.26.0
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
# ... (see EXAMPLES.md for complete configuration)Basic Alertmanager configuration ( excerpt):
alertmanager.ymlyaml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty-critical安装并配置Alertmanager,用于接收Prometheus发送的告警。
Docker Compose部署(基础结构):
yaml
version: '3.8'
services:
alertmanager:
image: prom/alertmanager:v0.26.0
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
# ... 完整配置请查看EXAMPLES.mdAlertmanager基础配置(片段):
alertmanager.ymlyaml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster', 'service'
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty-critical... (see EXAMPLES.md for complete routing, inhibition rules, and receivers)
... 完整路由、抑制规则和接收器配置请查看EXAMPLES.md
**Configure Prometheus to use Alertmanager** (`prometheus.yml`):
```yaml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
timeout: 10s
api_version: v2Expected: Alertmanager UI accessible at , Prometheus "Status > Alertmanagers" shows UP status.
http://localhost:9093On failure:
- Check Alertmanager logs:
docker logs alertmanager - Verify Prometheus can reach Alertmanager:
curl http://alertmanager:9093/api/v2/status - Test webhook URLs:
curl -X POST <SLACK_WEBHOOK_URL> -d '{"text":"test"}' - Validate YAML syntax:
amtool check-config alertmanager.yml
**配置Prometheus对接Alertmanager**(`prometheus.yml`):
```yaml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
timeout: 10s
api_version: v2预期结果:可通过`http://localhost:9093访问Alertmanager UI,Prometheus的「状态 > Alertmanagers」页面显示状态为UP。
故障排查:
- 查看Alertmanager日志:
docker logs alertmanager - 验证Prometheus可访问Alertmanager:
curl http://alertmanager:9093/api/v2/status - 测试Webhook地址:`curl -X POST <SLACK_WEBHOOK_URL> -d '{"text":"test"}'
- 验证YAML语法:
amtool check-config alertmanager.yml
Step 2: Define Alerting Rules in Prometheus
步骤2:在Prometheus中定义告警规则
Create alerting rules that fire when conditions are met.
Create alerting rules file ( excerpt):
/etc/prometheus/rules/alerts.ymlyaml
groups:
- name: instance_alerts
interval: 30s
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been down for >5min."
runbook_url: "https://wiki.example.com/runbooks/instance-down"
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
# ... (see EXAMPLES.md for complete alerts)Alert design best practices:
- duration: Prevents flapping alerts. Use 5-10 minutes for most alerts.
for - Descriptive annotations: Include current value, affected resource, and runbook link.
- Severity levels: critical (pages on-call), warning (investigate), info (FYI)
- Team labels: Enable routing to correct team/channel
- Runbook links: Every alert should have a runbook URL
Load rules into Prometheus:
yaml
undefined创建满足条件时触发的告警规则。
创建告警规则文件(片段):
/etc/prometheus/rules/alerts.ymlyaml
groups:
- name: instance_alerts
interval: 30s
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "实例 {{ $labels.instance }} 已宕机"
description: "{{ $labels.instance }} 宕机时间超过5分钟。"
runbook_url: "https://wiki.example.com/runbooks/instance-down"
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} CPU使用率过高
# ... 完整告警配置请查看EXAMPLES.md告警设计最佳实践:
- 时长:避免告警抖动。大多数告警建议设置5-10分钟。
for - 描述性注解:包含当前值、受影响资源和运行手册链接。
- 严重级别:critical(触发值班人员告警)、warning(待排查)、info(仅通知)
- 团队标签:支持路由到正确的团队/频道
- 运行手册链接:每个告警都应配备运行手册URL
将规则加载到Prometheus:
yaml
undefinedprometheus.yml
prometheus.yml
rule_files:
- "rules/*.yml"
Validate and reload:
```bash
promtool check rules /etc/prometheus/rules/alerts.yml
curl -X POST http://localhost:9090/-/reloadExpected: Alerts visible in Prometheus "Alerts" page, alerts fire when thresholds exceeded, Alertmanager receives fired alerts.
On failure:
- Check Prometheus logs for rule evaluation errors
- Verify rule syntax with
promtool check rules - Test alert queries independently in Prometheus UI
- Inspect alert state transitions: Inactive → Pending → Firing
rule_files:
- "rules/*.yml"
验证并重载配置:
```bash
promtool check rules /etc/prometheus/rules/alerts.yml
curl -X POST http://localhost:9090/-/reload预期结果:Prometheus「告警」页面可查看所有告警,阈值超出时告警触发,Alertmanager接收到触发的告警。
故障排查:
- 查看Prometheus日志排查规则评估错误
- 使用验证规则语法
promtool check rules - 在Prometheus UI中单独测试告警查询语句
- 检查告警状态流转:未激活 → 待触发 → 已触发
Step 3: Create Notification Templates
步骤3:创建通知模板
Design readable, actionable notification messages.
Create template file ( excerpt):
/etc/alertmanager/templates/default.tmplgotmpl
{{ define "slack.default.title" }}
[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}
{{ end }}
{{ define "slack.default.text" }}
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity }}
*Summary:* {{ .Annotations.summary }}
{{ if .Annotations.runbook_url }}*Runbook:* {{ .Annotations.runbook_url }}{{ end }}
{{ end }}
{{ end }}设计可读性高、可落地的通知消息。
创建模板文件(片段):
/etc/alertmanager/templates/default.tmplgotmpl
{{ define "slack.default.title" }}
[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}
{{ end }}
{{ define "slack.default.text" }}
{{ range .Alerts }}
*告警:* {{ .Labels.alertname }}
*严重级别:* {{ .Labels.severity }}
*摘要:* {{ .Annotations.summary }}
{{ if .Annotations.runbook_url }}*运行手册:* {{ .Annotations.runbook_url }}{{ end }}
{{ end }}
{{ end }}... (see EXAMPLES.md for complete email and PagerDuty templates)
... 完整邮件和PagerDuty模板请查看EXAMPLES.md
**Use templates in receivers**:
```yaml
receivers:
- name: 'slack-custom'
slack_configs:
- channel: '#alerts'
title: '{{ template "slack.default.title" . }}'
text: '{{ template "slack.default.text" . }}'Expected: Notifications formatted consistently, include all relevant context, actionable with runbook links.
On failure:
- Test template rendering:
amtool template test --config.file=alertmanager.yml - Check template syntax errors in Alertmanager logs
- Use to debug template data structure
{{ . | json }}
**在接收器中使用模板**:
```yaml
receivers:
- name: 'slack-custom'
slack_configs:
- channel: '#alerts'
title: '{{ template "slack.default.title" . }}'
text: '{{ template "slack.default.text" . }}'预期结果:通知格式统一,包含所有相关上下文,附带运行手册链接可直接使用。
故障排查:
- 测试模板渲染:
amtool template test --config.file=alertmanager.yml - 查看Alertmanager日志排查模板语法错误
- 使用调试模板数据结构
{{ . | json }}
Step 4: Configure Routing and Grouping
步骤4:配置路由和分组
Optimize alert delivery with intelligent routing rules.
Advanced routing configuration (excerpt):
yaml
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
routes:
- match:
team: platform
receiver: 'team-platform'
routes:
- match:
severity: critical
receiver: 'pagerduty-platform'
group_wait: 10s
repeat_interval: 15m
continue: true # Also send to Slack通过智能路由规则优化告警投递
高级路由配置(片段):
yaml
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
routes:
- match:
team: platform
receiver: 'team-platform'
routes:
- match:
severity: critical
receiver: 'pagerduty-platform'
group_wait: 10s
repeat_interval: 15m
continue: true # 同时发送到Slack... (see EXAMPLES.md for complete routing with time intervals)
... 带时间间隔的完整路由配置请查看EXAMPLES.md
**Grouping strategies**:
```yaml
**分组策略**:
```yamlGroup by alertname: All HighCPU alerts bundled together
按告警名称分组:所有高CPU告警聚合为一个分组
group_by: ['alertname']
group_by: ['alertname']
Group by alertname AND cluster: Separate notifications per cluster
按告警名称+集群分组:每个集群的告警单独通知
group_by: ['alertname', 'cluster']
**Expected:** Alerts routed to correct teams, grouped logically, timing appropriate for severity.
**On failure:**
- Test routing: `amtool config routes test --config.file=alertmanager.yml --alertname=HighCPU --label=severity=critical`
- Check routing tree: `amtool config routes show --config.file=alertmanager.yml`
- Verify `continue: true` if alert should match multiple routesgroup_by: ['alertname', 'cluster']
**预期结果**:告警路由到正确团队,逻辑分组合理,告警时机匹配严重级别。
**故障排查**:
- 测试路由:`amtool config routes test --config.file=alertmanager.yml --alertname=HighCPU --label=severity=critical`
- 查看路由树:`amtool config routes show --config.file=alertmanager.yml`
- 如果告警需要匹配多个路由,请确认`continue: true`配置Step 5: Implement Inhibition and Silencing
步骤5:实现抑制和静默规则
Reduce alert noise with inhibition rules and temporary silences.
Inhibition rules (suppress dependent alerts):
yaml
inhibit_rules:
# Cluster down suppresses all node alerts in that cluster
- source_match:
alertname: 'ClusterDown'
severity: 'critical'
target_match_re:
alertname: '(InstanceDown|HighCPU|HighMemory)'
equal: ['cluster']
# Service down suppresses latency and error alerts
- source_match:
alertname: 'ServiceDown'
target_match_re:
alertname: '(HighLatency|HighErrorRate)'
equal: ['service', 'namespace']通过抑制规则和临时静默规则减少告警噪音。
抑制规则(抑制依赖告警:
yaml
inhibit_rules:
# 集群宕机时抑制该集群下所有节点告警
- source_match:
alertname: 'ClusterDown'
severity: 'critical'
target_match_re:
alertname: '(InstanceDown|HighCPU|HighMemory)'
equal: ['cluster']
# 服务宕机时抑制延迟和错误告警
- source_match:
alertname: 'ServiceDown'
target_match_re:
alertname: '(HighLatency|HighErrorRate)'
equal: ['service', 'namespace']... (see EXAMPLES.md for more inhibition patterns)
... 更多抑制模式请查看EXAMPLES.md
**Create silences programmatically**:
```bash
**编程式创建静默规则:
```bashSilence during maintenance
运维期间静默告警
amtool silence add
instance=app-server-1
--author="ops-team"
--comment="Scheduled maintenance"
--duration=2h
instance=app-server-1
--author="ops-team"
--comment="Scheduled maintenance"
--duration=2h
amtool silence add
instance=app-server-1
--author="ops-team"
--comment="计划内运维"
--duration=2h
instance=app-server-1
--author="ops-team"
--comment="计划内运维"
--duration=2h
List and manage silences
查看和管理静默规则
amtool silence query
amtool silence expire <SILENCE_ID>
**Expected:** Inhibition reduces cascade alerts automatically, silences prevent notifications during planned maintenance.
**On failure:**
- Test inhibition logic with live alerts
- Check Alertmanager UI "Silences" tab
- Verify silence matchers are exact (labels must match perfectly)amtool silence query
amtool silence expire <SILENCE_ID>
**预期结果**:抑制规则自动减少级联告警,静默规则阻止计划内运维期间的通知。
**故障排查**:
- 使用实时告警测试抑制逻辑
- 查看Alertmanager UI「静默」标签页
- 验证静默规则匹配器精确(标签必须完全匹配Step 6: Integrate with External Systems
步骤6:对接外部系统
Connect Alertmanager to PagerDuty, Opsgenie, Jira, etc.
PagerDuty integration (excerpt):
yaml
receivers:
- name: 'pagerduty'
pagerduty_configs:
- routing_key: 'YOUR_INTEGRATION_KEY'
severity: '{{ .CommonLabels.severity }}'
description: '{{ range .Alerts.Firing }}{{ .Annotations.summary }}{{ end }}'
details:
firing: '{{ .Alerts.Firing | len }}'
alertname: '{{ .GroupLabels.alertname }}'
# ... (see EXAMPLES.md for complete integration examples)Webhook for custom integrations:
yaml
receivers:
- name: 'webhook-custom'
webhook_configs:
- url: 'https://your-webhook-endpoint.com/alerts'
send_resolved: trueExpected: Alerts create incidents in PagerDuty, appear in team communication channels, trigger on-call escalations.
On failure:
- Verify API keys/tokens are valid
- Check network connectivity to external services
- Test webhook endpoints independently with curl
- Enable debug mode:
--log.level=debug
将Alertmanager对接PagerDuty、Opsgenie、Jira等系统打通。
PagerDuty对接(片段):
yaml
receivers:
- name: 'pagerduty'
pagerduty_configs:
- routing_key: 'YOUR_INTEGRATION_KEY'
severity: '{{ .CommonLabels.severity }}'
description: '{{ range .Alerts.Firing }}{{ .Annotations.summary }}{{ end }}'
details:
firing: '{{ .Alerts.Firing | len }}'
alertname: '{{ .GroupLabels.alertname }}'
# ... 完整对接示例请查看EXAMPLES.md**自定义对接Webhook:
yaml
receivers:
- name: 'webhook-custom'
webhook_configs:
- url: 'https://your-webhook-endpoint.com/alerts'
send_resolved: true预期结果:告警在PagerDuty中生成事件,出现在团队沟通渠道,触发值班升级流程。
故障排查:
- 验证API密钥/令牌有效
- 检查到外部服务的网络连通性
- 使用curl单独测试Webhook端点
- 开启调试模式:
--log.level=debug
Validation
验证项
- Alertmanager receives alerts from Prometheus successfully
- Alerts routed to correct teams based on labels and severity
- Notifications delivered to Slack, email, or PagerDuty
- Alert grouping reduces notification volume appropriately
- Inhibition rules suppress dependent alerts correctly
- Silences prevent notifications during maintenance windows
- Notification templates include runbook links and context
- Repeat interval prevents alert fatigue for long-running issues
- Resolved notifications sent when alerts clear
- External integrations (PagerDuty, Opsgenie) create incidents
- Alertmanager成功接收Prometheus发送的告警
- 告警基于标签和严重级别路由到正确团队
- 通知投递到Slack、邮件或PagerDuty
- 告警分组合理减少通知量
- 抑制规则正确抑制依赖告警
- 静默规则在运维窗口期阻止通知
- 通知模板包含运行手册链接和上下文
- 重复间隔避免长期问题导致的告警疲劳
- 告警恢复时发送恢复通知
- 外部对接(PagerDuty、Opsgenie)成功生成事件
Common Pitfalls
常见问题
- Alert fatigue: Too many low-priority alerts cause responders to ignore critical ones. Set strict thresholds, use inhibition.
- Missing duration: Alerts without
forfire on transient spikes. Always use 5-10 minute windows.for - Overly broad grouping: Grouping by sends individual notifications. Use specific label grouping.
['...'] - No runbook links: Alerts without runbooks leave responders guessing. Every alert needs a runbook URL.
- Incorrect severity: Mislabeling warnings as critical desensitizes team. Reserve critical for emergencies.
- Forgotten silences: Silences without expiration can hide real issues. Always set end times.
- Single route: All alerts to one channel loses context. Use team-specific routing.
- No inhibition: Cascade alerts during outages create noise. Implement inhibition rules.
- 告警疲劳:过多低优先级告警导致响应人员忽略关键告警。设置严格阈值,使用抑制规则。
- 缺少时长:没有
for的告警会因瞬时波动触发。始终使用5-10分钟的窗口。for - 分组范围过宽:使用分组会发送单独通知。使用特定标签分组。
['...'] - 缺少运行手册链接:没有运行手册的告警会让响应人员无从下手。每个告警都需要运行手册URL。
- 严重级别错误:将警告标记为严重会降低团队敏感度。严重级别仅用于紧急情况。
- 静默规则未设置过期时间:没有过期时间的静默规则会隐藏真实问题。始终设置结束时间。
- 单一路由:所有告警发送到一个频道会丢失上下文。使用团队专属路由。
- 缺少抑制规则:故障期间的级联告警会产生大量噪音。配置抑制规则。
Related Skills
相关技能
- - Define metrics and recording rules that feed alerting rules
setup-prometheus-monitoring - - Generate SLO burn rate alerts for error budget management
define-slo-sli-sla - - Create runbooks linked from alert annotations
write-incident-runbook - - Visualize alert firing history and silence patterns
build-grafana-dashboards
- - 定义为告警规则提供数据的指标和记录规则
setup-prometheus-monitoring - - 生成SLO消耗速率告警用于错误预算管理
define-slo-sli-sla - - 创建告警注解中链接的运行手册
write-incident-runbook - - 可视化告警触发历史和静默模式
build-grafana-dashboards