signoz-explaining-alerts
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAlert Explain
告警规则解释
Decode an existing SigNoz alert's configuration into a plain-language
explanation. The skill is read-only and stays focused on the rule
itself: what it watches, when it fires, where it notifies. A single
line of fire-frequency data is included to ground the explanation, but
this skill does not investigate any specific fire — that is
's job.
signoz-investigating-alerts将现有SigNoz告警的配置解码为通俗语言解释。此技能为只读模式,仅聚焦于规则本身:它监控什么、何时触发、通知至何处。解释中会包含一行触发频率数据,但不会调查具体的告警触发事件——这是的职责。
signoz-investigating-alertsPrerequisites
前置条件
This skill calls SigNoz MCP server tools (,
, ). Before running
the workflow, confirm the tools are available. If they are
not, the SigNoz MCP server is not installed or configured — stop and
direct the user to set it up:
https://signoz.io/docs/ai/signoz-mcp-server/. Do not guess at alert
configuration from the rule name alone.
signoz:signoz_get_alertsignoz:signoz_list_alert_rulessignoz:signoz_get_alert_historysignoz:signoz_*此技能会调用SigNoz MCP服务器工具(、、)。在运行工作流之前,请确认工具可用。若不可用,则说明SigNoz MCP服务器未安装或配置——请停止操作并引导用户完成设置:https://signoz.io/docs/ai/signoz-mcp-server/。请勿仅根据规则名称猜测告警配置。
signoz:signoz_get_alertsignoz:signoz_list_alert_rulessignoz:signoz_get_alert_historysignoz:signoz_*When to use
使用场景
Use this skill when the user wants to:
- Understand or interpret an existing alert rule.
- Confirm what signal an alert watches and at what threshold.
- Audit whether an alert is reasonably configured.
- Translate raw alert JSON into operational language.
Do NOT use when the user wants to:
- Create a new alert → .
signoz-creating-alerts - Diagnose why an alert fired or correlate signals around a fire window
→ .
signoz-investigating-alerts - Modify an existing alert → call directly.
signoz:signoz_update_alert
当用户需要以下操作时使用此技能:
- 理解或解读现有告警规则。
- 确认告警监控的信号及阈值。
- 审计告警配置是否合理。
- 将原始告警JSON转换为可落地的运维语言。
请勿在以下场景使用:
- 创建新告警 → 使用。
signoz-creating-alerts - 诊断告警触发原因或关联触发时段的信号 → 使用。
signoz-investigating-alerts - 修改现有告警 → 直接调用。
signoz:signoz_update_alert
Required inputs
必填输入
| Input | Required | Source if missing |
|---|---|---|
| Alert identifier (rule ID or name) | yes | |
If the input is missing or ambiguous, this skill is best-effort (not
strict — read-only operations are cheap to recover from):
- Call , paginate through every page, and find the closest name match.
signoz:signoz_list_alert_rules - State the interpretation in the response: "Interpreting your request as alert 'High Error Rate — Checkout' (id 42). If you meant a different one, tell me the name or id."
- Proceed with the explanation. The user can correct after.
| 输入项 | 是否必填 | 缺失时获取来源 |
|---|---|---|
| 告警标识符(规则ID或名称) | 是 | |
若输入缺失或模糊,此技能会进行最大努力处理(非严格要求——只读操作易于恢复):
- 调用,遍历所有分页,找到最匹配的名称。
signoz:signoz_list_alert_rules - 在响应中说明解读依据: "将您的请求解读为告警‘高错误率——结账服务’(ID 42)。若您指的是其他告警,请告知名称或ID。"
- 继续进行解释,用户可后续纠正。
Workflow
工作流
Step 1: Resolve the alert
步骤1:解析告警
If the user provided a numeric id, skip to Step 2. Otherwise:
- Call and paginate every page —
signoz:signoz_list_alert_rulesis true until the full list is walked.pagination.hasMore - Match by name (case-insensitive substring). If multiple match, present the candidates and ask which one (interactive) or pick the closest and flag the assumption (autonomous).
若用户提供了数字ID,直接跳至步骤2。否则:
- 调用并遍历所有分页——直到
signoz:signoz_list_alert_rules为false,确保获取完整列表。pagination.hasMore - 按名称进行不区分大小写的子串匹配。若存在多个匹配项,可展示候选列表并询问用户选择(交互模式),或选择最匹配项并标注假设(自主模式)。
Step 2: Fetch the full configuration
步骤2:获取完整配置
Call with the rule id. This is mandatory — the
list response does not include the full condition / thresholds /
notification settings, and explanations based on the name alone are
guesses.
signoz:signoz_get_alert调用并传入规则ID。这一步是必须的——列表响应不包含完整的条件/阈值/通知设置,仅基于名称的解释属于猜测。
signoz:signoz_get_alertStep 3: Pull a one-line fire-frequency summary
步骤3:提取一行触发频率总结
Call for the rule with a 7-day lookback. From
the response, derive a single line:
signoz:signoz_get_alert_historyFired N times in the last 7d (last fire: <relative-time>).
If the alert never fired in the window, say so explicitly:
"Has not fired in the last 7d." If the alert is disabled, mention that
and skip the history line.
This single line grounds the explanation. Do not drill into specific
fires here — that's .
signoz-investigating-alerts调用查询该规则过去7天的记录。从响应中生成一行总结:
signoz:signoz_get_alert_history过去7天触发N次(最后一次触发:<相对时间>)。
若该时段内告警从未触发,请明确说明:"过去7天未触发。" 若告警已禁用,请提及此状态并跳过历史记录行。
这一行总结为解释提供实际依据。请勿深入分析具体的触发事件——这属于的范畴。
signoz-investigating-alertsStep 4: Build the explanation
步骤4:生成解释
The single most useful thing for the user is a tight summary. Lead
with a TL;DR that directly answers the question they asked, not a
generic alert summary. The TL;DR is the only thing some users will
read — burying their answer under a fixed template forces them to
scroll for what they wanted in the first place.
Match the TL;DR shape to the user's question:
-
"What does this alert do?" / "Explain X" — describe what fires:TL;DR: Fires whenfor
<condition>, notifies<scope>.<channel>.<fire-frequency line> -
"Is it configured correctly?" / "Audit this" / "Anything I should change?" — lead with the verdict and the top 1–3 changes, not the description of what fires:TL;DR: Mostly well-configured, but recommend: (1) add— currently a crashed service stays silent; (2) fix annotation template
alertOnAbsent→{{$topic}}(won't interpolate); (3) split critical to PagerDuty (both tiers currently route to Slack).{{$labels.topic}}.<fire-frequency line> -
"How does X work?" / "Explain the count guard" — answer the mechanism in 1–2 sentences before any framing:TL;DR: The count guard is aclause on query A — any 1-minute bucket with ≤50 spans is dropped before evaluation, so low-traffic minutes can't fire the alert.
having: count() > 50 -
"What's the threshold?" / focused config question — state the exact thing they asked about:TL;DR: Threshold is 3 standard deviations (z-score), not a raw rate value. Daily seasonality means the model compares each hour against historical norms for that hour.
Always include the fire-frequency line and status if
non-default — those ground every kind of TL;DR. But put the answer to
the user's specific question first.
disabledAfter the TL;DR, write the explanation in prose, organized into the
four sections below. Skip any section that has nothing meaningful to
add — empty severity labels, default notification settings, vanilla
annotations don't deserve a header. Short and skimmable beats
perfunctorily complete; the user is not reading a checklist.
1. What it watches — one short paragraph. Combine signal type
(metrics / logs / traces / exceptions), what the query measures, and
scope. Translate the query to operational language; for formulas, name
each sub-query (A, B, …) and state what F1 (or whichever
triggers) computes — e.g. "F1 = A × 100 / B → error
percentage". Decode filter operators ( equals, not equals,
/ , / , , , /
); enumerate / value lists so the user can
verify them. Name each dimension and its practical effect
("fires separately per service" for ).
selectedQueryName=!=INNOT INLIKEILIKECONTAINSREGEXPEXISTSNOT EXISTSINNOT INgroupByservice.nameFor anomaly rules (), explicitly state that
the threshold is in standard deviations from the learned pattern, not
the raw value — this is the most common point of confusion. Include
(zscore), (hourly / daily / weekly), and how
lower/higher targets shift sensitivity (lower → more noise, higher →
only extreme deviations).
ruleType: anomaly_rulealgorithmseasonality2. When it fires — one paragraph covering threshold + timing.
Decode the threshold spec into plain English using these mappings:
- codes:
opabove,1below,2equal,3not equal.4 - codes:
matchTypeat_least_once (any point in window),1all_the_times (entire window),2on_average (window average),3in_total (window sum),4last (most recent point).5
State each threshold tier's , , , and
attached channels. Always state the threshold in , not
the native query unit (e.g. "fires when p99 exceeds 500 ms", not
"…exceeds 500 000 000 ns"). Note if set (hysteresis);
if absent, mention flap risk when the value hovers near the boundary.
Describe timing as "checks every over the last
", and mention that with a single-point
breach triggers, while requires the full window.
nametargettargetUnittargetUnitrecoveryTarget<frequency><evalWindow>at_least_onceall_the_times3. Where it notifies — channels per tier (resolved by name from
if needed),
(how notifications are bundled), (interval + which states),
(label-based routing). Skip this section entirely if
notification settings are vanilla and the user already saw the channel
in the TL;DR.
signoz_list_notification_channelsnotificationSettings.groupByrenotifyusePolicy4. Notable concerns — flag only what's non-default and worth the
user's attention. Don't list every absent field; focus on the
high-leverage ones:
- missing when the signal is critical: silent data loss (crashed service, broken instrumentation) won't trigger the alert. Always call this out for production-tier rules.
alertOnAbsent - but
alertOnAbsent: truenot innodata: the absent-data fire pages once and then goes silent — easy to miss.renotify.alertStates - Template variable bugs: won't interpolate; the correct form is
{{$topic}}. Dots in label keys become underscores ({{$labels.topic}}→service.name).{{$labels.service_name}} - Multiple severity tiers but missing on the rule — breaks label-based routing policies. Common gap.
labels.severity - All tiers route to the same channel — defeats the point of graduated thresholds.
- High-cardinality (e.g.
groupBy×pod.name) → notification-storm risk during cluster-wide events.partition - Annotation/description text contradicts (e.g. description says "for over 5 minutes" but
matchTypefires on first breach within the window).matchType=at_least_once - Alert name doesn't match the filter target (e.g. name says
"checkout" but filter targets ) — call this out.
payments
If none of these apply, omit the section. Better silent than padded.
If the user asked only "what does this alert do", stop here. The audit
(Step 5) is for "is it configured correctly" / "audit this" /
"anything I should change" requests.
对用户最有用的内容是简洁的总结。以直接回应用户问题的TL;DR(摘要)开头,而非通用的告警总结。部分用户可能仅阅读TL;DR——将答案埋在固定模板下会迫使用户滚动查找所需内容。
根据用户问题调整TL;DR的形式:
-
“这个告警是做什么的?” / “解释X” ——描述触发条件:TL;DR:当在
<条件>内满足时触发,通知至<范围>。<渠道>。<触发频率行> -
“配置正确吗?” / “审计这条规则” / “有什么需要修改的?” ——先给出结论和最关键的1-3项修改建议,而非描述触发内容:TL;DR:配置基本合理,但建议:(1) 添加——当前服务崩溃时不会触发告警;(2) 修复注解模板
alertOnAbsent改为{{$topic}}(原格式无法插值);(3) 将严重级告警分流至PagerDuty(当前所有层级均路由至Slack)。{{$labels.topic}}。<触发频率行> -
“X是如何工作的?” / “解释计数防护机制” ——先以1-2句话回答机制,再补充其他内容:TL;DR:计数防护机制是查询A上的子句——任何1分钟窗口内跨度≤50的记录会在评估前被过滤,因此低流量时段不会触发告警。
having: count() > 50 -
“阈值是多少?” / 聚焦配置的问题 ——直接说明用户询问的具体内容:TL;DR:阈值为3个标准差(z-score),而非原始速率值。每日季节性模型会将每个时段与历史同期数据进行对比。
始终包含触发频率行,若告警处于禁用状态需额外说明——这些内容可为所有类型的TL;DR提供依据。但请将用户具体问题的答案放在最前面。
在TL;DR之后,将解释整理为以下四个部分的散文式内容。跳过无实际意义的部分——空的严重性标签、默认通知设置、通用注解无需单独设标题。简短易读优于形式上的完整;用户并非在核对清单。
1. 监控内容 ——一段简短描述。结合信号类型(指标/日志/链路追踪/异常)、查询测量的对象及范围。将查询转换为运维语言;对于公式,为每个子查询命名(A、B…)并说明触发规则的(如F1)的计算逻辑——例如“F1 = A × 100 / B → 错误百分比”。解码过滤操作符(等于、不等于、/、/、、、/);枚举/的取值列表,方便用户验证。说明每个维度及其实际作用(如对应“按服务单独触发告警”)。
selectedQueryName=!=INNOT INLIKEILIKECONTAINSREGEXPEXISTSNOT EXISTSINNOT INgroupByservice.name对于异常检测规则(),需明确说明阈值为与学习模式的标准差,而非原始数值——这是最常见的混淆点。需包含(zscore)、(小时/日/周),以及目标值高低对灵敏度的影响(值越低→误报越多,值越高→仅检测极端偏差)。
ruleType: anomaly_rulealgorithmseasonality2. 触发时机 ——一段描述阈值与时间逻辑的内容。使用以下映射将阈值规范解码为通俗英语:
- 代码:
op高于、1低于、2等于、3不等于。4 - 代码:
matchType至少一次(窗口内任意时点)、1持续满足(整个窗口)、2平均值(窗口平均值)、3总和(窗口总和)、4最新值(最近一个时点)。5
说明每个阈值层级的、、及关联渠道。始终以为单位说明阈值,而非查询原生单位(例如“当p99延迟超过500ms时触发”,而非“…超过500000000ns”)。若设置了(滞后阈值)需说明;若未设置,需提及当数值在阈值附近波动时可能出现告警震荡。时间逻辑描述为“每检查一次过去的数据”,并说明模式下单次违规即触发,而模式下需整个窗口持续满足条件。
nametargettargetUnittargetUnitrecoveryTarget<频率><评估窗口>at_least_onceall_the_times3. 通知渠道 ——按层级说明通知渠道(若需可通过按名称解析)、(通知的分组方式)、(重复通知间隔及状态)、(基于标签的路由)。若通知设置为默认值且用户已在TL;DR中看到渠道信息,可跳过此部分。
signoz_list_notification_channelsnotificationSettings.groupByrenotifyusePolicy4. 值得关注的问题 ——仅标注非默认且值得用户注意的内容。无需列出所有缺失字段,聚焦高影响项:
- 关键信号缺失:数据静默丢失(服务崩溃、 instrumentation 故障)不会触发告警。对于生产级规则需特别指出。
alertOnAbsent - 但
alertOnAbsent: true中未包含renotify.alertStates:无数据告警仅触发一次后即静默,容易被忽略。nodata - 模板变量错误:无法插值;正确格式为
{{$topic}}。标签键中的点会转换为下划线({{$labels.topic}}→service.name)。{{$labels.service_name}} - 存在多个严重性层级但规则缺失——会破坏基于标签的路由策略。这是常见漏洞。
labels.severity - 所有层级路由至同一渠道——失去分级阈值的意义。
- 高基数(如
groupBy×pod.name)→ 集群级事件时可能出现通知风暴。partition - 注解/描述文本与矛盾(例如描述为“持续超过5分钟”但
matchType会在窗口内首次违规时触发)。matchType=at_least_once - 告警名称与过滤目标不符(例如名称为“结账服务”但过滤目标为)——需指出此问题。
payments
若以上问题均不适用,可省略此部分。宁可不写也不要凑数。
若用户仅询问“这个告警是做什么的”,到此为止即可。审计步骤(步骤5)仅适用于用户询问“配置正确吗”/“审计这条规则”/“有什么需要修改的”的场景。
Step 5: Assess the configuration (only if asked)
步骤5:评估配置(仅在用户要求时执行)
The user may ask "is this alert reasonable" alongside the explanation.
Only assess when asked or when the request implies it (audit, review,
"is this configured correctly"). Keep assessment grounded in what's
actually in the config:
- Threshold calibration — appropriate for the signal? Consider service criticality and traffic.
- matchType fit — is sensitive (catches transients);
at_least_onceis conservative;all_the_timessmooths noise.on_average - Window vs frequency — short window + can be noisy. Long window can delay detection.
at_least_once - Multi-severity — alerts with both warning and critical thresholds enable graduated response. Single-severity alerts miss this.
- Notification routing — critical → high-urgency channels (PagerDuty); warning → low-urgency (Slack).
- Missing runbook / description — if are empty or default, suggest adding context.
annotations - Absent-data monitoring — for critical signals, recommend
if it isn't set.
alertOnAbsent: true - GroupBy cardinality — high-cardinality groupBy fields can produce many independent alert series; flag potential notification storms.
- Filter completeness — for /
INfilters with explicit value lists, flag values that look out of place or missing values that seem expected.NOT IN - Fire frequency vs threshold — if Step 3 shows the alert fires many times a day (>10/day in the 7d window), the threshold is likely too tight; if it never fires and the user is asking because they expected it to, the threshold may be too loose or the query may be wrong.
用户可能在请求解释的同时询问“这条告警配置合理吗”。仅在用户明确要求或请求隐含评估需求(审计、审核、“配置正确吗”)时执行评估。评估需基于实际配置:
- 阈值校准——是否适合该信号?需考虑服务重要性及流量情况。
- matchType适配性——灵敏度高(捕捉瞬态问题);
at_least_once更保守;all_the_times可平滑噪声。on_average - 窗口与频率——短窗口+可能导致误报。长窗口可能延迟检测。
at_least_once - 多严重性层级——包含警告和严重阈值的告警可实现分级响应。单一严重性层级会遗漏此优势。
- 通知路由——严重级告警→高优先级渠道(PagerDuty);警告→低优先级渠道(Slack)。
- 缺失运行手册/描述——若为空或默认值,建议添加上下文信息。
annotations - 无数据监控——对于关键信号,若未设置,建议添加。
alertOnAbsent: true - GroupBy基数——高基数GroupBy字段会生成大量独立告警系列;需标注潜在的通知风暴风险。
- 过滤完整性——对于包含明确取值列表的/
IN过滤,标注异常取值或缺失的预期取值。NOT IN - 触发频率与阈值——若步骤3显示告警每日触发多次(7天内>10次),则阈值可能过严;若从未触发且用户表示预期会触发,则阈值可能过松或查询逻辑错误。
Step 6: Offer next steps
步骤6:提供后续操作建议
End with two or three actionable follow-ups:
- "Want me to investigate the most recent fire?" (→ )
signoz-investigating-alerts - "Want me to run the underlying query to see current values?" (→
)
signoz-generating-queries - "Want me to adjust the threshold or add a severity level?" (→
directly — out of scope here)
signoz:signoz_update_alert - "Want me to create a related alert for [gap noticed]?" (→
)
signoz-creating-alerts
结尾提供2-3个可执行的后续选项:
- “需要我调查最近一次的告警触发事件吗?”(→ )
signoz-investigating-alerts - “需要我运行底层查询查看当前数据吗?”(→ )
signoz-generating-queries - “需要我调整阈值或添加严重性层级吗?”(→ 直接调用——超出本技能范围)
signoz:signoz_update_alert - “需要我针对[发现的漏洞]创建相关告警吗?”(→ )
signoz-creating-alerts
Guardrails
约束规则
- Fetch before explaining. Always call . Do not base explanations on the rule name or list response alone.
signoz:signoz_get_alert - Always pull fire history. The one-line frequency summary is cheap (one MCP call) and grounds the explanation. Skip it only if the alert is disabled.
- Decode, don't dump. Translate ,
op, filter expressions, and query JSON into operational language. Show raw JSON only if the user asks.matchType - Decode every sub-query. For formula alerts, explain each builder query and the formula. Don't gloss over a formula as "error rate" — show the user what A, B, and F1 each compute.
- Anchor to the actual config. Don't speculate based on the alert
name. If the name says "checkout" but the filter targets , call that out.
payments - No live queries by default. Do not run the alert's query unless the user asks. Explain is about understanding the rule, not inspecting current data.
- Paginate alert listing. Before concluding an alert does not
exist, walk every page of .
signoz:signoz_list_alert_rules - SigNoz evaluation model. SigNoz uses +
evalWindowfrequency- , not Prometheus-style
matchTypedurations. Don't describe the evaluation model in Prometheus terms.for
- Stay in scope. Investigation of a specific fire window belongs
to . Modifications belong to
signoz-investigating-alerts.signoz:signoz_update_alert
- 先获取再解释。始终调用。请勿仅基于规则名称或列表响应进行解释。
signoz:signoz_get_alert - 始终获取触发历史。一行频率总结成本低(仅一次MCP调用)且为解释提供依据。仅在告警禁用时跳过。
- 解码而非直接输出。将、
op、过滤表达式及查询JSON转换为运维语言。仅在用户要求时展示原始JSON。matchType - 解码每个子查询。对于公式型告警,解释每个构建查询及公式逻辑。不要将公式简单描述为“错误率”——需向用户说明A、B、F1各自的计算逻辑。
- 基于实际配置。请勿根据告警名称推测。若名称为“结账服务”但过滤目标为,需指出此问题。
payments - 默认不执行实时查询**。除非用户要求,否则不要运行告警的查询。解释的目的是理解规则,而非检查当前数据。
- 遍历告警列表分页。在确认告警不存在之前,需遍历的所有分页。
signoz:signoz_list_alert_rules - SigNoz评估模型。SigNoz使用+
evalWindow+frequency,而非Prometheus风格的matchType时长。请勿用Prometheus术语描述评估模型。for - 聚焦范围。特定触发时段的调查属于的范畴。修改操作属于
signoz-investigating-alerts的范畴。signoz:signoz_update_alert
Examples
示例
User: "Explain my checkout error rate alert"
Agent:
-
(paginated) → finds "High Error Rate — Checkout" id 42.
signoz:signoz_list_alert_rules -
→ traces formula (A errored spans / B total spans × 100), single critical threshold at 5%,
signoz:signoz_get_alert id=42,op=1, channelmatchType=1.pagerduty-oncall -
→ fired 3 times, last fire 2h ago.
signoz:signoz_get_alert_history id=42 lookback=7d -
Replies:TL;DR: Fires when checkout error rate (errored spans / total spans) exceeds 5% at any point in the last 5 min, pages. Fired 3 times in the last 7d (last 2h ago).
pagerduty-oncallWhat it watches — traces from. Query A counts spans withservice.name = 'checkout', query B counts all spans, F1 = A × 100 / B is the error percentage; the alert triggers on F1.hasError = trueWhen it fires —,op=above, target 5%. Checks every 1 min over the last 5 min — a single elevated minute fires it.matchType=at_least_onceNotable — no(could flap near 5%); norecoveryTarget(a checkout outage that drops traffic to zero won't trigger this rule).alertOnAbsent -
Offers: "Want me to investigate the fire from 2h ago?"
User: "Explain alert rule 17 — is it well configured?"
Agent:
-
→ metrics,
signoz:signoz_get_alert id=17, two thresholds (warning 80%, critical 90%),system.cpu.utilizationon_average, filtermatchType=3, groupBydeployment.environment.name = 'production'.host.name -
→ has not fired in last 7d.
signoz:signoz_get_alert_history id=17 -
The user asked "is it well configured?" — the TL;DR leads with the verdict, not a generic description. Then explanation + full audit:TL;DR: Mostly well-configured (dual-threshold +is the right call for CPU), but two gaps to fix: (1) no
on_average— a crashed agent silently stops reporting and this rule won't catch it; (2) annotations are default — a runbook URL would help on-call. Has not fired in the last 7d (could mean production is healthy or thresholds too loose).alertOnAbsentWhat it watches —onsystem.cpu.utilization, grouped bydeployment.environment.name = 'production'(fires per host).host.nameWhen it fires —,op=above, two tiers (warning 80%, critical 90%) over a 5-min window. Short spikes don't fire — smoothing is correct for CPU.matchType=on_averageNotable concerns — same as TL;DR fixes above; nothing else non-default. -
Offers next steps.
User: "Help me understand this anomaly detection alert" (alert id 88)
Agent:
-
→
signoz:signoz_get_alert id=88,ruleType: anomaly_rule,algorithm=zscore, target 3, metricseasonality=daily, scopehttp.server.request.duration.service.name = 'api-gateway' -
History: fired 1 time in last 7d.
-
Replies:TL;DR: Fires when api-gateway request latency deviates by more than 3 standard deviations (not raw latency, not a fixed value) from its learned daily pattern. Fired once in the last 7d.What it watches —for
http.server.request.duration, evaluated as a Z-score anomaly with daily seasonality — the model learns the typical pattern for each hour of day, so peak-hour latency won't false-trigger if it matches the historical norm for that hour.service.name = 'api-gateway'When it fires — when |Z-score| > 3, i.e. the value is more than 3 standard deviations away from the expected pattern. Lower target → more sensitive (more noise); higher → only extreme deviations. The threshold is not in seconds or milliseconds. -
Offers to investigate the recent fire.
用户: “解释我的结账服务错误率告警”
Agent:
-
(遍历分页)→ 找到“高错误率——结账服务”,ID 42。
signoz:signoz_list_alert_rules -
→ 链路追踪公式(A错误跨度数 / B总跨度数 × 100),单一严重阈值5%,
signoz:signoz_get_alert id=42,op=1,通知渠道matchType=1。pagerduty-oncall -
→ 触发3次,最后一次触发在2小时前。
signoz:signoz_get_alert_history id=42 lookback=7d -
响应:TL;DR:当结账服务错误率(错误跨度数/总跨度数)在过去5分钟内任意时点超过5%时触发,向发送告警。过去7天触发3次(最后一次在2小时前)。
pagerduty-oncall监控内容 ——来自的链路追踪数据。查询A统计service.name = 'checkout'的跨度数,查询B统计所有跨度数,F1 = A × 100 / B为错误百分比;告警基于F1触发。hasError = true触发时机 ——,op=高于,阈值5%。每1分钟检查过去5分钟的数据——单次数值超标即触发告警。matchType=至少一次值得关注的问题 ——未设置(数值在5%附近波动时可能出现告警震荡);未设置recoveryTarget(结账服务中断导致流量降为0时不会触发此规则)。alertOnAbsent -
提供选项:“需要我调查2小时前的告警触发事件吗?”
用户: “解释告警规则17——配置合理吗?”
Agent:
-
→ 指标
signoz:signoz_get_alert id=17,两个阈值(警告80%,严重90%),system.cpu.utilization(平均值),过滤条件matchType=3,按deployment.environment.name = 'production'分组。host.name -
→ 过去7天未触发。
signoz:signoz_get_alert_history id=17 -
用户询问“配置合理吗”——TL;DR先给出结论,而非通用描述。随后给出解释及完整审计:TL;DR:配置基本合理(双阈值+模式适合CPU监控),但存在两个漏洞需要修复:(1) 未设置
平均值——代理崩溃后停止上报数据,此规则无法检测到;(2) 注解为默认值——添加运行手册URL可帮助运维人员处理告警。过去7天未触发(可能意味着生产环境健康或阈值过松)。alertOnAbsent监控内容 ——的deployment.environment.name = 'production'指标,按system.cpu.utilization分组(按主机触发告警)。host.name触发时机 ——,op=高于,两个层级(警告80%,严重90%),基于5分钟窗口数据。短时间峰值不会触发——平滑处理适合CPU监控。matchType=平均值值得关注的问题 ——与TL;DR中的修复建议一致;无其他非默认问题。 -
提供后续操作建议。
用户: “帮我理解这条异常检测告警”(告警ID 88)
Agent:
-
→
signoz:signoz_get_alert id=88,ruleType: anomaly_rule,algorithm=zscore,阈值3,指标seasonality=daily,范围http.server.request.duration。service.name = 'api-gateway' -
历史记录:过去7天触发1次。
-
响应:TL;DR:当api-gateway请求延迟偏离其学习到的每日模式超过3个标准差(并非原始延迟值或固定数值)时触发。过去7天触发1次。监控内容 ——的
service.name = 'api-gateway'指标,采用带每日季节性的Z-score异常检测模型——模型会学习一天中每个时段的典型模式,因此峰值时段的延迟若符合历史同期水平不会误触发。http.server.request.duration触发时机 ——当|Z-score| > 3时触发,即数值偏离预期模式超过3个标准差。阈值越低→灵敏度越高(误报越多);阈值越高→仅检测极端偏差。阈值不以秒或毫秒为单位。 -
提供调查最近触发事件的选项。