signoz-investigating-alerts
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAlert Investigate
告警排查
Diagnose why a SigNoz alert fired. The skill correlates the alert's own
signal with neighbor signals around the fire window, and surfaces a
ranked list of likely causes with supporting evidence. It is the
companion to — explain decodes the rule
statically; investigate diagnoses a specific incident.
signoz-explaining-alerts诊断SigNoz告警触发的原因。本技能将告警自身信号与告警触发时段内的关联信号进行关联分析,呈现带有支撑证据的可能原因排序列表。它是的配套技能——explain用于静态解析告警规则,investigate用于诊断特定事件。
signoz-explaining-alertsPrerequisites
前置条件
This skill calls SigNoz MCP server tools heavily (,
, ,
, , ,
, etc.). Before running the workflow,
confirm the tools are available. If they are not, the
SigNoz MCP server is not installed or configured — stop and direct
the user to set it up: https://signoz.io/docs/ai/signoz-mcp-server/.
The investigation depends on correlating multiple MCP queries; without
the server there is no way to ground the analysis.
signoz:signoz_get_alertsignoz:signoz_get_alert_historysignoz:signoz_execute_builder_querysignoz:signoz_query_metricssignoz:signoz_search_tracessignoz:signoz_search_logssignoz:signoz_get_trace_detailssignoz:signoz_*本技能会大量调用SigNoz MCP服务器工具(、、、、、、等)。在运行工作流之前,请确认工具可用。如果不可用,说明SigNoz MCP服务器未安装或配置——请停止操作并引导用户完成设置:https://signoz.io/docs/ai/signoz-mcp-server/。
排查工作依赖于多个MCP查询的关联分析;如果没有该服务器,将无法开展有效的分析。
signoz:signoz_get_alertsignoz:signoz_get_alert_historysignoz:signoz_execute_builder_querysignoz:signoz_query_metricssignoz:signoz_search_tracessignoz:signoz_search_logssignoz:signoz_get_trace_detailssignoz:signoz_*When to use
使用场景
Use this skill when the user wants to:
- Understand why a specific alert fired.
- Find the root cause of a recent incident triggered by an alert.
- Correlate the alert's signal with related metrics, traces, and logs.
- Distinguish "real signal" fires from flapping or threshold-mistuning.
Do NOT use when the user wants to:
- Understand what an alert is configured to monitor → .
signoz-explaining-alerts - Create a new alert → .
signoz-creating-alerts - Modify an alert (raise threshold, add hysteresis) → call
directly.
signoz:signoz_update_alert - Run a free-form ad-hoc investigation without an alert as the anchor →
.
signoz-generating-queries
当用户需要以下操作时,使用本技能:
- 理解特定告警触发的原因。
- 查找由告警触发的近期事件的根本原因。
- 将告警信号与相关指标、链路追踪和日志进行关联分析。
- 区分“真实信号触发”与“告警波动”或“阈值配置不当”。
请勿在以下场景使用本技能:
- 理解告警配置用于监控的内容 → 使用。
signoz-explaining-alerts - 创建新告警 → 使用。
signoz-creating-alerts - 修改告警(提高阈值、添加滞后机制)→ 直接调用。
signoz:signoz_update_alert - 不基于告警作为锚点进行自由形式的临时排查 → 使用。
signoz-generating-queries
Required inputs
必填输入项
| Input | Required | Source if missing |
|---|---|---|
| Alert identifier (rule ID or name) | yes | |
| Time window | no | default to most recent fire from |
If the alert name is fuzzy, this skill is best-effort (read-only):
- Call , paginate, fuzzy-match the name.
signoz:signoz_list_alert_rules - State the interpretation: "Investigating fire of 'High Error Rate — Checkout' (id 42) at 14:32 UTC. If you meant a different alert or fire, tell me."
- Proceed.
If the alert has never fired in the lookback window, stop: there is
nothing to investigate. Respond with:
"Alert '[name]' has not fired in the last 7d, so there is no fire window to investigate. Useto walk through the rule, or check whether the alert is enabled."signoz-explaining-alerts
| 输入项 | 是否必填 | 缺失时的来源 |
|---|---|---|
| 告警标识符(规则ID或名称) | 是 | |
| 时间窗口 | 否 | 默认使用 |
如果告警名称模糊,本技能会进行尽力处理(只读模式):
- 调用,分页查询,模糊匹配名称。
signoz:signoz_list_alert_rules - 说明解析结果:“正在排查‘高错误率——结账服务’(ID 42)在14:32 UTC触发的告警。如果您指的是其他告警或触发事件,请告知我。”
- 继续排查。
如果告警在回溯窗口内从未触发,请停止操作:没有可排查的内容。回复:
“告警‘[名称]’在过去7天内未触发,因此没有可排查的触发时段。使用查看告警规则,或检查告警是否已启用。”signoz-explaining-alerts
Workflow
工作流
The investigation runs in three tiers with strict early-stop gates.
Tier 1 always runs. Tier 2 runs only if tier 1 confirms a real fire.
Tier 3 runs only if tier 2 surfaces correlated anomalies. Skipping the
gates produces hundreds of unnecessary trace/log queries on quiet
alerts.
排查工作分为三个层级,且有严格的提前终止条件。第一层级始终运行;第二层级仅在第一层级确认是真实告警触发时运行;第三层级仅在第二层级发现关联异常时运行。跳过这些条件会在静默告警上触发数百次不必要的链路追踪/日志查询。
Step 1: Resolve alert + fire window (Tier 0)
步骤1:解析告警 + 触发时段(第0层)
- Resolve the alert id via (paginated) if not given.
signoz:signoz_list_alert_rules - Call for the full rule config — needed to know what query, threshold, and resource scope the alert evaluated.
signoz:signoz_get_alert - Call with a 7d lookback. From the response:
signoz:signoz_get_alert_history- Pick the fire window. Default to the most recent fire. If the
user passed an explicit time window via , honor it.
$ARGUMENTS[1] - Note the fire pattern:
- → single fire with a long quiet period before/after.
one-off - → fires that stayed firing for ≥ 1 evaluation cycle.
sustained - → ≥ 3 fires within a 1h window, alternating fire/resolve.
flapping - → fires at regular intervals (cron-like, e.g., every hour).
recurring
- The pattern tells you what to expect from tiers 2/3.
- Pick the fire window. Default to the most recent fire. If the
user passed an explicit time window via
- 如果未提供告警ID,通过(分页)解析告警ID。
signoz:signoz_list_alert_rules - 调用获取完整规则配置——需要了解告警评估的查询语句、阈值和资源范围。
signoz:signoz_get_alert - 调用并设置7天回溯期。从响应中:
signoz:signoz_get_alert_history- 选择触发时段。默认使用最近一次触发。如果用户通过传入了明确的时间窗口,则优先使用该窗口。
$ARGUMENTS[1] - 记录触发模式:
- → 单次触发,前后有较长静默期。
one-off - → 触发状态持续≥1个评估周期。
sustained - → 1小时内≥3次触发,触发/恢复交替出现。
flapping - → 定期触发(类似 cron,例如每小时一次)。
recurring
- 触发模式将指导第二/第三层级的排查方向。
- 选择触发时段。默认使用最近一次触发。如果用户通过
Step 2: Tier 1 — what fired and how hard
步骤2:第一层级——告警触发情况及严重程度
This tier always runs. It establishes the fire is real (vs. transient
threshold tickle or flap) and quantifies the magnitude.
- Re-run the alert's primary query over a window centered on the fire
start: .
[fire_start - 30m, fire_start + 30m]- Use for builder/formula alerts.
signoz:signoz_execute_builder_query - Use for PromQL alerts.
signoz:signoz_query_metrics
- Use
- Compute:
- Peak value during the fire window.
- Threshold breach magnitude: for "above" alerts, inverted for "below".
(peak - threshold) / threshold * 100 - Fire duration: how long the breach lasted.
- Pre-fire baseline: average value in the 30m before fire start.
- Early-stop gate: if the breach magnitude is < 10% over the threshold AND the fire duration is < 1 evaluation window, classify as "marginal fire" — the alert may be too sensitive. Skip tiers 2 and 3 and go to Step 5 with a single hypothesis: "threshold may be too tight, recommend tuning."
本层级始终运行。用于确认告警是真实触发(而非短暂阈值波动或告警波动),并量化触发的严重程度。
- 在以触发开始时间为中心的窗口内,重新运行告警的主查询。
[触发开始时间-30分钟, 触发开始时间+30分钟]- 对于构建器/公式类告警,使用。
signoz:signoz_execute_builder_query - 对于PromQL类告警,使用。
signoz:signoz_query_metrics
- 对于构建器/公式类告警,使用
- 计算:
- 触发时段内的峰值。
- 阈值突破幅度:对于“高于阈值”的告警,计算;“低于阈值”的告警则反向计算。
(峰值 - 阈值)/阈值 * 100 - 触发持续时间:阈值突破持续的时长。
- 触发前基线:触发开始前30分钟内的平均值。
- 提前终止条件:如果阈值突破幅度<10%且触发持续时间<1个评估窗口,则归类为“边际触发”——告警可能过于敏感。跳过第二、第三层级,直接进入步骤5,仅提出一个假设:“阈值可能设置过严,建议调整。”
Step 3: Tier 2 — neighbor signals vs baseline
步骤3:第二层级——关联信号与基线对比
Run only if Tier 1 confirms a real breach. Pull related signals for the
same resource scope as the alert and compare the fire window to a
baseline window.
-
Pick a baseline window. Use the same hour, previous day (). If the alert fired during a known-anomalous time (deploy, weekly job), note it in the output but still proceed.
fire_start - 24h, fire_start - 24h + fire_duration -
Look up neighbor signals for the alert's resource type. Seefor the lookup table. Common cases:
references/neighbor-signals.md- Service-level alert (): pull error rate, p95/p99 latency, request throughput, dependency error rates if trace data is available.
service.name = X - Host / VM alert (): CPU, memory, disk I/O, network I/O.
host.name = X - K8s pod / namespace alert: pod restarts, container CPU/memory limits, node pressure, recent rollouts.
- Service-level alert (
-
For each neighbor signal:
- Query both windows (fire + baseline) via
or
signoz:signoz_execute_builder_query.signoz:signoz_query_metrics - Compute the delta (% change in fire window vs baseline).
- Rank by absolute delta.
- Query both windows (fire + baseline) via
-
Early-stop gate: if no neighbor signal shows ≥ 25% deviation from baseline, classify as "isolated fire — the alert's own signal moved but nothing else did." This is unusual and worth surfacing. Skip Tier 3 and go to Step 5 with hypotheses focused on the alert's own query (likely causes: data source change, instrumentation change, downstream silent failure that only shows in this metric).
仅在第一层级确认是真实阈值突破时运行。提取与告警相同资源范围的关联信号,并将触发时段与基线窗口进行对比。
-
选择基线窗口。使用前一天的同一时段()。如果告警在已知异常时段(如部署、每周任务期间)触发,需在输出中注明,但仍继续排查。
触发开始时间-24小时, 触发开始时间-24小时+触发持续时间 -
查找告警资源类型的关联信号。参考中的查找表。常见场景:
references/neighbor-signals.md- 服务级告警():提取错误率、p95/p99延迟、请求吞吐量;如果有链路追踪数据,提取依赖错误率。
service.name = X - 主机/VM告警():提取CPU、内存、磁盘I/O、网络I/O。
host.name = X - K8s Pod/命名空间告警:提取Pod重启次数、容器CPU/内存限制、节点压力、最近的滚动发布情况。
- 服务级告警(
-
针对每个关联信号:
- 通过或
signoz:signoz_execute_builder_query查询触发时段和基线窗口的数据。signoz:signoz_query_metrics - 计算变化量(触发时段与基线的百分比变化)。
- 按绝对变化量排序。
- 通过
-
提前终止条件:如果没有关联信号与基线的偏差≥25%,则归类为“孤立触发——仅告警自身信号变化,其他信号无异常”。这种情况较为特殊,需重点呈现。跳过第三层级,直接进入步骤5,假设集中在告警自身查询(可能原因:数据源变更、埋点变更、仅在此指标中体现的下游静默故障)。
Step 4: Tier 3 — traces and logs at the fire window
步骤4:第三层级——触发时段的链路追踪与日志
Run only if Tier 2 found correlated neighbor anomalies. Drill into
specific failing operations.
-
Traces (if the alert is service-scoped and traces are available):
- Call for the fire window with filter:
signoz:signoz_search_tracesANDservice.name = <scope>. Cap at top 20.hasError = true - Group results by (operation) and
name. Surface the top 3 by frequency with a representative trace ID for each.error_message - Optionally call on one representative to extract specific span attributes (DB statement, downstream URL, status code).
signoz:signoz_get_trace_details
- Call
-
Logs for the fire window:
- Call with filter:
signoz:signoz_search_logsAND<scope_filter>. Cap at top 20 most recent.severity_text IN ('ERROR', 'FATAL') - Group by pattern (or
bodyif present). Surface the top 3 distinct messages with counts.exception.type
- Call
-
Cross-reference: do the traces and logs point at the same downstream service, dependency, or code path? If so, that becomes the leading hypothesis.
See for query templates that pair
fire-window and baseline-window calls cleanly.
references/baseline-comparison.md仅在第二层级发现关联异常时运行。深入排查具体的失败操作。
-
链路追踪(如果告警是服务范围且有链路追踪数据可用):
- 调用查询触发时段内满足
signoz:signoz_search_traces且service.name = <范围>的数据,最多返回前20条。hasError = true - 按(操作)和
name分组。呈现出现频率最高的前3组,并提供每组的代表性trace ID。error_message - 可选:调用查询其中一个代表性trace,提取特定的span属性(数据库语句、下游URL、状态码)。
signoz:signoz_get_trace_details
- 调用
-
触发时段的日志:
- 调用查询满足
signoz:signoz_search_logs且<范围过滤条件>的数据,最多返回前20条最新日志。severity_text IN ('ERROR', 'FATAL') - 按模式(或
body如果存在)分组。呈现出现频率最高的前3条不同消息及计数。exception.type
- 调用
-
交叉验证:链路追踪和日志是否指向同一下游服务、依赖或代码路径?如果是,则该方向成为主要假设。
参考中的查询模板,可清晰地组合触发时段和基线窗口的查询。
references/baseline-comparison.mdStep 5: Build the structured output
步骤5:生成结构化输出
Use this exact section order. Lead with a TL;DR — engineers under
pressure scan the top first and stop reading once they have what
they need. Compression plus proof: every claim cites the MCP query
that produced it; no generic "check logs / verify connectivity"
filler.
1. TL;DR — one or two sentences, no more. Leading hypothesis,
overall confidence, blast radius, and the single most useful next
action. Example:
"checkoutservice error rate hit 12.4% (threshold 5%) for 8m at 14:32 UTC — most likely cause is payments-api timing out (high confidence). Open traceto see the failing call."7af3a09b…
If no hypothesis reaches medium confidence, the leading line is
"No clear root cause found." rather than a low-confidence guess
dressed up as the answer.
2. What fired
The alert (id, name), the fire window (absolute UTC + relative),
peak magnitude ("error rate hit 12.4% vs. 5% threshold — 148% over"),
fire duration, and the fire pattern ( / /
/ / ).
one-offsustainedflappingrecurringmarginal3. Investigation trail
A scannable list of what was checked, with ✅ for confirmed signals
and ❌ for ruled out, each followed by a one-line finding. The point
is that the reader can see what work the AI did and what it found —
this is where trust is earned. Example:
- ✅ Tier 1 — peak error rate 12.4%, fire was real (not marginal).
- ✅ Tier 2 — payments error rate +8900%, p99 +1180%; downstream cascade.
- ❌ CPU / memory pressure — flat through the fire window.
- ✅ Tier 3 — 30 error traces all hit payments-api, same message.
4. Likely causes (ranked, max 3)
Each cause has three parts:
- Hypothesis — one sentence, specific. Bad: "service is unhealthy". Good: "checkout is timing out on calls to payments-api".
- Evidence — the supporting numbers from tiers 1/2/3, with the underlying query inline so the user can re-run it. State the neighbor signal, the delta vs baseline, the trace/log pattern that supports it.
- Confidence — requires ≥2 of: temporal precedence, topology / dependency edge, shared service or entity, correlated metric/log/trace evidence, recent deploy or config change.
highis one tier's evidence with at least one of those signals.mediumis a single signal moved with no corroboration — in that case label it a "co-occurring signal," not a cause.low
If only Tier 1 ran (marginal fire / no neighbor anomalies), output
fewer hypotheses with confidence and explicitly call out the
limitation.
low5. Ruled out
Short but explicit. List candidates the evidence eliminated and the
one-line reason why. Skip the section if there's nothing meaningful
to rule out — but if you considered something and dropped it, say so
here so the user doesn't waste time re-checking it.
6. Suggested next steps
Action items the user can take. Be concrete and use SigNoz-native
handles so the user can act immediately:
- Specific trace, dashboard, or alert to open
(e.g., "open trace in the SigNoz UI").
7af3a09b… - Specific query to run with — paste the exact filter and time window.
signoz-generating-queries - "Tune this alert" if the fire was marginal — name the field
(,
matchType,target) and the change to make viarecoveryTarget.signoz:signoz_update_alert - "Open an incident" or "page the owning team" if the cause is cross-service.
Do not pad with generic advice ("verify connectivity", "check
dashboards") — that's noise during an active incident.
Mirroring as navigation chips. Mirror up to 3 of these "Suggested
next steps" as host follow-up intents — the most actionable,
alert-scoped ones. Keep the rest in the report prose so the user has
the full picture. The chip surface is capped; the prose is not.
严格按照以下章节顺序组织。开头先提供TL;DR(摘要)——处于压力下的工程师会先浏览顶部内容,找到所需信息后即停止阅读。内容需简洁且有依据:每个结论都要引用生成它的MCP查询;避免使用“检查日志/验证连通性”这类通用建议。
1. TL;DR——最多一两句话。包含主要假设、整体置信度、影响范围,以及最有用的下一步操作示例:
“结账服务错误率达到12.4%(阈值5%),持续8分钟,触发时间为14:32 UTC——最可能的原因是payments-api超时(高置信度)。打开trace查看失败调用。”7af3a09b…
如果没有假设达到中等置信度,开头应为“未找到明确根本原因。”,而非将低置信度猜测包装成答案。
2. 告警触发详情
告警(ID、名称)、触发时段(UTC绝对时间+相对时间)、峰值幅度(“错误率达到12.4%,远超5%阈值——超出148%”)、触发持续时间,以及触发模式(////)。
one-offsustainedflappingrecurringmarginal3. 排查轨迹
以可快速浏览的列表形式呈现已检查内容,用✅表示已确认的信号,❌表示已排除的内容,每条后附一行结论。目的是让读者了解AI完成的工作及发现——这是建立信任的关键。示例:
- ✅ 第一层级——峰值错误率12.4%,确认为真实触发(非边际触发)。
- ✅ 第二层级——支付服务错误率+8900%,p99延迟+1180%;下游级联故障。
- ❌ CPU/内存压力——触发时段内保持平稳。
- ✅ 第三层级——30条错误trace均指向payments-api,错误消息一致。
4. 可能原因(排序,最多3个)
每个原因包含三部分:
- 假设——一句话,具体明确。错误示例:“服务不健康”。正确示例:“结账服务调用payments-api超时”。
- 证据——来自第一/第二/第三层级的支撑数据,并内嵌底层查询语句,方便用户重新运行。说明关联信号、与基线的变化量、支撑该假设的链路追踪/日志模式。
- 置信度——(高)需满足≥2个条件:时间先后顺序、拓扑/依赖关系、共享服务或实体、关联的指标/日志/链路追踪证据、近期部署或配置变更。
high(中)为单一层级的证据,且至少满足上述一个条件。medium(低)为单一信号变化,无其他佐证——此时应标注为“并发信号”,而非原因。low
如果仅运行了第一层级(边际触发/无关联异常),则输出较少的低置信度假设,并明确说明局限性。
5. 已排除原因
简短但明确。列出证据已排除的候选原因及一行理由。如果没有有意义的排除项,可跳过此章节——但如果您考虑过某个原因并排除了它,请在此说明,避免用户浪费时间重复检查。
6. 建议下一步操作
用户可执行的具体操作项。需具体且使用SigNoz原生标识,方便用户立即行动:
- 特定的链路追踪、仪表盘或告警(例如:“在SigNoz UI中打开trace ”)。
7af3a09b… - 使用运行的具体查询——粘贴精确的过滤条件和时间窗口。
signoz-generating-queries - 如果是边际触发,建议“调整此告警”——指明字段(、
matchType、target)及通过recoveryTarget进行的修改。signoz:signoz_update_alert - 如果原因涉及跨服务,建议“创建事件”或“通知负责团队”。
不要添加通用建议(“验证连通性”“检查仪表盘”)——这些在事件处理过程中属于无效信息。
作为导航芯片的镜像。将最多3条“建议下一步操作”镜像为主机后续意图——选择最具可操作性、与告警相关的操作。其余操作保留在报告正文中,确保用户了解完整情况。导航芯片数量有限,但正文不受限制。
Out of scope (v1)
超出范围(v1版本)
- Deployment / config-change correlation — SigNoz MCP does not expose a deployments tool; do not fabricate one. If the user mentions a recent deploy, surface it as context but don't claim it caused the fire without the signal evidence.
- Cross-service blast-radius walking — investigating downstream callers of the alert's service. Out of scope to keep context bounded.
- Long-horizon historical baselines — Tier 2 compares to one
prior-day window, not to weekly/monthly seasonality. If the user
says "is this normal for a Friday afternoon", suggest an anomaly
alert (with
signoz-creating-alerts).anomaly_rule
- 部署/配置变更关联——SigNoz MCP未暴露部署工具;请勿虚构此类工具。如果用户提及近期部署,可将其作为上下文呈现,但不要在无信号证据的情况下声称是触发原因。
- 跨服务影响范围遍历——排查告警服务的下游调用方。为了保持上下文边界,此内容超出范围。
- 长期历史基线——第二层级仅与前一天的同一时段对比,而非每周/每月的季节性数据。如果用户询问“周五下午出现这种情况正常吗”,建议创建异常告警(使用并配置
signoz-creating-alerts)。anomaly_rule
Guardrails
约束规则
- Three-tier early-stop is mandatory. Skipping the gates pulls hundreds of traces/logs on quiet alerts and explodes context. The gates are not optional optimizations.
- Anchor every claim to an MCP query result. No speculation. If evidence is missing, lower confidence and say so.
- Show the supporting query with each hypothesis so the user can reproduce and dig deeper.
- Compression plus proof. TL;DR is one or two sentences max; the full report is a triage card, not a postmortem. Engineers under pressure should be able to skim the top and act. Every section earns its place by adding evidence the user couldn't already see in the alert payload.
- Correlation ≠ causation. Label something a cause only when at least two of the following converge: temporal precedence (signal moved before symptom), topology / dependency edge, shared service or entity, correlated metric/log/trace evidence, or a recent deploy/config change. A single time-aligned anomaly is a "co-occurring signal," not a cause — say so explicitly.
- Don't restate the alert or recommend the obvious. "Check logs", "verify connectivity", "investigate dashboards" — the reader of this output already knows they need to. Replace generic suggestions with specific queries, traces, or filters they can run immediately.
- No fabricated identifiers. Trace IDs, span names, alert rule IDs, channel names, deploy IDs — every identifier in the output must come from a real MCP response. Don't invent placeholders that look plausible.
- Honest uncertainty wins. If no hypothesis reaches medium confidence, the answer is "No clear root cause found — here's what we checked and what's ruled out." Do not promote a low-confidence guess to the leading hypothesis just to sound useful. False positives waste active incident time more than false negatives.
- Prefer resource-attribute filters in every drill-down query. This is the SigNoz MCP guideline and it directly affects query speed at scale.
- Do not modify any alert. Investigate is read-only. If the user
says "and tighten this alert", surface that as a next-step
recommendation; do not call .
signoz:signoz_update_alert - Stay in scope. Static rule explanation belongs to
. Cause analysis without an alert anchor belongs to
signoz-explaining-alerts.signoz-generating-queries - Time zones. Always state fire windows in UTC alongside relative time ("14:32 UTC, 2h ago") so autonomous and interactive consumers agree on the window.
- 必须执行三层提前终止检查。跳过这些条件会在静默告警上触发数百次链路追踪/日志查询,导致上下文爆炸。这些条件并非可选优化项。
- 每个结论都必须锚定MCP查询结果。禁止推测。如果证据缺失,降低置信度并说明情况。
- 每个假设都需附带支撑查询,方便用户复现和深入排查。
- 简洁且有依据。TL;DR最多一两句话;完整报告是分诊卡片,而非事后总结。处于压力下的工程师应能快速浏览顶部内容并采取行动。每个章节都必须提供用户无法从告警负载中直接获取的证据,才有存在的价值。
- 关联≠因果。只有当至少满足以下两个条件时,才能将某因素标注为原因:时间先后顺序(信号变化早于症状)、拓扑/依赖关系、共享服务或实体、关联的指标/日志/链路追踪证据、近期部署/配置变更。单一时间对齐的异常是“并发信号”,而非原因——需明确说明。
- 不要重复告警内容或给出显而易见的建议。“检查日志”“验证连通性”“排查仪表盘”——报告读者已经知道这些操作。将通用建议替换为用户可立即执行的具体查询、链路追踪或过滤条件。
- 禁止虚构标识符。Trace ID、span名称、告警规则ID、渠道名称、部署ID——输出中的所有标识符必须来自真实的MCP响应。不要发明看似合理的占位符。
- 诚实的不确定性更有价值。如果没有假设达到中等置信度,答案应为“未找到明确根本原因——以下是已检查内容和已排除原因”。不要为了显得有用而将低置信度猜测提升为主要假设。在事件处理过程中,误报比漏报更浪费时间。
- 在所有深入查询中优先使用资源属性过滤器。这是SigNoz MCP的指导原则,直接影响大规模场景下的查询速度。
- 请勿修改任何告警。排查为只读模式。如果用户说“同时调紧这个告警”,将其作为下一步建议呈现;不要调用。
signoz:signoz_update_alert - 保持在范围内。静态规则解析属于的范畴。不基于告警锚点的原因分析属于
signoz-explaining-alerts的范畴。signoz-generating-queries - 时区。触发时段始终以UTC时间呈现,并附带相对时间(“14:32 UTC,2小时前”),确保自动和交互式用户对时间窗口的理解一致。
Examples
示例
User: "Why did the checkout error rate alert fire?"
Agent:
-
Resolves alert: "High Error Rate — Checkout" (id 42).
-
→ most recent fire 2h ago at 14:32 UTC, sustained for 8m, single fire (not flapping).
signoz:signoz_get_alert_history -
Tier 1: re-runs error-rate formula over. Peak error rate 12.4% (vs 5% threshold — 148% over). Pre-fire baseline 0.3%. Real fire, not marginal.
[14:02, 15:02] -
Tier 2: pulls neighbor signals for:
service.name = checkout- p99 latency: 4.1s vs 320ms baseline (+1180%).
- Throughput: -42% (drop).
- Downstream error rate: 18% vs 0.2% baseline (+8900%).
payments - CPU/memory: flat (no resource pressure).
-
Tier 3: traces forin the fire window — top operation
service.name = checkout, hasError = true, top error message "context deadline exceeded calling payments-api". 30 traces, all hitting the same downstream URL. Logs show matching "payments client timeout" lines, 142 occurrences.POST /checkout/submit -
Output:TL;DR: checkoutservice error rate hit 12.4% (threshold 5%) for 8m at 14:32 UTC. Most likely cause: payments-api timing out (high confidence — converging trace + log + neighbor evidence). Open traceto see the failing call.
7af3a09b…- What fired: alert 42 fired 2h ago at 14:32 UTC, sustained 8m. Error rate peaked at 12.4% (148% over threshold).
- Investigation trail:
- ✅ Tier 1 — peak 12.4% vs 5% threshold, pre-fire baseline 0.3%. Real fire.
- ✅ Tier 2 — payments error rate +8900%, p99 latency +1180%, throughput −42%.
- ❌ CPU / memory pressure on checkout — flat.
- ✅ Tier 3 — 30 error traces all hit payments-api with
; 142 matching timeout logs.
context deadline exceeded
- Likely causes (high confidence): payments service errors cascading into checkout. Evidence converges across topology (checkout → payments edge), temporal precedence (payments errors lead checkout p99), and shared entity (every error trace targets the payments-api URL).
- Ruled out: checkout-side resource pressure (CPU/memory flat); upstream traffic spike (throughput dropped, didn't spike).
- Next steps: open trace in the SigNoz UI; if payments has its own alert, run
7af3a09b…on it, otherwisesignoz-explaining-alertsfor payments error rate over the last 4h.signoz-generating-queries
User: "Investigate the flapping CPU alert on prod-api-3"
Agent:
-
Resolves alert (id 88, host.name = prod-api-3).
-
History: 7 fires in last 1h, alternating fire/resolve every 8-12 minutes — flapping pattern.
-
Tier 1: peak 84% (threshold 80%, only 5% over). Each fire lasted 2-4 minutes. Marginal — the value hovered near threshold.made each blip trigger.
matchType = at_least_once -
Early-stop kicks in. Skip tiers 2/3.
-
Output:TL;DR: alert 88 has been flapping on— 7 fires in the last 1h, each 2–4m, all within 5% of the 80% threshold. No clear root cause; this is threshold tuning, not an incident. Switch
prod-api-3tomatchTypeor add hysteresis to stop the noise.on_average- What fired: alert 88, host , fire pattern
prod-api-3. 7 fires in the last 1h, each 2–4m. Peak 84% (5% over the 80% threshold).flapping - Investigation trail:
- ✅ Tier 1 — every fire was within 5% of threshold; duration short; baseline already at 70–75%. Marginal fire — early-stop triggered, Tier 2/3 skipped.
- Likely causes (low / co-occurring signal only): threshold
tuned too tight or is too sensitive. Evidence: every fire was within 5% of threshold; baseline already runs at 70–75%. Not promoted to a "cause" — single signal, no corroboration.
matchType - Ruled out: real CPU saturation incident (peaks too small and short-lived; baseline already near threshold).
- Next steps: change to
matchType(smooths transient spikes) OR raise threshold to 85% with hysteresis (on_average). UserecoveryTarget: 75to apply.signoz:signoz_update_alert
- What fired: alert 88, host
User: "RCA for the log volume alert that fired last night"
Agent:
-
Resolves alert: "Error Log Volume Spike" (id 14, no service filter — groupBy).
service.name -
History: fired at 03:12 UTC, sustained 22m, broke down by service in the alert annotations:was the firing series.
service.name = inventory -
Tier 1: re-runs log count for inventory in fire window. Peak 3,400 errors/min vs 1,000/min threshold (240% over). Pre-fire baseline 12/min. Real, large fire.
-
Tier 2: neighbor signals for:
service.name = inventory- Request error rate: +600%.
- p99 latency: +30% (mild).
- CPU: -80% (collapsed). Memory: -60%.
- Pod restarts (k8s): 4 in fire window.
-
Tier 3: logs for inventory in fire window. Top message: "OOMKilled restarting" (1,200 occurrences). Top trace error: graceful-shutdown exceptions.
-
Output:TL;DR: log volume alert 14 fired at 03:12 UTC for, sustained 22m at 240% over threshold. Most likely cause: inventory pods OOM-killed and restarted 4 times (high confidence). Check container memory limits for the inventory deployment.
service.name = inventory- What fired: alert 14 fired at 03:12 UTC for service
, sustained 22m, 240% over threshold.
inventory - Investigation trail:
- ✅ Tier 1 — peak 3,400 errors/min vs 1,000/min threshold; pre-fire baseline 12/min. Real fire.
- ✅ Tier 2 — request error rate +600%; CPU/memory collapsed (−80%/−60%); 4 pod restarts in window.
- ❌ p99 latency — only +30%, not a latency-driven incident.
- ✅ Tier 3 — top log message "OOMKilled restarting" (1,200 occurrences); top trace error: graceful-shutdown exceptions.
- Likely causes (high confidence): inventory pods OOM-killed
and restarted 4 times during the window. Evidence converges
across topology (single service), temporal precedence (memory
fell to zero before error spike), shared entity (all log lines
from ), and a single coherent pattern (OOM → restart → graceful-shutdown noise).
service.name = inventory - Ruled out: a true application error-rate change (errors are restart noise, not request-path failures); upstream traffic surge (throughput unchanged).
- Next steps: check container memory limits for inventory pods; review recent deploys; consider whether the alert should exclude restart-related error patterns or whether the underlying OOM is the real concern.
- What fired: alert 14 fired at 03:12 UTC for service
用户:“结账服务错误率告警为什么触发?”
Agent:
-
解析告警:“高错误率——结账服务”(ID 42)。
-
→ 最近一次触发为2小时前的14:32 UTC,持续8分钟,单次触发(非波动)。
signoz:signoz_get_alert_history -
第一层级:在窗口内重新运行错误率公式。峰值错误率12.4%(阈值5%——超出148%)。触发前基线0.3%。确认为真实触发,非边际触发。
[14:02, 15:02] -
第二层级:提取的关联信号:
service.name = checkout- p99延迟:4.1s vs 基线320ms(+1180%)。
- 吞吐量:-42%(下降)。
- 下游错误率:18% vs 基线0.2%(+8900%)。
payments - CPU/内存:平稳(无资源压力)。
-
第三层级:查询触发时段内的trace——最频繁的操作是
service.name = checkout, hasError = true,最常见的错误消息是“调用payments-api时上下文超时”。共30条trace,均指向同一下游URL。日志显示匹配的“支付客户端超时”记录,共142条。POST /checkout/submit -
输出:TL;DR:结账服务错误率达到12.4%(阈值5%),触发时间为14:32 UTC,持续8分钟。最可能的原因:payments-api超时(高置信度——链路追踪、日志、关联信号证据一致)。打开trace查看失败调用。
7af3a09b…- 告警触发详情:告警42于2小时前的14:32 UTC触发,持续8分钟。错误率峰值12.4%(超出阈值148%)。
- 排查轨迹:
- ✅ 第一层级——峰值12.4% vs 阈值5%,触发前基线0.3%。确认为真实触发。
- ✅ 第二层级——支付服务错误率+8900%,p99延迟+1180%,吞吐量−42%。
- ❌ 结账服务CPU/内存压力——平稳。
- ✅ 第三层级——30条错误trace均指向payments-api,错误消息为;142条匹配的超时日志。
context deadline exceeded
- 可能原因(高置信度):支付服务错误级联影响结账服务。证据来自拓扑关系(结账→支付链路)、时间先后顺序(支付服务错误早于结账服务p99延迟上升)、共享实体(所有错误trace均指向payments-api URL)。
- 已排除原因:结账服务端资源压力(CPU/内存平稳);上游流量激增(吞吐量下降,未激增)。
- 下一步操作:在SigNoz UI中打开trace ;如果支付服务有自身告警,对其运行
7af3a09b…,否则使用signoz-explaining-alerts查询过去4小时内支付服务的错误率。signoz-generating-queries
用户:“排查prod-api-3上波动的CPU告警”
Agent:
-
解析告警(ID 88,host.name = prod-api-3)。
-
历史记录:过去1小时内触发7次,每8-12分钟交替触发/恢复——波动模式。
-
第一层级:峰值84%(阈值80%,仅超出5%)。每次触发持续2-4分钟。边际触发——数值徘徊在阈值附近。导致每次小波动都触发告警。
matchType = at_least_once -
触发提前终止条件。跳过第二/第三层级。
-
输出:TL;DR:告警88在上出现波动——过去1小时内触发7次,每次持续2–4分钟,均在80%阈值的5%范围内。未找到明确根本原因;这是阈值配置问题,而非事件。将
prod-api-3切换为matchType或添加滞后机制以消除噪音。on_average- 告警触发详情:告警88,主机,触发模式为
prod-api-3。过去1小时内触发7次,每次持续2–4分钟。峰值84%(超出80%阈值5%)。flapping - 排查轨迹:
- ✅ 第一层级——每次触发均在阈值的5%范围内;持续时间短;基线已达70–75%。边际触发——触发提前终止,跳过第二/第三层级。
- 可能原因(低置信度/仅为并发信号):阈值配置过严或过于敏感。证据:每次触发均在阈值的5%范围内;基线已维持在70–75%。不归类为“原因”——仅单一信号变化,无佐证。
matchType - 已排除原因:真实CPU饱和事件(峰值过小且持续时间短;基线已接近阈值)。
- 下一步操作:将改为
matchType(平滑瞬时峰值),或将阈值提高至85%并添加滞后机制(on_average)。使用recoveryTarget: 75应用修改。signoz:signoz_update_alert
- 告警触发详情:告警88,主机
用户:“昨晚触发的日志量告警的根因分析”
Agent:
-
解析告警:“错误日志量激增”(ID 14,无服务过滤——按分组)。
service.name -
历史记录:触发时间为03:12 UTC,持续22分钟,告警注释中按服务拆分:是触发的服务。
service.name = inventory -
第一层级:重新运行触发时段内inventory服务的日志计数。峰值3400条错误/分钟 vs 阈值1000条/分钟(超出240%)。触发前基线12条/分钟。确认为真实且严重的触发。
-
第二层级:提取的关联信号:
service.name = inventory- 请求错误率:+600%。
- p99延迟:+30%(轻微)。
- CPU:-80%(骤降)。内存:-60%。
- K8s Pod重启次数:触发时段内4次。
-
第三层级:触发时段内inventory服务的日志。最常见消息:“OOMKilled restarting”(1200次)。最常见trace错误:优雅关闭异常。
-
输出:TL;DR:日志量告警14于03:12 UTC针对触发,持续22分钟,超出阈值240%。最可能的原因:inventory Pod因OOM被杀死并重启4次(高置信度)。检查inventory部署的容器内存限制。
service.name = inventory- 告警触发详情:告警14于03:12 UTC针对服务触发,持续22分钟,超出阈值240%。
inventory - 排查轨迹:
- ✅ 第一层级——峰值3400条错误/分钟 vs 阈值1000条/分钟;触发前基线12条/分钟。确认为真实触发。
- ✅ 第二层级——请求错误率+600%;CPU/内存骤降(−80%/−60%);触发时段内4次Pod重启。
- ❌ p99延迟——仅+30%,并非延迟驱动的事件。
- ✅ 第三层级——最常见日志消息“OOMKilled restarting”(1200次);最常见trace错误:优雅关闭异常。
- 可能原因(高置信度):触发时段内inventory Pod因OOM被杀死并重启4次。证据来自拓扑关系(单一服务)、时间先后顺序(内存降至零后错误激增)、共享实体(所有日志均来自),以及一致的模式(OOM→重启→优雅关闭噪音)。
service.name = inventory - 已排除原因:真实应用错误率变化(错误是重启噪音,而非请求路径故障);上游流量激增(吞吐量无变化)。
- 下一步操作:检查inventory Pod的容器内存限制;回顾近期部署;考虑告警是否应排除重启相关错误模式,或底层OOM是否为真正需要关注的问题。
- 告警触发详情:告警14于03:12 UTC针对服务
Additional resources
附加资源
- — lookup table mapping resource type (service / host / k8s) to the neighbor signals to pull in Tier 2.
references/neighbor-signals.md - — query templates that pair fire-window and baseline-window calls cleanly, including how to format
references/baseline-comparison.mdfor both.signoz:signoz_execute_builder_query - skill — to decode the rule before investigating, if the user is unfamiliar with what the alert monitors.
signoz-explaining-alerts - skill — for ad-hoc follow-up queries on the same resource scope.
signoz-generating-queries
- ——资源类型(服务/主机/K8s)与第二层级需提取的关联信号的映射表。
references/neighbor-signals.md - ——可清晰组合触发时段和基线窗口查询的模板,包括如何为两者格式化
references/baseline-comparison.md。signoz:signoz_execute_builder_query - 技能——如果用户不熟悉告警监控的内容,可先使用此技能解析规则,再进行排查。
signoz-explaining-alerts - 技能——针对同一资源范围的临时后续查询。
signoz-generating-queries