signoz-investigating-alerts

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Alert Investigate

告警排查

Diagnose why a SigNoz alert fired. The skill correlates the alert's own signal with neighbor signals around the fire window, and surfaces a ranked list of likely causes with supporting evidence. It is the companion to

signoz-explaining-alerts

— explain decodes the rule statically; investigate diagnoses a specific incident.

诊断SigNoz告警触发的原因。本技能将告警自身信号与告警触发时段内的关联信号进行关联分析，呈现带有支撑证据的可能原因排序列表。它是

signoz-explaining-alerts

的配套技能——explain用于静态解析告警规则，investigate用于诊断特定事件。

Prerequisites

前置条件

This skill calls SigNoz MCP server tools heavily (

signoz:signoz_get_alert

signoz:signoz_get_alert_history

signoz:signoz_execute_builder_query

signoz:signoz_query_metrics

signoz:signoz_search_traces

signoz:signoz_search_logs

signoz:signoz_get_trace_details

, etc.). Before running the workflow, confirm the

signoz:signoz_*

tools are available. If they are not, the SigNoz MCP server is not installed or configured — stop and direct the user to set it up: https://signoz.io/docs/ai/signoz-mcp-server/. The investigation depends on correlating multiple MCP queries; without the server there is no way to ground the analysis.

本技能会大量调用SigNoz MCP服务器工具（

signoz:signoz_get_alert

、

signoz:signoz_get_alert_history

、

signoz:signoz_execute_builder_query

、

signoz:signoz_query_metrics

、

signoz:signoz_search_traces

、

signoz:signoz_search_logs

、

signoz:signoz_get_trace_details

等）。在运行工作流之前，请确认

signoz:signoz_*

工具可用。如果不可用，说明SigNoz MCP服务器未安装或配置——请停止操作并引导用户完成设置：https://signoz.io/docs/ai/signoz-mcp-server/。排查工作依赖于多个MCP查询的关联分析；如果没有该服务器，将无法开展有效的分析。

When to use

使用场景

Use this skill when the user wants to:

Understand why a specific alert fired.
Find the root cause of a recent incident triggered by an alert.
Correlate the alert's signal with related metrics, traces, and logs.
Distinguish "real signal" fires from flapping or threshold-mistuning.

Do NOT use when the user wants to:

Understand what an alert is configured to monitor →
```
signoz-explaining-alerts
```
.
Create a new alert →
```
signoz-creating-alerts
```
.
Modify an alert (raise threshold, add hysteresis) → call
```
signoz:signoz_update_alert
```
directly.
Run a free-form ad-hoc investigation without an alert as the anchor →
```
signoz-generating-queries
```
.

当用户需要以下操作时，使用本技能：

理解特定告警触发的原因。
查找由告警触发的近期事件的根本原因。
将告警信号与相关指标、链路追踪和日志进行关联分析。
区分“真实信号触发”与“告警波动”或“阈值配置不当”。

请勿在以下场景使用本技能：

理解告警配置用于监控的内容 → 使用
```
signoz-explaining-alerts
```
。
创建新告警 → 使用
```
signoz-creating-alerts
```
。
修改告警（提高阈值、添加滞后机制）→ 直接调用
```
signoz:signoz_update_alert
```
。
不基于告警作为锚点进行自由形式的临时排查 → 使用
```
signoz-generating-queries
```
。

Required inputs

必填输入项

Input	Required	Source if missing
Alert identifier (rule ID or name)	yes	`$ARGUMENTS[0]` or recent context
Time window	no	default to most recent fire from `signoz:signoz_get_alert_history`

If the alert name is fuzzy, this skill is best-effort (read-only):

Call
```
signoz:signoz_list_alert_rules
```
, paginate, fuzzy-match the name.
State the interpretation: "Investigating fire of 'High Error Rate — Checkout' (id 42) at 14:32 UTC. If you meant a different alert or fire, tell me."
Proceed.

If the alert has never fired in the lookback window, stop: there is nothing to investigate. Respond with:

"Alert '[name]' has not fired in the last 7d, so there is no fire window to investigate. Use
signoz-explaining-alerts
to walk through the rule, or check whether the alert is enabled."

输入项	是否必填	缺失时的来源
告警标识符（规则ID或名称）	是	`$ARGUMENTS[0]` 或近期上下文
时间窗口	否	默认使用 `signoz:signoz_get_alert_history` 返回的最近一次告警触发时段

如果告警名称模糊，本技能会进行尽力处理（只读模式）：

调用
```
signoz:signoz_list_alert_rules
```
，分页查询，模糊匹配名称。
说明解析结果：“正在排查‘高错误率——结账服务’（ID 42）在14:32 UTC触发的告警。如果您指的是其他告警或触发事件，请告知我。”
继续排查。

如果告警在回溯窗口内从未触发，请停止操作：没有可排查的内容。回复：

“告警‘[名称]’在过去7天内未触发，因此没有可排查的触发时段。使用
signoz-explaining-alerts
查看告警规则，或检查告警是否已启用。”

Workflow

工作流

The investigation runs in three tiers with strict early-stop gates. Tier 1 always runs. Tier 2 runs only if tier 1 confirms a real fire. Tier 3 runs only if tier 2 surfaces correlated anomalies. Skipping the gates produces hundreds of unnecessary trace/log queries on quiet alerts.

排查工作分为三个层级，且有严格的提前终止条件。第一层级始终运行；第二层级仅在第一层级确认是真实告警触发时运行；第三层级仅在第二层级发现关联异常时运行。跳过这些条件会在静默告警上触发数百次不必要的链路追踪/日志查询。

Step 1: Resolve alert + fire window (Tier 0)

步骤1：解析告警 + 触发时段（第0层）

Resolve the alert id via
```
signoz:signoz_list_alert_rules
```
(paginated) if not given.
Call
```
signoz:signoz_get_alert
```
for the full rule config — needed to know what query, threshold, and resource scope the alert evaluated.
Call
```
signoz:signoz_get_alert_history
```
with a 7d lookback. From the response:
- Pick the fire window. Default to the most recent fire. If the user passed an explicit time window via
```
$ARGUMENTS[1]
```
  , honor it.
- Note the fire pattern:
  - ```
  one-off
```
  → single fire with a long quiet period before/after.
- ```
sustained
```
    → fires that stayed firing for ≥ 1 evaluation cycle.
  - ```
  flapping
```
  → ≥ 3 fires within a 1h window, alternating fire/resolve.
- ```
recurring
```
    → fires at regular intervals (cron-like, e.g., every hour).
- The pattern tells you what to expect from tiers 2/3.

如果未提供告警ID，通过
```
signoz:signoz_list_alert_rules
```
（分页）解析告警ID。
调用
```
signoz:signoz_get_alert
```
获取完整规则配置——需要了解告警评估的查询语句、阈值和资源范围。
调用
```
signoz:signoz_get_alert_history
```
并设置7天回溯期。从响应中：
- 选择触发时段。默认使用最近一次触发。如果用户通过
```
$ARGUMENTS[1]
```
  传入了明确的时间窗口，则优先使用该窗口。
- 记录触发模式：
  - ```
  one-off
```
  → 单次触发，前后有较长静默期。
- ```
sustained
```
    → 触发状态持续≥1个评估周期。
  - ```
  flapping
```
  → 1小时内≥3次触发，触发/恢复交替出现。
- ```
recurring
```
    → 定期触发（类似 cron，例如每小时一次）。
- 触发模式将指导第二/第三层级的排查方向。

Step 2: Tier 1 — what fired and how hard

步骤2：第一层级——告警触发情况及严重程度

This tier always runs. It establishes the fire is real (vs. transient threshold tickle or flap) and quantifies the magnitude.

Re-run the alert's primary query over a window centered on the fire start:
```
[fire_start - 30m, fire_start + 30m]
```
.
- Use
```
signoz:signoz_execute_builder_query
```
  for builder/formula alerts.
- Use
```
signoz:signoz_query_metrics
```
  for PromQL alerts.
Compute:
- Peak value during the fire window.
- Threshold breach magnitude:
```
(peak - threshold) / threshold * 100
```
  for "above" alerts, inverted for "below".
- Fire duration: how long the breach lasted.
- Pre-fire baseline: average value in the 30m before fire start.
Early-stop gate: if the breach magnitude is < 10% over the threshold AND the fire duration is < 1 evaluation window, classify as "marginal fire" — the alert may be too sensitive. Skip tiers 2 and 3 and go to Step 5 with a single hypothesis: "threshold may be too tight, recommend tuning."

本层级始终运行。用于确认告警是真实触发（而非短暂阈值波动或告警波动），并量化触发的严重程度。

在以触发开始时间为中心的窗口
```
[触发开始时间-30分钟, 触发开始时间+30分钟]
```
内，重新运行告警的主查询。
- 对于构建器/公式类告警，使用
```
signoz:signoz_execute_builder_query
```
  。
- 对于PromQL类告警，使用
```
signoz:signoz_query_metrics
```
  。
计算：
- 触发时段内的峰值。
- 阈值突破幅度：对于“高于阈值”的告警，计算
```
(峰值 - 阈值)/阈值 * 100
```
  ；“低于阈值”的告警则反向计算。
- 触发持续时间：阈值突破持续的时长。
- 触发前基线：触发开始前30分钟内的平均值。
提前终止条件：如果阈值突破幅度<10%且触发持续时间<1个评估窗口，则归类为“边际触发”——告警可能过于敏感。跳过第二、第三层级，直接进入步骤5，仅提出一个假设：“阈值可能设置过严，建议调整。”

Step 3: Tier 2 — neighbor signals vs baseline

步骤3：第二层级——关联信号与基线对比

Run only if Tier 1 confirms a real breach. Pull related signals for the same resource scope as the alert and compare the fire window to a baseline window.

Pick a baseline window. Use the same hour, previous day (
```
fire_start - 24h, fire_start - 24h + fire_duration
```
). If the alert fired during a known-anomalous time (deploy, weekly job), note it in the output but still proceed.
Look up neighbor signals for the alert's resource type. See
```
references/neighbor-signals.md
```
for the lookup table. Common cases:
- Service-level alert (
```
service.name = X
```
  ): pull error rate, p95/p99 latency, request throughput, dependency error rates if trace data is available.
- Host / VM alert (
```
host.name = X
```
  ): CPU, memory, disk I/O, network I/O.
- K8s pod / namespace alert: pod restarts, container CPU/memory limits, node pressure, recent rollouts.
For each neighbor signal:
- Query both windows (fire + baseline) via
```
signoz:signoz_execute_builder_query
```
  or
```
signoz:signoz_query_metrics
```
  .
- Compute the delta (% change in fire window vs baseline).
- Rank by absolute delta.
Early-stop gate: if no neighbor signal shows ≥ 25% deviation from baseline, classify as "isolated fire — the alert's own signal moved but nothing else did." This is unusual and worth surfacing. Skip Tier 3 and go to Step 5 with hypotheses focused on the alert's own query (likely causes: data source change, instrumentation change, downstream silent failure that only shows in this metric).

仅在第一层级确认是真实阈值突破时运行。提取与告警相同资源范围的关联信号，并将触发时段与基线窗口进行对比。

选择基线窗口。使用前一天的同一时段（
```
触发开始时间-24小时, 触发开始时间-24小时+触发持续时间
```
）。如果告警在已知异常时段（如部署、每周任务期间）触发，需在输出中注明，但仍继续排查。
查找告警资源类型的关联信号。参考
```
references/neighbor-signals.md
```
中的查找表。常见场景：
- 服务级告警（
```
service.name = X
```
  ）：提取错误率、p95/p99延迟、请求吞吐量；如果有链路追踪数据，提取依赖错误率。
- 主机/VM告警（
```
host.name = X
```
  ）：提取CPU、内存、磁盘I/O、网络I/O。
- K8s Pod/命名空间告警：提取Pod重启次数、容器CPU/内存限制、节点压力、最近的滚动发布情况。
针对每个关联信号：
- 通过
```
signoz:signoz_execute_builder_query
```
  或
```
signoz:signoz_query_metrics
```
  查询触发时段和基线窗口的数据。
- 计算变化量（触发时段与基线的百分比变化）。
- 按绝对变化量排序。
提前终止条件：如果没有关联信号与基线的偏差≥25%，则归类为“孤立触发——仅告警自身信号变化，其他信号无异常”。这种情况较为特殊，需重点呈现。跳过第三层级，直接进入步骤5，假设集中在告警自身查询（可能原因：数据源变更、埋点变更、仅在此指标中体现的下游静默故障）。

Step 4: Tier 3 — traces and logs at the fire window

步骤4：第三层级——触发时段的链路追踪与日志

Run only if Tier 2 found correlated neighbor anomalies. Drill into specific failing operations.

Traces (if the alert is service-scoped and traces are available):
- Call
```
signoz:signoz_search_traces
```
  for the fire window with filter:
```
service.name = <scope>
```
  AND
```
hasError = true
```
  . Cap at top 20.
- Group results by
```
name
```
  (operation) and
```
error_message
```
  . Surface the top 3 by frequency with a representative trace ID for each.
- Optionally call
```
signoz:signoz_get_trace_details
```
  on one representative to extract specific span attributes (DB statement, downstream URL, status code).
Logs for the fire window:
- Call
```
signoz:signoz_search_logs
```
  with filter:
```
<scope_filter>
```
  AND
```
severity_text IN ('ERROR', 'FATAL')
```
  . Cap at top 20 most recent.
- Group by
```
body
```
  pattern (or
```
exception.type
```
  if present). Surface the top 3 distinct messages with counts.
Cross-reference: do the traces and logs point at the same downstream service, dependency, or code path? If so, that becomes the leading hypothesis.

See

references/baseline-comparison.md

for query templates that pair fire-window and baseline-window calls cleanly.

仅在第二层级发现关联异常时运行。深入排查具体的失败操作。

链路追踪（如果告警是服务范围且有链路追踪数据可用）：
- 调用
```
signoz:signoz_search_traces
```
  查询触发时段内满足
```
service.name = <范围>
```
  且
```
hasError = true
```
  的数据，最多返回前20条。
- 按
```
name
```
  （操作）和
```
error_message
```
  分组。呈现出现频率最高的前3组，并提供每组的代表性trace ID。
- 可选：调用
```
signoz:signoz_get_trace_details
```
  查询其中一个代表性trace，提取特定的span属性（数据库语句、下游URL、状态码）。
触发时段的日志：
- 调用
```
signoz:signoz_search_logs
```
  查询满足
```
<范围过滤条件>
```
  且
```
severity_text IN ('ERROR', 'FATAL')
```
  的数据，最多返回前20条最新日志。
- 按
```
body
```
  模式（或
```
exception.type
```
  如果存在）分组。呈现出现频率最高的前3条不同消息及计数。
交叉验证：链路追踪和日志是否指向同一下游服务、依赖或代码路径？如果是，则该方向成为主要假设。

参考

references/baseline-comparison.md

中的查询模板，可清晰地组合触发时段和基线窗口的查询。

Step 5: Build the structured output

步骤5：生成结构化输出

Use this exact section order. Lead with a TL;DR — engineers under pressure scan the top first and stop reading once they have what they need. Compression plus proof: every claim cites the MCP query that produced it; no generic "check logs / verify connectivity" filler.

1. TL;DR — one or two sentences, no more. Leading hypothesis, overall confidence, blast radius, and the single most useful next action. Example:

"checkoutservice error rate hit 12.4% (threshold 5%) for 8m at 14:32 UTC — most likely cause is payments-api timing out (high confidence). Open trace
7af3a09b…
to see the failing call."

If no hypothesis reaches medium confidence, the leading line is "No clear root cause found." rather than a low-confidence guess dressed up as the answer.

2. What fired The alert (id, name), the fire window (absolute UTC + relative), peak magnitude ("error rate hit 12.4% vs. 5% threshold — 148% over"), fire duration, and the fire pattern (

one-off

sustained

flapping

recurring

marginal

3. Investigation trail A scannable list of what was checked, with ✅ for confirmed signals and ❌ for ruled out, each followed by a one-line finding. The point is that the reader can see what work the AI did and what it found — this is where trust is earned. Example:

✅ Tier 1 — peak error rate 12.4%, fire was real (not marginal).
✅ Tier 2 — payments error rate +8900%, p99 +1180%; downstream cascade.
❌ CPU / memory pressure — flat through the fire window.
✅ Tier 3 — 30 error traces all hit payments-api, same message.

4. Likely causes (ranked, max 3) Each cause has three parts:

Hypothesis — one sentence, specific. Bad: "service is unhealthy". Good: "checkout is timing out on calls to payments-api".
Evidence — the supporting numbers from tiers 1/2/3, with the underlying query inline so the user can re-run it. State the neighbor signal, the delta vs baseline, the trace/log pattern that supports it.
Confidence —
```
high
```
requires ≥2 of: temporal precedence, topology / dependency edge, shared service or entity, correlated metric/log/trace evidence, recent deploy or config change.
```
medium
```
is one tier's evidence with at least one of those signals.
```
low
```
is a single signal moved with no corroboration — in that case label it a "co-occurring signal," not a cause.

If only Tier 1 ran (marginal fire / no neighbor anomalies), output fewer hypotheses with

low

confidence and explicitly call out the limitation.

5. Ruled out Short but explicit. List candidates the evidence eliminated and the one-line reason why. Skip the section if there's nothing meaningful to rule out — but if you considered something and dropped it, say so here so the user doesn't waste time re-checking it.

6. Suggested next steps Action items the user can take. Be concrete and use SigNoz-native handles so the user can act immediately:

Specific trace, dashboard, or alert to open (e.g., "open trace
```
7af3a09b…
```
in the SigNoz UI").
Specific query to run with
```
signoz-generating-queries
```
— paste the exact filter and time window.
"Tune this alert" if the fire was marginal — name the field (
```
matchType
```
,
```
target
```
,
```
recoveryTarget
```
) and the change to make via
```
signoz:signoz_update_alert
```
.
"Open an incident" or "page the owning team" if the cause is cross-service.

Do not pad with generic advice ("verify connectivity", "check dashboards") — that's noise during an active incident.

Mirroring as navigation chips. Mirror up to 3 of these "Suggested next steps" as host follow-up intents — the most actionable, alert-scoped ones. Keep the rest in the report prose so the user has the full picture. The chip surface is capped; the prose is not.

严格按照以下章节顺序组织。开头先提供TL;DR（摘要）——处于压力下的工程师会先浏览顶部内容，找到所需信息后即停止阅读。内容需简洁且有依据：每个结论都要引用生成它的MCP查询；避免使用“检查日志/验证连通性”这类通用建议。

1. TL;DR——最多一两句话。包含主要假设、整体置信度、影响范围，以及最有用的下一步操作示例：

“结账服务错误率达到12.4%（阈值5%），持续8分钟，触发时间为14:32 UTC——最可能的原因是payments-api超时（高置信度）。打开trace
7af3a09b…
查看失败调用。”

如果没有假设达到中等置信度，开头应为“未找到明确根本原因。”，而非将低置信度猜测包装成答案。

2. 告警触发详情 告警（ID、名称）、触发时段（UTC绝对时间+相对时间）、峰值幅度（“错误率达到12.4%，远超5%阈值——超出148%”）、触发持续时间，以及触发模式（

one-off

sustained

flapping

recurring

marginal

）。

3. 排查轨迹 以可快速浏览的列表形式呈现已检查内容，用✅表示已确认的信号，❌表示已排除的内容，每条后附一行结论。目的是让读者了解AI完成的工作及发现——这是建立信任的关键。示例：

✅ 第一层级——峰值错误率12.4%，确认为真实触发（非边际触发）。
✅ 第二层级——支付服务错误率+8900%，p99延迟+1180%；下游级联故障。
❌ CPU/内存压力——触发时段内保持平稳。
✅ 第三层级——30条错误trace均指向payments-api，错误消息一致。

4. 可能原因（排序，最多3个）每个原因包含三部分：

假设——一句话，具体明确。错误示例：“服务不健康”。正确示例：“结账服务调用payments-api超时”。
证据——来自第一/第二/第三层级的支撑数据，并内嵌底层查询语句，方便用户重新运行。说明关联信号、与基线的变化量、支撑该假设的链路追踪/日志模式。
置信度——
```
high
```
（高）需满足≥2个条件：时间先后顺序、拓扑/依赖关系、共享服务或实体、关联的指标/日志/链路追踪证据、近期部署或配置变更。
```
medium
```
（中）为单一层级的证据，且至少满足上述一个条件。
```
low
```
（低）为单一信号变化，无其他佐证——此时应标注为“并发信号”，而非原因。

如果仅运行了第一层级（边际触发/无关联异常），则输出较少的低置信度假设，并明确说明局限性。

5. 已排除原因 简短但明确。列出证据已排除的候选原因及一行理由。如果没有有意义的排除项，可跳过此章节——但如果您考虑过某个原因并排除了它，请在此说明，避免用户浪费时间重复检查。

6. 建议下一步操作 用户可执行的具体操作项。需具体且使用SigNoz原生标识，方便用户立即行动：

特定的链路追踪、仪表盘或告警（例如：“在SigNoz UI中打开trace
```
7af3a09b…
```
”）。
使用
```
signoz-generating-queries
```
运行的具体查询——粘贴精确的过滤条件和时间窗口。
如果是边际触发，建议“调整此告警”——指明字段（
```
matchType
```
、
```
target
```
、
```
recoveryTarget
```
）及通过
```
signoz:signoz_update_alert
```
进行的修改。
如果原因涉及跨服务，建议“创建事件”或“通知负责团队”。

不要添加通用建议（“验证连通性”“检查仪表盘”）——这些在事件处理过程中属于无效信息。

作为导航芯片的镜像。将最多3条“建议下一步操作”镜像为主机后续意图——选择最具可操作性、与告警相关的操作。其余操作保留在报告正文中，确保用户了解完整情况。导航芯片数量有限，但正文不受限制。

Out of scope (v1)

超出范围（v1版本）

Deployment / config-change correlation — SigNoz MCP does not expose a deployments tool; do not fabricate one. If the user mentions a recent deploy, surface it as context but don't claim it caused the fire without the signal evidence.
Cross-service blast-radius walking — investigating downstream callers of the alert's service. Out of scope to keep context bounded.
Long-horizon historical baselines — Tier 2 compares to one prior-day window, not to weekly/monthly seasonality. If the user says "is this normal for a Friday afternoon", suggest an anomaly alert (
```
signoz-creating-alerts
```
with
```
anomaly_rule
```
).

部署/配置变更关联——SigNoz MCP未暴露部署工具；请勿虚构此类工具。如果用户提及近期部署，可将其作为上下文呈现，但不要在无信号证据的情况下声称是触发原因。
跨服务影响范围遍历——排查告警服务的下游调用方。为了保持上下文边界，此内容超出范围。
长期历史基线——第二层级仅与前一天的同一时段对比，而非每周/每月的季节性数据。如果用户询问“周五下午出现这种情况正常吗”，建议创建异常告警（使用
```
signoz-creating-alerts
```
并配置
```
anomaly_rule
```
）。

Guardrails

约束规则

Three-tier early-stop is mandatory. Skipping the gates pulls hundreds of traces/logs on quiet alerts and explodes context. The gates are not optional optimizations.
Anchor every claim to an MCP query result. No speculation. If evidence is missing, lower confidence and say so.
Show the supporting query with each hypothesis so the user can reproduce and dig deeper.
Compression plus proof. TL;DR is one or two sentences max; the full report is a triage card, not a postmortem. Engineers under pressure should be able to skim the top and act. Every section earns its place by adding evidence the user couldn't already see in the alert payload.
Correlation ≠ causation. Label something a cause only when at least two of the following converge: temporal precedence (signal moved before symptom), topology / dependency edge, shared service or entity, correlated metric/log/trace evidence, or a recent deploy/config change. A single time-aligned anomaly is a "co-occurring signal," not a cause — say so explicitly.
Don't restate the alert or recommend the obvious. "Check logs", "verify connectivity", "investigate dashboards" — the reader of this output already knows they need to. Replace generic suggestions with specific queries, traces, or filters they can run immediately.
No fabricated identifiers. Trace IDs, span names, alert rule IDs, channel names, deploy IDs — every identifier in the output must come from a real MCP response. Don't invent placeholders that look plausible.
Honest uncertainty wins. If no hypothesis reaches medium confidence, the answer is "No clear root cause found — here's what we checked and what's ruled out." Do not promote a low-confidence guess to the leading hypothesis just to sound useful. False positives waste active incident time more than false negatives.
Prefer resource-attribute filters in every drill-down query. This is the SigNoz MCP guideline and it directly affects query speed at scale.
Do not modify any alert. Investigate is read-only. If the user says "and tighten this alert", surface that as a next-step recommendation; do not call
```
signoz:signoz_update_alert
```
.
Stay in scope. Static rule explanation belongs to
```
signoz-explaining-alerts
```
. Cause analysis without an alert anchor belongs to
```
signoz-generating-queries
```
.
Time zones. Always state fire windows in UTC alongside relative time ("14:32 UTC, 2h ago") so autonomous and interactive consumers agree on the window.

必须执行三层提前终止检查。跳过这些条件会在静默告警上触发数百次链路追踪/日志查询，导致上下文爆炸。这些条件并非可选优化项。
每个结论都必须锚定MCP查询结果。禁止推测。如果证据缺失，降低置信度并说明情况。
每个假设都需附带支撑查询，方便用户复现和深入排查。
简洁且有依据。TL;DR最多一两句话；完整报告是分诊卡片，而非事后总结。处于压力下的工程师应能快速浏览顶部内容并采取行动。每个章节都必须提供用户无法从告警负载中直接获取的证据，才有存在的价值。
关联≠因果。只有当至少满足以下两个条件时，才能将某因素标注为原因：时间先后顺序（信号变化早于症状）、拓扑/依赖关系、共享服务或实体、关联的指标/日志/链路追踪证据、近期部署/配置变更。单一时间对齐的异常是“并发信号”，而非原因——需明确说明。
不要重复告警内容或给出显而易见的建议。“检查日志”“验证连通性”“排查仪表盘”——报告读者已经知道这些操作。将通用建议替换为用户可立即执行的具体查询、链路追踪或过滤条件。
禁止虚构标识符。Trace ID、span名称、告警规则ID、渠道名称、部署ID——输出中的所有标识符必须来自真实的MCP响应。不要发明看似合理的占位符。
诚实的不确定性更有价值。如果没有假设达到中等置信度，答案应为“未找到明确根本原因——以下是已检查内容和已排除原因”。不要为了显得有用而将低置信度猜测提升为主要假设。在事件处理过程中，误报比漏报更浪费时间。
在所有深入查询中优先使用资源属性过滤器。这是SigNoz MCP的指导原则，直接影响大规模场景下的查询速度。
请勿修改任何告警。排查为只读模式。如果用户说“同时调紧这个告警”，将其作为下一步建议呈现；不要调用
```
signoz:signoz_update_alert
```
。
保持在范围内。静态规则解析属于
```
signoz-explaining-alerts
```
的范畴。不基于告警锚点的原因分析属于
```
signoz-generating-queries
```
的范畴。
时区。触发时段始终以UTC时间呈现，并附带相对时间（“14:32 UTC，2小时前”），确保自动和交互式用户对时间窗口的理解一致。

Examples

示例

User: "Why did the checkout error rate alert fire?"

Agent:

Resolves alert: "High Error Rate — Checkout" (id 42).
```
signoz:signoz_get_alert_history
```
→ most recent fire 2h ago at 14:32 UTC, sustained for 8m, single fire (not flapping).
Tier 1: re-runs error-rate formula over
```
[14:02, 15:02]
```
. Peak error rate 12.4% (vs 5% threshold — 148% over). Pre-fire baseline 0.3%. Real fire, not marginal.
Tier 2: pulls neighbor signals for
```
service.name = checkout
```
:
- p99 latency: 4.1s vs 320ms baseline (+1180%).
- Throughput: -42% (drop).
- Downstream
```
payments
```
  error rate: 18% vs 0.2% baseline (+8900%).
- CPU/memory: flat (no resource pressure).
Tier 3: traces for
```
service.name = checkout, hasError = true
```
in the fire window — top operation
```
POST /checkout/submit
```
, top error message "context deadline exceeded calling payments-api". 30 traces, all hitting the same downstream URL. Logs show matching "payments client timeout" lines, 142 occurrences.
Output:
TL;DR: checkoutservice error rate hit 12.4% (threshold 5%) for 8m at 14:32 UTC. Most likely cause: payments-api timing out (high confidence — converging trace + log + neighbor evidence). Open trace
```
7af3a09b…
```
to see the failing call.
- What fired: alert 42 fired 2h ago at 14:32 UTC, sustained 8m. Error rate peaked at 12.4% (148% over threshold).
- Investigation trail:
  - ✅ Tier 1 — peak 12.4% vs 5% threshold, pre-fire baseline 0.3%. Real fire.
  - ✅ Tier 2 — payments error rate +8900%, p99 latency +1180%, throughput −42%.
  - ❌ CPU / memory pressure on checkout — flat.
  - ✅ Tier 3 — 30 error traces all hit payments-api with
```
context deadline exceeded
```
    ; 142 matching timeout logs.
- Likely causes (high confidence): payments service errors cascading into checkout. Evidence converges across topology (checkout → payments edge), temporal precedence (payments errors lead checkout p99), and shared entity (every error trace targets the payments-api URL).
- Ruled out: checkout-side resource pressure (CPU/memory flat); upstream traffic spike (throughput dropped, didn't spike).
- Next steps: open trace
```
7af3a09b…
```
  in the SigNoz UI; if payments has its own alert, run
```
signoz-explaining-alerts
```
  on it, otherwise
```
signoz-generating-queries
```
  for payments error rate over the last 4h.

User: "Investigate the flapping CPU alert on prod-api-3"

Agent:

Resolves alert (id 88, host.name = prod-api-3).
History: 7 fires in last 1h, alternating fire/resolve every 8-12 minutes — flapping pattern.
Tier 1: peak 84% (threshold 80%, only 5% over). Each fire lasted 2-4 minutes. Marginal — the value hovered near threshold.
```
matchType = at_least_once
```
made each blip trigger.
Early-stop kicks in. Skip tiers 2/3.
Output:
TL;DR: alert 88 has been flapping on
```
prod-api-3
```
— 7 fires in the last 1h, each 2–4m, all within 5% of the 80% threshold. No clear root cause; this is threshold tuning, not an incident. Switch
```
matchType
```
to
```
on_average
```
or add hysteresis to stop the noise.
- What fired: alert 88, host
```
prod-api-3
```
  , fire pattern
```
flapping
```
  . 7 fires in the last 1h, each 2–4m. Peak 84% (5% over the 80% threshold).
- Investigation trail:
  - ✅ Tier 1 — every fire was within 5% of threshold; duration short; baseline already at 70–75%. Marginal fire — early-stop triggered, Tier 2/3 skipped.
- Likely causes (low / co-occurring signal only): threshold tuned too tight or
```
matchType
```
  is too sensitive. Evidence: every fire was within 5% of threshold; baseline already runs at 70–75%. Not promoted to a "cause" — single signal, no corroboration.
- Ruled out: real CPU saturation incident (peaks too small and short-lived; baseline already near threshold).
- Next steps: change
```
matchType
```
  to
```
on_average
```
  (smooths transient spikes) OR raise threshold to 85% with hysteresis (
```
recoveryTarget: 75
```
  ). Use
```
signoz:signoz_update_alert
```
  to apply.

User: "RCA for the log volume alert that fired last night"

Agent:

Resolves alert: "Error Log Volume Spike" (id 14, no service filter — groupBy
```
service.name
```
).
History: fired at 03:12 UTC, sustained 22m, broke down by service in the alert annotations:
```
service.name = inventory
```
was the firing series.
Tier 1: re-runs log count for inventory in fire window. Peak 3,400 errors/min vs 1,000/min threshold (240% over). Pre-fire baseline 12/min. Real, large fire.
Tier 2: neighbor signals for
```
service.name = inventory
```
:
- Request error rate: +600%.
- p99 latency: +30% (mild).
- CPU: -80% (collapsed). Memory: -60%.
- Pod restarts (k8s): 4 in fire window.
Tier 3: logs for inventory in fire window. Top message: "OOMKilled restarting" (1,200 occurrences). Top trace error: graceful-shutdown exceptions.
Output:
TL;DR: log volume alert 14 fired at 03:12 UTC for
```
service.name = inventory
```
, sustained 22m at 240% over threshold. Most likely cause: inventory pods OOM-killed and restarted 4 times (high confidence). Check container memory limits for the inventory deployment.
- What fired: alert 14 fired at 03:12 UTC for service
```
inventory
```
  , sustained 22m, 240% over threshold.
- Investigation trail:
  - ✅ Tier 1 — peak 3,400 errors/min vs 1,000/min threshold; pre-fire baseline 12/min. Real fire.
  - ✅ Tier 2 — request error rate +600%; CPU/memory collapsed (−80%/−60%); 4 pod restarts in window.
  - ❌ p99 latency — only +30%, not a latency-driven incident.
  - ✅ Tier 3 — top log message "OOMKilled restarting" (1,200 occurrences); top trace error: graceful-shutdown exceptions.
- Likely causes (high confidence): inventory pods OOM-killed and restarted 4 times during the window. Evidence converges across topology (single service), temporal precedence (memory fell to zero before error spike), shared entity (all log lines from
```
service.name = inventory
```
  ), and a single coherent pattern (OOM → restart → graceful-shutdown noise).
- Ruled out: a true application error-rate change (errors are restart noise, not request-path failures); upstream traffic surge (throughput unchanged).
- Next steps: check container memory limits for inventory pods; review recent deploys; consider whether the alert should exclude restart-related error patterns or whether the underlying OOM is the real concern.

用户：“结账服务错误率告警为什么触发？”

Agent：

解析告警：“高错误率——结账服务”（ID 42）。
```
signoz:signoz_get_alert_history
```
→ 最近一次触发为2小时前的14:32 UTC，持续8分钟，单次触发（非波动）。
第一层级：在
```
[14:02, 15:02]
```
窗口内重新运行错误率公式。峰值错误率12.4%（阈值5%——超出148%）。触发前基线0.3%。确认为真实触发，非边际触发。
第二层级：提取
```
service.name = checkout
```
的关联信号：
- p99延迟：4.1s vs 基线320ms（+1180%）。
- 吞吐量：-42%（下降）。
- 下游
```
payments
```
  错误率：18% vs 基线0.2%（+8900%）。
- CPU/内存：平稳（无资源压力）。
第三层级：查询触发时段内
```
service.name = checkout, hasError = true
```
的trace——最频繁的操作是
```
POST /checkout/submit
```
，最常见的错误消息是“调用payments-api时上下文超时”。共30条trace，均指向同一下游URL。日志显示匹配的“支付客户端超时”记录，共142条。
输出：
TL;DR：结账服务错误率达到12.4%（阈值5%），触发时间为14:32 UTC，持续8分钟。最可能的原因：payments-api超时（高置信度——链路追踪、日志、关联信号证据一致）。打开trace
```
7af3a09b…
```
查看失败调用。
- 告警触发详情：告警42于2小时前的14:32 UTC触发，持续8分钟。错误率峰值12.4%（超出阈值148%）。
- 排查轨迹：
  - ✅ 第一层级——峰值12.4% vs 阈值5%，触发前基线0.3%。确认为真实触发。
  - ✅ 第二层级——支付服务错误率+8900%，p99延迟+1180%，吞吐量−42%。
  - ❌ 结账服务CPU/内存压力——平稳。
  - ✅ 第三层级——30条错误trace均指向payments-api，错误消息为
```
context deadline exceeded
```
    ；142条匹配的超时日志。
- 可能原因（高置信度）：支付服务错误级联影响结账服务。证据来自拓扑关系（结账→支付链路）、时间先后顺序（支付服务错误早于结账服务p99延迟上升）、共享实体（所有错误trace均指向payments-api URL）。
- 已排除原因：结账服务端资源压力（CPU/内存平稳）；上游流量激增（吞吐量下降，未激增）。
- 下一步操作：在SigNoz UI中打开trace
```
7af3a09b…
```
  ；如果支付服务有自身告警，对其运行
```
signoz-explaining-alerts
```
  ，否则使用
```
signoz-generating-queries
```
  查询过去4小时内支付服务的错误率。

用户：“排查prod-api-3上波动的CPU告警”

Agent：

解析告警（ID 88，host.name = prod-api-3）。
历史记录：过去1小时内触发7次，每8-12分钟交替触发/恢复——波动模式。
第一层级：峰值84%（阈值80%，仅超出5%）。每次触发持续2-4分钟。边际触发——数值徘徊在阈值附近。
```
matchType = at_least_once
```
导致每次小波动都触发告警。
触发提前终止条件。跳过第二/第三层级。
输出：
TL;DR：告警88在
```
prod-api-3
```
上出现波动——过去1小时内触发7次，每次持续2–4分钟，均在80%阈值的5%范围内。未找到明确根本原因；这是阈值配置问题，而非事件。将
```
matchType
```
切换为
```
on_average
```
或添加滞后机制以消除噪音。
- 告警触发详情：告警88，主机
```
prod-api-3
```
  ，触发模式为
```
flapping
```
  。过去1小时内触发7次，每次持续2–4分钟。峰值84%（超出80%阈值5%）。
- 排查轨迹：
  - ✅ 第一层级——每次触发均在阈值的5%范围内；持续时间短；基线已达70–75%。边际触发——触发提前终止，跳过第二/第三层级。
- 可能原因（低置信度/仅为并发信号）：阈值配置过严或
```
matchType
```
  过于敏感。证据：每次触发均在阈值的5%范围内；基线已维持在70–75%。不归类为“原因”——仅单一信号变化，无佐证。
- 已排除原因：真实CPU饱和事件（峰值过小且持续时间短；基线已接近阈值）。
- 下一步操作：将
```
matchType
```
  改为
```
on_average
```
  （平滑瞬时峰值），或将阈值提高至85%并添加滞后机制（
```
recoveryTarget: 75
```
  ）。使用
```
signoz:signoz_update_alert
```
  应用修改。

用户：“昨晚触发的日志量告警的根因分析”

Agent：

解析告警：“错误日志量激增”（ID 14，无服务过滤——按
```
service.name
```
分组）。
历史记录：触发时间为03:12 UTC，持续22分钟，告警注释中按服务拆分：
```
service.name = inventory
```
是触发的服务。
第一层级：重新运行触发时段内inventory服务的日志计数。峰值3400条错误/分钟 vs 阈值1000条/分钟（超出240%）。触发前基线12条/分钟。确认为真实且严重的触发。
第二层级：提取
```
service.name = inventory
```
的关联信号：
- 请求错误率：+600%。
- p99延迟：+30%（轻微）。
- CPU：-80%（骤降）。内存：-60%。
- K8s Pod重启次数：触发时段内4次。
第三层级：触发时段内inventory服务的日志。最常见消息：“OOMKilled restarting”（1200次）。最常见trace错误：优雅关闭异常。
输出：
TL;DR：日志量告警14于03:12 UTC针对
```
service.name = inventory
```
触发，持续22分钟，超出阈值240%。最可能的原因：inventory Pod因OOM被杀死并重启4次（高置信度）。检查inventory部署的容器内存限制。
- 告警触发详情：告警14于03:12 UTC针对服务
```
inventory
```
  触发，持续22分钟，超出阈值240%。
- 排查轨迹：
  - ✅ 第一层级——峰值3400条错误/分钟 vs 阈值1000条/分钟；触发前基线12条/分钟。确认为真实触发。
  - ✅ 第二层级——请求错误率+600%；CPU/内存骤降（−80%/−60%）；触发时段内4次Pod重启。
  - ❌ p99延迟——仅+30%，并非延迟驱动的事件。
  - ✅ 第三层级——最常见日志消息“OOMKilled restarting”（1200次）；最常见trace错误：优雅关闭异常。
- 可能原因（高置信度）：触发时段内inventory Pod因OOM被杀死并重启4次。证据来自拓扑关系（单一服务）、时间先后顺序（内存降至零后错误激增）、共享实体（所有日志均来自
```
service.name = inventory
```
  ），以及一致的模式（OOM→重启→优雅关闭噪音）。
- 已排除原因：真实应用错误率变化（错误是重启噪音，而非请求路径故障）；上游流量激增（吞吐量无变化）。
- 下一步操作：检查inventory Pod的容器内存限制；回顾近期部署；考虑告警是否应排除重启相关错误模式，或底层OOM是否为真正需要关注的问题。

Additional resources

附加资源

```
references/neighbor-signals.md
```
— lookup table mapping resource type (service / host / k8s) to the neighbor signals to pull in Tier 2.
```
references/baseline-comparison.md
```
— query templates that pair fire-window and baseline-window calls cleanly, including how to format
```
signoz:signoz_execute_builder_query
```
for both.
```
signoz-explaining-alerts
```
skill — to decode the rule before investigating, if the user is unfamiliar with what the alert monitors.
```
signoz-generating-queries
```
skill — for ad-hoc follow-up queries on the same resource scope.

```
references/neighbor-signals.md
```
——资源类型（服务/主机/K8s）与第二层级需提取的关联信号的映射表。
```
references/baseline-comparison.md
```
——可清晰组合触发时段和基线窗口查询的模板，包括如何为两者格式化
```
signoz:signoz_execute_builder_query
```
。
```
signoz-explaining-alerts
```
技能——如果用户不熟悉告警监控的内容，可先使用此技能解析规则，再进行排查。
```
signoz-generating-queries
```
技能——针对同一资源范围的临时后续查询。