signoz-investigating-alerts

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Alert Investigate

告警排查

Diagnose why a SigNoz alert fired. The skill correlates the alert's own signal with neighbor signals around the fire window, and surfaces a ranked list of likely causes with supporting evidence. It is the companion to
signoz-explaining-alerts
— explain decodes the rule statically; investigate diagnoses a specific incident.
诊断SigNoz告警触发的原因。本技能将告警自身信号与告警触发时段内的关联信号进行关联分析,呈现带有支撑证据的可能原因排序列表。它是
signoz-explaining-alerts
的配套技能——explain用于静态解析告警规则,investigate用于诊断特定事件。

Prerequisites

前置条件

This skill calls SigNoz MCP server tools heavily (
signoz:signoz_get_alert
,
signoz:signoz_get_alert_history
,
signoz:signoz_execute_builder_query
,
signoz:signoz_query_metrics
,
signoz:signoz_search_traces
,
signoz:signoz_search_logs
,
signoz:signoz_get_trace_details
, etc.). Before running the workflow, confirm the
signoz:signoz_*
tools are available. If they are not, the SigNoz MCP server is not installed or configured — stop and direct the user to set it up: https://signoz.io/docs/ai/signoz-mcp-server/. The investigation depends on correlating multiple MCP queries; without the server there is no way to ground the analysis.
本技能会大量调用SigNoz MCP服务器工具(
signoz:signoz_get_alert
signoz:signoz_get_alert_history
signoz:signoz_execute_builder_query
signoz:signoz_query_metrics
signoz:signoz_search_traces
signoz:signoz_search_logs
signoz:signoz_get_trace_details
等)。在运行工作流之前,请确认
signoz:signoz_*
工具可用。如果不可用,说明SigNoz MCP服务器未安装或配置——请停止操作并引导用户完成设置:https://signoz.io/docs/ai/signoz-mcp-server/。 排查工作依赖于多个MCP查询的关联分析;如果没有该服务器,将无法开展有效的分析。

When to use

使用场景

Use this skill when the user wants to:
  • Understand why a specific alert fired.
  • Find the root cause of a recent incident triggered by an alert.
  • Correlate the alert's signal with related metrics, traces, and logs.
  • Distinguish "real signal" fires from flapping or threshold-mistuning.
Do NOT use when the user wants to:
  • Understand what an alert is configured to monitor →
    signoz-explaining-alerts
    .
  • Create a new alert →
    signoz-creating-alerts
    .
  • Modify an alert (raise threshold, add hysteresis) → call
    signoz:signoz_update_alert
    directly.
  • Run a free-form ad-hoc investigation without an alert as the anchor →
    signoz-generating-queries
    .
当用户需要以下操作时,使用本技能:
  • 理解特定告警触发的原因。
  • 查找由告警触发的近期事件的根本原因。
  • 将告警信号与相关指标、链路追踪和日志进行关联分析。
  • 区分“真实信号触发”与“告警波动”或“阈值配置不当”。
请勿在以下场景使用本技能:
  • 理解告警配置用于监控的内容 → 使用
    signoz-explaining-alerts
  • 创建新告警 → 使用
    signoz-creating-alerts
  • 修改告警(提高阈值、添加滞后机制)→ 直接调用
    signoz:signoz_update_alert
  • 不基于告警作为锚点进行自由形式的临时排查 → 使用
    signoz-generating-queries

Required inputs

必填输入项

InputRequiredSource if missing
Alert identifier (rule ID or name)yes
$ARGUMENTS[0]
or recent context
Time windownodefault to most recent fire from
signoz:signoz_get_alert_history
If the alert name is fuzzy, this skill is best-effort (read-only):
  1. Call
    signoz:signoz_list_alert_rules
    , paginate, fuzzy-match the name.
  2. State the interpretation: "Investigating fire of 'High Error Rate — Checkout' (id 42) at 14:32 UTC. If you meant a different alert or fire, tell me."
  3. Proceed.
If the alert has never fired in the lookback window, stop: there is nothing to investigate. Respond with:
"Alert '[name]' has not fired in the last 7d, so there is no fire window to investigate. Use
signoz-explaining-alerts
to walk through the rule, or check whether the alert is enabled."
输入项是否必填缺失时的来源
告警标识符(规则ID或名称)
$ARGUMENTS[0]
或近期上下文
时间窗口默认使用
signoz:signoz_get_alert_history
返回的最近一次告警触发时段
如果告警名称模糊,本技能会进行尽力处理(只读模式):
  1. 调用
    signoz:signoz_list_alert_rules
    ,分页查询,模糊匹配名称。
  2. 说明解析结果:“正在排查‘高错误率——结账服务’(ID 42)在14:32 UTC触发的告警。如果您指的是其他告警或触发事件,请告知我。”
  3. 继续排查。
如果告警在回溯窗口内从未触发,请停止操作:没有可排查的内容。回复:
“告警‘[名称]’在过去7天内未触发,因此没有可排查的触发时段。使用
signoz-explaining-alerts
查看告警规则,或检查告警是否已启用。”

Workflow

工作流

The investigation runs in three tiers with strict early-stop gates. Tier 1 always runs. Tier 2 runs only if tier 1 confirms a real fire. Tier 3 runs only if tier 2 surfaces correlated anomalies. Skipping the gates produces hundreds of unnecessary trace/log queries on quiet alerts.
排查工作分为三个层级,且有严格的提前终止条件。第一层级始终运行;第二层级仅在第一层级确认是真实告警触发时运行;第三层级仅在第二层级发现关联异常时运行。跳过这些条件会在静默告警上触发数百次不必要的链路追踪/日志查询。

Step 1: Resolve alert + fire window (Tier 0)

步骤1:解析告警 + 触发时段(第0层)

  1. Resolve the alert id via
    signoz:signoz_list_alert_rules
    (paginated) if not given.
  2. Call
    signoz:signoz_get_alert
    for the full rule config — needed to know what query, threshold, and resource scope the alert evaluated.
  3. Call
    signoz:signoz_get_alert_history
    with a 7d lookback. From the response:
    • Pick the fire window. Default to the most recent fire. If the user passed an explicit time window via
      $ARGUMENTS[1]
      , honor it.
    • Note the fire pattern:
      • one-off
        → single fire with a long quiet period before/after.
      • sustained
        → fires that stayed firing for ≥ 1 evaluation cycle.
      • flapping
        → ≥ 3 fires within a 1h window, alternating fire/resolve.
      • recurring
        → fires at regular intervals (cron-like, e.g., every hour).
    • The pattern tells you what to expect from tiers 2/3.
  1. 如果未提供告警ID,通过
    signoz:signoz_list_alert_rules
    (分页)解析告警ID。
  2. 调用
    signoz:signoz_get_alert
    获取完整规则配置——需要了解告警评估的查询语句、阈值和资源范围。
  3. 调用
    signoz:signoz_get_alert_history
    并设置7天回溯期。从响应中:
    • 选择触发时段。默认使用最近一次触发。如果用户通过
      $ARGUMENTS[1]
      传入了明确的时间窗口,则优先使用该窗口。
    • 记录触发模式
      • one-off
        → 单次触发,前后有较长静默期。
      • sustained
        → 触发状态持续≥1个评估周期。
      • flapping
        → 1小时内≥3次触发,触发/恢复交替出现。
      • recurring
        → 定期触发(类似 cron,例如每小时一次)。
    • 触发模式将指导第二/第三层级的排查方向。

Step 2: Tier 1 — what fired and how hard

步骤2:第一层级——告警触发情况及严重程度

This tier always runs. It establishes the fire is real (vs. transient threshold tickle or flap) and quantifies the magnitude.
  1. Re-run the alert's primary query over a window centered on the fire start:
    [fire_start - 30m, fire_start + 30m]
    .
    • Use
      signoz:signoz_execute_builder_query
      for builder/formula alerts.
    • Use
      signoz:signoz_query_metrics
      for PromQL alerts.
  2. Compute:
    • Peak value during the fire window.
    • Threshold breach magnitude:
      (peak - threshold) / threshold * 100
      for "above" alerts, inverted for "below".
    • Fire duration: how long the breach lasted.
    • Pre-fire baseline: average value in the 30m before fire start.
  3. Early-stop gate: if the breach magnitude is < 10% over the threshold AND the fire duration is < 1 evaluation window, classify as "marginal fire" — the alert may be too sensitive. Skip tiers 2 and 3 and go to Step 5 with a single hypothesis: "threshold may be too tight, recommend tuning."
本层级始终运行。用于确认告警是真实触发(而非短暂阈值波动或告警波动),并量化触发的严重程度。
  1. 在以触发开始时间为中心的窗口
    [触发开始时间-30分钟, 触发开始时间+30分钟]
    内,重新运行告警的主查询。
    • 对于构建器/公式类告警,使用
      signoz:signoz_execute_builder_query
    • 对于PromQL类告警,使用
      signoz:signoz_query_metrics
  2. 计算:
    • 触发时段内的峰值
    • 阈值突破幅度:对于“高于阈值”的告警,计算
      (峰值 - 阈值)/阈值 * 100
      ;“低于阈值”的告警则反向计算。
    • 触发持续时间:阈值突破持续的时长。
    • 触发前基线:触发开始前30分钟内的平均值。
  3. 提前终止条件:如果阈值突破幅度<10%且触发持续时间<1个评估窗口,则归类为“边际触发”——告警可能过于敏感。跳过第二、第三层级,直接进入步骤5,仅提出一个假设:“阈值可能设置过严,建议调整。”

Step 3: Tier 2 — neighbor signals vs baseline

步骤3:第二层级——关联信号与基线对比

Run only if Tier 1 confirms a real breach. Pull related signals for the same resource scope as the alert and compare the fire window to a baseline window.
  1. Pick a baseline window. Use the same hour, previous day (
    fire_start - 24h, fire_start - 24h + fire_duration
    ). If the alert fired during a known-anomalous time (deploy, weekly job), note it in the output but still proceed.
  2. Look up neighbor signals for the alert's resource type. See
    references/neighbor-signals.md
    for the lookup table. Common cases:
    • Service-level alert (
      service.name = X
      ): pull error rate, p95/p99 latency, request throughput, dependency error rates if trace data is available.
    • Host / VM alert (
      host.name = X
      ): CPU, memory, disk I/O, network I/O.
    • K8s pod / namespace alert: pod restarts, container CPU/memory limits, node pressure, recent rollouts.
  3. For each neighbor signal:
    • Query both windows (fire + baseline) via
      signoz:signoz_execute_builder_query
      or
      signoz:signoz_query_metrics
      .
    • Compute the delta (% change in fire window vs baseline).
    • Rank by absolute delta.
  4. Early-stop gate: if no neighbor signal shows ≥ 25% deviation from baseline, classify as "isolated fire — the alert's own signal moved but nothing else did." This is unusual and worth surfacing. Skip Tier 3 and go to Step 5 with hypotheses focused on the alert's own query (likely causes: data source change, instrumentation change, downstream silent failure that only shows in this metric).
仅在第一层级确认是真实阈值突破时运行。提取与告警相同资源范围的关联信号,并将触发时段与基线窗口进行对比。
  1. 选择基线窗口。使用前一天的同一时段(
    触发开始时间-24小时, 触发开始时间-24小时+触发持续时间
    )。如果告警在已知异常时段(如部署、每周任务期间)触发,需在输出中注明,但仍继续排查。
  2. 查找告警资源类型的关联信号。参考
    references/neighbor-signals.md
    中的查找表。常见场景:
    • 服务级告警
      service.name = X
      ):提取错误率、p95/p99延迟、请求吞吐量;如果有链路追踪数据,提取依赖错误率。
    • 主机/VM告警
      host.name = X
      ):提取CPU、内存、磁盘I/O、网络I/O。
    • K8s Pod/命名空间告警:提取Pod重启次数、容器CPU/内存限制、节点压力、最近的滚动发布情况。
  3. 针对每个关联信号:
    • 通过
      signoz:signoz_execute_builder_query
      signoz:signoz_query_metrics
      查询触发时段和基线窗口的数据。
    • 计算变化量(触发时段与基线的百分比变化)。
    • 按绝对变化量排序。
  4. 提前终止条件:如果没有关联信号与基线的偏差≥25%,则归类为“孤立触发——仅告警自身信号变化,其他信号无异常”。这种情况较为特殊,需重点呈现。跳过第三层级,直接进入步骤5,假设集中在告警自身查询(可能原因:数据源变更、埋点变更、仅在此指标中体现的下游静默故障)。

Step 4: Tier 3 — traces and logs at the fire window

步骤4:第三层级——触发时段的链路追踪与日志

Run only if Tier 2 found correlated neighbor anomalies. Drill into specific failing operations.
  1. Traces (if the alert is service-scoped and traces are available):
    • Call
      signoz:signoz_search_traces
      for the fire window with filter:
      service.name = <scope>
      AND
      hasError = true
      . Cap at top 20.
    • Group results by
      name
      (operation) and
      error_message
      . Surface the top 3 by frequency with a representative trace ID for each.
    • Optionally call
      signoz:signoz_get_trace_details
      on one representative to extract specific span attributes (DB statement, downstream URL, status code).
  2. Logs for the fire window:
    • Call
      signoz:signoz_search_logs
      with filter:
      <scope_filter>
      AND
      severity_text IN ('ERROR', 'FATAL')
      . Cap at top 20 most recent.
    • Group by
      body
      pattern (or
      exception.type
      if present). Surface the top 3 distinct messages with counts.
  3. Cross-reference: do the traces and logs point at the same downstream service, dependency, or code path? If so, that becomes the leading hypothesis.
See
references/baseline-comparison.md
for query templates that pair fire-window and baseline-window calls cleanly.
仅在第二层级发现关联异常时运行。深入排查具体的失败操作。
  1. 链路追踪(如果告警是服务范围且有链路追踪数据可用):
    • 调用
      signoz:signoz_search_traces
      查询触发时段内满足
      service.name = <范围>
      hasError = true
      的数据,最多返回前20条。
    • name
      (操作)和
      error_message
      分组。呈现出现频率最高的前3组,并提供每组的代表性trace ID。
    • 可选:调用
      signoz:signoz_get_trace_details
      查询其中一个代表性trace,提取特定的span属性(数据库语句、下游URL、状态码)。
  2. 触发时段的日志
    • 调用
      signoz:signoz_search_logs
      查询满足
      <范围过滤条件>
      severity_text IN ('ERROR', 'FATAL')
      的数据,最多返回前20条最新日志。
    • body
      模式(或
      exception.type
      如果存在)分组。呈现出现频率最高的前3条不同消息及计数。
  3. 交叉验证:链路追踪和日志是否指向同一下游服务、依赖或代码路径?如果是,则该方向成为主要假设。
参考
references/baseline-comparison.md
中的查询模板,可清晰地组合触发时段和基线窗口的查询。

Step 5: Build the structured output

步骤5:生成结构化输出

Use this exact section order. Lead with a TL;DR — engineers under pressure scan the top first and stop reading once they have what they need. Compression plus proof: every claim cites the MCP query that produced it; no generic "check logs / verify connectivity" filler.
1. TL;DR — one or two sentences, no more. Leading hypothesis, overall confidence, blast radius, and the single most useful next action. Example:
"checkoutservice error rate hit 12.4% (threshold 5%) for 8m at 14:32 UTC — most likely cause is payments-api timing out (high confidence). Open trace
7af3a09b…
to see the failing call."
If no hypothesis reaches medium confidence, the leading line is "No clear root cause found." rather than a low-confidence guess dressed up as the answer.
2. What fired The alert (id, name), the fire window (absolute UTC + relative), peak magnitude ("error rate hit 12.4% vs. 5% threshold — 148% over"), fire duration, and the fire pattern (
one-off
/
sustained
/
flapping
/
recurring
/
marginal
).
3. Investigation trail A scannable list of what was checked, with ✅ for confirmed signals and ❌ for ruled out, each followed by a one-line finding. The point is that the reader can see what work the AI did and what it found — this is where trust is earned. Example:
  • ✅ Tier 1 — peak error rate 12.4%, fire was real (not marginal).
  • ✅ Tier 2 — payments error rate +8900%, p99 +1180%; downstream cascade.
  • ❌ CPU / memory pressure — flat through the fire window.
  • ✅ Tier 3 — 30 error traces all hit payments-api, same message.
4. Likely causes (ranked, max 3) Each cause has three parts:
  • Hypothesis — one sentence, specific. Bad: "service is unhealthy". Good: "checkout is timing out on calls to payments-api".
  • Evidence — the supporting numbers from tiers 1/2/3, with the underlying query inline so the user can re-run it. State the neighbor signal, the delta vs baseline, the trace/log pattern that supports it.
  • Confidence
    high
    requires ≥2 of: temporal precedence, topology / dependency edge, shared service or entity, correlated metric/log/trace evidence, recent deploy or config change.
    medium
    is one tier's evidence with at least one of those signals.
    low
    is a single signal moved with no corroboration — in that case label it a "co-occurring signal," not a cause.
If only Tier 1 ran (marginal fire / no neighbor anomalies), output fewer hypotheses with
low
confidence and explicitly call out the limitation.
5. Ruled out Short but explicit. List candidates the evidence eliminated and the one-line reason why. Skip the section if there's nothing meaningful to rule out — but if you considered something and dropped it, say so here so the user doesn't waste time re-checking it.
6. Suggested next steps Action items the user can take. Be concrete and use SigNoz-native handles so the user can act immediately:
  • Specific trace, dashboard, or alert to open (e.g., "open trace
    7af3a09b…
    in the SigNoz UI").
  • Specific query to run with
    signoz-generating-queries
    — paste the exact filter and time window.
  • "Tune this alert" if the fire was marginal — name the field (
    matchType
    ,
    target
    ,
    recoveryTarget
    ) and the change to make via
    signoz:signoz_update_alert
    .
  • "Open an incident" or "page the owning team" if the cause is cross-service.
Do not pad with generic advice ("verify connectivity", "check dashboards") — that's noise during an active incident.
Mirroring as navigation chips. Mirror up to 3 of these "Suggested next steps" as host follow-up intents — the most actionable, alert-scoped ones. Keep the rest in the report prose so the user has the full picture. The chip surface is capped; the prose is not.
严格按照以下章节顺序组织。开头先提供TL;DR(摘要)——处于压力下的工程师会先浏览顶部内容,找到所需信息后即停止阅读。内容需简洁且有依据:每个结论都要引用生成它的MCP查询;避免使用“检查日志/验证连通性”这类通用建议。
1. TL;DR——最多一两句话。包含主要假设、整体置信度、影响范围,以及最有用的下一步操作示例:
“结账服务错误率达到12.4%(阈值5%),持续8分钟,触发时间为14:32 UTC——最可能的原因是payments-api超时(高置信度)。打开trace
7af3a09b…
查看失败调用。”
如果没有假设达到中等置信度,开头应为“未找到明确根本原因。”,而非将低置信度猜测包装成答案。
2. 告警触发详情 告警(ID、名称)、触发时段(UTC绝对时间+相对时间)、峰值幅度(“错误率达到12.4%,远超5%阈值——超出148%”)、触发持续时间,以及触发模式(
one-off
/
sustained
/
flapping
/
recurring
/
marginal
)。
3. 排查轨迹 以可快速浏览的列表形式呈现已检查内容,用✅表示已确认的信号,❌表示已排除的内容,每条后附一行结论。目的是让读者了解AI完成的工作及发现——这是建立信任的关键。示例:
  • ✅ 第一层级——峰值错误率12.4%,确认为真实触发(非边际触发)。
  • ✅ 第二层级——支付服务错误率+8900%,p99延迟+1180%;下游级联故障。
  • ❌ CPU/内存压力——触发时段内保持平稳。
  • ✅ 第三层级——30条错误trace均指向payments-api,错误消息一致。
4. 可能原因(排序,最多3个) 每个原因包含三部分:
  • 假设——一句话,具体明确。错误示例:“服务不健康”。正确示例:“结账服务调用payments-api超时”。
  • 证据——来自第一/第二/第三层级的支撑数据,并内嵌底层查询语句,方便用户重新运行。说明关联信号、与基线的变化量、支撑该假设的链路追踪/日志模式。
  • 置信度——
    high
    (高)需满足≥2个条件:时间先后顺序、拓扑/依赖关系、共享服务或实体、关联的指标/日志/链路追踪证据、近期部署或配置变更。
    medium
    (中)为单一层级的证据,且至少满足上述一个条件。
    low
    (低)为单一信号变化,无其他佐证——此时应标注为“并发信号”,而非原因。
如果仅运行了第一层级(边际触发/无关联异常),则输出较少的低置信度假设,并明确说明局限性。
5. 已排除原因 简短但明确。列出证据已排除的候选原因及一行理由。如果没有有意义的排除项,可跳过此章节——但如果您考虑过某个原因并排除了它,请在此说明,避免用户浪费时间重复检查。
6. 建议下一步操作 用户可执行的具体操作项。需具体且使用SigNoz原生标识,方便用户立即行动:
  • 特定的链路追踪、仪表盘或告警(例如:“在SigNoz UI中打开trace
    7af3a09b…
    ”)。
  • 使用
    signoz-generating-queries
    运行的具体查询——粘贴精确的过滤条件和时间窗口。
  • 如果是边际触发,建议“调整此告警”——指明字段(
    matchType
    target
    recoveryTarget
    )及通过
    signoz:signoz_update_alert
    进行的修改。
  • 如果原因涉及跨服务,建议“创建事件”或“通知负责团队”。
不要添加通用建议(“验证连通性”“检查仪表盘”)——这些在事件处理过程中属于无效信息。
作为导航芯片的镜像。将最多3条“建议下一步操作”镜像为主机后续意图——选择最具可操作性、与告警相关的操作。其余操作保留在报告正文中,确保用户了解完整情况。导航芯片数量有限,但正文不受限制。

Out of scope (v1)

超出范围(v1版本)

  • Deployment / config-change correlation — SigNoz MCP does not expose a deployments tool; do not fabricate one. If the user mentions a recent deploy, surface it as context but don't claim it caused the fire without the signal evidence.
  • Cross-service blast-radius walking — investigating downstream callers of the alert's service. Out of scope to keep context bounded.
  • Long-horizon historical baselines — Tier 2 compares to one prior-day window, not to weekly/monthly seasonality. If the user says "is this normal for a Friday afternoon", suggest an anomaly alert (
    signoz-creating-alerts
    with
    anomaly_rule
    ).
  • 部署/配置变更关联——SigNoz MCP未暴露部署工具;请勿虚构此类工具。如果用户提及近期部署,可将其作为上下文呈现,但不要在无信号证据的情况下声称是触发原因。
  • 跨服务影响范围遍历——排查告警服务的下游调用方。为了保持上下文边界,此内容超出范围。
  • 长期历史基线——第二层级仅与前一天的同一时段对比,而非每周/每月的季节性数据。如果用户询问“周五下午出现这种情况正常吗”,建议创建异常告警(使用
    signoz-creating-alerts
    并配置
    anomaly_rule
    )。

Guardrails

约束规则

  • Three-tier early-stop is mandatory. Skipping the gates pulls hundreds of traces/logs on quiet alerts and explodes context. The gates are not optional optimizations.
  • Anchor every claim to an MCP query result. No speculation. If evidence is missing, lower confidence and say so.
  • Show the supporting query with each hypothesis so the user can reproduce and dig deeper.
  • Compression plus proof. TL;DR is one or two sentences max; the full report is a triage card, not a postmortem. Engineers under pressure should be able to skim the top and act. Every section earns its place by adding evidence the user couldn't already see in the alert payload.
  • Correlation ≠ causation. Label something a cause only when at least two of the following converge: temporal precedence (signal moved before symptom), topology / dependency edge, shared service or entity, correlated metric/log/trace evidence, or a recent deploy/config change. A single time-aligned anomaly is a "co-occurring signal," not a cause — say so explicitly.
  • Don't restate the alert or recommend the obvious. "Check logs", "verify connectivity", "investigate dashboards" — the reader of this output already knows they need to. Replace generic suggestions with specific queries, traces, or filters they can run immediately.
  • No fabricated identifiers. Trace IDs, span names, alert rule IDs, channel names, deploy IDs — every identifier in the output must come from a real MCP response. Don't invent placeholders that look plausible.
  • Honest uncertainty wins. If no hypothesis reaches medium confidence, the answer is "No clear root cause found — here's what we checked and what's ruled out." Do not promote a low-confidence guess to the leading hypothesis just to sound useful. False positives waste active incident time more than false negatives.
  • Prefer resource-attribute filters in every drill-down query. This is the SigNoz MCP guideline and it directly affects query speed at scale.
  • Do not modify any alert. Investigate is read-only. If the user says "and tighten this alert", surface that as a next-step recommendation; do not call
    signoz:signoz_update_alert
    .
  • Stay in scope. Static rule explanation belongs to
    signoz-explaining-alerts
    . Cause analysis without an alert anchor belongs to
    signoz-generating-queries
    .
  • Time zones. Always state fire windows in UTC alongside relative time ("14:32 UTC, 2h ago") so autonomous and interactive consumers agree on the window.
  • 必须执行三层提前终止检查。跳过这些条件会在静默告警上触发数百次链路追踪/日志查询,导致上下文爆炸。这些条件并非可选优化项。
  • 每个结论都必须锚定MCP查询结果。禁止推测。如果证据缺失,降低置信度并说明情况。
  • 每个假设都需附带支撑查询,方便用户复现和深入排查。
  • 简洁且有依据。TL;DR最多一两句话;完整报告是分诊卡片,而非事后总结。处于压力下的工程师应能快速浏览顶部内容并采取行动。每个章节都必须提供用户无法从告警负载中直接获取的证据,才有存在的价值。
  • 关联≠因果。只有当至少满足以下两个条件时,才能将某因素标注为原因:时间先后顺序(信号变化早于症状)、拓扑/依赖关系、共享服务或实体、关联的指标/日志/链路追踪证据、近期部署/配置变更。单一时间对齐的异常是“并发信号”,而非原因——需明确说明。
  • 不要重复告警内容或给出显而易见的建议。“检查日志”“验证连通性”“排查仪表盘”——报告读者已经知道这些操作。将通用建议替换为用户可立即执行的具体查询、链路追踪或过滤条件。
  • 禁止虚构标识符。Trace ID、span名称、告警规则ID、渠道名称、部署ID——输出中的所有标识符必须来自真实的MCP响应。不要发明看似合理的占位符。
  • 诚实的不确定性更有价值。如果没有假设达到中等置信度,答案应为“未找到明确根本原因——以下是已检查内容和已排除原因”。不要为了显得有用而将低置信度猜测提升为主要假设。在事件处理过程中,误报比漏报更浪费时间。
  • 在所有深入查询中优先使用资源属性过滤器。这是SigNoz MCP的指导原则,直接影响大规模场景下的查询速度。
  • 请勿修改任何告警。排查为只读模式。如果用户说“同时调紧这个告警”,将其作为下一步建议呈现;不要调用
    signoz:signoz_update_alert
  • 保持在范围内。静态规则解析属于
    signoz-explaining-alerts
    的范畴。不基于告警锚点的原因分析属于
    signoz-generating-queries
    的范畴。
  • 时区。触发时段始终以UTC时间呈现,并附带相对时间(“14:32 UTC,2小时前”),确保自动和交互式用户对时间窗口的理解一致。

Examples

示例

User: "Why did the checkout error rate alert fire?"
Agent:
  1. Resolves alert: "High Error Rate — Checkout" (id 42).
  2. signoz:signoz_get_alert_history
    → most recent fire 2h ago at 14:32 UTC, sustained for 8m, single fire (not flapping).
  3. Tier 1: re-runs error-rate formula over
    [14:02, 15:02]
    . Peak error rate 12.4% (vs 5% threshold — 148% over). Pre-fire baseline 0.3%. Real fire, not marginal.
  4. Tier 2: pulls neighbor signals for
    service.name = checkout
    :
    • p99 latency: 4.1s vs 320ms baseline (+1180%).
    • Throughput: -42% (drop).
    • Downstream
      payments
      error rate: 18% vs 0.2% baseline (+8900%).
    • CPU/memory: flat (no resource pressure).
  5. Tier 3: traces for
    service.name = checkout, hasError = true
    in the fire window — top operation
    POST /checkout/submit
    , top error message "context deadline exceeded calling payments-api". 30 traces, all hitting the same downstream URL. Logs show matching "payments client timeout" lines, 142 occurrences.
  6. Output:
    TL;DR: checkoutservice error rate hit 12.4% (threshold 5%) for 8m at 14:32 UTC. Most likely cause: payments-api timing out (high confidence — converging trace + log + neighbor evidence). Open trace
    7af3a09b…
    to see the failing call.
    • What fired: alert 42 fired 2h ago at 14:32 UTC, sustained 8m. Error rate peaked at 12.4% (148% over threshold).
    • Investigation trail:
      • ✅ Tier 1 — peak 12.4% vs 5% threshold, pre-fire baseline 0.3%. Real fire.
      • ✅ Tier 2 — payments error rate +8900%, p99 latency +1180%, throughput −42%.
      • ❌ CPU / memory pressure on checkout — flat.
      • ✅ Tier 3 — 30 error traces all hit payments-api with
        context deadline exceeded
        ; 142 matching timeout logs.
    • Likely causes (high confidence): payments service errors cascading into checkout. Evidence converges across topology (checkout → payments edge), temporal precedence (payments errors lead checkout p99), and shared entity (every error trace targets the payments-api URL).
    • Ruled out: checkout-side resource pressure (CPU/memory flat); upstream traffic spike (throughput dropped, didn't spike).
    • Next steps: open trace
      7af3a09b…
      in the SigNoz UI; if payments has its own alert, run
      signoz-explaining-alerts
      on it, otherwise
      signoz-generating-queries
      for payments error rate over the last 4h.

User: "Investigate the flapping CPU alert on prod-api-3"
Agent:
  1. Resolves alert (id 88, host.name = prod-api-3).
  2. History: 7 fires in last 1h, alternating fire/resolve every 8-12 minutes — flapping pattern.
  3. Tier 1: peak 84% (threshold 80%, only 5% over). Each fire lasted 2-4 minutes. Marginal — the value hovered near threshold.
    matchType = at_least_once
    made each blip trigger.
  4. Early-stop kicks in. Skip tiers 2/3.
  5. Output:
    TL;DR: alert 88 has been flapping on
    prod-api-3
    — 7 fires in the last 1h, each 2–4m, all within 5% of the 80% threshold. No clear root cause; this is threshold tuning, not an incident. Switch
    matchType
    to
    on_average
    or add hysteresis to stop the noise.
    • What fired: alert 88, host
      prod-api-3
      , fire pattern
      flapping
      . 7 fires in the last 1h, each 2–4m. Peak 84% (5% over the 80% threshold).
    • Investigation trail:
      • ✅ Tier 1 — every fire was within 5% of threshold; duration short; baseline already at 70–75%. Marginal fire — early-stop triggered, Tier 2/3 skipped.
    • Likely causes (low / co-occurring signal only): threshold tuned too tight or
      matchType
      is too sensitive. Evidence: every fire was within 5% of threshold; baseline already runs at 70–75%. Not promoted to a "cause" — single signal, no corroboration.
    • Ruled out: real CPU saturation incident (peaks too small and short-lived; baseline already near threshold).
    • Next steps: change
      matchType
      to
      on_average
      (smooths transient spikes) OR raise threshold to 85% with hysteresis (
      recoveryTarget: 75
      ). Use
      signoz:signoz_update_alert
      to apply.

User: "RCA for the log volume alert that fired last night"
Agent:
  1. Resolves alert: "Error Log Volume Spike" (id 14, no service filter — groupBy
    service.name
    ).
  2. History: fired at 03:12 UTC, sustained 22m, broke down by service in the alert annotations:
    service.name = inventory
    was the firing series.
  3. Tier 1: re-runs log count for inventory in fire window. Peak 3,400 errors/min vs 1,000/min threshold (240% over). Pre-fire baseline 12/min. Real, large fire.
  4. Tier 2: neighbor signals for
    service.name = inventory
    :
    • Request error rate: +600%.
    • p99 latency: +30% (mild).
    • CPU: -80% (collapsed). Memory: -60%.
    • Pod restarts (k8s): 4 in fire window.
  5. Tier 3: logs for inventory in fire window. Top message: "OOMKilled restarting" (1,200 occurrences). Top trace error: graceful-shutdown exceptions.
  6. Output:
    TL;DR: log volume alert 14 fired at 03:12 UTC for
    service.name = inventory
    , sustained 22m at 240% over threshold. Most likely cause: inventory pods OOM-killed and restarted 4 times (high confidence). Check container memory limits for the inventory deployment.
    • What fired: alert 14 fired at 03:12 UTC for service
      inventory
      , sustained 22m, 240% over threshold.
    • Investigation trail:
      • ✅ Tier 1 — peak 3,400 errors/min vs 1,000/min threshold; pre-fire baseline 12/min. Real fire.
      • ✅ Tier 2 — request error rate +600%; CPU/memory collapsed (−80%/−60%); 4 pod restarts in window.
      • ❌ p99 latency — only +30%, not a latency-driven incident.
      • ✅ Tier 3 — top log message "OOMKilled restarting" (1,200 occurrences); top trace error: graceful-shutdown exceptions.
    • Likely causes (high confidence): inventory pods OOM-killed and restarted 4 times during the window. Evidence converges across topology (single service), temporal precedence (memory fell to zero before error spike), shared entity (all log lines from
      service.name = inventory
      ), and a single coherent pattern (OOM → restart → graceful-shutdown noise).
    • Ruled out: a true application error-rate change (errors are restart noise, not request-path failures); upstream traffic surge (throughput unchanged).
    • Next steps: check container memory limits for inventory pods; review recent deploys; consider whether the alert should exclude restart-related error patterns or whether the underlying OOM is the real concern.
用户:“结账服务错误率告警为什么触发?”
Agent:
  1. 解析告警:“高错误率——结账服务”(ID 42)。
  2. signoz:signoz_get_alert_history
    → 最近一次触发为2小时前的14:32 UTC,持续8分钟,单次触发(非波动)。
  3. 第一层级:在
    [14:02, 15:02]
    窗口内重新运行错误率公式。峰值错误率12.4%(阈值5%——超出148%)。触发前基线0.3%。确认为真实触发,非边际触发。
  4. 第二层级:提取
    service.name = checkout
    的关联信号:
    • p99延迟:4.1s vs 基线320ms(+1180%)。
    • 吞吐量:-42%(下降)。
    • 下游
      payments
      错误率:18% vs 基线0.2%(+8900%)。
    • CPU/内存:平稳(无资源压力)。
  5. 第三层级:查询触发时段内
    service.name = checkout, hasError = true
    的trace——最频繁的操作是
    POST /checkout/submit
    ,最常见的错误消息是“调用payments-api时上下文超时”。共30条trace,均指向同一下游URL。日志显示匹配的“支付客户端超时”记录,共142条。
  6. 输出
    TL;DR:结账服务错误率达到12.4%(阈值5%),触发时间为14:32 UTC,持续8分钟。最可能的原因:payments-api超时(高置信度——链路追踪、日志、关联信号证据一致)。打开trace
    7af3a09b…
    查看失败调用。
    • 告警触发详情:告警42于2小时前的14:32 UTC触发,持续8分钟。错误率峰值12.4%(超出阈值148%)。
    • 排查轨迹
      • ✅ 第一层级——峰值12.4% vs 阈值5%,触发前基线0.3%。确认为真实触发。
      • ✅ 第二层级——支付服务错误率+8900%,p99延迟+1180%,吞吐量−42%。
      • ❌ 结账服务CPU/内存压力——平稳。
      • ✅ 第三层级——30条错误trace均指向payments-api,错误消息为
        context deadline exceeded
        ;142条匹配的超时日志。
    • 可能原因(高置信度):支付服务错误级联影响结账服务。证据来自拓扑关系(结账→支付链路)、时间先后顺序(支付服务错误早于结账服务p99延迟上升)、共享实体(所有错误trace均指向payments-api URL)。
    • 已排除原因:结账服务端资源压力(CPU/内存平稳);上游流量激增(吞吐量下降,未激增)。
    • 下一步操作:在SigNoz UI中打开trace
      7af3a09b…
      ;如果支付服务有自身告警,对其运行
      signoz-explaining-alerts
      ,否则使用
      signoz-generating-queries
      查询过去4小时内支付服务的错误率。

用户:“排查prod-api-3上波动的CPU告警”
Agent:
  1. 解析告警(ID 88,host.name = prod-api-3)。
  2. 历史记录:过去1小时内触发7次,每8-12分钟交替触发/恢复——波动模式。
  3. 第一层级:峰值84%(阈值80%,仅超出5%)。每次触发持续2-4分钟。边际触发——数值徘徊在阈值附近。
    matchType = at_least_once
    导致每次小波动都触发告警。
  4. 触发提前终止条件。跳过第二/第三层级。
  5. 输出
    TL;DR:告警88在
    prod-api-3
    上出现波动——过去1小时内触发7次,每次持续2–4分钟,均在80%阈值的5%范围内。未找到明确根本原因;这是阈值配置问题,而非事件。将
    matchType
    切换为
    on_average
    或添加滞后机制以消除噪音。
    • 告警触发详情:告警88,主机
      prod-api-3
      ,触发模式为
      flapping
      。过去1小时内触发7次,每次持续2–4分钟。峰值84%(超出80%阈值5%)。
    • 排查轨迹
      • ✅ 第一层级——每次触发均在阈值的5%范围内;持续时间短;基线已达70–75%。边际触发——触发提前终止,跳过第二/第三层级。
    • 可能原因(低置信度/仅为并发信号):阈值配置过严或
      matchType
      过于敏感。证据:每次触发均在阈值的5%范围内;基线已维持在70–75%。不归类为“原因”——仅单一信号变化,无佐证。
    • 已排除原因:真实CPU饱和事件(峰值过小且持续时间短;基线已接近阈值)。
    • 下一步操作:将
      matchType
      改为
      on_average
      (平滑瞬时峰值),或将阈值提高至85%并添加滞后机制(
      recoveryTarget: 75
      )。使用
      signoz:signoz_update_alert
      应用修改。

用户:“昨晚触发的日志量告警的根因分析”
Agent:
  1. 解析告警:“错误日志量激增”(ID 14,无服务过滤——按
    service.name
    分组)。
  2. 历史记录:触发时间为03:12 UTC,持续22分钟,告警注释中按服务拆分:
    service.name = inventory
    是触发的服务。
  3. 第一层级:重新运行触发时段内inventory服务的日志计数。峰值3400条错误/分钟 vs 阈值1000条/分钟(超出240%)。触发前基线12条/分钟。确认为真实且严重的触发。
  4. 第二层级:提取
    service.name = inventory
    的关联信号:
    • 请求错误率:+600%。
    • p99延迟:+30%(轻微)。
    • CPU:-80%(骤降)。内存:-60%。
    • K8s Pod重启次数:触发时段内4次。
  5. 第三层级:触发时段内inventory服务的日志。最常见消息:“OOMKilled restarting”(1200次)。最常见trace错误:优雅关闭异常。
  6. 输出
    TL;DR:日志量告警14于03:12 UTC针对
    service.name = inventory
    触发,持续22分钟,超出阈值240%。最可能的原因:inventory Pod因OOM被杀死并重启4次(高置信度)。检查inventory部署的容器内存限制。
    • 告警触发详情:告警14于03:12 UTC针对服务
      inventory
      触发,持续22分钟,超出阈值240%。
    • 排查轨迹
      • ✅ 第一层级——峰值3400条错误/分钟 vs 阈值1000条/分钟;触发前基线12条/分钟。确认为真实触发。
      • ✅ 第二层级——请求错误率+600%;CPU/内存骤降(−80%/−60%);触发时段内4次Pod重启。
      • ❌ p99延迟——仅+30%,并非延迟驱动的事件。
      • ✅ 第三层级——最常见日志消息“OOMKilled restarting”(1200次);最常见trace错误:优雅关闭异常。
    • 可能原因(高置信度):触发时段内inventory Pod因OOM被杀死并重启4次。证据来自拓扑关系(单一服务)、时间先后顺序(内存降至零后错误激增)、共享实体(所有日志均来自
      service.name = inventory
      ),以及一致的模式(OOM→重启→优雅关闭噪音)。
    • 已排除原因:真实应用错误率变化(错误是重启噪音,而非请求路径故障);上游流量激增(吞吐量无变化)。
    • 下一步操作:检查inventory Pod的容器内存限制;回顾近期部署;考虑告警是否应排除重启相关错误模式,或底层OOM是否为真正需要关注的问题。

Additional resources

附加资源

  • references/neighbor-signals.md
    — lookup table mapping resource type (service / host / k8s) to the neighbor signals to pull in Tier 2.
  • references/baseline-comparison.md
    — query templates that pair fire-window and baseline-window calls cleanly, including how to format
    signoz:signoz_execute_builder_query
    for both.
  • signoz-explaining-alerts
    skill — to decode the rule before investigating, if the user is unfamiliar with what the alert monitors.
  • signoz-generating-queries
    skill — for ad-hoc follow-up queries on the same resource scope.
  • references/neighbor-signals.md
    ——资源类型(服务/主机/K8s)与第二层级需提取的关联信号的映射表。
  • references/baseline-comparison.md
    ——可清晰组合触发时段和基线窗口查询的模板,包括如何为两者格式化
    signoz:signoz_execute_builder_query
  • signoz-explaining-alerts
    技能——如果用户不熟悉告警监控的内容,可先使用此技能解析规则,再进行排查。
  • signoz-generating-queries
    技能——针对同一资源范围的临时后续查询。