signoz-creating-alerts

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Alert Create

创建告警

Build a SigNoz alert from a user's natural-language intent. The skill targets two consumers: an autonomous AI SRE agent that runs without a human in the loop, and a human at a Claude Code / Codex / Cursor prompt. Both go through the same flow — the human just gets a chance to intervene at the preview step.
根据用户的自然语言指令构建SigNoz告警。本技能面向两类使用者:无需人工介入的自主AI SRE Agent,以及使用Claude Code / Codex / Cursor的人工用户。两者遵循相同流程——仅人工用户可在预览环节进行干预。

Prerequisites

前置条件

This skill calls SigNoz MCP server tools (
signoz:signoz_create_alert
,
signoz:signoz_list_alerts
,
signoz:signoz_get_field_keys
, etc.). Before running the workflow, confirm the
signoz:signoz_*
tools are available. If they are not, the SigNoz MCP server is not installed or configured — stop and direct the user to set it up: https://signoz.io/docs/ai/signoz-mcp-server/. Do not try to fall back to raw HTTP calls or fabricate alert configs without the MCP tools.
本技能会调用SigNoz MCP服务器工具(
signoz:signoz_create_alert
signoz:signoz_list_alerts
signoz:signoz_get_field_keys
等)。运行工作流前,请确认
signoz:signoz_*
工具已可用。若不可用,则说明SigNoz MCP服务器未安装或配置——请停止操作并引导用户完成配置:https://signoz.io/docs/ai/signoz-mcp-server/。请勿尝试回退至原始HTTP调用,或在无MCP工具的情况下编造告警配置。

When to use

使用场景

Use this skill when the user wants to:
  • Create, set up, or configure a new alert rule.
  • Get paged or notified when a metric, log volume, latency, or error rate crosses a threshold.
  • Detect anomalous behavior on a service, host, or signal.
  • Catch silent data loss ("alert if data stops arriving from X").
Do NOT use when the user wants to:
  • Understand what an existing alert monitors →
    signoz-explaining-alerts
    .
  • Diagnose why an existing alert fired →
    signoz-investigating-alerts
    .
  • Modify thresholds, queries, or routing on an existing alert → call
    signoz:signoz_update_alert
    directly.
当用户需要以下操作时使用本技能:
  • 创建、设置或配置新的告警规则。
  • 当指标、日志量、延迟或错误率超出阈值时,接收呼叫或通知。
  • 检测服务、主机或信号的异常行为。
  • 捕获静默数据丢失(“当X停止发送数据时触发告警”)。
请勿在以下场景使用:
  • 用户希望了解现有告警的监控内容 → 使用
    signoz-explaining-alerts
  • 用户希望诊断现有告警触发的原因 → 使用
    signoz-investigating-alerts
  • 用户希望修改现有告警的阈值、查询或路由 → 直接调用
    signoz:signoz_update_alert

Required inputs (strict)

必填输入(严格要求)

Alert creation is a write operation against a shared system. Guessing here creates noisy alerts on the wrong service that someone else has to clean up. The skill enforces a strict input contract:
InputRequiredSource if missing
Alert intent (NL goal)yes
$ARGUMENTS
or recent user turn
Resource attribute filter (e.g.
service.name
,
k8s.namespace.name
,
host.name
)
yesdiscover via
signoz:signoz_get_field_keys
+
signoz:signoz_get_field_values
Threshold value(s)inferred from intentderive a sensible default and surface in the preview
Severityinferred from intentdefault
warning
; promote to
critical
only if user said "page", "wake up", "critical"
Notification channelyes
signoz:signoz_list_notification_channels
+ offer "create new"
If a required input is missing and cannot be discovered, emit a structured
needs_input
block and stop before calling any write tool:
text
needs_input:
  missing:
    - resource_attribute_filter: "no service or host specified — pick one"
  candidates:
    service.name: ["frontend", "checkout", "payments", "inventory"]
    host.name: ["prod-api-1", "prod-api-2", "prod-db-1"]
In interactive mode, the human picks from candidates. In autonomous mode, the caller fills the gap from upstream context or escalates. Either way, do not proceed to
signoz:signoz_create_alert
with a guessed value.
告警创建是针对共享系统的写入操作。此处的猜测会在错误的服务上产生无效告警,需要他人清理。本技能强制执行严格的输入约束:
输入项是否必填缺失时的获取来源
告警指令(自然语言目标)
$ARGUMENTS
或用户最近的对话内容
资源属性过滤器(如
service.name
k8s.namespace.name
host.name
通过
signoz:signoz_get_field_keys
+
signoz:signoz_get_field_values
自动发现
阈值从指令中推断生成合理默认值并在预览中展示
严重级别从指令中推断默认
warning
;仅当用户提及“呼叫”“唤醒”“critical”时升级为
critical
通知渠道
signoz:signoz_list_notification_channels
+ 提供“创建新渠道”选项
若必填输入缺失且无法自动发现,请输出结构化的
needs_input
块并停止所有写入工具调用:
text
needs_input:
  missing:
    - resource_attribute_filter: "未指定服务或主机——请选择一个"
  candidates:
    service.name: ["frontend", "checkout", "payments", "inventory"]
    host.name: ["prod-api-1", "prod-api-2", "prod-db-1"]
在交互模式下,人工用户从候选列表中选择;在自主模式下,调用方从上游上下文补充信息或升级处理。无论哪种情况,请勿使用猜测值调用
signoz:signoz_create_alert

Workflow

工作流

Step 1: Parse intent and check what's missing

步骤1:解析指令并检查缺失项

Extract from the user's request:
  1. What to monitor — signal type (metrics / logs / traces / exceptions) and the specific condition (CPU, error rate, p99 latency, log count, ...).
  2. Resource scope — which service, host, namespace, or environment.
  3. Threshold — numeric value and comparison ("above 80%", "below 100/s").
  4. Severity — implicit from urgency words ("page" → critical, default warning otherwise).
  5. Channel — explicit channel name if the user provided one.
Map signal phrasing to alert type:
User saysalertTypesignal
metric, CPU, memory, latency, request rateMETRIC_BASED_ALERTmetrics
log, error logs, log volume, log patternLOGS_BASED_ALERTlogs
trace, span, latency p99, slow requestsTRACES_BASED_ALERTtraces
exception, stack trace, crashEXCEPTIONS_BASED_ALERT(clickhouse_sql)
If resource scope is missing, run discovery (Step 2). If still missing after discovery, emit
needs_input
and stop.
从用户请求中提取以下信息:
  1. 监控对象——信号类型(指标/日志/链路追踪/异常)及具体条件(CPU、错误率、p99延迟、日志数量等)。
  2. 资源范围——涉及的服务、主机、命名空间或环境。
  3. 阈值——数值及比较逻辑(“高于80%”“低于100次/秒”)。
  4. 严重级别——从紧急词汇中隐含推断(“呼叫”→critical,否则默认warning)。
  5. 渠道——若用户明确指定则使用对应渠道名称。
将用户表述映射为告警类型:
用户表述alertTypesignal
指标、CPU、内存、延迟、请求速率METRIC_BASED_ALERTmetrics
日志、错误日志、日志量、日志模式LOGS_BASED_ALERTlogs
链路追踪、Span、p99延迟、慢请求TRACES_BASED_ALERTtraces
异常、堆栈跟踪、崩溃EXCEPTIONS_BASED_ALERT(clickhouse_sql)
若资源范围缺失,执行步骤2进行发现;若发现后仍缺失,输出
needs_input
并停止操作。

Step 2: Discover resource attributes and metric names

步骤2:发现资源属性和指标名称

When the user does not name a service / host / namespace, the SigNoz MCP guideline applies: always prefer a resource-attribute filter. Discover candidates instead of guessing:
  1. Call
    signoz:signoz_get_field_keys
    with
    fieldContext=resource
    to enumerate resource attributes for the chosen signal.
  2. Call
    signoz:signoz_get_field_values
    for the most likely attribute (typically
    service.name
    , then
    host.name
    , then
    k8s.namespace.name
    ) to get concrete values.
  3. If the user mentioned a metric by name, call
    signoz:signoz_list_metrics
    with a search term to verify the exact OTel metric name. Wrong names create alerts that never fire.
Surface the candidates in the
needs_input
block. Do not pick one.
当用户未指定服务/主机/命名空间时,遵循SigNoz MCP准则:优先使用资源属性过滤器。自动发现候选值而非猜测:
  1. 调用
    signoz:signoz_get_field_keys
    并设置
    fieldContext=resource
    ,枚举所选信号的资源属性。
  2. 调用
    signoz:signoz_get_field_values
    获取最可能的属性值(通常优先
    service.name
    ,其次
    host.name
    ,最后
    k8s.namespace.name
    )。
  3. 若用户提及具体指标名称,调用
    signoz:signoz_list_metrics
    并传入搜索词,验证确切的OTel指标名称。错误的指标名称会导致告警永远无法触发。
needs_input
块中展示候选值,请勿自行选择。

Step 2.5: Probe data existence for the chosen filter (fail fast)

步骤2.5:探测所选过滤器的数据存在性(快速失败)

Before authoring any alert config, confirm the specific combination the alert will watch (metric × service × any other filter) actually emits data. The most common silent failure is "metric exists in the catalog and the service exists in the catalog, but the service doesn't emit that metric" — each piece checks out in isolation, the alert saves successfully, and it silently never fires.
Run a single probe over the last 1 hour using the same filter the alert will use, but with the simplest aggregation that confirms data exists:
  • Metrics:
    signoz:signoz_query_metrics
    (PromQL) or
    signoz:signoz_execute_builder_query
    with
    count()
    (or
    count_distinct(service.name)
    if scope-discovering).
  • Logs:
    signoz:signoz_aggregate_logs
    with
    count()
    over the filter.
  • Traces:
    signoz:signoz_aggregate_traces
    with
    count()
    over the filter.
Inspect the result:
  • Probe returns rows → proceed to Step 3.
  • Probe returns empty → STOP. Do not build an alert config the user will then be asked to throw away. Emit a
    needs_input
    block describing what was missing and offer concrete recovery:
    • Service doesn't emit the metric → call
      signoz:signoz_get_field_values signal=metrics name=service.name metricName=<metric>
      to list the services that do emit it; let the user pick a different service or a different metric.
    • Wrong attribute name (
      service
      instead of
      service.name
      ) → suggest the semantic-convention name and re-probe.
    • Service emits the metric but not in the expected time range → widen the probe window once (e.g. last 24h) before declaring no-data.
Exception — log-based crash / panic / OOMKilled / FATAL alerts. These intentionally have zero matches in a healthy system. The probe will return empty by design. Do not stop; instead, surface the zero-match result and ask the user to confirm before save. Treat this exception narrowly: it applies to "alert me when bad thing happens" log queries, not to alerts that depend on continuous data flow.
This probe is cheap (one query, ~100ms), and catching the no-data case early avoids the worst UX failure mode of this skill — the user reading through a fully-authored JSON payload and only then learning the alert can never fire.
在生成任何告警配置前,请确认告警将监控的特定组合(指标×服务×其他过滤器)确实有数据输出。最常见的静默故障是:“指标在目录中存在,服务也在目录中存在,但该服务并未输出该指标”——单独检查各部分均正常,告警保存成功,但永远静默不触发。
使用告警将采用的相同过滤器,在过去1小时内执行一次探测,使用最简单的聚合方式确认数据存在:
  • 指标
    signoz:signoz_query_metrics
    (PromQL)或
    signoz:signoz_execute_builder_query
    ,使用
    count()
    (若为范围发现则使用
    count_distinct(service.name)
    )。
  • 日志
    signoz:signoz_aggregate_logs
    ,对过滤器执行
    count()
    聚合。
  • 链路追踪
    signoz:signoz_aggregate_traces
    ,对过滤器执行
    count()
    聚合。
检查探测结果:
  • 探测返回数据行 → 继续步骤3。
  • 探测返回空结果 → 停止操作。请勿生成用户随后需要丢弃的告警配置。输出
    needs_input
    块说明缺失内容,并提供具体恢复方案:
    • 服务未输出该指标 → 调用
      signoz:signoz_get_field_values signal=metrics name=service.name metricName=<metric>
      ,列出实际输出该指标的服务;让用户选择其他服务或指标。
    • 属性名称错误(如
      service
      而非
      service.name
      ) → 建议使用语义规范名称并重新探测。
    • 服务输出该指标但不在预期时间范围内 → 在判定无数据前,将探测窗口扩大一次(如过去24小时)。
例外情况——基于日志的崩溃/ panic / OOMKilled / FATAL告警。在健康系统中,这些告警的查询结果本就应为空。探测返回空结果属于正常情况,无需停止操作;相反,需展示零匹配结果并在保存前获取用户确认。此例外仅适用于“当不良事件发生时触发告警”的日志查询,不适用于依赖持续数据流的告警。
该探测成本极低(一次查询,约100ms),提前发现无数据情况可避免本技能最糟糕的用户体验——用户查看完整的JSON配置后,才发现告警永远无法触发。

Step 3: Check for duplicate alerts

步骤3:检查重复告警

Call
signoz:signoz_list_alerts
and paginate through every page
pagination.hasMore
is true until you have walked the full list. Check for existing alerts that match the user's intent (same signal + same scope + similar threshold). If a likely duplicate exists, surface it and ask whether to create a new one anyway, modify the existing one (out of scope here — use
signoz:signoz_update_alert
), or cancel.
调用
signoz:signoz_list_alerts
遍历所有分页——直到
pagination.hasMore
为false,确保获取完整列表。检查是否存在与用户指令匹配的现有告警(相同信号+相同范围+相似阈值)。若存在疑似重复告警,展示该告警并询问用户是继续创建新告警、修改现有告警(超出本技能范围——使用
signoz:signoz_update_alert
)还是取消操作。

Step 4: Build the alert config

步骤4:构建告警配置

The MCP server is the source of truth for the alert JSON schema, threshold codes, and validation rules. Read the
signoz://alert/instructions
and
signoz://alert/examples
MCP resources for the canonical, version-current shape. Do not transcribe schema text into this skill — it will rot out of sync with the server.
For most user intents, the config is one of a small number of patterns:
PatternWhere to authorExample intents
Single-metric thresholdinline (this skill)"alert when CPU > 80%", "p99 latency > 2s"
Log volume thresholdinline"more than N error logs/min"
Trace-based count or p-tileinline"p99 span duration > 2s on checkout"
Error-rate formula (A/B*100)inline (see "Common query shapes" below)"error rate > 5%"
Anomaly detection (Z-score)inline, but only with
METRIC_BASED_ALERT
"alert me on anomalous traffic"
Absent-data alertinline"alert if data stops arriving"
ClickHouse SQL alertinline (this skill) — author the SQL directly using the schema in
signoz://alert/examples
non-trivial joins, custom aggregations the builder cannot express
PromQL alertdelegate to
signoz-generating-queries
for the PromQL, then return here
when user already has PromQL
Threshold
op
and
matchType
values.
v2alpha1 accepts the human-readable strings (
"above"
,
"on_average"
); the legacy numeric codes (
"1"
,
"3"
) are also accepted but harder to read in the UI. Prefer the words. Anomaly rules only support
op: "above"
— the engine already treats z-score breaches as two-sided when the threshold is positive, so
"above_or_below"
is rejected and unnecessary.
Comparison
op
Evaluation behavior
matchType
above / exceeds / >
"above"
breach at any point
"at_least_once"
below / under / <
"below"
breach for entire window
"all_the_times"
equal / =
"equals"
average breaches
"on_average"
not equal / !=
"not_equals"
sum breaches
"in_total"
last value breaches
"last"
Defaults the skill applies (and surfaces in the preview):
  • evalWindow: 5m0s
    ,
    frequency: 1m0s
    — change only if the intent implies a slower or faster cadence.
  • matchType: "3"
    (on_average) for CPU / memory / latency — smooths transient spikes.
  • matchType: "1"
    (at_least_once) for error counts / error rates — catches any breach.
Severity defaults — derive from the intrinsic urgency of the alert, not just the user's words. The user saying "alert me" doesn't force
warning
when the condition itself describes a critical event. Use this table; an explicit user cue overrides it ("just FYI" → demote, "page me" / "wake me up" → promote).
Alert intentDefault severity
Pod crash / OOMKilled / CrashLoopBackOff / panic / FATAL log signalscritical
Service down / no-data on a production servicecritical
Error rate above any non-trivial threshold (>1%)critical
Error logs / exception spikeswarning
Latency degradation (p95/p99 above target)warning
CPU / memory / disk pressurewarning
Request-rate / traffic anomalywarning
SLO budget burn (info-level burn rate)info / warning
When the user's intent is ambiguous on severity (no urgency cue, no clearly-critical condition), default to
warning
and surface the choice in the preview so they can adjust.
OTel attribute names — always use semantic conventions:
service.name
,
host.name
,
k8s.namespace.name
,
deployment.environment.name
. Never
service
,
host
, or
env
.
Annotation templates — the on-call engineer sees the notification, not the alert config. A notification that says "Pod crash detected" with no service name, no count, and no value is nearly useless at 3am. Always include the moving values:
  • summary
    — single-line headline. Include the resource scope and the numeric value:
    "checkoutservice error rate {{$value}}% above 3%"
    .
  • description
    — longer message. Include
    {{$value}}
    ,
    {{$threshold}}
    , the groupBy values (e.g.
    {{$labels.service_name}}
    ), and a sentence on what to do or where to look. For count-based alerts include the count explicitly:
    "{{$value}} crash log lines in the last 5 minutes from service {{$labels.service_name}}"
    .
Use
{{$value}}
for the breaching value,
{{$threshold}}
for the target, and
{{$labels.<key>}}
for groupBy values (note SigNoz substitutes the dotted attribute name with underscores:
service.name
service_name
).
MCP服务器是告警JSON schema、阈值代码和验证规则的权威来源。请阅读
signoz://alert/instructions
signoz://alert/examples
MCP资源,获取规范的、版本同步的配置格式。请勿将schema文本转录到本技能中——否则会与服务器版本不同步而过时。
大多数用户指令对应的配置属于以下少数模式之一:
模式配置方式示例指令
单指标阈值内联(本技能直接生成)“当CPU超过80%时触发告警”“p99延迟超过2秒时触发告警”
日志量阈值内联“每分钟错误日志超过N条时触发告警”
基于链路追踪的计数或百分位数内联“checkout服务的p99 Span持续时间超过2秒时触发告警”
错误率公式(A/B*100)内联(见下方“常见查询格式”)“错误率超过5%时触发告警”
异常检测(Z-score)内联,但仅适用于
METRIC_BASED_ALERT
“针对异常流量触发告警”
数据缺失告警内联“当数据停止发送时触发告警”
ClickHouse SQL告警内联(本技能直接生成)——使用
signoz://alert/examples
中的schema直接编写SQL
非平凡关联、构建器无法表达的自定义聚合
PromQL告警委托给
signoz-generating-queries
生成PromQL,再返回本技能处理
用户已提供PromQL时
阈值
op
matchType
取值
。v2alpha1版本接受人类可读字符串(如
"above"
"on_average"
);旧版数字代码(如
"1"
"3"
)也可接受,但在UI中可读性较差。优先使用字符串。异常规则仅支持
op: "above"
——当阈值为正数时,引擎已将Z-score突破视为双向检测,因此
"above_or_below"
会被拒绝且无必要。
比较逻辑
op
评估行为
matchType
高于/超过/ >
"above"
任意时刻突破阈值
"at_least_once"
低于/小于/ <
"below"
整个窗口内持续突破阈值
"all_the_times"
等于/ =
"equals"
平均值突破阈值
"on_average"
不等于/ !=
"not_equals"
总和突破阈值
"in_total"
最后一个值突破阈值
"last"
本技能应用的默认值(将在预览中展示):
  • evalWindow: 5m0s
    frequency: 1m0s
    ——仅当指令暗示需要更慢或更快的频率时才修改。
  • CPU/内存/延迟告警的
    matchType: "3"
    (on_average)——平滑瞬时峰值。
  • 错误计数/错误率告警的
    matchType: "1"
    (at_least_once)——捕获任何一次阈值突破。
严重级别默认值——从告警的内在紧急程度推导,而非仅依据用户表述。当条件本身描述严重事件时,用户说“提醒我”并不强制设置为
warning
。使用下表;用户明确提示可覆盖默认值(“仅作参考”→降级,“呼叫我”/“唤醒我”→升级)。
告警指令默认严重级别
Pod崩溃/OOMKilled/CrashLoopBackOff/panic/FATAL日志信号critical
服务宕机/生产服务无数据critical
错误率超过任何非 trivial 阈值(>1%)critical
错误日志/异常峰值warning
延迟恶化(p95/p99超过目标值)warning
CPU/内存/磁盘压力warning
请求速率/流量异常warning
SLO预算消耗(信息级消耗速率)info / warning
当用户指令对严重级别不明确(无紧急提示、无明显严重条件)时,默认设置为
warning
并在预览中展示该选项,供用户调整。
OTel属性名称——始终使用语义规范:
service.name
host.name
k8s.namespace.name
deployment.environment.name
。禁止使用
service
host
env
注释模板——值班工程师看到的是通知内容,而非告警配置。仅显示“检测到Pod崩溃”但无服务名称、数量和数值的通知,在凌晨3点几乎毫无用处。务必包含动态值:
  • summary
    ——单行标题。包含资源范围和数值:
    "checkoutservice错误率{{$value}}%,超过3%阈值"
  • description
    ——较长的消息。包含
    {{$value}}
    {{$threshold}}
    、groupBy值(如
    {{$labels.service_name}}
    ),以及操作指引或查看位置。针对基于计数的告警,需明确包含计数:
    "过去5分钟内,服务{{$labels.service_name}}产生了{{$value}}条崩溃日志"
使用
{{$value}}
表示突破阈值的数值,
{{$threshold}}
表示目标阈值,
{{$labels.<key>}}
表示groupBy值(注意SigNoz会将点分隔的属性名替换为下划线:
service.name
service_name
)。

Common query shapes

常见查询格式

Three patterns cover most non-trivial alerts. The MCP resources above carry the full schema; these are quick references for the query block only.
Error rate — two queries + formula
A * 100 / B
. Set
disabled: true
on the component queries A and B
so only the formula F1 renders in the alert chart and notification — the raw counts are intermediate, not the alert signal. Forgetting this clutters the alert preview with three series and confuses the on-call engineer reading the notification.
json
{
  "queries": [
    { "type": "builder_query", "spec": { "name": "A", "signal": "traces",
        "disabled": true,
        "aggregations": [{ "expression": "count()" }],
        "filter": { "expression": "hasError = true" } } },
    { "type": "builder_query", "spec": { "name": "B", "signal": "traces",
        "disabled": true,
        "aggregations": [{ "expression": "count()" }],
        "filter": { "expression": "" } } },
    { "type": "builder_formula",
      "spec": { "name": "F1", "expression": "A * 100 / B" } }
  ],
  "selectedQueryName": "F1"
}
p99 latency — single trace query with
groupBy
for per-service breakdown. Threshold target is in nanoseconds (2s → 2000000000),
targetUnit: "ns"
:
json
{
  "queries": [
    { "type": "builder_query", "spec": { "name": "A", "signal": "traces",
        "aggregations": [{ "expression": "p99(durationNano)" }],
        "groupBy": [{ "name": "service.name", "fieldContext": "resource",
                      "fieldDataType": "string" }] } }
  ]
}
Log volume spike — count of error/fatal logs grouped by service:
json
{
  "queries": [
    { "type": "builder_query", "spec": { "name": "A", "signal": "logs",
        "aggregations": [{ "expression": "count()" }],
        "filter": { "expression": "severity_text IN ('ERROR', 'FATAL')" },
        "groupBy": [{ "name": "service.name", "fieldContext": "resource",
                      "fieldDataType": "string" }] } }
  ]
}
For absent-data, anomaly, PromQL, and ClickHouse SQL alerts, read the
signoz://alert/examples
MCP resource for current shapes.
三种模式覆盖大多数非平凡告警。上述MCP资源包含完整schema;以下仅为查询块的快速参考。
错误率——两个查询+公式
A * 100 / B
将组件查询A和B的
disabled: true
设置为true
,以便仅公式F1在告警图表和通知中显示——原始计数为中间值,而非告警信号。若忘记此设置,会在告警预览中显示三个序列,混淆阅读通知的值班工程师。
json
{
  "queries": [
    { "type": "builder_query", "spec": { "name": "A", "signal": "traces",
        "disabled": true,
        "aggregations": [{ "expression": "count()" }],
        "filter": { "expression": "hasError = true" } } },
    { "type": "builder_query", "spec": { "name": "B", "signal": "traces",
        "disabled": true,
        "aggregations": [{ "expression": "count()" }],
        "filter": { "expression": "" } } },
    { "type": "builder_formula",
      "spec": { "name": "F1", "expression": "A * 100 / B" } }
  ],
  "selectedQueryName": "F1"
}
p99延迟——单个链路追踪查询,使用
groupBy
按服务拆分。阈值目标单位为纳秒(2秒→2000000000),
targetUnit: "ns"
json
{
  "queries": [
    { "type": "builder_query", "spec": { "name": "A", "signal": "traces",
        "aggregations": [{ "expression": "p99(durationNano)" }],
        "groupBy": [{ "name": "service.name", "fieldContext": "resource",
                      "fieldDataType": "string" }] } }
  ]
}
日志量峰值——按服务分组的错误/致命日志计数:
json
{
  "queries": [
    { "type": "builder_query", "spec": { "name": "A", "signal": "logs",
        "aggregations": [{ "expression": "count()" }],
        "filter": { "expression": "severity_text IN ('ERROR', 'FATAL')" },
        "groupBy": [{ "name": "service.name", "fieldContext": "resource",
                      "fieldDataType": "string" }] } }
  ]
}
关于数据缺失、异常检测、PromQL和ClickHouse SQL告警的配置,请阅读
signoz://alert/examples
MCP资源获取当前格式。

Step 5: Resolve notification channels

步骤5:解析通知渠道

The skill must resolve at least one channel before save. An alert with no channels saves successfully and silently never notifies anyone — the second most common silent failure after bad queries.
  1. Call
    signoz:signoz_list_notification_channels
    to enumerate existing channels.
  2. If the user named a channel ("send to slack-infra"), use it if it exists; if not, fall through.
  3. Otherwise present the user with two options:
    • Pick from existing — list channels with their type (Slack, PagerDuty, email, webhook) so the user can choose.
    • Create new inline — call
      signoz:signoz_create_notification_channel
      with channel parameters the user provides (name, type, type-specific config like Slack webhook URL or PagerDuty integration key).
  4. If neither path resolves a channel, emit
    needs_input: notification_channel
    and stop.
For multi-severity alerts, attach channels per threshold:
thresholds.spec[N].channels
is an array — typically warning → Slack only, critical → Slack + PagerDuty.
本技能必须在保存前解析至少一个渠道。无渠道的告警会保存成功但永远静默不通知任何人——这是仅次于错误查询的第二常见静默故障。
  1. 调用
    signoz:signoz_list_notification_channels
    枚举现有渠道。
  2. 若用户指定了渠道(如“发送至slack-infra”),若该渠道存在则使用;否则进入下一步。
  3. 否则向用户提供两个选项:
    • 从现有渠道选择——列出渠道及其类型(Slack、PagerDuty、邮件、Webhook)供用户选择。
    • 内联创建新渠道——调用
      signoz:signoz_create_notification_channel
      ,传入用户提供的渠道参数(名称、类型、类型特定配置如Slack Webhook URL或PagerDuty集成密钥)。
  4. 若两种方式均无法解析渠道,输出
    needs_input: notification_channel
    并停止操作。
针对多严重级别告警,按阈值绑定渠道:
thresholds.spec[N].channels
为数组——通常warning级别仅发送至Slack,critical级别发送至Slack+PagerDuty。

Step 6: Validate the threshold (would-have-fired count)

步骤6:验证阈值(模拟触发次数)

Step 2.5 already confirmed the underlying data exists. Step 6 is about the threshold — given the full proposed query (including formulas, groupBy, and unit conversions) and the proposed threshold, would this alert have fired a sensible number of times in the last hour?
  1. Run the alert's full primary query (or formula) over the last hour using:
    • signoz:signoz_execute_builder_query
      for builder/formula queries.
    • signoz:signoz_query_metrics
      for PromQL queries.
    • signoz:signoz_aggregate_logs
      /
      signoz:signoz_aggregate_traces
      if those fit better.
  2. Compute how many evaluation points in the last hour breached the proposed threshold. Surface this in the preview as "would have fired N times in the last 1h":
    • N = 0 → the threshold may be too loose or the gating too strict. Mention this so the user can adjust if intent was tighter.
    • N is large (e.g. > 30) → likely alert storm. Surface and recommend tightening or adding hysteresis (
      recoveryTarget
      ).
    • N is small and non-zero → calibrated; proceed.
  3. Exceptions:
    • Anomaly alerts — skip the breach count entirely (Z-scores aren't directly comparable to raw values). Step 2.5 already verified the underlying metric × service has data; nothing more to validate here.
    • Log-based crash / panic / OOMKilled / FATAL alerts — these intentionally have zero matches in a healthy system. Step 2.5 has already surfaced the zero-match result and obtained user confirmation; skip the breach count.
If Step 2.5 was somehow skipped (e.g. a downstream skill is invoking this flow mid-stream), the no-data stop rule applies here too: empty result → emit
needs_input
instead of saving an alert that will never fire.
步骤2.5已确认底层数据存在。步骤6针对阈值进行验证:给定完整的拟议查询(包括公式、groupBy和单位转换)和拟议阈值,该告警在过去1小时内会触发多少次合理的次数?
  1. 使用以下工具,在过去1小时内运行告警的完整主查询(或公式):
    • 针对构建器/公式查询:
      signoz:signoz_execute_builder_query
    • 针对PromQL查询:
      signoz:signoz_query_metrics
    • 若更合适,使用
      signoz:signoz_aggregate_logs
      /
      signoz:signoz_aggregate_traces
  2. 计算过去1小时内突破拟议阈值的评估点数量。在预览中展示为**“过去1小时内会触发N次”**:
    • N=0 → 阈值可能过于宽松或限制过严。需告知用户,以便根据更严格的指令调整。
    • N较大(如>30) → 可能产生告警风暴。需告知用户并建议收紧阈值或添加滞后机制(
      recoveryTarget
      )。
    • N较小且非零 → 阈值已校准;继续操作。
  3. 例外情况:
    • 异常告警——完全跳过触发次数计算(Z-score无法直接与原始值比较)。步骤2.5已验证底层指标×服务存在数据;无需进一步验证。
    • 基于日志的崩溃/panic/OOMKilled/FATAL告警——在健康系统中,这些告警的查询结果本就应为空。步骤2.5已展示零匹配结果并获取用户确认;跳过触发次数计算。
若步骤2.5被意外跳过(如下游技能在流程中途调用本技能),此处仍需执行无数据停止规则:空结果→输出
needs_input
,而非保存永远无法触发的告警。

Step 7: Preview the prepared config

步骤7:预览准备好的配置

Emit a fenced JSON code block containing the exact payload that will be sent to
signoz:signoz_create_alert
, plus a one-paragraph plain-language summary:
json
{
  "alert": "<name>",
  "alertType": "...",
  "ruleType": "...",
  "condition": { ... },
  "labels": { "severity": "..." },
  "annotations": { "description": "...", "summary": "..." },
  "evaluation": { ... },
  "preferredChannels": ["..."]
}
Summary: This alert fires when [condition] for [resource scope], evaluated every [frequency] over the last [window]. Thresholds: warning at X, critical at Y. Notifications go to [channels]. Dry-run on the last hour: would have fired N times.
In autonomous mode the consumer proceeds. In interactive mode the human can intervene before Step 8.
输出一个带围栏的JSON代码块,包含将发送至
signoz:signoz_create_alert
的精确负载,外加一段自然语言摘要:
json
{
  "alert": "<名称>",
  "alertType": "...",
  "ruleType": "...",
  "condition": { ... },
  "labels": { "severity": "..." },
  "annotations": { "description": "...", "summary": "..." },
  "evaluation": { ... },
  "preferredChannels": ["..."]
}
摘要:当[条件]在[资源范围]发生时,该告警会触发,每[频率]评估一次,评估窗口为过去[窗口时长]。阈值:warning级别为X,critical级别为Y。通知将发送至[渠道]。过去1小时模拟运行:会触发N次。
在自主模式下,使用者直接继续操作;在交互模式下,人工用户可在步骤8前进行干预。

Step 8: Save and report

步骤8:保存并报告

  1. Call
    signoz:signoz_create_alert
    with the JSON payload from Step 7.
  2. Name collision — if
    signoz:signoz_create_alert
    returns a duplicate-name error, do not suffix-append or call
    signoz:signoz_update_alert
    . Stop and tell the user the existing alert blocked creation; offer to use a different name or modify the existing alert (which is out of scope for this skill).
  3. On success, report:
    • The alert ID and name.
    • What it watches and at what threshold.
    • Which channels are wired up.
    • The dry-run summary ("would have fired N times in last 1h").
    • Two follow-up offers: "Want to test the query live with
      signoz-generating-queries
      ?" and "Want me to add a runbook URL to the annotations?"
  1. 使用步骤7中的JSON负载调用
    signoz:signoz_create_alert
  2. 名称冲突——若
    signoz:signoz_create_alert
    返回重复名称错误,请勿添加后缀或调用
    signoz:signoz_update_alert
    。停止操作并告知用户现有告警阻止了创建;提供使用不同名称或修改现有告警的选项(修改现有告警超出本技能范围)。
  3. 成功后,报告以下内容:
    • 告警ID和名称。
    • 告警监控的对象及阈值。
    • 绑定的通知渠道。
    • 模拟运行摘要(“过去1小时内会触发N次”)。
    • 两个后续操作选项:“是否要使用
      signoz-generating-queries
      实时测试查询?”和“是否要为注释添加运行手册URL?”

Guardrails

防护规则

  • Strict inputs over guessing. Resource attribute and channel are required. If missing, emit
    needs_input
    and stop. Creating an alert on a guessed service is harder to undo than asking.
  • Always paginate
    signoz:signoz_list_alerts
    .
    Stopping at page 1 misses duplicates and produces noise.
  • Dry-run is mandatory. Saving an alert whose query returns no data is a silent failure mode and must be prevented.
  • No duplicate updates. Name collision → error and stop. Do not silently update an existing alert from a "create" skill.
  • OTel attribute names only.
    service.name
    not
    service
    .
  • Threshold codes are strings, not words.
    op: "1"
    not
    op: "above"
    .
  • Signal must match alertType.
    signal: "logs"
    requires
    LOGS_BASED_ALERT
    . Mismatches fail validation.
  • Anomaly rules are metrics-only.
    anomaly_rule
    + non-metric alertType is rejected.
  • Channels must exist. Use names from
    signoz:signoz_list_notification_channels
    exactly, or create the channel inline first.
  • Scope boundary. This skill only creates new rules. Modifications use
    signoz:signoz_update_alert
    directly.
  • 严格输入而非猜测。资源属性和渠道为必填项。若缺失,输出
    needs_input
    并停止操作。在猜测的服务上创建告警比询问用户更难撤销。
  • 务必遍历
    signoz:signoz_list_alerts
    的所有分页
    。仅停止在第一页会遗漏重复告警并产生无效内容。
  • 模拟运行是强制性的。保存查询无数据的告警属于静默故障,必须避免。
  • 禁止重复更新。名称冲突→报错并停止。请勿从“创建”技能中静默更新现有告警。
  • 仅使用OTel属性名称。使用
    service.name
    而非
    service
  • 阈值代码为字符串而非单词。使用
    op: "1"
    而非
    op: "above"
  • 信号必须匹配alertType
    signal: "logs"
    需对应
    LOGS_BASED_ALERT
    。不匹配会导致验证失败。
  • 异常规则仅适用于指标
    anomaly_rule
    +非指标alertType会被拒绝。
  • 渠道必须存在。完全使用
    signoz:signoz_list_notification_channels
    返回的名称,或先内联创建渠道。
  • 范围边界。本技能仅创建新规则。修改操作直接使用
    signoz:signoz_update_alert

Examples

示例

User: "Alert me when CPU usage on the checkout service goes above 80%, page me at 90%"
Agent:
  1. Intent parsed: metric, CPU, scope
    service.name = checkout
    , thresholds 80% (warning) / 90% (critical), severity bumped to critical on the higher level because "page me" was used.
  2. signoz:signoz_list_metrics searchText=cpu
    → confirms
    system.cpu.utilization
    .
  3. signoz:signoz_list_alerts
    (paginated) → no existing CPU alert for checkout.
  4. signoz:signoz_list_notification_channels
    → presents existing channels; user picks
    slack-infra
    for warning and
    pagerduty-oncall
    for critical.
  5. Builds JSON:
    METRIC_BASED_ALERT
    ,
    threshold_rule
    ,
    signal=metrics
    , two thresholds (
    op="1"
    ,
    matchType="3"
    on_average,
    targetUnit="percent"
    ), filter
    service.name = 'checkout'
    .
  6. Dry-run via
    signoz:signoz_execute_builder_query
    over last 1h: returns data, would have fired 0 times (clean baseline).
  7. Emits JSON preview + summary.
  8. Calls
    signoz:signoz_create_alert
    . Reports created alert with ID, threshold summary, channel routing, and dry-run result.

User: "Set up an alert when error rate goes above 5% on payments"
Agent:
  1. Intent parsed: traces, error-rate formula, scope
    service.name = payments
    , threshold 5%, severity default warning.
  2. signoz:signoz_get_field_keys fieldContext=resource signal=traces
    confirms
    service.name
    .
    signoz:signoz_get_field_values
    confirms
    payments
    exists.
  3. No existing payments error-rate alert.
  4. Builds formula alert: query A counts spans with
    hasError = true
    for
    service.name = 'payments'
    , query B counts all spans for the same service, formula F1 =
    A * 100 / B
    ,
    selectedQueryName: "F1"
    , threshold target 5,
    targetUnit: "percent"
    ,
    op: "1"
    ,
    matchType: "1"
    (catch any breach).
  5. Channel: user picks
    slack-payments
    .
  6. Dry-run on last 1h: payments error rate hovered around 0.3%, would have fired 0 times. Clean — not too tight.
  7. Preview, save, report.

User: "Notify me if log volume spikes more than 1000 error logs per minute on any service"
Agent:
  1. Intent parsed: logs, volume threshold, scope = all services (groupBy
    service.name
    ), threshold 1000/min, severity warning.
  2. signoz:signoz_get_field_keys signal=logs fieldContext=resource
    confirms
    service.name
    exists; no service filter required since the user said "any service" → groupBy provides the scoping.
  3. Builds:
    LOGS_BASED_ALERT
    ,
    signal=logs
    ,
    aggregations: [{expression: "count()"}]
    ,
    filter: {expression: "severity_text IN ('ERROR', 'FATAL')"}
    ,
    groupBy: [{name: "service.name", fieldContext: "resource", fieldDataType: "string"}]
    , threshold 1000,
    targetUnit: ""
    ,
    evalWindow: 1m0s
    ,
    matchType: "1"
    (catch any minute that breaches).
  4. Channels: user picks slack channel.
  5. Dry-run: returned per-service counts, max in last 1h was 87 — would have fired 0 times. Within reasonable headroom.
  6. Preview, save, report.

User: "Alert me on anomalous request latency for the api gateway"
Agent:
  1. Intent parsed: metric, anomaly detection, scope
    service.name = api-gateway
    . Anomaly detection requires
    METRIC_BASED_ALERT
    +
    anomaly_rule
    .
  2. signoz:signoz_list_metrics searchText=duration
    → finds
    http.server.request.duration
    .
  3. Builds:
    anomaly_rule
    ,
    algorithm=zscore
    ,
    seasonality=daily
    , threshold target 3 (3 standard deviations),
    op: "1"
    ,
    matchType: "1"
    .
  4. Channel: user picks slack-api.
  5. Dry-run validates query returns data. Skip breach-count for anomaly alerts.
  6. Preview emphasizes that the threshold is in standard deviations, not raw latency. Save, report.
用户:“当checkout服务的CPU使用率超过80%时提醒我,超过90%时呼叫我”
Agent
  1. 解析指令:指标、CPU、范围
    service.name = checkout
    、阈值80%(warning)/90%(critical),因用户提及“呼叫我”,将高级别阈值的严重级别升级为critical。
  2. signoz:signoz_list_metrics searchText=cpu
    → 确认
    system.cpu.utilization
    指标存在。
  3. signoz:signoz_list_alerts
    (遍历所有分页)→ 无针对checkout服务的现有CPU告警。
  4. signoz:signoz_list_notification_channels
    → 展示现有渠道;用户选择
    slack-infra
    作为warning级别渠道,
    pagerduty-oncall
    作为critical级别渠道。
  5. 构建JSON:
    METRIC_BASED_ALERT
    threshold_rule
    signal=metrics
    、两个阈值(
    op="1"
    matchType="3"
    即on_average、
    targetUnit="percent"
    )、过滤器
    service.name = 'checkout'
  6. 通过
    signoz:signoz_execute_builder_query
    在过去1小时内模拟运行:返回数据,会触发0次(基线正常)。
  7. 输出JSON预览+摘要。
  8. 调用
    signoz:signoz_create_alert
    。报告已创建的告警ID、阈值摘要、渠道路由和模拟运行结果。

用户:“为payments服务设置错误率超过5%时的告警”
Agent
  1. 解析指令:链路追踪、错误率公式、范围
    service.name = payments
    、阈值5%、默认严重级别warning。
  2. signoz:signoz_get_field_keys fieldContext=resource signal=traces
    确认
    service.name
    存在。
    signoz:signoz_get_field_values
    确认
    payments
    服务存在。
  3. 无针对payments服务的现有错误率告警。
  4. 构建公式告警:查询A统计
    service.name = 'payments'
    hasError = true
    的Span数量,查询B统计同一服务的所有Span数量,公式F1=
    A * 100 / B
    selectedQueryName: "F1"
    ,阈值目标5,
    targetUnit: "percent"
    op: "1"
    matchType: "1"
    (捕获任何一次突破)。
  5. 渠道:用户选择
    slack-payments
  6. 过去1小时模拟运行:payments服务错误率约为0.3%,会触发0次。阈值合理——不过于严格。
  7. 预览、保存、报告。

用户:“当任何服务的错误日志量每分钟超过1000条时通知我”
Agent
  1. 解析指令:日志、量阈值、范围=所有服务(按
    service.name
    分组)、阈值1000条/分钟、严重级别warning。
  2. signoz:signoz_get_field_keys signal=logs fieldContext=resource
    确认
    service.name
    存在;无需服务过滤器,因用户要求“任何服务”→groupBy提供范围划分。
  3. 构建配置:
    LOGS_BASED_ALERT
    signal=logs
    aggregations: [{expression: "count()"}]
    filter: {expression: "severity_text IN ('ERROR', 'FATAL')"}
    groupBy: [{name: "service.name", fieldContext: "resource", fieldDataType: "string"}]
    、阈值1000、
    targetUnit: ""
    evalWindow: 1m0s
    matchType: "1"
    (捕获任何突破的分钟)。
  4. 渠道:用户选择Slack渠道。
  5. 模拟运行:返回按服务统计的数量,过去1小时内最大值为87——会触发0次。阈值留有合理余量。
  6. 预览、保存、报告。

用户:“针对api网关的异常请求延迟触发告警”
Agent
  1. 解析指令:指标、异常检测、范围
    service.name = api-gateway
    。异常检测需要
    METRIC_BASED_ALERT
    +
    anomaly_rule
  2. signoz:signoz_list_metrics searchText=duration
    → 找到
    http.server.request.duration
    指标。
  3. 构建配置:
    anomaly_rule
    algorithm=zscore
    seasonality=daily
    、阈值目标3(3个标准差)、
    op: "1"
    matchType: "1"
  4. 渠道:用户选择
    slack-api
  5. 模拟运行验证查询返回数据。异常告警跳过触发次数计算。
  6. 预览中强调阈值为标准差而非原始延迟值。保存、报告。

Additional resources

额外资源

  • signoz://alert/instructions
    and
    signoz://alert/examples
    MCP resources — full alert config JSON schema, threshold codes, filter expression syntax, and version-current pattern examples. Always preferred over any transcribed copy.
  • signoz-generating-queries
    skill — for authoring PromQL or testing queries before wrapping them in an alert.
  • signoz://alert/instructions
    signoz://alert/examples
    MCP资源——完整的告警配置JSON schema、阈值代码、过滤器表达式语法以及版本同步的模式示例。始终优先使用这些资源,而非任何转录副本。
  • signoz-generating-queries
    技能——用于编写PromQL或在封装为告警前测试查询。