signoz-creating-alerts
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAlert Create
创建告警
Build a SigNoz alert from a user's natural-language intent. The skill targets
two consumers: an autonomous AI SRE agent that runs without a human in the
loop, and a human at a Claude Code / Codex / Cursor prompt. Both go through
the same flow — the human just gets a chance to intervene at the preview step.
根据用户的自然语言指令构建SigNoz告警。本技能面向两类使用者:无需人工介入的自主AI SRE Agent,以及使用Claude Code / Codex / Cursor的人工用户。两者遵循相同流程——仅人工用户可在预览环节进行干预。
Prerequisites
前置条件
This skill calls SigNoz MCP server tools (,
, , etc.). Before running the
workflow, confirm the tools are available. If they are not,
the SigNoz MCP server is not installed or configured — stop and direct
the user to set it up:
https://signoz.io/docs/ai/signoz-mcp-server/. Do not try to fall back
to raw HTTP calls or fabricate alert configs without the MCP tools.
signoz:signoz_create_alertsignoz:signoz_list_alertssignoz:signoz_get_field_keyssignoz:signoz_*本技能会调用SigNoz MCP服务器工具(、、等)。运行工作流前,请确认工具已可用。若不可用,则说明SigNoz MCP服务器未安装或配置——请停止操作并引导用户完成配置:https://signoz.io/docs/ai/signoz-mcp-server/。请勿尝试回退至原始HTTP调用,或在无MCP工具的情况下编造告警配置。
signoz:signoz_create_alertsignoz:signoz_list_alertssignoz:signoz_get_field_keyssignoz:signoz_*When to use
使用场景
Use this skill when the user wants to:
- Create, set up, or configure a new alert rule.
- Get paged or notified when a metric, log volume, latency, or error rate crosses a threshold.
- Detect anomalous behavior on a service, host, or signal.
- Catch silent data loss ("alert if data stops arriving from X").
Do NOT use when the user wants to:
- Understand what an existing alert monitors → .
signoz-explaining-alerts - Diagnose why an existing alert fired → .
signoz-investigating-alerts - Modify thresholds, queries, or routing on an existing alert → call
directly.
signoz:signoz_update_alert
当用户需要以下操作时使用本技能:
- 创建、设置或配置新的告警规则。
- 当指标、日志量、延迟或错误率超出阈值时,接收呼叫或通知。
- 检测服务、主机或信号的异常行为。
- 捕获静默数据丢失(“当X停止发送数据时触发告警”)。
请勿在以下场景使用:
- 用户希望了解现有告警的监控内容 → 使用。
signoz-explaining-alerts - 用户希望诊断现有告警触发的原因 → 使用。
signoz-investigating-alerts - 用户希望修改现有告警的阈值、查询或路由 → 直接调用。
signoz:signoz_update_alert
Required inputs (strict)
必填输入(严格要求)
Alert creation is a write operation against a shared system. Guessing here
creates noisy alerts on the wrong service that someone else has to clean up.
The skill enforces a strict input contract:
| Input | Required | Source if missing |
|---|---|---|
| Alert intent (NL goal) | yes | |
Resource attribute filter (e.g. | yes | discover via |
| Threshold value(s) | inferred from intent | derive a sensible default and surface in the preview |
| Severity | inferred from intent | default |
| Notification channel | yes | |
If a required input is missing and cannot be discovered, emit a structured
block and stop before calling any write tool:
needs_inputtext
needs_input:
missing:
- resource_attribute_filter: "no service or host specified — pick one"
candidates:
service.name: ["frontend", "checkout", "payments", "inventory"]
host.name: ["prod-api-1", "prod-api-2", "prod-db-1"]In interactive mode, the human picks from candidates. In autonomous mode, the
caller fills the gap from upstream context or escalates. Either way, do not
proceed to with a guessed value.
signoz:signoz_create_alert告警创建是针对共享系统的写入操作。此处的猜测会在错误的服务上产生无效告警,需要他人清理。本技能强制执行严格的输入约束:
| 输入项 | 是否必填 | 缺失时的获取来源 |
|---|---|---|
| 告警指令(自然语言目标) | 是 | |
资源属性过滤器(如 | 是 | 通过 |
| 阈值 | 从指令中推断 | 生成合理默认值并在预览中展示 |
| 严重级别 | 从指令中推断 | 默认 |
| 通知渠道 | 是 | |
若必填输入缺失且无法自动发现,请输出结构化的块并停止所有写入工具调用:
needs_inputtext
needs_input:
missing:
- resource_attribute_filter: "未指定服务或主机——请选择一个"
candidates:
service.name: ["frontend", "checkout", "payments", "inventory"]
host.name: ["prod-api-1", "prod-api-2", "prod-db-1"]在交互模式下,人工用户从候选列表中选择;在自主模式下,调用方从上游上下文补充信息或升级处理。无论哪种情况,请勿使用猜测值调用。
signoz:signoz_create_alertWorkflow
工作流
Step 1: Parse intent and check what's missing
步骤1:解析指令并检查缺失项
Extract from the user's request:
- What to monitor — signal type (metrics / logs / traces / exceptions) and the specific condition (CPU, error rate, p99 latency, log count, ...).
- Resource scope — which service, host, namespace, or environment.
- Threshold — numeric value and comparison ("above 80%", "below 100/s").
- Severity — implicit from urgency words ("page" → critical, default warning otherwise).
- Channel — explicit channel name if the user provided one.
Map signal phrasing to alert type:
| User says | alertType | signal |
|---|---|---|
| metric, CPU, memory, latency, request rate | METRIC_BASED_ALERT | metrics |
| log, error logs, log volume, log pattern | LOGS_BASED_ALERT | logs |
| trace, span, latency p99, slow requests | TRACES_BASED_ALERT | traces |
| exception, stack trace, crash | EXCEPTIONS_BASED_ALERT | (clickhouse_sql) |
If resource scope is missing, run discovery (Step 2). If still missing after
discovery, emit and stop.
needs_input从用户请求中提取以下信息:
- 监控对象——信号类型(指标/日志/链路追踪/异常)及具体条件(CPU、错误率、p99延迟、日志数量等)。
- 资源范围——涉及的服务、主机、命名空间或环境。
- 阈值——数值及比较逻辑(“高于80%”“低于100次/秒”)。
- 严重级别——从紧急词汇中隐含推断(“呼叫”→critical,否则默认warning)。
- 渠道——若用户明确指定则使用对应渠道名称。
将用户表述映射为告警类型:
| 用户表述 | alertType | signal |
|---|---|---|
| 指标、CPU、内存、延迟、请求速率 | METRIC_BASED_ALERT | metrics |
| 日志、错误日志、日志量、日志模式 | LOGS_BASED_ALERT | logs |
| 链路追踪、Span、p99延迟、慢请求 | TRACES_BASED_ALERT | traces |
| 异常、堆栈跟踪、崩溃 | EXCEPTIONS_BASED_ALERT | (clickhouse_sql) |
若资源范围缺失,执行步骤2进行发现;若发现后仍缺失,输出并停止操作。
needs_inputStep 2: Discover resource attributes and metric names
步骤2:发现资源属性和指标名称
When the user does not name a service / host / namespace, the SigNoz MCP
guideline applies: always prefer a resource-attribute filter. Discover
candidates instead of guessing:
- Call with
signoz:signoz_get_field_keysto enumerate resource attributes for the chosen signal.fieldContext=resource - Call for the most likely attribute (typically
signoz:signoz_get_field_values, thenservice.name, thenhost.name) to get concrete values.k8s.namespace.name - If the user mentioned a metric by name, call with a search term to verify the exact OTel metric name. Wrong names create alerts that never fire.
signoz:signoz_list_metrics
Surface the candidates in the block. Do not pick one.
needs_input当用户未指定服务/主机/命名空间时,遵循SigNoz MCP准则:优先使用资源属性过滤器。自动发现候选值而非猜测:
- 调用并设置
signoz:signoz_get_field_keys,枚举所选信号的资源属性。fieldContext=resource - 调用获取最可能的属性值(通常优先
signoz:signoz_get_field_values,其次service.name,最后host.name)。k8s.namespace.name - 若用户提及具体指标名称,调用并传入搜索词,验证确切的OTel指标名称。错误的指标名称会导致告警永远无法触发。
signoz:signoz_list_metrics
在块中展示候选值,请勿自行选择。
needs_inputStep 2.5: Probe data existence for the chosen filter (fail fast)
步骤2.5:探测所选过滤器的数据存在性(快速失败)
Before authoring any alert config, confirm the specific combination the
alert will watch (metric × service × any other filter) actually emits data.
The most common silent failure is "metric exists in the catalog and the
service exists in the catalog, but the service doesn't emit that metric"
— each piece checks out in isolation, the alert saves successfully, and it
silently never fires.
Run a single probe over the last 1 hour using the same filter the alert
will use, but with the simplest aggregation that confirms data exists:
- Metrics: (PromQL) or
signoz:signoz_query_metricswithsignoz:signoz_execute_builder_query(orcount()if scope-discovering).count_distinct(service.name) - Logs: with
signoz:signoz_aggregate_logsover the filter.count() - Traces: with
signoz:signoz_aggregate_tracesover the filter.count()
Inspect the result:
- Probe returns rows → proceed to Step 3.
- Probe returns empty → STOP. Do not build an alert config the user
will then be asked to throw away. Emit a block describing what was missing and offer concrete recovery:
needs_input- Service doesn't emit the metric → call
to list the services that do emit it; let the user pick a different service or a different metric.
signoz:signoz_get_field_values signal=metrics name=service.name metricName=<metric> - Wrong attribute name (instead of
service) → suggest the semantic-convention name and re-probe.service.name - Service emits the metric but not in the expected time range → widen the probe window once (e.g. last 24h) before declaring no-data.
- Service doesn't emit the metric → call
Exception — log-based crash / panic / OOMKilled / FATAL alerts. These
intentionally have zero matches in a healthy system. The probe will
return empty by design. Do not stop; instead, surface the zero-match
result and ask the user to confirm before save. Treat this exception
narrowly: it applies to "alert me when bad thing happens" log queries,
not to alerts that depend on continuous data flow.
This probe is cheap (one query, ~100ms), and catching the no-data case
early avoids the worst UX failure mode of this skill — the user reading
through a fully-authored JSON payload and only then learning the alert
can never fire.
在生成任何告警配置前,请确认告警将监控的特定组合(指标×服务×其他过滤器)确实有数据输出。最常见的静默故障是:“指标在目录中存在,服务也在目录中存在,但该服务并未输出该指标”——单独检查各部分均正常,告警保存成功,但永远静默不触发。
使用告警将采用的相同过滤器,在过去1小时内执行一次探测,使用最简单的聚合方式确认数据存在:
- 指标:(PromQL)或
signoz:signoz_query_metrics,使用signoz:signoz_execute_builder_query(若为范围发现则使用count())。count_distinct(service.name) - 日志:,对过滤器执行
signoz:signoz_aggregate_logs聚合。count() - 链路追踪:,对过滤器执行
signoz:signoz_aggregate_traces聚合。count()
检查探测结果:
- 探测返回数据行 → 继续步骤3。
- 探测返回空结果 → 停止操作。请勿生成用户随后需要丢弃的告警配置。输出块说明缺失内容,并提供具体恢复方案:
needs_input- 服务未输出该指标 → 调用,列出实际输出该指标的服务;让用户选择其他服务或指标。
signoz:signoz_get_field_values signal=metrics name=service.name metricName=<metric> - 属性名称错误(如而非
service) → 建议使用语义规范名称并重新探测。service.name - 服务输出该指标但不在预期时间范围内 → 在判定无数据前,将探测窗口扩大一次(如过去24小时)。
- 服务未输出该指标 → 调用
例外情况——基于日志的崩溃/ panic / OOMKilled / FATAL告警。在健康系统中,这些告警的查询结果本就应为空。探测返回空结果属于正常情况,无需停止操作;相反,需展示零匹配结果并在保存前获取用户确认。此例外仅适用于“当不良事件发生时触发告警”的日志查询,不适用于依赖持续数据流的告警。
该探测成本极低(一次查询,约100ms),提前发现无数据情况可避免本技能最糟糕的用户体验——用户查看完整的JSON配置后,才发现告警永远无法触发。
Step 3: Check for duplicate alerts
步骤3:检查重复告警
Call and paginate through every page —
is true until you have walked the full list. Check for
existing alerts that match the user's intent (same signal + same scope +
similar threshold). If a likely duplicate exists, surface it and ask whether
to create a new one anyway, modify the existing one (out of scope here — use
), or cancel.
signoz:signoz_list_alertspagination.hasMoresignoz:signoz_update_alert调用并遍历所有分页——直到为false,确保获取完整列表。检查是否存在与用户指令匹配的现有告警(相同信号+相同范围+相似阈值)。若存在疑似重复告警,展示该告警并询问用户是继续创建新告警、修改现有告警(超出本技能范围——使用)还是取消操作。
signoz:signoz_list_alertspagination.hasMoresignoz:signoz_update_alertStep 4: Build the alert config
步骤4:构建告警配置
The MCP server is the source of truth for the alert JSON schema, threshold
codes, and validation rules. Read the and
MCP resources for the canonical, version-current
shape. Do not transcribe schema text into this skill — it will rot out of
sync with the server.
signoz://alert/instructionssignoz://alert/examplesFor most user intents, the config is one of a small number of patterns:
| Pattern | Where to author | Example intents |
|---|---|---|
| Single-metric threshold | inline (this skill) | "alert when CPU > 80%", "p99 latency > 2s" |
| Log volume threshold | inline | "more than N error logs/min" |
| Trace-based count or p-tile | inline | "p99 span duration > 2s on checkout" |
| Error-rate formula (A/B*100) | inline (see "Common query shapes" below) | "error rate > 5%" |
| Anomaly detection (Z-score) | inline, but only with | "alert me on anomalous traffic" |
| Absent-data alert | inline | "alert if data stops arriving" |
| ClickHouse SQL alert | inline (this skill) — author the SQL directly using the schema in | non-trivial joins, custom aggregations the builder cannot express |
| PromQL alert | delegate to | when user already has PromQL |
Threshold and values. v2alpha1 accepts the
human-readable strings (, ); the legacy numeric
codes (, ) are also accepted but harder to read in the UI. Prefer
the words. Anomaly rules only support — the engine
already treats z-score breaches as two-sided when the threshold is
positive, so is rejected and unnecessary.
opmatchType"above""on_average""1""3"op: "above""above_or_below"| Comparison | | Evaluation behavior | |
|---|---|---|---|
| above / exceeds / > | | breach at any point | |
| below / under / < | | breach for entire window | |
| equal / = | | average breaches | |
| not equal / != | | sum breaches | |
| last value breaches | |
Defaults the skill applies (and surfaces in the preview):
- ,
evalWindow: 5m0s— change only if the intent implies a slower or faster cadence.frequency: 1m0s - (on_average) for CPU / memory / latency — smooths transient spikes.
matchType: "3" - (at_least_once) for error counts / error rates — catches any breach.
matchType: "1"
Severity defaults — derive from the intrinsic urgency of the alert, not
just the user's words. The user saying "alert me" doesn't force
when the condition itself describes a critical event. Use this table; an
explicit user cue overrides it ("just FYI" → demote, "page me" / "wake me
up" → promote).
warning| Alert intent | Default severity |
|---|---|
| Pod crash / OOMKilled / CrashLoopBackOff / panic / FATAL log signals | critical |
| Service down / no-data on a production service | critical |
| Error rate above any non-trivial threshold (>1%) | critical |
| Error logs / exception spikes | warning |
| Latency degradation (p95/p99 above target) | warning |
| CPU / memory / disk pressure | warning |
| Request-rate / traffic anomaly | warning |
| SLO budget burn (info-level burn rate) | info / warning |
When the user's intent is ambiguous on severity (no urgency cue, no
clearly-critical condition), default to and surface the choice
in the preview so they can adjust.
warningOTel attribute names — always use semantic conventions:
, , ,
. Never , , or .
service.namehost.namek8s.namespace.namedeployment.environment.nameservicehostenvAnnotation templates — the on-call engineer sees the notification, not
the alert config. A notification that says "Pod crash detected" with no
service name, no count, and no value is nearly useless at 3am. Always
include the moving values:
- — single-line headline. Include the resource scope and the numeric value:
summary."checkoutservice error rate {{$value}}% above 3%" - — longer message. Include
description,{{$value}}, the groupBy values (e.g.{{$threshold}}), and a sentence on what to do or where to look. For count-based alerts include the count explicitly:{{$labels.service_name}}."{{$value}} crash log lines in the last 5 minutes from service {{$labels.service_name}}"
Use for the breaching value, for the target,
and for groupBy values (note SigNoz substitutes the
dotted attribute name with underscores: → ).
{{$value}}{{$threshold}}{{$labels.<key>}}service.nameservice_nameMCP服务器是告警JSON schema、阈值代码和验证规则的权威来源。请阅读和 MCP资源,获取规范的、版本同步的配置格式。请勿将schema文本转录到本技能中——否则会与服务器版本不同步而过时。
signoz://alert/instructionssignoz://alert/examples大多数用户指令对应的配置属于以下少数模式之一:
| 模式 | 配置方式 | 示例指令 |
|---|---|---|
| 单指标阈值 | 内联(本技能直接生成) | “当CPU超过80%时触发告警”“p99延迟超过2秒时触发告警” |
| 日志量阈值 | 内联 | “每分钟错误日志超过N条时触发告警” |
| 基于链路追踪的计数或百分位数 | 内联 | “checkout服务的p99 Span持续时间超过2秒时触发告警” |
| 错误率公式(A/B*100) | 内联(见下方“常见查询格式”) | “错误率超过5%时触发告警” |
| 异常检测(Z-score) | 内联,但仅适用于 | “针对异常流量触发告警” |
| 数据缺失告警 | 内联 | “当数据停止发送时触发告警” |
| ClickHouse SQL告警 | 内联(本技能直接生成)——使用 | 非平凡关联、构建器无法表达的自定义聚合 |
| PromQL告警 | 委托给 | 用户已提供PromQL时 |
阈值和取值。v2alpha1版本接受人类可读字符串(如、);旧版数字代码(如、)也可接受,但在UI中可读性较差。优先使用字符串。异常规则仅支持——当阈值为正数时,引擎已将Z-score突破视为双向检测,因此会被拒绝且无必要。
opmatchType"above""on_average""1""3"op: "above""above_or_below"| 比较逻辑 | | 评估行为 | |
|---|---|---|---|
| 高于/超过/ > | | 任意时刻突破阈值 | |
| 低于/小于/ < | | 整个窗口内持续突破阈值 | |
| 等于/ = | | 平均值突破阈值 | |
| 不等于/ != | | 总和突破阈值 | |
| 最后一个值突破阈值 | |
本技能应用的默认值(将在预览中展示):
- ,
evalWindow: 5m0s——仅当指令暗示需要更慢或更快的频率时才修改。frequency: 1m0s - CPU/内存/延迟告警的(on_average)——平滑瞬时峰值。
matchType: "3" - 错误计数/错误率告警的(at_least_once)——捕获任何一次阈值突破。
matchType: "1"
严重级别默认值——从告警的内在紧急程度推导,而非仅依据用户表述。当条件本身描述严重事件时,用户说“提醒我”并不强制设置为。使用下表;用户明确提示可覆盖默认值(“仅作参考”→降级,“呼叫我”/“唤醒我”→升级)。
warning| 告警指令 | 默认严重级别 |
|---|---|
| Pod崩溃/OOMKilled/CrashLoopBackOff/panic/FATAL日志信号 | critical |
| 服务宕机/生产服务无数据 | critical |
| 错误率超过任何非 trivial 阈值(>1%) | critical |
| 错误日志/异常峰值 | warning |
| 延迟恶化(p95/p99超过目标值) | warning |
| CPU/内存/磁盘压力 | warning |
| 请求速率/流量异常 | warning |
| SLO预算消耗(信息级消耗速率) | info / warning |
当用户指令对严重级别不明确(无紧急提示、无明显严重条件)时,默认设置为并在预览中展示该选项,供用户调整。
warningOTel属性名称——始终使用语义规范:、、、。禁止使用、或。
service.namehost.namek8s.namespace.namedeployment.environment.nameservicehostenv注释模板——值班工程师看到的是通知内容,而非告警配置。仅显示“检测到Pod崩溃”但无服务名称、数量和数值的通知,在凌晨3点几乎毫无用处。务必包含动态值:
- ——单行标题。包含资源范围和数值:
summary。"checkoutservice错误率{{$value}}%,超过3%阈值" - ——较长的消息。包含
description、{{$value}}、groupBy值(如{{$threshold}}),以及操作指引或查看位置。针对基于计数的告警,需明确包含计数:{{$labels.service_name}}。"过去5分钟内,服务{{$labels.service_name}}产生了{{$value}}条崩溃日志"
使用表示突破阈值的数值,表示目标阈值,表示groupBy值(注意SigNoz会将点分隔的属性名替换为下划线:→)。
{{$value}}{{$threshold}}{{$labels.<key>}}service.nameservice_nameCommon query shapes
常见查询格式
Three patterns cover most non-trivial alerts. The MCP resources above carry
the full schema; these are quick references for the query block only.
Error rate — two queries + formula . Set
on the component queries A and B so only the formula
F1 renders in the alert chart and notification — the raw counts are
intermediate, not the alert signal. Forgetting this clutters the alert
preview with three series and confuses the on-call engineer reading the
notification.
A * 100 / Bdisabled: truejson
{
"queries": [
{ "type": "builder_query", "spec": { "name": "A", "signal": "traces",
"disabled": true,
"aggregations": [{ "expression": "count()" }],
"filter": { "expression": "hasError = true" } } },
{ "type": "builder_query", "spec": { "name": "B", "signal": "traces",
"disabled": true,
"aggregations": [{ "expression": "count()" }],
"filter": { "expression": "" } } },
{ "type": "builder_formula",
"spec": { "name": "F1", "expression": "A * 100 / B" } }
],
"selectedQueryName": "F1"
}p99 latency — single trace query with for per-service
breakdown. Threshold target is in nanoseconds (2s → 2000000000),
:
groupBytargetUnit: "ns"json
{
"queries": [
{ "type": "builder_query", "spec": { "name": "A", "signal": "traces",
"aggregations": [{ "expression": "p99(durationNano)" }],
"groupBy": [{ "name": "service.name", "fieldContext": "resource",
"fieldDataType": "string" }] } }
]
}Log volume spike — count of error/fatal logs grouped by service:
json
{
"queries": [
{ "type": "builder_query", "spec": { "name": "A", "signal": "logs",
"aggregations": [{ "expression": "count()" }],
"filter": { "expression": "severity_text IN ('ERROR', 'FATAL')" },
"groupBy": [{ "name": "service.name", "fieldContext": "resource",
"fieldDataType": "string" }] } }
]
}For absent-data, anomaly, PromQL, and ClickHouse SQL alerts, read the
MCP resource for current shapes.
signoz://alert/examples三种模式覆盖大多数非平凡告警。上述MCP资源包含完整schema;以下仅为查询块的快速参考。
错误率——两个查询+公式。将组件查询A和B的设置为true,以便仅公式F1在告警图表和通知中显示——原始计数为中间值,而非告警信号。若忘记此设置,会在告警预览中显示三个序列,混淆阅读通知的值班工程师。
A * 100 / Bdisabled: truejson
{
"queries": [
{ "type": "builder_query", "spec": { "name": "A", "signal": "traces",
"disabled": true,
"aggregations": [{ "expression": "count()" }],
"filter": { "expression": "hasError = true" } } },
{ "type": "builder_query", "spec": { "name": "B", "signal": "traces",
"disabled": true,
"aggregations": [{ "expression": "count()" }],
"filter": { "expression": "" } } },
{ "type": "builder_formula",
"spec": { "name": "F1", "expression": "A * 100 / B" } }
],
"selectedQueryName": "F1"
}p99延迟——单个链路追踪查询,使用按服务拆分。阈值目标单位为纳秒(2秒→2000000000),:
groupBytargetUnit: "ns"json
{
"queries": [
{ "type": "builder_query", "spec": { "name": "A", "signal": "traces",
"aggregations": [{ "expression": "p99(durationNano)" }],
"groupBy": [{ "name": "service.name", "fieldContext": "resource",
"fieldDataType": "string" }] } }
]
}日志量峰值——按服务分组的错误/致命日志计数:
json
{
"queries": [
{ "type": "builder_query", "spec": { "name": "A", "signal": "logs",
"aggregations": [{ "expression": "count()" }],
"filter": { "expression": "severity_text IN ('ERROR', 'FATAL')" },
"groupBy": [{ "name": "service.name", "fieldContext": "resource",
"fieldDataType": "string" }] } }
]
}关于数据缺失、异常检测、PromQL和ClickHouse SQL告警的配置,请阅读 MCP资源获取当前格式。
signoz://alert/examplesStep 5: Resolve notification channels
步骤5:解析通知渠道
The skill must resolve at least one channel before save. An alert with no
channels saves successfully and silently never notifies anyone — the second
most common silent failure after bad queries.
- Call to enumerate existing channels.
signoz:signoz_list_notification_channels - If the user named a channel ("send to slack-infra"), use it if it exists; if not, fall through.
- Otherwise present the user with two options:
- Pick from existing — list channels with their type (Slack, PagerDuty, email, webhook) so the user can choose.
- Create new inline — call with channel parameters the user provides (name, type, type-specific config like Slack webhook URL or PagerDuty integration key).
signoz:signoz_create_notification_channel
- If neither path resolves a channel, emit
and stop.
needs_input: notification_channel
For multi-severity alerts, attach channels per threshold:
is an array — typically warning → Slack only,
critical → Slack + PagerDuty.
thresholds.spec[N].channels本技能必须在保存前解析至少一个渠道。无渠道的告警会保存成功但永远静默不通知任何人——这是仅次于错误查询的第二常见静默故障。
- 调用枚举现有渠道。
signoz:signoz_list_notification_channels - 若用户指定了渠道(如“发送至slack-infra”),若该渠道存在则使用;否则进入下一步。
- 否则向用户提供两个选项:
- 从现有渠道选择——列出渠道及其类型(Slack、PagerDuty、邮件、Webhook)供用户选择。
- 内联创建新渠道——调用,传入用户提供的渠道参数(名称、类型、类型特定配置如Slack Webhook URL或PagerDuty集成密钥)。
signoz:signoz_create_notification_channel
- 若两种方式均无法解析渠道,输出并停止操作。
needs_input: notification_channel
针对多严重级别告警,按阈值绑定渠道:为数组——通常warning级别仅发送至Slack,critical级别发送至Slack+PagerDuty。
thresholds.spec[N].channelsStep 6: Validate the threshold (would-have-fired count)
步骤6:验证阈值(模拟触发次数)
Step 2.5 already confirmed the underlying data exists. Step 6 is about
the threshold — given the full proposed query (including formulas,
groupBy, and unit conversions) and the proposed threshold, would this
alert have fired a sensible number of times in the last hour?
- Run the alert's full primary query (or formula) over the last hour using:
- for builder/formula queries.
signoz:signoz_execute_builder_query - for PromQL queries.
signoz:signoz_query_metrics - /
signoz:signoz_aggregate_logsif those fit better.signoz:signoz_aggregate_traces
- Compute how many evaluation points in the last hour breached the
proposed threshold. Surface this in the preview as
"would have fired N times in the last 1h":
- N = 0 → the threshold may be too loose or the gating too strict. Mention this so the user can adjust if intent was tighter.
- N is large (e.g. > 30) → likely alert storm. Surface and
recommend tightening or adding hysteresis ().
recoveryTarget - N is small and non-zero → calibrated; proceed.
- Exceptions:
- Anomaly alerts — skip the breach count entirely (Z-scores aren't directly comparable to raw values). Step 2.5 already verified the underlying metric × service has data; nothing more to validate here.
- Log-based crash / panic / OOMKilled / FATAL alerts — these intentionally have zero matches in a healthy system. Step 2.5 has already surfaced the zero-match result and obtained user confirmation; skip the breach count.
If Step 2.5 was somehow skipped (e.g. a downstream skill is invoking this
flow mid-stream), the no-data stop rule applies here too: empty result →
emit instead of saving an alert that will never fire.
needs_input步骤2.5已确认底层数据存在。步骤6针对阈值进行验证:给定完整的拟议查询(包括公式、groupBy和单位转换)和拟议阈值,该告警在过去1小时内会触发多少次合理的次数?
- 使用以下工具,在过去1小时内运行告警的完整主查询(或公式):
- 针对构建器/公式查询:。
signoz:signoz_execute_builder_query - 针对PromQL查询:。
signoz:signoz_query_metrics - 若更合适,使用/
signoz:signoz_aggregate_logs。signoz:signoz_aggregate_traces
- 针对构建器/公式查询:
- 计算过去1小时内突破拟议阈值的评估点数量。在预览中展示为**“过去1小时内会触发N次”**:
- N=0 → 阈值可能过于宽松或限制过严。需告知用户,以便根据更严格的指令调整。
- N较大(如>30) → 可能产生告警风暴。需告知用户并建议收紧阈值或添加滞后机制()。
recoveryTarget - N较小且非零 → 阈值已校准;继续操作。
- 例外情况:
- 异常告警——完全跳过触发次数计算(Z-score无法直接与原始值比较)。步骤2.5已验证底层指标×服务存在数据;无需进一步验证。
- 基于日志的崩溃/panic/OOMKilled/FATAL告警——在健康系统中,这些告警的查询结果本就应为空。步骤2.5已展示零匹配结果并获取用户确认;跳过触发次数计算。
若步骤2.5被意外跳过(如下游技能在流程中途调用本技能),此处仍需执行无数据停止规则:空结果→输出,而非保存永远无法触发的告警。
needs_inputStep 7: Preview the prepared config
步骤7:预览准备好的配置
Emit a fenced JSON code block containing the exact payload that will be sent
to , plus a one-paragraph plain-language summary:
signoz:signoz_create_alertjson
{
"alert": "<name>",
"alertType": "...",
"ruleType": "...",
"condition": { ... },
"labels": { "severity": "..." },
"annotations": { "description": "...", "summary": "..." },
"evaluation": { ... },
"preferredChannels": ["..."]
}Summary: This alert fires when [condition] for [resource scope], evaluated every [frequency] over the last [window]. Thresholds: warning at X, critical at Y. Notifications go to [channels]. Dry-run on the last hour: would have fired N times.
In autonomous mode the consumer proceeds. In interactive mode the human can
intervene before Step 8.
输出一个带围栏的JSON代码块,包含将发送至的精确负载,外加一段自然语言摘要:
signoz:signoz_create_alertjson
{
"alert": "<名称>",
"alertType": "...",
"ruleType": "...",
"condition": { ... },
"labels": { "severity": "..." },
"annotations": { "description": "...", "summary": "..." },
"evaluation": { ... },
"preferredChannels": ["..."]
}摘要:当[条件]在[资源范围]发生时,该告警会触发,每[频率]评估一次,评估窗口为过去[窗口时长]。阈值:warning级别为X,critical级别为Y。通知将发送至[渠道]。过去1小时模拟运行:会触发N次。
在自主模式下,使用者直接继续操作;在交互模式下,人工用户可在步骤8前进行干预。
Step 8: Save and report
步骤8:保存并报告
- Call with the JSON payload from Step 7.
signoz:signoz_create_alert - Name collision — if returns a duplicate-name error, do not suffix-append or call
signoz:signoz_create_alert. Stop and tell the user the existing alert blocked creation; offer to use a different name or modify the existing alert (which is out of scope for this skill).signoz:signoz_update_alert - On success, report:
- The alert ID and name.
- What it watches and at what threshold.
- Which channels are wired up.
- The dry-run summary ("would have fired N times in last 1h").
- Two follow-up offers: "Want to test the query live with ?" and "Want me to add a runbook URL to the annotations?"
signoz-generating-queries
- 使用步骤7中的JSON负载调用。
signoz:signoz_create_alert - 名称冲突——若返回重复名称错误,请勿添加后缀或调用
signoz:signoz_create_alert。停止操作并告知用户现有告警阻止了创建;提供使用不同名称或修改现有告警的选项(修改现有告警超出本技能范围)。signoz:signoz_update_alert - 成功后,报告以下内容:
- 告警ID和名称。
- 告警监控的对象及阈值。
- 绑定的通知渠道。
- 模拟运行摘要(“过去1小时内会触发N次”)。
- 两个后续操作选项:“是否要使用实时测试查询?”和“是否要为注释添加运行手册URL?”
signoz-generating-queries
Guardrails
防护规则
- Strict inputs over guessing. Resource attribute and channel are
required. If missing, emit and stop. Creating an alert on a guessed service is harder to undo than asking.
needs_input - Always paginate . Stopping at page 1 misses duplicates and produces noise.
signoz:signoz_list_alerts - Dry-run is mandatory. Saving an alert whose query returns no data is a silent failure mode and must be prevented.
- No duplicate updates. Name collision → error and stop. Do not silently update an existing alert from a "create" skill.
- OTel attribute names only. not
service.name.service - Threshold codes are strings, not words. not
op: "1".op: "above" - Signal must match alertType. requires
signal: "logs". Mismatches fail validation.LOGS_BASED_ALERT - Anomaly rules are metrics-only. + non-metric alertType is rejected.
anomaly_rule - Channels must exist. Use names from exactly, or create the channel inline first.
signoz:signoz_list_notification_channels - Scope boundary. This skill only creates new rules. Modifications use
directly.
signoz:signoz_update_alert
- 严格输入而非猜测。资源属性和渠道为必填项。若缺失,输出并停止操作。在猜测的服务上创建告警比询问用户更难撤销。
needs_input - 务必遍历的所有分页。仅停止在第一页会遗漏重复告警并产生无效内容。
signoz:signoz_list_alerts - 模拟运行是强制性的。保存查询无数据的告警属于静默故障,必须避免。
- 禁止重复更新。名称冲突→报错并停止。请勿从“创建”技能中静默更新现有告警。
- 仅使用OTel属性名称。使用而非
service.name。service - 阈值代码为字符串而非单词。使用而非
op: "1"。op: "above" - 信号必须匹配alertType。需对应
signal: "logs"。不匹配会导致验证失败。LOGS_BASED_ALERT - 异常规则仅适用于指标。+非指标alertType会被拒绝。
anomaly_rule - 渠道必须存在。完全使用返回的名称,或先内联创建渠道。
signoz:signoz_list_notification_channels - 范围边界。本技能仅创建新规则。修改操作直接使用。
signoz:signoz_update_alert
Examples
示例
User: "Alert me when CPU usage on the checkout service goes above 80%, page me at 90%"
Agent:
- Intent parsed: metric, CPU, scope , thresholds 80% (warning) / 90% (critical), severity bumped to critical on the higher level because "page me" was used.
service.name = checkout - → confirms
signoz:signoz_list_metrics searchText=cpu.system.cpu.utilization - (paginated) → no existing CPU alert for checkout.
signoz:signoz_list_alerts - → presents existing channels; user picks
signoz:signoz_list_notification_channelsfor warning andslack-infrafor critical.pagerduty-oncall - Builds JSON: ,
METRIC_BASED_ALERT,threshold_rule, two thresholds (signal=metrics,op="1"on_average,matchType="3"), filtertargetUnit="percent".service.name = 'checkout' - Dry-run via over last 1h: returns data, would have fired 0 times (clean baseline).
signoz:signoz_execute_builder_query - Emits JSON preview + summary.
- Calls . Reports created alert with ID, threshold summary, channel routing, and dry-run result.
signoz:signoz_create_alert
User: "Set up an alert when error rate goes above 5% on payments"
Agent:
- Intent parsed: traces, error-rate formula, scope
, threshold 5%, severity default warning.
service.name = payments - confirms
signoz:signoz_get_field_keys fieldContext=resource signal=traces.service.nameconfirmssignoz:signoz_get_field_valuesexists.payments - No existing payments error-rate alert.
- Builds formula alert: query A counts spans with for
hasError = true, query B counts all spans for the same service, formula F1 =service.name = 'payments',A * 100 / B, threshold target 5,selectedQueryName: "F1",targetUnit: "percent",op: "1"(catch any breach).matchType: "1" - Channel: user picks .
slack-payments - Dry-run on last 1h: payments error rate hovered around 0.3%, would have fired 0 times. Clean — not too tight.
- Preview, save, report.
User: "Notify me if log volume spikes more than 1000 error logs per minute on any service"
Agent:
- Intent parsed: logs, volume threshold, scope = all services (groupBy
), threshold 1000/min, severity warning.
service.name - confirms
signoz:signoz_get_field_keys signal=logs fieldContext=resourceexists; no service filter required since the user said "any service" → groupBy provides the scoping.service.name - Builds: ,
LOGS_BASED_ALERT,signal=logs,aggregations: [{expression: "count()"}],filter: {expression: "severity_text IN ('ERROR', 'FATAL')"}, threshold 1000,groupBy: [{name: "service.name", fieldContext: "resource", fieldDataType: "string"}],targetUnit: "",evalWindow: 1m0s(catch any minute that breaches).matchType: "1" - Channels: user picks slack channel.
- Dry-run: returned per-service counts, max in last 1h was 87 — would have fired 0 times. Within reasonable headroom.
- Preview, save, report.
User: "Alert me on anomalous request latency for the api gateway"
Agent:
- Intent parsed: metric, anomaly detection, scope
. Anomaly detection requires
service.name = api-gateway+METRIC_BASED_ALERT.anomaly_rule - → finds
signoz:signoz_list_metrics searchText=duration.http.server.request.duration - Builds: ,
anomaly_rule,algorithm=zscore, threshold target 3 (3 standard deviations),seasonality=daily,op: "1".matchType: "1" - Channel: user picks slack-api.
- Dry-run validates query returns data. Skip breach-count for anomaly alerts.
- Preview emphasizes that the threshold is in standard deviations, not raw latency. Save, report.
用户:“当checkout服务的CPU使用率超过80%时提醒我,超过90%时呼叫我”
Agent:
- 解析指令:指标、CPU、范围、阈值80%(warning)/90%(critical),因用户提及“呼叫我”,将高级别阈值的严重级别升级为critical。
service.name = checkout - → 确认
signoz:signoz_list_metrics searchText=cpu指标存在。system.cpu.utilization - (遍历所有分页)→ 无针对checkout服务的现有CPU告警。
signoz:signoz_list_alerts - → 展示现有渠道;用户选择
signoz:signoz_list_notification_channels作为warning级别渠道,slack-infra作为critical级别渠道。pagerduty-oncall - 构建JSON:、
METRIC_BASED_ALERT、threshold_rule、两个阈值(signal=metrics、op="1"即on_average、matchType="3")、过滤器targetUnit="percent"。service.name = 'checkout' - 通过在过去1小时内模拟运行:返回数据,会触发0次(基线正常)。
signoz:signoz_execute_builder_query - 输出JSON预览+摘要。
- 调用。报告已创建的告警ID、阈值摘要、渠道路由和模拟运行结果。
signoz:signoz_create_alert
用户:“为payments服务设置错误率超过5%时的告警”
Agent:
- 解析指令:链路追踪、错误率公式、范围、阈值5%、默认严重级别warning。
service.name = payments - 确认
signoz:signoz_get_field_keys fieldContext=resource signal=traces存在。service.name确认signoz:signoz_get_field_values服务存在。payments - 无针对payments服务的现有错误率告警。
- 构建公式告警:查询A统计且
service.name = 'payments'的Span数量,查询B统计同一服务的所有Span数量,公式F1=hasError = true,A * 100 / B,阈值目标5,selectedQueryName: "F1",targetUnit: "percent",op: "1"(捕获任何一次突破)。matchType: "1" - 渠道:用户选择。
slack-payments - 过去1小时模拟运行:payments服务错误率约为0.3%,会触发0次。阈值合理——不过于严格。
- 预览、保存、报告。
用户:“当任何服务的错误日志量每分钟超过1000条时通知我”
Agent:
- 解析指令:日志、量阈值、范围=所有服务(按分组)、阈值1000条/分钟、严重级别warning。
service.name - 确认
signoz:signoz_get_field_keys signal=logs fieldContext=resource存在;无需服务过滤器,因用户要求“任何服务”→groupBy提供范围划分。service.name - 构建配置:、
LOGS_BASED_ALERT、signal=logs、aggregations: [{expression: "count()"}]、filter: {expression: "severity_text IN ('ERROR', 'FATAL')"}、阈值1000、groupBy: [{name: "service.name", fieldContext: "resource", fieldDataType: "string"}]、targetUnit: ""、evalWindow: 1m0s(捕获任何突破的分钟)。matchType: "1" - 渠道:用户选择Slack渠道。
- 模拟运行:返回按服务统计的数量,过去1小时内最大值为87——会触发0次。阈值留有合理余量。
- 预览、保存、报告。
用户:“针对api网关的异常请求延迟触发告警”
Agent:
- 解析指令:指标、异常检测、范围。异常检测需要
service.name = api-gateway+METRIC_BASED_ALERT。anomaly_rule - → 找到
signoz:signoz_list_metrics searchText=duration指标。http.server.request.duration - 构建配置:、
anomaly_rule、algorithm=zscore、阈值目标3(3个标准差)、seasonality=daily、op: "1"。matchType: "1" - 渠道:用户选择。
slack-api - 模拟运行验证查询返回数据。异常告警跳过触发次数计算。
- 预览中强调阈值为标准差而非原始延迟值。保存、报告。
Additional resources
额外资源
- and
signoz://alert/instructionsMCP resources — full alert config JSON schema, threshold codes, filter expression syntax, and version-current pattern examples. Always preferred over any transcribed copy.signoz://alert/examples - skill — for authoring PromQL or testing queries before wrapping them in an alert.
signoz-generating-queries
- 和
signoz://alert/instructionsMCP资源——完整的告警配置JSON schema、阈值代码、过滤器表达式语法以及版本同步的模式示例。始终优先使用这些资源,而非任何转录副本。signoz://alert/examples - 技能——用于编写PromQL或在封装为告警前测试查询。
signoz-generating-queries