signoz-creating-alerts

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Alert Create

创建告警

Build a SigNoz alert from a user's natural-language intent. The skill targets two consumers: an autonomous AI SRE agent that runs without a human in the loop, and a human at a Claude Code / Codex / Cursor prompt. Both go through the same flow — the human just gets a chance to intervene at the preview step.

根据用户的自然语言指令构建SigNoz告警。本技能面向两类使用者：无需人工介入的自主AI SRE Agent，以及使用Claude Code / Codex / Cursor的人工用户。两者遵循相同流程——仅人工用户可在预览环节进行干预。

Prerequisites

前置条件

This skill calls SigNoz MCP server tools (

signoz:signoz_create_alert

signoz:signoz_list_alerts

signoz:signoz_get_field_keys

, etc.). Before running the workflow, confirm the

signoz:signoz_*

tools are available. If they are not, the SigNoz MCP server is not installed or configured — stop and direct the user to set it up: https://signoz.io/docs/ai/signoz-mcp-server/. Do not try to fall back to raw HTTP calls or fabricate alert configs without the MCP tools.

本技能会调用SigNoz MCP服务器工具（

signoz:signoz_create_alert

、

signoz:signoz_list_alerts

、

signoz:signoz_get_field_keys

等）。运行工作流前，请确认

signoz:signoz_*

工具已可用。若不可用，则说明SigNoz MCP服务器未安装或配置——请停止操作并引导用户完成配置：https://signoz.io/docs/ai/signoz-mcp-server/。请勿尝试回退至原始HTTP调用，或在无MCP工具的情况下编造告警配置。

When to use

使用场景

Use this skill when the user wants to:

Create, set up, or configure a new alert rule.
Get paged or notified when a metric, log volume, latency, or error rate crosses a threshold.
Detect anomalous behavior on a service, host, or signal.
Catch silent data loss ("alert if data stops arriving from X").

Do NOT use when the user wants to:

Understand what an existing alert monitors →
```
signoz-explaining-alerts
```
.
Diagnose why an existing alert fired →
```
signoz-investigating-alerts
```
.
Modify thresholds, queries, or routing on an existing alert → call
```
signoz:signoz_update_alert
```
directly.

当用户需要以下操作时使用本技能：

创建、设置或配置新的告警规则。
当指标、日志量、延迟或错误率超出阈值时，接收呼叫或通知。
检测服务、主机或信号的异常行为。
捕获静默数据丢失（“当X停止发送数据时触发告警”）。

请勿在以下场景使用：

用户希望了解现有告警的监控内容 → 使用
```
signoz-explaining-alerts
```
。
用户希望诊断现有告警触发的原因 → 使用
```
signoz-investigating-alerts
```
。
用户希望修改现有告警的阈值、查询或路由 → 直接调用
```
signoz:signoz_update_alert
```
。

Required inputs (strict)

必填输入（严格要求）

Alert creation is a write operation against a shared system. Guessing here creates noisy alerts on the wrong service that someone else has to clean up. The skill enforces a strict input contract:

Input	Required	Source if missing
Alert intent (NL goal)	yes	`$ARGUMENTS` or recent user turn
Resource attribute filter (e.g. `service.name` , `k8s.namespace.name` , `host.name` )	yes	discover via `signoz:signoz_get_field_keys` + `signoz:signoz_get_field_values`
Threshold value(s)	inferred from intent	derive a sensible default and surface in the preview
Severity	inferred from intent	default `warning` ; promote to `critical` only if user said "page", "wake up", "critical"
Notification channel	yes	`signoz:signoz_list_notification_channels` + offer "create new"

If a required input is missing and cannot be discovered, emit a structured

needs_input

block and stop before calling any write tool:

text

needs_input:
  missing:
    - resource_attribute_filter: "no service or host specified — pick one"
  candidates:
    service.name: ["frontend", "checkout", "payments", "inventory"]
    host.name: ["prod-api-1", "prod-api-2", "prod-db-1"]

In interactive mode, the human picks from candidates. In autonomous mode, the caller fills the gap from upstream context or escalates. Either way, do not proceed to

signoz:signoz_create_alert

with a guessed value.

告警创建是针对共享系统的写入操作。此处的猜测会在错误的服务上产生无效告警，需要他人清理。本技能强制执行严格的输入约束：

输入项	是否必填	缺失时的获取来源
告警指令（自然语言目标）	是	`$ARGUMENTS` 或用户最近的对话内容
资源属性过滤器（如 `service.name` 、 `k8s.namespace.name` 、 `host.name` ）	是	通过 `signoz:signoz_get_field_keys` + `signoz:signoz_get_field_values` 自动发现
阈值	从指令中推断	生成合理默认值并在预览中展示
严重级别	从指令中推断	默认 `warning` ；仅当用户提及“呼叫”“唤醒”“critical”时升级为 `critical`
通知渠道	是	`signoz:signoz_list_notification_channels` + 提供“创建新渠道”选项

若必填输入缺失且无法自动发现，请输出结构化的

needs_input

块并停止所有写入工具调用：

text

needs_input:
  missing:
    - resource_attribute_filter: "未指定服务或主机——请选择一个"
  candidates:
    service.name: ["frontend", "checkout", "payments", "inventory"]
    host.name: ["prod-api-1", "prod-api-2", "prod-db-1"]

在交互模式下，人工用户从候选列表中选择；在自主模式下，调用方从上游上下文补充信息或升级处理。无论哪种情况，请勿使用猜测值调用

signoz:signoz_create_alert

。

Workflow

工作流

Step 1: Parse intent and check what's missing

步骤1：解析指令并检查缺失项

Extract from the user's request:

What to monitor — signal type (metrics / logs / traces / exceptions) and the specific condition (CPU, error rate, p99 latency, log count, ...).
Resource scope — which service, host, namespace, or environment.
Threshold — numeric value and comparison ("above 80%", "below 100/s").
Severity — implicit from urgency words ("page" → critical, default warning otherwise).
Channel — explicit channel name if the user provided one.

Map signal phrasing to alert type:

User says	alertType	signal
metric, CPU, memory, latency, request rate	METRIC_BASED_ALERT	metrics
log, error logs, log volume, log pattern	LOGS_BASED_ALERT	logs
trace, span, latency p99, slow requests	TRACES_BASED_ALERT	traces
exception, stack trace, crash	EXCEPTIONS_BASED_ALERT	(clickhouse_sql)

If resource scope is missing, run discovery (Step 2). If still missing after discovery, emit

needs_input

and stop.

从用户请求中提取以下信息：

监控对象——信号类型（指标/日志/链路追踪/异常）及具体条件（CPU、错误率、p99延迟、日志数量等）。
资源范围——涉及的服务、主机、命名空间或环境。
阈值——数值及比较逻辑（“高于80%”“低于100次/秒”）。
严重级别——从紧急词汇中隐含推断（“呼叫”→critical，否则默认warning）。
渠道——若用户明确指定则使用对应渠道名称。

将用户表述映射为告警类型：

用户表述	alertType	signal
指标、CPU、内存、延迟、请求速率	METRIC_BASED_ALERT	metrics
日志、错误日志、日志量、日志模式	LOGS_BASED_ALERT	logs
链路追踪、Span、p99延迟、慢请求	TRACES_BASED_ALERT	traces
异常、堆栈跟踪、崩溃	EXCEPTIONS_BASED_ALERT	(clickhouse_sql)

若资源范围缺失，执行步骤2进行发现；若发现后仍缺失，输出

needs_input

并停止操作。

Step 2: Discover resource attributes and metric names

步骤2：发现资源属性和指标名称

When the user does not name a service / host / namespace, the SigNoz MCP guideline applies: always prefer a resource-attribute filter. Discover candidates instead of guessing:

Call
```
signoz:signoz_get_field_keys
```
with
```
fieldContext=resource
```
to enumerate resource attributes for the chosen signal.
Call
```
signoz:signoz_get_field_values
```
for the most likely attribute (typically
```
service.name
```
, then
```
host.name
```
, then
```
k8s.namespace.name
```
) to get concrete values.
If the user mentioned a metric by name, call
```
signoz:signoz_list_metrics
```
with a search term to verify the exact OTel metric name. Wrong names create alerts that never fire.

Surface the candidates in the

needs_input

block. Do not pick one.

当用户未指定服务/主机/命名空间时，遵循SigNoz MCP准则：优先使用资源属性过滤器。自动发现候选值而非猜测：

调用
```
signoz:signoz_get_field_keys
```
并设置
```
fieldContext=resource
```
，枚举所选信号的资源属性。
调用
```
signoz:signoz_get_field_values
```
获取最可能的属性值（通常优先
```
service.name
```
，其次
```
host.name
```
，最后
```
k8s.namespace.name
```
）。
若用户提及具体指标名称，调用
```
signoz:signoz_list_metrics
```
并传入搜索词，验证确切的OTel指标名称。错误的指标名称会导致告警永远无法触发。

在

needs_input

块中展示候选值，请勿自行选择。

Step 2.5: Probe data existence for the chosen filter (fail fast)

步骤2.5：探测所选过滤器的数据存在性（快速失败）

Before authoring any alert config, confirm the specific combination the alert will watch (metric × service × any other filter) actually emits data. The most common silent failure is "metric exists in the catalog and the service exists in the catalog, but the service doesn't emit that metric" — each piece checks out in isolation, the alert saves successfully, and it silently never fires.

Run a single probe over the last 1 hour using the same filter the alert will use, but with the simplest aggregation that confirms data exists:

Metrics:

signoz:signoz_query_metrics

(PromQL) or

signoz:signoz_execute_builder_query

with

count()

(or

count_distinct(service.name)

if scope-discovering).

Logs:
```
signoz:signoz_aggregate_logs
```
with
```
count()
```
over the filter.
Traces:
```
signoz:signoz_aggregate_traces
```
with
```
count()
```
over the filter.

Inspect the result:

Probe returns rows → proceed to Step 3.
Probe returns empty → STOP. Do not build an alert config the user will then be asked to throw away. Emit a
```
needs_input
```
block describing what was missing and offer concrete recovery:
- Service doesn't emit the metric → call
```
signoz:signoz_get_field_values signal=metrics name=service.name metricName=<metric>
```
  to list the services that do emit it; let the user pick a different service or a different metric.
- Wrong attribute name (
```
service
```
  instead of
```
service.name
```
  ) → suggest the semantic-convention name and re-probe.
- Service emits the metric but not in the expected time range → widen the probe window once (e.g. last 24h) before declaring no-data.

Exception — log-based crash / panic / OOMKilled / FATAL alerts. These intentionally have zero matches in a healthy system. The probe will return empty by design. Do not stop; instead, surface the zero-match result and ask the user to confirm before save. Treat this exception narrowly: it applies to "alert me when bad thing happens" log queries, not to alerts that depend on continuous data flow.

This probe is cheap (one query, ~100ms), and catching the no-data case early avoids the worst UX failure mode of this skill — the user reading through a fully-authored JSON payload and only then learning the alert can never fire.

在生成任何告警配置前，请确认告警将监控的特定组合（指标×服务×其他过滤器）确实有数据输出。最常见的静默故障是：“指标在目录中存在，服务也在目录中存在，但该服务并未输出该指标”——单独检查各部分均正常，告警保存成功，但永远静默不触发。

使用告警将采用的相同过滤器，在过去1小时内执行一次探测，使用最简单的聚合方式确认数据存在：

指标：

signoz:signoz_query_metrics

（PromQL）或

signoz:signoz_execute_builder_query

，使用

count()

（若为范围发现则使用

count_distinct(service.name)

）。

日志：
```
signoz:signoz_aggregate_logs
```
，对过滤器执行
```
count()
```
聚合。
链路追踪：
```
signoz:signoz_aggregate_traces
```
，对过滤器执行
```
count()
```
聚合。

检查探测结果：

探测返回数据行 → 继续步骤3。
探测返回空结果 → 停止操作。请勿生成用户随后需要丢弃的告警配置。输出
```
needs_input
```
块说明缺失内容，并提供具体恢复方案：
- 服务未输出该指标 → 调用
```
signoz:signoz_get_field_values signal=metrics name=service.name metricName=<metric>
```
  ，列出实际输出该指标的服务；让用户选择其他服务或指标。
- 属性名称错误（如
```
service
```
  而非
```
service.name
```
  ） → 建议使用语义规范名称并重新探测。
- 服务输出该指标但不在预期时间范围内 → 在判定无数据前，将探测窗口扩大一次（如过去24小时）。

例外情况——基于日志的崩溃/ panic / OOMKilled / FATAL告警。在健康系统中，这些告警的查询结果本就应为空。探测返回空结果属于正常情况，无需停止操作；相反，需展示零匹配结果并在保存前获取用户确认。此例外仅适用于“当不良事件发生时触发告警”的日志查询，不适用于依赖持续数据流的告警。

该探测成本极低（一次查询，约100ms），提前发现无数据情况可避免本技能最糟糕的用户体验——用户查看完整的JSON配置后，才发现告警永远无法触发。

Step 3: Check for duplicate alerts

步骤3：检查重复告警

Call

signoz:signoz_list_alerts

and paginate through every page —

pagination.hasMore

is true until you have walked the full list. Check for existing alerts that match the user's intent (same signal + same scope + similar threshold). If a likely duplicate exists, surface it and ask whether to create a new one anyway, modify the existing one (out of scope here — use

signoz:signoz_update_alert

), or cancel.

调用

signoz:signoz_list_alerts

并遍历所有分页——直到

pagination.hasMore

为false，确保获取完整列表。检查是否存在与用户指令匹配的现有告警（相同信号+相同范围+相似阈值）。若存在疑似重复告警，展示该告警并询问用户是继续创建新告警、修改现有告警（超出本技能范围——使用

signoz:signoz_update_alert

）还是取消操作。

Step 4: Build the alert config

步骤4：构建告警配置

The MCP server is the source of truth for the alert JSON schema, threshold codes, and validation rules. Read the

signoz://alert/instructions

and

signoz://alert/examples

MCP resources for the canonical, version-current shape. Do not transcribe schema text into this skill — it will rot out of sync with the server.

For most user intents, the config is one of a small number of patterns:

Pattern	Where to author	Example intents
Single-metric threshold	inline (this skill)	"alert when CPU > 80%", "p99 latency > 2s"
Log volume threshold	inline	"more than N error logs/min"
Trace-based count or p-tile	inline	"p99 span duration > 2s on checkout"
Error-rate formula (A/B*100)	inline (see "Common query shapes" below)	"error rate > 5%"
Anomaly detection (Z-score)	inline, but only with `METRIC_BASED_ALERT`	"alert me on anomalous traffic"
Absent-data alert	inline	"alert if data stops arriving"
ClickHouse SQL alert	inline (this skill) — author the SQL directly using the schema in `signoz://alert/examples`	non-trivial joins, custom aggregations the builder cannot express
PromQL alert	delegate to `signoz-generating-queries` for the PromQL, then return here	when user already has PromQL

Threshold
op
and
matchType
values. v2alpha1 accepts the human-readable strings (

"above"

"on_average"

); the legacy numeric codes (

"1"

"3"

) are also accepted but harder to read in the UI. Prefer the words. Anomaly rules only support
op: "above"
— the engine already treats z-score breaches as two-sided when the threshold is positive, so

"above_or_below"

is rejected and unnecessary.

Comparison	`op`	Evaluation behavior	`matchType`
above / exceeds / >	`"above"`	breach at any point	`"at_least_once"`
below / under / <	`"below"`	breach for entire window	`"all_the_times"`
equal / =	`"equals"`	average breaches	`"on_average"`
not equal / !=	`"not_equals"`	sum breaches	`"in_total"`
		last value breaches	`"last"`

Defaults the skill applies (and surfaces in the preview):

```
evalWindow: 5m0s
```
,
```
frequency: 1m0s
```
— change only if the intent implies a slower or faster cadence.
```
matchType: "3"
```
(on_average) for CPU / memory / latency — smooths transient spikes.
```
matchType: "1"
```
(at_least_once) for error counts / error rates — catches any breach.

Severity defaults — derive from the intrinsic urgency of the alert, not just the user's words. The user saying "alert me" doesn't force

warning

when the condition itself describes a critical event. Use this table; an explicit user cue overrides it ("just FYI" → demote, "page me" / "wake me up" → promote).

Alert intent	Default severity
Pod crash / OOMKilled / CrashLoopBackOff / panic / FATAL log signals	critical
Service down / no-data on a production service	critical
Error rate above any non-trivial threshold (>1%)	critical
Error logs / exception spikes	warning
Latency degradation (p95/p99 above target)	warning
CPU / memory / disk pressure	warning
Request-rate / traffic anomaly	warning
SLO budget burn (info-level burn rate)	info / warning

When the user's intent is ambiguous on severity (no urgency cue, no clearly-critical condition), default to

warning

and surface the choice in the preview so they can adjust.

OTel attribute names — always use semantic conventions:

service.name

host.name

k8s.namespace.name

deployment.environment.name

. Never

service

host

, or

env

Annotation templates — the on-call engineer sees the notification, not the alert config. A notification that says "Pod crash detected" with no service name, no count, and no value is nearly useless at 3am. Always include the moving values:

```
summary
```
— single-line headline. Include the resource scope and the numeric value:
```
"checkoutservice error rate {{$value}}% above 3%"
```
.
```
description
```
— longer message. Include
```
{{$value}}
```
,
```
{{$threshold}}
```
, the groupBy values (e.g.
```
{{$labels.service_name}}
```
), and a sentence on what to do or where to look. For count-based alerts include the count explicitly:
```
"{{$value}} crash log lines in the last 5 minutes from service {{$labels.service_name}}"
```
.

Use

{{$value}}

for the breaching value,

{{$threshold}}

for the target, and

{{$labels.<key>}}

for groupBy values (note SigNoz substitutes the dotted attribute name with underscores:

service.name

→

service_name

MCP服务器是告警JSON schema、阈值代码和验证规则的权威来源。请阅读

signoz://alert/instructions

和

signoz://alert/examples

MCP资源，获取规范的、版本同步的配置格式。请勿将schema文本转录到本技能中——否则会与服务器版本不同步而过时。

大多数用户指令对应的配置属于以下少数模式之一：

模式	配置方式	示例指令
单指标阈值	内联（本技能直接生成）	“当CPU超过80%时触发告警”“p99延迟超过2秒时触发告警”
日志量阈值	内联	“每分钟错误日志超过N条时触发告警”
基于链路追踪的计数或百分位数	内联	“checkout服务的p99 Span持续时间超过2秒时触发告警”
错误率公式（A/B*100）	内联（见下方“常见查询格式”）	“错误率超过5%时触发告警”
异常检测（Z-score）	内联，但仅适用于 `METRIC_BASED_ALERT`	“针对异常流量触发告警”
数据缺失告警	内联	“当数据停止发送时触发告警”
ClickHouse SQL告警	内联（本技能直接生成）——使用 `signoz://alert/examples` 中的schema直接编写SQL	非平凡关联、构建器无法表达的自定义聚合
PromQL告警	委托给 `signoz-generating-queries` 生成PromQL，再返回本技能处理	用户已提供PromQL时

阈值
op
和
matchType
取值。v2alpha1版本接受人类可读字符串（如

"above"

、

"on_average"

）；旧版数字代码（如

"1"

、

"3"

）也可接受，但在UI中可读性较差。优先使用字符串。异常规则仅支持
op: "above"
——当阈值为正数时，引擎已将Z-score突破视为双向检测，因此

"above_or_below"

会被拒绝且无必要。

比较逻辑	`op`	评估行为	`matchType`
高于/超过/ >	`"above"`	任意时刻突破阈值	`"at_least_once"`
低于/小于/ <	`"below"`	整个窗口内持续突破阈值	`"all_the_times"`
等于/ =	`"equals"`	平均值突破阈值	`"on_average"`
不等于/ !=	`"not_equals"`	总和突破阈值	`"in_total"`
		最后一个值突破阈值	`"last"`

本技能应用的默认值（将在预览中展示）：

```
evalWindow: 5m0s
```
，
```
frequency: 1m0s
```
——仅当指令暗示需要更慢或更快的频率时才修改。
CPU/内存/延迟告警的
```
matchType: "3"
```
（on_average）——平滑瞬时峰值。
错误计数/错误率告警的
```
matchType: "1"
```
（at_least_once）——捕获任何一次阈值突破。

严重级别默认值——从告警的内在紧急程度推导，而非仅依据用户表述。当条件本身描述严重事件时，用户说“提醒我”并不强制设置为

warning

。使用下表；用户明确提示可覆盖默认值（“仅作参考”→降级，“呼叫我”/“唤醒我”→升级）。

告警指令	默认严重级别
Pod崩溃/OOMKilled/CrashLoopBackOff/panic/FATAL日志信号	critical
服务宕机/生产服务无数据	critical
错误率超过任何非 trivial 阈值（>1%）	critical
错误日志/异常峰值	warning
延迟恶化（p95/p99超过目标值）	warning
CPU/内存/磁盘压力	warning
请求速率/流量异常	warning
SLO预算消耗（信息级消耗速率）	info / warning

当用户指令对严重级别不明确（无紧急提示、无明显严重条件）时，默认设置为

warning

并在预览中展示该选项，供用户调整。

OTel属性名称——始终使用语义规范：

service.name

、

host.name

、

k8s.namespace.name

、

deployment.environment.name

。禁止使用

service

、

host

或

env

。

注释模板——值班工程师看到的是通知内容，而非告警配置。仅显示“检测到Pod崩溃”但无服务名称、数量和数值的通知，在凌晨3点几乎毫无用处。务必包含动态值：

summary

——单行标题。包含资源范围和数值：

"checkoutservice错误率{{$value}}%，超过3%阈值"

。

```
description
```
——较长的消息。包含
```
{{$value}}
```
、
```
{{$threshold}}
```
、groupBy值（如
```
{{$labels.service_name}}
```
），以及操作指引或查看位置。针对基于计数的告警，需明确包含计数：
```
"过去5分钟内，服务{{$labels.service_name}}产生了{{$value}}条崩溃日志"
```
。

使用

{{$value}}

表示突破阈值的数值，

{{$threshold}}

表示目标阈值，

{{$labels.<key>}}

表示groupBy值（注意SigNoz会将点分隔的属性名替换为下划线：

service.name

→

service_name

）。

Common query shapes

常见查询格式

Three patterns cover most non-trivial alerts. The MCP resources above carry the full schema; these are quick references for the query block only.

Error rate — two queries + formula

A * 100 / B

. Set
disabled: true
on the component queries A and B so only the formula F1 renders in the alert chart and notification — the raw counts are intermediate, not the alert signal. Forgetting this clutters the alert preview with three series and confuses the on-call engineer reading the notification.

json

{
  "queries": [
    { "type": "builder_query", "spec": { "name": "A", "signal": "traces",
        "disabled": true,
        "aggregations": [{ "expression": "count()" }],
        "filter": { "expression": "hasError = true" } } },
    { "type": "builder_query", "spec": { "name": "B", "signal": "traces",
        "disabled": true,
        "aggregations": [{ "expression": "count()" }],
        "filter": { "expression": "" } } },
    { "type": "builder_formula",
      "spec": { "name": "F1", "expression": "A * 100 / B" } }
  ],
  "selectedQueryName": "F1"
}

p99 latency — single trace query with

groupBy

for per-service breakdown. Threshold target is in nanoseconds (2s → 2000000000),

targetUnit: "ns"

json

{
  "queries": [
    { "type": "builder_query", "spec": { "name": "A", "signal": "traces",
        "aggregations": [{ "expression": "p99(durationNano)" }],
        "groupBy": [{ "name": "service.name", "fieldContext": "resource",
                      "fieldDataType": "string" }] } }
  ]
}

Log volume spike — count of error/fatal logs grouped by service:

json

{
  "queries": [
    { "type": "builder_query", "spec": { "name": "A", "signal": "logs",
        "aggregations": [{ "expression": "count()" }],
        "filter": { "expression": "severity_text IN ('ERROR', 'FATAL')" },
        "groupBy": [{ "name": "service.name", "fieldContext": "resource",
                      "fieldDataType": "string" }] } }
  ]
}

For absent-data, anomaly, PromQL, and ClickHouse SQL alerts, read the

signoz://alert/examples

MCP resource for current shapes.

三种模式覆盖大多数非平凡告警。上述MCP资源包含完整schema；以下仅为查询块的快速参考。

错误率——两个查询+公式

A * 100 / B

。将组件查询A和B的
disabled: true
设置为true，以便仅公式F1在告警图表和通知中显示——原始计数为中间值，而非告警信号。若忘记此设置，会在告警预览中显示三个序列，混淆阅读通知的值班工程师。

json

{
  "queries": [
    { "type": "builder_query", "spec": { "name": "A", "signal": "traces",
        "disabled": true,
        "aggregations": [{ "expression": "count()" }],
        "filter": { "expression": "hasError = true" } } },
    { "type": "builder_query", "spec": { "name": "B", "signal": "traces",
        "disabled": true,
        "aggregations": [{ "expression": "count()" }],
        "filter": { "expression": "" } } },
    { "type": "builder_formula",
      "spec": { "name": "F1", "expression": "A * 100 / B" } }
  ],
  "selectedQueryName": "F1"
}

p99延迟——单个链路追踪查询，使用

groupBy

按服务拆分。阈值目标单位为纳秒（2秒→2000000000），

targetUnit: "ns"

：

json

{
  "queries": [
    { "type": "builder_query", "spec": { "name": "A", "signal": "traces",
        "aggregations": [{ "expression": "p99(durationNano)" }],
        "groupBy": [{ "name": "service.name", "fieldContext": "resource",
                      "fieldDataType": "string" }] } }
  ]
}

日志量峰值——按服务分组的错误/致命日志计数：

json

{
  "queries": [
    { "type": "builder_query", "spec": { "name": "A", "signal": "logs",
        "aggregations": [{ "expression": "count()" }],
        "filter": { "expression": "severity_text IN ('ERROR', 'FATAL')" },
        "groupBy": [{ "name": "service.name", "fieldContext": "resource",
                      "fieldDataType": "string" }] } }
  ]
}

关于数据缺失、异常检测、PromQL和ClickHouse SQL告警的配置，请阅读

signoz://alert/examples

MCP资源获取当前格式。

Step 5: Resolve notification channels

步骤5：解析通知渠道

The skill must resolve at least one channel before save. An alert with no channels saves successfully and silently never notifies anyone — the second most common silent failure after bad queries.

Call

signoz:signoz_list_notification_channels

to enumerate existing channels.

If the user named a channel ("send to slack-infra"), use it if it exists; if not, fall through.
Otherwise present the user with two options:
- Pick from existing — list channels with their type (Slack, PagerDuty, email, webhook) so the user can choose.
- Create new inline — call
```
signoz:signoz_create_notification_channel
```
  with channel parameters the user provides (name, type, type-specific config like Slack webhook URL or PagerDuty integration key).
If neither path resolves a channel, emit
```
needs_input: notification_channel
```
and stop.

For multi-severity alerts, attach channels per threshold:

thresholds.spec[N].channels

is an array — typically warning → Slack only, critical → Slack + PagerDuty.

本技能必须在保存前解析至少一个渠道。无渠道的告警会保存成功但永远静默不通知任何人——这是仅次于错误查询的第二常见静默故障。

调用

signoz:signoz_list_notification_channels

枚举现有渠道。

若用户指定了渠道（如“发送至slack-infra”），若该渠道存在则使用；否则进入下一步。
否则向用户提供两个选项：
- 从现有渠道选择——列出渠道及其类型（Slack、PagerDuty、邮件、Webhook）供用户选择。
- 内联创建新渠道——调用
```
signoz:signoz_create_notification_channel
```
  ，传入用户提供的渠道参数（名称、类型、类型特定配置如Slack Webhook URL或PagerDuty集成密钥）。
若两种方式均无法解析渠道，输出
```
needs_input: notification_channel
```
并停止操作。

针对多严重级别告警，按阈值绑定渠道：

thresholds.spec[N].channels

为数组——通常warning级别仅发送至Slack，critical级别发送至Slack+PagerDuty。

Step 6: Validate the threshold (would-have-fired count)

步骤6：验证阈值（模拟触发次数）

Step 2.5 already confirmed the underlying data exists. Step 6 is about the threshold — given the full proposed query (including formulas, groupBy, and unit conversions) and the proposed threshold, would this alert have fired a sensible number of times in the last hour?

Run the alert's full primary query (or formula) over the last hour using:
- ```
signoz:signoz_execute_builder_query
```
  for builder/formula queries.
- ```
signoz:signoz_query_metrics
```
  for PromQL queries.
- ```
signoz:signoz_aggregate_logs
```
  /
```
signoz:signoz_aggregate_traces
```
  if those fit better.
Compute how many evaluation points in the last hour breached the proposed threshold. Surface this in the preview as "would have fired N times in the last 1h":
- N = 0 → the threshold may be too loose or the gating too strict. Mention this so the user can adjust if intent was tighter.
- N is large (e.g. > 30) → likely alert storm. Surface and recommend tightening or adding hysteresis (
```
recoveryTarget
```
  ).
- N is small and non-zero → calibrated; proceed.
Exceptions:
- Anomaly alerts — skip the breach count entirely (Z-scores aren't directly comparable to raw values). Step 2.5 already verified the underlying metric × service has data; nothing more to validate here.
- Log-based crash / panic / OOMKilled / FATAL alerts — these intentionally have zero matches in a healthy system. Step 2.5 has already surfaced the zero-match result and obtained user confirmation; skip the breach count.

If Step 2.5 was somehow skipped (e.g. a downstream skill is invoking this flow mid-stream), the no-data stop rule applies here too: empty result → emit

needs_input

instead of saving an alert that will never fire.

步骤2.5已确认底层数据存在。步骤6针对阈值进行验证：给定完整的拟议查询（包括公式、groupBy和单位转换）和拟议阈值，该告警在过去1小时内会触发多少次合理的次数？

使用以下工具，在过去1小时内运行告警的完整主查询（或公式）：
- 针对构建器/公式查询：
```
signoz:signoz_execute_builder_query
```
  。
- 针对PromQL查询：
```
signoz:signoz_query_metrics
```
  。
- 若更合适，使用
```
signoz:signoz_aggregate_logs
```
  /
```
signoz:signoz_aggregate_traces
```
  。
计算过去1小时内突破拟议阈值的评估点数量。在预览中展示为**“过去1小时内会触发N次”**：
- N=0 → 阈值可能过于宽松或限制过严。需告知用户，以便根据更严格的指令调整。
- N较大（如>30） → 可能产生告警风暴。需告知用户并建议收紧阈值或添加滞后机制（
```
recoveryTarget
```
  ）。
- N较小且非零 → 阈值已校准；继续操作。
例外情况：
- 异常告警——完全跳过触发次数计算（Z-score无法直接与原始值比较）。步骤2.5已验证底层指标×服务存在数据；无需进一步验证。
- 基于日志的崩溃/panic/OOMKilled/FATAL告警——在健康系统中，这些告警的查询结果本就应为空。步骤2.5已展示零匹配结果并获取用户确认；跳过触发次数计算。

若步骤2.5被意外跳过（如下游技能在流程中途调用本技能），此处仍需执行无数据停止规则：空结果→输出

needs_input

，而非保存永远无法触发的告警。

Step 7: Preview the prepared config

步骤7：预览准备好的配置

Emit a fenced JSON code block containing the exact payload that will be sent to

signoz:signoz_create_alert

, plus a one-paragraph plain-language summary:

json

{
  "alert": "<name>",
  "alertType": "...",
  "ruleType": "...",
  "condition": { ... },
  "labels": { "severity": "..." },
  "annotations": { "description": "...", "summary": "..." },
  "evaluation": { ... },
  "preferredChannels": ["..."]
}

Summary: This alert fires when [condition] for [resource scope], evaluated every [frequency] over the last [window]. Thresholds: warning at X, critical at Y. Notifications go to [channels]. Dry-run on the last hour: would have fired N times.

In autonomous mode the consumer proceeds. In interactive mode the human can intervene before Step 8.

输出一个带围栏的JSON代码块，包含将发送至

signoz:signoz_create_alert

的精确负载，外加一段自然语言摘要：

json

{
  "alert": "<名称>",
  "alertType": "...",
  "ruleType": "...",
  "condition": { ... },
  "labels": { "severity": "..." },
  "annotations": { "description": "...", "summary": "..." },
  "evaluation": { ... },
  "preferredChannels": ["..."]
}

摘要：当[条件]在[资源范围]发生时，该告警会触发，每[频率]评估一次，评估窗口为过去[窗口时长]。阈值：warning级别为X，critical级别为Y。通知将发送至[渠道]。过去1小时模拟运行：会触发N次。

在自主模式下，使用者直接继续操作；在交互模式下，人工用户可在步骤8前进行干预。

Step 8: Save and report

步骤8：保存并报告

Call
```
signoz:signoz_create_alert
```
with the JSON payload from Step 7.
Name collision — if
```
signoz:signoz_create_alert
```
returns a duplicate-name error, do not suffix-append or call
```
signoz:signoz_update_alert
```
. Stop and tell the user the existing alert blocked creation; offer to use a different name or modify the existing alert (which is out of scope for this skill).
On success, report:
- The alert ID and name.
- What it watches and at what threshold.
- Which channels are wired up.
- The dry-run summary ("would have fired N times in last 1h").
- Two follow-up offers: "Want to test the query live with
```
signoz-generating-queries
```
  ?" and "Want me to add a runbook URL to the annotations?"

使用步骤7中的JSON负载调用
```
signoz:signoz_create_alert
```
。
名称冲突——若
```
signoz:signoz_create_alert
```
返回重复名称错误，请勿添加后缀或调用
```
signoz:signoz_update_alert
```
。停止操作并告知用户现有告警阻止了创建；提供使用不同名称或修改现有告警的选项（修改现有告警超出本技能范围）。
成功后，报告以下内容：
- 告警ID和名称。
- 告警监控的对象及阈值。
- 绑定的通知渠道。
- 模拟运行摘要（“过去1小时内会触发N次”）。
- 两个后续操作选项：“是否要使用
```
signoz-generating-queries
```
  实时测试查询？”和“是否要为注释添加运行手册URL？”

Guardrails

防护规则

Strict inputs over guessing. Resource attribute and channel are required. If missing, emit
```
needs_input
```
and stop. Creating an alert on a guessed service is harder to undo than asking.
Always paginate
signoz:signoz_list_alerts
. Stopping at page 1 misses duplicates and produces noise.
Dry-run is mandatory. Saving an alert whose query returns no data is a silent failure mode and must be prevented.
No duplicate updates. Name collision → error and stop. Do not silently update an existing alert from a "create" skill.
OTel attribute names only.
```
service.name
```
not
```
service
```
.
Threshold codes are strings, not words.
```
op: "1"
```
not
```
op: "above"
```
.
Signal must match alertType.
```
signal: "logs"
```
requires
```
LOGS_BASED_ALERT
```
. Mismatches fail validation.
Anomaly rules are metrics-only.
```
anomaly_rule
```
+ non-metric alertType is rejected.
Channels must exist. Use names from
```
signoz:signoz_list_notification_channels
```
exactly, or create the channel inline first.
Scope boundary. This skill only creates new rules. Modifications use
```
signoz:signoz_update_alert
```
directly.

严格输入而非猜测。资源属性和渠道为必填项。若缺失，输出
```
needs_input
```
并停止操作。在猜测的服务上创建告警比询问用户更难撤销。
务必遍历
signoz:signoz_list_alerts
的所有分页。仅停止在第一页会遗漏重复告警并产生无效内容。
模拟运行是强制性的。保存查询无数据的告警属于静默故障，必须避免。
禁止重复更新。名称冲突→报错并停止。请勿从“创建”技能中静默更新现有告警。
仅使用OTel属性名称。使用
```
service.name
```
而非
```
service
```
。
阈值代码为字符串而非单词。使用
```
op: "1"
```
而非
```
op: "above"
```
。
信号必须匹配alertType。
```
signal: "logs"
```
需对应
```
LOGS_BASED_ALERT
```
。不匹配会导致验证失败。
异常规则仅适用于指标。
```
anomaly_rule
```
+非指标alertType会被拒绝。
渠道必须存在。完全使用
```
signoz:signoz_list_notification_channels
```
返回的名称，或先内联创建渠道。
范围边界。本技能仅创建新规则。修改操作直接使用
```
signoz:signoz_update_alert
```
。

Examples

示例

User: "Alert me when CPU usage on the checkout service goes above 80%, page me at 90%"

Agent:

Intent parsed: metric, CPU, scope
```
service.name = checkout
```
, thresholds 80% (warning) / 90% (critical), severity bumped to critical on the higher level because "page me" was used.

signoz:signoz_list_metrics searchText=cpu

→ confirms

system.cpu.utilization

```
signoz:signoz_list_alerts
```
(paginated) → no existing CPU alert for checkout.
```
signoz:signoz_list_notification_channels
```
→ presents existing channels; user picks
```
slack-infra
```
for warning and
```
pagerduty-oncall
```
for critical.

Builds JSON:

METRIC_BASED_ALERT

threshold_rule

signal=metrics

, two thresholds (

op="1"

matchType="3"

on_average,

targetUnit="percent"

), filter

service.name = 'checkout'

Dry-run via
```
signoz:signoz_execute_builder_query
```
over last 1h: returns data, would have fired 0 times (clean baseline).
Emits JSON preview + summary.
Calls
```
signoz:signoz_create_alert
```
. Reports created alert with ID, threshold summary, channel routing, and dry-run result.

User: "Set up an alert when error rate goes above 5% on payments"

Agent:

Intent parsed: traces, error-rate formula, scope
```
service.name = payments
```
, threshold 5%, severity default warning.

signoz:signoz_get_field_keys fieldContext=resource signal=traces

confirms

service.name

signoz:signoz_get_field_values

confirms

payments

exists.

No existing payments error-rate alert.
Builds formula alert: query A counts spans with
```
hasError = true
```
for
```
service.name = 'payments'
```
, query B counts all spans for the same service, formula F1 =
```
A * 100 / B
```
,
```
selectedQueryName: "F1"
```
, threshold target 5,
```
targetUnit: "percent"
```
,
```
op: "1"
```
,
```
matchType: "1"
```
(catch any breach).
Channel: user picks
```
slack-payments
```
.
Dry-run on last 1h: payments error rate hovered around 0.3%, would have fired 0 times. Clean — not too tight.
Preview, save, report.

User: "Notify me if log volume spikes more than 1000 error logs per minute on any service"

Agent:

Intent parsed: logs, volume threshold, scope = all services (groupBy
```
service.name
```
), threshold 1000/min, severity warning.
```
signoz:signoz_get_field_keys signal=logs fieldContext=resource
```
confirms
```
service.name
```
exists; no service filter required since the user said "any service" → groupBy provides the scoping.

Builds:

LOGS_BASED_ALERT

signal=logs

aggregations: [{expression: "count()"}]

filter: {expression: "severity_text IN ('ERROR', 'FATAL')"}

groupBy: [{name: "service.name", fieldContext: "resource", fieldDataType: "string"}]

, threshold 1000,

targetUnit: ""

evalWindow: 1m0s

matchType: "1"

(catch any minute that breaches).

Channels: user picks slack channel.
Dry-run: returned per-service counts, max in last 1h was 87 — would have fired 0 times. Within reasonable headroom.
Preview, save, report.

User: "Alert me on anomalous request latency for the api gateway"

Agent:

Intent parsed: metric, anomaly detection, scope
```
service.name = api-gateway
```
. Anomaly detection requires
```
METRIC_BASED_ALERT
```
+
```
anomaly_rule
```
.

signoz:signoz_list_metrics searchText=duration

→ finds

http.server.request.duration

Builds:

anomaly_rule

algorithm=zscore

seasonality=daily

, threshold target 3 (3 standard deviations),

op: "1"

matchType: "1"

Channel: user picks slack-api.
Dry-run validates query returns data. Skip breach-count for anomaly alerts.
Preview emphasizes that the threshold is in standard deviations, not raw latency. Save, report.

用户：“当checkout服务的CPU使用率超过80%时提醒我，超过90%时呼叫我”

Agent：

解析指令：指标、CPU、范围
```
service.name = checkout
```
、阈值80%（warning）/90%（critical），因用户提及“呼叫我”，将高级别阈值的严重级别升级为critical。

signoz:signoz_list_metrics searchText=cpu

→ 确认

system.cpu.utilization

指标存在。

```
signoz:signoz_list_alerts
```
（遍历所有分页）→ 无针对checkout服务的现有CPU告警。
```
signoz:signoz_list_notification_channels
```
→ 展示现有渠道；用户选择
```
slack-infra
```
作为warning级别渠道，
```
pagerduty-oncall
```
作为critical级别渠道。

构建JSON：

METRIC_BASED_ALERT

、

threshold_rule

、

signal=metrics

、两个阈值（

op="1"

、

matchType="3"

即on_average、

targetUnit="percent"

）、过滤器

service.name = 'checkout'

。

通过
```
signoz:signoz_execute_builder_query
```
在过去1小时内模拟运行：返回数据，会触发0次（基线正常）。
输出JSON预览+摘要。
调用
```
signoz:signoz_create_alert
```
。报告已创建的告警ID、阈值摘要、渠道路由和模拟运行结果。

用户：“为payments服务设置错误率超过5%时的告警”

Agent：

解析指令：链路追踪、错误率公式、范围
```
service.name = payments
```
、阈值5%、默认严重级别warning。

signoz:signoz_get_field_keys fieldContext=resource signal=traces

确认

service.name

存在。

signoz:signoz_get_field_values

确认

payments

服务存在。

无针对payments服务的现有错误率告警。
构建公式告警：查询A统计
```
service.name = 'payments'
```
且
```
hasError = true
```
的Span数量，查询B统计同一服务的所有Span数量，公式F1=
```
A * 100 / B
```
，
```
selectedQueryName: "F1"
```
，阈值目标5，
```
targetUnit: "percent"
```
，
```
op: "1"
```
，
```
matchType: "1"
```
（捕获任何一次突破）。
渠道：用户选择
```
slack-payments
```
。
过去1小时模拟运行：payments服务错误率约为0.3%，会触发0次。阈值合理——不过于严格。
预览、保存、报告。

用户：“当任何服务的错误日志量每分钟超过1000条时通知我”

Agent：

解析指令：日志、量阈值、范围=所有服务（按
```
service.name
```
分组）、阈值1000条/分钟、严重级别warning。
```
signoz:signoz_get_field_keys signal=logs fieldContext=resource
```
确认
```
service.name
```
存在；无需服务过滤器，因用户要求“任何服务”→groupBy提供范围划分。

构建配置：

LOGS_BASED_ALERT

、

signal=logs

、

aggregations: [{expression: "count()"}]

、

filter: {expression: "severity_text IN ('ERROR', 'FATAL')"}

、

groupBy: [{name: "service.name", fieldContext: "resource", fieldDataType: "string"}]

、阈值1000、

targetUnit: ""

、

evalWindow: 1m0s

、

matchType: "1"

（捕获任何突破的分钟）。

渠道：用户选择Slack渠道。
模拟运行：返回按服务统计的数量，过去1小时内最大值为87——会触发0次。阈值留有合理余量。
预览、保存、报告。

用户：“针对api网关的异常请求延迟触发告警”

Agent：

解析指令：指标、异常检测、范围
```
service.name = api-gateway
```
。异常检测需要
```
METRIC_BASED_ALERT
```
+
```
anomaly_rule
```
。

signoz:signoz_list_metrics searchText=duration

→ 找到

http.server.request.duration

指标。

构建配置：

anomaly_rule

、

algorithm=zscore

、

seasonality=daily

、阈值目标3（3个标准差）、

op: "1"

、

matchType: "1"

。

渠道：用户选择
```
slack-api
```
。
模拟运行验证查询返回数据。异常告警跳过触发次数计算。
预览中强调阈值为标准差而非原始延迟值。保存、报告。

Additional resources

额外资源

```
signoz://alert/instructions
```
and
```
signoz://alert/examples
```
MCP resources — full alert config JSON schema, threshold codes, filter expression syntax, and version-current pattern examples. Always preferred over any transcribed copy.
```
signoz-generating-queries
```
skill — for authoring PromQL or testing queries before wrapping them in an alert.

```
signoz://alert/instructions
```
和
```
signoz://alert/examples
```
MCP资源——完整的告警配置JSON schema、阈值代码、过滤器表达式语法以及版本同步的模式示例。始终优先使用这些资源，而非任何转录副本。
```
signoz-generating-queries
```
技能——用于编写PromQL或在封装为告警前测试查询。