Alert Create
Build a SigNoz alert from a user's natural-language intent. The skill targets
two consumers: an autonomous AI SRE agent that runs without a human in the
loop, and a human at a Claude Code / Codex / Cursor prompt. Both go through
the same flow — the human just gets a chance to intervene at the preview step.
Prerequisites
This skill calls SigNoz MCP server tools (
signoz:signoz_create_alert
,
signoz:signoz_list_alerts
,
signoz:signoz_get_field_keys
, etc.). Before running the
workflow, confirm the
tools are available. If they are not,
the SigNoz MCP server is not installed or configured — stop and direct
the user to set it up:
https://signoz.io/docs/ai/signoz-mcp-server/. Do not try to fall back
to raw HTTP calls or fabricate alert configs without the MCP tools.
When to use
Use this skill when the user wants to:
- Create, set up, or configure a new alert rule.
- Get paged or notified when a metric, log volume, latency, or error rate
crosses a threshold.
- Detect anomalous behavior on a service, host, or signal.
- Catch silent data loss ("alert if data stops arriving from X").
Do NOT use when the user wants to:
- Understand what an existing alert monitors → .
- Diagnose why an existing alert fired →
signoz-investigating-alerts
.
- Modify thresholds, queries, or routing on an existing alert → call
signoz:signoz_update_alert
directly.
Required inputs (strict)
Alert creation is a write operation against a shared system. Guessing here
creates noisy alerts on the wrong service that someone else has to clean up.
The skill enforces a strict input contract:
| Input | Required | Source if missing |
|---|
| Alert intent (NL goal) | yes | or recent user turn |
| Resource attribute filter (e.g. , , ) | yes | discover via signoz:signoz_get_field_keys
+ signoz:signoz_get_field_values
|
| Threshold value(s) | inferred from intent | derive a sensible default and surface in the preview |
| Severity | inferred from intent | default ; promote to only if user said "page", "wake up", "critical" |
| Notification channel | yes | signoz:signoz_list_notification_channels
+ offer "create new" |
If a required input is missing and cannot be discovered, emit a structured
block and stop
before calling any write tool:
text
needs_input:
missing:
- resource_attribute_filter: "no service or host specified — pick one"
candidates:
service.name: ["frontend", "checkout", "payments", "inventory"]
host.name: ["prod-api-1", "prod-api-2", "prod-db-1"]
In interactive mode, the human picks from candidates. In autonomous mode, the
caller fills the gap from upstream context or escalates. Either way, do not
proceed to
signoz:signoz_create_alert
with a guessed value.
Workflow
Step 1: Parse intent and check what's missing
Extract from the user's request:
- What to monitor — signal type (metrics / logs / traces / exceptions)
and the specific condition (CPU, error rate, p99 latency, log count, ...).
- Resource scope — which service, host, namespace, or environment.
- Threshold — numeric value and comparison ("above 80%", "below 100/s").
- Severity — implicit from urgency words ("page" → critical, default
warning otherwise).
- Channel — explicit channel name if the user provided one.
Map signal phrasing to alert type:
| User says | alertType | signal |
|---|
| metric, CPU, memory, latency, request rate | METRIC_BASED_ALERT | metrics |
| log, error logs, log volume, log pattern | LOGS_BASED_ALERT | logs |
| trace, span, latency p99, slow requests | TRACES_BASED_ALERT | traces |
| exception, stack trace, crash | EXCEPTIONS_BASED_ALERT | (clickhouse_sql) |
If resource scope is missing, run discovery (Step 2). If still missing after
discovery, emit
and stop.
Step 2: Discover resource attributes and metric names
When the user does not name a service / host / namespace, the SigNoz MCP
guideline applies: always prefer a resource-attribute filter. Discover
candidates instead of guessing:
- Call
signoz:signoz_get_field_keys
with to enumerate
resource attributes for the chosen signal.
- Call
signoz:signoz_get_field_values
for the most likely attribute (typically
, then , then ) to get
concrete values.
- If the user mentioned a metric by name, call
signoz:signoz_list_metrics
with a
search term to verify the exact OTel metric name. Wrong names create
alerts that never fire.
Surface the candidates in the
block. Do not pick one.
Step 2.5: Probe data existence for the chosen filter (fail fast)
Before authoring any alert config, confirm the specific combination the
alert will watch (metric × service × any other filter) actually emits data.
The most common silent failure is "metric exists in the catalog and the
service exists in the catalog, but the service doesn't emit that metric"
— each piece checks out in isolation, the alert saves successfully, and it
silently never fires.
Run a single probe over the last 1 hour using the same filter the alert
will use, but with the simplest aggregation that confirms data exists:
- Metrics:
signoz:signoz_query_metrics
(PromQL) or signoz:signoz_execute_builder_query
with (or count_distinct(service.name)
if scope-discovering).
- Logs:
signoz:signoz_aggregate_logs
with over the filter.
- Traces:
signoz:signoz_aggregate_traces
with over the filter.
Inspect the result:
- Probe returns rows → proceed to Step 3.
- Probe returns empty → STOP. Do not build an alert config the user
will then be asked to throw away. Emit a block describing
what was missing and offer concrete recovery:
- Service doesn't emit the metric → call
signoz:signoz_get_field_values signal=metrics name=service.name metricName=<metric>
to list the services that do emit it; let the user pick a different
service or a different metric.
- Wrong attribute name ( instead of ) → suggest
the semantic-convention name and re-probe.
- Service emits the metric but not in the expected time range → widen
the probe window once (e.g. last 24h) before declaring no-data.
Exception — log-based crash / panic / OOMKilled / FATAL alerts. These
intentionally have zero matches in a healthy system. The probe will
return empty by design. Do not stop; instead, surface the zero-match
result and ask the user to confirm before save. Treat this exception
narrowly: it applies to "alert me when bad thing happens" log queries,
not to alerts that depend on continuous data flow.
This probe is cheap (one query, ~100ms), and catching the no-data case
early avoids the worst UX failure mode of this skill — the user reading
through a fully-authored JSON payload and only then learning the alert
can never fire.
Step 3: Check for duplicate alerts
Call
signoz:signoz_list_alerts
and
paginate through every page —
is true until you have walked the full list. Check for
existing alerts that match the user's intent (same signal + same scope +
similar threshold). If a likely duplicate exists, surface it and ask whether
to create a new one anyway, modify the existing one (out of scope here — use
signoz:signoz_update_alert
), or cancel.
Step 4: Build the alert config
The MCP server is the source of truth for the alert JSON schema, threshold
codes, and validation rules. Read the
signoz://alert/instructions
and
MCP resources for the canonical, version-current
shape. Do not transcribe schema text into this skill — it will rot out of
sync with the server.
For most user intents, the config is one of a small number of patterns:
| Pattern | Where to author | Example intents |
|---|
| Single-metric threshold | inline (this skill) | "alert when CPU > 80%", "p99 latency > 2s" |
| Log volume threshold | inline | "more than N error logs/min" |
| Trace-based count or p-tile | inline | "p99 span duration > 2s on checkout" |
| Error-rate formula (A/B*100) | inline (see "Common query shapes" below) | "error rate > 5%" |
| Anomaly detection (Z-score) | inline, but only with | "alert me on anomalous traffic" |
| Absent-data alert | inline | "alert if data stops arriving" |
| ClickHouse SQL alert | inline (this skill) — author the SQL directly using the schema in | non-trivial joins, custom aggregations the builder cannot express |
| PromQL alert | delegate to signoz-generating-queries
for the PromQL, then return here | when user already has PromQL |
Threshold and values. v2alpha1 accepts the
human-readable strings (
,
); the legacy numeric
codes (
,
) are also accepted but harder to read in the UI. Prefer
the words.
Anomaly rules only support — the engine
already treats z-score breaches as two-sided when the threshold is
positive, so
is rejected and unnecessary.
| Comparison | | Evaluation behavior | |
|---|
| above / exceeds / > | | breach at any point | |
| below / under / < | | breach for entire window | |
| equal / = | | average breaches | |
| not equal / != | | sum breaches | |
| | last value breaches | |
Defaults the skill applies (and surfaces in the preview):
- , — change only if the intent implies
a slower or faster cadence.
- (on_average) for CPU / memory / latency — smooths
transient spikes.
- (at_least_once) for error counts / error rates — catches
any breach.
Severity defaults — derive from the intrinsic urgency of the alert, not
just the user's words. The user saying "alert me" doesn't force
when the condition itself describes a critical event. Use this table; an
explicit user cue overrides it ("just FYI" → demote, "page me" / "wake me
up" → promote).
| Alert intent | Default severity |
|---|
| Pod crash / OOMKilled / CrashLoopBackOff / panic / FATAL log signals | critical |
| Service down / no-data on a production service | critical |
| Error rate above any non-trivial threshold (>1%) | critical |
| Error logs / exception spikes | warning |
| Latency degradation (p95/p99 above target) | warning |
| CPU / memory / disk pressure | warning |
| Request-rate / traffic anomaly | warning |
| SLO budget burn (info-level burn rate) | info / warning |
When the user's intent is ambiguous on severity (no urgency cue, no
clearly-critical condition), default to
and surface the choice
in the preview so they can adjust.
OTel attribute names — always use semantic conventions:
,
,
,
deployment.environment.name
. Never
,
, or
.
Annotation templates — the on-call engineer sees the notification, not
the alert config. A notification that says "Pod crash detected" with no
service name, no count, and no value is nearly useless at 3am. Always
include the moving values:
- — single-line headline. Include the resource scope and the
numeric value:
"checkoutservice error rate {{$value}}% above 3%"
.
- — longer message. Include , ,
the groupBy values (e.g. ), and a sentence on
what to do or where to look. For count-based alerts include the count
explicitly:
"{{$value}} crash log lines in the last 5 minutes from service {{$labels.service_name}}"
.
Use
for the breaching value,
for the target,
and
for groupBy values (note SigNoz substitutes the
dotted attribute name with underscores:
→
).
Common query shapes
Three patterns cover most non-trivial alerts. The MCP resources above carry
the full schema; these are quick references for the query block only.
Error rate — two queries + formula
.
Set
on the component queries A and B so only the formula
F1 renders in the alert chart and notification — the raw counts are
intermediate, not the alert signal. Forgetting this clutters the alert
preview with three series and confuses the on-call engineer reading the
notification.
json
{
"queries": [
{ "type": "builder_query", "spec": { "name": "A", "signal": "traces",
"disabled": true,
"aggregations": [{ "expression": "count()" }],
"filter": { "expression": "hasError = true" } } },
{ "type": "builder_query", "spec": { "name": "B", "signal": "traces",
"disabled": true,
"aggregations": [{ "expression": "count()" }],
"filter": { "expression": "" } } },
{ "type": "builder_formula",
"spec": { "name": "F1", "expression": "A * 100 / B" } }
],
"selectedQueryName": "F1"
}
p99 latency — single trace query with
for per-service
breakdown. Threshold target is in
nanoseconds (2s → 2000000000),
:
json
{
"queries": [
{ "type": "builder_query", "spec": { "name": "A", "signal": "traces",
"aggregations": [{ "expression": "p99(durationNano)" }],
"groupBy": [{ "name": "service.name", "fieldContext": "resource",
"fieldDataType": "string" }] } }
]
}
Log volume spike — count of error/fatal logs grouped by service:
json
{
"queries": [
{ "type": "builder_query", "spec": { "name": "A", "signal": "logs",
"aggregations": [{ "expression": "count()" }],
"filter": { "expression": "severity_text IN ('ERROR', 'FATAL')" },
"groupBy": [{ "name": "service.name", "fieldContext": "resource",
"fieldDataType": "string" }] } }
]
}
For absent-data, anomaly, PromQL, and ClickHouse SQL alerts, read the
MCP resource for current shapes.
Step 5: Resolve notification channels
The skill must resolve at least one channel before save. An alert with no
channels saves successfully and silently never notifies anyone — the second
most common silent failure after bad queries.
- Call
signoz:signoz_list_notification_channels
to enumerate existing channels.
- If the user named a channel ("send to slack-infra"), use it if it exists;
if not, fall through.
- Otherwise present the user with two options:
- Pick from existing — list channels with their type (Slack, PagerDuty,
email, webhook) so the user can choose.
- Create new inline — call
signoz:signoz_create_notification_channel
with
channel parameters the user provides (name, type, type-specific config
like Slack webhook URL or PagerDuty integration key).
- If neither path resolves a channel, emit
needs_input: notification_channel
and stop.
For multi-severity alerts, attach channels per threshold:
thresholds.spec[N].channels
is an array — typically warning → Slack only,
critical → Slack + PagerDuty.
Step 6: Validate the threshold (would-have-fired count)
Step 2.5 already confirmed the underlying data exists. Step 6 is about
the threshold — given the full proposed query (including formulas,
groupBy, and unit conversions) and the proposed threshold, would this
alert have fired a sensible number of times in the last hour?
- Run the alert's full primary query (or formula) over the last hour using:
signoz:signoz_execute_builder_query
for builder/formula queries.
signoz:signoz_query_metrics
for PromQL queries.
signoz:signoz_aggregate_logs
/ signoz:signoz_aggregate_traces
if those fit better.
- Compute how many evaluation points in the last hour breached the
proposed threshold. Surface this in the preview as
"would have fired N times in the last 1h":
- N = 0 → the threshold may be too loose or the gating too strict.
Mention this so the user can adjust if intent was tighter.
- N is large (e.g. > 30) → likely alert storm. Surface and
recommend tightening or adding hysteresis ().
- N is small and non-zero → calibrated; proceed.
- Exceptions:
- Anomaly alerts — skip the breach count entirely (Z-scores aren't
directly comparable to raw values). Step 2.5 already verified the
underlying metric × service has data; nothing more to validate here.
- Log-based crash / panic / OOMKilled / FATAL alerts — these
intentionally have zero matches in a healthy system. Step 2.5 has
already surfaced the zero-match result and obtained user confirmation;
skip the breach count.
If Step 2.5 was somehow skipped (e.g. a downstream skill is invoking this
flow mid-stream), the no-data stop rule applies here too: empty result →
emit
instead of saving an alert that will never fire.
Step 7: Preview the prepared config
Emit a fenced JSON code block containing the exact payload that will be sent
to
signoz:signoz_create_alert
, plus a one-paragraph plain-language summary:
json
{
"alert": "<name>",
"alertType": "...",
"ruleType": "...",
"condition": { ... },
"labels": { "severity": "..." },
"annotations": { "description": "...", "summary": "..." },
"evaluation": { ... },
"preferredChannels": ["..."]
}
Summary: This alert fires when [condition] for [resource scope],
evaluated every [frequency] over the last [window]. Thresholds:
warning at X, critical at Y. Notifications go to [channels]. Dry-run on
the last hour: would have fired N times.
In autonomous mode the consumer proceeds. In interactive mode the human can
intervene before Step 8.
Step 8: Save and report
- Call
signoz:signoz_create_alert
with the JSON payload from Step 7.
- Name collision — if
signoz:signoz_create_alert
returns a duplicate-name
error, do not suffix-append or call signoz:signoz_update_alert
. Stop and
tell the user the existing alert blocked creation; offer to use a
different name or modify the existing alert (which is out of scope for
this skill).
- On success, report:
- The alert ID and name.
- What it watches and at what threshold.
- Which channels are wired up.
- The dry-run summary ("would have fired N times in last 1h").
- Two follow-up offers: "Want to test the query live with
signoz-generating-queries
?"
and "Want me to add a runbook URL to the annotations?"
Guardrails
- Strict inputs over guessing. Resource attribute and channel are
required. If missing, emit and stop. Creating an alert on
a guessed service is harder to undo than asking.
- Always paginate
signoz:signoz_list_alerts
. Stopping at page 1 misses
duplicates and produces noise.
- Dry-run is mandatory. Saving an alert whose query returns no data is
a silent failure mode and must be prevented.
- No duplicate updates. Name collision → error and stop. Do not
silently update an existing alert from a "create" skill.
- OTel attribute names only. not .
- Threshold codes are strings, not words. not .
- Signal must match alertType. requires
. Mismatches fail validation.
- Anomaly rules are metrics-only. + non-metric alertType
is rejected.
- Channels must exist. Use names from
signoz:signoz_list_notification_channels
exactly, or create the channel inline first.
- Scope boundary. This skill only creates new rules. Modifications use
signoz:signoz_update_alert
directly.
Examples
User: "Alert me when CPU usage on the checkout service goes above 80%, page me at 90%"
Agent:
- Intent parsed: metric, CPU, scope ,
thresholds 80% (warning) / 90% (critical), severity bumped to critical
on the higher level because "page me" was used.
signoz:signoz_list_metrics searchText=cpu
→ confirms .
signoz:signoz_list_alerts
(paginated) → no existing CPU alert for checkout.
signoz:signoz_list_notification_channels
→ presents existing channels;
user picks for warning and for critical.
- Builds JSON: , ,
, two thresholds (, on_average,
), filter
service.name = 'checkout'
.
- Dry-run via
signoz:signoz_execute_builder_query
over last 1h: returns data,
would have fired 0 times (clean baseline).
- Emits JSON preview + summary.
- Calls
signoz:signoz_create_alert
. Reports created alert with ID, threshold
summary, channel routing, and dry-run result.
User: "Set up an alert when error rate goes above 5% on payments"
Agent:
- Intent parsed: traces, error-rate formula, scope
, threshold 5%, severity default warning.
signoz:signoz_get_field_keys fieldContext=resource signal=traces
confirms
. signoz:signoz_get_field_values
confirms exists.
- No existing payments error-rate alert.
- Builds formula alert: query A counts spans with for
service.name = 'payments'
, query B counts all spans for the same
service, formula F1 = , ,
threshold target 5, , ,
(catch any breach).
- Channel: user picks .
- Dry-run on last 1h: payments error rate hovered around 0.3%, would have
fired 0 times. Clean — not too tight.
- Preview, save, report.
User: "Notify me if log volume spikes more than 1000 error logs per minute on any service"
Agent:
- Intent parsed: logs, volume threshold, scope = all services (groupBy
), threshold 1000/min, severity warning.
signoz:signoz_get_field_keys signal=logs fieldContext=resource
confirms
exists; no service filter required since the user
said "any service" → groupBy provides the scoping.
- Builds: , ,
aggregations: [{expression: "count()"}]
,
filter: {expression: "severity_text IN ('ERROR', 'FATAL')"}
,
groupBy: [{name: "service.name", fieldContext: "resource", fieldDataType: "string"}]
,
threshold 1000, , ,
(catch any minute that breaches).
- Channels: user picks slack channel.
- Dry-run: returned per-service counts, max in last 1h was 87 — would
have fired 0 times. Within reasonable headroom.
- Preview, save, report.
User: "Alert me on anomalous request latency for the api gateway"
Agent:
- Intent parsed: metric, anomaly detection, scope
service.name = api-gateway
. Anomaly detection requires
+ .
signoz:signoz_list_metrics searchText=duration
→ finds
http.server.request.duration
.
- Builds: , , ,
threshold target 3 (3 standard deviations), , .
- Channel: user picks slack-api.
- Dry-run validates query returns data. Skip breach-count for
anomaly alerts.
- Preview emphasizes that the threshold is in standard deviations, not raw
latency. Save, report.
Additional resources
signoz://alert/instructions
and MCP resources
— full alert config JSON schema, threshold codes, filter expression
syntax, and version-current pattern examples. Always preferred over any
transcribed copy.
signoz-generating-queries
skill — for authoring PromQL or testing queries
before wrapping them in an alert.