Alert Explain
Decode an existing SigNoz alert's configuration into a plain-language
explanation. The skill is read-only and stays focused on the rule
itself: what it watches, when it fires, where it notifies. A single
line of fire-frequency data is included to ground the explanation, but
this skill does
not investigate any specific fire — that is
signoz-investigating-alerts
's job.
Prerequisites
This skill calls SigNoz MCP server tools (
,
signoz:signoz_list_alert_rules
,
signoz:signoz_get_alert_history
). Before running
the workflow, confirm the
tools are available. If they are
not, the SigNoz MCP server is not installed or configured — stop and
direct the user to set it up:
https://signoz.io/docs/ai/signoz-mcp-server/. Do not guess at alert
configuration from the rule name alone.
When to use
Use this skill when the user wants to:
- Understand or interpret an existing alert rule.
- Confirm what signal an alert watches and at what threshold.
- Audit whether an alert is reasonably configured.
- Translate raw alert JSON into operational language.
Do NOT use when the user wants to:
- Create a new alert → .
- Diagnose why an alert fired or correlate signals around a fire window
→
signoz-investigating-alerts
.
- Modify an existing alert → call
signoz:signoz_update_alert
directly.
Required inputs
| Input | Required | Source if missing |
|---|
| Alert identifier (rule ID or name) | yes | , recent context, or fuzzy match |
If the input is missing or ambiguous, this skill is best-effort (not
strict — read-only operations are cheap to recover from):
- Call
signoz:signoz_list_alert_rules
, paginate through every page, and find
the closest name match.
- State the interpretation in the response:
"Interpreting your request as alert 'High Error Rate — Checkout' (id 42).
If you meant a different one, tell me the name or id."
- Proceed with the explanation. The user can correct after.
Workflow
Step 1: Resolve the alert
If the user provided a numeric id, skip to Step 2. Otherwise:
- Call
signoz:signoz_list_alert_rules
and paginate every page —
is true until the full list is walked.
- Match by name (case-insensitive substring). If multiple match,
present the candidates and ask which one (interactive) or pick the
closest and flag the assumption (autonomous).
Step 2: Fetch the full configuration
Call
with the rule id. This is
mandatory — the
list response does not include the full condition / thresholds /
notification settings, and explanations based on the name alone are
guesses.
Step 3: Pull a one-line fire-frequency summary
Call
signoz:signoz_get_alert_history
for the rule with a 7-day lookback. From
the response, derive a single line:
Fired N times in the last 7d (last fire: <relative-time>).
If the alert never fired in the window, say so explicitly:
"Has not fired in the last 7d." If the alert is disabled, mention that
and skip the history line.
This single line grounds the explanation. Do
not drill into specific
fires here — that's
signoz-investigating-alerts
.
Step 4: Build the explanation
The single most useful thing for the user is a tight summary. Lead
with a TL;DR that directly answers the question they asked, not a
generic alert summary. The TL;DR is the only thing some users will
read — burying their answer under a fixed template forces them to
scroll for what they wanted in the first place.
Match the TL;DR shape to the user's question:
-
"What does this alert do?" / "Explain X" — describe what fires:
TL;DR: Fires when
for
, notifies
.
.
-
"Is it configured correctly?" / "Audit this" / "Anything I should
change?" — lead with the verdict and the top 1–3 changes, not
the description of what fires:
TL;DR: Mostly well-configured, but recommend: (1) add
— currently a crashed service stays silent; (2)
fix annotation template
→
(won't
interpolate); (3) split critical to PagerDuty (both tiers
currently route to Slack).
.
-
"How does X work?" / "Explain the count guard" — answer the
mechanism in 1–2 sentences before any framing:
TL;DR: The count guard is a
clause on
query A — any 1-minute bucket with ≤50 spans is dropped before
evaluation, so low-traffic minutes can't fire the alert.
-
"What's the threshold?" / focused config question — state the
exact thing they asked about:
TL;DR: Threshold is 3 standard deviations (z-score), not
a raw rate value. Daily seasonality means the model compares
each hour against historical norms for that hour.
Always include the fire-frequency line and
status if
non-default — those ground every kind of TL;DR. But put the answer to
the user's specific question first.
After the TL;DR, write the explanation in prose, organized into the
four sections below. Skip any section that has nothing meaningful to
add — empty severity labels, default notification settings, vanilla
annotations don't deserve a header. Short and skimmable beats
perfunctorily complete; the user is not reading a checklist.
1. What it watches — one short paragraph. Combine signal type
(metrics / logs / traces / exceptions), what the query measures, and
scope. Translate the query to operational language; for formulas, name
each sub-query (A, B, …) and state what F1 (or whichever
triggers) computes — e.g. "F1 = A × 100 / B → error
percentage". Decode filter operators (
equals,
not equals,
/
,
/
,
,
,
/
); enumerate
/
value lists so the user can
verify them. Name each
dimension and its practical effect
("fires separately per service" for
).
For
anomaly rules (
), explicitly state that
the threshold is in
standard deviations from the learned pattern, not
the raw value — this is the most common point of confusion. Include
(zscore),
(hourly / daily / weekly), and how
lower/higher targets shift sensitivity (lower → more noise, higher →
only extreme deviations).
2. When it fires — one paragraph covering threshold + timing.
Decode the threshold spec into plain English using these mappings:
- codes: above, below, equal, not equal.
- codes: at_least_once (any point in window),
all_the_times (entire window), on_average (window average),
in_total (window sum), last (most recent point).
State each threshold tier's
,
,
, and
attached channels.
Always state the threshold in , not
the native query unit (e.g. "fires when p99 exceeds 500 ms", not
"…exceeds 500 000 000 ns"). Note
if set (hysteresis);
if absent, mention flap risk when the value hovers near the boundary.
Describe timing as "checks every
over the last
", and mention that with
a single-point
breach triggers, while
requires the full window.
3. Where it notifies — channels per tier (resolved by name from
signoz_list_notification_channels
if needed),
notificationSettings.groupBy
(how notifications are bundled),
(interval + which states),
(label-based routing). Skip this section entirely if
notification settings are vanilla and the user already saw the channel
in the TL;DR.
4. Notable concerns — flag only what's non-default and worth the
user's attention. Don't list every absent field; focus on the
high-leverage ones:
- missing when the signal is critical: silent data
loss (crashed service, broken instrumentation) won't trigger the
alert. Always call this out for production-tier rules.
- but not in :
the absent-data fire pages once and then goes silent — easy to miss.
- Template variable bugs: won't interpolate; the
correct form is . Dots in label keys become
underscores ( → ).
- Multiple severity tiers but missing on the rule
— breaks label-based routing policies. Common gap.
- All tiers route to the same channel — defeats the point of
graduated thresholds.
- High-cardinality (e.g. × ) →
notification-storm risk during cluster-wide events.
- Annotation/description text contradicts (e.g.
description says "for over 5 minutes" but
fires on first breach within the window).
- Alert name doesn't match the filter target (e.g. name says
"checkout" but filter targets ) — call this out.
If none of these apply, omit the section. Better silent than padded.
If the user asked only "what does this alert do", stop here. The audit
(Step 5) is for "is it configured correctly" / "audit this" /
"anything I should change" requests.
Step 5: Assess the configuration (only if asked)
The user may ask "is this alert reasonable" alongside the explanation.
Only assess when asked or when the request implies it (audit, review,
"is this configured correctly"). Keep assessment grounded in what's
actually in the config:
- Threshold calibration — appropriate for the signal? Consider
service criticality and traffic.
- matchType fit — is sensitive (catches transients);
is conservative; smooths noise.
- Window vs frequency — short window + can be noisy.
Long window can delay detection.
- Multi-severity — alerts with both warning and critical thresholds
enable graduated response. Single-severity alerts miss this.
- Notification routing — critical → high-urgency channels (PagerDuty);
warning → low-urgency (Slack).
- Missing runbook / description — if are empty or
default, suggest adding context.
- Absent-data monitoring — for critical signals, recommend
if it isn't set.
- GroupBy cardinality — high-cardinality groupBy fields can produce
many independent alert series; flag potential notification storms.
- Filter completeness — for / filters with explicit
value lists, flag values that look out of place or missing values
that seem expected.
- Fire frequency vs threshold — if Step 3 shows the alert fires
many times a day (>10/day in the 7d window), the threshold is likely
too tight; if it never fires and the user is asking because they
expected it to, the threshold may be too loose or the query may be
wrong.
Step 6: Offer next steps
End with two or three actionable follow-ups:
- "Want me to investigate the most recent fire?" (→
signoz-investigating-alerts
)
- "Want me to run the underlying query to see current values?" (→
signoz-generating-queries
)
- "Want me to adjust the threshold or add a severity level?" (→
signoz:signoz_update_alert
directly — out of scope here)
- "Want me to create a related alert for [gap noticed]?" (→
)
Guardrails
- Fetch before explaining. Always call . Do not
base explanations on the rule name or list response alone.
- Always pull fire history. The one-line frequency summary is
cheap (one MCP call) and grounds the explanation. Skip it only if
the alert is disabled.
- Decode, don't dump. Translate , , filter
expressions, and query JSON into operational language. Show raw JSON
only if the user asks.
- Decode every sub-query. For formula alerts, explain each builder
query and the formula. Don't gloss over a formula as "error rate" —
show the user what A, B, and F1 each compute.
- Anchor to the actual config. Don't speculate based on the alert
name. If the name says "checkout" but the filter targets ,
call that out.
- No live queries by default. Do not run the alert's query unless
the user asks. Explain is about understanding the rule, not
inspecting current data.
- Paginate alert listing. Before concluding an alert does not
exist, walk every page of
signoz:signoz_list_alert_rules
.
- SigNoz evaluation model. SigNoz uses +
- , not Prometheus-style durations. Don't describe
the evaluation model in Prometheus terms.
- Stay in scope. Investigation of a specific fire window belongs
to
signoz-investigating-alerts
. Modifications belong to
signoz:signoz_update_alert
.
Examples
User: "Explain my checkout error rate alert"
Agent:
-
signoz:signoz_list_alert_rules
(paginated) → finds "High Error Rate —
Checkout" id 42.
-
signoz:signoz_get_alert id=42
→ traces formula (A errored spans / B
total spans × 100), single critical threshold at 5%,
,
, channel
.
-
signoz:signoz_get_alert_history id=42 lookback=7d
→ fired 3 times,
last fire 2h ago.
-
Replies:
TL;DR: Fires when checkout error rate (errored spans / total
spans) exceeds 5% at any point in the last 5 min, pages
. Fired 3 times in the last 7d (last 2h ago).
What it watches — traces from
service.name = 'checkout'
.
Query A counts spans with
, query B counts all
spans, F1 = A × 100 / B is the error percentage; the alert
triggers on F1.
When it fires —
,
, target
5%. Checks every 1 min over the last 5 min — a single elevated
minute fires it.
Notable — no
(could flap near 5%); no
(a checkout outage that drops traffic to zero
won't trigger this rule).
-
Offers: "Want me to investigate the fire from 2h ago?"
User: "Explain alert rule 17 — is it well configured?"
Agent:
-
signoz:signoz_get_alert id=17
→ metrics,
,
two thresholds (warning 80%, critical 90%),
on_average,
filter
deployment.environment.name = 'production'
, groupBy
.
-
signoz:signoz_get_alert_history id=17
→ has not fired in last 7d.
-
The user asked "is it well configured?" — the TL;DR leads with the
verdict, not a generic description. Then explanation + full
audit:
TL;DR: Mostly well-configured (dual-threshold +
is the right call for CPU), but two gaps to fix: (1) no
— a crashed agent silently stops reporting and
this rule won't catch it; (2) annotations are default — a
runbook URL would help on-call. Has not fired in the last 7d
(could mean production is healthy or thresholds too loose).
What it watches —
on
deployment.environment.name = 'production'
, grouped by
(fires per host).
When it fires —
,
, two tiers
(warning 80%, critical 90%) over a 5-min window. Short spikes
don't fire — smoothing is correct for CPU.
Notable concerns — same as TL;DR fixes above; nothing else
non-default.
-
Offers next steps.
User: "Help me understand this anomaly detection alert" (alert id 88)
Agent:
-
signoz:signoz_get_alert id=88
→
,
,
, target 3, metric
http.server.request.duration
, scope
service.name = 'api-gateway'
.
-
History: fired 1 time in last 7d.
-
Replies:
TL;DR: Fires when api-gateway request latency deviates by more
than 3 standard deviations (not raw latency, not a fixed value)
from its learned daily pattern. Fired once in the last 7d.
What it watches —
http.server.request.duration
for
service.name = 'api-gateway'
, evaluated as a Z-score anomaly
with
daily seasonality — the model learns the typical pattern
for each hour of day, so peak-hour latency won't false-trigger if
it matches the historical norm for that hour.
When it fires — when |Z-score| > 3, i.e. the value is more
than 3 standard deviations away from the expected pattern. Lower
target → more sensitive (more noise); higher → only extreme
deviations. The threshold is not in seconds or milliseconds.
-
Offers to investigate the recent fire.