Elastic ML Anomaly Detection
Single skill covering all anomaly detection work against
Kibana Agent Builder MCP at
{KIBANA_URL}/api/agent_builder/mcp
. Use the
Mode Selector below to pick the right approach for the user's question
— modes share the same tool surface and concepts.
Platform
- Read path: ES|QL against , , ,
- Always-available:
platform.core.execute_esql
(plus additional platform tools for search, index mapping, and
documentation — see scripts/agent_builder_constants.json
)
- ML API spec (if available):
.kibana_ai_openapi_spec_elasticsearch
— see
references/anomaly-detection-openapi-spec-discover.md for
discovery pattern.
- Run
ad_validate_ml_tool_permissions
first when tools return empty/misleading results — missing privileges are
the most common cause of false negatives. Full permissions matrix:
references/permissions-matrix.md.
Mode Selector
| User intent | Mode |
|---|
| "What broke?" / RCA / cross-job / blast radius / influencers / log categories | Investigate |
| "Why score high/low?" / renormalization / model bounds / forecasts | Explain |
| Missing docs / memory limit / datafeed stopped / CCS / lifecycle / calendars | Troubleshoot |
| Create a job / configure a datafeed / start analysis / retrieve results | Manage |
| Security framing (attack chains, MITRE, exfil) | Investigate + references/security-anomaly-expert.md |
| Observability/SRE framing (degradation, capacity, deployment regression) | Investigate + references/observability-anomaly-expert.md |
When a question spans modes: Investigate → Explain → Troubleshoot. Don't blend mode logic — finish one before moving
on.
Score Quick Reference
- bands: >75 critical · 50–75 warning · 25–50 minor · <25 informational
- → sustained shift (not a transient spike)
initial_record_score >> record_score
→ renormalization (model saw worse anomalies later)
- with // → absence/outage, not just low value
- Low scores across many jobs > one high score — composite cross-job signal often beats single-detector severity
Full score definitions, renormalization mechanics, and
anomaly_score_explanation
components:
references/score-reference.md.
Core concepts
Treat
as three layers, accessed via
:
- — bucket-level unusualness per . is the aggregate across all detectors.
- — finest-grained rows with vs , , ,
anomaly_score_explanation
.
- — entity contributions ranked within a bucket ().
Read scores this way:
- / = current normalized values (move as the model sees new extremes).
- / = immutable snapshots from detection time.
- Compare to ; use for raw likelihood.
- Map entities via / / .
- Read (-5 to +5) to separate single-bucket spikes from sustained trends.
Mode: Investigate — RCA
When: "what broke?", "which entity caused this?", cross-job correlation, blast radius, attack/cascade chains.
Tool chain
| Phase | Tools |
|---|
| Discovery | ad_get_available_metadata
, , , ad_discover_jobs_by_datafeed_index
|
| Timeline / scope | ad_query_anomaly_timeline
|
| Cross-job / entities | ad_rca_cross_job_entity_match
, ad_rca_multi_job_entities
, |
| Records / influencers | , |
| RCA depth | ad_rca_detector_fingerprint
, , , ad_rca_score_reassessment
|
| Evidence / categories | ad_get_job_datafeed_config
, , , ad_search_log_category_examples
|
Protocol
Follow the 14-step sequence in
references/protocols/investigation.md. High
level:
ad_get_available_metadata
→ pair
ad_discover_jobs_by_datafeed_index
with
→
ad_query_anomaly_timeline
→ rank with
ad_rca_multi_job_entities
(
) →
ad_rca_detector_fingerprint
→ drill with
+
(low
) → profile with
→ order with
→ confirm with
. When
by_field_name == "mlcategory"
, compare with
+ paired
ad_search_log_category_examples
(baseline
vs. anomaly window).
Finish with a written RCA: root cause entity · affected jobs · temporal progression · fault class
(resource/network/application) · severity · recommended actions. Worked example:
references/worked-example.md. Full ES|QL templates and parameters:
references/investigate-anomaly-esql-tools.md.
Rules
- Multi-job entities are prime suspects; single-job entities are usually victims. Use .
- Earliest anomaly timestamp wins — sort by timestamp; first-appearing entity = origin.
- = sustained behavioral shift, weight higher than transient spikes.
- Never close an RCA without — raw source documents are ground truth.
- Use low (25 or lower) for influencer queries — high thresholds miss correlated entities.
Mode: Explain — Score / model behavior
When: "why is my score 30/90?", "score dropped overnight", "what is renormalization?", "why wasn't this detected?".
Score types
| Field | Scope | Meaning |
|---|
| Single record | Normalized severity after renormalization. |
| Single record | Score at detection time. Gap vs = renormalization drift. |
| Bucket | Aggregate severity across all detectors in a bucket. |
| Entity × bucket | How anomalous a specific entity is in that bucket. |
anomaly_score_explanation
components
| Component | Effect | What it means |
|---|
| ↑ score | More consecutive anomalous buckets |
| ↑ score | Lower probability → higher impact |
| ↑ score | Sustained pattern contribution |
anomaly_characteristics_impact
| ↑ score | Mean shift vs. variance change |
| ↓ score | Noisy data → wide bounds → anomaly less surprising |
incomplete_bucket_penalty
| ↓ score | Bucket has less data than expected (ingest lag, sparse data) |
Why a score looks wrong
- Unexpectedly low: , renormalization, <3 weeks training for weekly seasonality,
too large, wrong detector function ( vs ),
incomplete_bucket_penalty
, suppression by
.
- Unexpectedly high: insufficient history (early training over-flags), high-cardinality split (too few points per
entity), on a sparse field.
Tool chain
| Purpose | Tools |
|---|
| Records + explanation | (exact ) |
| Renormalization drift | ad_rca_score_reassessment
(score_drift = initial_record_score - record_score
) |
| Model bounds (visual) | — actual outside / = anomaly |
| Forecast overlap | |
| Influencer attribution | |
| Config & detector | ad_get_job_datafeed_config
— , function, , |
| Categorization | |
| Model snapshots | |
| Structured diagnostic | ad_wf_troubleshoot_anomaly_score
(full decision tree) |
Decision tree (ad_wf_troubleshoot_anomaly_score
)
- — ≥3 weeks data for weekly seasonality?
ad_ts_model_memory_health
— healthy?
ad_ts_delayed_data_annotations
— no incomplete buckets?
- — compare vs .
ad_get_job_datafeed_config
— , detector function, , .
- — wide bounds → .
ad_rca_score_reassessment
— renormalization drift across history.
- Explain
anomaly_score_explanation
factors.
Rules
- Always show both and — the gap is the renormalization story.
- Explain renormalization before diagnosing config — score drift is the most common "score dropped" cause and needs
no config change.
- with / is an absence anomaly — distinguish outages from value spikes.
- and
incomplete_bucket_penalty
explain most "low score" surprises without remediation.
- Weekly seasonality needs ≥3 weeks of training data — flag young jobs as the cause.
For detector function selection details, see
references/anomaly-detection-functions.md.
Mode: Troubleshoot — Job ops
When: "missing documents", "datafeed stopped", "hard_limit", "results look wrong", lifecycle changes, calendars,
CCS.
Common issues → fast paths
| Issue | Fast path | Full decision tree |
|---|
| Missing docs / warning | ad_ts_delayed_data_annotations
→ → ad_ts_ingest_latency_estimate
→ ad_update_datafeed_query_delay
| ad_wf_troubleshoot_query_delay
|
| Memory / | ad_ts_model_memory_health
→ ad_wf_ts_field_cardinality
→ ad_estimate_memory_requirement
→ ad_update_model_memory_limit
| ad_wf_troubleshoot_memory_limit
|
| Datafeed not running / job state | (state) → → | — |
| CCS / indices | | — |
| Score sanity check | — | ad_wf_troubleshoot_anomaly_score
|
corrupts model state and causes downstream missing-doc false alarms (categorizer silently skips events
for unknown categories).
Fix memory before fixing .
Memory concepts
| Field | Meaning |
|---|
| Current memory used |
| High-water mark since job opened |
| Configured |
| / (pruning) / (critical) |
total_by_field_count > 100k
| cardinality too high — dominant driver |
total_partition_field_count > 10k
| Partition explosion |
total_category_count > 10k
| Too many distinct log patterns |
Prefer
ad_estimate_memory_requirement
(samples cardinality from source, calls Estimate Model Memory API) over
heuristics like
— the heuristic ignores pure influencer and categorization memory.
Datafeed & timing concepts
- — how far behind real time the datafeed queries. Too small → missing docs; too large → slower
alerts. Set to P95 ingest latency + buffer (default –).
delayed_data_check_config
— how aggressively the datafeed checks for late data.
- — analysis interval. Align with data granularity and detection window.
- — defaults to
min(query_delay, bucket_span / 2)
.
Lifecycle for config changes (memory limit, query_delay)
- Stop datafeed: ()
- Close job
- Update config:
ad_update_model_memory_limit
, ad_update_datafeed_query_delay
,
ad_update_delayed_data_check_config
- Open job:
- Start datafeed: ()
Recover a corrupted period without resetting the whole model:
.
Tool surface
| Category | Tools |
|---|
| Permissions / metadata | ad_validate_ml_tool_permissions
, ad_get_available_metadata
, |
| Job + datafeed state | ad_get_job_datafeed_config
, , , ad_preview_datafeed_with_latency
|
| Timing / missing docs | ad_ts_delayed_data_annotations
, , ad_ts_ingest_latency_estimate
, ad_update_datafeed_query_delay
, ad_update_delayed_data_check_config
, ad_wf_troubleshoot_query_delay
|
| Memory | ad_ts_model_memory_health
, ad_wf_ts_field_cardinality
, ad_estimate_memory_requirement
, ad_update_model_memory_limit
, ad_wf_troubleshoot_memory_limit
|
| Model / lifecycle | , , , |
| CCS | |
| Calendars | , |
Full parameter tables, ES|QL templates, and REST step lists:
references/troubleshoot-anomaly-tool-reference.md.
Rules
ad_validate_ml_tool_permissions
first — missing privileges produce misleading empty results.
- Fix memory before — corrupts state; fixes on a memory-limited job are
wasted.
- Stop the datafeed before updating it. Updating a running datafeed is rejected.
- Close the job before updating memory limit. Sequence above.
- Prefer workflow tools () over manually chaining diagnostics for complex decisions.
ad_preview_datafeed_with_latency
before starting — confirm the datafeed returns data after config changes.
Mode: Manage — Create / configure jobs
When: "set up a job", "create an ML detector", "monitor X over time", "detect rare/unusual/anomalous values".
4-step workflow
text
PUT _ml/anomaly_detectors/<job_id> # 1. Define job (ad_create_job)
PUT _ml/datafeeds/datafeed-<job_id> # 2. Define datafeed (ad_create_datafeed)
POST _ml/anomaly_detectors/<job_id>/_open # 3a. Open job (ad_open_job)
POST _ml/datafeeds/datafeed-<job_id>/_start # 3b. Start datafeed (ad_manage_datafeed action=_start)
GET _ml/anomaly_detectors/<job_id>/results/records # 4. Read results
Process
-
Build configs. Parse the user request into job + datafeed JSON with no null fields.
-
Apply smart defaults:
| Field | Default | Override when |
|---|
| | User specifies a different span |
| | User names a different timestamp field |
| | User specifies an index or pattern |
| | User mentions filters, processes, or time windows |
| by/over/partition fields from detectors | User adds extra influencer fields |
| Generated from user description | User provides an explicit ID |
| | P95 ingest latency is higher |
-
Choose detector function from user intent — full table in
references/anomaly-detection-functions.md:
- "high CPU" / "unusually large" → or
- "rare logins" / "unusual values" → (variants below)
- "too many requests" / "spike in count" →
- Infrequent globally →
- Infrequent vs peers →
rare by_field_name: X over_field_name: Y
- Infrequent per segment →
rare by_field_name: X partition_field_name: Y
- Infrequent per segment vs peers →
rare by_field_name: X over_field_name: Y partition_field_name: Z
-
Validate. platform.core.get_index_mapping
on the target index to verify field existence/types →
. If errors, fix and re-validate (max 3 attempts).
-
Present and confirm. Show the complete job + datafeed bodies formatted as the exact API calls. Ask for
approval once. If feedback, incorporate and re-present (up to 3 rounds).
-
Deploy. After confirmation:
→
→
→
(
). Report final
and
.
For
batch analysis on historical data, pass
and
to the datafeed start call.
Worked examples (rare-username, DNS exfil, large-downloads) with full JSON bodies and datafeed filters:
references/job-creation-recipes.md.
Rules
- Create job before datafeed. Datafeed references job by ID.
- Open job before starting datafeed. Start on a closed job is rejected.
- = P95 ingest latency + buffer (60s–120s safe default).
- Forecasts require non-population jobs — jobs cannot be forecasted; warn before attempting.
- vs : compares entity to its own history; compares to peer group in
the same bucket. = fully independent sub-model with its own normalization.
- matches detection granularity — 15m for high-frequency, 1h for operational metrics, 1d for daily
patterns. Larger smooths short spikes; smaller increases noise.
Registration (Kibana Agent Builder)
Requires Node.js 18+. Defaults to
/
when no credentials are supplied.
bash
cd skills/kibana/kibana-anomaly-detection
# tools → workflows → skills
node scripts/kibana-agent-builder.mjs all register --kibana-url http://localhost:5601
# HTTPS with self-signed cert
node scripts/kibana-agent-builder.mjs all register --kibana-url https://localhost:5601 --insecure
runs
, then
, then
. Kibana allows
at most five
per skill; the script fills them by scanning
for tool mentions (in document order), then appends
ids from
references/kibana/tools/esql/*.json
until the cap (workflow-only tools omitted by default). If you run
alone, run
first so those ids exist.
Workflow tool exclusions and prefixes live in
scripts/agent_builder_constants.json
.
MCP API key permissions:
- Kibana: ,
- Index: , on , , ,
- For source evidence: on source data indices
Tool inventory
ES|QL tool specs live under
references/kibana/tools/esql/*.json
; workflow definitions under
references/kibana/workflows/*.yaml
. Each Mode section above lists the tools it uses. Full surface:
references/tools.md (ES|QL) and
references/workflow-tools.md
(workflows).
Key system indices
| Index | Relevant content |
|---|
| , , , , , , , |
| job/datafeed documents (visible even for never-run jobs) |
| delayed data () |
| job messages (: info/warning/error) |
Examples
RCA: "Something caused a spike in our error rate at 2pm — what broke?" → Investigate →
ad_get_available_metadata
→
ad_query_anomaly_timeline
→
ad_rca_cross_job_entity_match
→
ad_rca_multi_job_entities
→ RCA report.
Score drop: "My anomaly score went from 90 to 55 — did the model change?" → Explain →
ad_rca_score_reassessment
for drift → explain renormalization if
is large.
Memory limit: "Job status shows
and results look wrong." → Troubleshoot →
ad_ts_model_memory_health
→
ad_wf_ts_field_cardinality
→
ad_estimate_memory_requirement
→
ad_update_model_memory_limit
(lifecycle: stop
datafeed → close → update → open → start).
New job: "Detect unusual error rates per host on nginx access logs." → Manage →
detector with
by_field_name: "host.keyword"
→ validate → present → deploy.
Multi-mode: "We had an incident last night, scores were high but now low — is the job healthy?" → Investigate the
incident → Explain the score drift → Troubleshoot if
or delayed data is suspected.
Guidelines
- Pick a mode first. Don't blend RCA logic with score-explanation logic in one response.
ad_validate_ml_tool_permissions
first on empty results — privileges are the most common false-negative cause.
- Score bands are absolute thresholds: critical, warning, minor, informational.
- Multi-job entities are prime suspects. Use in
ad_rca_multi_job_entities
.
- Show alongside — the gap tells the renormalization story.
- Fix memory before . invalidates downstream diagnostics.
- Stop datafeed → close job → update config → open job → start datafeed for any config change to memory or query
delay.
- Confirm RCAs with . Raw source documents are ground truth.