kibana-anomaly-detection

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Elastic ML Anomaly Detection

Elastic ML异常检测

Single skill covering all anomaly detection work against Kibana Agent Builder MCP at

{KIBANA_URL}/api/agent_builder/mcp

. Use the Mode Selector below to pick the right approach for the user's question — modes share the same tool surface and concepts.

这是一项针对Kibana Agent Builder MCP（地址为

{KIBANA_URL}/api/agent_builder/mcp

）的全流程异常检测技能。请使用下方的模式选择器，根据用户的问题选择合适的处理方式——所有模式共用一套工具集和核心概念。

Platform

平台说明

Read path: ES|QL against

.ml-anomalies-*

.ml-config

.ml-notifications-*

.ml-annotations-*

Always-available:
```
platform.core.execute_esql
```
(plus additional platform tools for search, index mapping, and documentation — see
```
scripts/agent_builder_constants.json
```
)
ML API spec (if available):
```
.kibana_ai_openapi_spec_elasticsearch
```
— see references/anomaly-detection-openapi-spec-discover.md for discovery pattern.
Run
ad_validate_ml_tool_permissions
first when tools return empty/misleading results — missing privileges are the most common cause of false negatives. Full permissions matrix: references/permissions-matrix.md.

读取路径：通过ES|QL查询

.ml-anomalies-*

、

.ml-config

、

.ml-notifications-*

、

.ml-annotations-*

索引

通用工具：
```
platform.core.execute_esql
```
（此外还有搜索、索引映射和文档相关的平台工具——详见
```
scripts/agent_builder_constants.json
```
）
ML API规范（若可用）：
```
.kibana_ai_openapi_spec_elasticsearch
```
——发现模式请参考references/anomaly-detection-openapi-spec-discover.md。
当工具返回空结果或误导性结果时，请先运行
ad_validate_ml_tool_permissions
——权限缺失是导致假阴性结果的最常见原因。完整权限矩阵请见：references/permissions-matrix.md。

Mode Selector

模式选择器

User intent	Mode
"What broke?" / RCA / cross-job / blast radius / influencers / log categories	Investigate
"Why score high/low?" / renormalization / model bounds / forecasts	Explain
Missing docs / memory limit / datafeed stopped / CCS / lifecycle / calendars	Troubleshoot
Create a job / configure a datafeed / start analysis / retrieve results	Manage
Security framing (attack chains, MITRE, exfil)	Investigate + references/security-anomaly-expert.md
Observability/SRE framing (degradation, capacity, deployment regression)	Investigate + references/observability-anomaly-expert.md

When a question spans modes: Investigate → Explain → Troubleshoot. Don't blend mode logic — finish one before moving on.

用户意图	模式
"哪里出问题了？"/RCA/跨作业/影响范围/影响因素/日志分类	调查
"分数为何偏高/偏低？"/重归一化/模型边界/预测	解释
文档缺失/内存限制/数据馈送停止/CCS/生命周期/日历	故障排查
创建作业/配置数据馈送/启动分析/获取结果	管理
安全场景（攻击链、MITRE、数据泄露）	调查 + references/security-anomaly-expert.md
可观测性/SRE场景（性能退化、容量不足、部署回退）	调查 + references/observability-anomaly-expert.md

若问题涉及多个模式，请遵循调查→解释→故障排查的顺序。不要混合不同模式的逻辑——完成一个模式后再进行下一个。

Score Quick Reference

分数速查

```
record_score
```
bands: >75 critical · 50–75 warning · 25–50 minor · <25 informational
```
multi_bucket_impact ≥ 3
```
→ sustained shift (not a transient spike)
```
initial_record_score >> record_score
```
→ renormalization (model saw worse anomalies later)
```
actual << typical
```
with
```
count
```
/
```
low_count
```
/
```
low_mean
```
→ absence/outage, not just low value
Low scores across many jobs > one high score — composite cross-job signal often beats single-detector severity

Full score definitions, renormalization mechanics, and
anomaly_score_explanation
components: references/score-reference.md.

```
record_score
```
区间：>75 严重 · 50–75 警告 · 25–50 轻微 · <25 信息性
```
multi_bucket_impact ≥ 3
```
→ 持续异常变化（而非瞬时峰值）
```
initial_record_score >> record_score
```
→ 重归一化（模型后续检测到了更严重的异常）
```
actual << typical
```
且伴随
```
count
```
/
```
low_count
```
/
```
low_mean
```
→ 数据缺失/服务中断，而非单纯数值偏低
多个作业普遍低分 > 单个作业高分——跨作业的复合信号通常比单一检测器的严重程度更有价值

完整的分数定义、重归一化机制以及
anomaly_score_explanation
组件说明： references/score-reference.md。

Core concepts

核心概念

Treat

.ml-anomalies-*

as three layers, accessed via

result_type

bucket
— bucket-level unusualness per
```
bucket_span
```
.
```
anomaly_score
```
is the aggregate across all detectors.

record
— finest-grained rows with

actual

typical

probability

record_score

anomaly_score_explanation

influencer
— entity contributions ranked within a bucket (
```
influencer_score
```
).

Read scores this way:

```
anomaly_score
```
/
```
record_score
```
= current normalized values (move as the model sees new extremes).

initial_anomaly_score

initial_record_score

= immutable snapshots from detection time.

Compare
```
actual
```
to
```
typical
```
; use
```
probability
```
for raw likelihood.

Map entities via

partition_field_value

by_field_value

over_field_value

Read
```
multi_bucket_impact
```
(-5 to +5) to separate single-bucket spikes from sustained trends.

将

.ml-anomalies-*

视为三层结构，通过

result_type

访问：

bucket
—— 每个
```
bucket_span
```
时间区间内的桶级异常程度。
```
anomaly_score
```
是所有检测器的聚合分数。

record
—— 最细粒度的行数据，包含

actual

与

typical

对比、

probability

、

record_score

、

anomaly_score_explanation

。

influencer
—— 桶内各实体的异常贡献度排名（
```
influencer_score
```
）。

分数解读方式：

```
anomaly_score
```
/
```
record_score
```
= 当前归一化数值（会随模型检测到新的极端值而变化）。

initial_anomaly_score

initial_record_score

= 检测时的固定快照数值。

对比
```
actual
```
与
```
typical
```
；使用
```
probability
```
查看原始异常概率。

通过

partition_field_value

by_field_value

over_field_value

映射实体。

查看
```
multi_bucket_impact
```
（范围-5到+5），区分瞬时峰值与持续趋势。

Mode: Investigate — RCA

模式：调查——根因分析（RCA）

When: "what broke?", "which entity caused this?", cross-job correlation, blast radius, attack/cascade chains.

适用场景： "哪里出问题了？"、"哪个实体引发的异常？"、跨作业关联、影响范围分析、攻击/故障连锁反应。

Tool chain

工具链

Phase	Tools
Discovery	`ad_get_available_metadata` , `ad_get_jobs` , `ad_discover_related_jobs` , `ad_discover_jobs_by_datafeed_index`
Timeline / scope	`ad_query_anomaly_timeline`
Cross-job / entities	`ad_rca_cross_job_entity_match` , `ad_rca_multi_job_entities` , `ad_rca_entity_profile`
Records / influencers	`ad_query_anomaly_records` , `ad_query_influencers`
RCA depth	`ad_rca_detector_fingerprint` , `ad_rca_correlation` , `ad_rca_blast_radius` , `ad_rca_score_reassessment`
Evidence / categories	`ad_get_job_datafeed_config` , `ad_rca_source_evidence` , `ad_get_categories` , `ad_search_log_category_examples`

阶段	工具
发现阶段	`ad_get_available_metadata` , `ad_get_jobs` , `ad_discover_related_jobs` , `ad_discover_jobs_by_datafeed_index`
时间线/范围界定	`ad_query_anomaly_timeline`
跨作业/实体分析	`ad_rca_cross_job_entity_match` , `ad_rca_multi_job_entities` , `ad_rca_entity_profile`
记录/影响因素查询	`ad_query_anomaly_records` , `ad_query_influencers`
RCA深度分析	`ad_rca_detector_fingerprint` , `ad_rca_correlation` , `ad_rca_blast_radius` , `ad_rca_score_reassessment`
证据/分类获取	`ad_get_job_datafeed_config` , `ad_rca_source_evidence` , `ad_get_categories` , `ad_search_log_category_examples`

Protocol

流程规范

Follow the 14-step sequence in references/protocols/investigation.md. High level:

ad_get_available_metadata

→ pair

ad_discover_jobs_by_datafeed_index

with

ad_discover_related_jobs

→

ad_query_anomaly_timeline

→ rank with

ad_rca_multi_job_entities

(

min_job_count=2

) →

ad_rca_detector_fingerprint

→ drill with

ad_query_anomaly_records

ad_query_influencers

(low

min_score=25

) → profile with

ad_rca_entity_profile

→ order with

ad_rca_correlation

→ confirm with

ad_rca_source_evidence

. When

by_field_name == "mlcategory"

, compare with

ad_get_categories

+ paired

ad_search_log_category_examples

(baseline vs. anomaly window).

Finish with a written RCA: root cause entity · affected jobs · temporal progression · fault class (resource/network/application) · severity · recommended actions. Worked example: references/worked-example.md. Full ES|QL templates and parameters: references/investigate-anomaly-esql-tools.md.

请遵循references/protocols/investigation.md中的14步流程。核心步骤：

ad_get_available_metadata

→ 结合

ad_discover_jobs_by_datafeed_index

与

ad_discover_related_jobs

→

ad_query_anomaly_timeline

→ 通过

ad_rca_multi_job_entities

（

min_job_count=2

）排序 →

ad_rca_detector_fingerprint

→ 结合

ad_query_anomaly_records

ad_query_influencers

（设置较低的

min_score=25

）深入分析 → 通过

ad_rca_entity_profile

生成实体画像 → 通过

ad_rca_correlation

排序 → 用

ad_rca_source_evidence

验证。当

by_field_name == "mlcategory"

时，需结合

ad_get_categories

ad_search_log_category_examples

对比基线与异常窗口数据。

最终输出书面RCA报告，包含：根因实体 · 受影响作业 · 时间线进展 · 故障类型（资源/网络/应用） · 严重程度 · 建议操作。示例参考：references/worked-example.md。完整ES|QL模板及参数：references/investigate-anomaly-esql-tools.md。

Rules

规则

Multi-job entities are prime suspects; single-job entities are usually victims. Use
```
min_job_count=2
```
.
Earliest anomaly timestamp wins — sort
```
ad_rca_correlation
```
by timestamp; first-appearing entity = origin.
multi_bucket_impact ≥ 3
= sustained behavioral shift, weight higher than transient spikes.
Never close an RCA without
ad_rca_source_evidence
— raw source documents are ground truth.
Use low
min_score
(25 or lower) for influencer queries — high thresholds miss correlated entities.

跨作业关联的实体是首要排查对象；单一作业的实体通常是受害者。请设置
```
min_job_count=2
```
。
最早出现异常的时间戳优先级最高——按时间戳排序
```
ad_rca_correlation
```
，最早出现的实体即为根源。
multi_bucket_impact ≥ 3
代表持续行为变化，权重高于瞬时峰值。
未获取
ad_rca_source_evidence
前不要结束RCA——原始源文档是最可靠的依据。
影响因素查询请设置较低的
min_score
（25或更低）——高阈值会遗漏关联实体。

Mode: Explain — Score / model behavior

模式：解释——分数/模型行为

When: "why is my score 30/90?", "score dropped overnight", "what is renormalization?", "why wasn't this detected?".

适用场景： "我的分数为什么是30/90？"、"分数一夜之间下降了"、"什么是重归一化？"、"为什么没有检测到这个异常？"。

Score types

分数类型

Field	Scope	Meaning
`record_score`	Single record	Normalized severity after renormalization.
`initial_record_score`	Single record	Score at detection time. Gap vs `record_score` = renormalization drift.
`anomaly_score`	Bucket	Aggregate severity across all detectors in a bucket.
`influencer_score`	Entity × bucket	How anomalous a specific entity is in that bucket.

字段	范围	含义
`record_score`	单条记录	重归一化后的归一化严重程度。
`initial_record_score`	单条记录	检测时的分数。与 `record_score` 的差值即为重归一化漂移量。
`anomaly_score`	桶级	一个桶内所有检测器的聚合严重程度。
`influencer_score`	实体×桶级	特定实体在该桶内的异常程度。

anomaly_score_explanation

components

anomaly_score_explanation

组件

Component	Effect	What it means
`anomaly_length`	↑ score	More consecutive anomalous buckets
`single_bucket_impact`	↑ score	Lower probability → higher impact
`multi_bucket_impact`	↑ score	Sustained pattern contribution
`anomaly_characteristics_impact`	↑ score	Mean shift vs. variance change
`high_variance_penalty`	↓ score	Noisy data → wide bounds → anomaly less surprising
`incomplete_bucket_penalty`	↓ score	Bucket has less data than expected (ingest lag, sparse data)

组件	对分数的影响	含义
`anomaly_length`	↑ 分数	连续异常桶的数量越多，分数越高
`single_bucket_impact`	↑ 分数	概率越低，对分数的影响越大
`multi_bucket_impact`	↑ 分数	持续异常模式对分数的贡献
`anomaly_characteristics_impact`	↑ 分数	均值偏移 vs 方差变化的影响
`high_variance_penalty`	↓ 分数	数据噪声大→模型边界宽→异常的意外性降低
`incomplete_bucket_penalty`	↓ 分数	桶内数据量低于预期（摄入延迟、稀疏数据）

Why a score looks wrong

分数异常的常见原因

Unexpectedly low:
```
high_variance_penalty
```
, renormalization, <3 weeks training for weekly seasonality,
```
bucket_span
```
too large, wrong detector function (
```
mean
```
vs
```
high_mean
```
),
```
incomplete_bucket_penalty
```
, suppression by
```
custom_rules
```
.
Unexpectedly high: insufficient history (early training over-flags), high-cardinality split (too few points per entity),
```
use_null: true
```
on a sparse field.

分数低于预期：
```
high_variance_penalty
```
、重归一化、周季节性模型训练数据不足3周、
```
bucket_span
```
过大、检测器函数选择错误（
```
mean
```
vs
```
high_mean
```
）、
```
incomplete_bucket_penalty
```
、
```
custom_rules
```
抑制。
分数高于预期： 训练数据不足（初期训练误报率高）、高基数拆分（每个实体的数据点过少）、稀疏字段设置
```
use_null: true
```
。

Tool chain

工具链

Purpose	Tools
Records + explanation	`ad_query_anomaly_records` (exact `job_id_pattern` )
Renormalization drift	`ad_rca_score_reassessment` ( `score_drift = initial_record_score - record_score` )
Model bounds (visual)	`ad_get_model_plot` — actual outside `model_lower` / `model_upper` = anomaly
Forecast overlap	`ad_get_forecast_results`
Influencer attribution	`ad_query_influencers`
Config & detector	`ad_get_job_datafeed_config` — `bucket_span` , function, `custom_rules` , `use_null`
Categorization	`ad_get_categories`
Model snapshots	`ad_get_model_snapshots`
Structured diagnostic	`ad_wf_troubleshoot_anomaly_score` (full decision tree)

用途	工具
记录+分数解释	`ad_query_anomaly_records` （指定精确的 `job_id_pattern` ）
重归一化漂移分析	`ad_rca_score_reassessment` （ `score_drift = initial_record_score - record_score` ）
模型边界可视化	`ad_get_model_plot` —— `actual` 超出 `model_lower` / `model_upper` 即为异常
预测结果对比	`ad_get_forecast_results`
影响因素归因	`ad_query_influencers`
配置与检测器查看	`ad_get_job_datafeed_config` —— 查看 `bucket_span` 、函数、 `custom_rules` 、 `use_null`
分类信息获取	`ad_get_categories`
模型快照查看	`ad_get_model_snapshots`
结构化诊断	`ad_wf_troubleshoot_anomaly_score` （完整决策树）

Decision tree (

ad_wf_troubleshoot_anomaly_score

)

决策树（

ad_wf_troubleshoot_anomaly_score

）

```
ad_get_jobs
```
— ≥3 weeks data for weekly seasonality?
```
ad_ts_model_memory_health
```
—
```
memory_status
```
healthy?
```
ad_ts_delayed_data_annotations
```
— no incomplete buckets?

ad_query_anomaly_records

— compare

record_score

initial_record_score

ad_get_job_datafeed_config

—

bucket_span

, detector function,

custom_rules

use_null

```
ad_get_model_plot
```
— wide bounds →
```
high_variance_penalty
```
.
```
ad_rca_score_reassessment
```
— renormalization drift across history.
Explain
```
anomaly_score_explanation
```
factors.

```
ad_get_jobs
```
—— 周季节性模型是否有≥3周的数据？
```
ad_ts_model_memory_health
```
——
```
memory_status
```
是否健康？
```
ad_ts_delayed_data_annotations
```
—— 是否存在不完整的桶？

ad_query_anomaly_records

—— 对比

record_score

与

initial_record_score

。

ad_get_job_datafeed_config

—— 检查

bucket_span

、检测器函数、

custom_rules

、

use_null

。

```
ad_get_model_plot
```
—— 宽边界对应
```
high_variance_penalty
```
。
```
ad_rca_score_reassessment
```
—— 分析历史数据中的重归一化漂移。
解释
```
anomaly_score_explanation
```
各因素的影响。

Rules

规则

Always show both
initial_record_score
and
record_score
— the gap is the renormalization story.
Explain renormalization before diagnosing config — score drift is the most common "score dropped" cause and needs no config change.
actual << typical
with
count
/
low_count
is an absence anomaly — distinguish outages from value spikes.
high_variance_penalty
and
incomplete_bucket_penalty
explain most "low score" surprises without remediation.
Weekly seasonality needs ≥3 weeks of training data — flag young jobs as the cause.

For detector function selection details, see references/anomaly-detection-functions.md.

始终同时展示
initial_record_score
和
record_score
——两者的差值就是重归一化的核心信息。
先解释重归一化，再排查配置问题——分数漂移是“分数下降”最常见的原因，无需修改配置。
actual << typical
且伴随
count
/
low_count
属于缺失类异常——需区分服务中断与数值峰值。
high_variance_penalty
和
incomplete_bucket_penalty
是“分数偏低”最常见的原因，通常无需修复。
周季节性模型需要≥3周的训练数据——若作业创建时间短，需明确指出这是原因。

检测器函数选择详情请见： references/anomaly-detection-functions.md。

Mode: Troubleshoot — Job ops

模式：故障排查——作业操作

When: "missing documents", "datafeed stopped", "hard_limit", "results look wrong", lifecycle changes, calendars, CCS.

适用场景： "文档缺失"、"数据馈送已停止"、"hard_limit"、"结果异常"、生命周期变更、日历、CCS。

Common issues → fast paths

常见问题→快速处理路径

Issue	Fast path	Full decision tree
Missing docs / `query_delay` warning	`ad_ts_delayed_data_annotations` → `ad_ts_bucket_event_gaps` → `ad_ts_ingest_latency_estimate` → `ad_update_datafeed_query_delay`	`ad_wf_troubleshoot_query_delay`
Memory `soft_limit` / `hard_limit`	`ad_ts_model_memory_health` → `ad_wf_ts_field_cardinality` → `ad_estimate_memory_requirement` → `ad_update_model_memory_limit`	`ad_wf_troubleshoot_memory_limit`
Datafeed not running / job state	`ad_get_jobs` (state) → `ad_get_job_messages` → `ad_manage_datafeed`	—
CCS / `remote_cluster:` indices	`ad_ts_ccs_diagnostics`	—
Score sanity check	—	`ad_wf_troubleshoot_anomaly_score`

hard_limit
corrupts model state and causes downstream missing-doc false alarms (categorizer silently skips events for unknown categories). Fix memory before fixing
query_delay
.

问题	快速处理路径	完整决策树
文档缺失/ `query_delay` 警告	`ad_ts_delayed_data_annotations` → `ad_ts_bucket_event_gaps` → `ad_ts_ingest_latency_estimate` → `ad_update_datafeed_query_delay`	`ad_wf_troubleshoot_query_delay`
内存 `soft_limit` / `hard_limit`	`ad_ts_model_memory_health` → `ad_wf_ts_field_cardinality` → `ad_estimate_memory_requirement` → `ad_update_model_memory_limit`	`ad_wf_troubleshoot_memory_limit`
数据馈送未运行/作业状态异常	`ad_get_jobs` （查看状态）→ `ad_get_job_messages` → `ad_manage_datafeed`	—
CCS/ `remote_cluster:` 索引	`ad_ts_ccs_diagnostics`	—
分数合理性检查	—	`ad_wf_troubleshoot_anomaly_score`

hard_limit
会破坏模型状态，并导致下游出现文档缺失的误报（分类器会静默跳过未知分类的事件）。修复内存问题后再处理
query_delay
。

Memory concepts

内存相关概念

Field	Meaning
`model_bytes`	Current memory used
`peak_model_bytes`	High-water mark since job opened
`model_bytes_memory_limit`	Configured `model_memory_limit`
`memory_status`	`ok` / `soft_limit` (pruning) / `hard_limit` (critical)
`total_by_field_count > 100k`	`by_field` cardinality too high — dominant driver
`total_partition_field_count > 10k`	Partition explosion
`total_category_count > 10k`	Too many distinct log patterns

Prefer ad_estimate_memory_requirement
(samples cardinality from source, calls Estimate Model Memory API) over heuristics like

peak_model_bytes * 1.3

— the heuristic ignores pure influencer and categorization memory.

字段	含义
`model_bytes`	当前已使用内存
`peak_model_bytes`	作业启动后的内存峰值
`model_bytes_memory_limit`	配置的 `model_memory_limit`
`memory_status`	`ok` / `soft_limit` （正在修剪） / `hard_limit` （严重）
`total_by_field_count > 100k`	`by_field` 基数过高——是内存占用的主要驱动因素
`total_partition_field_count > 10k`	分区数量爆炸
`total_category_count > 10k`	日志模式种类过多

优先使用**

ad_estimate_memory_requirement

**（从源数据采样基数，调用Estimate Model Memory API），而非

peak_model_bytes * 1.3

这类经验值——经验值会忽略纯影响因素和分类功能的内存占用。

Datafeed & timing concepts

数据馈送与时间相关概念

query_delay
— how far behind real time the datafeed queries. Too small → missing docs; too large → slower alerts. Set to P95 ingest latency + buffer (default
```
60s
```
–
```
120s
```
).
delayed_data_check_config
— how aggressively the datafeed checks for late data.
bucket_span
— analysis interval. Align with data granularity and detection window.

frequency
— defaults to

min(query_delay, bucket_span / 2)

query_delay
—— 数据馈送查询与实时数据的延迟时间。过小会导致文档缺失；过大会延迟告警。建议设置为P95摄入延迟 + 缓冲值（默认
```
60s
```
–
```
120s
```
）。
delayed_data_check_config
—— 数据馈送检查延迟数据的频率。
bucket_span
—— 分析时间间隔。需与数据粒度和检测窗口对齐。

frequency
—— 默认值为

min(query_delay, bucket_span / 2)

。

Lifecycle for config changes (memory limit, query_delay)

配置变更的生命周期流程（内存限制、query_delay）

Stop datafeed:
```
ad_manage_datafeed
```
(
```
action=_stop
```
)
Close job

Update config:

ad_update_model_memory_limit

ad_update_datafeed_query_delay

ad_update_delayed_data_check_config

Open job:
```
ad_open_job
```
Start datafeed:
```
ad_manage_datafeed
```
(
```
action=_start
```
)

Recover a corrupted period without resetting the whole model:

ad_revert_model_snapshot

停止数据馈送：
```
ad_manage_datafeed
```
（
```
action=_stop
```
）
关闭作业

更新配置：

ad_update_model_memory_limit

、

ad_update_datafeed_query_delay

、

ad_update_delayed_data_check_config

打开作业：
```
ad_open_job
```
启动数据馈送：
```
ad_manage_datafeed
```
（
```
action=_start
```
）

无需重置整个模型即可恢复损坏时段的数据：

ad_revert_model_snapshot

。

Tool surface

工具集

Category	Tools
Permissions / metadata	`ad_validate_ml_tool_permissions` , `ad_get_available_metadata` , `ad_get_jobs`
Job + datafeed state	`ad_get_job_datafeed_config` , `ad_get_job_messages` , `ad_manage_datafeed` , `ad_preview_datafeed_with_latency`
Timing / missing docs	`ad_ts_delayed_data_annotations` , `ad_ts_bucket_event_gaps` , `ad_ts_ingest_latency_estimate` , `ad_update_datafeed_query_delay` , `ad_update_delayed_data_check_config` , `ad_wf_troubleshoot_query_delay`
Memory	`ad_ts_model_memory_health` , `ad_wf_ts_field_cardinality` , `ad_estimate_memory_requirement` , `ad_update_model_memory_limit` , `ad_wf_troubleshoot_memory_limit`
Model / lifecycle	`ad_get_model_snapshots` , `ad_revert_model_snapshot` , `ad_open_job` , `ad_create_job`
CCS	`ad_ts_ccs_diagnostics`
Calendars	`ad_get_calendar_events` , `ad_create_calendar_event`

Full parameter tables, ES|QL templates, and REST step lists: references/troubleshoot-anomaly-tool-reference.md.

分类	工具
权限/元数据	`ad_validate_ml_tool_permissions` , `ad_get_available_metadata` , `ad_get_jobs`
作业+数据馈送状态	`ad_get_job_datafeed_config` , `ad_get_job_messages` , `ad_manage_datafeed` , `ad_preview_datafeed_with_latency`
时间/文档缺失排查	`ad_ts_delayed_data_annotations` , `ad_ts_bucket_event_gaps` , `ad_ts_ingest_latency_estimate` , `ad_update_datafeed_query_delay` , `ad_update_delayed_data_check_config` , `ad_wf_troubleshoot_query_delay`
内存问题排查	`ad_ts_model_memory_health` , `ad_wf_ts_field_cardinality` , `ad_estimate_memory_requirement` , `ad_update_model_memory_limit` , `ad_wf_troubleshoot_memory_limit`
模型/生命周期管理	`ad_get_model_snapshots` , `ad_revert_model_snapshot` , `ad_open_job` , `ad_create_job`
CCS诊断	`ad_ts_ccs_diagnostics`
日历管理	`ad_get_calendar_events` , `ad_create_calendar_event`

完整参数表、ES|QL模板及REST步骤列表： references/troubleshoot-anomaly-tool-reference.md。

Rules

规则

ad_validate_ml_tool_permissions
first — missing privileges produce misleading empty results.
Fix memory before
query_delay
—
```
hard_limit
```
corrupts state;
```
query_delay
```
fixes on a memory-limited job are wasted.
Stop the datafeed before updating it. Updating a running datafeed is rejected.
Close the job before updating memory limit. Sequence above.
Prefer workflow tools (
ad_wf_*
) over manually chaining diagnostics for complex decisions.
ad_preview_datafeed_with_latency
before starting — confirm the datafeed returns data after config changes.

先运行
ad_validate_ml_tool_permissions
——权限缺失会导致误导性的空结果。
先修复内存问题，再处理
query_delay
——
```
hard_limit
```
会破坏状态；在内存受限的作业上修复
```
query_delay
```
是无效的。
更新前先停止数据馈送——更新运行中的数据馈送会被拒绝。
更新内存限制前先关闭作业——遵循上述生命周期流程。
复杂决策优先使用工作流工具（
ad_wf_*
），而非手动串联诊断工具。
启动前运行
ad_preview_datafeed_with_latency
——确认配置变更后数据馈送能返回数据。

Mode: Manage — Create / configure jobs

模式：管理——创建/配置作业

When: "set up a job", "create an ML detector", "monitor X over time", "detect rare/unusual/anomalous values".

适用场景： "设置作业"、"创建ML检测器"、"长期监控X指标"、"检测罕见/异常值"。

4-step workflow

4步工作流

text

PUT  _ml/anomaly_detectors/<job_id>          # 1. Define job        (ad_create_job)
PUT  _ml/datafeeds/datafeed-<job_id>         # 2. Define datafeed   (ad_create_datafeed)
POST _ml/anomaly_detectors/<job_id>/_open    # 3a. Open job         (ad_open_job)
POST _ml/datafeeds/datafeed-<job_id>/_start  # 3b. Start datafeed   (ad_manage_datafeed action=_start)
GET  _ml/anomaly_detectors/<job_id>/results/records  # 4. Read results

text

PUT  _ml/anomaly_detectors/<job_id>          # 1. 定义作业        (ad_create_job)
PUT  _ml/datafeeds/datafeed-<job_id>         # 2. 定义数据馈送   (ad_create_datafeed)
POST _ml/anomaly_detectors/<job_id>/_open    # 3a. 打开作业         (ad_open_job)
POST _ml/datafeeds/datafeed-<job_id>/_start  # 3b. 启动数据馈送   (ad_manage_datafeed action=_start)
GET  _ml/anomaly_detectors/<job_id>/results/records  # 4. 读取结果

Process

流程

Build configs. Parse the user request into job + datafeed JSON with no null fields.

Apply smart defaults:

Field	Default	Override when
`bucket_span`	`"15m"`	User specifies a different span
`time_field`	`"@timestamp"`	User names a different timestamp field
`index`	`"logs-*"`	User specifies an index or pattern
`datafeed_query`	`{"match_all": {}}`	User mentions filters, processes, or time windows
`influencers`	by/over/partition fields from detectors	User adds extra influencer fields
`job_id`	Generated from user description	User provides an explicit ID
`query_delay`	`"60s"`	P95 ingest latency is higher

Choose detector function from user intent — full table in references/anomaly-detection-functions.md:
- "high CPU" / "unusually large" →
```
high_mean
```
  or
```
high_sum
```
- "rare logins" / "unusual values" →
```
rare
```
  (variants below)
- "too many requests" / "spike in count" →
```
high_count
```
```
rare
```
variants:
- Infrequent globally →
```
rare by_field_name: X
```
- Infrequent vs peers →
```
rare by_field_name: X over_field_name: Y
```
- Infrequent per segment →
```
rare by_field_name: X partition_field_name: Y
```
- Infrequent per segment vs peers →
```
rare by_field_name: X over_field_name: Y partition_field_name: Z
```
Validate.
```
platform.core.get_index_mapping
```
on the target index to verify field existence/types →
```
ad_validate_job_spec
```
. If errors, fix and re-validate (max 3 attempts).
Present and confirm. Show the complete job + datafeed bodies formatted as the exact API calls. Ask for approval once. If feedback, incorporate and re-present (up to 3 rounds).

Deploy. After confirmation:

ad_create_job

→

ad_create_datafeed

→

ad_open_job

→

ad_manage_datafeed

(

action=_start

). Report final

job_id

and

datafeed_id

For batch analysis on historical data, pass

start

and

end

to the datafeed start call.

Worked examples (rare-username, DNS exfil, large-downloads) with full JSON bodies and datafeed filters: references/job-creation-recipes.md.

构建配置。将用户需求解析为无空字段的作业+数据馈送JSON。

应用智能默认值：

字段	默认值	覆盖场景
`bucket_span`	`"15m"`	用户指定了不同的时间间隔
`time_field`	`"@timestamp"`	用户指定了其他时间戳字段
`index`	`"logs-*"`	用户指定了具体索引或索引模式
`datafeed_query`	`{"match_all": {}}`	用户提及过滤条件、处理流程或时间窗口
`influencers`	检测器中的by/over/partition字段	用户添加了额外的影响因素字段
`job_id`	根据用户描述自动生成	用户提供了明确的作业ID
`query_delay`	`"60s"`	P95摄入延迟高于默认值

根据用户意图选择检测器函数——完整列表见references/anomaly-detection-functions.md：
- "CPU过高" / "数值异常大" →
```
high_mean
```
  或
```
high_sum
```
- "罕见登录" / "异常值" →
```
rare
```
  （变体如下）
- "请求量过高" / "计数突增" →
```
high_count
```
```
rare
```
变体：
- 全局罕见 →
```
rare by_field_name: X
```
- 相对于 peers 罕见 →
```
rare by_field_name: X over_field_name: Y
```
- 分段内罕见 →
```
rare by_field_name: X partition_field_name: Y
```
- 分段内相对于 peers 罕见 →
```
rare by_field_name: X over_field_name: Y partition_field_name: Z
```
验证。对目标索引调用
```
platform.core.get_index_mapping
```
验证字段存在性/类型 → 运行
```
ad_validate_job_spec
```
。若有错误，修复后重新验证（最多3次尝试）。
展示并确认。以精确API调用格式展示完整的作业+数据馈送内容。仅请求一次确认。若有反馈，调整后重新展示（最多3轮）。

部署。确认后执行：

ad_create_job

→

ad_create_datafeed

→

ad_open_job

→

ad_manage_datafeed

（

action=_start

）。报告最终的

job_id

和

datafeed_id

。

针对历史数据的批量分析，在启动数据馈送时传入

start

和

end

参数。

示例（罕见用户名检测、DNS数据泄露检测、大文件下载检测）包含完整JSON内容和数据馈送过滤条件： references/job-creation-recipes.md。

Rules

规则

Create job before datafeed. Datafeed references job by ID.
Open job before starting datafeed. Start on a closed job is rejected.
query_delay
= P95 ingest latency + buffer (60s–120s safe default).
Forecasts require non-population jobs —
```
over_field_name
```
jobs cannot be forecasted; warn before attempting.
by_field_name
vs
over_field_name
:
```
by
```
compares entity to its own history;
```
over
```
compares to peer group in the same bucket.
```
partition_field_name
```
= fully independent sub-model with its own normalization.
bucket_span
matches detection granularity — 15m for high-frequency, 1h for operational metrics, 1d for daily patterns. Larger smooths short spikes; smaller increases noise.

先创建作业，再创建数据馈送——数据馈送通过ID关联作业。
先打开作业，再启动数据馈送——在关闭的作业上启动会被拒绝。
query_delay
= P95摄入延迟 + 缓冲值（60s–120s为安全默认值）。
预测功能仅支持非群体作业——带有
```
over_field_name
```
的作业无法进行预测；尝试前需告知用户。
by_field_name
vs
over_field_name
：
```
by
```
将实体与其自身历史对比；
```
over
```
将实体与同桶内的 peer 群体对比。
```
partition_field_name
```
表示完全独立的子模型，拥有自己的归一化规则。
bucket_span
匹配检测粒度——高频数据用15m，运维指标用1h，日模式用1d。更大的间隔会平滑瞬时峰值；更小的间隔会增加噪声。

Registration (Kibana Agent Builder)

注册（Kibana Agent Builder）

Requires Node.js 18+. Defaults to

elastic

changeme

when no credentials are supplied.

bash

cd skills/kibana/kibana-anomaly-detection

需要Node.js 18+。未提供凭据时，默认使用

elastic

changeme

。

bash

cd skills/kibana/kibana-anomaly-detection

tools → workflows → skills

工具 → 工作流 → 技能

node scripts/kibana-agent-builder.mjs all register --kibana-url http://localhost:5601

HTTPS with self-signed cert

HTTPS + 自签名证书

node scripts/kibana-agent-builder.mjs all register --kibana-url https://localhost:5601 --insecure


`all register` runs `tools register`, then `workflows register`, then `skills register`. Kibana allows **at most five**
`tool_ids` per skill; the script fills them by scanning `SKILL.md` for tool mentions (in document order), then appends
ids from `references/kibana/tools/esql/*.json` until the cap (workflow-only tools omitted by default). If you run
`skills register` alone, run `tools register` first so those ids exist.

Workflow tool exclusions and prefixes live in `scripts/agent_builder_constants.json`.

**MCP API key permissions:**

- Kibana: `read_onechat`, `space_read`
- Index: `read`, `view_index_metadata` on `.ml-anomalies-*`, `.ml-annotations-*`, `.ml-notifications-*`, `.ml-config`
- For source evidence: `read` on source data indices

---

node scripts/kibana-agent-builder.mjs all register --kibana-url https://localhost:5601 --insecure


`all register`会依次执行`tools register`、`workflows register`、`skills register`。Kibana允许每个技能最多关联5个`tool_ids`；脚本会扫描`SKILL.md`中的工具提及（按文档顺序），然后从`references/kibana/tools/esql/*.json`中补充工具ID，直到达到上限（默认排除仅工作流工具）。如果单独运行`skills register`，请先运行`tools register`确保工具ID已存在。

工作流工具的排除规则和前缀定义在`scripts/agent_builder_constants.json`中。

**MCP API密钥权限：**

- Kibana：`read_onechat`、`space_read`
- 索引：`.ml-anomalies-*`、`.ml-annotations-*`、`.ml-notifications-*`、`.ml-config`的`read`、`view_index_metadata`权限
- 源证据：源数据索引的`read`权限

---

Tool inventory

工具清单

ES|QL tool specs live under

references/kibana/tools/esql/*.json

; workflow definitions under

references/kibana/workflows/*.yaml

. Each Mode section above lists the tools it uses. Full surface: references/tools.md (ES|QL) and references/workflow-tools.md (workflows).

ES|QL工具定义位于

references/kibana/tools/esql/*.json

；工作流定义位于

references/kibana/workflows/*.yaml

。上述各模式部分已列出对应工具。完整工具集： references/tools.md（ES|QL工具）和references/workflow-tools.md（工作流工具）。

Key system indices

核心系统索引

Index	Relevant content
`.ml-anomalies-*`	`record` , `bucket` , `influencer` , `model_plot` , `model_forecast` , `model_snapshot` , `category_definition` , `model_size_stats`
`.ml-config`	job/datafeed documents (visible even for never-run jobs)
`.ml-annotations-*`	delayed data ( `event == "delayed_data"` )
`.ml-notifications-*`	job messages ( `level` : info/warning/error)

索引	相关内容
`.ml-anomalies-*`	`record` 、 `bucket` 、 `influencer` 、 `model_plot` 、 `model_forecast` 、 `model_snapshot` 、 `category_definition` 、 `model_size_stats`
`.ml-config`	作业/数据馈送文档（即使从未运行的作业也可见）
`.ml-annotations-*`	延迟数据（ `event == "delayed_data"` ）
`.ml-notifications-*`	作业消息（ `level` : info/warning/error）

Examples

示例

RCA: "Something caused a spike in our error rate at 2pm — what broke?" → Investigate →

ad_get_available_metadata

→

ad_query_anomaly_timeline

→

ad_rca_cross_job_entity_match

→

ad_rca_multi_job_entities

→ RCA report.

Score drop: "My anomaly score went from 90 to 55 — did the model change?" → Explain →

ad_rca_score_reassessment

for drift → explain renormalization if

score_drift

is large.

Memory limit: "Job status shows

hard_limit

and results look wrong." → Troubleshoot →

ad_ts_model_memory_health

→

ad_wf_ts_field_cardinality

→

ad_estimate_memory_requirement

→

ad_update_model_memory_limit

(lifecycle: stop datafeed → close → update → open → start).

New job: "Detect unusual error rates per host on nginx access logs." → Manage →

high_count

detector with

by_field_name: "host.keyword"

→ validate → present → deploy.

Multi-mode: "We had an incident last night, scores were high but now low — is the job healthy?" → Investigate the incident → Explain the score drift → Troubleshoot if

hard_limit

or delayed data is suspected.

RCA场景： "下午2点错误率突增——哪里出问题了？" → 调查模式 →

ad_get_available_metadata

→

ad_query_anomaly_timeline

→

ad_rca_cross_job_entity_match

→

ad_rca_multi_job_entities

→ 生成RCA报告。

分数下降场景： "我的异常分数从90降到了55——模型变了吗？" → 解释模式 → 运行

ad_rca_score_reassessment

分析漂移 → 若

score_drift

较大，解释重归一化原因。

内存限制场景： "作业状态显示

hard_limit

，结果异常。" → 故障排查模式 →

ad_ts_model_memory_health

→

ad_wf_ts_field_cardinality

→

ad_estimate_memory_requirement

→

ad_update_model_memory_limit

（遵循生命周期：停止数据馈送→关闭作业→更新配置→打开作业→启动数据馈送）。

新建作业场景： "检测nginx访问日志中每个主机的异常错误率。" → 管理模式 → 使用

high_count

检测器，设置

by_field_name: "host.keyword"

→ 验证→展示→部署。

多模式场景： "昨晚发生了事件，分数很高但现在变低了——作业健康吗？" → 调查事件→解释分数漂移→若怀疑

hard_limit

或延迟数据，进行故障排查。

Guidelines

指南

Pick a mode first. Don't blend RCA logic with score-explanation logic in one response.
ad_validate_ml_tool_permissions
first on empty results — privileges are the most common false-negative cause.
Score bands are absolute thresholds:
```
>75
```
critical,
```
50–75
```
warning,
```
25–50
```
minor,
```
<25
```
informational.
Multi-job entities are prime suspects. Use
```
min_job_count=2
```
in
```
ad_rca_multi_job_entities
```
.
Show
initial_record_score
alongside
record_score
— the gap tells the renormalization story.
Fix memory before
query_delay
.
```
hard_limit
```
invalidates downstream diagnostics.
Stop datafeed → close job → update config → open job → start datafeed for any config change to memory or query delay.
Confirm RCAs with
ad_rca_source_evidence
. Raw source documents are ground truth.

先选择模式。不要在一个响应中混合RCA逻辑和分数解释逻辑。
返回空结果时先运行
ad_validate_ml_tool_permissions
——权限缺失是最常见的假阴性原因。
分数区间为绝对阈值：
```
>75
```
严重，
```
50–75
```
警告，
```
25–50
```
轻微，
```
<25
```
信息性。
跨作业关联的实体是首要排查对象。在
```
ad_rca_multi_job_entities
```
中设置
```
min_job_count=2
```
。
同时展示
initial_record_score
和
record_score
——两者的差值体现重归一化过程。
先修复内存问题，再处理
query_delay
——
```
hard_limit
```
会使下游诊断失效。
任何内存或query_delay的配置变更，都要遵循停止数据馈送→关闭作业→更新配置→打开作业→启动数据馈送的流程。
用
ad_rca_source_evidence
验证RCA结果——原始源文档是最可靠的依据。

kibana-anomaly-detection

Original

Translation

Elastic ML Anomaly Detection

Elastic ML异常检测

Platform

平台说明

Mode Selector

模式选择器

Score Quick Reference

分数速查

Core concepts

核心概念

Mode: Investigate — RCA

模式：调查——根因分析（RCA）

Tool chain

工具链

Protocol

流程规范

Rules

规则

Mode: Explain — Score / model behavior

模式：解释——分数/模型行为

Score types

分数类型

anomaly_score_explanation components

anomaly_score_explanation组件

Why a score looks wrong

分数异常的常见原因

Tool chain

工具链

Decision tree (ad_wf_troubleshoot_anomaly_score)

决策树（ad_wf_troubleshoot_anomaly_score）

Rules

规则

Mode: Troubleshoot — Job ops

模式：故障排查——作业操作

Common issues → fast paths

常见问题→快速处理路径

Memory concepts

内存相关概念

Datafeed & timing concepts

数据馈送与时间相关概念

Lifecycle for config changes (memory limit, query_delay)

配置变更的生命周期流程（内存限制、query_delay）

Tool surface

工具集

Rules

规则

Mode: Manage — Create / configure jobs

模式：管理——创建/配置作业

4-step workflow

4步工作流

Process

流程

Rules

规则

Registration (Kibana Agent Builder)

注册（Kibana Agent Builder）

tools → workflows → skills

工具 → 工作流 → 技能

HTTPS with self-signed cert

HTTPS + 自签名证书

Tool inventory

工具清单

Key system indices

核心系统索引

Examples

示例

Guidelines

指南

`anomaly_score_explanation`
components

`anomaly_score_explanation`
组件

Decision tree (
`ad_wf_troubleshoot_anomaly_score`
)

决策树（
`ad_wf_troubleshoot_anomaly_score`
）