kibana-anomaly-detection

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Elastic ML Anomaly Detection

Elastic ML异常检测

Single skill covering all anomaly detection work against Kibana Agent Builder MCP at
{KIBANA_URL}/api/agent_builder/mcp
. Use the Mode Selector below to pick the right approach for the user's question — modes share the same tool surface and concepts.
这是一项针对Kibana Agent Builder MCP(地址为
{KIBANA_URL}/api/agent_builder/mcp
)的全流程异常检测技能。请使用下方的模式选择器,根据用户的问题选择合适的处理方式——所有模式共用一套工具集和核心概念。

Platform

平台说明

  • Read path: ES|QL against
    .ml-anomalies-*
    ,
    .ml-config
    ,
    .ml-notifications-*
    ,
    .ml-annotations-*
  • Always-available:
    platform.core.execute_esql
    (plus additional platform tools for search, index mapping, and documentation — see
    scripts/agent_builder_constants.json
    )
  • ML API spec (if available):
    .kibana_ai_openapi_spec_elasticsearch
    — see references/anomaly-detection-openapi-spec-discover.md for discovery pattern.
  • Run
    ad_validate_ml_tool_permissions
    first
    when tools return empty/misleading results — missing privileges are the most common cause of false negatives. Full permissions matrix: references/permissions-matrix.md.
  • 读取路径:通过ES|QL查询
    .ml-anomalies-*
    .ml-config
    .ml-notifications-*
    .ml-annotations-*
    索引
  • 通用工具:
    platform.core.execute_esql
    (此外还有搜索、索引映射和文档相关的平台工具——详见
    scripts/agent_builder_constants.json
  • ML API规范(若可用):
    .kibana_ai_openapi_spec_elasticsearch
    ——发现模式请参考references/anomaly-detection-openapi-spec-discover.md
  • 当工具返回空结果或误导性结果时,请先运行
    ad_validate_ml_tool_permissions
    ——权限缺失是导致假阴性结果的最常见原因。完整权限矩阵请见:references/permissions-matrix.md

Mode Selector

模式选择器

User intentMode
"What broke?" / RCA / cross-job / blast radius / influencers / log categoriesInvestigate
"Why score high/low?" / renormalization / model bounds / forecastsExplain
Missing docs / memory limit / datafeed stopped / CCS / lifecycle / calendarsTroubleshoot
Create a job / configure a datafeed / start analysis / retrieve resultsManage
Security framing (attack chains, MITRE, exfil)Investigate + references/security-anomaly-expert.md
Observability/SRE framing (degradation, capacity, deployment regression)Investigate + references/observability-anomaly-expert.md
When a question spans modes: Investigate → Explain → Troubleshoot. Don't blend mode logic — finish one before moving on.

用户意图模式
"哪里出问题了?"/RCA/跨作业/影响范围/影响因素/日志分类调查
"分数为何偏高/偏低?"/重归一化/模型边界/预测解释
文档缺失/内存限制/数据馈送停止/CCS/生命周期/日历故障排查
创建作业/配置数据馈送/启动分析/获取结果管理
安全场景(攻击链、MITRE、数据泄露)调查 + references/security-anomaly-expert.md
可观测性/SRE场景(性能退化、容量不足、部署回退)调查 + references/observability-anomaly-expert.md
若问题涉及多个模式,请遵循调查→解释→故障排查的顺序。不要混合不同模式的逻辑——完成一个模式后再进行下一个。

Score Quick Reference

分数速查

  • record_score
    bands: >75 critical · 50–75 warning · 25–50 minor · <25 informational
  • multi_bucket_impact ≥ 3
    → sustained shift (not a transient spike)
  • initial_record_score >> record_score
    → renormalization (model saw worse anomalies later)
  • actual << typical
    with
    count
    /
    low_count
    /
    low_mean
    → absence/outage, not just low value
  • Low scores across many jobs > one high score — composite cross-job signal often beats single-detector severity
Full score definitions, renormalization mechanics, and
anomaly_score_explanation
components: references/score-reference.md.
  • record_score
    区间:>75 严重 · 50–75 警告 · 25–50 轻微 · <25 信息性
  • multi_bucket_impact ≥ 3
    → 持续异常变化(而非瞬时峰值)
  • initial_record_score >> record_score
    → 重归一化(模型后续检测到了更严重的异常)
  • actual << typical
    且伴随
    count
    /
    low_count
    /
    low_mean
    → 数据缺失/服务中断,而非单纯数值偏低
  • 多个作业普遍低分 > 单个作业高分——跨作业的复合信号通常比单一检测器的严重程度更有价值
完整的分数定义、重归一化机制以及
anomaly_score_explanation
组件说明: references/score-reference.md

Core concepts

核心概念

Treat
.ml-anomalies-*
as three layers, accessed via
result_type
:
  • bucket
    — bucket-level unusualness per
    bucket_span
    .
    anomaly_score
    is the aggregate across all detectors.
  • record
    — finest-grained rows with
    actual
    vs
    typical
    ,
    probability
    ,
    record_score
    ,
    anomaly_score_explanation
    .
  • influencer
    — entity contributions ranked within a bucket (
    influencer_score
    ).
Read scores this way:
  • anomaly_score
    /
    record_score
    = current normalized values (move as the model sees new extremes).
  • initial_anomaly_score
    /
    initial_record_score
    = immutable snapshots from detection time.
  • Compare
    actual
    to
    typical
    ; use
    probability
    for raw likelihood.
  • Map entities via
    partition_field_value
    /
    by_field_value
    /
    over_field_value
    .
  • Read
    multi_bucket_impact
    (-5 to +5) to separate single-bucket spikes from sustained trends.

.ml-anomalies-*
视为三层结构,通过
result_type
访问:
  • bucket
    —— 每个
    bucket_span
    时间区间内的桶级异常程度。
    anomaly_score
    是所有检测器的聚合分数。
  • record
    —— 最细粒度的行数据,包含
    actual
    typical
    对比、
    probability
    record_score
    anomaly_score_explanation
  • influencer
    —— 桶内各实体的异常贡献度排名(
    influencer_score
    )。
分数解读方式:
  • anomaly_score
    /
    record_score
    = 当前归一化数值(会随模型检测到新的极端值而变化)。
  • initial_anomaly_score
    /
    initial_record_score
    = 检测时的固定快照数值。
  • 对比
    actual
    typical
    ;使用
    probability
    查看原始异常概率。
  • 通过
    partition_field_value
    /
    by_field_value
    /
    over_field_value
    映射实体。
  • 查看
    multi_bucket_impact
    (范围-5到+5),区分瞬时峰值与持续趋势。

Mode: Investigate — RCA

模式:调查——根因分析(RCA)

When: "what broke?", "which entity caused this?", cross-job correlation, blast radius, attack/cascade chains.
适用场景: "哪里出问题了?"、"哪个实体引发的异常?"、跨作业关联、影响范围分析、攻击/故障连锁反应。

Tool chain

工具链

PhaseTools
Discovery
ad_get_available_metadata
,
ad_get_jobs
,
ad_discover_related_jobs
,
ad_discover_jobs_by_datafeed_index
Timeline / scope
ad_query_anomaly_timeline
Cross-job / entities
ad_rca_cross_job_entity_match
,
ad_rca_multi_job_entities
,
ad_rca_entity_profile
Records / influencers
ad_query_anomaly_records
,
ad_query_influencers
RCA depth
ad_rca_detector_fingerprint
,
ad_rca_correlation
,
ad_rca_blast_radius
,
ad_rca_score_reassessment
Evidence / categories
ad_get_job_datafeed_config
,
ad_rca_source_evidence
,
ad_get_categories
,
ad_search_log_category_examples
阶段工具
发现阶段
ad_get_available_metadata
,
ad_get_jobs
,
ad_discover_related_jobs
,
ad_discover_jobs_by_datafeed_index
时间线/范围界定
ad_query_anomaly_timeline
跨作业/实体分析
ad_rca_cross_job_entity_match
,
ad_rca_multi_job_entities
,
ad_rca_entity_profile
记录/影响因素查询
ad_query_anomaly_records
,
ad_query_influencers
RCA深度分析
ad_rca_detector_fingerprint
,
ad_rca_correlation
,
ad_rca_blast_radius
,
ad_rca_score_reassessment
证据/分类获取
ad_get_job_datafeed_config
,
ad_rca_source_evidence
,
ad_get_categories
,
ad_search_log_category_examples

Protocol

流程规范

Follow the 14-step sequence in references/protocols/investigation.md. High level:
ad_get_available_metadata
→ pair
ad_discover_jobs_by_datafeed_index
with
ad_discover_related_jobs
ad_query_anomaly_timeline
→ rank with
ad_rca_multi_job_entities
(
min_job_count=2
) →
ad_rca_detector_fingerprint
→ drill with
ad_query_anomaly_records
+
ad_query_influencers
(low
min_score=25
) → profile with
ad_rca_entity_profile
→ order with
ad_rca_correlation
→ confirm with
ad_rca_source_evidence
. When
by_field_name == "mlcategory"
, compare with
ad_get_categories
+ paired
ad_search_log_category_examples
(baseline vs. anomaly window).
Finish with a written RCA: root cause entity · affected jobs · temporal progression · fault class (resource/network/application) · severity · recommended actions. Worked example: references/worked-example.md. Full ES|QL templates and parameters: references/investigate-anomaly-esql-tools.md.
请遵循references/protocols/investigation.md中的14步流程。核心步骤:
ad_get_available_metadata
→ 结合
ad_discover_jobs_by_datafeed_index
ad_discover_related_jobs
ad_query_anomaly_timeline
→ 通过
ad_rca_multi_job_entities
min_job_count=2
)排序 →
ad_rca_detector_fingerprint
→ 结合
ad_query_anomaly_records
+
ad_query_influencers
(设置较低的
min_score=25
)深入分析 → 通过
ad_rca_entity_profile
生成实体画像 → 通过
ad_rca_correlation
排序 → 用
ad_rca_source_evidence
验证。当
by_field_name == "mlcategory"
时,需结合
ad_get_categories
+
ad_search_log_category_examples
对比基线与异常窗口数据。
最终输出书面RCA报告,包含:根因实体 · 受影响作业 · 时间线进展 · 故障类型(资源/网络/应用) · 严重程度 · 建议操作。示例参考:references/worked-example.md。完整ES|QL模板及参数:references/investigate-anomaly-esql-tools.md

Rules

规则

  1. Multi-job entities are prime suspects; single-job entities are usually victims. Use
    min_job_count=2
    .
  2. Earliest anomaly timestamp wins — sort
    ad_rca_correlation
    by timestamp; first-appearing entity = origin.
  3. multi_bucket_impact ≥ 3
    = sustained behavioral shift
    , weight higher than transient spikes.
  4. Never close an RCA without
    ad_rca_source_evidence
    — raw source documents are ground truth.
  5. Use low
    min_score
    (25 or lower) for influencer queries
    — high thresholds miss correlated entities.

  1. 跨作业关联的实体是首要排查对象;单一作业的实体通常是受害者。请设置
    min_job_count=2
  2. 最早出现异常的时间戳优先级最高——按时间戳排序
    ad_rca_correlation
    ,最早出现的实体即为根源。
  3. multi_bucket_impact ≥ 3
    代表持续行为变化
    ,权重高于瞬时峰值。
  4. 未获取
    ad_rca_source_evidence
    前不要结束RCA
    ——原始源文档是最可靠的依据。
  5. 影响因素查询请设置较低的
    min_score
    (25或更低)
    ——高阈值会遗漏关联实体。

Mode: Explain — Score / model behavior

模式:解释——分数/模型行为

When: "why is my score 30/90?", "score dropped overnight", "what is renormalization?", "why wasn't this detected?".
适用场景: "我的分数为什么是30/90?"、"分数一夜之间下降了"、"什么是重归一化?"、"为什么没有检测到这个异常?"。

Score types

分数类型

FieldScopeMeaning
record_score
Single recordNormalized severity after renormalization.
initial_record_score
Single recordScore at detection time. Gap vs
record_score
= renormalization drift.
anomaly_score
BucketAggregate severity across all detectors in a bucket.
influencer_score
Entity × bucketHow anomalous a specific entity is in that bucket.
字段范围含义
record_score
单条记录重归一化后的归一化严重程度。
initial_record_score
单条记录检测时的分数。与
record_score
的差值即为重归一化漂移量。
anomaly_score
桶级一个桶内所有检测器的聚合严重程度。
influencer_score
实体×桶级特定实体在该桶内的异常程度。

anomaly_score_explanation
components

anomaly_score_explanation
组件

ComponentEffectWhat it means
anomaly_length
↑ scoreMore consecutive anomalous buckets
single_bucket_impact
↑ scoreLower probability → higher impact
multi_bucket_impact
↑ scoreSustained pattern contribution
anomaly_characteristics_impact
↑ scoreMean shift vs. variance change
high_variance_penalty
↓ scoreNoisy data → wide bounds → anomaly less surprising
incomplete_bucket_penalty
↓ scoreBucket has less data than expected (ingest lag, sparse data)
组件对分数的影响含义
anomaly_length
↑ 分数连续异常桶的数量越多,分数越高
single_bucket_impact
↑ 分数概率越低,对分数的影响越大
multi_bucket_impact
↑ 分数持续异常模式对分数的贡献
anomaly_characteristics_impact
↑ 分数均值偏移 vs 方差变化的影响
high_variance_penalty
↓ 分数数据噪声大→模型边界宽→异常的意外性降低
incomplete_bucket_penalty
↓ 分数桶内数据量低于预期(摄入延迟、稀疏数据)

Why a score looks wrong

分数异常的常见原因

  • Unexpectedly low:
    high_variance_penalty
    , renormalization, <3 weeks training for weekly seasonality,
    bucket_span
    too large, wrong detector function (
    mean
    vs
    high_mean
    ),
    incomplete_bucket_penalty
    , suppression by
    custom_rules
    .
  • Unexpectedly high: insufficient history (early training over-flags), high-cardinality split (too few points per entity),
    use_null: true
    on a sparse field.
  • 分数低于预期:
    high_variance_penalty
    、重归一化、周季节性模型训练数据不足3周、
    bucket_span
    过大、检测器函数选择错误(
    mean
    vs
    high_mean
    )、
    incomplete_bucket_penalty
    custom_rules
    抑制。
  • 分数高于预期: 训练数据不足(初期训练误报率高)、高基数拆分(每个实体的数据点过少)、稀疏字段设置
    use_null: true

Tool chain

工具链

PurposeTools
Records + explanation
ad_query_anomaly_records
(exact
job_id_pattern
)
Renormalization drift
ad_rca_score_reassessment
(
score_drift = initial_record_score - record_score
)
Model bounds (visual)
ad_get_model_plot
— actual outside
model_lower
/
model_upper
= anomaly
Forecast overlap
ad_get_forecast_results
Influencer attribution
ad_query_influencers
Config & detector
ad_get_job_datafeed_config
bucket_span
, function,
custom_rules
,
use_null
Categorization
ad_get_categories
Model snapshots
ad_get_model_snapshots
Structured diagnostic
ad_wf_troubleshoot_anomaly_score
(full decision tree)
用途工具
记录+分数解释
ad_query_anomaly_records
(指定精确的
job_id_pattern
重归一化漂移分析
ad_rca_score_reassessment
score_drift = initial_record_score - record_score
模型边界可视化
ad_get_model_plot
——
actual
超出
model_lower
/
model_upper
即为异常
预测结果对比
ad_get_forecast_results
影响因素归因
ad_query_influencers
配置与检测器查看
ad_get_job_datafeed_config
—— 查看
bucket_span
、函数、
custom_rules
use_null
分类信息获取
ad_get_categories
模型快照查看
ad_get_model_snapshots
结构化诊断
ad_wf_troubleshoot_anomaly_score
(完整决策树)

Decision tree (
ad_wf_troubleshoot_anomaly_score
)

决策树(
ad_wf_troubleshoot_anomaly_score

  1. ad_get_jobs
    — ≥3 weeks data for weekly seasonality?
  2. ad_ts_model_memory_health
    memory_status
    healthy?
  3. ad_ts_delayed_data_annotations
    — no incomplete buckets?
  4. ad_query_anomaly_records
    — compare
    record_score
    vs
    initial_record_score
    .
  5. ad_get_job_datafeed_config
    bucket_span
    , detector function,
    custom_rules
    ,
    use_null
    .
  6. ad_get_model_plot
    — wide bounds →
    high_variance_penalty
    .
  7. ad_rca_score_reassessment
    — renormalization drift across history.
  8. Explain
    anomaly_score_explanation
    factors.
  1. ad_get_jobs
    —— 周季节性模型是否有≥3周的数据?
  2. ad_ts_model_memory_health
    ——
    memory_status
    是否健康?
  3. ad_ts_delayed_data_annotations
    —— 是否存在不完整的桶?
  4. ad_query_anomaly_records
    —— 对比
    record_score
    initial_record_score
  5. ad_get_job_datafeed_config
    —— 检查
    bucket_span
    、检测器函数、
    custom_rules
    use_null
  6. ad_get_model_plot
    —— 宽边界对应
    high_variance_penalty
  7. ad_rca_score_reassessment
    —— 分析历史数据中的重归一化漂移。
  8. 解释
    anomaly_score_explanation
    各因素的影响。

Rules

规则

  1. Always show both
    initial_record_score
    and
    record_score
    — the gap is the renormalization story.
  2. Explain renormalization before diagnosing config — score drift is the most common "score dropped" cause and needs no config change.
  3. actual << typical
    with
    count
    /
    low_count
    is an absence anomaly
    — distinguish outages from value spikes.
  4. high_variance_penalty
    and
    incomplete_bucket_penalty
    explain most "low score" surprises
    without remediation.
  5. Weekly seasonality needs ≥3 weeks of training data — flag young jobs as the cause.
For detector function selection details, see references/anomaly-detection-functions.md.

  1. 始终同时展示
    initial_record_score
    record_score
    ——两者的差值就是重归一化的核心信息。
  2. 先解释重归一化,再排查配置问题——分数漂移是“分数下降”最常见的原因,无需修改配置。
  3. actual << typical
    且伴随
    count
    /
    low_count
    属于缺失类异常
    ——需区分服务中断与数值峰值。
  4. high_variance_penalty
    incomplete_bucket_penalty
    是“分数偏低”最常见的原因
    ,通常无需修复。
  5. 周季节性模型需要≥3周的训练数据——若作业创建时间短,需明确指出这是原因。
检测器函数选择详情请见: references/anomaly-detection-functions.md

Mode: Troubleshoot — Job ops

模式:故障排查——作业操作

When: "missing documents", "datafeed stopped", "hard_limit", "results look wrong", lifecycle changes, calendars, CCS.
适用场景: "文档缺失"、"数据馈送已停止"、"hard_limit"、"结果异常"、生命周期变更、日历、CCS。

Common issues → fast paths

常见问题→快速处理路径

IssueFast pathFull decision tree
Missing docs /
query_delay
warning
ad_ts_delayed_data_annotations
ad_ts_bucket_event_gaps
ad_ts_ingest_latency_estimate
ad_update_datafeed_query_delay
ad_wf_troubleshoot_query_delay
Memory
soft_limit
/
hard_limit
ad_ts_model_memory_health
ad_wf_ts_field_cardinality
ad_estimate_memory_requirement
ad_update_model_memory_limit
ad_wf_troubleshoot_memory_limit
Datafeed not running / job state
ad_get_jobs
(state) →
ad_get_job_messages
ad_manage_datafeed
CCS /
remote_cluster:
indices
ad_ts_ccs_diagnostics
Score sanity check
ad_wf_troubleshoot_anomaly_score
hard_limit
corrupts model state and causes downstream missing-doc false alarms (categorizer silently skips events for unknown categories). Fix memory before fixing
query_delay
.
问题快速处理路径完整决策树
文档缺失/
query_delay
警告
ad_ts_delayed_data_annotations
ad_ts_bucket_event_gaps
ad_ts_ingest_latency_estimate
ad_update_datafeed_query_delay
ad_wf_troubleshoot_query_delay
内存
soft_limit
/
hard_limit
ad_ts_model_memory_health
ad_wf_ts_field_cardinality
ad_estimate_memory_requirement
ad_update_model_memory_limit
ad_wf_troubleshoot_memory_limit
数据馈送未运行/作业状态异常
ad_get_jobs
(查看状态)→
ad_get_job_messages
ad_manage_datafeed
CCS/
remote_cluster:
索引
ad_ts_ccs_diagnostics
分数合理性检查
ad_wf_troubleshoot_anomaly_score
hard_limit
会破坏模型状态,并导致下游出现文档缺失的误报(分类器会静默跳过未知分类的事件)。修复内存问题后再处理
query_delay

Memory concepts

内存相关概念

FieldMeaning
model_bytes
Current memory used
peak_model_bytes
High-water mark since job opened
model_bytes_memory_limit
Configured
model_memory_limit
memory_status
ok
/
soft_limit
(pruning) /
hard_limit
(critical)
total_by_field_count > 100k
by_field
cardinality too high — dominant driver
total_partition_field_count > 10k
Partition explosion
total_category_count > 10k
Too many distinct log patterns
Prefer
ad_estimate_memory_requirement
(samples cardinality from source, calls Estimate Model Memory API) over heuristics like
peak_model_bytes * 1.3
— the heuristic ignores pure influencer and categorization memory.
字段含义
model_bytes
当前已使用内存
peak_model_bytes
作业启动后的内存峰值
model_bytes_memory_limit
配置的
model_memory_limit
memory_status
ok
/
soft_limit
(正在修剪) /
hard_limit
(严重)
total_by_field_count > 100k
by_field
基数过高——是内存占用的主要驱动因素
total_partition_field_count > 10k
分区数量爆炸
total_category_count > 10k
日志模式种类过多
优先使用**
ad_estimate_memory_requirement
**(从源数据采样基数,调用Estimate Model Memory API),而非
peak_model_bytes * 1.3
这类经验值——经验值会忽略纯影响因素和分类功能的内存占用。

Datafeed & timing concepts

数据馈送与时间相关概念

  • query_delay
    — how far behind real time the datafeed queries. Too small → missing docs; too large → slower alerts. Set to P95 ingest latency + buffer (default
    60s
    120s
    ).
  • delayed_data_check_config
    — how aggressively the datafeed checks for late data.
  • bucket_span
    — analysis interval. Align with data granularity and detection window.
  • frequency
    — defaults to
    min(query_delay, bucket_span / 2)
    .
  • query_delay
    —— 数据馈送查询与实时数据的延迟时间。过小会导致文档缺失;过大会延迟告警。建议设置为P95摄入延迟 + 缓冲值(默认
    60s
    120s
    )。
  • delayed_data_check_config
    —— 数据馈送检查延迟数据的频率。
  • bucket_span
    —— 分析时间间隔。需与数据粒度和检测窗口对齐。
  • frequency
    —— 默认值为
    min(query_delay, bucket_span / 2)

Lifecycle for config changes (memory limit, query_delay)

配置变更的生命周期流程(内存限制、query_delay)

  1. Stop datafeed:
    ad_manage_datafeed
    (
    action=_stop
    )
  2. Close job
  3. Update config:
    ad_update_model_memory_limit
    ,
    ad_update_datafeed_query_delay
    ,
    ad_update_delayed_data_check_config
  4. Open job:
    ad_open_job
  5. Start datafeed:
    ad_manage_datafeed
    (
    action=_start
    )
Recover a corrupted period without resetting the whole model:
ad_revert_model_snapshot
.
  1. 停止数据馈送:
    ad_manage_datafeed
    action=_stop
  2. 关闭作业
  3. 更新配置:
    ad_update_model_memory_limit
    ad_update_datafeed_query_delay
    ad_update_delayed_data_check_config
  4. 打开作业:
    ad_open_job
  5. 启动数据馈送:
    ad_manage_datafeed
    action=_start
无需重置整个模型即可恢复损坏时段的数据:
ad_revert_model_snapshot

Tool surface

工具集

CategoryTools
Permissions / metadata
ad_validate_ml_tool_permissions
,
ad_get_available_metadata
,
ad_get_jobs
Job + datafeed state
ad_get_job_datafeed_config
,
ad_get_job_messages
,
ad_manage_datafeed
,
ad_preview_datafeed_with_latency
Timing / missing docs
ad_ts_delayed_data_annotations
,
ad_ts_bucket_event_gaps
,
ad_ts_ingest_latency_estimate
,
ad_update_datafeed_query_delay
,
ad_update_delayed_data_check_config
,
ad_wf_troubleshoot_query_delay
Memory
ad_ts_model_memory_health
,
ad_wf_ts_field_cardinality
,
ad_estimate_memory_requirement
,
ad_update_model_memory_limit
,
ad_wf_troubleshoot_memory_limit
Model / lifecycle
ad_get_model_snapshots
,
ad_revert_model_snapshot
,
ad_open_job
,
ad_create_job
CCS
ad_ts_ccs_diagnostics
Calendars
ad_get_calendar_events
,
ad_create_calendar_event
Full parameter tables, ES|QL templates, and REST step lists: references/troubleshoot-anomaly-tool-reference.md.
分类工具
权限/元数据
ad_validate_ml_tool_permissions
,
ad_get_available_metadata
,
ad_get_jobs
作业+数据馈送状态
ad_get_job_datafeed_config
,
ad_get_job_messages
,
ad_manage_datafeed
,
ad_preview_datafeed_with_latency
时间/文档缺失排查
ad_ts_delayed_data_annotations
,
ad_ts_bucket_event_gaps
,
ad_ts_ingest_latency_estimate
,
ad_update_datafeed_query_delay
,
ad_update_delayed_data_check_config
,
ad_wf_troubleshoot_query_delay
内存问题排查
ad_ts_model_memory_health
,
ad_wf_ts_field_cardinality
,
ad_estimate_memory_requirement
,
ad_update_model_memory_limit
,
ad_wf_troubleshoot_memory_limit
模型/生命周期管理
ad_get_model_snapshots
,
ad_revert_model_snapshot
,
ad_open_job
,
ad_create_job
CCS诊断
ad_ts_ccs_diagnostics
日历管理
ad_get_calendar_events
,
ad_create_calendar_event
完整参数表、ES|QL模板及REST步骤列表: references/troubleshoot-anomaly-tool-reference.md

Rules

规则

  1. ad_validate_ml_tool_permissions
    first
    — missing privileges produce misleading empty results.
  2. Fix memory before
    query_delay
    hard_limit
    corrupts state;
    query_delay
    fixes on a memory-limited job are wasted.
  3. Stop the datafeed before updating it. Updating a running datafeed is rejected.
  4. Close the job before updating memory limit. Sequence above.
  5. Prefer workflow tools (
    ad_wf_*
    ) over manually chaining diagnostics
    for complex decisions.
  6. ad_preview_datafeed_with_latency
    before starting
    — confirm the datafeed returns data after config changes.

  1. 先运行
    ad_validate_ml_tool_permissions
    ——权限缺失会导致误导性的空结果。
  2. 先修复内存问题,再处理
    query_delay
    ——
    hard_limit
    会破坏状态;在内存受限的作业上修复
    query_delay
    是无效的。
  3. 更新前先停止数据馈送——更新运行中的数据馈送会被拒绝。
  4. 更新内存限制前先关闭作业——遵循上述生命周期流程。
  5. 复杂决策优先使用工作流工具(
    ad_wf_*
    ,而非手动串联诊断工具。
  6. 启动前运行
    ad_preview_datafeed_with_latency
    ——确认配置变更后数据馈送能返回数据。

Mode: Manage — Create / configure jobs

模式:管理——创建/配置作业

When: "set up a job", "create an ML detector", "monitor X over time", "detect rare/unusual/anomalous values".
适用场景: "设置作业"、"创建ML检测器"、"长期监控X指标"、"检测罕见/异常值"。

4-step workflow

4步工作流

text
PUT  _ml/anomaly_detectors/<job_id>          # 1. Define job        (ad_create_job)
PUT  _ml/datafeeds/datafeed-<job_id>         # 2. Define datafeed   (ad_create_datafeed)
POST _ml/anomaly_detectors/<job_id>/_open    # 3a. Open job         (ad_open_job)
POST _ml/datafeeds/datafeed-<job_id>/_start  # 3b. Start datafeed   (ad_manage_datafeed action=_start)
GET  _ml/anomaly_detectors/<job_id>/results/records  # 4. Read results
text
PUT  _ml/anomaly_detectors/<job_id>          # 1. 定义作业        (ad_create_job)
PUT  _ml/datafeeds/datafeed-<job_id>         # 2. 定义数据馈送   (ad_create_datafeed)
POST _ml/anomaly_detectors/<job_id>/_open    # 3a. 打开作业         (ad_open_job)
POST _ml/datafeeds/datafeed-<job_id>/_start  # 3b. 启动数据馈送   (ad_manage_datafeed action=_start)
GET  _ml/anomaly_detectors/<job_id>/results/records  # 4. 读取结果

Process

流程

  1. Build configs. Parse the user request into job + datafeed JSON with no null fields.
  2. Apply smart defaults:
    FieldDefaultOverride when
    bucket_span
    "15m"
    User specifies a different span
    time_field
    "@timestamp"
    User names a different timestamp field
    index
    "logs-*"
    User specifies an index or pattern
    datafeed_query
    {"match_all": {}}
    User mentions filters, processes, or time windows
    influencers
    by/over/partition fields from detectorsUser adds extra influencer fields
    job_id
    Generated from user descriptionUser provides an explicit ID
    query_delay
    "60s"
    P95 ingest latency is higher
  3. Choose detector function from user intent — full table in references/anomaly-detection-functions.md:
    • "high CPU" / "unusually large" →
      high_mean
      or
      high_sum
    • "rare logins" / "unusual values" →
      rare
      (variants below)
    • "too many requests" / "spike in count" →
      high_count
    rare
    variants:
    • Infrequent globally →
      rare by_field_name: X
    • Infrequent vs peers →
      rare by_field_name: X over_field_name: Y
    • Infrequent per segment →
      rare by_field_name: X partition_field_name: Y
    • Infrequent per segment vs peers →
      rare by_field_name: X over_field_name: Y partition_field_name: Z
  4. Validate.
    platform.core.get_index_mapping
    on the target index to verify field existence/types →
    ad_validate_job_spec
    . If errors, fix and re-validate (max 3 attempts).
  5. Present and confirm. Show the complete job + datafeed bodies formatted as the exact API calls. Ask for approval once. If feedback, incorporate and re-present (up to 3 rounds).
  6. Deploy. After confirmation:
    ad_create_job
    ad_create_datafeed
    ad_open_job
    ad_manage_datafeed
    (
    action=_start
    ). Report final
    job_id
    and
    datafeed_id
    .
For batch analysis on historical data, pass
start
and
end
to the datafeed start call.
Worked examples (rare-username, DNS exfil, large-downloads) with full JSON bodies and datafeed filters: references/job-creation-recipes.md.
  1. 构建配置。将用户需求解析为无空字段的作业+数据馈送JSON。
  2. 应用智能默认值
    字段默认值覆盖场景
    bucket_span
    "15m"
    用户指定了不同的时间间隔
    time_field
    "@timestamp"
    用户指定了其他时间戳字段
    index
    "logs-*"
    用户指定了具体索引或索引模式
    datafeed_query
    {"match_all": {}}
    用户提及过滤条件、处理流程或时间窗口
    influencers
    检测器中的by/over/partition字段用户添加了额外的影响因素字段
    job_id
    根据用户描述自动生成用户提供了明确的作业ID
    query_delay
    "60s"
    P95摄入延迟高于默认值
  3. 根据用户意图选择检测器函数——完整列表见references/anomaly-detection-functions.md
    • "CPU过高" / "数值异常大" →
      high_mean
      high_sum
    • "罕见登录" / "异常值" →
      rare
      (变体如下)
    • "请求量过高" / "计数突增" →
      high_count
    rare
    变体:
    • 全局罕见 →
      rare by_field_name: X
    • 相对于 peers 罕见 →
      rare by_field_name: X over_field_name: Y
    • 分段内罕见 →
      rare by_field_name: X partition_field_name: Y
    • 分段内相对于 peers 罕见 →
      rare by_field_name: X over_field_name: Y partition_field_name: Z
  4. 验证。对目标索引调用
    platform.core.get_index_mapping
    验证字段存在性/类型 → 运行
    ad_validate_job_spec
    。若有错误,修复后重新验证(最多3次尝试)。
  5. 展示并确认。以精确API调用格式展示完整的作业+数据馈送内容。仅请求一次确认。若有反馈,调整后重新展示(最多3轮)。
  6. 部署。确认后执行:
    ad_create_job
    ad_create_datafeed
    ad_open_job
    ad_manage_datafeed
    action=_start
    )。报告最终的
    job_id
    datafeed_id
针对历史数据的批量分析,在启动数据馈送时传入
start
end
参数。
示例(罕见用户名检测、DNS数据泄露检测、大文件下载检测)包含完整JSON内容和数据馈送过滤条件: references/job-creation-recipes.md

Rules

规则

  1. Create job before datafeed. Datafeed references job by ID.
  2. Open job before starting datafeed. Start on a closed job is rejected.
  3. query_delay
    = P95 ingest latency + buffer
    (60s–120s safe default).
  4. Forecasts require non-population jobs
    over_field_name
    jobs cannot be forecasted; warn before attempting.
  5. by_field_name
    vs
    over_field_name
    :
    by
    compares entity to its own history;
    over
    compares to peer group in the same bucket.
    partition_field_name
    = fully independent sub-model with its own normalization.
  6. bucket_span
    matches detection granularity
    — 15m for high-frequency, 1h for operational metrics, 1d for daily patterns. Larger smooths short spikes; smaller increases noise.

  1. 先创建作业,再创建数据馈送——数据馈送通过ID关联作业。
  2. 先打开作业,再启动数据馈送——在关闭的作业上启动会被拒绝。
  3. query_delay
    = P95摄入延迟 + 缓冲值
    (60s–120s为安全默认值)。
  4. 预测功能仅支持非群体作业——带有
    over_field_name
    的作业无法进行预测;尝试前需告知用户。
  5. by_field_name
    vs
    over_field_name
    by
    将实体与其自身历史对比;
    over
    将实体与同桶内的 peer 群体对比。
    partition_field_name
    表示完全独立的子模型,拥有自己的归一化规则。
  6. bucket_span
    匹配检测粒度
    ——高频数据用15m,运维指标用1h,日模式用1d。更大的间隔会平滑瞬时峰值;更小的间隔会增加噪声。

Registration (Kibana Agent Builder)

注册(Kibana Agent Builder)

Requires Node.js 18+. Defaults to
elastic
/
changeme
when no credentials are supplied.
bash
cd skills/kibana/kibana-anomaly-detection
需要Node.js 18+。未提供凭据时,默认使用
elastic
/
changeme
bash
cd skills/kibana/kibana-anomaly-detection

tools → workflows → skills

工具 → 工作流 → 技能

node scripts/kibana-agent-builder.mjs all register --kibana-url http://localhost:5601
node scripts/kibana-agent-builder.mjs all register --kibana-url http://localhost:5601

HTTPS with self-signed cert

HTTPS + 自签名证书

node scripts/kibana-agent-builder.mjs all register --kibana-url https://localhost:5601 --insecure

`all register` runs `tools register`, then `workflows register`, then `skills register`. Kibana allows **at most five**
`tool_ids` per skill; the script fills them by scanning `SKILL.md` for tool mentions (in document order), then appends
ids from `references/kibana/tools/esql/*.json` until the cap (workflow-only tools omitted by default). If you run
`skills register` alone, run `tools register` first so those ids exist.

Workflow tool exclusions and prefixes live in `scripts/agent_builder_constants.json`.

**MCP API key permissions:**

- Kibana: `read_onechat`, `space_read`
- Index: `read`, `view_index_metadata` on `.ml-anomalies-*`, `.ml-annotations-*`, `.ml-notifications-*`, `.ml-config`
- For source evidence: `read` on source data indices

---
node scripts/kibana-agent-builder.mjs all register --kibana-url https://localhost:5601 --insecure

`all register`会依次执行`tools register`、`workflows register`、`skills register`。Kibana允许每个技能最多关联5个`tool_ids`;脚本会扫描`SKILL.md`中的工具提及(按文档顺序),然后从`references/kibana/tools/esql/*.json`中补充工具ID,直到达到上限(默认排除仅工作流工具)。如果单独运行`skills register`,请先运行`tools register`确保工具ID已存在。

工作流工具的排除规则和前缀定义在`scripts/agent_builder_constants.json`中。

**MCP API密钥权限:**

- Kibana:`read_onechat`、`space_read`
- 索引:`.ml-anomalies-*`、`.ml-annotations-*`、`.ml-notifications-*`、`.ml-config`的`read`、`view_index_metadata`权限
- 源证据:源数据索引的`read`权限

---

Tool inventory

工具清单

ES|QL tool specs live under
references/kibana/tools/esql/*.json
; workflow definitions under
references/kibana/workflows/*.yaml
. Each Mode section above lists the tools it uses. Full surface: references/tools.md (ES|QL) and references/workflow-tools.md (workflows).
ES|QL工具定义位于
references/kibana/tools/esql/*.json
;工作流定义位于
references/kibana/workflows/*.yaml
。上述各模式部分已列出对应工具。完整工具集: references/tools.md(ES|QL工具)和references/workflow-tools.md(工作流工具)。

Key system indices

核心系统索引

IndexRelevant content
.ml-anomalies-*
record
,
bucket
,
influencer
,
model_plot
,
model_forecast
,
model_snapshot
,
category_definition
,
model_size_stats
.ml-config
job/datafeed documents (visible even for never-run jobs)
.ml-annotations-*
delayed data (
event == "delayed_data"
)
.ml-notifications-*
job messages (
level
: info/warning/error)

索引相关内容
.ml-anomalies-*
record
bucket
influencer
model_plot
model_forecast
model_snapshot
category_definition
model_size_stats
.ml-config
作业/数据馈送文档(即使从未运行的作业也可见)
.ml-annotations-*
延迟数据(
event == "delayed_data"
.ml-notifications-*
作业消息(
level
: info/warning/error)

Examples

示例

RCA: "Something caused a spike in our error rate at 2pm — what broke?" → Investigate →
ad_get_available_metadata
ad_query_anomaly_timeline
ad_rca_cross_job_entity_match
ad_rca_multi_job_entities
→ RCA report.
Score drop: "My anomaly score went from 90 to 55 — did the model change?" → Explain →
ad_rca_score_reassessment
for drift → explain renormalization if
score_drift
is large.
Memory limit: "Job status shows
hard_limit
and results look wrong." → Troubleshoot →
ad_ts_model_memory_health
ad_wf_ts_field_cardinality
ad_estimate_memory_requirement
ad_update_model_memory_limit
(lifecycle: stop datafeed → close → update → open → start).
New job: "Detect unusual error rates per host on nginx access logs." → Manage →
high_count
detector with
by_field_name: "host.keyword"
→ validate → present → deploy.
Multi-mode: "We had an incident last night, scores were high but now low — is the job healthy?" → Investigate the incident → Explain the score drift → Troubleshoot if
hard_limit
or delayed data is suspected.

RCA场景: "下午2点错误率突增——哪里出问题了?" → 调查模式 →
ad_get_available_metadata
ad_query_anomaly_timeline
ad_rca_cross_job_entity_match
ad_rca_multi_job_entities
→ 生成RCA报告。
分数下降场景: "我的异常分数从90降到了55——模型变了吗?" → 解释模式 → 运行
ad_rca_score_reassessment
分析漂移 → 若
score_drift
较大,解释重归一化原因。
内存限制场景: "作业状态显示
hard_limit
,结果异常。" → 故障排查模式 →
ad_ts_model_memory_health
ad_wf_ts_field_cardinality
ad_estimate_memory_requirement
ad_update_model_memory_limit
(遵循生命周期:停止数据馈送→关闭作业→更新配置→打开作业→启动数据馈送)。
新建作业场景: "检测nginx访问日志中每个主机的异常错误率。" → 管理模式 → 使用
high_count
检测器,设置
by_field_name: "host.keyword"
→ 验证→展示→部署。
多模式场景: "昨晚发生了事件,分数很高但现在变低了——作业健康吗?" → 调查事件→解释分数漂移→若怀疑
hard_limit
或延迟数据,进行故障排查。

Guidelines

指南

  1. Pick a mode first. Don't blend RCA logic with score-explanation logic in one response.
  2. ad_validate_ml_tool_permissions
    first
    on empty results — privileges are the most common false-negative cause.
  3. Score bands are absolute thresholds:
    >75
    critical,
    50–75
    warning,
    25–50
    minor,
    <25
    informational.
  4. Multi-job entities are prime suspects. Use
    min_job_count=2
    in
    ad_rca_multi_job_entities
    .
  5. Show
    initial_record_score
    alongside
    record_score
    — the gap tells the renormalization story.
  6. Fix memory before
    query_delay
    .
    hard_limit
    invalidates downstream diagnostics.
  7. Stop datafeed → close job → update config → open job → start datafeed for any config change to memory or query delay.
  8. Confirm RCAs with
    ad_rca_source_evidence
    .
    Raw source documents are ground truth.
  1. 先选择模式。不要在一个响应中混合RCA逻辑和分数解释逻辑。
  2. 返回空结果时先运行
    ad_validate_ml_tool_permissions
    ——权限缺失是最常见的假阴性原因。
  3. 分数区间为绝对阈值
    >75
    严重,
    50–75
    警告,
    25–50
    轻微,
    <25
    信息性。
  4. 跨作业关联的实体是首要排查对象。在
    ad_rca_multi_job_entities
    中设置
    min_job_count=2
  5. 同时展示
    initial_record_score
    record_score
    ——两者的差值体现重归一化过程。
  6. 先修复内存问题,再处理
    query_delay
    ——
    hard_limit
    会使下游诊断失效。
  7. 任何内存或query_delay的配置变更,都要遵循停止数据馈送→关闭作业→更新配置→打开作业→启动数据馈送的流程
  8. ad_rca_source_evidence
    验证RCA结果
    ——原始源文档是最可靠的依据。