kibana-anomaly-detection
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseElastic ML Anomaly Detection
Elastic ML异常检测
Single skill covering all anomaly detection work against Kibana Agent Builder MCP at
. Use the Mode Selector below to pick the right approach for the user's question
— modes share the same tool surface and concepts.
{KIBANA_URL}/api/agent_builder/mcp这是一项针对Kibana Agent Builder MCP(地址为)的全流程异常检测技能。请使用下方的模式选择器,根据用户的问题选择合适的处理方式——所有模式共用一套工具集和核心概念。
{KIBANA_URL}/api/agent_builder/mcpPlatform
平台说明
- Read path: ES|QL against ,
.ml-anomalies-*,.ml-config,.ml-notifications-*.ml-annotations-* - Always-available: (plus additional platform tools for search, index mapping, and documentation — see
platform.core.execute_esql)scripts/agent_builder_constants.json - ML API spec (if available): — see references/anomaly-detection-openapi-spec-discover.md for discovery pattern.
.kibana_ai_openapi_spec_elasticsearch - Run first when tools return empty/misleading results — missing privileges are the most common cause of false negatives. Full permissions matrix: references/permissions-matrix.md.
ad_validate_ml_tool_permissions
- 读取路径:通过ES|QL查询、
.ml-anomalies-*、.ml-config、.ml-notifications-*索引.ml-annotations-* - 通用工具:(此外还有搜索、索引映射和文档相关的平台工具——详见
platform.core.execute_esql)scripts/agent_builder_constants.json - ML API规范(若可用):——发现模式请参考references/anomaly-detection-openapi-spec-discover.md。
.kibana_ai_openapi_spec_elasticsearch - 当工具返回空结果或误导性结果时,请先运行——权限缺失是导致假阴性结果的最常见原因。完整权限矩阵请见:references/permissions-matrix.md。
ad_validate_ml_tool_permissions
Mode Selector
模式选择器
| User intent | Mode |
|---|---|
| "What broke?" / RCA / cross-job / blast radius / influencers / log categories | Investigate |
| "Why score high/low?" / renormalization / model bounds / forecasts | Explain |
| Missing docs / memory limit / datafeed stopped / CCS / lifecycle / calendars | Troubleshoot |
| Create a job / configure a datafeed / start analysis / retrieve results | Manage |
| Security framing (attack chains, MITRE, exfil) | Investigate + references/security-anomaly-expert.md |
| Observability/SRE framing (degradation, capacity, deployment regression) | Investigate + references/observability-anomaly-expert.md |
When a question spans modes: Investigate → Explain → Troubleshoot. Don't blend mode logic — finish one before moving
on.
| 用户意图 | 模式 |
|---|---|
| "哪里出问题了?"/RCA/跨作业/影响范围/影响因素/日志分类 | 调查 |
| "分数为何偏高/偏低?"/重归一化/模型边界/预测 | 解释 |
| 文档缺失/内存限制/数据馈送停止/CCS/生命周期/日历 | 故障排查 |
| 创建作业/配置数据馈送/启动分析/获取结果 | 管理 |
| 安全场景(攻击链、MITRE、数据泄露) | 调查 + references/security-anomaly-expert.md |
| 可观测性/SRE场景(性能退化、容量不足、部署回退) | 调查 + references/observability-anomaly-expert.md |
若问题涉及多个模式,请遵循调查→解释→故障排查的顺序。不要混合不同模式的逻辑——完成一个模式后再进行下一个。
Score Quick Reference
分数速查
- bands: >75 critical · 50–75 warning · 25–50 minor · <25 informational
record_score - → sustained shift (not a transient spike)
multi_bucket_impact ≥ 3 - → renormalization (model saw worse anomalies later)
initial_record_score >> record_score - with
actual << typical/count/low_count→ absence/outage, not just low valuelow_mean - Low scores across many jobs > one high score — composite cross-job signal often beats single-detector severity
Full score definitions, renormalization mechanics, andcomponents: references/score-reference.md.anomaly_score_explanation
- 区间:>75 严重 · 50–75 警告 · 25–50 轻微 · <25 信息性
record_score - → 持续异常变化(而非瞬时峰值)
multi_bucket_impact ≥ 3 - → 重归一化(模型后续检测到了更严重的异常)
initial_record_score >> record_score - 且伴随
actual << typical/count/low_count→ 数据缺失/服务中断,而非单纯数值偏低low_mean - 多个作业普遍低分 > 单个作业高分——跨作业的复合信号通常比单一检测器的严重程度更有价值
完整的分数定义、重归一化机制以及组件说明: references/score-reference.md。anomaly_score_explanation
Core concepts
核心概念
Treat as three layers, accessed via :
.ml-anomalies-*result_type- — bucket-level unusualness per
bucket.bucket_spanis the aggregate across all detectors.anomaly_score - — finest-grained rows with
recordvsactual,typical,probability,record_score.anomaly_score_explanation - — entity contributions ranked within a bucket (
influencer).influencer_score
Read scores this way:
- /
anomaly_score= current normalized values (move as the model sees new extremes).record_score - /
initial_anomaly_score= immutable snapshots from detection time.initial_record_score - Compare to
actual; usetypicalfor raw likelihood.probability - Map entities via /
partition_field_value/by_field_value.over_field_value - Read (-5 to +5) to separate single-bucket spikes from sustained trends.
multi_bucket_impact
将视为三层结构,通过访问:
.ml-anomalies-*result_type- —— 每个
bucket时间区间内的桶级异常程度。bucket_span是所有检测器的聚合分数。anomaly_score - —— 最细粒度的行数据,包含
record与actual对比、typical、probability、record_score。anomaly_score_explanation - —— 桶内各实体的异常贡献度排名(
influencer)。influencer_score
分数解读方式:
- /
anomaly_score= 当前归一化数值(会随模型检测到新的极端值而变化)。record_score - /
initial_anomaly_score= 检测时的固定快照数值。initial_record_score - 对比与
actual;使用typical查看原始异常概率。probability - 通过/
partition_field_value/by_field_value映射实体。over_field_value - 查看(范围-5到+5),区分瞬时峰值与持续趋势。
multi_bucket_impact
Mode: Investigate — RCA
模式:调查——根因分析(RCA)
When: "what broke?", "which entity caused this?", cross-job correlation, blast radius, attack/cascade chains.
适用场景: "哪里出问题了?"、"哪个实体引发的异常?"、跨作业关联、影响范围分析、攻击/故障连锁反应。
Tool chain
工具链
| Phase | Tools |
|---|---|
| Discovery | |
| Timeline / scope | |
| Cross-job / entities | |
| Records / influencers | |
| RCA depth | |
| Evidence / categories | |
| 阶段 | 工具 |
|---|---|
| 发现阶段 | |
| 时间线/范围界定 | |
| 跨作业/实体分析 | |
| 记录/影响因素查询 | |
| RCA深度分析 | |
| 证据/分类获取 | |
Protocol
流程规范
Follow the 14-step sequence in references/protocols/investigation.md. High
level: → pair with →
→ rank with () →
→ drill with + (low ) → profile with
→ order with → confirm with . When
, compare with + paired (baseline
vs. anomaly window).
ad_get_available_metadataad_discover_jobs_by_datafeed_indexad_discover_related_jobsad_query_anomaly_timelinead_rca_multi_job_entitiesmin_job_count=2ad_rca_detector_fingerprintad_query_anomaly_recordsad_query_influencersmin_score=25ad_rca_entity_profilead_rca_correlationad_rca_source_evidenceby_field_name == "mlcategory"ad_get_categoriesad_search_log_category_examplesFinish with a written RCA: root cause entity · affected jobs · temporal progression · fault class
(resource/network/application) · severity · recommended actions. Worked example:
references/worked-example.md. Full ES|QL templates and parameters:
references/investigate-anomaly-esql-tools.md.
请遵循references/protocols/investigation.md中的14步流程。核心步骤: → 结合与 → → 通过()排序 → → 结合 + (设置较低的)深入分析 → 通过生成实体画像 → 通过排序 → 用验证。当时,需结合 + 对比基线与异常窗口数据。
ad_get_available_metadataad_discover_jobs_by_datafeed_indexad_discover_related_jobsad_query_anomaly_timelinead_rca_multi_job_entitiesmin_job_count=2ad_rca_detector_fingerprintad_query_anomaly_recordsad_query_influencersmin_score=25ad_rca_entity_profilead_rca_correlationad_rca_source_evidenceby_field_name == "mlcategory"ad_get_categoriesad_search_log_category_examples最终输出书面RCA报告,包含:根因实体 · 受影响作业 · 时间线进展 · 故障类型(资源/网络/应用) · 严重程度 · 建议操作。示例参考:references/worked-example.md。完整ES|QL模板及参数:references/investigate-anomaly-esql-tools.md。
Rules
规则
- Multi-job entities are prime suspects; single-job entities are usually victims. Use .
min_job_count=2 - Earliest anomaly timestamp wins — sort by timestamp; first-appearing entity = origin.
ad_rca_correlation - = sustained behavioral shift, weight higher than transient spikes.
multi_bucket_impact ≥ 3 - Never close an RCA without — raw source documents are ground truth.
ad_rca_source_evidence - Use low (25 or lower) for influencer queries — high thresholds miss correlated entities.
min_score
- 跨作业关联的实体是首要排查对象;单一作业的实体通常是受害者。请设置。
min_job_count=2 - 最早出现异常的时间戳优先级最高——按时间戳排序,最早出现的实体即为根源。
ad_rca_correlation - 代表持续行为变化,权重高于瞬时峰值。
multi_bucket_impact ≥ 3 - 未获取前不要结束RCA——原始源文档是最可靠的依据。
ad_rca_source_evidence - 影响因素查询请设置较低的(25或更低)——高阈值会遗漏关联实体。
min_score
Mode: Explain — Score / model behavior
模式:解释——分数/模型行为
When: "why is my score 30/90?", "score dropped overnight", "what is renormalization?", "why wasn't this detected?".
适用场景: "我的分数为什么是30/90?"、"分数一夜之间下降了"、"什么是重归一化?"、"为什么没有检测到这个异常?"。
Score types
分数类型
| Field | Scope | Meaning |
|---|---|---|
| Single record | Normalized severity after renormalization. |
| Single record | Score at detection time. Gap vs |
| Bucket | Aggregate severity across all detectors in a bucket. |
| Entity × bucket | How anomalous a specific entity is in that bucket. |
| 字段 | 范围 | 含义 |
|---|---|---|
| 单条记录 | 重归一化后的归一化严重程度。 |
| 单条记录 | 检测时的分数。与 |
| 桶级 | 一个桶内所有检测器的聚合严重程度。 |
| 实体×桶级 | 特定实体在该桶内的异常程度。 |
anomaly_score_explanation
components
anomaly_score_explanationanomaly_score_explanation
组件
anomaly_score_explanation| Component | Effect | What it means |
|---|---|---|
| ↑ score | More consecutive anomalous buckets |
| ↑ score | Lower probability → higher impact |
| ↑ score | Sustained pattern contribution |
| ↑ score | Mean shift vs. variance change |
| ↓ score | Noisy data → wide bounds → anomaly less surprising |
| ↓ score | Bucket has less data than expected (ingest lag, sparse data) |
| 组件 | 对分数的影响 | 含义 |
|---|---|---|
| ↑ 分数 | 连续异常桶的数量越多,分数越高 |
| ↑ 分数 | 概率越低,对分数的影响越大 |
| ↑ 分数 | 持续异常模式对分数的贡献 |
| ↑ 分数 | 均值偏移 vs 方差变化的影响 |
| ↓ 分数 | 数据噪声大→模型边界宽→异常的意外性降低 |
| ↓ 分数 | 桶内数据量低于预期(摄入延迟、稀疏数据) |
Why a score looks wrong
分数异常的常见原因
- Unexpectedly low: , renormalization, <3 weeks training for weekly seasonality,
high_variance_penaltytoo large, wrong detector function (bucket_spanvsmean),high_mean, suppression byincomplete_bucket_penalty.custom_rules - Unexpectedly high: insufficient history (early training over-flags), high-cardinality split (too few points per
entity), on a sparse field.
use_null: true
- 分数低于预期: 、重归一化、周季节性模型训练数据不足3周、
high_variance_penalty过大、检测器函数选择错误(bucket_spanvsmean)、high_mean、incomplete_bucket_penalty抑制。custom_rules - 分数高于预期: 训练数据不足(初期训练误报率高)、高基数拆分(每个实体的数据点过少)、稀疏字段设置。
use_null: true
Tool chain
工具链
| Purpose | Tools |
|---|---|
| Records + explanation | |
| Renormalization drift | |
| Model bounds (visual) | |
| Forecast overlap | |
| Influencer attribution | |
| Config & detector | |
| Categorization | |
| Model snapshots | |
| Structured diagnostic | |
| 用途 | 工具 |
|---|---|
| 记录+分数解释 | |
| 重归一化漂移分析 | |
| 模型边界可视化 | |
| 预测结果对比 | |
| 影响因素归因 | |
| 配置与检测器查看 | |
| 分类信息获取 | |
| 模型快照查看 | |
| 结构化诊断 | |
Decision tree (ad_wf_troubleshoot_anomaly_score
)
ad_wf_troubleshoot_anomaly_score决策树(ad_wf_troubleshoot_anomaly_score
)
ad_wf_troubleshoot_anomaly_score- — ≥3 weeks data for weekly seasonality?
ad_get_jobs - —
ad_ts_model_memory_healthhealthy?memory_status - — no incomplete buckets?
ad_ts_delayed_data_annotations - — compare
ad_query_anomaly_recordsvsrecord_score.initial_record_score - —
ad_get_job_datafeed_config, detector function,bucket_span,custom_rules.use_null - — wide bounds →
ad_get_model_plot.high_variance_penalty - — renormalization drift across history.
ad_rca_score_reassessment - Explain factors.
anomaly_score_explanation
- —— 周季节性模型是否有≥3周的数据?
ad_get_jobs - ——
ad_ts_model_memory_health是否健康?memory_status - —— 是否存在不完整的桶?
ad_ts_delayed_data_annotations - —— 对比
ad_query_anomaly_records与record_score。initial_record_score - —— 检查
ad_get_job_datafeed_config、检测器函数、bucket_span、custom_rules。use_null - —— 宽边界对应
ad_get_model_plot。high_variance_penalty - —— 分析历史数据中的重归一化漂移。
ad_rca_score_reassessment - 解释各因素的影响。
anomaly_score_explanation
Rules
规则
- Always show both and
initial_record_score— the gap is the renormalization story.record_score - Explain renormalization before diagnosing config — score drift is the most common "score dropped" cause and needs no config change.
- with
actual << typical/countis an absence anomaly — distinguish outages from value spikes.low_count - and
high_variance_penaltyexplain most "low score" surprises without remediation.incomplete_bucket_penalty - Weekly seasonality needs ≥3 weeks of training data — flag young jobs as the cause.
For detector function selection details, see
references/anomaly-detection-functions.md.
- 始终同时展示和
initial_record_score——两者的差值就是重归一化的核心信息。record_score - 先解释重归一化,再排查配置问题——分数漂移是“分数下降”最常见的原因,无需修改配置。
- 且伴随
actual << typical/count属于缺失类异常——需区分服务中断与数值峰值。low_count - 和
high_variance_penalty是“分数偏低”最常见的原因,通常无需修复。incomplete_bucket_penalty - 周季节性模型需要≥3周的训练数据——若作业创建时间短,需明确指出这是原因。
检测器函数选择详情请见:
references/anomaly-detection-functions.md。
Mode: Troubleshoot — Job ops
模式:故障排查——作业操作
When: "missing documents", "datafeed stopped", "hard_limit", "results look wrong", lifecycle changes, calendars,
CCS.
适用场景: "文档缺失"、"数据馈送已停止"、"hard_limit"、"结果异常"、生命周期变更、日历、CCS。
Common issues → fast paths
常见问题→快速处理路径
| Issue | Fast path | Full decision tree |
|---|---|---|
Missing docs / | | |
Memory | | |
| Datafeed not running / job state | | — |
CCS / | | — |
| Score sanity check | — | |
corrupts model state and causes downstream missing-doc false alarms (categorizer silently skips events for unknown categories). Fix memory before fixinghard_limit.query_delay
| 问题 | 快速处理路径 | 完整决策树 |
|---|---|---|
文档缺失/ | | |
内存 | | |
| 数据馈送未运行/作业状态异常 | | — |
CCS/ | | — |
| 分数合理性检查 | — | |
会破坏模型状态,并导致下游出现文档缺失的误报(分类器会静默跳过未知分类的事件)。修复内存问题后再处理hard_limit。query_delay
Memory concepts
内存相关概念
| Field | Meaning |
|---|---|
| Current memory used |
| High-water mark since job opened |
| Configured |
| |
| |
| Partition explosion |
| Too many distinct log patterns |
Prefer (samples cardinality from source, calls Estimate Model Memory API) over
heuristics like — the heuristic ignores pure influencer and categorization memory.
ad_estimate_memory_requirementpeak_model_bytes * 1.3| 字段 | 含义 |
|---|---|
| 当前已使用内存 |
| 作业启动后的内存峰值 |
| 配置的 |
| |
| |
| 分区数量爆炸 |
| 日志模式种类过多 |
优先使用****(从源数据采样基数,调用Estimate Model Memory API),而非这类经验值——经验值会忽略纯影响因素和分类功能的内存占用。
ad_estimate_memory_requirementpeak_model_bytes * 1.3Datafeed & timing concepts
数据馈送与时间相关概念
- — how far behind real time the datafeed queries. Too small → missing docs; too large → slower alerts. Set to P95 ingest latency + buffer (default
query_delay–60s).120s - — how aggressively the datafeed checks for late data.
delayed_data_check_config - — analysis interval. Align with data granularity and detection window.
bucket_span - — defaults to
frequency.min(query_delay, bucket_span / 2)
- —— 数据馈送查询与实时数据的延迟时间。过小会导致文档缺失;过大会延迟告警。建议设置为P95摄入延迟 + 缓冲值(默认
query_delay–60s)。120s - —— 数据馈送检查延迟数据的频率。
delayed_data_check_config - —— 分析时间间隔。需与数据粒度和检测窗口对齐。
bucket_span - —— 默认值为
frequency。min(query_delay, bucket_span / 2)
Lifecycle for config changes (memory limit, query_delay)
配置变更的生命周期流程(内存限制、query_delay)
- Stop datafeed: (
ad_manage_datafeed)action=_stop - Close job
- Update config: ,
ad_update_model_memory_limit,ad_update_datafeed_query_delayad_update_delayed_data_check_config - Open job:
ad_open_job - Start datafeed: (
ad_manage_datafeed)action=_start
Recover a corrupted period without resetting the whole model: .
ad_revert_model_snapshot- 停止数据馈送:(
ad_manage_datafeed)action=_stop - 关闭作业
- 更新配置:、
ad_update_model_memory_limit、ad_update_datafeed_query_delayad_update_delayed_data_check_config - 打开作业:
ad_open_job - 启动数据馈送:(
ad_manage_datafeed)action=_start
无需重置整个模型即可恢复损坏时段的数据:。
ad_revert_model_snapshotTool surface
工具集
| Category | Tools |
|---|---|
| Permissions / metadata | |
| Job + datafeed state | |
| Timing / missing docs | |
| Memory | |
| Model / lifecycle | |
| CCS | |
| Calendars | |
Full parameter tables, ES|QL templates, and REST step lists:
references/troubleshoot-anomaly-tool-reference.md.
| 分类 | 工具 |
|---|---|
| 权限/元数据 | |
| 作业+数据馈送状态 | |
| 时间/文档缺失排查 | |
| 内存问题排查 | |
| 模型/生命周期管理 | |
| CCS诊断 | |
| 日历管理 | |
完整参数表、ES|QL模板及REST步骤列表:
references/troubleshoot-anomaly-tool-reference.md。
Rules
规则
- first — missing privileges produce misleading empty results.
ad_validate_ml_tool_permissions - Fix memory before —
query_delaycorrupts state;hard_limitfixes on a memory-limited job are wasted.query_delay - Stop the datafeed before updating it. Updating a running datafeed is rejected.
- Close the job before updating memory limit. Sequence above.
- Prefer workflow tools () over manually chaining diagnostics for complex decisions.
ad_wf_* - before starting — confirm the datafeed returns data after config changes.
ad_preview_datafeed_with_latency
- 先运行——权限缺失会导致误导性的空结果。
ad_validate_ml_tool_permissions - 先修复内存问题,再处理——
query_delay会破坏状态;在内存受限的作业上修复hard_limit是无效的。query_delay - 更新前先停止数据馈送——更新运行中的数据馈送会被拒绝。
- 更新内存限制前先关闭作业——遵循上述生命周期流程。
- 复杂决策优先使用工作流工具(),而非手动串联诊断工具。
ad_wf_* - 启动前运行——确认配置变更后数据馈送能返回数据。
ad_preview_datafeed_with_latency
Mode: Manage — Create / configure jobs
模式:管理——创建/配置作业
When: "set up a job", "create an ML detector", "monitor X over time", "detect rare/unusual/anomalous values".
适用场景: "设置作业"、"创建ML检测器"、"长期监控X指标"、"检测罕见/异常值"。
4-step workflow
4步工作流
text
PUT _ml/anomaly_detectors/<job_id> # 1. Define job (ad_create_job)
PUT _ml/datafeeds/datafeed-<job_id> # 2. Define datafeed (ad_create_datafeed)
POST _ml/anomaly_detectors/<job_id>/_open # 3a. Open job (ad_open_job)
POST _ml/datafeeds/datafeed-<job_id>/_start # 3b. Start datafeed (ad_manage_datafeed action=_start)
GET _ml/anomaly_detectors/<job_id>/results/records # 4. Read resultstext
PUT _ml/anomaly_detectors/<job_id> # 1. 定义作业 (ad_create_job)
PUT _ml/datafeeds/datafeed-<job_id> # 2. 定义数据馈送 (ad_create_datafeed)
POST _ml/anomaly_detectors/<job_id>/_open # 3a. 打开作业 (ad_open_job)
POST _ml/datafeeds/datafeed-<job_id>/_start # 3b. 启动数据馈送 (ad_manage_datafeed action=_start)
GET _ml/anomaly_detectors/<job_id>/results/records # 4. 读取结果Process
流程
-
Build configs. Parse the user request into job + datafeed JSON with no null fields.
-
Apply smart defaults:
Field Default Override when bucket_span"15m"User specifies a different span time_field"@timestamp"User names a different timestamp field index"logs-*"User specifies an index or pattern datafeed_query{"match_all": {}}User mentions filters, processes, or time windows influencersby/over/partition fields from detectors User adds extra influencer fields job_idGenerated from user description User provides an explicit ID query_delay"60s"P95 ingest latency is higher -
Choose detector function from user intent — full table in references/anomaly-detection-functions.md:
- "high CPU" / "unusually large" → or
high_meanhigh_sum - "rare logins" / "unusual values" → (variants below)
rare - "too many requests" / "spike in count" →
high_count
variants:rare- Infrequent globally →
rare by_field_name: X - Infrequent vs peers →
rare by_field_name: X over_field_name: Y - Infrequent per segment →
rare by_field_name: X partition_field_name: Y - Infrequent per segment vs peers →
rare by_field_name: X over_field_name: Y partition_field_name: Z
- "high CPU" / "unusually large" →
-
Validate.on the target index to verify field existence/types →
platform.core.get_index_mapping. If errors, fix and re-validate (max 3 attempts).ad_validate_job_spec -
Present and confirm. Show the complete job + datafeed bodies formatted as the exact API calls. Ask for approval once. If feedback, incorporate and re-present (up to 3 rounds).
-
Deploy. After confirmation:→
ad_create_job→ad_create_datafeed→ad_open_job(ad_manage_datafeed). Report finalaction=_startandjob_id.datafeed_id
For batch analysis on historical data, pass and to the datafeed start call.
startendWorked examples (rare-username, DNS exfil, large-downloads) with full JSON bodies and datafeed filters: references/job-creation-recipes.md.
-
构建配置。将用户需求解析为无空字段的作业+数据馈送JSON。
-
应用智能默认值:
字段 默认值 覆盖场景 bucket_span"15m"用户指定了不同的时间间隔 time_field"@timestamp"用户指定了其他时间戳字段 index"logs-*"用户指定了具体索引或索引模式 datafeed_query{"match_all": {}}用户提及过滤条件、处理流程或时间窗口 influencers检测器中的by/over/partition字段 用户添加了额外的影响因素字段 job_id根据用户描述自动生成 用户提供了明确的作业ID query_delay"60s"P95摄入延迟高于默认值 -
根据用户意图选择检测器函数——完整列表见references/anomaly-detection-functions.md:
- "CPU过高" / "数值异常大" → 或
high_meanhigh_sum - "罕见登录" / "异常值" → (变体如下)
rare - "请求量过高" / "计数突增" →
high_count
变体:rare- 全局罕见 →
rare by_field_name: X - 相对于 peers 罕见 →
rare by_field_name: X over_field_name: Y - 分段内罕见 →
rare by_field_name: X partition_field_name: Y - 分段内相对于 peers 罕见 →
rare by_field_name: X over_field_name: Y partition_field_name: Z
- "CPU过高" / "数值异常大" →
-
验证。对目标索引调用验证字段存在性/类型 → 运行
platform.core.get_index_mapping。若有错误,修复后重新验证(最多3次尝试)。ad_validate_job_spec -
展示并确认。以精确API调用格式展示完整的作业+数据馈送内容。仅请求一次确认。若有反馈,调整后重新展示(最多3轮)。
-
部署。确认后执行:→
ad_create_job→ad_create_datafeed→ad_open_job(ad_manage_datafeed)。报告最终的action=_start和job_id。datafeed_id
针对历史数据的批量分析,在启动数据馈送时传入和参数。
startend示例(罕见用户名检测、DNS数据泄露检测、大文件下载检测)包含完整JSON内容和数据馈送过滤条件: references/job-creation-recipes.md。
Rules
规则
- Create job before datafeed. Datafeed references job by ID.
- Open job before starting datafeed. Start on a closed job is rejected.
- = P95 ingest latency + buffer (60s–120s safe default).
query_delay - Forecasts require non-population jobs — jobs cannot be forecasted; warn before attempting.
over_field_name - vs
by_field_name:over_field_namecompares entity to its own history;bycompares to peer group in the same bucket.over= fully independent sub-model with its own normalization.partition_field_name - matches detection granularity — 15m for high-frequency, 1h for operational metrics, 1d for daily patterns. Larger smooths short spikes; smaller increases noise.
bucket_span
- 先创建作业,再创建数据馈送——数据馈送通过ID关联作业。
- 先打开作业,再启动数据馈送——在关闭的作业上启动会被拒绝。
- = P95摄入延迟 + 缓冲值(60s–120s为安全默认值)。
query_delay - 预测功能仅支持非群体作业——带有的作业无法进行预测;尝试前需告知用户。
over_field_name - vs
by_field_name:over_field_name将实体与其自身历史对比;by将实体与同桶内的 peer 群体对比。over表示完全独立的子模型,拥有自己的归一化规则。partition_field_name - 匹配检测粒度——高频数据用15m,运维指标用1h,日模式用1d。更大的间隔会平滑瞬时峰值;更小的间隔会增加噪声。
bucket_span
Registration (Kibana Agent Builder)
注册(Kibana Agent Builder)
Requires Node.js 18+. Defaults to / when no credentials are supplied.
elasticchangemebash
cd skills/kibana/kibana-anomaly-detection需要Node.js 18+。未提供凭据时,默认使用/。
elasticchangemebash
cd skills/kibana/kibana-anomaly-detectiontools → workflows → skills
工具 → 工作流 → 技能
node scripts/kibana-agent-builder.mjs all register --kibana-url http://localhost:5601
node scripts/kibana-agent-builder.mjs all register --kibana-url http://localhost:5601
HTTPS with self-signed cert
HTTPS + 自签名证书
node scripts/kibana-agent-builder.mjs all register --kibana-url https://localhost:5601 --insecure
`all register` runs `tools register`, then `workflows register`, then `skills register`. Kibana allows **at most five**
`tool_ids` per skill; the script fills them by scanning `SKILL.md` for tool mentions (in document order), then appends
ids from `references/kibana/tools/esql/*.json` until the cap (workflow-only tools omitted by default). If you run
`skills register` alone, run `tools register` first so those ids exist.
Workflow tool exclusions and prefixes live in `scripts/agent_builder_constants.json`.
**MCP API key permissions:**
- Kibana: `read_onechat`, `space_read`
- Index: `read`, `view_index_metadata` on `.ml-anomalies-*`, `.ml-annotations-*`, `.ml-notifications-*`, `.ml-config`
- For source evidence: `read` on source data indices
---node scripts/kibana-agent-builder.mjs all register --kibana-url https://localhost:5601 --insecure
`all register`会依次执行`tools register`、`workflows register`、`skills register`。Kibana允许每个技能最多关联5个`tool_ids`;脚本会扫描`SKILL.md`中的工具提及(按文档顺序),然后从`references/kibana/tools/esql/*.json`中补充工具ID,直到达到上限(默认排除仅工作流工具)。如果单独运行`skills register`,请先运行`tools register`确保工具ID已存在。
工作流工具的排除规则和前缀定义在`scripts/agent_builder_constants.json`中。
**MCP API密钥权限:**
- Kibana:`read_onechat`、`space_read`
- 索引:`.ml-anomalies-*`、`.ml-annotations-*`、`.ml-notifications-*`、`.ml-config`的`read`、`view_index_metadata`权限
- 源证据:源数据索引的`read`权限
---Tool inventory
工具清单
ES|QL tool specs live under ; workflow definitions under
. Each Mode section above lists the tools it uses. Full surface:
references/tools.md (ES|QL) and references/workflow-tools.md
(workflows).
references/kibana/tools/esql/*.jsonreferences/kibana/workflows/*.yamlES|QL工具定义位于;工作流定义位于。上述各模式部分已列出对应工具。完整工具集:
references/tools.md(ES|QL工具)和references/workflow-tools.md(工作流工具)。
references/kibana/tools/esql/*.jsonreferences/kibana/workflows/*.yamlKey system indices
核心系统索引
| Index | Relevant content |
|---|---|
| |
| job/datafeed documents (visible even for never-run jobs) |
| delayed data ( |
| job messages ( |
| 索引 | 相关内容 |
|---|---|
| |
| 作业/数据馈送文档(即使从未运行的作业也可见) |
| 延迟数据( |
| 作业消息( |
Examples
示例
RCA: "Something caused a spike in our error rate at 2pm — what broke?" → Investigate → →
→ → → RCA report.
ad_get_available_metadataad_query_anomaly_timelinead_rca_cross_job_entity_matchad_rca_multi_job_entitiesScore drop: "My anomaly score went from 90 to 55 — did the model change?" → Explain →
for drift → explain renormalization if is large.
ad_rca_score_reassessmentscore_driftMemory limit: "Job status shows and results look wrong." → Troubleshoot → →
→ → (lifecycle: stop
datafeed → close → update → open → start).
hard_limitad_ts_model_memory_healthad_wf_ts_field_cardinalityad_estimate_memory_requirementad_update_model_memory_limitNew job: "Detect unusual error rates per host on nginx access logs." → Manage → detector with
→ validate → present → deploy.
high_countby_field_name: "host.keyword"Multi-mode: "We had an incident last night, scores were high but now low — is the job healthy?" → Investigate the
incident → Explain the score drift → Troubleshoot if or delayed data is suspected.
hard_limitRCA场景: "下午2点错误率突增——哪里出问题了?" → 调查模式 → → → → → 生成RCA报告。
ad_get_available_metadataad_query_anomaly_timelinead_rca_cross_job_entity_matchad_rca_multi_job_entities分数下降场景: "我的异常分数从90降到了55——模型变了吗?" → 解释模式 → 运行分析漂移 → 若较大,解释重归一化原因。
ad_rca_score_reassessmentscore_drift内存限制场景: "作业状态显示,结果异常。" → 故障排查模式 → → → → (遵循生命周期:停止数据馈送→关闭作业→更新配置→打开作业→启动数据馈送)。
hard_limitad_ts_model_memory_healthad_wf_ts_field_cardinalityad_estimate_memory_requirementad_update_model_memory_limit新建作业场景: "检测nginx访问日志中每个主机的异常错误率。" → 管理模式 → 使用检测器,设置 → 验证→展示→部署。
high_countby_field_name: "host.keyword"多模式场景: "昨晚发生了事件,分数很高但现在变低了——作业健康吗?" → 调查事件→解释分数漂移→若怀疑或延迟数据,进行故障排查。
hard_limitGuidelines
指南
- Pick a mode first. Don't blend RCA logic with score-explanation logic in one response.
- first on empty results — privileges are the most common false-negative cause.
ad_validate_ml_tool_permissions - Score bands are absolute thresholds: critical,
>75warning,50–75minor,25–50informational.<25 - Multi-job entities are prime suspects. Use in
min_job_count=2.ad_rca_multi_job_entities - Show alongside
initial_record_score— the gap tells the renormalization story.record_score - Fix memory before .
query_delayinvalidates downstream diagnostics.hard_limit - Stop datafeed → close job → update config → open job → start datafeed for any config change to memory or query delay.
- Confirm RCAs with . Raw source documents are ground truth.
ad_rca_source_evidence
- 先选择模式。不要在一个响应中混合RCA逻辑和分数解释逻辑。
- 返回空结果时先运行——权限缺失是最常见的假阴性原因。
ad_validate_ml_tool_permissions - 分数区间为绝对阈值:严重,
>75警告,50–75轻微,25–50信息性。<25 - 跨作业关联的实体是首要排查对象。在中设置
ad_rca_multi_job_entities。min_job_count=2 - 同时展示和
initial_record_score——两者的差值体现重归一化过程。record_score - 先修复内存问题,再处理——
query_delay会使下游诊断失效。hard_limit - 任何内存或query_delay的配置变更,都要遵循停止数据馈送→关闭作业→更新配置→打开作业→启动数据馈送的流程。
- 用验证RCA结果——原始源文档是最可靠的依据。
ad_rca_source_evidence