alibabacloud-elasticsearch-instance-diagnose

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Alibaba Cloud Elasticsearch Instance Diagnosis

阿里云Elasticsearch实例诊断

Collect signals from Alibaba Cloud OpenAPI (control plane) and the Elasticsearch REST API (data plane), combine them with the SOP knowledge base under
references/
, and produce root-cause analysis, an evidence chain, prioritized remediation guidance, and—when multiple dimensions fire—a recency-ordered incident timeline (severity vs time in window; see Timeline and recency (MUST) in §5 Step 4).
Architecture: Alibaba Cloud Elasticsearch OpenAPI + Alibaba CloudMonitor (CMS) + Elasticsearch REST API + diagnostic SOPs
Closure: If MUST applies and
ES_*
is set, finish authenticated ES API evidence before the final report (see Feasibility order in §5).

阿里云OpenAPI(控制平面)Elasticsearch REST API(数据平面)收集信号,结合
references/
目录下的SOP知识库,生成
根因分析报告
证据链优先级排序的修复指南,当多个维度触发时,还会生成按时间顺序排列的事件时间线(窗口内的严重程度与时间关系;详见第5步第4节的时间线与时效性(必选))。
架构:阿里云Elasticsearch OpenAPI + 阿里云监控(CMS) + Elasticsearch REST API + 诊断SOP
收尾要求:若触发必选规则且已设置
ES_*
变量,需在生成最终报告前完成已认证的ES API证据收集(详见第5节的可行性优先级)。

1. Prerequisites

1. 前置条件

1.1 Aliyun CLI

1.1 Aliyun CLI

Pre-check: Aliyun CLI >= 3.3.1 required (for RAM permission checks and OpenAPI CLI fallback) Run
aliyun version
to verify the version is >= 3.3.1. If the CLI is missing or too old, see
references/cli-installation-guide.md
. After installation, run
aliyun configure set --auto-plugin-install true
to enable automatic plugin installation (do not pass plaintext AccessKey pairs on this command line; see §1.2).
预检查:需Aliyun CLI >= 3.3.1(用于RAM权限检查和OpenAPI CLI降级方案) 运行
aliyun version
验证版本是否>=3.3.1。若CLI缺失或版本过旧,请查看
references/cli-installation-guide.md
。 安装完成后,运行
aliyun configure set --auto-plugin-install true
启用自动插件安装(请勿在此命令行中传入明文AccessKey;详见第1.2节)。

1.2 Alibaba Cloud account authentication and security (MUST)

1.2 阿里云账号认证与安全(必选)

Security rules (mandatory):
  • NEVER read, echo, or print AccessKey ID or AccessKey Secret values.
  • NEVER prompt or ask the user to paste plaintext AccessKeys in the conversation.
  • NEVER embed AccessKeys in scripts, CLI arguments, or
    curl
    URLs.
  • NEVER use
    aliyun configure set
    (or similar) to pass literal AccessKey ID/Secret on the command line.
  • NEVER accept AccessKeys that the user pastes into the chat, even if offered voluntarily.
  • ONLY use configured CLI profiles (
    aliyun configure
    ) or environment variables such as
    ALIBABA_CLOUD_ACCESS_KEY_ID
    /
    ALIBABA_CLOUD_ACCESS_KEY_SECRET
    that the user has set in their local shell (the agent must not echo those values in the session).
⚠️ If the user provides AccessKeys in the chat (e.g. “my AK is xxx”)
  1. Stop immediately: do not run any Alibaba Cloud command that requires credentials.
  2. Decline politely and give only the names of approved configuration methods (do not repeat any secret the user may have leaked):
    • Recommended: run
      aliyun configure
      in a local terminal and enter credentials when prompted; credentials are stored in the local profile file.
    • Alternatively: set
      ALIBABA_CLOUD_ACCESS_KEY_ID
      /
      ALIBABA_CLOUD_ACCESS_KEY_SECRET
      in the local shell (the user types values only in the terminal, not in chat).
  3. Resume the diagnosis request only after credentials are configured correctly.
Verify credentials without exposing secrets:
bash
aliyun configure list
aliyun --profile <profile_name> sts get-caller-identity
Credential policy:
  1. Prefer an
    aliyun configure
    profile (default or
    --profile
    ).
  2. If there is no valid identity (
    configure list
    /
    get-caller-identity
    fails), STOP and guide the user to configure locally; do not guess or fabricate credentials.
  3. Never pass plaintext AccessKeys through the conversation.
安全规则(强制执行):
  • 绝对禁止读取、回显或打印AccessKey ID或AccessKey Secret的值。
  • 绝对禁止提示或要求用户在对话中粘贴明文AccessKey。
  • 绝对禁止将AccessKey嵌入脚本、CLI参数或
    curl
    URL中。
  • 绝对禁止使用
    aliyun configure set
    (或类似命令)在命令行中传入明文AccessKey ID/Secret。
  • 绝对禁止接受用户粘贴到聊天中的AccessKey,即使用户主动提供。
  • 仅可使用已配置的CLI配置文件(
    aliyun configure
    )或用户在本地Shell中设置的环境变量(如
    ALIBABA_CLOUD_ACCESS_KEY_ID
    /
    ALIBABA_CLOUD_ACCESS_KEY_SECRET
    ),Agent不得在会话中回显这些值。
⚠️ 若用户在聊天中提供AccessKey(例如“我的AK是xxx”)
  1. 立即停止:不要运行任何需要凭证的阿里云命令。
  2. 礼貌拒绝,并仅提供已批准的配置方法名称(请勿重复用户可能泄露的任何机密信息):
    • 推荐方案:在本地终端运行
      aliyun configure
      ,并在提示时输入凭证;凭证将存储在本地配置文件中。
    • 替代方案:在本地Shell中设置
      ALIBABA_CLOUD_ACCESS_KEY_ID
      /
      ALIBABA_CLOUD_ACCESS_KEY_SECRET
      (用户仅在终端中输入值,而非聊天窗口)。
  3. 仅在凭证配置正确后,再恢复诊断请求。
在不暴露机密的情况下验证凭证:
bash
aliyun configure list
aliyun --profile <profile_name> sts get-caller-identity
凭证策略:
  1. 优先使用
    aliyun configure
    配置文件(默认或
    --profile
    指定)。
  2. 若没有有效身份(
    configure list
    /
    get-caller-identity
    执行失败),立即停止并引导用户在本地配置;请勿猜测或伪造凭证。
  3. 绝对禁止通过对话传递明文AccessKey。

1.3 Elasticsearch direct-connect credential boundary

1.3 Elasticsearch直连凭证边界

  • NEVER ask the user to paste
    ES_PASSWORD
    in chat; NEVER echo, print, or log the password; NEVER copy a password from chat into commands, hooks, or repo files.
  • Shell expansion for
    curl -u "$ES_USERNAME:$ES_PASSWORD"
    (or equivalent) is allowed when vars are pre-exported in the user’s local shell; NEVER put the secret as a literal in chat, scripts checked into repos, or command output.
  • If the user tries to send a password in chat: STOP as well and ask them to set
    ES_PASSWORD
    only locally via
    export
    (see §2.2).

  • 绝对禁止要求用户在聊天中粘贴
    ES_PASSWORD
    绝对禁止回显、打印或记录密码;绝对禁止从聊天中复制密码到命令、钩子或仓库文件中。 当变量在用户本地Shell中预先导出时,允许使用
    curl -u "$ES_USERNAME:$ES_PASSWORD"
    (或等效命令)的Shell扩展;绝对禁止将机密信息明文写在聊天、已提交到仓库的脚本或命令输出中。
  • 若用户尝试在聊天中发送密码:同样立即停止,并要求他们仅通过
    export
    在本地设置
    ES_PASSWORD
    (详见第2.2节)。

2. Environment setup

2. 环境配置

2.1 Control plane OpenAPI (via Aliyun CLI)

2.1 控制平面OpenAPI(通过Aliyun CLI)

All control-plane and CMS data collection for this skill uses the Aliyun CLI.
[MUST]
elasticsearch
/
cms
— plugin-mode shell only (avoid legacy CLI)

Whenever the agent emits executable
aliyun
lines (chat, reproducibility exports, or copy-paste steps), use plugin subcommands (lowercase-hyphenated) and kebab-case flags — the same shape as
scripts/openapi_cli_collect.py
and references/verification-method.md.
  • Do not use legacy POP-style invocations: a PascalCase verb immediately after
    elasticsearch
    or
    cms
    on the same
    aliyun
    line (the old “action name = subcommand” style), or CamelCase flags like
    --InstanceId
    ,
    --Namespace
    ,
    --StartTime
    in new commands. Use plugin verbs only (
    describe-instance
    ,
    describe-metric-list
    , …).
  • Naming split:
    DescribeInstance
    ,
    ListSearchLog
    ,
    DescribeMetricList
    , etc. are OpenAPI action names (PascalCase — docs, RAM, console). The token after
    aliyun elasticsearch
    or
    aliyun cms
    in a shell must be the CLI plugin name (
    describe-instance
    ,
    list-search-log
    ,
    describe-metric-list
    , …).
  • Prefer
    python3 scripts/check_es_instance_health.py
    for the standard control-plane + CMS bundle so subprocess calls stay aligned with this repo.
  • CLI references: Elasticsearch CLI 中心, 云监控 CLI 中心.
AI-Mode and plugin baseline (required) — wrap every diagnosis session that runs
aliyun
OpenAPI/CMS commands:
bash
aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose"
aliyun plugin update
本技能的所有控制平面和CMS数据收集均使用Aliyun CLI
[必选]
elasticsearch
/
cms
— 仅使用插件模式Shell(避免旧版CLI)

当Agent生成可执行
aliyun
命令行(聊天、可复现性导出或复制粘贴步骤)时,需使用插件子命令(小写连字符格式)和短横线命名的参数 — 与
scripts/openapi_cli_collect.py
references/verification-method.md保持一致。
  • 请勿使用旧版POP风格调用:在同一
    aliyun
    命令行中,
    elasticsearch
    cms
    后直接跟大驼峰动词(旧版“动作名称=子命令”风格),或在命令中使用大驼峰参数如
    --InstanceId
    --Namespace
    --StartTime
    。仅使用插件动词(
    describe-instance
    describe-metric-list
    等)。
  • 命名区分
    DescribeInstance
    ListSearchLog
    DescribeMetricList
    等是OpenAPI动作名称(大驼峰格式 — 文档、RAM、控制台)。
    aliyun elasticsearch
    aliyun cms
    后的命令行必须是CLI插件名称(
    describe-instance
    list-search-log
    describe-metric-list
    等)。
  • 优先使用
    python3 scripts/check_es_instance_health.py
    获取标准控制平面+CMS组合数据,确保子进程调用与本仓库保持一致。
  • CLI参考Elasticsearch CLI 中心云监控 CLI 中心
AI模式与插件基线(必填) — 在运行
aliyun
OpenAPI/CMS命令的诊断会话前后执行以下命令:
bash
aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose"
aliyun plugin update

… diagnosis: aliyun / python3 scripts/check_es_instance_health.py …

… 诊断操作:aliyun / python3 scripts/check_es_instance_health.py …

aliyun configure ai-mode disable

> **`configure ai-mode` missing or failing:** Skip the wrapper above; use **`ALIBABA_CLOUD_USER_AGENT`** (next block). Log the CLI failure (e.g. subcommand unavailable). Whether the profile is **valid** is determined only by **`aliyun configure list`** and **`sts get-caller-identity`** — write **valid** / **validity**, not *vaild*.

**User-Agent (required)**: set a User-Agent for Alibaba Cloud API calls:
```bash
export ALIBABA_CLOUD_USER_AGENT="AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose"
CLI hardening (recommended): when authoring raw
aliyun
commands, use §2.1 MUST plugin shape first, then add
--connect-timeout 3 --read-timeout 10
(increase
read-timeout
for large responses or CMS), consistent with the instance-management skill examples, to avoid indefinite hangs on network faults. If the global User-Agent is not set, add
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose
per invocation. For optional Elasticsearch probes inside
check_es_instance_health.py
(when
ES_*
is set), the same knobs exist as
--connect-timeout
/
--read-timeout
on that script — they map to
curl
for engine calls only, not to the Aliyun OpenAPI client.
Run before diagnosis:
bash
aliyun version
aliyun configure list
aliyun --profile <profile_name> sts get-caller-identity
aliyun configure ai-mode disable

> **`configure ai-mode`缺失或执行失败**:跳过上述包装命令;使用**`ALIBABA_CLOUD_USER_AGENT`**(下一代码块)。记录CLI执行失败信息(例如子命令不可用)。配置文件是否**有效**仅通过**`aliyun configure list`**和**`sts get-caller-identity`**判断 — 请使用“valid” / “有效性”,而非拼写错误的“vaild”。

**用户代理(必填)**:为阿里云API调用设置用户代理:
```bash
export ALIBABA_CLOUD_USER_AGENT="AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose"
CLI强化(推荐):编写原生
aliyun
命令时,首先遵循第2.1节的必选插件格式,然后添加**
--connect-timeout 3 --read-timeout 10
(针对大响应或CMS可增加
read-timeout
),与实例管理技能示例保持一致,避免因网络故障导致无限挂起。若未设置全局用户代理,需在每次调用时添加
--user-agent AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose
。对于
check_es_instance_health.py
中的
可选Elasticsearch探测**(当
ES_*
已设置时),脚本也提供了相同的**
--connect-timeout
** / **
--read-timeout
**参数 — 这些参数仅映射到引擎调用的
curl
,而非阿里云OpenAPI客户端。
诊断前运行以下命令:
bash
aliyun version
aliyun configure list
aliyun --profile <profile_name> sts get-caller-identity

2.2 Elasticsearch API direct access (
curl
)

2.2 Elasticsearch API直连(
curl

Have the user set connection variables in a local terminal after you confirm the Elasticsearch endpoint (VPC or public) and admin credentials—do not hardcode user-specific values in chat:
bash
export ES_ENDPOINT="http://<elasticsearch-endpoint-ip>:9200"
export ES_USERNAME="elastic"
export ES_PASSWORD="<elasticsearch-admin-password>"
Public access and
http
vs
https
:
From
DescribeInstance
, use
publicDomain
/
domain
and the reported
protocol
. When
protocol
is
HTTP
(typical public listener), set
ES_ENDPOINT
to
http://<publicDomain>:9200
. Using
https://
against an HTTP-only endpoint causes TLS errors (e.g.
WRONG_VERSION_NUMBER
). Use
https://
only when
protocol
is
HTTPS
(or TLS is actually enabled on the port you use), and supply CA / fingerprint options as in HTTPS options below.
If
http://
“does not work” — when to try
https://
:
Treat
DescribeInstance
protocol
as the source of truth for the REST listener.
000
, timeouts, or connection refused on
http://
usually mean network path / allowlist / security group / wrong host or portnot “try HTTPS next” when
protocol
is still
HTTP
. Do switch to
https://
when
protocol
is
HTTPS
(or the console / product doc states TLS on that endpoint) and the failure on
http://
is a TLS or scheme symptom (e.g.
WRONG_VERSION_NUMBER
,
error:0A00010B
, immediate SSL alert while probing with the wrong scheme). If
protocol
is
HTTP
and only plain TCP is advertised, HTTPS is not a fallback for reachability.
Credential safety
  • NEVER echo, print, or log
    ES_PASSWORD
    ; NEVER copy credentials from chat into shell history or saved files.
  • NEVER ask the user to paste the password in plaintext in chat.
  • ONLY use the following checks to verify that variables are set:
bash
[[ -n "$ES_ENDPOINT" ]] && echo "ES_ENDPOINT: $ES_ENDPOINT" || echo "ES_ENDPOINT: NOT SET"
[[ -n "$ES_PASSWORD" ]] && echo "ES_PASSWORD: SET" || echo "ES_PASSWORD: NOT SET"
Network connectivity and access control
IssueHow to checkMitigation
Public network access disabledElasticsearch console → NetworkEnable public access or use the VPC endpoint
Public access allowlistConsole → SecurityPublic access allowlistAdd the agent host’s public IP
VPC isolatione.g.
telnet <ES_IP> 9200
VPC peering, Express Connect, or equivalent
Security groupInbound rules on the ECS/security group hosting ElasticsearchAllow TCP 9200 (or the configured port)
Connectivity probe:
curl -sS -o /dev/null -w "%{http_code}" --connect-timeout 5 "${ES_ENDPOINT}"
— HTTP code
000
usually means the path is unreachable.
401
without
-u
is normal
(auth required); if
ES_PASSWORD
is SET, proceed to authenticated
GET /_cluster/health
(§7).
401
with
-u
→ wrong credentials.
000
/ refused / timeout
→ network, allowlist, or TLS/scheme mismatch.
HTTPS — prerequisites (what must be true)
  1. Listener: The Elasticsearch HTTP port you call (9200 unless changed) must actually speak TLS — align with
    DescribeInstance
    protocol
    (
    HTTPS
    ) or console/network documentation.
  2. URL:
    https://<host>:<port>
    with the same host (e.g.
    publicDomain
    ) you would use for HTTP.
  3. Client trust of the server certificate: Your client must trust the cluster’s certificate chain (cluster / cloud CA PEM, or corporate proxy CA if TLS is intercepted).
    curl
    : prefer
    curl --cacert /path/to/ca.crt ...
    ;
    -k
    /
    --insecure
    only for short, non-production diagnosis.
  4. Auth: Same
    ES_USERNAME
    /
    ES_PASSWORD
    as for HTTP (Basic auth over TLS).
HTTPS — how this skill documents it
  • Manual
    curl
    (§7 and es-api-call-failures.md):
    Add
    --cacert
    (or
    -k
    for testing) to every
    curl
    when using
    https://
    if the default trust store does not include your cluster CA.
  • check_es_instance_health.py
    optional ES probes:
    They invoke
    curl
    with
    -u
    only; they do not read
    ES_CA_CERTS
    /
    ES_SSL_FINGERPRINT
    /
    ES_VERIFY_CERTS
    (those names are common for Python Elasticsearch clients). For HTTPS instances, use §7
    curl
    with
    --cacert
    for deep checks, or extend the script later to pass
    --cacert
    from an env var.
  • Python-style env vars (reference for other tooling):
    ES_CA_CERTS
    ,
    ES_SSL_FINGERPRINT
    ,
    ES_VERIFY_CERTS=false
    (testing only) — not wired into this repo’s optional
    curl
    path today.

请用户在本地终端中设置连接变量,确认Elasticsearch端点(VPC或公网)和管理员凭证后执行 — 请勿在聊天中硬编码用户特定值:
bash
export ES_ENDPOINT="http://<elasticsearch-endpoint-ip>:9200"
export ES_USERNAME="elastic"
export ES_PASSWORD="<elasticsearch-admin-password>"
公网访问与
http
vs
https
:从**
DescribeInstance
获取
publicDomain
** /
domain
和上报的
protocol
。当**
protocol
HTTP
(典型公网监听器)时,将
ES_ENDPOINT
设置为
http://<publicDomain>:9200
。若针对仅支持HTTP的端点使用
https://
,会导致TLS错误(例如
WRONG_VERSION_NUMBER
)。仅当
protocol
HTTPS
(或端口实际启用了TLS)时,才使用
https://
,并按照下方HTTPS选项**提供CA/指纹参数。
http://
“无法工作” — 何时尝试
https://
:将**
DescribeInstance
protocol
视为REST监听器的权威来源。
000
、超时或连接拒绝通常意味着网络路径/白名单/安全组/主机或端口错误** — 当**
protocol
仍为
HTTP
时,不要直接尝试HTTPS。仅当
protocol
HTTPS
(或控制台/产品文档说明该端点启用了TLS),且
http://
调用失败是
TLS或协议问题(例如
WRONG_VERSION_NUMBER
error:0A00010B
、使用错误协议探测时立即触发SSL告警)时,才切换到
https://
。若
protocol
HTTP
且仅声明支持纯TCP**,则HTTPS不能作为可达性的降级方案。
凭证安全
  • 绝对禁止回显、打印或记录
    ES_PASSWORD
    绝对禁止将凭证从聊天复制到Shell历史或保存的文件中。
  • 绝对禁止要求用户在聊天中粘贴明文密码。
  • 仅可使用以下检查验证变量是否已设置:
bash
[[ -n "$ES_ENDPOINT" ]] && echo "ES_ENDPOINT: $ES_ENDPOINT" || echo "ES_ENDPOINT: NOT SET"
[[ -n "$ES_PASSWORD" ]] && echo "ES_PASSWORD: SET" || echo "ES_PASSWORD: NOT SET"
网络连通性与访问控制
问题检查方式缓解方案
公网访问已禁用Elasticsearch控制台 → 网络启用公网访问或使用VPC端点
公网访问白名单控制台 → 安全公网访问白名单添加Agent主机的公网IP
VPC隔离例如
telnet <ES_IP> 9200
VPC对等连接、高速通道或等效方案
安全组Elasticsearch所在ECS/安全组的入站规则允许TCP 9200(或配置的端口)
连通性探测
curl -sS -o /dev/null -w "%{http_code}" --connect-timeout 5 "${ES_ENDPOINT}"
— HTTP返回码
000
通常表示路径不可达。不带
-u
401
是正常情况
(需要认证);若
ES_PASSWORD
已设置,继续执行已认证
GET /_cluster/health
(第7节)。
-u
401
→ 凭证错误。
000
/ 连接拒绝 / 超时
→ 网络、白名单或TLS/协议不匹配。
HTTPS — 前置条件(必须满足)
  1. 监听器:调用的Elasticsearch HTTP端口(默认9200,除非已修改)必须实际支持TLS — 与**
    DescribeInstance
    protocol
    HTTPS
    **)或控制台/网络文档保持一致。
  2. URL:使用**
    https://<host>:<port>
    ,主机与HTTP使用的一致(例如
    publicDomain
    **)。
  3. 客户端信任服务器证书链:客户端必须信任集群的证书链(集群/云CA PEM,或TLS被拦截时的企业代理CA)。
    curl
    :优先使用**
    curl --cacert /path/to/ca.crt ...
    -k
    /
    --insecure
    仅用于短期非生产**诊断场景。
  4. 认证:使用与HTTP相同的**
    ES_USERNAME
    /
    ES_PASSWORD
    **(基于TLS的基础认证)。
HTTPS — 本技能的文档方式
  • 手动
    curl
    (第7节和es-api-call-failures.md
    :当使用**
    https://
    时,若默认信任库不包含集群CA,需为每个
    curl
    命令添加
    --cacert
    (或测试时用
    -k
    **)。
  • check_es_instance_health.py
    可选ES探测
    :仅调用带**
    -u
    curl
    ;不读取
    ES_CA_CERTS
    ** /
    ES_SSL_FINGERPRINT
    /
    ES_VERIFY_CERTS
    (这些名称是Python Elasticsearch客户端的通用参数)。对于HTTPS实例,使用第7节的**
    curl
    并添加
    --cacert
    进行深度检查,或后续扩展脚本以从环境变量传递
    --cacert
    **。
  • Python风格环境变量(其他工具参考)
    ES_CA_CERTS
    ES_SSL_FINGERPRINT
    ES_VERIFY_CERTS=false
    (仅用于测试) — 目前本仓库的可选**
    curl
    **路径未集成这些参数。

3. RAM permission check

3. RAM权限检查

[MUST] RAM permission pre-check
Before running this skill, verify the principal has the required RAM permissions. See
references/ram-policies.md
for the full list. If the user reports insufficient permissions, direct them to attach the corresponding policies in the RAM console.

[必选] RAM权限预检查
在运行本技能前,验证主体是否具备所需的RAM权限。 完整权限列表请查看
references/ram-policies.md
。 若用户报告权限不足,引导他们在RAM控制台附加相应的策略。

4. Parameter confirmation

4. 参数确认

IMPORTANT: Parameter confirmation Confirm the following with the user before any command or API call. Do not assume undeclared defaults or hardcode user-specific parameters.
Boundary controls (MUST)
  • Region and
    instance-id
    must not be guessed
    or taken from unverified defaults; if they disagree with
    DescribeInstance
    or the user’s explicit statement, reconfirm.
  • Do not apply metrics, logs, or
    DescribeInstance
    conclusions from instance A to instance B;
    ES_ENDPOINT
    must match the instance under diagnosis (see Pre-flight validation for Elasticsearch API below).
  • This skill is read-only diagnosis: do not invoke mutating control-plane APIs (create, resize, restart, delete instance, etc.). If the user requests a change, provide recommendations only; execution belongs in the console or an approved change workflow.
ParameterRequiredDescriptionDefault
instance-id
YesElasticsearch instance ID, e.g.
es-cn-xxxxx
.
aliyun
flag is
--instance-id
(not
--InstanceId
).
-
region
YesRegion ID (e.g.
cn-hangzhou
).
aliyun
flag is
--region
(not
--region-id
).
-
profile
NoAliyun CLI profile (explicit
--profile
recommended)
default
ES_ENDPOINT
NoElasticsearch endpoint (direct API access only)-
ES_PASSWORD
NoElasticsearch admin password (direct API access only)-
--window
No
check_es_instance_health.py
: analysis window in minutes (default 60)
60
--connect-timeout
,
--read-timeout
No
check_es_instance_health.py
:
curl
timeouts for optional ES engine probes when
ES_*
is set (
--connect-timeout
curl --connect-timeout
;
--read-timeout
contributes to
curl -m
together with connect). Defaults 5 / 10 seconds.
5 / 10

重要提示:参数确认 在执行任何命令或API调用前,与用户确认以下参数。 请勿假设未声明的默认值或硬编码用户特定参数。
边界控制(必选)
  • 切勿猜测Region和
    instance-id
    ,或从未经验证的默认值获取;若与
    DescribeInstance
    或用户明确声明的内容不符,需重新确认。
  • 请勿将实例A的指标、日志或
    DescribeInstance
    结论应用到实例B;
    ES_ENDPOINT
    必须与正在诊断的实例匹配(详见下方Elasticsearch API飞行前验证)。
  • 本技能为只读诊断请勿调用变更类控制平面API(创建、扩容、重启、删除实例等)。若用户请求变更,仅提供建议;执行操作需通过控制台或已批准的变更流程。
参数必填描述默认值
instance-id
Elasticsearch实例ID,例如
es-cn-xxxxx
aliyun
参数为
--instance-id
(而非
--InstanceId
)。
-
region
Region ID(例如
cn-hangzhou
)。
aliyun
参数为
--region
(而非
--region-id
)。
-
profile
Aliyun CLI配置文件(推荐显式指定
--profile
default
ES_ENDPOINT
Elasticsearch端点(仅用于API直连)-
ES_PASSWORD
Elasticsearch管理员密码(仅用于API直连)-
--window
check_es_instance_health.py
:分析窗口时长(分钟),默认60
60
--connect-timeout
,
--read-timeout
check_es_instance_health.py
:当
ES_*
已设置时,可选ES引擎探测的
curl
超时时间(
--connect-timeout
对应
curl --connect-timeout
--read-timeout
与连接超时共同构成
curl -m
的总超时)。默认值为5
/ 10秒。
5 / 10

5. End-to-end diagnostic workflow

5. 端到端诊断流程

Agent hard rules (non-negotiable)

Agent硬性规则(不可协商)

Aliyun CLI shape: For
aliyun elasticsearch
and
aliyun cms
, follow §2.1 MUST (plugin mode only) in every new executable command — do not resurrect legacy
DescribeInstance
/
ListSearchLog
-as-subcommand lines or
--InstanceId
-style flags in session exports or user-facing step lists (they drift from
openapi_cli_collect.py
and fail static checks).
OpenAPI/CMS cannot replace MUST engine APIs. For any §5 MUST table row or
check_es_instance_health.py
rule-engine MUST
, Alibaba Cloud OpenAPI and CloudMonitor do not replace the listed Elasticsearch REST calls for engine-level root cause—when feasibility holds, run those
curl
endpoints (see §7); they are complementary layers, not interchangeable.
Feasibility is decided only by checks, not by assumption. Whether the agent may call Elasticsearch must be determined by actually running the Feasibility order (§5): at minimum verify
ES_ENDPOINT
/
ES_PASSWORD
per §2.2, align
ES_ENDPOINT
with
DescribeInstance
, then authenticated
GET /_cluster/health
. Do not assume
ES_*
is unset or the path is unreachable without performing these steps in the session.
For Elasticsearch incidents, follow these four steps; each has a distinct role.
Aliyun CLI格式:对于**
aliyun elasticsearch
aliyun cms
,在所有新的可执行命令中必须遵循第2.1节的必选要求(仅插件模式)** — 请勿在会话导出或用户可见的步骤列表中使用旧版
DescribeInstance
/
ListSearchLog
作为子命令的格式,或
--InstanceId
风格的参数(这些格式与
openapi_cli_collect.py
不一致,且无法通过静态检查)。
OpenAPI/CMS无法替代必选引擎API。对于任何第5节必选表格行或**
check_es_instance_health.py
规则引擎必选项**,阿里云OpenAPI和云监控无法替代所列的Elasticsearch REST调用以获取引擎级根因 — 当可行性满足时,需运行这些
curl
端点(详见第7节);它们是互补层,而非可互换的。
可行性仅通过检查确定,而非假设。Agent是否可以调用Elasticsearch必须通过实际执行可行性优先级(第5节)来判断:至少需按照第2.2节验证
ES_ENDPOINT
/
ES_PASSWORD
,将
ES_ENDPOINT
DescribeInstance
对齐,然后执行已认证的
GET /_cluster/health
请勿在未执行这些步骤的情况下,假设
ES_*
未设置或路径不可达。
针对Elasticsearch事件,请遵循以下四个步骤;每个步骤都有明确的作用。

Execution strategy (root-cause driven)

执行策略(根因驱动)

Full policy: es-api-diagnosis-strategy.md
Data-plane
curl
collection requires both:
  1. Feasibility:
    ES_ENDPOINT
    and
    ES_PASSWORD
    are set and the network path works.
  2. Necessity: root-cause analysis needs data-plane evidence that the control plane or CMS cannot establish alone.
For endpoints listed under a fired MUST table row or rule-engine MUST, necessity for those calls is already satisfied by the trigger—still require feasibility (Feasibility order). For optional engine
curl
not in those lists, apply feasibility and necessity per es-api-diagnosis-strategy.md.
MUST triggers (if any CMS condition below holds, collect the listed Elasticsearch evidence):
TriggerScenarioRequired Elasticsearch evidence
ClusterStatus
max ≥ Yellow / Red
Cluster health
allocation/explain
,
_cat/shards
NodeCPUUtilization
max > 80%
CPU overload
_nodes/hot_threads
,
_tasks
NodeHeapMemoryUtilization
max > 85%
Memory pressure
_nodes/stats/breaker
,
GET /_cluster/settings?include_defaults=true
(
indices.breaker.*
in transient / persistent )
Thread pool
rejected
> 0
Performance
_nodes/hot_threads
,
_nodes/stats/thread_pool
Inter-node resource CV > 0.3Load imbalance
_cat/shards
,
_cat/allocation
Write failures or index read-onlyDisk / watermark / blocks
_cluster/settings
,
_all/_settings?filter_path=*.settings.index.blocks
,
_cat/allocation
Intermittent Elasticsearch API timeouts + CMS CPU > 80%Possible cascading failure
_nodes/hot_threads
,
_nodes/stats/thread_pool
,
_tasks
Thread-pool row: interpret search vs write / bulk using sop-query-thread-pool.md vs sop-write-performance.md (see also Write-path / bulk saturation below).
Rule-engine MUST: If
check_es_instance_health.py
prints a §5 MUST / §5–§7 callout for this run, treat it like a row above—collect that listed ES evidence when feasibility holds.
Binding rule (MUST triggers): If any MUST-trigger row or the rule-engine MUST line above applies, necessity is satisfied for that evidence set—OpenAPI/CMS cannot replace those calls for engine-level root cause (cluster-health:
allocation/explain
+
_cat/shards
for Yellow/Red). Confirm feasibility per Feasibility order below. If reachable with auth, run the MUST-listed endpoints in Step 2 in parallel with control-plane collection. If still blocked after authenticated
GET /_cluster/health
, lead with blocking reason: unset
ES_*
; transport failure (
000
, refused, timeout); 401 with
-u
; scheme/TLS mismatch—not 401 on an unauthenticated probe when
ES_PASSWORD
is SET.
完整策略:es-api-diagnosis-strategy.md
数据平面
curl
收集需同时满足以下两点:
  1. 可行性
    ES_ENDPOINT
    ES_PASSWORD
    已设置,且网络路径可用。
  2. 必要性:根因分析需要控制平面或CMS无法单独提供的数据平面证据。
对于触发的必选表格行中列出的端点,或规则引擎必选项,这些调用的必要性已由触发条件满足 — 仍需确认可行性可行性优先级)。对于在这些列表中的可选引擎
curl
调用,需按照es-api-diagnosis-strategy.md的可行性+必要性测试执行。
必选触发条件(若满足以下任一CMS条件,需收集所列的Elasticsearch证据):
触发条件场景所需Elasticsearch证据
ClusterStatus
最大值 ≥ Yellow / Red
集群健康
allocation/explain
,
_cat/shards
NodeCPUUtilization
最大值 > 80%
CPU过载
_nodes/hot_threads
,
_tasks
NodeHeapMemoryUtilization
最大值 > 85%
内存压力
_nodes/stats/breaker
,
GET /_cluster/settings?include_defaults=true
(**
indices.breaker.*
**在transient / persistent配置中)
线程池
rejected
计数 > 0
性能问题
_nodes/hot_threads
,
_nodes/stats/thread_pool
节点间资源变异系数 > 0.3负载不均衡
_cat/shards
,
_cat/allocation
写入失败或索引只读磁盘 / 水位线 / 块
_cluster/settings
,
_all/_settings?filter_path=*.settings.index.blocks
,
_cat/allocation
Elasticsearch API间歇性超时 + CMS CPU > 80%可能的级联故障
_nodes/hot_threads
,
_nodes/stats/thread_pool
,
_tasks
线程池行:使用sop-query-thread-pool.mdsop-write-performance.md区分searchwrite / bulk线程池(另见下方写入路径 / bulk饱和)。
规则引擎必选项:若
check_es_instance_health.py
在本次运行中打印了第5节必选 / 第5–7节的提示,需按照上表中的行处理 — 当可行性满足时,收集所列的ES证据。
绑定规则(必选触发条件):若满足任一必选触发行或上述规则引擎必选项,则该证据集的必要性已满足 — OpenAPI/CMS无法替代这些调用以获取引擎级根因(集群健康:Yellow/Red状态需
allocation/explain
+
_cat/shards
)。按照下方可行性优先级确认可行性。若已认证且可达,在步骤2中与控制平面收集并行运行必选所列的端点。若在已认证
GET /_cluster/health
后仍被阻塞,需明确说明阻塞原因:未设置
ES_*
;传输失败(
000
、连接拒绝、超时);
-u
的401
;协议/TLS不匹配 — 而非
ES_PASSWORD
已设置时未认证探测的401

Write-path / bulk saturation

写入路径 / bulk饱和

If
ThreadPool.WriteRejected
or
write
pool stress matches high-QPS bulk indexing, read and follow
references/sop-write-performance.md
— §2
, subsection “Evidence interpretation: bulk QPS → write pool” for the evidence chain,
rejected
semantics (cumulative since node start), report ordering vs Old GC / heap (causal chain or dual P0 — write path before JVM-only headline), per-node
rejected
/
completed
numbers
(reject share), per-node asymmetry, and write-only vs search. Do not lead with a JVM-only narrative when that subsection applies. For write-queue–style acceptance prompts, the opening conclusion should read as write-capacity (data-plane counters + optional CMS rule names), not only a GC/heap headline.
若**
ThreadPool.WriteRejected
write
池压力与高QPS bulk索引匹配,请阅读并遵循
references/sop-write-performance.md
— 第2节中的“证据解读:bulk QPS → 写入池”部分,了解证据链、
rejected
的语义(节点启动以来的累计值)、报告排序与旧GC/堆内存(因果链或双重P0 — 写入路径优先于仅JVM的标题)、单节点
rejected
/
completed
数值
(拒绝占比)、节点间不对称性,以及仅写入与搜索的区别。当该小节适用时,请勿仅以JVM相关内容作为核心结论。对于
写入队列式的确认提示,开篇结论应表述为写入容量**(数据平面计数器 + 可选CMS规则名称),而非GC/堆内存标题。

Search-primary vs write (both pools show cumulative
rejected
)

搜索主路径 vs 写入(两个池均显示累计
rejected

When
_nodes/stats/thread_pool
shows
search.rejected
write.rejected
on the same node(s) and
ThreadPool.SearchRejected
/ query-driven overload applies, lead the executive summary and P0 ordering with
search
(high concurrent query / terms / slow query; hot index when verified) — not
write
first.
write.rejected
may remain P0/P1 as parallel or secondary (bulk, catch-up); Old GC / CPU / node disconnect stay co-stress or cascade. Checker listing order is not proof of narrative order — see acceptance-criteria.md §6.5 and sop-query-thread-pool.md Report narrative.
Recency overrides this magnitude default when time-resolved evidence exists: do not rank the opening story by
search.rejected
vs
write.rejected
alone
— cumulative counters lack timestamps. Full rubric: acceptance-criteria.md §6.5 (P0 / executive order vs
search
write
: unless write dominated by time) and §6.6 (Executive order, No false recency from counters). Binding: Timeline and recency (MUST) below (same skill).
当**
_nodes/stats/thread_pool
显示同一节点上
search.rejected
write.rejected
,且
ThreadPool.SearchRejected
** / 查询驱动过载适用时,核心摘要P0排序需以**
search
(高并发查询 / 词条 / 慢查询;验证后的热点索引)为主 — 而非先以
write
为主。
write.rejected
仍可作为并行次要的P0/P1问题(bulk、追赶式写入);旧GC / CPU / 节点断开属于协同压力或级联问题**。检查器的列出顺序不能作为叙事顺序的依据 — 详见acceptance-criteria.md 第6.5节sop-query-thread-pool.md 报告叙事
当存在时间解析证据时,时效性优先于量级默认规则请勿仅通过**
search.rejected
vs
write.rejected
的大小来确定开篇叙事 — 累计计数器没有时间戳。完整规则:acceptance-criteria.md 第6.5节P0 / 核心顺序 vs
search
write
除非写入在
时间上占主导)和第6.6节**(核心顺序避免计数器导致的错误时效性)。绑定规则:下方的时间线与时效性(必选)(同一技能)。

activating
/ change workflow stuck (cross-layer root cause)

activating
/ 变更流程卡住(跨层根因)

When an instance stays in
activating
, a change is unfinished, and Red or unassigned shards coexist, follow
references/sop-activating-change-stuck.md
end-to-end (MUST includes
ListActionRecords
,
DescribeInstance
before/after remediation, collection order section 3.1, reporting section 4).
当实例持续处于**
activating
状态、变更未完成,且同时存在Red状态或未分配分片时,请完整遵循
references/sop-activating-change-stuck.md
**(必选内容包括
ListActionRecords
、修复前后的
DescribeInstance
、第3.1节的收集顺序、第4节的报告要求)。

Pre-flight validation for Elasticsearch API

Elasticsearch API飞行前验证

[IMPORTANT]
ES_ENDPOINT
must match the diagnosed instance
Compare
publicDomain
/
domain
and
protocol
from
DescribeInstance
with
ES_ENDPOINT
. If they differ, warn:
⚠️ ES_ENDPOINT does not match the current instance; run export ES_ENDPOINT="http://{publicDomain}:9200"
when
protocol
is
HTTP
, or
https://…
only when
protocol
is
HTTPS
(adjust host/port to match the deployment).
[重要]
ES_ENDPOINT
必须与诊断实例匹配
DescribeInstance
返回的
publicDomain
/
domain
和**
protocol
**与
ES_ENDPOINT
进行对比。 若不一致,发出警告:
⚠️ ES_ENDPOINT与当前实例不匹配;当**
protocol
HTTP
**时,请运行export ES_ENDPOINT="http://{publicDomain}:9200";仅当**
protocol
HTTPS`**时,使用https://…(根据部署调整主机/端口)。

When Elasticsearch credentials are missing or connections fail

当Elasticsearch凭证缺失或连接失败时

[CRITICAL] Guide the user to fix connectivity explicitly; classify failure modes (do not default persistent timeouts to “allowlist only”). Do not imply the agent “forgot” Elasticsearch — if the first answer is CMS/OpenAPI-heavy, give the blocking reason per Feasibility order below: unset
ES_*
; transport errors; 401 with valid
-u
; TLS/scheme—not 401 on a probe without
-u
when
ES_PASSWORD
is SET (use authenticated
curl
first).
Progressive playbook (read in order): references/es-api-call-failures.md (sections 1 → 4).
MUST / strategy context: references/es-api-diagnosis-strategy.md (sections 1–3 and 3.5 summary table).
[关键] 明确引导用户修复连通性;分类故障模式(请勿将持续默认超时归为“仅白名单问题”)。请勿暗示Agent“忘记”了Elasticsearch — 若首次回复以CMS/OpenAPI内容为主,需按照下方可行性优先级说明阻塞原因:未设置
ES_*
;传输错误;有效
-u
下的401
;TLS/协议不匹配 — 而非
ES_PASSWORD
已设置时未认证探测的401(需先使用已认证的
curl
)。
渐进式手册(按顺序阅读)references/es-api-call-failures.md1 → 4节)。
必选 / 策略背景references/es-api-diagnosis-strategy.md(1–3节和3.5总结表格)。

Mandatory warning when MUST applies but Elasticsearch is not configured

当必选条件触发但Elasticsearch未配置时的强制警告

[CRITICAL] If a MUST trigger fires but data-plane evidence is missing, put a warning at the top of the report: follow section 4 of references/es-api-call-failures.md (blocking reason first, then MUST list, missing evidence; if
ES_*
unset, pointer to section 2.2 of this SKILL; if vars are set, use es-api-call-failures sections 1–2 for auth vs transport).
[关键] 若必选触发条件已触发但数据平面证据缺失,需在报告顶部添加警告:遵循references/es-api-call-failures.md第4节(先说明阻塞原因,再列出必选条件、缺失的证据;若
ES_*
未设置,指向本技能的第2.2节;若变量已设置,使用es-api-call-failures的1–2节排查认证与传输问题)。

Step 1: Quick health scan (initial signals)

步骤1:快速健康扫描(初始信号)

Run the lightweight rules engine (17 metric rules) to list P0 / P1 / P2 findings and steer deeper collection:
bash
python3 scripts/check_es_instance_health.py -i <InstanceId> -r <RegionId> [--window <minutes, default 60>] [--profile <profile_name>]
运行轻量级规则引擎(17个指标规则),列出P0 / P1 / P2问题,引导深入收集:
bash
python3 scripts/check_es_instance_health.py -i <InstanceId> -r <RegionId> [--window <minutes, default 60>] [--profile <profile_name>]

Feasibility order (agent)

可行性优先级(Agent执行顺序)

  1. Run §2.2
    ES_*
    checks (password = SET only)—do not skip; never infer feasibility without this step.
  2. ES_ENDPOINT
    matches
    DescribeInstance
    domain
    /
    publicDomain
    (scheme/port).
  3. Authenticated
    GET /_cluster/health
    —do not stop at 401 on an unauthenticated probe if
    ES_PASSWORD
    is SET.
  4. MUST scope: table rows and/or rule-engine MUST line in §5.
  1. 执行第2.2节的
    ES_*
    检查(仅确认密码已设置)— 请勿跳过;切勿未执行此步骤就推断可行性。
  2. 确认
    ES_ENDPOINT
    DescribeInstance
    domain
    /
    publicDomain
    (协议/端口)匹配。
  3. 已认证
    GET /_cluster/health
    — 若
    ES_PASSWORD
    已设置,请勿在未认证探测返回401时停止。
  4. 必选范围:表格行和/或第5节中的规则引擎必选项

Step 2: Collect evidence in parallel

步骤2:并行收集证据

Based on Step 1, run collection in parallel (prioritize dimensions with signals).
If a MUST-trigger row or rule-engine MUST applies: run Feasibility order, then run that Required Elasticsearch evidence via
curl
in the same round (see §7). If no MUST applies, add optional data-plane
curl
only when feasibility and necessity both hold per the strategy doc.
Re-run
check_es_instance_health.py
with the same invocation pattern as Step 1; for this parallel round,
--window 120
and explicit
--profile <profile_name>
are common.
To backfill control-plane evidence (
DescribeInstance
,
ListSearchLog
, CMS-style calls), use
aliyun
patterns in references/verification-method.md (epoch times, profiles, namespaces).
Note: data-plane access still requires
ES_ENDPOINT
/
ES_PASSWORD
; the Aliyun CLI cannot replace
curl
to the cluster.
For MUST-trigger rows, necessity for the listed endpoints is already established—do not skip them when feasibility including reachability holds. Outside those rows, avoid unrelated bulk
curl
solely because
ES_*
is set; use the strategy doc’s feasibility + necessity test instead.
基于步骤1的结果,并行收集证据(优先处理有信号的维度)。
必选触发行规则引擎必选项适用:执行可行性优先级步骤,然后在同一轮次中通过
curl
运行所需的Elasticsearch证据(详见第7节)。若必选条件适用,仅当可行性必要性同时满足策略文档要求时,才添加可选的数据平面
curl
调用。
以与步骤1相同的调用方式重新运行**
check_es_instance_health.py
;在本轮并行收集中,
--window 120
和显式指定
--profile <profile_name>
**是常见做法。
如需补充控制平面证据(
DescribeInstance
ListSearchLog
、CMS风格调用),请使用references/verification-method.md中的**
aliyun
**调用模式(时间戳、配置文件、命名空间)。
注意:数据平面访问仍需
ES_ENDPOINT
/
ES_PASSWORD
;Aliyun CLI无法替代
curl
调用集群。
对于必选触发行,所列端点的必要性确定 — 当可行性(包括可达性)满足时,请勿跳过这些端点。在这些行之外,请勿仅因
ES_*
已设置就执行无关的批量
curl
调用;需使用策略文档的可行性+必要性测试来判断。

Step 3: Read SOPs by signal

步骤3:根据信号阅读SOP

Map signals to SOPs and read for deeper reasoning. With multiple signals, process P0 → P1 → P2 for severity, then apply Timeline and recency (MUST) in Step 4 so the narrative order matches when signals mattered in the window—not only static rule-engine print order.
Observed signalRead
Cluster Red/Yellow, node loss, pending tasks
references/sop-cluster-health.md
Long
activating
, unfinished change records, Red / unassigned shards
references/sop-cluster-health.md
+
references/sop-activating-change-stuck.md
High CPU, load, imbalance
references/sop-cpu-load.md
Per-node load imbalance (CPU/memory/disk/shard count)
references/sop-node-load-imbalance.md
JVM pressure, GC, circuit breaker, OOM
references/sop-memory-gc.md
Disk watermark, IO, write failures (read-only)
references/sop-disk-storage.md
Watermark misconfiguration, index blocks, “normal” disk % but write failures
references/sop-disk-storage.md
(Section 3 — watermark misconfiguration)
Write timeouts / rejections / latency / QPS drop
references/sop-write-performance.md
Query timeouts / rejections / slow queries
references/sop-query-thread-pool.md
Nodes look down but CPU still reported;
all shards failed
references/sop-service-avalanche.md
Intermittent Elasticsearch timeouts + CMS CPU > 80%
references/sop-service-avalanche.md
Risky settings, Ngram issues, API anomalies
references/sop-configuration.md
Event code definitions
references/health-events-catalog.md
将信号映射到SOP并深入阅读推理。当存在多个信号时,按照P0 → P1 → P2严重程度处理,然后应用步骤4中的时间线与时效性(必选),使叙事顺序与信号在窗口内的重要时间点匹配 — 而非仅遵循静态规则引擎的打印顺序。
观测信号需阅读的文档
集群Red/Yellow、节点丢失、待处理任务
references/sop-cluster-health.md
长时间
activating
、未完成的变更记录、Red状态 / 未分配分片
references/sop-cluster-health.md
+
references/sop-activating-change-stuck.md
高CPU、负载、不均衡
references/sop-cpu-load.md
单节点负载不均衡(CPU/内存/磁盘/分片数)
references/sop-node-load-imbalance.md
JVM压力、GC、断路器、OOM
references/sop-memory-gc.md
磁盘水位线、IO、写入失败(只读)
references/sop-disk-storage.md
水位线配置错误、索引块、磁盘百分比“正常”但写入失败
references/sop-disk-storage.md
(第3节 — 水位线配置错误)
写入超时 / 拒绝 / 延迟 / QPS下降
references/sop-write-performance.md
查询超时 / 拒绝 / 慢查询
references/sop-query-thread-pool.md
节点显示离线但仍上报CPU;
all shards failed
references/sop-service-avalanche.md
Elasticsearch间歇性超时 + CMS CPU > 80%
references/sop-service-avalanche.md
风险配置、Ngram问题、API异常
references/sop-configuration.md
事件代码定义
references/health-events-catalog.md

Step 4: Synthesize and write the structured report

步骤4:合成并撰写结构化报告

Acceptance-style optional checklists: references/acceptance-criteria.md §6.1§6.6 — Red/Yellow; read-heavy CPU +
search
pool (+ CMS alignment); JVM / breakers / fielddata; write-queue vs GC +
rejected
/
completed
; read-heavy search pool vs GC-only headline (expand in sop-query-thread-pool.md Report narrative: search pool vs GC / CPU headlines); timeline/recency. Bulk/write: references/sop-write-performance.md §2. Shard
reroute
:
references/sop-node-load-imbalance.md §1.3 (allocator / change control only).
[CRITICAL] Remediation must match the diagnosed root cause — avoid generic templates. Wrong breaker or concurrency fixes (e.g.
in_flight_requests
vs
request
, “split query” when concurrency is the issue) → see
sop-memory-gc.md
and the fired signal’s SOP.
activating
+ data-plane anomaly
: include the one-line cross-layer root cause; see
references/sop-activating-change-stuck.md
section 4.
Report skeleton (copy/fill): references/report-template.md.
验收式可选检查清单references/acceptance-criteria.md 第6.1第6.6节 — Red/Yellow状态;读密集型CPU +
search
池(+ CMS对齐);JVM / 断路器 / fielddata;写入队列 vs GC +
rejected
/
completed
;读密集型search池 vs 仅GC标题(详见sop-query-thread-pool.md 报告叙事:search池 vs GC / CPU标题);时间线/时效性。Bulk/写入references/sop-write-performance.md第2节。分片
reroute
references/sop-node-load-imbalance.md第1.3节(仅分配器 / 变更控制)。
[关键] 修复方案必须匹配诊断出的根因 — 避免通用模板。错误的断路器或并发修复方案(例如
in_flight_requests
vs
request
、当并发是问题时建议“拆分查询”)→ 请查看**
sop-memory-gc.md
**和触发信号对应的SOP。
activating
+ 数据平面异常
:需包含一行跨层根因;详见
references/sop-activating-change-stuck.md
第4节
报告框架(复制填写)references/report-template.md

Timeline and recency (MUST for synthesized reports)

时间线与时效性(合成报告必选)

Problem:
check_es_instance_health.py
and P0/P1/P2 bands express severity, not when a signal mattered most within the analysis window. Cumulative engine counters (
search.rejected
,
write.rejected
) do not encode recency—write and search issues can both be “real” while only one path dominated the recent past (e.g. search pressure closer to window end than write pressure).
Binding rules for the agent:
  1. Two axes — Treat severity (P0/P1/P2) and temporal relevance (proximity to window end / “now”) as orthogonal. Do not infer recency from priority alone (e.g. “write is P0 so it must be the current headline”) when time-resolved evidence says otherwise.
  2. Mandatory human-facing section — When more than one major finding fires (e.g. write pool + search pool + GC/CPU), the synthesized report must include an
    ### Incident timeline (recency-ordered)
    (or equivalent) block before or immediately after the executive summary, unless the user explicitly asks for a minimal report. In that block:
    • Order bullets or rows by time (earlier → later), or state which signal cluster peaked / persisted in the latter portion of
      {begin} ~ {end}
      .
    • Call out divergence: e.g. “write-path stress earlier in window; search-path / CPU more recent” when CMS or logs support it.
  3. Evidence for recency (use what exists; do not invent timestamps):
    • CloudMonitor: per-metric time series — note peak timestamp or sustained-high interval for
      NodeCPUUtilization
      ,
      NodeHeapMemoryUtilization
      , GC-related metrics,
      ThreadPool.*
      if exposed as rates or non-cumulative series in the collected JSON.
    • Slow logs /
      ListSearchLog
      : correlate query vs index slow entries to minutes.
    • Engine (optional): two
      _nodes/stats/thread_pool
      samples at known times to show delta on
      rejected
      /
      completed
      ; or
      _tasks
      /
      hot_threads
      for current skew vs historical cumulative counters.
  4. Executive summary ordering — The opening 2–4 sentences should reflect recency-weighted user impact: if search pressure is closer to current than write pressure, lead with search/query concurrency and co-stress (GC/CPU) as appropriate, and place historical write saturation as context or second wavewithout dropping P0 write findings if they remain valid for remediation backlog.
  5. Explicit uncertainty — If only cumulative counters exist and no time series differentiates paths, state one line: recency is undifferentiated; recommend narrower window, slow logs, or delta sampling for the next run.

问题
check_es_instance_health.py
和P0/P1/P2级别仅表示严重程度,而非信号在分析窗口内的重要时间点累计引擎计数器(
search.rejected
write.rejected
)不包含时效性信息 — 写入搜索问题可能同时存在,但仅其中一个路径在近期占主导(例如搜索压力更接近窗口末尾,而写入压力更早出现)。
Agent绑定规则
  1. 两个维度 — 将严重程度(P0/P1/P2)和时间相关性(接近窗口末尾 / “当前”)视为正交维度。当时间解析证据显示不同情况时,请勿仅通过优先级推断时效性(例如“写入是P0,所以它一定是当前核心问题”)。
  2. 强制面向用户的章节 — 当存在多个主要问题时(例如写入池 + 搜索池 + GC/CPU),合成报告必须包含**
    ### 事件时间线(按时效性排序)
    **(或等效章节),且需放在核心摘要之前或之后,除非用户明确要求极简报告。在该章节中:
    • 时间顺序排列项目符号或行(早→晚),或说明哪个信号集群在
      {begin} ~ {end}
      后半段达到峰值 / 持续存在
    • 指出差异:例如当CMS或日志支持时,说明“写入路径压力在窗口早期出现;搜索路径 / CPU问题更近期”。
  3. 时效性证据(使用现有证据;请勿编造时间戳):
    • 云监控:指标时间序列 — 若收集的JSON中包含速率或非累计序列,请记录
      NodeCPUUtilization
      NodeHeapMemoryUtilization
      、GC相关指标、
      ThreadPool.*
      峰值时间戳持续高值区间
    • 慢日志 /
      ListSearchLog
      :将查询 vs 索引的慢日志条目与分钟级时间关联。
    • 引擎(可选):在已知时间点采集两次
      _nodes/stats/thread_pool
      样本,以显示
      rejected
      /
      completed
      增量;或使用**
      _tasks
      ** /
      hot_threads
      对比当前
      偏差与历史累计计数器。
  4. 核心摘要排序开篇2–4句话应反映时效性加权的用户影响:若搜索压力比写入压力更接近当前,则优先说明搜索/查询并发和协同压力(GC/CPU)(如适用),并将历史写入饱和作为背景第二阶段问题 — 若写入问题仍需修复,请勿遗漏P0写入问题。
  5. 明确不确定性 — 若仅存在累计计数器且时间序列区分不同路径,需添加一行说明:时效性无法区分;建议下一次运行时使用更窄的窗口慢日志增量采样

6. Data collection details (CLI OpenAPI + injected input)

6. 数据收集详情(CLI OpenAPI + 注入输入)

One-shot entry

一次性入口

Use the same
check_es_instance_health.py
command as §5 Step 1 (optional
--window
/
--profile
; default window 60 minutes if omitted).
使用与第5步第1节相同的**
check_es_instance_health.py
命令(可选
--window
** /
--profile
;若省略,默认窗口为60分钟)。

Injected input mode (paired with CLI)

注入输入模式(与CLI配合使用)

check_es_instance_health.py
accepts external JSON to avoid duplicate calls:
bash
python3 scripts/check_es_instance_health.py \
  -i <InstanceId> -r <RegionId> \
  --data-source input \
  --input-json-file /path/to/diag-input.json
Input JSON shape:
json
{
  "status_info": {},
  "metrics": {},
  "events": [],
  "logs": []
}
--data-source
modes:
  • auto
    : prefer injected fields; backfill gaps via Aliyun CLI.
  • cli
    : ignore injection; fetch everything via CLI.
  • input
    : injection only; no OpenAPI calls.
check_es_instance_health.py
接受外部JSON输入以避免重复调用:
bash
python3 scripts/check_es_instance_health.py \
  -i <InstanceId> -r <RegionId> \
  --data-source input \
  --input-json-file /path/to/diag-input.json
输入JSON格式:
json
{
  "status_info": {},
  "metrics": {},
  "events": [],
  "logs": []
}
--data-source
模式:
  • auto
    :优先使用注入字段;通过Aliyun CLI补充缺失内容。
  • cli
    :忽略注入内容;全部通过CLI获取。
  • input
    :仅使用注入内容;不调用OpenAPI。

Manual control-plane CLI backfill

手动控制平面CLI补充

For additional OpenAPI examples, see
references/verification-method.md
.

更多OpenAPI示例请查看
references/verification-method.md

7. Elasticsearch direct API access (data-plane deep dive)

7. Elasticsearch直连API访问(数据平面深度分析)

When feasibility holds (including reachability), execute the REST calls required by any MUST-trigger row (§5). For endpoints not listed in a fired MUST row, call them only when feasibility and necessity both hold per the strategy doc.
ES_ENDPOINT
may be
host:port
or a full URL. For the samples below, normalize to
http://${ES_ENDPOINT#http://}
(use
https://
consistently when the cluster serves TLS).
Timeouts: every
curl
must use
--connect-timeout 10 --max-time 30
.
可行性满足(包括可达性)时,执行必选触发行要求的REST调用(第5节)。对于在触发的必选行中列出的端点,仅当可行性必要性同时满足策略文档要求时才调用。
ES_ENDPOINT
可以是
host:port
或完整URL。对于以下示例,统一格式为
http://${ES_ENDPOINT#http://}
(当集群支持TLS时,始终使用
https://
)。
超时设置:每个
curl
命令必须使用
--connect-timeout 10 --max-time 30

Red / Yellow (MUST) — recommended set

Red / Yellow状态(必选) — 推荐调用集合

Scope: The cluster-health MUST row uses
ClusterStatus
max ≥ Yellow (includes Red). Use this set for unassigned / misallocated shard root cause on the engine.
bash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_cluster/health?pretty"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  -H "Content-Type: application/json" \
  -X POST "http://${ES_ENDPOINT#http://}/_cluster/allocation/explain?pretty" \
  -d '{}'

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason&s=state"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_cluster/pending_tasks?pretty"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_nodes/stats/thread_pool?pretty"
范围:集群健康必选行适用于
ClusterStatus
最大值 ≥ Yellow(包含Red)。此集合用于排查引擎层面未分配 / 分配错误分片的根因。
bash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_cluster/health?pretty"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  -H "Content-Type: application/json" \
  -X POST "http://${ES_ENDPOINT#http://}/_cluster/allocation/explain?pretty" \
  -d '{}'

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason&s=state"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_cluster/pending_tasks?pretty"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_nodes/stats/thread_pool?pretty"

Query / write performance (MUST) — recommended set

查询 / 写入性能(必选) — 推荐调用集合

Include
_cluster/settings
when heap / GC / breaker rules fired in Step 1 or
_nodes/stats/breaker
shows concern — read transient and persistent
indices.breaker.*
/
network.breaker.*
.
bash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_nodes/hot_threads?threads=3"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_nodes/stats/breaker?pretty"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_cluster/settings?include_defaults=true&pretty"
/_cluster/pending_tasks
and
GET /_nodes/stats/thread_pool
are also listed under Red / Yellow (MUST) above—one call each per session when both sections apply. If you run only this performance block, add those two
curl
lines from that block.
当步骤1中触发了堆内存 / GC / 断路器规则,或**
_nodes/stats/breaker
显示异常时,需包含
_cluster/settings
** — 读取transientpersistent配置中的
indices.breaker.*
/
network.breaker.*
bash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_nodes/hot_threads?threads=3"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_nodes/stats/breaker?pretty"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_cluster/settings?include_defaults=true&pretty"
**
/_cluster/pending_tasks
GET /_nodes/stats/thread_pool
也在上述Red / Yellow状态(必选)**集合中列出 — 当两个章节都适用时,每个会话调用一次即可。若仅运行此性能集合,需添加该集合中的这两个
curl
命令。

Resource anomalies without a closed loop (SHOULD) — recommended set

未闭环的资源异常(建议) — 推荐调用集合

bash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_cat/nodes?v&s=cpu:desc&h=name,ip,cpu,heap.percent,ram.percent,load_1m"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_nodes/stats/jvm?pretty"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_cat/allocation?v&bytes=gb"
GET /_cluster/settings?include_defaults=true
also appears under Query / write performance (MUST) above—reuse one response when both blocks apply. If you run only this SHOULD block, add the same
curl
line from that block.
Protocol sanity (avoid
WRONG_VERSION_NUMBER
): usually http/https scheme mismatch on
ES_ENDPOINT
— fix scheme/port and retry.
Scenario → endpoint index: references/es-api-catalog.md.

bash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_cat/nodes?v&s=cpu:desc&h=name,ip,cpu,heap.percent,ram.percent,load_1m"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_nodes/stats/jvm?pretty"

curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
  "http://${ES_ENDPOINT#http://}/_cat/allocation?v&bytes=gb"
**
GET /_cluster/settings?include_defaults=true
也在上述查询 / 写入性能(必选)**集合中列出 — 当两个集合都适用时,复用同一个响应即可。若仅运行此建议集合,需添加该集合中的相同
curl
命令。
协议合理性检查(避免
WRONG_VERSION_NUMBER
):通常是
ES_ENDPOINT
http/https
协议不匹配 — 修正协议/端口后重试。
场景 → 端点索引:references/es-api-catalog.md

8. Diagnostic coverage

8. 诊断覆盖范围

The knowledge base covers 48+ health-event-style rules and chained scenarios (e.g. disk pressure → allocation → Red). Per-category counts, P0/P1/P2 mix, and event codes: references/health-events-catalog.md — scenario runbooks:
references/sop-*.md
(index: references/README.md).

知识库涵盖**48+**健康事件类规则和链式场景(例如磁盘压力 → 分片分配 → Red状态)。按类别统计、P0/P1/P2分布、事件代码references/health-events-catalog.md — 场景手册:
references/sop-*.md
(索引:references/README.md)。

9. Best practices

9. 最佳实践

Read-only: no mutating control-plane APIs; no teardown.
  1. Layered + evidence-bound: scan → SOP depth; every conclusion cites metrics/logs/events; if ES is unreachable, state limits (es-api-call-failures.md).
  2. Priority vs narrative: P0→P2 for urgency; Incident timeline when multiple dimensions differ in time (Step 4). Credentials / TLS / parameters: §1–2 and §4.
  3. Green is not “all clear” — watermarks, blocks, mis-set limits still matter; MUST + reachable ES: do not skip §5/§7 evidence because the cluster is Green or OpenAPI “explains” symptoms.
  4. Thread-pool
    rejected
    :
    cumulative unless you show a delta — sop-query-thread-pool.md §1–2; write/bulk: sop-write-performance.md §2.

只读原则:不调用变更类控制平面API;不执行销毁操作。
  1. 分层 + 证据绑定:扫描 → SOP深度分析;每个结论都引用指标/日志/事件;若ES不可达,说明限制(es-api-call-failures.md)。
  2. 优先级 vs 叙事:按P0→P2排序紧急程度;当多个维度的时间差异较大时,使用事件时间线(步骤4)。凭证 / TLS / 参数:遵循第1–2节和第4节。
  3. Green状态不代表“一切正常” — 水位线、块、错误配置的限制仍需关注;必选条件 + ES可达:请勿因集群为Green状态或OpenAPI“解释”了症状而跳过第5/7节的证据收集。
  4. 线程池
    rejected
    累计值除非显示增量 — sop-query-thread-pool.md第1–2节;写入/bulk:sop-write-performance.md第2节。

10. Reference links

10. 参考链接

  • references/verification-method.md
    — Verification (how to validate diagnosis; metrics, APIs, workflows)
  • references/report-template.md
    — Structured diagnosis report skeleton
  • references/README.md
    Language map (reference assets and
    sop-*.md
    runbooks; English in this repo)
  • references/ram-policies.md
    — RAM policy list
  • references/acceptance-criteria.md
    — Correct/incorrect patterns and acceptance (includes credential and safety anti-patterns)
  • references/cli-installation-guide.md
    — Aliyun CLI installation
  • references/es-api-catalog.md
    — Elasticsearch REST API catalog
  • references/health-events-catalog.md
    — Health event catalog
  • references/sop-*.md
    — Scenario SOPs (e.g.
    sop-activating-change-stuck.md
    for
    activating
    / change stuck, cross-layer root cause)
  • references/es-api-diagnosis-strategy.md
    — Elasticsearch API diagnosis strategy
  • references/verification-method.md
    — 验证方法(如何验证诊断结果;指标、API、流程)
  • references/report-template.md
    — 结构化诊断报告框架
  • references/README.md
    语言映射(参考资产和
    sop-*.md
    手册;本仓库为英文)
  • references/ram-policies.md
    — RAM策略列表
  • references/acceptance-criteria.md
    — 正确/错误模式与验收标准(包括凭证和安全反模式)
  • references/cli-installation-guide.md
    — Aliyun CLI安装指南
  • references/es-api-catalog.md
    — Elasticsearch REST API目录
  • references/health-events-catalog.md
    — 健康事件目录
  • references/sop-*.md
    — 场景SOP(例如
    sop-activating-change-stuck.md
    用于
    activating
    / 变更卡住场景的跨层根因分析)
  • references/es-api-diagnosis-strategy.md
    — Elasticsearch API诊断策略