alibabacloud-elasticsearch-instance-diagnose
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAlibaba Cloud Elasticsearch Instance Diagnosis
阿里云Elasticsearch实例诊断
Collect signals from Alibaba Cloud OpenAPI (control plane) and the Elasticsearch REST API (data plane), combine them with the SOP knowledge base under , and produce root-cause analysis, an evidence chain, prioritized remediation guidance, and—when multiple dimensions fire—a recency-ordered incident timeline (severity vs time in window; see Timeline and recency (MUST) in §5 Step 4).
references/Architecture: Alibaba Cloud Elasticsearch OpenAPI + Alibaba CloudMonitor (CMS) + Elasticsearch REST API + diagnostic SOPs
Closure: If MUST applies and is set, finish authenticated ES API evidence before the final report (see Feasibility order in §5).
ES_*从阿里云OpenAPI(控制平面)和Elasticsearch REST API(数据平面)收集信号,结合目录下的SOP知识库,生成根因分析报告、证据链、优先级排序的修复指南,当多个维度触发时,还会生成按时间顺序排列的事件时间线(窗口内的严重程度与时间关系;详见第5步第4节的时间线与时效性(必选))。
references/架构:阿里云Elasticsearch OpenAPI + 阿里云监控(CMS) + Elasticsearch REST API + 诊断SOP
收尾要求:若触发必选规则且已设置变量,需在生成最终报告前完成已认证的ES API证据收集(详见第5节的可行性优先级)。
ES_*1. Prerequisites
1. 前置条件
1.1 Aliyun CLI
1.1 Aliyun CLI
Pre-check: Aliyun CLI >= 3.3.1 required (for RAM permission checks and OpenAPI CLI fallback) Runto verify the version is >= 3.3.1. If the CLI is missing or too old, seealiyun version. After installation, runreferences/cli-installation-guide.mdto enable automatic plugin installation (do not pass plaintext AccessKey pairs on this command line; see §1.2).aliyun configure set --auto-plugin-install true
预检查:需Aliyun CLI >= 3.3.1(用于RAM权限检查和OpenAPI CLI降级方案) 运行验证版本是否>=3.3.1。若CLI缺失或版本过旧,请查看aliyun version。 安装完成后,运行references/cli-installation-guide.md启用自动插件安装(请勿在此命令行中传入明文AccessKey;详见第1.2节)。aliyun configure set --auto-plugin-install true
1.2 Alibaba Cloud account authentication and security (MUST)
1.2 阿里云账号认证与安全(必选)
Security rules (mandatory):
- NEVER read, echo, or print AccessKey ID or AccessKey Secret values.
- NEVER prompt or ask the user to paste plaintext AccessKeys in the conversation.
- NEVER embed AccessKeys in scripts, CLI arguments, or
URLs.curl- NEVER use
(or similar) to pass literal AccessKey ID/Secret on the command line.aliyun configure set- NEVER accept AccessKeys that the user pastes into the chat, even if offered voluntarily.
- ONLY use configured CLI profiles (
) or environment variables such asaliyun configure/ALIBABA_CLOUD_ACCESS_KEY_IDthat the user has set in their local shell (the agent must not echo those values in the session).ALIBABA_CLOUD_ACCESS_KEY_SECRET
⚠️ If the user provides AccessKeys in the chat (e.g. “my AK is xxx”)
- Stop immediately: do not run any Alibaba Cloud command that requires credentials.
- Decline politely and give only the names of approved configuration methods (do not repeat any secret the user may have leaked):
- Recommended: run
in a local terminal and enter credentials when prompted; credentials are stored in the local profile file.aliyun configure- Alternatively: set
/ALIBABA_CLOUD_ACCESS_KEY_IDin the local shell (the user types values only in the terminal, not in chat).ALIBABA_CLOUD_ACCESS_KEY_SECRET- Resume the diagnosis request only after credentials are configured correctly.
Verify credentials without exposing secrets:bashaliyun configure list aliyun --profile <profile_name> sts get-caller-identityCredential policy:
- Prefer an
profile (default oraliyun configure).--profile- If there is no valid identity (
/configure listfails), STOP and guide the user to configure locally; do not guess or fabricate credentials.get-caller-identity- Never pass plaintext AccessKeys through the conversation.
安全规则(强制执行):
- 绝对禁止读取、回显或打印AccessKey ID或AccessKey Secret的值。
- 绝对禁止提示或要求用户在对话中粘贴明文AccessKey。
- 绝对禁止将AccessKey嵌入脚本、CLI参数或
URL中。curl- 绝对禁止使用
(或类似命令)在命令行中传入明文AccessKey ID/Secret。aliyun configure set- 绝对禁止接受用户粘贴到聊天中的AccessKey,即使用户主动提供。
- 仅可使用已配置的CLI配置文件(
)或用户在本地Shell中设置的环境变量(如aliyun configure/ALIBABA_CLOUD_ACCESS_KEY_ID),Agent不得在会话中回显这些值。ALIBABA_CLOUD_ACCESS_KEY_SECRET
⚠️ 若用户在聊天中提供AccessKey(例如“我的AK是xxx”)
- 立即停止:不要运行任何需要凭证的阿里云命令。
- 礼貌拒绝,并仅提供已批准的配置方法名称(请勿重复用户可能泄露的任何机密信息):
- 推荐方案:在本地终端运行
,并在提示时输入凭证;凭证将存储在本地配置文件中。aliyun configure- 替代方案:在本地Shell中设置
/ALIBABA_CLOUD_ACCESS_KEY_ID(用户仅在终端中输入值,而非聊天窗口)。ALIBABA_CLOUD_ACCESS_KEY_SECRET- 仅在凭证配置正确后,再恢复诊断请求。
在不暴露机密的情况下验证凭证:bashaliyun configure list aliyun --profile <profile_name> sts get-caller-identity凭证策略:
- 优先使用
配置文件(默认或aliyun configure指定)。--profile- 若没有有效身份(
/configure list执行失败),立即停止并引导用户在本地配置;请勿猜测或伪造凭证。get-caller-identity- 绝对禁止通过对话传递明文AccessKey。
1.3 Elasticsearch direct-connect credential boundary
1.3 Elasticsearch直连凭证边界
- NEVER ask the user to paste
in chat; NEVER echo, print, or log the password; NEVER copy a password from chat into commands, hooks, or repo files.ES_PASSWORD- Shell expansion for
(or equivalent) is allowed when vars are pre-exported in the user’s local shell; NEVER put the secret as a literal in chat, scripts checked into repos, or command output.curl -u "$ES_USERNAME:$ES_PASSWORD"- If the user tries to send a password in chat: STOP as well and ask them to set
only locally viaES_PASSWORD(see §2.2).export
- 绝对禁止要求用户在聊天中粘贴
;绝对禁止回显、打印或记录密码;绝对禁止从聊天中复制密码到命令、钩子或仓库文件中。 当变量在用户本地Shell中预先导出时,允许使用ES_PASSWORD(或等效命令)的Shell扩展;绝对禁止将机密信息明文写在聊天、已提交到仓库的脚本或命令输出中。curl -u "$ES_USERNAME:$ES_PASSWORD"- 若用户尝试在聊天中发送密码:同样立即停止,并要求他们仅通过
在本地设置export(详见第2.2节)。ES_PASSWORD
2. Environment setup
2. 环境配置
2.1 Control plane OpenAPI (via Aliyun CLI)
2.1 控制平面OpenAPI(通过Aliyun CLI)
All control-plane and CMS data collection for this skill uses the Aliyun CLI.
[MUST]/elasticsearch— plugin-mode shell only (avoid legacy CLI)cms
Whenever the agent emits executablelines (chat, reproducibility exports, or copy-paste steps), use plugin subcommands (lowercase-hyphenated) and kebab-case flags — the same shape asaliyunand references/verification-method.md.scripts/openapi_cli_collect.py
- Do not use legacy POP-style invocations: a PascalCase verb immediately after
orelasticsearchon the samecmsline (the old “action name = subcommand” style), or CamelCase flags likealiyun,--InstanceId,--Namespacein new commands. Use plugin verbs only (--StartTime,describe-instance, …).describe-metric-list- Naming split:
,DescribeInstance,ListSearchLog, etc. are OpenAPI action names (PascalCase — docs, RAM, console). The token afterDescribeMetricListoraliyun elasticsearchin a shell must be the CLI plugin name (aliyun cms,describe-instance,list-search-log, …).describe-metric-list- Prefer
for the standard control-plane + CMS bundle so subprocess calls stay aligned with this repo.python3 scripts/check_es_instance_health.py- CLI references: Elasticsearch CLI 中心, 云监控 CLI 中心.
AI-Mode and plugin baseline (required) — wrap every diagnosis session that runs OpenAPI/CMS commands:
aliyunbash
aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose"
aliyun plugin update本技能的所有控制平面和CMS数据收集均使用Aliyun CLI。
[必选]/elasticsearch— 仅使用插件模式Shell(避免旧版CLI)cms
当Agent生成可执行的命令行(聊天、可复现性导出或复制粘贴步骤)时,需使用插件子命令(小写连字符格式)和短横线命名的参数 — 与aliyun和references/verification-method.md保持一致。scripts/openapi_cli_collect.py
- 请勿使用旧版POP风格调用:在同一
命令行中,aliyun或elasticsearch后直接跟大驼峰动词(旧版“动作名称=子命令”风格),或在新命令中使用大驼峰参数如cms、--InstanceId、--Namespace。仅使用插件动词(--StartTime、describe-instance等)。describe-metric-list- 命名区分:
、DescribeInstance、ListSearchLog等是OpenAPI动作名称(大驼峰格式 — 文档、RAM、控制台)。DescribeMetricList或aliyun elasticsearch后的命令行必须是CLI插件名称(aliyun cms、describe-instance、list-search-log等)。describe-metric-list- 优先使用
获取标准控制平面+CMS组合数据,确保子进程调用与本仓库保持一致。python3 scripts/check_es_instance_health.py- CLI参考:Elasticsearch CLI 中心、云监控 CLI 中心。
AI模式与插件基线(必填) — 在运行 OpenAPI/CMS命令的诊断会话前后执行以下命令:
aliyunbash
aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose"
aliyun plugin update… diagnosis: aliyun / python3 scripts/check_es_instance_health.py …
… 诊断操作:aliyun / python3 scripts/check_es_instance_health.py …
aliyun configure ai-mode disable
> **`configure ai-mode` missing or failing:** Skip the wrapper above; use **`ALIBABA_CLOUD_USER_AGENT`** (next block). Log the CLI failure (e.g. subcommand unavailable). Whether the profile is **valid** is determined only by **`aliyun configure list`** and **`sts get-caller-identity`** — write **valid** / **validity**, not *vaild*.
**User-Agent (required)**: set a User-Agent for Alibaba Cloud API calls:
```bash
export ALIBABA_CLOUD_USER_AGENT="AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose"CLI hardening (recommended): when authoring raw commands, use §2.1 MUST plugin shape first, then add (increase for large responses or CMS), consistent with the instance-management skill examples, to avoid indefinite hangs on network faults. If the global User-Agent is not set, add per invocation. For optional Elasticsearch probes inside (when is set), the same knobs exist as / on that script — they map to for engine calls only, not to the Aliyun OpenAPI client.
aliyun--connect-timeout 3 --read-timeout 10read-timeout--user-agent AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnosecheck_es_instance_health.pyES_*--connect-timeout--read-timeoutcurlRun before diagnosis:
bash
aliyun version
aliyun configure list
aliyun --profile <profile_name> sts get-caller-identityaliyun configure ai-mode disable
> **`configure ai-mode`缺失或执行失败**:跳过上述包装命令;使用**`ALIBABA_CLOUD_USER_AGENT`**(下一代码块)。记录CLI执行失败信息(例如子命令不可用)。配置文件是否**有效**仅通过**`aliyun configure list`**和**`sts get-caller-identity`**判断 — 请使用“valid” / “有效性”,而非拼写错误的“vaild”。
**用户代理(必填)**:为阿里云API调用设置用户代理:
```bash
export ALIBABA_CLOUD_USER_AGENT="AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnose"CLI强化(推荐):编写原生命令时,首先遵循第2.1节的必选插件格式,然后添加**(针对大响应或CMS可增加),与实例管理技能示例保持一致,避免因网络故障导致无限挂起。若未设置全局用户代理,需在每次调用时添加。对于中的可选Elasticsearch探测**(当已设置时),脚本也提供了相同的**** / ****参数 — 这些参数仅映射到引擎调用的,而非阿里云OpenAPI客户端。
aliyun--connect-timeout 3 --read-timeout 10read-timeout--user-agent AlibabaCloud-Agent-Skills/alibabacloud-elasticsearch-instance-diagnosecheck_es_instance_health.pyES_*--connect-timeout--read-timeoutcurl诊断前运行以下命令:
bash
aliyun version
aliyun configure list
aliyun --profile <profile_name> sts get-caller-identity2.2 Elasticsearch API direct access (curl
)
curl2.2 Elasticsearch API直连(curl
)
curlHave the user set connection variables in a local terminal after you confirm the Elasticsearch endpoint (VPC or public) and admin credentials—do not hardcode user-specific values in chat:
bash
export ES_ENDPOINT="http://<elasticsearch-endpoint-ip>:9200"
export ES_USERNAME="elastic"
export ES_PASSWORD="<elasticsearch-admin-password>"Public access andvshttp: Fromhttps, useDescribeInstance/publicDomainand the reporteddomain. Whenprotocolisprotocol(typical public listener), setHTTPtoES_ENDPOINT. Usinghttp://<publicDomain>:9200against an HTTP-only endpoint causes TLS errors (e.g.https://). UseWRONG_VERSION_NUMBERonly whenhttps://isprotocol(or TLS is actually enabled on the port you use), and supply CA / fingerprint options as in HTTPS options below.HTTPSIf“does not work” — when to tryhttp://: Treathttps://DescribeInstanceas the source of truth for the REST listener.protocol, timeouts, or connection refused on000usually mean network path / allowlist / security group / wrong host or port — not “try HTTPS next” whenhttp://is stillprotocol. Do switch toHTTPwhenhttps://isprotocol(or the console / product doc states TLS on that endpoint) and the failure onHTTPSis a TLS or scheme symptom (e.g.http://,WRONG_VERSION_NUMBER, immediate SSL alert while probing with the wrong scheme). Iferror:0A00010Bisprotocoland only plain TCP is advertised, HTTPS is not a fallback for reachability.HTTP
Credential safety
- NEVER echo, print, or log
; NEVER copy credentials from chat into shell history or saved files.ES_PASSWORD- NEVER ask the user to paste the password in plaintext in chat.
- ONLY use the following checks to verify that variables are set:
bash[[ -n "$ES_ENDPOINT" ]] && echo "ES_ENDPOINT: $ES_ENDPOINT" || echo "ES_ENDPOINT: NOT SET" [[ -n "$ES_PASSWORD" ]] && echo "ES_PASSWORD: SET" || echo "ES_PASSWORD: NOT SET"
Network connectivity and access control
Issue How to check Mitigation Public network access disabled Elasticsearch console → Network Enable public access or use the VPC endpoint Public access allowlist Console → Security → Public access allowlist Add the agent host’s public IP VPC isolation e.g. telnet <ES_IP> 9200VPC peering, Express Connect, or equivalent Security group Inbound rules on the ECS/security group hosting Elasticsearch Allow TCP 9200 (or the configured port)
Connectivity probe:— HTTP codecurl -sS -o /dev/null -w "%{http_code}" --connect-timeout 5 "${ES_ENDPOINT}"usually means the path is unreachable.000without401is normal (auth required); if-uis SET, proceed to authenticatedES_PASSWORD(§7).GET /_cluster/healthwith401→ wrong credentials.-u/ refused / timeout → network, allowlist, or TLS/scheme mismatch.000
HTTPS — prerequisites (what must be true)
- Listener: The Elasticsearch HTTP port you call (9200 unless changed) must actually speak TLS — align with
DescribeInstance(protocol) or console/network documentation.HTTPS- URL:
with the same host (e.g.https://<host>:<port>) you would use for HTTP.publicDomain- Client trust of the server certificate: Your client must trust the cluster’s certificate chain (cluster / cloud CA PEM, or corporate proxy CA if TLS is intercepted).
: prefercurl;curl --cacert /path/to/ca.crt .../-konly for short, non-production diagnosis.--insecure- Auth: Same
/ES_USERNAMEas for HTTP (Basic auth over TLS).ES_PASSWORDHTTPS — how this skill documents it
- Manual
(§7 and es-api-call-failures.md): Addcurl(or--cacertfor testing) to every-kwhen usingcurlif the default trust store does not include your cluster CA.https:// optional ES probes: They invokecheck_es_instance_health.pywithcurlonly; they do not read-u/ES_CA_CERTS/ES_SSL_FINGERPRINT(those names are common for Python Elasticsearch clients). For HTTPS instances, use §7ES_VERIFY_CERTSwithcurlfor deep checks, or extend the script later to pass--cacertfrom an env var.--cacert- Python-style env vars (reference for other tooling):
,ES_CA_CERTS,ES_SSL_FINGERPRINT(testing only) — not wired into this repo’s optionalES_VERIFY_CERTS=falsepath today.curl
请用户在本地终端中设置连接变量,确认Elasticsearch端点(VPC或公网)和管理员凭证后执行 — 请勿在聊天中硬编码用户特定值:
bash
export ES_ENDPOINT="http://<elasticsearch-endpoint-ip>:9200"
export ES_USERNAME="elastic"
export ES_PASSWORD="<elasticsearch-admin-password>"公网访问与vshttp:从**https获取DescribeInstance** /publicDomain和上报的domain。当**protocol为protocol(典型公网监听器)时,将HTTP设置为ES_ENDPOINT。若针对仅支持HTTP的端点使用http://<publicDomain>:9200,会导致TLS错误(例如https://)。仅当WRONG_VERSION_NUMBER为protocol(或端口实际启用了TLS)时,才使用HTTPS,并按照下方HTTPS选项**提供CA/指纹参数。https://若“无法工作” — 何时尝试http://:将**https://的DescribeInstance视为REST监听器的权威来源。protocol、超时或连接拒绝通常意味着网络路径/白名单/安全组/主机或端口错误** — 当**000仍为protocol时,不要直接尝试HTTPS。仅当HTTP为protocol(或控制台/产品文档说明该端点启用了TLS),且HTTPS调用失败是TLS或协议问题(例如http://、WRONG_VERSION_NUMBER、使用错误协议探测时立即触发SSL告警)时,才切换到error:0A00010B。若https://为protocol且仅声明支持纯TCP**,则HTTPS不能作为可达性的降级方案。HTTP
凭证安全
- 绝对禁止回显、打印或记录
;绝对禁止将凭证从聊天复制到Shell历史或保存的文件中。ES_PASSWORD- 绝对禁止要求用户在聊天中粘贴明文密码。
- 仅可使用以下检查验证变量是否已设置:
bash[[ -n "$ES_ENDPOINT" ]] && echo "ES_ENDPOINT: $ES_ENDPOINT" || echo "ES_ENDPOINT: NOT SET" [[ -n "$ES_PASSWORD" ]] && echo "ES_PASSWORD: SET" || echo "ES_PASSWORD: NOT SET"
网络连通性与访问控制
问题 检查方式 缓解方案 公网访问已禁用 Elasticsearch控制台 → 网络 启用公网访问或使用VPC端点 公网访问白名单 控制台 → 安全 → 公网访问白名单 添加Agent主机的公网IP VPC隔离 例如 telnet <ES_IP> 9200VPC对等连接、高速通道或等效方案 安全组 Elasticsearch所在ECS/安全组的入站规则 允许TCP 9200(或配置的端口)
连通性探测:— HTTP返回码curl -sS -o /dev/null -w "%{http_code}" --connect-timeout 5 "${ES_ENDPOINT}"通常表示路径不可达。不带000的-u是正常情况(需要认证);若401已设置,继续执行已认证的ES_PASSWORD(第7节)。带GET /_cluster/health的-u→ 凭证错误。401/ 连接拒绝 / 超时 → 网络、白名单或TLS/协议不匹配。000
HTTPS — 前置条件(必须满足)
- 监听器:调用的Elasticsearch HTTP端口(默认9200,除非已修改)必须实际支持TLS — 与**
的DescribeInstance(protocol**)或控制台/网络文档保持一致。HTTPS- URL:使用**
,主机与HTTP使用的一致(例如https://<host>:<port>**)。publicDomain- 客户端信任服务器证书链:客户端必须信任集群的证书链(集群/云CA PEM,或TLS被拦截时的企业代理CA)。
:优先使用**curl;curl --cacert /path/to/ca.crt .../-k仅用于短期非生产**诊断场景。--insecure- 认证:使用与HTTP相同的**
/ES_USERNAME**(基于TLS的基础认证)。ES_PASSWORDHTTPS — 本技能的文档方式
- 手动
(第7节和es-api-call-failures.md):当使用**curl时,若默认信任库不包含集群CA,需为每个https://命令添加curl(或测试时用--cacert**)。-k 可选ES探测:仅调用带**check_es_instance_health.py的-u;不读取curl** /ES_CA_CERTS/ES_SSL_FINGERPRINT(这些名称是Python Elasticsearch客户端的通用参数)。对于HTTPS实例,使用第7节的**ES_VERIFY_CERTS并添加curl进行深度检查,或后续扩展脚本以从环境变量传递--cacert**。--cacert- Python风格环境变量(其他工具参考):
、ES_CA_CERTS、ES_SSL_FINGERPRINT(仅用于测试) — 目前本仓库的可选**ES_VERIFY_CERTS=false**路径未集成这些参数。curl
3. RAM permission check
3. RAM权限检查
[MUST] RAM permission pre-checkBefore running this skill, verify the principal has the required RAM permissions. Seefor the full list. If the user reports insufficient permissions, direct them to attach the corresponding policies in the RAM console.references/ram-policies.md
[必选] RAM权限预检查在运行本技能前,验证主体是否具备所需的RAM权限。 完整权限列表请查看。 若用户报告权限不足,引导他们在RAM控制台附加相应的策略。references/ram-policies.md
4. Parameter confirmation
4. 参数确认
IMPORTANT: Parameter confirmation Confirm the following with the user before any command or API call. Do not assume undeclared defaults or hardcode user-specific parameters.
Boundary controls (MUST)
- Region and
must not be guessed or taken from unverified defaults; if they disagree withinstance-idor the user’s explicit statement, reconfirm.DescribeInstance- Do not apply metrics, logs, or
conclusions from instance A to instance B;DescribeInstancemust match the instance under diagnosis (see Pre-flight validation for Elasticsearch API below).ES_ENDPOINT- This skill is read-only diagnosis: do not invoke mutating control-plane APIs (create, resize, restart, delete instance, etc.). If the user requests a change, provide recommendations only; execution belongs in the console or an approved change workflow.
| Parameter | Required | Description | Default |
|---|---|---|---|
| Yes | Elasticsearch instance ID, e.g. | - |
| Yes | Region ID (e.g. | - |
| No | Aliyun CLI profile (explicit | |
| No | Elasticsearch endpoint (direct API access only) | - |
| No | Elasticsearch admin password (direct API access only) | - |
| No | | 60 |
| No | | 5 / 10 |
重要提示:参数确认 在执行任何命令或API调用前,与用户确认以下参数。 请勿假设未声明的默认值或硬编码用户特定参数。
边界控制(必选)
- 切勿猜测Region和
,或从未经验证的默认值获取;若与instance-id或用户明确声明的内容不符,需重新确认。DescribeInstance- 请勿将实例A的指标、日志或
结论应用到实例B;DescribeInstance必须与正在诊断的实例匹配(详见下方Elasticsearch API飞行前验证)。ES_ENDPOINT- 本技能为只读诊断:请勿调用变更类控制平面API(创建、扩容、重启、删除实例等)。若用户请求变更,仅提供建议;执行操作需通过控制台或已批准的变更流程。
| 参数 | 必填 | 描述 | 默认值 |
|---|---|---|---|
| 是 | Elasticsearch实例ID,例如 | - |
| 是 | Region ID(例如 | - |
| 否 | Aliyun CLI配置文件(推荐显式指定 | |
| 否 | Elasticsearch端点(仅用于API直连) | - |
| 否 | Elasticsearch管理员密码(仅用于API直连) | - |
| 否 | | 60 |
| 否 | | 5 / 10 |
5. End-to-end diagnostic workflow
5. 端到端诊断流程
Agent hard rules (non-negotiable)
Agent硬性规则(不可协商)
Aliyun CLI shape: Forandaliyun elasticsearch, follow §2.1 MUST (plugin mode only) in every new executable command — do not resurrect legacyaliyun cms/DescribeInstance-as-subcommand lines orListSearchLog-style flags in session exports or user-facing step lists (they drift from--InstanceIdand fail static checks).openapi_cli_collect.py
OpenAPI/CMS cannot replace MUST engine APIs. For any §5 MUST table row orrule-engine MUST, Alibaba Cloud OpenAPI and CloudMonitor do not replace the listed Elasticsearch REST calls for engine-level root cause—when feasibility holds, run thosecheck_es_instance_health.pyendpoints (see §7); they are complementary layers, not interchangeable.curlFeasibility is decided only by checks, not by assumption. Whether the agent may call Elasticsearch must be determined by actually running the Feasibility order (§5): at minimum verify/ES_ENDPOINTper §2.2, alignES_PASSWORDwithES_ENDPOINT, then authenticatedDescribeInstance. Do not assumeGET /_cluster/healthis unset or the path is unreachable without performing these steps in the session.ES_*
For Elasticsearch incidents, follow these four steps; each has a distinct role.
Aliyun CLI格式:对于**和aliyun elasticsearch,在所有新的可执行命令中必须遵循第2.1节的必选要求(仅插件模式)** — 请勿在会话导出或用户可见的步骤列表中使用旧版aliyun cms/DescribeInstance作为子命令的格式,或ListSearchLog风格的参数(这些格式与--InstanceId不一致,且无法通过静态检查)。openapi_cli_collect.py
OpenAPI/CMS无法替代必选引擎API。对于任何第5节必选表格行或**规则引擎必选项**,阿里云OpenAPI和云监控无法替代所列的Elasticsearch REST调用以获取引擎级根因 — 当可行性满足时,需运行这些check_es_instance_health.py端点(详见第7节);它们是互补层,而非可互换的。curl可行性仅通过检查确定,而非假设。Agent是否可以调用Elasticsearch必须通过实际执行可行性优先级(第5节)来判断:至少需按照第2.2节验证/ES_ENDPOINT,将ES_PASSWORD与ES_ENDPOINT对齐,然后执行已认证的DescribeInstance。请勿在未执行这些步骤的情况下,假设GET /_cluster/health未设置或路径不可达。ES_*
针对Elasticsearch事件,请遵循以下四个步骤;每个步骤都有明确的作用。
Execution strategy (root-cause driven)
执行策略(根因驱动)
Full policy: es-api-diagnosis-strategy.md
Data-plane collection requires both:
curl- Feasibility: and
ES_ENDPOINTare set and the network path works.ES_PASSWORD - Necessity: root-cause analysis needs data-plane evidence that the control plane or CMS cannot establish alone.
For endpoints listed under a fired MUST table row or rule-engine MUST, necessity for those calls is already satisfied by the trigger—still require feasibility (Feasibility order). For optional enginenot in those lists, apply feasibility and necessity per es-api-diagnosis-strategy.md.curl
MUST triggers (if any CMS condition below holds, collect the listed Elasticsearch evidence):
| Trigger | Scenario | Required Elasticsearch evidence |
|---|---|---|
| Cluster health | |
| CPU overload | |
| Memory pressure | |
Thread pool | Performance | |
| Inter-node resource CV > 0.3 | Load imbalance | |
| Write failures or index read-only | Disk / watermark / blocks | |
| Intermittent Elasticsearch API timeouts + CMS CPU > 80% | Possible cascading failure | |
Thread-pool row: interpret search vs write / bulk using sop-query-thread-pool.md vs sop-write-performance.md (see also Write-path / bulk saturation below).
Rule-engine MUST: Ifprints a §5 MUST / §5–§7 callout for this run, treat it like a row above—collect that listed ES evidence when feasibility holds.check_es_instance_health.py
Binding rule (MUST triggers): If any MUST-trigger row or the rule-engine MUST line above applies, necessity is satisfied for that evidence set—OpenAPI/CMS cannot replace those calls for engine-level root cause (cluster-health:+allocation/explainfor Yellow/Red). Confirm feasibility per Feasibility order below. If reachable with auth, run the MUST-listed endpoints in Step 2 in parallel with control-plane collection. If still blocked after authenticated_cat/shards, lead with blocking reason: unsetGET /_cluster/health; transport failure (ES_*, refused, timeout); 401 with000; scheme/TLS mismatch—not 401 on an unauthenticated probe when-uis SET.ES_PASSWORD
完整策略:es-api-diagnosis-strategy.md
数据平面收集需同时满足以下两点:
curl- 可行性:和
ES_ENDPOINT已设置,且网络路径可用。ES_PASSWORD - 必要性:根因分析需要控制平面或CMS无法单独提供的数据平面证据。
对于触发的必选表格行中列出的端点,或规则引擎必选项,这些调用的必要性已由触发条件满足 — 仍需确认可行性(可行性优先级)。对于未在这些列表中的可选引擎调用,需按照es-api-diagnosis-strategy.md的可行性+必要性测试执行。curl
必选触发条件(若满足以下任一CMS条件,需收集所列的Elasticsearch证据):
| 触发条件 | 场景 | 所需Elasticsearch证据 |
|---|---|---|
| 集群健康 | |
| CPU过载 | |
| 内存压力 | |
线程池 | 性能问题 | |
| 节点间资源变异系数 > 0.3 | 负载不均衡 | |
| 写入失败或索引只读 | 磁盘 / 水位线 / 块 | |
| Elasticsearch API间歇性超时 + CMS CPU > 80% | 可能的级联故障 | |
线程池行:使用sop-query-thread-pool.md和sop-write-performance.md区分search与write / bulk线程池(另见下方写入路径 / bulk饱和)。
规则引擎必选项:若在本次运行中打印了第5节必选 / 第5–7节的提示,需按照上表中的行处理 — 当可行性满足时,收集所列的ES证据。check_es_instance_health.py
绑定规则(必选触发条件):若满足任一必选触发行或上述规则引擎必选项,则该证据集的必要性已满足 — OpenAPI/CMS无法替代这些调用以获取引擎级根因(集群健康:Yellow/Red状态需+allocation/explain)。按照下方可行性优先级确认可行性。若已认证且可达,在步骤2中与控制平面收集并行运行必选所列的端点。若在已认证_cat/shards后仍被阻塞,需明确说明阻塞原因:未设置GET /_cluster/health;传输失败(ES_*、连接拒绝、超时);带000的401;协议/TLS不匹配 — 而非-u已设置时未认证探测的401。ES_PASSWORD
Write-path / bulk saturation
写入路径 / bulk饱和
IforThreadPool.WriteRejectedpool stress matches high-QPS bulk indexing, read and followwrite— §2, subsection “Evidence interpretation: bulk QPS → write pool” for the evidence chain,references/sop-write-performance.mdsemantics (cumulative since node start), report ordering vs Old GC / heap (causal chain or dual P0 — write path before JVM-only headline), per-noderejected/rejectednumbers (reject share), per-node asymmetry, and write-only vs search. Do not lead with a JVM-only narrative when that subsection applies. For write-queue–style acceptance prompts, the opening conclusion should read as write-capacity (data-plane counters + optional CMS rule names), not only a GC/heap headline.completed
若**或ThreadPool.WriteRejected池压力与高QPS bulk索引匹配,请阅读并遵循write— 第2节中的“证据解读:bulk QPS → 写入池”部分,了解证据链、references/sop-write-performance.md的语义(节点启动以来的累计值)、报告排序与旧GC/堆内存(因果链或双重P0 — 写入路径优先于仅JVM的标题)、单节点rejected/rejected数值(拒绝占比)、节点间不对称性,以及仅写入与搜索的区别。当该小节适用时,请勿仅以JVM相关内容作为核心结论。对于写入队列式的确认提示,开篇结论应表述为写入容量**(数据平面计数器 + 可选CMS规则名称),而非仅GC/堆内存标题。completed
Search-primary vs write (both pools show cumulative rejected
)
rejected搜索主路径 vs 写入(两个池均显示累计rejected
)
rejectedWhenshows_nodes/stats/thread_pool≫search.rejectedon the same node(s) andwrite.rejected/ query-driven overload applies, lead the executive summary and P0 ordering withThreadPool.SearchRejected(high concurrent query / terms / slow query; hot index when verified) — notsearchfirst.writemay remain P0/P1 as parallel or secondary (bulk, catch-up); Old GC / CPU / node disconnect stay co-stress or cascade. Checker listing order is not proof of narrative order — see acceptance-criteria.md §6.5 and sop-query-thread-pool.md Report narrative.write.rejectedRecency overrides this magnitude default when time-resolved evidence exists: do not rank the opening story byvssearch.rejectedalone — cumulative counters lack timestamps. Full rubric: acceptance-criteria.md §6.5 (P0 / executive order vswrite.rejected≫search: unless write dominated by time) and §6.6 (Executive order, No false recency from counters). Binding: Timeline and recency (MUST) below (same skill).write
当**显示同一节点上_nodes/stats/thread_pool≫search.rejected,且write.rejected** / 查询驱动过载适用时,核心摘要和P0排序需以**ThreadPool.SearchRejected(高并发查询 / 词条 / 慢查询;验证后的热点索引)为主 — 而非先以search为主。write仍可作为并行或次要的P0/P1问题(bulk、追赶式写入);旧GC / CPU / 节点断开属于协同压力或级联问题**。检查器的列出顺序不能作为叙事顺序的依据 — 详见acceptance-criteria.md 第6.5节和sop-query-thread-pool.md 报告叙事。write.rejected当存在时间解析证据时,时效性优先于量级默认规则:请勿仅通过**vssearch.rejected的大小来确定开篇叙事 — 累计计数器没有时间戳。完整规则:acceptance-criteria.md 第6.5节(P0 / 核心顺序 vswrite.rejected≫search:除非写入在时间上占主导)和第6.6节**(核心顺序、避免计数器导致的错误时效性)。绑定规则:下方的时间线与时效性(必选)(同一技能)。write
activating
/ change workflow stuck (cross-layer root cause)
activatingactivating
/ 变更流程卡住(跨层根因)
activatingWhen an instance stays in, a change is unfinished, and Red or unassigned shards coexist, followactivatingend-to-end (MUST includesreferences/sop-activating-change-stuck.md,ListActionRecordsbefore/after remediation, collection order section 3.1, reporting section 4).DescribeInstance
当实例持续处于**状态、变更未完成,且同时存在Red状态或未分配分片时,请完整遵循activating**(必选内容包括references/sop-activating-change-stuck.md、修复前后的ListActionRecords、第3.1节的收集顺序、第4节的报告要求)。DescribeInstance
Pre-flight validation for Elasticsearch API
Elasticsearch API飞行前验证
[IMPORTANT]must match the diagnosed instanceES_ENDPOINTCompare/publicDomainanddomainfromprotocolwithDescribeInstance. If they differ, warn:ES_ENDPOINTwhen⚠️ ES_ENDPOINT does not match the current instance; run export ES_ENDPOINT="http://{publicDomain}:9200"isprotocol, orHTTPonly whenhttps://…isprotocol(adjust host/port to match the deployment).HTTPS
[重要]必须与诊断实例匹配ES_ENDPOINT将返回的DescribeInstance/publicDomain和**domain**与protocol进行对比。 若不一致,发出警告:ES_ENDPOINTprotocol⚠️ ES_ENDPOINT与当前实例不匹配;当**HTTP为protocol**时,请运行export ES_ENDPOINT="http://{publicDomain}:9200";仅当**HTTPS`**时,使用https://…(根据部署调整主机/端口)。为
When Elasticsearch credentials are missing or connections fail
当Elasticsearch凭证缺失或连接失败时
[CRITICAL] Guide the user to fix connectivity explicitly; classify failure modes (do not default persistent timeouts to “allowlist only”). Do not imply the agent “forgot” Elasticsearch — if the first answer is CMS/OpenAPI-heavy, give the blocking reason per Feasibility order below: unset; transport errors; 401 with validES_*; TLS/scheme—not 401 on a probe without-uwhen-uis SET (use authenticatedES_PASSWORDfirst).curl
Progressive playbook (read in order): references/es-api-call-failures.md (sections 1 → 4).
MUST / strategy context: references/es-api-diagnosis-strategy.md (sections 1–3 and 3.5 summary table).
MUST / strategy context: references/es-api-diagnosis-strategy.md (sections 1–3 and 3.5 summary table).
[关键] 明确引导用户修复连通性;分类故障模式(请勿将持续默认超时归为“仅白名单问题”)。请勿暗示Agent“忘记”了Elasticsearch — 若首次回复以CMS/OpenAPI内容为主,需按照下方可行性优先级说明阻塞原因:未设置;传输错误;有效ES_*下的401;TLS/协议不匹配 — 而非-u已设置时未认证探测的401(需先使用已认证的ES_PASSWORD)。curl
渐进式手册(按顺序阅读):references/es-api-call-failures.md(1 → 4节)。
必选 / 策略背景:references/es-api-diagnosis-strategy.md(1–3节和3.5总结表格)。
必选 / 策略背景:references/es-api-diagnosis-strategy.md(1–3节和3.5总结表格)。
Mandatory warning when MUST applies but Elasticsearch is not configured
当必选条件触发但Elasticsearch未配置时的强制警告
[CRITICAL] If a MUST trigger fires but data-plane evidence is missing, put a warning at the top of the report: follow section 4 of references/es-api-call-failures.md (blocking reason first, then MUST list, missing evidence; ifunset, pointer to section 2.2 of this SKILL; if vars are set, use es-api-call-failures sections 1–2 for auth vs transport).ES_*
[关键] 若必选触发条件已触发但数据平面证据缺失,需在报告顶部添加警告:遵循references/es-api-call-failures.md的第4节(先说明阻塞原因,再列出必选条件、缺失的证据;若未设置,指向本技能的第2.2节;若变量已设置,使用es-api-call-failures的1–2节排查认证与传输问题)。ES_*
Step 1: Quick health scan (initial signals)
步骤1:快速健康扫描(初始信号)
Run the lightweight rules engine (17 metric rules) to list P0 / P1 / P2 findings and steer deeper collection:
bash
python3 scripts/check_es_instance_health.py -i <InstanceId> -r <RegionId> [--window <minutes, default 60>] [--profile <profile_name>]运行轻量级规则引擎(17个指标规则),列出P0 / P1 / P2问题,引导深入收集:
bash
python3 scripts/check_es_instance_health.py -i <InstanceId> -r <RegionId> [--window <minutes, default 60>] [--profile <profile_name>]Feasibility order (agent)
可行性优先级(Agent执行顺序)
- Run §2.2 checks (password = SET only)—do not skip; never infer feasibility without this step.
ES_* - matches
ES_ENDPOINTDescribeInstance/domain(scheme/port).publicDomain - Authenticated —do not stop at 401 on an unauthenticated probe if
GET /_cluster/healthis SET.ES_PASSWORD - MUST scope: table rows and/or rule-engine MUST line in §5.
- 执行第2.2节的检查(仅确认密码已设置)— 请勿跳过;切勿未执行此步骤就推断可行性。
ES_* - 确认与
ES_ENDPOINT的DescribeInstance/domain(协议/端口)匹配。publicDomain - 已认证的— 若
GET /_cluster/health已设置,请勿在未认证探测返回401时停止。ES_PASSWORD - 必选范围:表格行和/或第5节中的规则引擎必选项。
Step 2: Collect evidence in parallel
步骤2:并行收集证据
Based on Step 1, run collection in parallel (prioritize dimensions with signals).
If a MUST-trigger row or rule-engine MUST applies: run Feasibility order, then run that Required Elasticsearch evidence via in the same round (see §7). If no MUST applies, add optional data-plane only when feasibility and necessity both hold per the strategy doc.
If a MUST-trigger row or rule-engine MUST applies: run Feasibility order, then run that Required Elasticsearch evidence via
curlcurlRe-run with the same invocation pattern as Step 1; for this parallel round, and explicit are common.
check_es_instance_health.py--window 120--profile <profile_name>To backfill control-plane evidence (, , CMS-style calls), use patterns in references/verification-method.md (epoch times, profiles, namespaces).
DescribeInstanceListSearchLogaliyunNote: data-plane access still requires/ES_ENDPOINT; the Aliyun CLI cannot replaceES_PASSWORDto the cluster.curlFor MUST-trigger rows, necessity for the listed endpoints is already established—do not skip them when feasibility including reachability holds. Outside those rows, avoid unrelated bulksolely becausecurlis set; use the strategy doc’s feasibility + necessity test instead.ES_*
基于步骤1的结果,并行收集证据(优先处理有信号的维度)。
若必选触发行或规则引擎必选项适用:执行可行性优先级步骤,然后在同一轮次中通过运行所需的Elasticsearch证据(详见第7节)。若无必选条件适用,仅当可行性和必要性同时满足策略文档要求时,才添加可选的数据平面调用。
若必选触发行或规则引擎必选项适用:执行可行性优先级步骤,然后在同一轮次中通过
curlcurl以与步骤1相同的调用方式重新运行**;在本轮并行收集中,和显式指定**是常见做法。
check_es_instance_health.py--window 120--profile <profile_name>如需补充控制平面证据(、、CMS风格调用),请使用references/verification-method.md中的****调用模式(时间戳、配置文件、命名空间)。
DescribeInstanceListSearchLogaliyun注意:数据平面访问仍需/ES_ENDPOINT;Aliyun CLI无法替代ES_PASSWORD调用集群。curl对于必选触发行,所列端点的必要性已确定 — 当可行性(包括可达性)满足时,请勿跳过这些端点。在这些行之外,请勿仅因已设置就执行无关的批量ES_*调用;需使用策略文档的可行性+必要性测试来判断。curl
Step 3: Read SOPs by signal
步骤3:根据信号阅读SOP
Map signals to SOPs and read for deeper reasoning. With multiple signals, process P0 → P1 → P2 for severity, then apply Timeline and recency (MUST) in Step 4 so the narrative order matches when signals mattered in the window—not only static rule-engine print order.
| Observed signal | Read |
|---|---|
| Cluster Red/Yellow, node loss, pending tasks | |
Long | |
| High CPU, load, imbalance | |
| Per-node load imbalance (CPU/memory/disk/shard count) | |
| JVM pressure, GC, circuit breaker, OOM | |
| Disk watermark, IO, write failures (read-only) | |
| Watermark misconfiguration, index blocks, “normal” disk % but write failures | |
| Write timeouts / rejections / latency / QPS drop | |
| Query timeouts / rejections / slow queries | |
Nodes look down but CPU still reported; | |
| Intermittent Elasticsearch timeouts + CMS CPU > 80% | |
| Risky settings, Ngram issues, API anomalies | |
| Event code definitions | |
将信号映射到SOP并深入阅读推理。当存在多个信号时,按照P0 → P1 → P2的严重程度处理,然后应用步骤4中的时间线与时效性(必选),使叙事顺序与信号在窗口内的重要时间点匹配 — 而非仅遵循静态规则引擎的打印顺序。
| 观测信号 | 需阅读的文档 |
|---|---|
| 集群Red/Yellow、节点丢失、待处理任务 | |
长时间 | |
| 高CPU、负载、不均衡 | |
| 单节点负载不均衡(CPU/内存/磁盘/分片数) | |
| JVM压力、GC、断路器、OOM | |
| 磁盘水位线、IO、写入失败(只读) | |
| 水位线配置错误、索引块、磁盘百分比“正常”但写入失败 | |
| 写入超时 / 拒绝 / 延迟 / QPS下降 | |
| 查询超时 / 拒绝 / 慢查询 | |
节点显示离线但仍上报CPU; | |
| Elasticsearch间歇性超时 + CMS CPU > 80% | |
| 风险配置、Ngram问题、API异常 | |
| 事件代码定义 | |
Step 4: Synthesize and write the structured report
步骤4:合成并撰写结构化报告
Acceptance-style optional checklists: references/acceptance-criteria.md §6.1–§6.6 — Red/Yellow; read-heavy CPU +pool (+ CMS alignment); JVM / breakers / fielddata; write-queue vs GC +search/rejected; read-heavy search pool vs GC-only headline (expand in sop-query-thread-pool.md Report narrative: search pool vs GC / CPU headlines); timeline/recency. Bulk/write: references/sop-write-performance.md §2. Shardcompleted: references/sop-node-load-imbalance.md §1.3 (allocator / change control only).reroute
[CRITICAL] Remediation must match the diagnosed root cause — avoid generic templates. Wrong breaker or concurrency fixes (e.g.vsin_flight_requests, “split query” when concurrency is the issue) → seerequestand the fired signal’s SOP.sop-memory-gc.md
+ data-plane anomaly: include the one-line cross-layer root cause; seeactivatingsection 4.references/sop-activating-change-stuck.md
Report skeleton (copy/fill): references/report-template.md.
验收式可选检查清单:references/acceptance-criteria.md 第6.1–第6.6节 — Red/Yellow状态;读密集型CPU +池(+ CMS对齐);JVM / 断路器 / fielddata;写入队列 vs GC +search/rejected;读密集型search池 vs 仅GC标题(详见sop-query-thread-pool.md 报告叙事:search池 vs GC / CPU标题);时间线/时效性。Bulk/写入:references/sop-write-performance.md第2节。分片completed:references/sop-node-load-imbalance.md第1.3节(仅分配器 / 变更控制)。reroute
[关键] 修复方案必须匹配诊断出的根因 — 避免通用模板。错误的断路器或并发修复方案(例如vsin_flight_requests、当并发是问题时建议“拆分查询”)→ 请查看**request**和触发信号对应的SOP。sop-memory-gc.md
+ 数据平面异常:需包含一行跨层根因;详见activating第4节。references/sop-activating-change-stuck.md
报告框架(复制填写):references/report-template.md。
Timeline and recency (MUST for synthesized reports)
时间线与时效性(合成报告必选)
Problem:and P0/P1/P2 bands express severity, not when a signal mattered most within the analysis window. Cumulative engine counters (check_es_instance_health.py,search.rejected) do not encode recency—write and search issues can both be “real” while only one path dominated the recent past (e.g. search pressure closer to window end than write pressure).write.rejected
Binding rules for the agent:
- Two axes — Treat severity (P0/P1/P2) and temporal relevance (proximity to window end / “now”) as orthogonal. Do not infer recency from priority alone (e.g. “write is P0 so it must be the current headline”) when time-resolved evidence says otherwise.
- Mandatory human-facing section — When more than one major finding fires (e.g. write pool + search pool + GC/CPU), the synthesized report must include an (or equivalent) block before or immediately after the executive summary, unless the user explicitly asks for a minimal report. In that block:
### Incident timeline (recency-ordered)- Order bullets or rows by time (earlier → later), or state which signal cluster peaked / persisted in the latter portion of .
{begin} ~ {end} - Call out divergence: e.g. “write-path stress earlier in window; search-path / CPU more recent” when CMS or logs support it.
- Order bullets or rows by time (earlier → later), or state which signal cluster peaked / persisted in the latter portion of
- Evidence for recency (use what exists; do not invent timestamps):
- CloudMonitor: per-metric time series — note peak timestamp or sustained-high interval for ,
NodeCPUUtilization, GC-related metrics,NodeHeapMemoryUtilizationif exposed as rates or non-cumulative series in the collected JSON.ThreadPool.* - Slow logs / : correlate query vs index slow entries to minutes.
ListSearchLog - Engine (optional): two samples at known times to show delta on
_nodes/stats/thread_pool/rejected; orcompleted/_tasksfor current skew vs historical cumulative counters.hot_threads
- CloudMonitor: per-metric time series — note peak timestamp or sustained-high interval for
- Executive summary ordering — The opening 2–4 sentences should reflect recency-weighted user impact: if search pressure is closer to current than write pressure, lead with search/query concurrency and co-stress (GC/CPU) as appropriate, and place historical write saturation as context or second wave—without dropping P0 write findings if they remain valid for remediation backlog.
- Explicit uncertainty — If only cumulative counters exist and no time series differentiates paths, state one line: recency is undifferentiated; recommend narrower window, slow logs, or delta sampling for the next run.
问题:和P0/P1/P2级别仅表示严重程度,而非信号在分析窗口内的重要时间点。累计引擎计数器(check_es_instance_health.py、search.rejected)不包含时效性信息 — 写入和搜索问题可能同时存在,但仅其中一个路径在近期占主导(例如搜索压力更接近窗口末尾,而写入压力更早出现)。write.rejected
Agent绑定规则:
- 两个维度 — 将严重程度(P0/P1/P2)和时间相关性(接近窗口末尾 / “当前”)视为正交维度。当时间解析证据显示不同情况时,请勿仅通过优先级推断时效性(例如“写入是P0,所以它一定是当前核心问题”)。
- 强制面向用户的章节 — 当存在多个主要问题时(例如写入池 + 搜索池 + GC/CPU),合成报告必须包含****(或等效章节),且需放在核心摘要之前或之后,除非用户明确要求极简报告。在该章节中:
### 事件时间线(按时效性排序)- 按时间顺序排列项目符号或行(早→晚),或说明哪个信号集群在的后半段达到峰值 / 持续存在。
{begin} ~ {end} - 指出差异:例如当CMS或日志支持时,说明“写入路径压力在窗口早期出现;搜索路径 / CPU问题更近期”。
- 按时间顺序排列项目符号或行(早→晚),或说明哪个信号集群在
- 时效性证据(使用现有证据;请勿编造时间戳):
- 云监控:指标时间序列 — 若收集的JSON中包含速率或非累计序列,请记录、
NodeCPUUtilization、GC相关指标、NodeHeapMemoryUtilization的峰值时间戳或持续高值区间。ThreadPool.* - 慢日志 / :将查询 vs 索引的慢日志条目与分钟级时间关联。
ListSearchLog - 引擎(可选):在已知时间点采集两次样本,以显示
_nodes/stats/thread_pool/rejected的增量;或使用**completed** /_tasks对比当前偏差与历史累计计数器。hot_threads
- 云监控:指标时间序列 — 若收集的JSON中包含速率或非累计序列,请记录
- 核心摘要排序 — 开篇2–4句话应反映时效性加权的用户影响:若搜索压力比写入压力更接近当前,则优先说明搜索/查询并发和协同压力(GC/CPU)(如适用),并将历史写入饱和作为背景或第二阶段问题 — 若写入问题仍需修复,请勿遗漏P0写入问题。
- 明确不确定性 — 若仅存在累计计数器且无时间序列区分不同路径,需添加一行说明:时效性无法区分;建议下一次运行时使用更窄的窗口、慢日志或增量采样。
6. Data collection details (CLI OpenAPI + injected input)
6. 数据收集详情(CLI OpenAPI + 注入输入)
One-shot entry
一次性入口
Use the same command as §5 Step 1 (optional / ; default window 60 minutes if omitted).
check_es_instance_health.py--window--profile使用与第5步第1节相同的**命令(可选** / ;若省略,默认窗口为60分钟)。
check_es_instance_health.py--window--profileInjected input mode (paired with CLI)
注入输入模式(与CLI配合使用)
check_es_instance_health.pybash
python3 scripts/check_es_instance_health.py \
-i <InstanceId> -r <RegionId> \
--data-source input \
--input-json-file /path/to/diag-input.jsonInput JSON shape:
json
{
"status_info": {},
"metrics": {},
"events": [],
"logs": []
}--data-source- : prefer injected fields; backfill gaps via Aliyun CLI.
auto - : ignore injection; fetch everything via CLI.
cli - : injection only; no OpenAPI calls.
input
check_es_instance_health.pybash
python3 scripts/check_es_instance_health.py \
-i <InstanceId> -r <RegionId> \
--data-source input \
--input-json-file /path/to/diag-input.json输入JSON格式:
json
{
"status_info": {},
"metrics": {},
"events": [],
"logs": []
}--data-source- :优先使用注入字段;通过Aliyun CLI补充缺失内容。
auto - :忽略注入内容;全部通过CLI获取。
cli - :仅使用注入内容;不调用OpenAPI。
input
Manual control-plane CLI backfill
手动控制平面CLI补充
For additional OpenAPI examples, see .
references/verification-method.md更多OpenAPI示例请查看。
references/verification-method.md7. Elasticsearch direct API access (data-plane deep dive)
7. Elasticsearch直连API访问(数据平面深度分析)
When feasibility holds (including reachability), execute the REST calls required by any MUST-trigger row (§5). For endpoints not listed in a fired MUST row, call them only when feasibility and necessity both hold per the strategy doc.
may beES_ENDPOINTor a full URL. For the samples below, normalize tohost:port(usehttp://${ES_ENDPOINT#http://}consistently when the cluster serves TLS).https://Timeouts: everymust usecurl.--connect-timeout 10 --max-time 30
当可行性满足(包括可达性)时,执行必选触发行要求的REST调用(第5节)。对于未在触发的必选行中列出的端点,仅当可行性和必要性同时满足策略文档要求时才调用。
可以是ES_ENDPOINT或完整URL。对于以下示例,统一格式为host:port(当集群支持TLS时,始终使用http://${ES_ENDPOINT#http://})。https://超时设置:每个命令必须使用curl。--connect-timeout 10 --max-time 30
Red / Yellow (MUST) — recommended set
Red / Yellow状态(必选) — 推荐调用集合
Scope: The cluster-health MUST row uses max ≥ Yellow (includes Red). Use this set for unassigned / misallocated shard root cause on the engine.
ClusterStatusbash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cluster/health?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
-H "Content-Type: application/json" \
-X POST "http://${ES_ENDPOINT#http://}/_cluster/allocation/explain?pretty" \
-d '{}'
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason&s=state"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cluster/pending_tasks?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/stats/thread_pool?pretty"范围:集群健康必选行适用于最大值 ≥ Yellow(包含Red)。此集合用于排查引擎层面未分配 / 分配错误分片的根因。
ClusterStatusbash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cluster/health?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
-H "Content-Type: application/json" \
-X POST "http://${ES_ENDPOINT#http://}/_cluster/allocation/explain?pretty" \
-d '{}'
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason&s=state"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cluster/pending_tasks?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/stats/thread_pool?pretty"Query / write performance (MUST) — recommended set
查询 / 写入性能(必选) — 推荐调用集合
Includewhen heap / GC / breaker rules fired in Step 1 or_cluster/settingsshows concern — read transient and persistent_nodes/stats/breaker/indices.breaker.*.network.breaker.*
bash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/hot_threads?threads=3"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/stats/breaker?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cluster/settings?include_defaults=true&pretty"and/_cluster/pending_tasksare also listed under Red / Yellow (MUST) above—one call each per session when both sections apply. If you run only this performance block, add those twoGET /_nodes/stats/thread_poollines from that block.curl
当步骤1中触发了堆内存 / GC / 断路器规则,或**显示异常时,需包含_nodes/stats/breaker** — 读取transient和persistent配置中的_cluster/settings/indices.breaker.*。network.breaker.*
bash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/hot_threads?threads=3"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/stats/breaker?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cluster/settings?include_defaults=true&pretty"**和/_cluster/pending_tasks也在上述Red / Yellow状态(必选)**集合中列出 — 当两个章节都适用时,每个会话调用一次即可。若仅运行此性能集合,需添加该集合中的这两个GET /_nodes/stats/thread_pool命令。curl
Resource anomalies without a closed loop (SHOULD) — recommended set
未闭环的资源异常(建议) — 推荐调用集合
bash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cat/nodes?v&s=cpu:desc&h=name,ip,cpu,heap.percent,ram.percent,load_1m"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/stats/jvm?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cat/allocation?v&bytes=gb"also appears under Query / write performance (MUST) above—reuse one response when both blocks apply. If you run only this SHOULD block, add the sameGET /_cluster/settings?include_defaults=trueline from that block.curl
Protocol sanity (avoid ): usually http/https scheme mismatch on — fix scheme/port and retry.
WRONG_VERSION_NUMBERES_ENDPOINTScenario → endpoint index: references/es-api-catalog.md.
bash
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cat/nodes?v&s=cpu:desc&h=name,ip,cpu,heap.percent,ram.percent,load_1m"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_nodes/stats/jvm?pretty"
curl -sS --connect-timeout 10 --max-time 30 -u "${ES_USERNAME:-elastic}:${ES_PASSWORD}" \
"http://${ES_ENDPOINT#http://}/_cat/allocation?v&bytes=gb"**也在上述查询 / 写入性能(必选)**集合中列出 — 当两个集合都适用时,复用同一个响应即可。若仅运行此建议集合,需添加该集合中的相同GET /_cluster/settings?include_defaults=true命令。curl
协议合理性检查(避免):通常是的协议不匹配 — 修正协议/端口后重试。
WRONG_VERSION_NUMBERES_ENDPOINThttp/https场景 → 端点索引:references/es-api-catalog.md。
8. Diagnostic coverage
8. 诊断覆盖范围
The knowledge base covers 48+ health-event-style rules and chained scenarios (e.g. disk pressure → allocation → Red). Per-category counts, P0/P1/P2 mix, and event codes: references/health-events-catalog.md — scenario runbooks: (index: references/README.md).
references/sop-*.md知识库涵盖**48+**健康事件类规则和链式场景(例如磁盘压力 → 分片分配 → Red状态)。按类别统计、P0/P1/P2分布、事件代码:references/health-events-catalog.md — 场景手册:(索引:references/README.md)。
references/sop-*.md9. Best practices
9. 最佳实践
Read-only: no mutating control-plane APIs; no teardown.
- Layered + evidence-bound: scan → SOP depth; every conclusion cites metrics/logs/events; if ES is unreachable, state limits (es-api-call-failures.md).
- Priority vs narrative: P0→P2 for urgency; Incident timeline when multiple dimensions differ in time (Step 4). Credentials / TLS / parameters: §1–2 and §4.
- Green is not “all clear” — watermarks, blocks, mis-set limits still matter; MUST + reachable ES: do not skip §5/§7 evidence because the cluster is Green or OpenAPI “explains” symptoms.
- Thread-pool : cumulative unless you show a delta — sop-query-thread-pool.md §1–2; write/bulk: sop-write-performance.md §2.
rejected
只读原则:不调用变更类控制平面API;不执行销毁操作。
- 分层 + 证据绑定:扫描 → SOP深度分析;每个结论都引用指标/日志/事件;若ES不可达,说明限制(es-api-call-failures.md)。
- 优先级 vs 叙事:按P0→P2排序紧急程度;当多个维度的时间差异较大时,使用事件时间线(步骤4)。凭证 / TLS / 参数:遵循第1–2节和第4节。
- Green状态不代表“一切正常” — 水位线、块、错误配置的限制仍需关注;必选条件 + ES可达:请勿因集群为Green状态或OpenAPI“解释”了症状而跳过第5/7节的证据收集。
- 线程池:累计值除非显示增量 — sop-query-thread-pool.md第1–2节;写入/bulk:sop-write-performance.md第2节。
rejected
10. Reference links
10. 参考链接
- — Verification (how to validate diagnosis; metrics, APIs, workflows)
references/verification-method.md - — Structured diagnosis report skeleton
references/report-template.md - — Language map (reference assets and
references/README.mdrunbooks; English in this repo)sop-*.md - — RAM policy list
references/ram-policies.md - — Correct/incorrect patterns and acceptance (includes credential and safety anti-patterns)
references/acceptance-criteria.md - — Aliyun CLI installation
references/cli-installation-guide.md - — Elasticsearch REST API catalog
references/es-api-catalog.md - — Health event catalog
references/health-events-catalog.md - — Scenario SOPs (e.g.
references/sop-*.mdforsop-activating-change-stuck.md/ change stuck, cross-layer root cause)activating - — Elasticsearch API diagnosis strategy
references/es-api-diagnosis-strategy.md
- — 验证方法(如何验证诊断结果;指标、API、流程)
references/verification-method.md - — 结构化诊断报告框架
references/report-template.md - — 语言映射(参考资产和
references/README.md手册;本仓库为英文)sop-*.md - — RAM策略列表
references/ram-policies.md - — 正确/错误模式与验收标准(包括凭证和安全反模式)
references/acceptance-criteria.md - — Aliyun CLI安装指南
references/cli-installation-guide.md - — Elasticsearch REST API目录
references/es-api-catalog.md - — 健康事件目录
references/health-events-catalog.md - — 场景SOP(例如
references/sop-*.md用于sop-activating-change-stuck.md/ 变更卡住场景的跨层根因分析)activating - — Elasticsearch API诊断策略
references/es-api-diagnosis-strategy.md