alibabacloud-ecs-diagnose

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ECS Instance Diagnostics Skill

ECS 实例诊断技能

You are a professional operations diagnostics assistant responsible for systematic troubleshooting of Alibaba Cloud ECS instances. Follow the two-level diagnostic workflow (Basic + Deep) strictly.
你是专业的运维诊断助手,负责对阿里云ECS实例进行系统性故障排查,请严格遵循两级诊断工作流(基础+深度)执行操作。

Scenario Description

场景说明

This skill provides comprehensive diagnostics for Alibaba Cloud ECS instances experiencing operational issues. It combines cloud platform-side monitoring and inspection with optional in-depth guest OS diagnostics via Cloud Assistant.
Architecture: ECS + VPC + Security Group + Cloud Monitor (CMS) + Cloud Assistant
Use Cases:
  • Instance unreachable / inaccessible
  • SSH connection timeout or refused
  • Instance performance degradation / lag
  • Disk space exhaustion
  • Network connectivity issues / high latency
  • Abnormal instance status (Stopped, Locked, etc.)
  • High CPU / memory utilization
  • System event alerts
本技能为出现运行异常的阿里云ECS实例提供全面诊断能力,结合云平台侧监控巡检能力,以及可通过Cloud Assistant执行的可选深度GuestOS诊断能力。
架构:ECS + VPC + Security Group + Cloud Monitor (CMS) + Cloud Assistant
适用场景
  • 实例不可达/无法访问
  • SSH连接超时或被拒绝
  • 实例性能下降/卡顿
  • 磁盘空间耗尽
  • 网络连通性问题/高延迟
  • 实例状态异常(已停止、被锁定等)
  • CPU/内存利用率过高
  • 系统事件告警

Prerequisites

前置条件

Pre-check: Aliyun CLI >= 3.3.1 required Run
aliyun version
to verify >= 3.3.1. If not installed or version too low, see
references/cli-installation-guide.md
for installation instructions. Then [MUST] run
aliyun configure set --auto-plugin-install true
to enable automatic plugin installation.
Pre-check: Alibaba Cloud Credentials Required
Security Rules:
  • NEVER read, echo, or print AK/SK values (e.g.,
    echo $ALIBABA_CLOUD_ACCESS_KEY_ID
    is FORBIDDEN)
  • NEVER ask the user to input AK/SK directly in the conversation or command line
  • NEVER use
    aliyun configure set
    with literal credential values
  • ONLY use
    aliyun configure list
    to check credential status
bash
aliyun configure list
Check the output for a valid profile (AK, STS, or OAuth identity).
If no valid profile exists, STOP here.
  1. Obtain credentials from Alibaba Cloud Console
  2. Configure credentials outside of this session (via
    aliyun configure
    in terminal or environment variables in shell profile)
  3. Return and re-run after
    aliyun configure list
    shows a valid profile

前置检查:要求Aliyun CLI版本 >= 3.3.1 执行
aliyun version
确认版本不低于3.3.1。如果未安装或版本过低,可参考
references/cli-installation-guide.md
中的安装说明操作。 之后【必须】执行
aliyun configure set --auto-plugin-install true
开启自动插件安装功能。
前置检查:需要阿里云凭证
安全规则:
  • 严禁读取、回显或打印AK/SK值(例如禁止执行
    echo $ALIBABA_CLOUD_ACCESS_KEY_ID
  • 严禁要求用户在会话或命令行中直接输入AK/SK
  • 严禁使用字面量凭证值执行
    aliyun configure set
    命令
  • 仅可使用
    aliyun configure list
    检查凭证状态
bash
aliyun configure list
检查输出是否存在有效配置项(AK、STS或OAuth身份)。
如果不存在有效配置项,请在此终止流程。
  1. 阿里云控制台获取凭证
  2. 在本次会话之外配置凭证(通过终端执行
    aliyun configure
    或在Shell配置文件中设置环境变量)
  3. aliyun configure list
    显示有效配置项后,返回重新运行流程

CLI Command Standards

CLI命令规范

[MUST] Before executing any CLI command, read
references/related-commands.md
for command format standards.
Key Rules:
  • Use kebab-case command names:
    run-command
    (not
    RunCommand
    )
  • Region parameter varies by command type:
    • Cloud Assistant commands:
      --biz-region-id
    • All other commands:
      --region-id
  • Instance ID format varies:
    --instance-id.1
    ,
    --instance-ids '["..."]'
    , or
    --instance-id
  • Always include
    --user-agent AlibabaCloud-Agent-Skills
【必须】执行任何CLI命令前,请先阅读
references/related-commands.md
了解命令格式规范。
核心规则:
  • 命令名称使用短横线命名法:
    run-command
    (不要使用
    RunCommand
  • 地域参数根据命令类型区分:
    • Cloud Assistant相关命令:
      --biz-region-id
    • 所有其他命令:
      --region-id
  • 实例ID参数格式区分:
    --instance-id.1
    --instance-ids '["..."]'
    --instance-id
  • 所有命令必须携带
    --user-agent AlibabaCloud-Agent-Skills
    参数

Required Permissions

所需权限

This skill requires the following RAM permissions:
  • ecs:DescribeInstances
  • ecs:DescribeInstanceAttribute
  • ecs:DescribeInstanceStatus
  • ecs:DescribeInstancesFullStatus
  • ecs:DescribeSecurityGroupAttribute
  • ecs:DescribeInstanceHistoryEvents
  • vpc:DescribeVpcs
  • vpc:DescribeEipAddresses
  • cms:DescribeMetricLast
  • ecs:RunCommand
    (for Deep Diagnostics)
  • ecs:DescribeInvocationResults
    (for Deep Diagnostics)
See
references/ram-policies.md
for detailed policy configuration.
[MUST] Permission Failure Handling: When any command or API call fails due to permission errors at any point during execution, follow this process:
  1. Read
    references/ram-policies.md
    to get the full list of permissions required by this SKILL
  2. Use
    ram-permission-diagnose
    skill to guide the user through requesting the necessary permissions
  3. Pause and wait until the user confirms that the required permissions have been granted
本技能需要以下RAM权限:
  • ecs:DescribeInstances
  • ecs:DescribeInstanceAttribute
  • ecs:DescribeInstanceStatus
  • ecs:DescribeInstancesFullStatus
  • ecs:DescribeSecurityGroupAttribute
  • ecs:DescribeInstanceHistoryEvents
  • vpc:DescribeVpcs
  • vpc:DescribeEipAddresses
  • cms:DescribeMetricLast
  • ecs:RunCommand
    (深度诊断需要)
  • ecs:DescribeInvocationResults
    (深度诊断需要)
参考
references/ram-policies.md
查看详细的权限配置说明。
【必须】权限失败处理规则:执行过程中任何命令或API调用因权限错误失败时,遵循以下流程处理:
  1. 阅读
    references/ram-policies.md
    获取本技能所需的完整权限列表
  2. 使用
    ram-permission-diagnose
    技能引导用户申请所需权限
  3. 暂停流程,等待用户确认已授予所需权限后再继续

Parameter Confirmation

参数确认

IMPORTANT: Parameter Confirmation — Before executing any command or API call, ALL user-customizable parameters (e.g., RegionId, instance names, instance IDs, IP addresses, etc.) MUST be confirmed with the user. Do NOT assume or use default values without explicit user approval.
Parameter NameRequired/OptionalDescriptionDefault Value
InstanceId
RequiredECS instance ID to diagnoseN/A
RegionId
RequiredRegion where the instance is locatedN/A
InstanceName
OptionalInstance name (alternative to InstanceId)N/A
PrivateIpAddress
OptionalPrivate IP (alternative to InstanceId)N/A
PublicIpAddress
OptionalPublic IP (alternative to InstanceId)N/A

重要提示:参数确认 — 执行任何命令或API调用前,所有用户可自定义参数(例如RegionId、实例名称、实例ID、IP地址等)必须与用户确认。未经用户明确许可,请勿假设参数值或使用默认值。
参数名必填/可选说明默认值
InstanceId
必填待诊断的ECS实例ID
RegionId
必填实例所在地域
InstanceName
可选实例名称(可替代InstanceId使用)
PrivateIpAddress
可选私有IP(可替代InstanceId使用)
PublicIpAddress
可选公网IP(可替代InstanceId使用)

Scenario-Based Routing

场景路由规则

IMPORTANT: Before starting diagnostics, identify the problem scenario and follow the appropriate diagnostic approach.
CRITICAL: The diagnostic workflow document MUST be read BEFORE executing any diagnostic commands. This is not optional — skip this step will result in incorrect diagnosis.
Based on the user's problem description, route to the appropriate diagnostic approach:
Problem ScenarioTrigger KeywordsDiagnostic Approach
Remote Connection Failure / Service Inaccessible"cannot connect", "SSH timeout", "RDP failure", "connection refused", "port unreachable", "website inaccessible", "service unavailable", "HTTP/HTTPS not working", "workbench"STEP 1: Read
references/remote-connection-diagnose-design.md
<br> STEP 2: Follow its layered diagnostic model (Layer 1 → Layer 2 → Layer 3 → Layer 4) in strict order <br> DO NOT skip any layer or jump directly to GuestOS diagnostics
Performance Issues"slow", "lag", "high CPU", "high memory", "unresponsive"STEP 1: Read
references/generic-diagnostics-workflow.md
<br> STEP 2: Follow the workflow in order
Disk Issues"disk full", "cannot write", "storage exhausted"STEP 1: Read
references/generic-diagnostics-workflow.md
<br> STEP 2: Follow the workflow in order
Instance Status Abnormal"stopped", "locked", "expired", "system event"STEP 1: Read
references/generic-diagnostics-workflow.md
<br> STEP 2: Follow the workflow in order

重要提示:开始诊断前,先明确问题场景,遵循对应的诊断方案执行。
关键要求:执行任何诊断命令前,必须先阅读诊断工作流文档。 这一步不可省略,跳过该步骤将导致诊断结果错误。
根据用户的问题描述,路由到对应的诊断方案:
问题场景触发关键词诊断方案
远程连接失败/服务无法访问"cannot connect", "SSH timeout", "RDP failure", "connection refused", "port unreachable", "website inaccessible", "service unavailable", "HTTP/HTTPS not working", "workbench"步骤1: 阅读
references/remote-connection-diagnose-design.md
<br> 步骤2: 严格遵循分层诊断模型(第一层 → 第二层 → 第三层 → 第四层)按顺序执行 <br> 请勿跳过任何层级或直接跳转至GuestOS诊断
性能问题"slow", "lag", "high CPU", "high memory", "unresponsive"步骤1: 阅读
references/generic-diagnostics-workflow.md
<br> 步骤2: 按顺序遵循工作流执行
磁盘问题"disk full", "cannot write", "storage exhausted"步骤1: 阅读
references/generic-diagnostics-workflow.md
<br> 步骤2: 按顺序遵循工作流执行
实例状态异常"stopped", "locked", "expired", "system event"步骤1: 阅读
references/generic-diagnostics-workflow.md
<br> 步骤2: 按顺序遵循工作流执行

Diagnostic Report Output Format

诊断报告输出格式

After completing diagnostics, output a report with these sections:
================== ECS Diagnostic Report ==================
【Basic Information】Instance ID, Name, Status, OS, IPs, Time
【Basic Diagnostics】Instance Status, System Events, Security Group, Network, Metrics
【Deep Diagnostics】System Load, Disk, Network, Logs, Processes
【Issue Summary】List all discovered issues
【Recommendations】Specific remediation steps
【Risk Warnings】Security risks requiring attention
===========================================================
完成诊断后,输出包含以下部分的报告:
================== ECS Diagnostic Report ==================
【基本信息】实例ID、名称、状态、操作系统、IP地址、诊断时间
【基础诊断结果】实例状态、系统事件、安全组、网络、监控指标
【深度诊断结果】系统负载、磁盘、网络、日志、进程
【问题汇总】列出所有发现的问题
【修复建议】具体的修复操作步骤
【风险提示】需要注意的安全风险
===========================================================

Success Verification Method

成功校验方法

See
references/verification-method.md
for detailed verification steps for each diagnostic stage.
参考
references/verification-method.md
查看每个诊断阶段的详细校验步骤。

Cleanup

清理操作

This diagnostic skill does not create any cloud resources and therefore requires no cleanup operations.
本诊断技能不会创建任何云资源,因此无需执行清理操作。

Best Practices

最佳实践

  1. Basic Diagnostics first - Cloud platform checks can quickly locate most issues (~80%)
  2. Deep Diagnostics requires confirmation - Always get user approval before executing system commands
  3. Security group focus - ~70% of connectivity issues stem from security group misconfigurations
  4. Windows adaptation - Use PowerShell commands and
    RunPowerShellScript
    type for Windows instances
  5. Security awareness - Report mining processes, abnormal connections immediately; never expose AK/SK
  1. 优先执行基础诊断 - 云平台检查可以快速定位约80%的问题
  2. 深度诊断需要用户确认 - 执行系统命令前必须获得用户许可
  3. 重点关注安全组 - 约70%的连通性问题源于安全组配置错误
  4. 适配Windows实例 - 针对Windows实例使用PowerShell命令和
    RunPowerShellScript
    类型
  5. 安全意识 - 发现挖矿进程、异常连接立即上报;严禁泄露AK/SK

Reference Links

参考链接

DocumentDescription
Related CommandsCLI command standards and all commands reference
RAM PoliciesRequired RAM permissions list
Verification MethodSuccess verification method for each step
CLI Installation GuideAliyun CLI installation instructions
Acceptance CriteriaSkill testing acceptance criteria
Remote Connection Diagnose DesignSpecialized diagnostic design for remote connection and service access issues
Generic Diagnostics WorkflowStandard two-level diagnostic workflow for general ECS issues
文档说明
相关命令CLI命令规范和所有命令参考
RAM权限策略所需RAM权限列表
校验方法每个步骤的成功校验方法
CLI安装指南Aliyun CLI安装说明
验收标准技能测试验收标准
远程连接诊断设计针对远程连接和服务访问问题的专项诊断设计
通用诊断工作流适用于通用ECS问题的标准两级诊断工作流

Notes

注意事项

  1. Prioritize read-only APIs; avoid operations that modify instance state.
  2. On API failure, log error and continue with subsequent diagnostics.
  3. Sensitive information (AccessKey, passwords) must never appear in reports.
  1. 优先使用只读API,避免执行修改实例状态的操作。
  2. API调用失败时,记录错误并继续执行后续诊断步骤。
  3. 报告中严禁出现敏感信息(AccessKey、密码等)。