hyperpod-ssm

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

HyperPod SSM Access

HyperPod SSM 访问

SSM Target Format

SSM 目标格式

Target:
sagemaker-cluster:<CLUSTER_ID>_<GROUP_NAME>-<INSTANCE_ID>
  • CLUSTER_ID
    : Last segment of cluster ARN (NOT the cluster name). Extract via
    get-cluster-info.sh
    .
  • GROUP_NAME
    : Instance group name — retrieve via
    list-nodes.sh
    .
  • INSTANCE_ID
    : EC2 instance ID (e.g.,
    i-0123456789abcdef0
    )
目标:
sagemaker-cluster:<CLUSTER_ID>_<GROUP_NAME>-<INSTANCE_ID>
  • CLUSTER_ID
    : 集群ARN的最后一段(不是集群名称),可通过
    get-cluster-info.sh
    提取。
  • GROUP_NAME
    : 实例组名称——可通过
    list-nodes.sh
    获取。
  • INSTANCE_ID
    : EC2实例ID(例如:
    i-0123456789abcdef0

Scripts

脚本

Three scripts under
scripts/
. Resolve cluster info and nodes once, then execute per node.
scripts/
目录下有三个脚本,一次性解析集群信息和节点,然后按节点执行操作。

get-cluster-info.sh — Resolve cluster name → ID (call once)

get-cluster-info.sh — 解析集群名称→ID(调用一次即可)

bash
scripts/get-cluster-info.sh CLUSTER_NAME [--region REGION]
bash
scripts/get-cluster-info.sh CLUSTER_NAME [--region REGION]

Output: {"cluster_id":"...","cluster_arn":"...","cluster_name":"...","region":"..."}

Output: {"cluster_id":"...","cluster_arn":"...","cluster_name":"...","region":"..."}

undefined
undefined

list-nodes.sh — List all nodes with pagination (call once)

list-nodes.sh — 分页列出所有节点(调用一次即可)

bash
scripts/list-nodes.sh CLUSTER_NAME [--region REGION] [--instance-group GROUP] [--instance-id ID]
bash
scripts/list-nodes.sh CLUSTER_NAME [--region REGION] [--instance-group GROUP] [--instance-id ID]

Output: JSON array of ClusterNodeSummaries (InstanceId, InstanceGroupName, InstanceStatus, etc.)

Output: JSON array of ClusterNodeSummaries (InstanceId, InstanceGroupName, InstanceStatus, etc.)


`list-cluster-nodes` paginates at 100 nodes. This script handles pagination automatically.

`list-cluster-nodes`默认分页大小为100个节点,本脚本会自动处理分页。

ssm-exec.sh — Execute command on a node (call per node)

ssm-exec.sh — 在节点上执行命令(每个节点调用一次)

bash
undefined
bash
undefined

Execute — with pre-built target

Execute — with pre-built target

scripts/ssm-exec.sh --target "sagemaker-cluster:CLUSTERID_GROUP-INSTANCEID" 'command' [--region REGION]
scripts/ssm-exec.sh --target "sagemaker-cluster:CLUSTERID_GROUP-INSTANCEID" 'command' [--region REGION]

Execute — with parts

Execute — with parts

scripts/ssm-exec.sh --cluster-id ID --group GROUP --instance-id INSTANCE_ID 'command' [--region REGION]
scripts/ssm-exec.sh --cluster-id ID --group GROUP --instance-id INSTANCE_ID 'command' [--region REGION]

Upload

Upload

scripts/ssm-exec.sh --target TARGET --upload LOCAL_PATH REMOTE_PATH [--region REGION]
scripts/ssm-exec.sh --target TARGET --upload LOCAL_PATH REMOTE_PATH [--region REGION]

Read remote file

Read remote file

scripts/ssm-exec.sh --target TARGET --read REMOTE_PATH [--region REGION]
undefined
scripts/ssm-exec.sh --target TARGET --read REMOTE_PATH [--region REGION]
undefined

Running Commands Across Many Nodes

跨多节点运行命令

SSM
start-session
rate limit: 3 TPS per account. Plan batch size and delay accordingly.
aws ssm send-command
does NOT support
sagemaker-cluster:
targets — only
start-session
works.
SSM
start-session
速率限制:每个账号3 TPS,请相应规划批处理大小和延迟。
aws ssm send-command
不支持
sagemaker-cluster:
类型的目标——仅
start-session
可用。

Manual SSM Commands

手动执行SSM命令

When the scripts aren't suitable, use
aws ssm start-session
directly with
AWS-StartNonInteractiveCommand
:
bash
cat > /tmp/cmd.json << 'EOF'
{"command": ["bash -c 'echo hello && whoami'"]}
EOF

aws ssm start-session \
  --target sagemaker-cluster:{CLUSTER_ID}_{GROUP_NAME}-{INSTANCE_ID} \
  --region REGION \
  --document-name AWS-StartNonInteractiveCommand \
  --parameters file:///tmp/cmd.json
Always use a JSON file for
--parameters
— inline parameters break with special characters.
当脚本不适用时,可直接配合
AWS-StartNonInteractiveCommand
使用
aws ssm start-session
bash
cat > /tmp/cmd.json << 'EOF'
{"command": ["bash -c 'echo hello && whoami'"]}
EOF

aws ssm start-session \
  --target sagemaker-cluster:{CLUSTER_ID}_{GROUP_NAME}-{INSTANCE_ID} \
  --region REGION \
  --document-name AWS-StartNonInteractiveCommand \
  --parameters file:///tmp/cmd.json
请始终使用JSON文件传入
--parameters
——内联参数会因特殊字符报错。

Common Diagnostic Commands

常用诊断命令

TaskCommand
Lifecycle logs
cat /var/log/provision/provisioning.log
Memory
free -h
Disk/mounts
df -h && lsblk
GPU status
nvidia-smi
GPU memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
EFA/network
fi_info -p efa
CloudWatch agent
sudo systemctl status amazon-cloudwatch-agent
Top processes
ps aux --sort=-%mem | head -20
任务命令
生命周期日志
cat /var/log/provision/provisioning.log
内存
free -h
磁盘/挂载
df -h && lsblk
GPU状态
nvidia-smi
GPU显存
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
EFA/网络
fi_info -p efa
CloudWatch代理
sudo systemctl status amazon-cloudwatch-agent
Top进程
ps aux --sort=-%mem | head -20

Key Details

关键说明

  • Default SSM non-interactive user is
    root
    .
  • SSM rate limit: 3 TPS per account.
  • For interactive sessions (rare), omit
    --document-name
    to get a shell.
  • Interactive commands (vim, top) are not supported via
    AWS-StartNonInteractiveCommand
    .
  • Large outputs may be truncated by SSM.
  • For troubleshooting common errors, see references/troubleshooting.md.
  • SSM非交互会话默认用户为
    root
  • SSM速率限制:每个账号3 TPS
  • 如需交互会话(极少使用),省略
    --document-name
    即可获取shell。
  • 交互命令(vim、top)不支持通过
    AWS-StartNonInteractiveCommand
    执行。
  • 大型输出可能会被SSM截断。
  • 如需排查常见错误,请参考 references/troubleshooting.md