hyperpod-ssm
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHyperPod SSM Access
HyperPod SSM 访问
SSM Target Format
SSM 目标格式
Target:
sagemaker-cluster:<CLUSTER_ID>_<GROUP_NAME>-<INSTANCE_ID>- : Last segment of cluster ARN (NOT the cluster name). Extract via
CLUSTER_ID.get-cluster-info.sh - : Instance group name — retrieve via
GROUP_NAME.list-nodes.sh - : EC2 instance ID (e.g.,
INSTANCE_ID)i-0123456789abcdef0
目标:
sagemaker-cluster:<CLUSTER_ID>_<GROUP_NAME>-<INSTANCE_ID>- : 集群ARN的最后一段(不是集群名称),可通过
CLUSTER_ID提取。get-cluster-info.sh - : 实例组名称——可通过
GROUP_NAME获取。list-nodes.sh - : EC2实例ID(例如:
INSTANCE_ID)i-0123456789abcdef0
Scripts
脚本
Three scripts under . Resolve cluster info and nodes once, then execute per node.
scripts/scripts/get-cluster-info.sh — Resolve cluster name → ID (call once)
get-cluster-info.sh — 解析集群名称→ID(调用一次即可)
bash
scripts/get-cluster-info.sh CLUSTER_NAME [--region REGION]bash
scripts/get-cluster-info.sh CLUSTER_NAME [--region REGION]Output: {"cluster_id":"...","cluster_arn":"...","cluster_name":"...","region":"..."}
Output: {"cluster_id":"...","cluster_arn":"...","cluster_name":"...","region":"..."}
undefinedundefinedlist-nodes.sh — List all nodes with pagination (call once)
list-nodes.sh — 分页列出所有节点(调用一次即可)
bash
scripts/list-nodes.sh CLUSTER_NAME [--region REGION] [--instance-group GROUP] [--instance-id ID]bash
scripts/list-nodes.sh CLUSTER_NAME [--region REGION] [--instance-group GROUP] [--instance-id ID]Output: JSON array of ClusterNodeSummaries (InstanceId, InstanceGroupName, InstanceStatus, etc.)
Output: JSON array of ClusterNodeSummaries (InstanceId, InstanceGroupName, InstanceStatus, etc.)
`list-cluster-nodes` paginates at 100 nodes. This script handles pagination automatically.
`list-cluster-nodes`默认分页大小为100个节点,本脚本会自动处理分页。ssm-exec.sh — Execute command on a node (call per node)
ssm-exec.sh — 在节点上执行命令(每个节点调用一次)
bash
undefinedbash
undefinedExecute — with pre-built target
Execute — with pre-built target
scripts/ssm-exec.sh --target "sagemaker-cluster:CLUSTERID_GROUP-INSTANCEID" 'command' [--region REGION]
scripts/ssm-exec.sh --target "sagemaker-cluster:CLUSTERID_GROUP-INSTANCEID" 'command' [--region REGION]
Execute — with parts
Execute — with parts
scripts/ssm-exec.sh --cluster-id ID --group GROUP --instance-id INSTANCE_ID 'command' [--region REGION]
scripts/ssm-exec.sh --cluster-id ID --group GROUP --instance-id INSTANCE_ID 'command' [--region REGION]
Upload
Upload
scripts/ssm-exec.sh --target TARGET --upload LOCAL_PATH REMOTE_PATH [--region REGION]
scripts/ssm-exec.sh --target TARGET --upload LOCAL_PATH REMOTE_PATH [--region REGION]
Read remote file
Read remote file
scripts/ssm-exec.sh --target TARGET --read REMOTE_PATH [--region REGION]
undefinedscripts/ssm-exec.sh --target TARGET --read REMOTE_PATH [--region REGION]
undefinedRunning Commands Across Many Nodes
跨多节点运行命令
SSM rate limit: 3 TPS per account. Plan batch size and delay accordingly.
start-sessionaws ssm send-commandsagemaker-cluster:start-sessionSSM 速率限制:每个账号3 TPS,请相应规划批处理大小和延迟。
start-sessionaws ssm send-commandsagemaker-cluster:start-sessionManual SSM Commands
手动执行SSM命令
When the scripts aren't suitable, use directly with :
aws ssm start-sessionAWS-StartNonInteractiveCommandbash
cat > /tmp/cmd.json << 'EOF'
{"command": ["bash -c 'echo hello && whoami'"]}
EOF
aws ssm start-session \
--target sagemaker-cluster:{CLUSTER_ID}_{GROUP_NAME}-{INSTANCE_ID} \
--region REGION \
--document-name AWS-StartNonInteractiveCommand \
--parameters file:///tmp/cmd.jsonAlways use a JSON file for — inline parameters break with special characters.
--parameters当脚本不适用时,可直接配合使用:
AWS-StartNonInteractiveCommandaws ssm start-sessionbash
cat > /tmp/cmd.json << 'EOF'
{"command": ["bash -c 'echo hello && whoami'"]}
EOF
aws ssm start-session \
--target sagemaker-cluster:{CLUSTER_ID}_{GROUP_NAME}-{INSTANCE_ID} \
--region REGION \
--document-name AWS-StartNonInteractiveCommand \
--parameters file:///tmp/cmd.json请始终使用JSON文件传入——内联参数会因特殊字符报错。
--parametersCommon Diagnostic Commands
常用诊断命令
| Task | Command |
|---|---|
| Lifecycle logs | |
| Memory | |
| Disk/mounts | |
| GPU status | |
| GPU memory | |
| EFA/network | |
| CloudWatch agent | |
| Top processes | |
| 任务 | 命令 |
|---|---|
| 生命周期日志 | |
| 内存 | |
| 磁盘/挂载 | |
| GPU状态 | |
| GPU显存 | |
| EFA/网络 | |
| CloudWatch代理 | |
| Top进程 | |
Key Details
关键说明
- Default SSM non-interactive user is .
root - SSM rate limit: 3 TPS per account.
- For interactive sessions (rare), omit to get a shell.
--document-name - Interactive commands (vim, top) are not supported via .
AWS-StartNonInteractiveCommand - Large outputs may be truncated by SSM.
- For troubleshooting common errors, see references/troubleshooting.md.
- SSM非交互会话默认用户为。
root - SSM速率限制:每个账号3 TPS。
- 如需交互会话(极少使用),省略即可获取shell。
--document-name - 交互命令(vim、top)不支持通过执行。
AWS-StartNonInteractiveCommand - 大型输出可能会被SSM截断。
- 如需排查常见错误,请参考 references/troubleshooting.md。