hyperpod-version-checker
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHyperPod Version Checker
HyperPod版本检查器
Upload to cluster nodes via skill, then execute.
hyperpod-ssm通过 skill上传到集群节点后执行。
hyperpod-ssmUsage
使用方法
bash
undefinedbash
undefinedText report to console + file
Text report to console + file
bash hyperpod_check_versions.sh
bash hyperpod_check_versions.sh
JSON only to stdout (text report still saved to file) — best for piping/parsing
JSON only to stdout (text report still saved to file) — best for piping/parsing
bash hyperpod_check_versions.sh --json
bash hyperpod_check_versions.sh --json
Custom output file
Custom output file
bash hyperpod_check_versions.sh --output /tmp/versions.txt
bash hyperpod_check_versions.sh --output /tmp/versions.txt
No color (for logging)
No color (for logging)
bash hyperpod_check_versions.sh --no-color
Output file: `component_versions_<hostname>_<timestamp>.txt` (default)bash hyperpod_check_versions.sh --no-color
默认输出文件:`component_versions_<hostname>_<timestamp>.txt`What It Checks
检查内容
| Component | Detection Method | Applicable When |
|---|---|---|
| NVIDIA Driver | | GPU instances (p3/p4/p5/g5) |
| CUDA Toolkit | | GPU instances |
| cuDNN | Header file, packages | GPU instances doing deep learning |
| NCCL | Library filename, header, packages | Distributed GPU training |
| EFA | | EFA-capable instances (p4d/p4de/p5/trn1/trn2) |
| AWS OFI NCCL | | EFA + NCCL workloads |
| GDRCopy | rpm/dpkg, kernel module | GPU instances with RDMA (p4d+/p5) |
| MPI | | Distributed training |
| Neuron SDK | | Trainium/Inferentia (trn1/trn2/inf1/inf2) |
| Python/PyTorch | | ML workloads |
| Container runtime | | EKS clusters |
| 组件 | 检测方法 | 适用场景 |
|---|---|---|
| NVIDIA Driver | | GPU实例(p3/p4/p5/g5) |
| CUDA Toolkit | | GPU实例 |
| cuDNN | 头文件、软件包 | 执行深度学习任务的GPU实例 |
| NCCL | 库文件名、头文件、软件包 | 分布式GPU训练 |
| EFA | | 支持EFA的实例(p4d/p4de/p5/trn1/trn2) |
| AWS OFI NCCL | | EFA + NCCL 工作负载 |
| GDRCopy | rpm/dpkg、内核模块 | 支持RDMA的GPU实例(p4d+/p5) |
| MPI | | 分布式训练 |
| Neuron SDK | | Trainium/Inferentia 实例(trn1/trn2/inf1/inf2) |
| Python/PyTorch | | 机器学习工作负载 |
| Container runtime | | EKS集群 |
Multi-Node Comparison
多节点比较
Run on each node individually via the skill. With , stdout is clean JSON for easy diffing.
hyperpod-ssm--json通过 skill在每个节点上单独运行。使用参数时,stdout会输出干净的JSON格式,便于对比差异。
hyperpod-ssm--jsonCompatibility Reference
兼容性参考
The script automatically analyzes CUDA/driver compatibility. For reference:
| Driver Series | Supported CUDA |
|---|---|
| 580+ | 13.x, 12.x, 11.x |
| 570+ | 12.8+ (Blackwell), 12.x, 11.x |
| 545+ | 12.3-12.7, 11.x |
| 525-535 | 12.0-12.2, 11.x |
| 450+ | 11.x only |
NCCL: Use 2.18+ for CUDA 12.x, 2.12+ for CUDA 11.x. Must be consistent across all nodes.
| EFA Installer | AWS OFI NCCL |
|---|---|
| 1.29+ | v1.7.3+ (recommended) |
| 1.26-1.28 | v1.7.0-v1.7.2 |
| 1.20-1.25 | v1.6.0+ |
脚本会自动分析CUDA/驱动兼容性,参考规则如下:
| 驱动版本系列 | 支持的CUDA版本 |
|---|---|
| 580+ | 13.x、12.x、11.x |
| 570+ | 12.8+(Blackwell)、12.x、11.x |
| 545+ | 12.3-12.7、11.x |
| 525-535 | 12.0-12.2、11.x |
| 450+ | 仅支持11.x |
NCCL:CUDA 12.x 请使用2.18+版本,CUDA 11.x 请使用2.12+版本,所有节点的NCCL版本必须保持一致。
| EFA安装程序版本 | 对应AWS OFI NCCL版本 |
|---|---|
| 1.29+ | v1.7.3+(推荐) |
| 1.26-1.28 | v1.7.0-v1.7.2 |
| 1.20-1.25 | v1.6.0+ |