hyperpod-version-checker

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

HyperPod Version Checker

HyperPod版本检查器

Upload to cluster nodes via
hyperpod-ssm
skill, then execute.
通过
hyperpod-ssm
skill上传到集群节点后执行。

Usage

使用方法

bash
undefined
bash
undefined

Text report to console + file

Text report to console + file

bash hyperpod_check_versions.sh
bash hyperpod_check_versions.sh

JSON only to stdout (text report still saved to file) — best for piping/parsing

JSON only to stdout (text report still saved to file) — best for piping/parsing

bash hyperpod_check_versions.sh --json
bash hyperpod_check_versions.sh --json

Custom output file

Custom output file

bash hyperpod_check_versions.sh --output /tmp/versions.txt
bash hyperpod_check_versions.sh --output /tmp/versions.txt

No color (for logging)

No color (for logging)

bash hyperpod_check_versions.sh --no-color

Output file: `component_versions_<hostname>_<timestamp>.txt` (default)
bash hyperpod_check_versions.sh --no-color

默认输出文件:`component_versions_<hostname>_<timestamp>.txt`

What It Checks

检查内容

ComponentDetection MethodApplicable When
NVIDIA Driver
nvidia-smi
GPU instances (p3/p4/p5/g5)
CUDA Toolkit
nvcc
,
/usr/local/cuda
symlink
GPU instances
cuDNNHeader file, packagesGPU instances doing deep learning
NCCLLibrary filename, header, packagesDistributed GPU training
EFA
/opt/amazon/efa_installed_packages
,
fi_info
EFA-capable instances (p4d/p4de/p5/trn1/trn2)
AWS OFI NCCL
efa_installed_packages
, library search
EFA + NCCL workloads
GDRCopyrpm/dpkg, kernel moduleGPU instances with RDMA (p4d+/p5)
MPI
mpirun
,
/opt/amazon/openmpi
Distributed training
Neuron SDK
neuronx-cc
,
neuron-ls
, packages
Trainium/Inferentia (trn1/trn2/inf1/inf2)
Python/PyTorch
python3
,
torch
import
ML workloads
Container runtime
docker
,
containerd
,
kubectl
,
nvidia-ctk
EKS clusters
组件检测方法适用场景
NVIDIA Driver
nvidia-smi
GPU实例(p3/p4/p5/g5)
CUDA Toolkit
nvcc
/usr/local/cuda
软链接
GPU实例
cuDNN头文件、软件包执行深度学习任务的GPU实例
NCCL库文件名、头文件、软件包分布式GPU训练
EFA
/opt/amazon/efa_installed_packages
fi_info
支持EFA的实例(p4d/p4de/p5/trn1/trn2)
AWS OFI NCCL
efa_installed_packages
、库搜索
EFA + NCCL 工作负载
GDRCopyrpm/dpkg、内核模块支持RDMA的GPU实例(p4d+/p5)
MPI
mpirun
/opt/amazon/openmpi
分布式训练
Neuron SDK
neuronx-cc
neuron-ls
、软件包
Trainium/Inferentia 实例(trn1/trn2/inf1/inf2)
Python/PyTorch
python3
torch
导入
机器学习工作负载
Container runtime
docker
containerd
kubectl
nvidia-ctk
EKS集群

Multi-Node Comparison

多节点比较

Run on each node individually via the
hyperpod-ssm
skill. With
--json
, stdout is clean JSON for easy diffing.
通过
hyperpod-ssm
skill在每个节点上单独运行。使用
--json
参数时,stdout会输出干净的JSON格式,便于对比差异。

Compatibility Reference

兼容性参考

The script automatically analyzes CUDA/driver compatibility. For reference:
Driver SeriesSupported CUDA
580+13.x, 12.x, 11.x
570+12.8+ (Blackwell), 12.x, 11.x
545+12.3-12.7, 11.x
525-53512.0-12.2, 11.x
450+11.x only
NCCL: Use 2.18+ for CUDA 12.x, 2.12+ for CUDA 11.x. Must be consistent across all nodes.
EFA InstallerAWS OFI NCCL
1.29+v1.7.3+ (recommended)
1.26-1.28v1.7.0-v1.7.2
1.20-1.25v1.6.0+
脚本会自动分析CUDA/驱动兼容性,参考规则如下:
驱动版本系列支持的CUDA版本
580+13.x、12.x、11.x
570+12.8+(Blackwell)、12.x、11.x
545+12.3-12.7、11.x
525-53512.0-12.2、11.x
450+仅支持11.x
NCCL:CUDA 12.x 请使用2.18+版本,CUDA 11.x 请使用2.12+版本,所有节点的NCCL版本必须保持一致。
EFA安装程序版本对应AWS OFI NCCL版本
1.29+v1.7.3+(推荐)
1.26-1.28v1.7.0-v1.7.2
1.20-1.25v1.6.0+