build-and-dependency
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBuild and Dependency
构建与依赖
Two core principles: build and develop inside containers, and always use uv.
两个核心原则:在容器内构建和开发,以及始终使用uv。
Why Containers
为什么使用容器
Megatron Bridge depends on CUDA, NCCL, PyTorch with GPU support, Transformer
Engine, and optional components like TRT-LLM, vLLM, and DeepEP. Installing
these on a bare host is fragile and hard to reproduce. The project ships
production-quality Dockerfiles that pin every dependency.
Use the container as your development environment. This guarantees:
- Identical CUDA / NCCL / cuDNN versions across developers and CI.
- resolves the same way locally and in CI (the lockfile is Linux-only; it cannot be regenerated on macOS).
uv.lock - GPU-dependent operations work out of the box.
Megatron Bridge 依赖 CUDA、NCCL、支持GPU的 PyTorch、Transformer Engine,以及可选组件如 TRT-LLM、vLLM 和 DeepEP。在裸机主机上安装这些组件非常不稳定且难以复现。项目提供了固定所有依赖项的生产级 Dockerfile。
将容器作为你的开发环境。这保证了:
- 开发者和 CI 环境使用完全相同的 CUDA / NCCL / cuDNN 版本。
- 在本地和 CI 环境中的解析方式一致(该锁文件仅适用于Linux;无法在macOS上重新生成)。
uv.lock - 依赖GPU的操作开箱即用。
Container Options
容器选项
Option 1: NeMo Framework Container (fastest)
选项1:NeMo Framework 容器(最快)
Find available tags at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags
bash
skopeo list-tags docker://nvcr.io/nvidia/nemo \
| python3 -c "import sys,json,re; tags=json.load(sys.stdin)['Tags']; [print(t) for t in sorted((t for t in tags if re.match(r'^\d{2}\.\d{2}', t)), reverse=True)]"bash
docker run --rm -it --gpus all --shm-size=24g \
nvcr.io/nvidia/nemo:<tag> \
bashbash
skopeo list-tags docker://nvcr.io/nvidia/nemo \
| python3 -c "import sys,json,re; tags=json.load(sys.stdin)['Tags']; [print(t) for t in sorted((t for t in tags if re.match(r'^\d{2}\.\d{2}', t)), reverse=True)]"bash
docker run --rm -it --gpus all --shm-size=24g \
nvcr.io/nvidia/nemo:<tag> \
bashOption 2: Build the Megatron Bridge Container
选项2:构建 Megatron Bridge 容器
See @docker/README.md for build commands, build arguments, and the full NeMo-FW image stack.
查看 @docker/README.md 获取构建命令、构建参数以及完整的 NeMo-FW 镜像栈说明。
Running the Container
运行容器
bash
docker run --rm -it -w /opt/Megatron-Bridge \
-v $(pwd):/opt/Megatron-Bridge \
-v $HOME/.cache/uv:/root/.cache/uv \
--gpus all \
--shm-size=24g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
megatron-bridge:latest \
bashMounting avoids re-downloading wheels on every run.
$HOME/.cache/uvbash
docker run --rm -it -w /opt/Megatron-Bridge \
-v $(pwd):/opt/Megatron-Bridge \
-v $HOME/.cache/uv:/root/.cache/uv \
--gpus all \
--shm-size=24g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
megatron-bridge:latest \
bash挂载 可避免每次运行时重新下载wheel包。
$HOME/.cache/uvContainers on Slurm
Slurm 上的容器使用
On Slurm clusters with Enroot/Pyxis, pass containers directly to :
srunbash
srun --mpi=pmix \
--container-image="$CONTAINER_IMAGE" \
--container-mounts="$CONTAINER_MOUNTS" \
--no-container-mount-home \
bash -c "cd /opt/Megatron-Bridge && uv run --no-sync python ..."If you bind-mount a custom source tree into the container, only rank 0
should sync while others wait:
bash
if [ "$SLURM_LOCALID" -eq 0 ]; then uv sync; else sleep 10; fiNote: is an flag, not an directive.
Set to shared storage to avoid filling .
--no-container-mount-homesrun#SBATCHUV_CACHE_DIR/root/.cache/在配备 Enroot/Pyxis 的 Slurm 集群上,可直接将容器传递给 :
srunbash
srun --mpi=pmix \
--container-image="$CONTAINER_IMAGE" \
--container-mounts="$CONTAINER_MOUNTS" \
--no-container-mount-home \
bash -c "cd /opt/Megatron-Bridge && uv run --no-sync python ..."如果将自定义源代码树绑定挂载到容器中,只有rank 0节点需要执行同步操作,其他节点等待即可:
bash
if [ "$SLURM_LOCALID" -eq 0 ]; then uv sync; else sleep 10; fi注意: 是 的参数,而非 指令。将 设置为共享存储,避免填满 。
--no-container-mount-homesrun#SBATCHUV_CACHE_DIR/root/.cache/Always Use uv
始终使用 uv
Never use , , or bare — always go through .
All commands must be run inside a container. Never install or upgrade
dependencies outside the CI container.
pip installcondapythonuvuv永远不要使用 、 或原生 ——始终通过 进行操作。所有 命令必须在容器内运行。永远不要在CI容器外安装或升级依赖项。
pip installcondapythonuvuvEssential Commands
核心命令
| Task | Command |
|---|---|
| Install all deps from lockfile | |
| Install with all extras and dev groups | |
| Run a Python command | |
| Run distributed training | |
| Add a new dependency | |
| Add an optional dependency | |
| Regenerate the lockfile | |
| Install pre-commit hooks | |
| 任务 | 命令 |
|---|---|
| 从锁文件安装所有依赖 | |
| 安装所有额外组件和开发组依赖 | |
| 运行Python命令 | |
| 运行分布式训练 | |
| 添加新依赖 | |
| 添加可选依赖 | |
| 重新生成锁文件 | |
| 安装pre-commit钩子 | |
Adding Dependencies
添加依赖项
Submit dependency changes as a separate PR before the feature PR:
bash
undefined在提交功能PR之前,需将依赖变更作为单独的PR提交:
bash
undefinedOptional dependency (preferred)
可选依赖(推荐方式)
uv add --optional --extra <group> <package>
uv add --optional --extra <group> <package>
Required dependency (needs strong justification — affects all downstream)
必填依赖(需要充分理由——影响所有下游环节)
uv add <package>
Commit both modified files:
```bash
git add pyproject.toml uv.lock
git commit -s -m "[build] chore: add <package>"uv add <package>
提交修改后的两个文件:
```bash
git add pyproject.toml uv.lock
git commit -s -m "[build] chore: add <package>"Regenerating uv.lock
重新生成 uv.lock
The lockfile is Linux-only (resolves CUDA wheels). Run inside Docker:
bash
docker run --gpus all --rm \
-v $(pwd):/opt/Megatron-Bridge \
megatron-bridge:latest \
bash -c 'cd /opt/Megatron-Bridge && uv lock'锁文件仅适用于Linux(解析CUDA wheel包)。需在Docker内运行:
bash
docker run --gpus all --rm \
-v $(pwd):/opt/Megatron-Bridge \
megatron-bridge:latest \
bash -c 'cd /opt/Megatron-Bridge && uv lock'Switching MCore Branches
切换MCore分支
bash
undefinedbash
undefinedSwitch to dev branch
切换到dev分支
./scripts/switch_mcore.sh dev
uv sync # without --locked
./scripts/switch_mcore.sh dev
uv sync # 不带 --locked 参数
Switch back to main
切回main分支
./scripts/switch_mcore.sh main
uv sync --locked # lockfile matches again
undefined./scripts/switch_mcore.sh main
uv sync --locked # 锁文件再次匹配
undefinedQuick Start
快速开始
bash
undefinedbash
undefined1. Clone and init submodules
1. 克隆仓库并初始化子模块
git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge megatron-bridge
cd megatron-bridge
git submodule update --init 3rdparty/Megatron-LM
git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge megatron-bridge
cd megatron-bridge
git submodule update --init 3rdparty/Megatron-LM
2. Build the container
2. 构建容器
docker build -f docker/Dockerfile.ci --target megatron_bridge -t megatron-bridge:latest .
docker build -f docker/Dockerfile.ci --target megatron_bridge -t megatron-bridge:latest .
3. Start a dev shell
3. 启动开发shell
docker run --rm -it -v $(pwd):/opt/Megatron-Bridge --gpus all --shm-size=24g megatron-bridge:latest bash
docker run --rm -it -v $(pwd):/opt/Megatron-Bridge --gpus all --shm-size=24g megatron-bridge:latest bash
4. Install pre-commit hooks (inside container)
4. 安装pre-commit钩子(容器内执行)
uv run --group dev pre-commit install
uv run --group dev pre-commit install
5. Sanity check
5. 完整性检查
uv run python -m torch.distributed.run --nproc_per_node=1
scripts/training/run_recipe.py
--recipe vanilla_gpt_pretrain_config
train.train_iters=5 train.global_batch_size=8 train.micro_batch_size=4
scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=5
logger.log_interval=1
scripts/training/run_recipe.py
--recipe vanilla_gpt_pretrain_config
train.train_iters=5 train.global_batch_size=8 train.micro_batch_size=4
scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=5
logger.log_interval=1
undefineduv run python -m torch.distributed.run --nproc_per_node=1
scripts/training/run_recipe.py
--recipe vanilla_gpt_pretrain_config
train.train_iters=5 train.global_batch_size=8 train.micro_batch_size=4
scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=5
logger.log_interval=1
scripts/training/run_recipe.py
--recipe vanilla_gpt_pretrain_config
train.train_iters=5 train.global_batch_size=8 train.micro_batch_size=4
scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=5
logger.log_interval=1
undefinedCommon Pitfalls
常见陷阱
| Problem | Cause | Fix |
|---|---|---|
| Lockfile resolves CUDA wheels that don't exist on macOS | Run inside Docker or on a Linux machine |
| pip installed outside uv-managed venv | Use |
| Lockfile generated against main MCore | Use |
| Container doesn't have uv | Use the |
| Cache fills container's | Set |
| Pre-commit fails with ruff errors | Code style violations | Run |
| 问题 | 原因 | 解决方法 |
|---|---|---|
macOS上执行 | 锁文件解析的CUDA wheel包在macOS上不存在 | 在Docker或Linux机器上运行 |
pip安装后出现 | pip在uv管理的虚拟环境外安装了包 | 使用 |
切换MCore分支后 | 锁文件是基于main分支的MCore生成的 | 在dev分支上使用 |
容器内提示 | 容器未安装uv | 使用从 |
uv操作期间提示 | 缓存填满了容器的 | 将 |
| Pre-commit因ruff错误失败 | 代码风格不符合规范 | 运行 |