build-and-dependency

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Build and Dependency

构建与依赖

Two core principles: build and develop inside containers, and always use uv.

两个核心原则：在容器内构建和开发，以及始终使用uv。

Why Containers

为什么使用容器

Megatron Bridge depends on CUDA, NCCL, PyTorch with GPU support, Transformer Engine, and optional components like TRT-LLM, vLLM, and DeepEP. Installing these on a bare host is fragile and hard to reproduce. The project ships production-quality Dockerfiles that pin every dependency.

Use the container as your development environment. This guarantees:

Identical CUDA / NCCL / cuDNN versions across developers and CI.
```
uv.lock
```
resolves the same way locally and in CI (the lockfile is Linux-only; it cannot be regenerated on macOS).
GPU-dependent operations work out of the box.

Megatron Bridge 依赖 CUDA、NCCL、支持GPU的 PyTorch、Transformer Engine，以及可选组件如 TRT-LLM、vLLM 和 DeepEP。在裸机主机上安装这些组件非常不稳定且难以复现。项目提供了固定所有依赖项的生产级 Dockerfile。

将容器作为你的开发环境。这保证了：

开发者和 CI 环境使用完全相同的 CUDA / NCCL / cuDNN 版本。
```
uv.lock
```
在本地和 CI 环境中的解析方式一致（该锁文件仅适用于Linux；无法在macOS上重新生成）。
依赖GPU的操作开箱即用。

Container Options

容器选项

Option 1: NeMo Framework Container (fastest)

选项1：NeMo Framework 容器（最快）

Find available tags at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags

bash

skopeo list-tags docker://nvcr.io/nvidia/nemo \
  | python3 -c "import sys,json,re; tags=json.load(sys.stdin)['Tags']; [print(t) for t in sorted((t for t in tags if re.match(r'^\d{2}\.\d{2}', t)), reverse=True)]"

bash

docker run --rm -it --gpus all --shm-size=24g \
  nvcr.io/nvidia/nemo:<tag> \
  bash

在 https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags 查看可用标签

bash

skopeo list-tags docker://nvcr.io/nvidia/nemo \
  | python3 -c "import sys,json,re; tags=json.load(sys.stdin)['Tags']; [print(t) for t in sorted((t for t in tags if re.match(r'^\d{2}\.\d{2}', t)), reverse=True)]"

bash

docker run --rm -it --gpus all --shm-size=24g \
  nvcr.io/nvidia/nemo:<tag> \
  bash

Option 2: Build the Megatron Bridge Container

选项2：构建 Megatron Bridge 容器

See @docker/README.md for build commands, build arguments, and the full NeMo-FW image stack.

查看 @docker/README.md 获取构建命令、构建参数以及完整的 NeMo-FW 镜像栈说明。

Running the Container

运行容器

bash

docker run --rm -it -w /opt/Megatron-Bridge \
  -v $(pwd):/opt/Megatron-Bridge \
  -v $HOME/.cache/uv:/root/.cache/uv \
  --gpus all \
  --shm-size=24g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  megatron-bridge:latest \
  bash

Mounting

$HOME/.cache/uv

avoids re-downloading wheels on every run.

bash

docker run --rm -it -w /opt/Megatron-Bridge \
  -v $(pwd):/opt/Megatron-Bridge \
  -v $HOME/.cache/uv:/root/.cache/uv \
  --gpus all \
  --shm-size=24g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  megatron-bridge:latest \
  bash

挂载

$HOME/.cache/uv

可避免每次运行时重新下载wheel包。

Containers on Slurm

Slurm 上的容器使用

On Slurm clusters with Enroot/Pyxis, pass containers directly to

srun

bash

srun --mpi=pmix \
  --container-image="$CONTAINER_IMAGE" \
  --container-mounts="$CONTAINER_MOUNTS" \
  --no-container-mount-home \
  bash -c "cd /opt/Megatron-Bridge && uv run --no-sync python ..."

If you bind-mount a custom source tree into the container, only rank 0 should sync while others wait:

bash

if [ "$SLURM_LOCALID" -eq 0 ]; then uv sync; else sleep 10; fi

Note:

--no-container-mount-home

is an

srun

flag, not an

#SBATCH

directive. Set

UV_CACHE_DIR

to shared storage to avoid filling

/root/.cache/

在配备 Enroot/Pyxis 的 Slurm 集群上，可直接将容器传递给

srun

：

bash

srun --mpi=pmix \
  --container-image="$CONTAINER_IMAGE" \
  --container-mounts="$CONTAINER_MOUNTS" \
  --no-container-mount-home \
  bash -c "cd /opt/Megatron-Bridge && uv run --no-sync python ..."

如果将自定义源代码树绑定挂载到容器中，只有rank 0节点需要执行同步操作，其他节点等待即可：

bash

if [ "$SLURM_LOCALID" -eq 0 ]; then uv sync; else sleep 10; fi

注意：

--no-container-mount-home

是

srun

的参数，而非

#SBATCH

指令。将

UV_CACHE_DIR

设置为共享存储，避免填满

/root/.cache/

。

Always Use uv

始终使用 uv

Never use
pip install
,
conda
, or bare
python
— always go through

uv

. All

uv

commands must be run inside a container. Never install or upgrade dependencies outside the CI container.

永远不要使用
pip install
、
conda
或原生
python
——始终通过

uv

进行操作。所有

uv

命令必须在容器内运行。永远不要在CI容器外安装或升级依赖项。

Essential Commands

核心命令

Task	Command
Install all deps from lockfile	`uv sync --locked`
Install with all extras and dev groups	`uv sync --locked --all-extras --all-groups`
Run a Python command	`uv run python script.py`
Run distributed training	`uv run python -m torch.distributed.run --nproc_per_node=N script.py`
Add a new dependency	`uv add <package>`
Add an optional dependency	`uv add --optional --extra <group> <package>`
Regenerate the lockfile	`uv lock` (Linux/container only)
Install pre-commit hooks	`uv run --group dev pre-commit install`

任务	命令
从锁文件安装所有依赖	`uv sync --locked`
安装所有额外组件和开发组依赖	`uv sync --locked --all-extras --all-groups`
运行Python命令	`uv run python script.py`
运行分布式训练	`uv run python -m torch.distributed.run --nproc_per_node=N script.py`
添加新依赖	`uv add <package>`
添加可选依赖	`uv add --optional --extra <group> <package>`
重新生成锁文件	`uv lock` （仅Linux/容器环境）
安装pre-commit钩子	`uv run --group dev pre-commit install`

Adding Dependencies

添加依赖项

Submit dependency changes as a separate PR before the feature PR:

bash

undefined

在提交功能PR之前，需将依赖变更作为单独的PR提交：

bash

undefined

Optional dependency (preferred)

可选依赖（推荐方式）

uv add --optional --extra <group> <package>

Required dependency (needs strong justification — affects all downstream)

必填依赖（需要充分理由——影响所有下游环节）

uv add <package>


Commit both modified files:

```bash
git add pyproject.toml uv.lock
git commit -s -m "[build] chore: add <package>"

uv add <package>


提交修改后的两个文件：

```bash
git add pyproject.toml uv.lock
git commit -s -m "[build] chore: add <package>"

Regenerating uv.lock

重新生成 uv.lock

The lockfile is Linux-only (resolves CUDA wheels). Run inside Docker:

bash

docker run --gpus all --rm \
  -v $(pwd):/opt/Megatron-Bridge \
  megatron-bridge:latest \
  bash -c 'cd /opt/Megatron-Bridge && uv lock'

锁文件仅适用于Linux（解析CUDA wheel包）。需在Docker内运行：

bash

docker run --gpus all --rm \
  -v $(pwd):/opt/Megatron-Bridge \
  megatron-bridge:latest \
  bash -c 'cd /opt/Megatron-Bridge && uv lock'

Switching MCore Branches

切换MCore分支

bash

undefined

bash

undefined

Switch to dev branch

切换到dev分支

./scripts/switch_mcore.sh dev uv sync # without --locked

./scripts/switch_mcore.sh dev uv sync # 不带 --locked 参数

Switch back to main

切回main分支

./scripts/switch_mcore.sh main uv sync --locked # lockfile matches again

undefined

./scripts/switch_mcore.sh main uv sync --locked # 锁文件再次匹配

undefined

Quick Start

快速开始

bash

undefined

bash

undefined

1. Clone and init submodules

1. 克隆仓库并初始化子模块

git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge megatron-bridge cd megatron-bridge git submodule update --init 3rdparty/Megatron-LM

2. Build the container

2. 构建容器

docker build -f docker/Dockerfile.ci --target megatron_bridge -t megatron-bridge:latest .

3. Start a dev shell

3. 启动开发shell

docker run --rm -it -v $(pwd):/opt/Megatron-Bridge --gpus all --shm-size=24g megatron-bridge:latest bash

4. Install pre-commit hooks (inside container)

4. 安装pre-commit钩子（容器内执行）

uv run --group dev pre-commit install

5. Sanity check

5. 完整性检查

uv run python -m torch.distributed.run --nproc_per_node=1
scripts/training/run_recipe.py
--recipe vanilla_gpt_pretrain_config
train.train_iters=5 train.global_batch_size=8 train.micro_batch_size=4
scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=5
logger.log_interval=1

undefined

undefined

Common Pitfalls

常见陷阱

Problem	Cause	Fix
`uv sync --locked` fails on macOS	Lockfile resolves CUDA wheels that don't exist on macOS	Run inside Docker or on a Linux machine
`ModuleNotFoundError` after pip install	pip installed outside uv-managed venv	Use `uv add` + `uv sync` , never bare `pip install`
`uv sync --locked` fails after MCore branch switch	Lockfile generated against main MCore	Use `uv sync` (without `--locked` ) on dev
`uv: command not found` inside container	Container doesn't have uv	Use the `megatron-bridge` image built from `Dockerfile.ci`
`No space left on device` during uv ops	Cache fills container's `/root/.cache/`	Set `UV_CACHE_DIR` to shared/persistent storage
Pre-commit fails with ruff errors	Code style violations	Run `uv run ruff check --fix . && uv run ruff format .`

问题	原因	解决方法
macOS上执行 `uv sync --locked` 失败	锁文件解析的CUDA wheel包在macOS上不存在	在Docker或Linux机器上运行
pip安装后出现 `ModuleNotFoundError`	pip在uv管理的虚拟环境外安装了包	使用 `uv add` + `uv sync` ，永远不要直接使用 `pip install`
切换MCore分支后 `uv sync --locked` 失败	锁文件是基于main分支的MCore生成的	在dev分支上使用 `uv sync` （不带 `--locked` 参数）
容器内提示 `uv: command not found`	容器未安装uv	使用从 `Dockerfile.ci` 构建的 `megatron-bridge` 镜像
uv操作期间提示 `No space left on device`	缓存填满了容器的 `/root/.cache/`	将 `UV_CACHE_DIR` 设置为共享/持久化存储
Pre-commit因ruff错误失败	代码风格不符合规范	运行 `uv run ruff check --fix . && uv run ruff format .`