build-and-dependency

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Build and Dependency

构建与依赖

Two core principles: build and develop inside containers, and always use uv.
两个核心原则:在容器内构建和开发,以及始终使用uv

Why Containers

为什么使用容器

Megatron Bridge depends on CUDA, NCCL, PyTorch with GPU support, Transformer Engine, and optional components like TRT-LLM, vLLM, and DeepEP. Installing these on a bare host is fragile and hard to reproduce. The project ships production-quality Dockerfiles that pin every dependency.
Use the container as your development environment. This guarantees:
  • Identical CUDA / NCCL / cuDNN versions across developers and CI.
  • uv.lock
    resolves the same way locally and in CI (the lockfile is Linux-only; it cannot be regenerated on macOS).
  • GPU-dependent operations work out of the box.
Megatron Bridge 依赖 CUDA、NCCL、支持GPU的 PyTorch、Transformer Engine,以及可选组件如 TRT-LLM、vLLM 和 DeepEP。在裸机主机上安装这些组件非常不稳定且难以复现。项目提供了固定所有依赖项的生产级 Dockerfile。
将容器作为你的开发环境。这保证了:
  • 开发者和 CI 环境使用完全相同的 CUDA / NCCL / cuDNN 版本。
  • uv.lock
    在本地和 CI 环境中的解析方式一致(该锁文件仅适用于Linux;无法在macOS上重新生成)。
  • 依赖GPU的操作开箱即用。

Container Options

容器选项

Option 1: NeMo Framework Container (fastest)

选项1:NeMo Framework 容器(最快)

bash
skopeo list-tags docker://nvcr.io/nvidia/nemo \
  | python3 -c "import sys,json,re; tags=json.load(sys.stdin)['Tags']; [print(t) for t in sorted((t for t in tags if re.match(r'^\d{2}\.\d{2}', t)), reverse=True)]"
bash
docker run --rm -it --gpus all --shm-size=24g \
  nvcr.io/nvidia/nemo:<tag> \
  bash
bash
skopeo list-tags docker://nvcr.io/nvidia/nemo \
  | python3 -c "import sys,json,re; tags=json.load(sys.stdin)['Tags']; [print(t) for t in sorted((t for t in tags if re.match(r'^\d{2}\.\d{2}', t)), reverse=True)]"
bash
docker run --rm -it --gpus all --shm-size=24g \
  nvcr.io/nvidia/nemo:<tag> \
  bash

Option 2: Build the Megatron Bridge Container

选项2:构建 Megatron Bridge 容器

See @docker/README.md for build commands, build arguments, and the full NeMo-FW image stack.
查看 @docker/README.md 获取构建命令、构建参数以及完整的 NeMo-FW 镜像栈说明。

Running the Container

运行容器

bash
docker run --rm -it -w /opt/Megatron-Bridge \
  -v $(pwd):/opt/Megatron-Bridge \
  -v $HOME/.cache/uv:/root/.cache/uv \
  --gpus all \
  --shm-size=24g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  megatron-bridge:latest \
  bash
Mounting
$HOME/.cache/uv
avoids re-downloading wheels on every run.
bash
docker run --rm -it -w /opt/Megatron-Bridge \
  -v $(pwd):/opt/Megatron-Bridge \
  -v $HOME/.cache/uv:/root/.cache/uv \
  --gpus all \
  --shm-size=24g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  megatron-bridge:latest \
  bash
挂载
$HOME/.cache/uv
可避免每次运行时重新下载wheel包。

Containers on Slurm

Slurm 上的容器使用

On Slurm clusters with Enroot/Pyxis, pass containers directly to
srun
:
bash
srun --mpi=pmix \
  --container-image="$CONTAINER_IMAGE" \
  --container-mounts="$CONTAINER_MOUNTS" \
  --no-container-mount-home \
  bash -c "cd /opt/Megatron-Bridge && uv run --no-sync python ..."
If you bind-mount a custom source tree into the container, only rank 0 should sync while others wait:
bash
if [ "$SLURM_LOCALID" -eq 0 ]; then uv sync; else sleep 10; fi
Note:
--no-container-mount-home
is an
srun
flag, not an
#SBATCH
directive. Set
UV_CACHE_DIR
to shared storage to avoid filling
/root/.cache/
.
在配备 Enroot/Pyxis 的 Slurm 集群上,可直接将容器传递给
srun
bash
srun --mpi=pmix \
  --container-image="$CONTAINER_IMAGE" \
  --container-mounts="$CONTAINER_MOUNTS" \
  --no-container-mount-home \
  bash -c "cd /opt/Megatron-Bridge && uv run --no-sync python ..."
如果将自定义源代码树绑定挂载到容器中,只有rank 0节点需要执行同步操作,其他节点等待即可:
bash
if [ "$SLURM_LOCALID" -eq 0 ]; then uv sync; else sleep 10; fi
注意:
--no-container-mount-home
srun
的参数,而非
#SBATCH
指令。将
UV_CACHE_DIR
设置为共享存储,避免填满
/root/.cache/

Always Use uv

始终使用 uv

Never use
pip install
,
conda
, or bare
python
— always go through
uv
. All
uv
commands must be run inside a container. Never install or upgrade dependencies outside the CI container.
永远不要使用
pip install
conda
或原生
python
——始终通过
uv
进行操作。所有
uv
命令必须在容器内运行。永远不要在CI容器外安装或升级依赖项。

Essential Commands

核心命令

TaskCommand
Install all deps from lockfile
uv sync --locked
Install with all extras and dev groups
uv sync --locked --all-extras --all-groups
Run a Python command
uv run python script.py
Run distributed training
uv run python -m torch.distributed.run --nproc_per_node=N script.py
Add a new dependency
uv add <package>
Add an optional dependency
uv add --optional --extra <group> <package>
Regenerate the lockfile
uv lock
(Linux/container only)
Install pre-commit hooks
uv run --group dev pre-commit install
任务命令
从锁文件安装所有依赖
uv sync --locked
安装所有额外组件和开发组依赖
uv sync --locked --all-extras --all-groups
运行Python命令
uv run python script.py
运行分布式训练
uv run python -m torch.distributed.run --nproc_per_node=N script.py
添加新依赖
uv add <package>
添加可选依赖
uv add --optional --extra <group> <package>
重新生成锁文件
uv lock
(仅Linux/容器环境)
安装pre-commit钩子
uv run --group dev pre-commit install

Adding Dependencies

添加依赖项

Submit dependency changes as a separate PR before the feature PR:
bash
undefined
在提交功能PR之前,需将依赖变更作为单独的PR提交:
bash
undefined

Optional dependency (preferred)

可选依赖(推荐方式)

uv add --optional --extra <group> <package>
uv add --optional --extra <group> <package>

Required dependency (needs strong justification — affects all downstream)

必填依赖(需要充分理由——影响所有下游环节)

uv add <package>

Commit both modified files:

```bash
git add pyproject.toml uv.lock
git commit -s -m "[build] chore: add <package>"
uv add <package>

提交修改后的两个文件:

```bash
git add pyproject.toml uv.lock
git commit -s -m "[build] chore: add <package>"

Regenerating uv.lock

重新生成 uv.lock

The lockfile is Linux-only (resolves CUDA wheels). Run inside Docker:
bash
docker run --gpus all --rm \
  -v $(pwd):/opt/Megatron-Bridge \
  megatron-bridge:latest \
  bash -c 'cd /opt/Megatron-Bridge && uv lock'
锁文件仅适用于Linux(解析CUDA wheel包)。需在Docker内运行:
bash
docker run --gpus all --rm \
  -v $(pwd):/opt/Megatron-Bridge \
  megatron-bridge:latest \
  bash -c 'cd /opt/Megatron-Bridge && uv lock'

Switching MCore Branches

切换MCore分支

bash
undefined
bash
undefined

Switch to dev branch

切换到dev分支

./scripts/switch_mcore.sh dev uv sync # without --locked
./scripts/switch_mcore.sh dev uv sync # 不带 --locked 参数

Switch back to main

切回main分支

./scripts/switch_mcore.sh main uv sync --locked # lockfile matches again
undefined
./scripts/switch_mcore.sh main uv sync --locked # 锁文件再次匹配
undefined

Quick Start

快速开始

bash
undefined
bash
undefined

1. Clone and init submodules

1. 克隆仓库并初始化子模块

git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge megatron-bridge cd megatron-bridge git submodule update --init 3rdparty/Megatron-LM
git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge megatron-bridge cd megatron-bridge git submodule update --init 3rdparty/Megatron-LM

2. Build the container

2. 构建容器

docker build -f docker/Dockerfile.ci --target megatron_bridge -t megatron-bridge:latest .
docker build -f docker/Dockerfile.ci --target megatron_bridge -t megatron-bridge:latest .

3. Start a dev shell

3. 启动开发shell

docker run --rm -it -v $(pwd):/opt/Megatron-Bridge --gpus all --shm-size=24g megatron-bridge:latest bash
docker run --rm -it -v $(pwd):/opt/Megatron-Bridge --gpus all --shm-size=24g megatron-bridge:latest bash

4. Install pre-commit hooks (inside container)

4. 安装pre-commit钩子(容器内执行)

uv run --group dev pre-commit install
uv run --group dev pre-commit install

5. Sanity check

5. 完整性检查

uv run python -m torch.distributed.run --nproc_per_node=1
scripts/training/run_recipe.py
--recipe vanilla_gpt_pretrain_config
train.train_iters=5 train.global_batch_size=8 train.micro_batch_size=4
scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=5
logger.log_interval=1
undefined
uv run python -m torch.distributed.run --nproc_per_node=1
scripts/training/run_recipe.py
--recipe vanilla_gpt_pretrain_config
train.train_iters=5 train.global_batch_size=8 train.micro_batch_size=4
scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=5
logger.log_interval=1
undefined

Common Pitfalls

常见陷阱

ProblemCauseFix
uv sync --locked
fails on macOS
Lockfile resolves CUDA wheels that don't exist on macOSRun inside Docker or on a Linux machine
ModuleNotFoundError
after pip install
pip installed outside uv-managed venvUse
uv add
+
uv sync
, never bare
pip install
uv sync --locked
fails after MCore branch switch
Lockfile generated against main MCoreUse
uv sync
(without
--locked
) on dev
uv: command not found
inside container
Container doesn't have uvUse the
megatron-bridge
image built from
Dockerfile.ci
No space left on device
during uv ops
Cache fills container's
/root/.cache/
Set
UV_CACHE_DIR
to shared/persistent storage
Pre-commit fails with ruff errorsCode style violationsRun
uv run ruff check --fix . && uv run ruff format .
问题原因解决方法
macOS上执行
uv sync --locked
失败
锁文件解析的CUDA wheel包在macOS上不存在在Docker或Linux机器上运行
pip安装后出现
ModuleNotFoundError
pip在uv管理的虚拟环境外安装了包使用
uv add
+
uv sync
,永远不要直接使用
pip install
切换MCore分支后
uv sync --locked
失败
锁文件是基于main分支的MCore生成的在dev分支上使用
uv sync
(不带
--locked
参数)
容器内提示
uv: command not found
容器未安装uv使用从
Dockerfile.ci
构建的
megatron-bridge
镜像
uv操作期间提示
No space left on device
缓存填满了容器的
/root/.cache/
UV_CACHE_DIR
设置为共享/持久化存储
Pre-commit因ruff错误失败代码风格不符合规范运行
uv run ruff check --fix . && uv run ruff format .