Loading...
Loading...
Dev environment setup for Megatron Bridge — container-based development, uv package management, lockfile regeneration, adding dependencies, Slurm container usage, and common build pitfalls.
npx skill4agent add nvidia/skills build-and-dependencyuv.lockskopeo list-tags docker://nvcr.io/nvidia/nemo \
| python3 -c "import sys,json,re; tags=json.load(sys.stdin)['Tags']; [print(t) for t in sorted((t for t in tags if re.match(r'^\d{2}\.\d{2}', t)), reverse=True)]"docker run --rm -it --gpus all --shm-size=24g \
nvcr.io/nvidia/nemo:<tag> \
bashdocker run --rm -it -w /opt/Megatron-Bridge \
-v $(pwd):/opt/Megatron-Bridge \
-v $HOME/.cache/uv:/root/.cache/uv \
--gpus all \
--shm-size=24g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
megatron-bridge:latest \
bash$HOME/.cache/uvsrunsrun --mpi=pmix \
--container-image="$CONTAINER_IMAGE" \
--container-mounts="$CONTAINER_MOUNTS" \
--no-container-mount-home \
bash -c "cd /opt/Megatron-Bridge && uv run --no-sync python ..."if [ "$SLURM_LOCALID" -eq 0 ]; then uv sync; else sleep 10; fi--no-container-mount-homesrun#SBATCHUV_CACHE_DIR/root/.cache/pip installcondapythonuvuv| Task | Command |
|---|---|
| Install all deps from lockfile | |
| Install with all extras and dev groups | |
| Run a Python command | |
| Run distributed training | |
| Add a new dependency | |
| Add an optional dependency | |
| Regenerate the lockfile | |
| Install pre-commit hooks | |
# Optional dependency (preferred)
uv add --optional --extra <group> <package>
# Required dependency (needs strong justification — affects all downstream)
uv add <package>git add pyproject.toml uv.lock
git commit -s -m "[build] chore: add <package>"docker run --gpus all --rm \
-v $(pwd):/opt/Megatron-Bridge \
megatron-bridge:latest \
bash -c 'cd /opt/Megatron-Bridge && uv lock'# Switch to dev branch
./scripts/switch_mcore.sh dev
uv sync # without --locked
# Switch back to main
./scripts/switch_mcore.sh main
uv sync --locked # lockfile matches again# 1. Clone and init submodules
git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge megatron-bridge
cd megatron-bridge
git submodule update --init 3rdparty/Megatron-LM
# 2. Build the container
docker build -f docker/Dockerfile.ci --target megatron_bridge -t megatron-bridge:latest .
# 3. Start a dev shell
docker run --rm -it -v $(pwd):/opt/Megatron-Bridge --gpus all --shm-size=24g megatron-bridge:latest bash
# 4. Install pre-commit hooks (inside container)
uv run --group dev pre-commit install
# 5. Sanity check
uv run python -m torch.distributed.run --nproc_per_node=1 \
scripts/training/run_recipe.py \
--recipe vanilla_gpt_pretrain_config \
train.train_iters=5 train.global_batch_size=8 train.micro_batch_size=4 \
scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=5 \
logger.log_interval=1| Problem | Cause | Fix |
|---|---|---|
| Lockfile resolves CUDA wheels that don't exist on macOS | Run inside Docker or on a Linux machine |
| pip installed outside uv-managed venv | Use |
| Lockfile generated against main MCore | Use |
| Container doesn't have uv | Use the |
| Cache fills container's | Set |
| Pre-commit fails with ruff errors | Code style violations | Run |