Loading...
Loading...
CI/CD reference for Megatron Bridge — pipeline structure, commit and PR workflow, CI failure investigation, and common failure patterns.
npx skill4agent add nvidia/skills cicdmaingit commit -s -m "message"[{areas}] {type}: {description}[model] feat: Add Qwen3 model bridgepushpull_requestcopy-pr-botcopy-pr-bot/ok to test <commit-sha>copy-pr-botpull-request/<number>pushrefs/heads/pull-request/<number>/ok to testpull-request/<number>/ok to test <new-sha>pre-flight
└── lint-check
└── cicd-wait-in-queue # queues workflows to avoid runner interleaving across PRs
└── cicd-container-build
├── unit-tests-core
├── unit-tests-diffusion
└── functional-tests (L0 always; L1 with needs-more-tests label; L2 on schedule or full-test-suite label)testing# Extract PR number from branch name (e.g. pull-request/1234)
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
gh pr view "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge --name-only
gh pr checks "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridgegh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridgegh pr checksgh run list --repo NVIDIA-NeMo/Megatron-Bridge --branch "pull-request/$PR_NUMBER"
gh run view <run_id> --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > run.logwc -l run.log
tail -200 run.log # start from the end
sed -n '1,200p' run.log # or scan forward in 200-line chunksHF_HUB_OFFLINE=1HF_HOMEHF_HUB_OFFLINE=1list_repo_files()trust_remote_code=True| Symptom | Likely Cause | Action |
|---|---|---|
| CI never started on a PR | Commits not GPG-signed and no | Post |
| Lint job fails | | Run |
| Container build fails | Dependency conflict or stale | Re-run |
| Unit tests fail | Code regression or missing import | Run failing test locally; check the PR diff |
| Functional test (L0) fails | Integration breakage | Check GPU runner logs; reproduce with |
HF model fixture passes locally but fails in CI with | Test made a live Hugging Face Hub API/download call; CI has | Warm local cache, reproduce with |
| Many PRs queued; automation serializes runners to avoid interleaving | Wait; or check queue depth in the Actions tab |
| MCore submodule mismatch | Pinned commit out of sync | Update |
| Stale checkpoint auto-resume | | |
| Port collision on Slurm (EADDRINUSE) | | Drop torchrun; use |