cicd
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCI/CD
CI/CD
Commit and PR Workflow
提交与PR工作流
- Never commit directly to — always create a feature branch.
main - Always sign commits: .
git commit -s -m "message" - PR title format: (e.g.,
[{areas}] {type}: {description}). See @CONTRIBUTING.md for the full PR workflow, area/type labels, and DCO requirements.[model] feat: Add Qwen3 model bridge
- 切勿直接提交到分支 — 请始终创建功能分支。
main - 始终对提交进行签名:。
git commit -s -m "message" - PR标题格式:(例如:
[{areas}] {type}: {description})。 详情请查看@CONTRIBUTING.md中的完整PR工作流、领域/类型标签及DCO要求。[model] feat: Add Qwen3 model bridge
How CI Is Triggered
CI触发方式
The workflow is defined in @.github/workflows/cicd-main.yml and is triggered
on — not on . This is intentional: a bot called
controls when CI runs.
pushpull_requestcopy-pr-botMechanism:
- When a PR is opened, watches for a trust signal.
copy-pr-bot - Trust is established in one of two ways:
- All commits on the PR branch are GPG-signed by a verified NVIDIA contributor → bot triggers automatically.
- An NVIDIAN posts as a PR comment → bot triggers manually for that SHA.
/ok to test <commit-sha>
- Once trusted, copies the PR's code into the remote branch
copy-pr-botand pushes it.pull-request/<number> - That push fires the workflow's trigger on
push, launching CI.refs/heads/pull-request/<number>
Consequences:
- CI never runs on untrusted pushes — external contributors always need .
/ok to test - The running workflow branch is , not the author's feature branch.
pull-request/<number> - Pushing a new commit to a PR does not automatically re-trigger CI unless the
commit is signed or is posted.
/ok to test <new-sha> - Concurrent runs for the same PR are cancelled automatically (concurrency group per PR number).
工作流定义在@.github/workflows/cicd-main.yml中,由事件触发 — 而非事件。这是有意设计的:名为的机器人负责控制CI的运行时机。
pushpull_requestcopy-pr-bot机制:
- 当PR创建后,会等待信任信号。
copy-pr-bot - 可通过以下两种方式建立信任:
- PR分支上的所有提交均由已验证的NVIDIA贡献者进行GPG签名 → 机器人自动触发CI。
- NVIDIA员工在PR评论中发布→ 机器人针对该SHA手动触发CI。
/ok to test <commit-sha>
- 建立信任后,会将PR代码复制到远程分支
copy-pr-bot并推送。pull-request/<number> - 此次推送会触发工作流在上的
refs/heads/pull-request/<number>事件,从而启动CI。push
影响:
- CI绝不会在未受信任的推送中运行 — 外部贡献者始终需要指令。
/ok to test - 运行中的工作流分支为,而非作者的功能分支。
pull-request/<number> - 向PR推送新提交不会自动重新触发CI,除非提交已签名或发布了指令。
/ok to test <new-sha> - 同一PR的并发运行会被自动取消(每个PR编号对应一个并发组)。
Pipeline Structure
流水线结构
pre-flight
└── lint-check
└── cicd-wait-in-queue # queues workflows to avoid runner interleaving across PRs
└── cicd-container-build
├── unit-tests-core
├── unit-tests-diffusion
└── functional-tests (L0 always; L1 with needs-more-tests label; L2 on schedule or full-test-suite label)- Slack notifications are sent on completion for scheduled and nightly runs.
For functional test tier semantics and job-to-directory mapping, see the skill.
testingpre-flight
└── lint-check
└── cicd-wait-in-queue # 对工作流进行排队,避免PR之间的运行器交错
└── cicd-container-build
├── unit-tests-core
├── unit-tests-diffusion
└── functional-tests(L0始终运行;L1需添加needs-more-tests标签;L2按计划运行或添加full-test-suite标签)- 计划任务和夜间运行完成后会发送Slack通知。
关于功能测试层级语义及作业与目录的映射关系,请查看技能文档。
testingCI Failure Investigation
CI失败排查
Locating the PR from a CI Branch
从CI分支定位PR
bash
undefinedbash
undefinedExtract PR number from branch name (e.g. pull-request/1234)
从分支名称中提取PR编号(例如pull-request/1234)
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
gh pr view "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge --name-only
gh pr checks "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
undefinedPR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
gh pr view "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge --name-only
gh pr checks "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
undefinedInvestigating a Failing Job
排查失败作业
- Get the PR number from the branch name (see above).
- Review the changeset:
bash
gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge - Identify the failing job from output.
gh pr checks - Fetch job logs:
bash
gh run list --repo NVIDIA-NeMo/Megatron-Bridge --branch "pull-request/$PR_NUMBER" gh run view <run_id> --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > run.log - Scan logs in chunks — log files can exceed 10,000 lines, never load them whole:
bash
wc -l run.log tail -200 run.log # start from the end sed -n '1,200p' run.log # or scan forward in 200-line chunks - Cross-reference the changeset against the failing step.
- 从分支名称获取PR编号(见上文)。
- 查看变更集:
bash
gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge - 从输出中识别失败作业。
gh pr checks - 获取作业日志:
bash
gh run list --repo NVIDIA-NeMo/Megatron-Bridge --branch "pull-request/$PR_NUMBER" gh run view <run_id> --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > run.log - 分块扫描日志 — 日志文件可能超过10000行,切勿完整加载:
bash
wc -l run.log tail -200 run.log # 从末尾开始查看 sed -n '1,200p' run.log # 或者按200行分段向前扫描 - 将变更集与失败步骤进行交叉对比。
Hugging Face Model Access In CI
CI中的Hugging Face模型访问
Assume CI functional-test containers run with Hugging Face models offline
() and a pre-populated . When reproducing or
fixing CI failures involving HF models, mirror this locally by setting
after warming the cache. Test fixtures must not depend on
live Hub API calls such as or uncached downloads during CI.
For toy checkpoints, copy custom Python modules from
the already loaded local/cache source files or a local snapshot, not by listing
the remote repo at test time.
HF_HUB_OFFLINE=1HF_HOMEHF_HUB_OFFLINE=1list_repo_files()trust_remote_code=True假设CI功能测试容器在Hugging Face模型离线模式下运行(),且已预先填充。在重现或修复涉及HF模型的CI失败时,请在本地模拟此环境:先预热缓存,再设置。测试夹具不得依赖CI期间的实时Hub API调用,例如或未缓存的下载。对于的测试 checkpoint,请从已加载的本地/缓存源文件或本地快照复制自定义Python模块,而非在测试时列出远程仓库。
HF_HUB_OFFLINE=1HF_HOMEHF_HUB_OFFLINE=1list_repo_files()trust_remote_code=TrueCommon Failure Patterns
常见失败模式
| Symptom | Likely Cause | Action |
|---|---|---|
| CI never started on a PR | Commits not GPG-signed and no | Post |
| Lint job fails | | Run |
| Container build fails | Dependency conflict or stale | Re-run |
| Unit tests fail | Code regression or missing import | Run failing test locally; check the PR diff |
| Functional test (L0) fails | Integration breakage | Check GPU runner logs; reproduce with |
HF model fixture passes locally but fails in CI with | Test made a live Hugging Face Hub API/download call; CI has | Warm local cache, reproduce with |
| Many PRs queued; automation serializes runners to avoid interleaving | Wait; or check queue depth in the Actions tab |
| MCore submodule mismatch | Pinned commit out of sync | Update |
| Stale checkpoint auto-resume | | |
| Port collision on Slurm (EADDRINUSE) | | Drop torchrun; use |
| 症状 | 可能原因 | 操作 |
|---|---|---|
| PR上的CI从未启动 | 提交未进行GPG签名且无 | 在PR上发布 |
| Lint作业失败 | | 在本地运行 |
| 容器构建失败 | 依赖冲突或 | 在Docker内重新运行 |
| 单元测试失败 | 代码回归或缺失导入 | 在本地运行失败的测试;检查PR变更集 |
| 功能测试(L0)失败 | 集成中断 | 查看GPU运行器日志;使用 |
HF模型夹具在本地通过,但在CI中因 | 测试发起了实时Hugging Face Hub API/下载调用;CI中设置了 | 预热本地缓存,设置 |
| 大量PR在排队;自动化对运行器进行序列化以避免交错 | 等待;或在Actions选项卡中查看队列深度 |
| MCore子模块不匹配 | 固定的提交版本不同步 | 更新 |
| 陈旧的checkpoint自动恢复 | 存在之前运行留下的 | 重新开始前执行 |
| Slurm上的端口冲突(EADDRINUSE) | 使用 | 移除torchrun;使用 |