cicd

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CI/CD

Commit and PR Workflow

提交与PR工作流

Never commit directly to
main
— always create a feature branch.
Always sign commits:
```
git commit -s -m "message"
```
.
PR title format:
```
[{areas}] {type}: {description}
```
(e.g.,
```
[model] feat: Add Qwen3 model bridge
```
). See @CONTRIBUTING.md for the full PR workflow, area/type labels, and DCO requirements.

切勿直接提交到
main
分支 — 请始终创建功能分支。
始终对提交进行签名：
```
git commit -s -m "message"
```
。
PR标题格式：
```
[{areas}] {type}: {description}
```
（例如：
```
[model] feat: Add Qwen3 model bridge
```
）。详情请查看@CONTRIBUTING.md中的完整PR工作流、领域/类型标签及DCO要求。

How CI Is Triggered

CI触发方式

The workflow is defined in @.github/workflows/cicd-main.yml and is triggered on

push

— not on

pull_request

. This is intentional: a bot called

copy-pr-bot

controls when CI runs.

Mechanism:

When a PR is opened,
```
copy-pr-bot
```
watches for a trust signal.
Trust is established in one of two ways:
- All commits on the PR branch are GPG-signed by a verified NVIDIA contributor → bot triggers automatically.
- An NVIDIAN posts
```
/ok to test <commit-sha>
```
  as a PR comment → bot triggers manually for that SHA.
Once trusted,
```
copy-pr-bot
```
copies the PR's code into the remote branch
```
pull-request/<number>
```
and pushes it.
That push fires the workflow's
```
push
```
trigger on
```
refs/heads/pull-request/<number>
```
, launching CI.

Consequences:

CI never runs on untrusted pushes — external contributors always need
```
/ok to test
```
.
The running workflow branch is
```
pull-request/<number>
```
, not the author's feature branch.
Pushing a new commit to a PR does not automatically re-trigger CI unless the commit is signed or
```
/ok to test <new-sha>
```
is posted.
Concurrent runs for the same PR are cancelled automatically (concurrency group per PR number).

工作流定义在@.github/workflows/cicd-main.yml中，由

push

事件触发 — 而非

pull_request

事件。这是有意设计的：名为

copy-pr-bot

的机器人负责控制CI的运行时机。

机制：

当PR创建后，
```
copy-pr-bot
```
会等待信任信号。
可通过以下两种方式建立信任：
- PR分支上的所有提交均由已验证的NVIDIA贡献者进行GPG签名 → 机器人自动触发CI。
- NVIDIA员工在PR评论中发布
```
/ok to test <commit-sha>
```
  → 机器人针对该SHA手动触发CI。
建立信任后，
```
copy-pr-bot
```
会将PR代码复制到远程分支
```
pull-request/<number>
```
并推送。
此次推送会触发工作流在
```
refs/heads/pull-request/<number>
```
上的
```
push
```
事件，从而启动CI。

影响：

CI绝不会在未受信任的推送中运行 — 外部贡献者始终需要
```
/ok to test
```
指令。
运行中的工作流分支为
```
pull-request/<number>
```
，而非作者的功能分支。
向PR推送新提交不会自动重新触发CI，除非提交已签名或发布了
```
/ok to test <new-sha>
```
指令。
同一PR的并发运行会被自动取消（每个PR编号对应一个并发组）。

Pipeline Structure

流水线结构

pre-flight
  └── lint-check
        └── cicd-wait-in-queue       # queues workflows to avoid runner interleaving across PRs
              └── cicd-container-build
                    ├── unit-tests-core
                    ├── unit-tests-diffusion
                    └── functional-tests (L0 always; L1 with needs-more-tests label; L2 on schedule or full-test-suite label)

Slack notifications are sent on completion for scheduled and nightly runs.

For functional test tier semantics and job-to-directory mapping, see the

testing

skill.

pre-flight
  └── lint-check
        └── cicd-wait-in-queue       # 对工作流进行排队，避免PR之间的运行器交错
              └── cicd-container-build
                    ├── unit-tests-core
                    ├── unit-tests-diffusion
                    └── functional-tests（L0始终运行；L1需添加needs-more-tests标签；L2按计划运行或添加full-test-suite标签）

计划任务和夜间运行完成后会发送Slack通知。

关于功能测试层级语义及作业与目录的映射关系，请查看

testing

技能文档。

CI Failure Investigation

CI失败排查

Locating the PR from a CI Branch

从CI分支定位PR

bash

undefined

bash

undefined

Extract PR number from branch name (e.g. pull-request/1234)

从分支名称中提取PR编号（例如pull-request/1234）

PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')

gh pr view "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge --name-only gh pr checks "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge

undefined

PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')

gh pr view "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge --name-only gh pr checks "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge

undefined

Investigating a Failing Job

排查失败作业

Get the PR number from the branch name (see above).

Review the changeset:

bash

gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge

Identify the failing job from
```
gh pr checks
```
output.

Fetch job logs:

bash

gh run list --repo NVIDIA-NeMo/Megatron-Bridge --branch "pull-request/$PR_NUMBER"
gh run view <run_id> --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > run.log

Scan logs in chunks — log files can exceed 10,000 lines, never load them whole:

bash

wc -l run.log
tail -200 run.log          # start from the end
sed -n '1,200p' run.log    # or scan forward in 200-line chunks

Cross-reference the changeset against the failing step.

从分支名称获取PR编号（见上文）。

查看变更集：

bash

gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge

从
gh pr checks
输出中识别失败作业。

获取作业日志：

bash

gh run list --repo NVIDIA-NeMo/Megatron-Bridge --branch "pull-request/$PR_NUMBER"
gh run view <run_id> --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > run.log

分块扫描日志 — 日志文件可能超过10000行，切勿完整加载：

bash

wc -l run.log
tail -200 run.log          # 从末尾开始查看
sed -n '1,200p' run.log    # 或者按200行分段向前扫描

将变更集与失败步骤进行交叉对比。

Hugging Face Model Access In CI

CI中的Hugging Face模型访问

Assume CI functional-test containers run with Hugging Face models offline (

HF_HUB_OFFLINE=1

) and a pre-populated

HF_HOME

. When reproducing or fixing CI failures involving HF models, mirror this locally by setting

HF_HUB_OFFLINE=1

after warming the cache. Test fixtures must not depend on live Hub API calls such as

list_repo_files()

or uncached downloads during CI. For

trust_remote_code=True

toy checkpoints, copy custom Python modules from the already loaded local/cache source files or a local snapshot, not by listing the remote repo at test time.

假设CI功能测试容器在Hugging Face模型离线模式下运行（

HF_HUB_OFFLINE=1

），且

HF_HOME

已预先填充。在重现或修复涉及HF模型的CI失败时，请在本地模拟此环境：先预热缓存，再设置

HF_HUB_OFFLINE=1

。测试夹具不得依赖CI期间的实时Hub API调用，例如

list_repo_files()

或未缓存的下载。对于

trust_remote_code=True

的测试 checkpoint，请从已加载的本地/缓存源文件或本地快照复制自定义Python模块，而非在测试时列出远程仓库。

Common Failure Patterns

常见失败模式

Symptom	Likely Cause	Action
CI never started on a PR	Commits not GPG-signed and no `/ok to test` comment	Post `/ok to test <full-sha>` on the PR
Lint job fails	`ruff` or `pre-commit` violation	Run `ruff check --fix` + `ruff format` locally
Container build fails	Dependency conflict or stale `uv.lock`	Re-run `uv lock` inside Docker and commit updated lock
Unit tests fail	Code regression or missing import	Run failing test locally; check the PR diff
Functional test (L0) fails	Integration breakage	Check GPU runner logs; reproduce with `L0_Launch_*.sh`
HF model fixture passes locally but fails in CI with `OfflineModeIsEnabled`	Test made a live Hugging Face Hub API/download call; CI has `HF_HUB_OFFLINE=1`	Warm local cache, reproduce with `HF_HUB_OFFLINE=1` , and change the fixture to use cached/local artifacts only
`cicd-wait-in-queue` running long	Many PRs queued; automation serializes runners to avoid interleaving	Wait; or check queue depth in the Actions tab
MCore submodule mismatch	Pinned commit out of sync	Update `3rdparty/Megatron-LM` submodule and re-lock
Stale checkpoint auto-resume	`nemo_experiments/` from a previous run exists	`rm -rf nemo_experiments` before starting fresh
Port collision on Slurm (EADDRINUSE)	`ntasks-per-node=8` with `torchrun`	Drop torchrun; use `ntasks-per-node=8` with `uv run python script.py`

症状	可能原因	操作
PR上的CI从未启动	提交未进行GPG签名且无 `/ok to test` 评论	在PR上发布 `/ok to test <完整sha值>`
Lint作业失败	`ruff` 或 `pre-commit` 规则违反	在本地运行 `ruff check --fix` + `ruff format`
容器构建失败	依赖冲突或 `uv.lock` 过期	在Docker内重新运行 `uv lock` 并提交更新后的锁文件
单元测试失败	代码回归或缺失导入	在本地运行失败的测试；检查PR变更集
功能测试（L0）失败	集成中断	查看GPU运行器日志；使用 `L0_Launch_*.sh` 重现
HF模型夹具在本地通过，但在CI中因 `OfflineModeIsEnabled` 失败	测试发起了实时Hugging Face Hub API/下载调用；CI中设置了 `HF_HUB_OFFLINE=1`	预热本地缓存，设置 `HF_HUB_OFFLINE=1` 重现问题，并修改夹具使其仅使用缓存/本地工件
`cicd-wait-in-queue` 运行时间过长	大量PR在排队；自动化对运行器进行序列化以避免交错	等待；或在Actions选项卡中查看队列深度
MCore子模块不匹配	固定的提交版本不同步	更新 `3rdparty/Megatron-LM` 子模块并重新锁定
陈旧的checkpoint自动恢复	存在之前运行留下的 `nemo_experiments/` 目录	重新开始前执行 `rm -rf nemo_experiments`
Slurm上的端口冲突（EADDRINUSE）	使用 `ntasks-per-node=8` 搭配 `torchrun`	移除torchrun；使用 `ntasks-per-node=8` 搭配 `uv run python script.py`