cicd

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CI/CD

CI/CD

Commit and PR Workflow

提交与PR工作流

  • Never commit directly to
    main
    — always create a feature branch.
  • Always sign commits:
    git commit -s -m "message"
    .
  • PR title format:
    [{areas}] {type}: {description}
    (e.g.,
    [model] feat: Add Qwen3 model bridge
    ). See @CONTRIBUTING.md for the full PR workflow, area/type labels, and DCO requirements.
  • 切勿直接提交到
    main
    分支
    — 请始终创建功能分支。
  • 始终对提交进行签名
    git commit -s -m "message"
  • PR标题格式
    [{areas}] {type}: {description}
    (例如:
    [model] feat: Add Qwen3 model bridge
    )。 详情请查看@CONTRIBUTING.md中的完整PR工作流、领域/类型标签及DCO要求。

How CI Is Triggered

CI触发方式

The workflow is defined in @.github/workflows/cicd-main.yml and is triggered on
push
not on
pull_request
. This is intentional: a bot called
copy-pr-bot
controls when CI runs.
Mechanism:
  1. When a PR is opened,
    copy-pr-bot
    watches for a trust signal.
  2. Trust is established in one of two ways:
    • All commits on the PR branch are GPG-signed by a verified NVIDIA contributor → bot triggers automatically.
    • An NVIDIAN posts
      /ok to test <commit-sha>
      as a PR comment → bot triggers manually for that SHA.
  3. Once trusted,
    copy-pr-bot
    copies the PR's code into the remote branch
    pull-request/<number>
    and pushes it.
  4. That push fires the workflow's
    push
    trigger on
    refs/heads/pull-request/<number>
    , launching CI.
Consequences:
  • CI never runs on untrusted pushes — external contributors always need
    /ok to test
    .
  • The running workflow branch is
    pull-request/<number>
    , not the author's feature branch.
  • Pushing a new commit to a PR does not automatically re-trigger CI unless the commit is signed or
    /ok to test <new-sha>
    is posted.
  • Concurrent runs for the same PR are cancelled automatically (concurrency group per PR number).
工作流定义在@.github/workflows/cicd-main.yml中,由
push
事件触发 — 而非
pull_request
事件。这是有意设计的:名为
copy-pr-bot
的机器人负责控制CI的运行时机。
机制:
  1. 当PR创建后,
    copy-pr-bot
    会等待信任信号。
  2. 可通过以下两种方式建立信任:
    • PR分支上的所有提交均由已验证的NVIDIA贡献者进行GPG签名 → 机器人自动触发CI。
    • NVIDIA员工在PR评论中发布
      /ok to test <commit-sha>
      → 机器人针对该SHA手动触发CI。
  3. 建立信任后,
    copy-pr-bot
    会将PR代码复制到远程分支
    pull-request/<number>
    并推送。
  4. 此次推送会触发工作流在
    refs/heads/pull-request/<number>
    上的
    push
    事件,从而启动CI。
影响:
  • CI绝不会在未受信任的推送中运行 — 外部贡献者始终需要
    /ok to test
    指令。
  • 运行中的工作流分支为
    pull-request/<number>
    ,而非作者的功能分支。
  • 向PR推送新提交不会自动重新触发CI,除非提交已签名或发布了
    /ok to test <new-sha>
    指令。
  • 同一PR的并发运行会被自动取消(每个PR编号对应一个并发组)。

Pipeline Structure

流水线结构

pre-flight
  └── lint-check
        └── cicd-wait-in-queue       # queues workflows to avoid runner interleaving across PRs
              └── cicd-container-build
                    ├── unit-tests-core
                    ├── unit-tests-diffusion
                    └── functional-tests (L0 always; L1 with needs-more-tests label; L2 on schedule or full-test-suite label)
  • Slack notifications are sent on completion for scheduled and nightly runs.
For functional test tier semantics and job-to-directory mapping, see the
testing
skill.
pre-flight
  └── lint-check
        └── cicd-wait-in-queue       # 对工作流进行排队,避免PR之间的运行器交错
              └── cicd-container-build
                    ├── unit-tests-core
                    ├── unit-tests-diffusion
                    └── functional-tests(L0始终运行;L1需添加needs-more-tests标签;L2按计划运行或添加full-test-suite标签)
  • 计划任务和夜间运行完成后会发送Slack通知。
关于功能测试层级语义及作业与目录的映射关系,请查看
testing
技能文档。

CI Failure Investigation

CI失败排查

Locating the PR from a CI Branch

从CI分支定位PR

bash
undefined
bash
undefined

Extract PR number from branch name (e.g. pull-request/1234)

从分支名称中提取PR编号(例如pull-request/1234)

PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
gh pr view "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge --name-only gh pr checks "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
undefined
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
gh pr view "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge --name-only gh pr checks "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
undefined

Investigating a Failing Job

排查失败作业

  1. Get the PR number from the branch name (see above).
  2. Review the changeset:
    bash
    gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
  3. Identify the failing job from
    gh pr checks
    output.
  4. Fetch job logs:
    bash
    gh run list --repo NVIDIA-NeMo/Megatron-Bridge --branch "pull-request/$PR_NUMBER"
    gh run view <run_id> --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > run.log
  5. Scan logs in chunks — log files can exceed 10,000 lines, never load them whole:
    bash
    wc -l run.log
    tail -200 run.log          # start from the end
    sed -n '1,200p' run.log    # or scan forward in 200-line chunks
  6. Cross-reference the changeset against the failing step.
  1. 从分支名称获取PR编号(见上文)。
  2. 查看变更集
    bash
    gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Megatron-Bridge
  3. gh pr checks
    输出中识别失败作业
  4. 获取作业日志
    bash
    gh run list --repo NVIDIA-NeMo/Megatron-Bridge --branch "pull-request/$PR_NUMBER"
    gh run view <run_id> --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > run.log
  5. 分块扫描日志 — 日志文件可能超过10000行,切勿完整加载:
    bash
    wc -l run.log
    tail -200 run.log          # 从末尾开始查看
    sed -n '1,200p' run.log    # 或者按200行分段向前扫描
  6. 将变更集与失败步骤进行交叉对比

Hugging Face Model Access In CI

CI中的Hugging Face模型访问

Assume CI functional-test containers run with Hugging Face models offline (
HF_HUB_OFFLINE=1
) and a pre-populated
HF_HOME
. When reproducing or fixing CI failures involving HF models, mirror this locally by setting
HF_HUB_OFFLINE=1
after warming the cache. Test fixtures must not depend on live Hub API calls such as
list_repo_files()
or uncached downloads during CI. For
trust_remote_code=True
toy checkpoints, copy custom Python modules from the already loaded local/cache source files or a local snapshot, not by listing the remote repo at test time.
假设CI功能测试容器在Hugging Face模型离线模式下运行(
HF_HUB_OFFLINE=1
),且
HF_HOME
已预先填充。在重现或修复涉及HF模型的CI失败时,请在本地模拟此环境:先预热缓存,再设置
HF_HUB_OFFLINE=1
。测试夹具不得依赖CI期间的实时Hub API调用,例如
list_repo_files()
或未缓存的下载。对于
trust_remote_code=True
的测试 checkpoint,请从已加载的本地/缓存源文件或本地快照复制自定义Python模块,而非在测试时列出远程仓库。

Common Failure Patterns

常见失败模式

SymptomLikely CauseAction
CI never started on a PRCommits not GPG-signed and no
/ok to test
comment
Post
/ok to test <full-sha>
on the PR
Lint job fails
ruff
or
pre-commit
violation
Run
ruff check --fix
+
ruff format
locally
Container build failsDependency conflict or stale
uv.lock
Re-run
uv lock
inside Docker and commit updated lock
Unit tests failCode regression or missing importRun failing test locally; check the PR diff
Functional test (L0) failsIntegration breakageCheck GPU runner logs; reproduce with
L0_Launch_*.sh
HF model fixture passes locally but fails in CI with
OfflineModeIsEnabled
Test made a live Hugging Face Hub API/download call; CI has
HF_HUB_OFFLINE=1
Warm local cache, reproduce with
HF_HUB_OFFLINE=1
, and change the fixture to use cached/local artifacts only
cicd-wait-in-queue
running long
Many PRs queued; automation serializes runners to avoid interleavingWait; or check queue depth in the Actions tab
MCore submodule mismatchPinned commit out of syncUpdate
3rdparty/Megatron-LM
submodule and re-lock
Stale checkpoint auto-resume
nemo_experiments/
from a previous run exists
rm -rf nemo_experiments
before starting fresh
Port collision on Slurm (EADDRINUSE)
ntasks-per-node=8
with
torchrun
Drop torchrun; use
ntasks-per-node=8
with
uv run python script.py
症状可能原因操作
PR上的CI从未启动提交未进行GPG签名且无
/ok to test
评论
在PR上发布
/ok to test <完整sha值>
Lint作业失败
ruff
pre-commit
规则违反
在本地运行
ruff check --fix
+
ruff format
容器构建失败依赖冲突或
uv.lock
过期
在Docker内重新运行
uv lock
并提交更新后的锁文件
单元测试失败代码回归或缺失导入在本地运行失败的测试;检查PR变更集
功能测试(L0)失败集成中断查看GPU运行器日志;使用
L0_Launch_*.sh
重现
HF模型夹具在本地通过,但在CI中因
OfflineModeIsEnabled
失败
测试发起了实时Hugging Face Hub API/下载调用;CI中设置了
HF_HUB_OFFLINE=1
预热本地缓存,设置
HF_HUB_OFFLINE=1
重现问题,并修改夹具使其仅使用缓存/本地工件
cicd-wait-in-queue
运行时间过长
大量PR在排队;自动化对运行器进行序列化以避免交错等待;或在Actions选项卡中查看队列深度
MCore子模块不匹配固定的提交版本不同步更新
3rdparty/Megatron-LM
子模块并重新锁定
陈旧的checkpoint自动恢复存在之前运行留下的
nemo_experiments/
目录
重新开始前执行
rm -rf nemo_experiments
Slurm上的端口冲突(EADDRINUSE)使用
ntasks-per-node=8
搭配
torchrun
移除torchrun;使用
ntasks-per-node=8
搭配
uv run python script.py