nightly-sync

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Nightly Sync: Main to Dev

夜间同步:主分支到开发分支

This skill is read by the automated sync bot during the nightly-sync-main-to-dev workflow. It contains all domain knowledge for merging main into dev, resolving conflicts, iterating on CI, and shipping the PR.

本技能供自动化同步机器人在nightly-sync-main-to-dev工作流中读取,包含将main分支合并到dev分支、解决冲突、CI迭代及提交PR的所有领域知识。

Phase 1: Create the Sync Branch and Merge

阶段1:创建同步分支并执行合并

Branch Setup

分支设置

  1. Create branch
    $BRANCH
    from
    origin/dev
  2. Merge:
    git merge origin/main -X theirs --no-edit
  3. If conflicts remain (e.g. add/add), resolve by favoring main
  1. origin/dev
    创建分支
    $BRANCH
  2. 执行合并:
    git merge origin/main -X theirs --no-edit
  3. 若仍存在冲突(如add/add冲突),以main分支版本为准解决

Preserving Dev-Only Additions

保留开发分支独有的新增内容

Do NOT blanket-override all shared files with main's version. Dev has features not yet in main (new classes, new modules, new tests). The merge preserves both sides' non-conflicting additions — only intervene where there is an actual conflict.
请勿用main分支版本全盘覆盖所有共享文件。dev分支包含尚未合并到main的功能(新类、新模块、新测试)。合并操作会保留双方无冲突的新增内容——仅在出现实际冲突时进行干预。

Squash-Merge Chain Detection

squash合并链检测

Dev often develops features as a chain of PRs (PR1 → PR2 → PR3) where each builds on the last. When PR1 is squash-merged to main, git sees main's squashed version and dev's original commits as unrelated changes.
-X theirs
will pick main's PR1 code and silently discard PR2/PR3's improvements on dev.
After the merge, check for this pattern:
  1. For each file where
    -X theirs
    resolved a conflict, run
    git log --oneline origin/dev -- <file>
    to see if dev has commits that came AFTER the code main is bringing in.
  2. If dev has follow-up commits (bug fixes, refactors, extensions), favor dev's version for those sections.
  3. If the conflict is just main bringing in a clean copy of what dev already has (no follow-ups), main's version is fine.
Practical check: run
git diff origin/dev -- <file>
on conflicted files. If dev's code was removed or reverted, investigate whether dev's version is the more evolved one.
Real examples from PR #4291:
  • emerging_optimizers.py
    : Main's version was MORE complete — it squash-merged dev's PRs plus added more.
    -X theirs
    was correct.
  • distrib_optimizer.py
    : Main overwrote dev's
    GroupedQuantizedTensor
    support. Had to restore
    _is_distopt_quantized_param
    and the expanded
    _expand_quantized_param_shard_for_cast
    loop while keeping main's NVFP4 additions. This required a surgical merge combining sections from both.
Key insight: squash-merge chains can go in EITHER direction. Sometimes main is ahead (it squash-merged dev's work + more), sometimes dev is ahead (it has follow-up PRs). Always diff both ways before deciding which version to favor.
dev分支通常以PR链形式开发功能(PR1 → PR2 → PR3),每个PR基于前一个构建。当PR1被squash合并到main分支后,git会将main分支的squash版本与dev分支的原始提交视为无关变更。
-X theirs
参数会选择main分支的PR1代码,并静默丢弃dev分支上PR2/PR3的改进内容。
合并完成后,检查以下模式:
  1. 对于每个通过
    -X theirs
    解决冲突的文件,执行
    git log --oneline origin/dev -- <file>
    查看dev分支是否存在main分支引入代码之后的提交。
  2. 若dev分支存在后续提交(bug修复、重构、扩展),优先保留dev分支对应部分的版本
  3. 若冲突仅为main分支引入dev分支已有的干净代码(无后续提交),则保留main分支版本即可。
实操检查:对冲突文件执行
git diff origin/dev -- <file>
。若dev分支代码被移除或回滚,需判断dev分支版本是否为更完善的版本。
来自PR #4291的实际示例:
  • emerging_optimizers.py
    :main分支版本更完整——它squash合并了dev分支的PR并新增了更多内容,使用
    -X theirs
    是正确的。
  • distrib_optimizer.py
    :main分支覆盖了dev分支的
    GroupedQuantizedTensor
    支持。需恢复
    _is_distopt_quantized_param
    及扩展后的
    _expand_quantized_param_shard_for_cast
    循环,同时保留main分支的NVFP4新增内容。这需要精细合并,结合双方代码片段。
核心要点:squash合并链可能双向存在。有时main分支更超前(squash合并了dev分支的工作并新增内容),有时dev分支更超前(包含后续PR)。在决定优先保留哪个版本前,务必双向对比差异。

Files to Override from Main

需从main分支覆盖的文件

These files have known semantic conflicts where dev's versions reference args or APIs that main removed or renamed. Take main's version with
git checkout origin/main -- <file>
:
  • megatron/training/training.py
    — references dev-only args
  • megatron/training/initialize.py
    — references dev-only args
  • megatron/training/utils.py
    — references dev-only args
  • megatron/training/datasets/data_samplers.py
    — references dev-only args
  • megatron/core/optimizer/layer_wise_optimizer.py
    — constructor signature
Caveat for ALL overrides: After taking main's version of any file, you MUST run the API Mismatch Detection procedure (see below) on that file. Taking main's caller code while keeping dev's callee implementations is the #1 source of sync bugs.
IMPORTANT: Do NOT take main's
pyproject.toml
,
uv.lock
, or
docker/Dockerfile.ci.dev
.
These three files are a tightly coupled triple — the Dockerfile's
uv sync
command must match the dependency groups in
pyproject.toml
, and
uv.lock
must be consistent with both. Main's versions are missing dev-only dependencies (e.g.
fast-hadamard-transform
, correct TransformerEngine revision) and the
--group no_pypi_wheels
flag needed to install them. Keep dev's versions of all three files.
IMPORTANT:
.github/CODEOWNERS
must NEVER be modified by the sync bot under any circumstances.
Dev's CODEOWNERS is intentionally different from main's — do not take main's version, do not merge them, do not touch the file. If the merge produces a conflict or a non-zero diff against
origin/dev
on this path, restore dev's version verbatim:
git checkout origin/dev -- .github/CODEOWNERS
Then verify with
git diff origin/dev -- .github/CODEOWNERS
— output must be empty. Modifying CODEOWNERS triggers spurious reviewer requests and conflicts with the dev team's governance; rolling back a CODEOWNERS change after the PR lands is painful.
NEVER manually edit
uv.lock
.
It is a machine-generated lockfile. If it needs to change, it must be regenerated with
uv lock
inside a CUDA container (see
.claude/skills/build-and-test/SKILL.md
).
以下文件存在已知语义冲突,dev分支版本引用了main分支已移除或重命名的参数或API。使用
git checkout origin/main -- <file>
命令采用main分支版本:
  • megatron/training/training.py
    —— 引用dev分支独有的参数
  • megatron/training/initialize.py
    —— 引用dev分支独有的参数
  • megatron/training/utils.py
    —— 引用dev分支独有的参数
  • megatron/training/datasets/data_samplers.py
    —— 引用dev分支独有的参数
  • megatron/core/optimizer/layer_wise_optimizer.py
    —— 构造函数签名不一致
所有覆盖操作注意事项:采用main分支版本后,必须对该文件执行下文的API不匹配检测流程。保留dev分支的被调用方实现,同时采用main分支的调用方代码,是同步bug的首要来源。
**重要提示:请勿采用main分支的
pyproject.toml
uv.lock
docker/Dockerfile.ci.dev
。**这三个文件紧密耦合——Dockerfile中的
uv sync
命令必须与
pyproject.toml
中的依赖组匹配,且
uv.lock
必须与两者保持一致。main分支版本缺少dev分支独有的依赖(如
fast-hadamard-transform
、正确的TransformerEngine修订版本),以及安装这些依赖所需的
--group no_pypi_wheels
参数。保留dev分支的这三个文件版本。
**重要提示:
.github/CODEOWNERS
绝对不能被同步机器人修改。**dev分支的CODEOWNERS与main分支刻意不同——请勿采用main分支版本,请勿合并,请勿修改该文件。若合并导致该文件与
origin/dev
存在冲突或非零差异,需完全恢复dev分支版本:
git checkout origin/dev -- .github/CODEOWNERS
然后执行
git diff origin/dev -- .github/CODEOWNERS
验证——输出必须为空。修改CODEOWNERS会触发不必要的审核请求,并与dev团队的治理规则冲突;PR合并后回滚CODEOWNERS变更会非常麻烦。
**请勿手动编辑
uv.lock
。**它是机器生成的锁定文件。若需修改,必须在CUDA容器中通过
uv lock
重新生成(参见
.claude/skills/build-and-test/SKILL.md
)。

Git Source Reconciliation (pyproject.toml)

Git源协调(pyproject.toml)

After keeping dev's
pyproject.toml
, check whether main has added NEW git sources to
[tool.uv.sources]
that don't exist in dev's version. Main's merged code may import from packages only available at specific git revisions.
  1. Diff the
    [tool.uv.sources]
    sections:
    git show origin/main:pyproject.toml
    vs
    git show origin/dev:pyproject.toml
  2. For each git source in main but not dev, add it to dev's
    pyproject.toml
  3. For sources in both but at different revisions, check whether dev's revision works. If dev's revision is broken (TOML parse errors, missing classes main's code imports), take main's revision instead.
Real examples from PR #4291:
  • nvidia-resiliency-ext
    : Main's
    torch.py
    imports
    get_write_results_queue
    which only existed in main's pinned git revision, not on PyPI. Had to add main's git source to dev's pyproject.toml.
  • nemo-run
    : Dev's pinned revision had a TOML parse error with uv 0.7.2. Had to swap to main's revision.
After any changes to
pyproject.toml
, regenerate
uv.lock
inside a CUDA container:
bash
docker run --rm -v $(pwd):/workspace nvcr.io/nvidia/pytorch:26.02-py3 \
  bash -c "pip install uv==0.7.2 && cd /workspace && \
  uv venv .venv --system-site-packages && uv sync --only-group build && uv lock"
保留dev分支的
pyproject.toml
后,检查main分支是否在
[tool.uv.sources]
中添加了dev分支没有的新Git源。main分支合并的代码可能依赖仅在特定Git修订版本中可用的包。
  1. 对比
    [tool.uv.sources]
    部分:
    git show origin/main:pyproject.toml
    git show origin/dev:pyproject.toml
  2. 对于main分支有但dev分支没有的每个Git源,添加到dev分支的
    pyproject.toml
  3. 对于双方都有但修订版本不同的源,检查dev分支的修订版本是否可用。若dev分支版本存在问题(TOML解析错误、缺少main分支代码导入的类),则采用main分支的修订版本。
来自PR #4291的实际示例:
  • nvidia-resiliency-ext
    :main分支的
    torch.py
    导入了
    get_write_results_queue
    ,该方法仅存在于main分支固定的Git修订版本中,PyPI上没有。需将main分支的Git源添加到dev分支的pyproject.toml中。
  • nemo-run
    :dev分支固定的修订版本在uv 0.7.2下存在TOML解析错误。需替换为main分支的修订版本。
修改
pyproject.toml
后,在CUDA容器中重新生成
uv.lock
bash
docker run --rm -v $(pwd):/workspace nvcr.io/nvidia/pytorch:26.02-py3 \
  bash -c "pip install uv==0.7.2 && cd /workspace && \
  uv venv .venv --system-site-packages && uv sync --only-group build && uv lock"

Clean up root-owned .venv:

清理root权限的.venv:

docker run --rm -v $(pwd):/workspace nvcr.io/nvidia/pytorch:26.02-py3
bash -c "rm -rf /workspace/.venv"
undefined
docker run --rm -v $(pwd):/workspace nvcr.io/nvidia/pytorch:26.02-py3
bash -c "rm -rf /workspace/.venv"
undefined

API Mismatch Detection (Post-Merge Audit)

API不匹配检测(合并后审计)

The merge can create "Frankenstein" code where main's callers use dev's implementations (or vice versa) with different method signatures. This compiles fine but fails at runtime.
After the merge, audit cross-boundary call sites:
  1. Identify files where main's version was taken (
    -X theirs
    or explicit
    git checkout origin/main
    )
  2. For each, find all external call sites: classes it instantiates, methods it calls on imported objects, functions from other modules it invokes
  3. Verify method names, parameter counts, and signatures match between the caller and the implementation in the merged tree
  4. Pay special attention to "interface" modules (files defining base classes) — if main and dev evolved the interface differently, every caller and implementer must agree
Real examples from PR #4291:
  • multi_latent_attention.py
    (main) called
    off_interface.group_commit()
    but dev's interface only had
    group_offload()
    — method renamed
  • mamba_model.py
    (main) called
    init_chunk_handler(3 params)
    but dev's interface required 6 params — signature expanded on dev
  • mamba_model.py
    called
    mark_not_offloadable()
    but dev had
    mark_not_offload()
    — method renamed
  • bulk_offload()
    did
    .remove()
    after
    bulk_offload_group()
    already
    .pop()
    d the same item — double-removal from a list
Practical detection:
bash
undefined
合并操作可能产生“拼接”代码,导致main分支的调用方使用dev分支的实现(或反之),但方法签名不一致。这种情况编译正常,但运行时会失败。
合并完成后,审计跨边界调用点:
  1. 识别采用main分支版本的文件(通过
    -X theirs
    或显式
    git checkout origin/main
  2. 对每个文件,找出所有外部调用点:实例化的类、对导入对象调用的方法、从其他模块调用的函数
  3. 验证调用方与合并后代码中实现的方法名称、参数数量及签名是否匹配
  4. 特别注意“接口”模块(定义基类的文件)——若main和dev分支对接口的演进不同,所有调用方和实现方必须保持一致
来自PR #4291的实际示例:
  • multi_latent_attention.py
    (main分支)调用
    off_interface.group_commit()
    ,但dev分支的接口只有
    group_offload()
    ——方法重命名
  • mamba_model.py
    (main分支)调用
    init_chunk_handler(3个参数)
    ,但dev分支的接口需要6个参数——dev分支扩展了签名
  • mamba_model.py
    调用
    mark_not_offloadable()
    ,但dev分支是
    mark_not_offload()
    ——方法重命名
  • bulk_offload()
    bulk_offload_group()
    已经
    .pop()
    同一元素后执行
    .remove()
    ——从列表中重复移除
实操检测:
bash
undefined

For each file taken from main, find what it imports and calls

对每个来自main分支的文件,找出它导入和调用的内容

grep -rn "from <module> import|<module>." megatron/
grep -rn "from <module> import|<module>." megatron/

Cross-reference with the actual implementations in the merged tree

与合并后代码中的实际实现交叉验证

undefined
undefined

File-Specific Merge Lessons

文件特定合并经验

These lessons were learned from PR #4291. They may recur if the same files continue to diverge:
  • gated_delta_net.py
    : If the merge creates code calling non-existent helper methods (e.g.
    _resolve_cu_seqlens
    ), take dev's version wholesale.
  • model_chunk_schedule_plan.py
    : Watch for missing imports (e.g.
    CudaGraphScope
    ) silently dropped during conflict resolution.
  • fine_grained_activation_offload.py
    : Critical interface file used by many callers. If main and dev have divergent method names/signatures, prefer dev's implementation and patch main-originated callers to match.
  • distrib_optimizer.py
    : Dev may have broader type abstractions (e.g.
    _is_distopt_quantized_param
    covering both FP8 and GroupedQuantizedTensor). Main may simplify to explicit type checks. Restore dev's abstractions.
这些经验来自PR #4291。若相同文件持续出现差异,可能会重复遇到:
  • gated_delta_net.py
    :若合并后代码调用不存在的辅助方法(如
    _resolve_cu_seqlens
    ),则完全采用dev分支版本。
  • model_chunk_schedule_plan.py
    :注意冲突解决过程中可能被静默删除的缺失导入(如
    CudaGraphScope
    )。
  • fine_grained_activation_offload.py
    :许多调用方使用的关键接口文件。若main和dev分支的方法名称/签名存在差异,优先保留dev分支的实现,并修改来自main分支的调用方以匹配。
  • distrib_optimizer.py
    :dev分支可能有更广泛的类型抽象(如
    _is_distopt_quantized_param
    覆盖FP8和GroupedQuantizedTensor)。main分支可能简化为显式类型检查。恢复dev分支的抽象。

Special Handling: data_schedule.py

特殊处理:data_schedule.py

Main and dev have completely different classes in this file:
  • Main:
    HybridCPDataLoaderWrapper
    (imported by main's
    training.py
    )
  • Dev:
    BasePackingScheduler
    ,
    DpBalancedScheduler
    ,
    DefaultDynamicCPScheduler
    ,
    wrap_data_iterator
    ,
    get_batch_on_this_rank_for_sequence_packing
    (imported by
    pretrain_gpt.py
    and tests)
Do NOT take either version wholesale. Keep dev's file and append main's
HybridCPDataLoaderWrapper
class (plus any missing imports like
BalancedCPScheduler
,
Any
,
List
) at the end.
main和dev分支的该文件包含完全不同的类:
  • main分支:
    HybridCPDataLoaderWrapper
    (被main分支的
    training.py
    导入)
  • dev分支:
    BasePackingScheduler
    DpBalancedScheduler
    DefaultDynamicCPScheduler
    wrap_data_iterator
    get_batch_on_this_rank_for_sequence_packing
    (被
    pretrain_gpt.py
    和测试导入)
**请勿完全采用任一版本。**保留dev分支的文件,并在末尾追加main分支的
HybridCPDataLoaderWrapper
类(以及任何缺失的导入,如
BalancedCPScheduler
Any
List
)。

Restore Deleted Files

恢复已删除文件

Compare
git ls-tree
between
origin/main
and HEAD to find files in main that are missing from the merged tree. For each:
  • Restore if main's code imports/references it and would break without it (e.g.
    hybrid_cp_schedule.py
    if
    data_schedule.py
    imports from it)
  • Do NOT restore if dev intentionally deleted it — check
    git log origin/dev -- <file>
    for the deletion commit to understand intent
  • When in doubt, check whether any file in the merged tree imports from the missing file. If nothing imports it, skip it.
对比
origin/main
与HEAD的
git ls-tree
,找出main分支存在但合并后代码中缺失的文件。对于每个文件:
  • 恢复:若main分支代码导入/引用该文件,且缺失会导致代码崩溃(如
    data_schedule.py
    导入
    hybrid_cp_schedule.py
    ,则恢复该文件)
  • 不恢复:若dev分支刻意删除该文件——查看
    git log origin/dev -- <file>
    的删除提交以了解意图
  • 若不确定,检查合并后代码中是否有文件导入该缺失文件。若无文件导入,则跳过。

Formatting

代码格式化

Run on ALL changed Python files (relative to
origin/dev
), in this order:
  1. black
    (version 24,
    --config pyproject.toml
    )
  2. isort
  3. Order matters: black first, then isort — reverse order can undo isort's work
  4. pylint
    on changed
    megatron/core/
    files — fix missing-docstring and line-too-long violations before pushing
对所有相对于
origin/dev
修改的Python文件按以下顺序执行格式化:
  1. black
    (版本24,
    --config pyproject.toml
  2. isort
  3. 顺序很重要:先执行black,再执行isort——反向顺序会抵消isort的效果
  4. 对修改的
    megatron/core/
    文件执行
    pylint
    ——推送前修复缺失文档字符串和行过长的违规问题

Pre-push invariant checks

推送前不变量检查

Before every
git push
in this workflow (the initial push in Phase 1 AND every fix-push in Phase 3), run these bash checks. If any fails, fix the condition and re-check before pushing:
bash
undefined
在该工作流的每次
git push
之前(阶段1的初始推送以及阶段3的每次修复推送),执行以下bash检查。若任何检查失败,修复问题后重新检查再推送:
bash
undefined

1. CODEOWNERS must be identical to dev's.

1. CODEOWNERS必须与dev分支完全一致。

if ! git diff --quiet origin/dev -- .github/CODEOWNERS; then echo "ABORT: .github/CODEOWNERS differs from origin/dev. Restore with:" echo " git checkout origin/dev -- .github/CODEOWNERS" exit 1 fi
if ! git diff --quiet origin/dev -- .github/CODEOWNERS; then echo "终止:.github/CODEOWNERS与origin/dev不同。执行以下命令恢复:" echo " git checkout origin/dev -- .github/CODEOWNERS" exit 1 fi

2. Dependency-management triple must be identical to dev's.

2. 依赖管理三元组必须与dev分支完全一致。

for f in pyproject.toml uv.lock docker/Dockerfile.ci.dev; do if ! git diff --quiet origin/dev -- "$f"; then # pyproject.toml is allowed to differ ONLY for git source reconciliation # (new [tool.uv.sources] entries from main). If you intentionally edited # it for that reason, bypass this check by re-running with $f skipped. echo "WARNING: $f differs from origin/dev" fi done

The CODEOWNERS check is a HARD abort — never push if it fails.
for f in pyproject.toml uv.lock docker/Dockerfile.ci.dev; do if ! git diff --quiet origin/dev -- "$f"; then # pyproject.toml仅允许因Git源协调(来自main分支的新[tool.uv.sources]条目)而不同。若你因该原因有意编辑了它,可跳过该文件重新运行检查。 echo "警告:$f与origin/dev不同" fi done

CODEOWNERS检查是强制终止项——若失败,绝对不能推送。

Commit and Push

提交并推送

Phase 1 produces a single commit on the sync branch. The merge itself creates the merge commit; fold any post-merge work (formatting, conflict surgery, restored files, regenerated
uv.lock
) into it rather than stacking a second commit:
bash
git add -A
git commit --amend --no-edit  # rewrites the merge commit's tree;
                              # parents are preserved.
git push -u origin "$BRANCH"  # only non-force push of the run.
Once pushed, this commit is immutable for the rest of the run. Phase 3 fixes go into a separate rolling fix commit on top (see Phase 3 step 4 and the two-commit policy in Rules).

阶段1在同步分支上生成单个提交。合并操作本身会创建合并提交;将合并后的所有工作(格式化、冲突修复、恢复文件、重新生成
uv.lock
)合并到该提交中,而非创建第二个提交:
bash
git add -A
git commit --amend --no-edit  # 重写合并提交的树;保留父提交。
git push -u origin "$BRANCH"  # 本次运行仅执行一次非强制推送。
推送后,该提交在后续运行中不可变。阶段3的修复内容会在其上方创建单独的滚动修复提交(参见阶段3步骤4及规则中的双提交策略)。

Phase 2: Create the Draft PR

阶段2:创建草稿PR

  • Title:
    chore: nightly sync main into dev ($DATE)
  • Create as draft:
    gh pr create --draft
  • Body should include:
    1. Summary of what was synced (number of commits from main)
    2. Python-only line-change stats, so reviewers can gauge the real code surface (excluding golden-value JSON, uv.lock, etc.). Compute with:
      bash
      git diff --numstat origin/dev...HEAD -- '*.py' \
        | awk 'BEGIN{a=0;d=0} {a+=$1; d+=$2} END{
            printf "Python lines: +%d / -%d across %d files\n", a, d, NR
          }'
      Include the exact line (e.g.
      Python lines: +1234 / -567 across 42 files
      ) in the PR body so reviewers see it at a glance.
    3. List of files where main's version was taken over the merge
    4. List of files that were deleted in dev but restored (and why)
    5. The remerge-diff output (
      git show --remerge-diff HEAD
      on the merge commit) so reviewers can inspect ONLY the conflict resolutions. If the output is very long, summarize conflicts by file and put the full diff in a collapsed
      <details>
      block. If git is too old for
      --remerge-diff
      , note the git version and describe the merge strategy used instead.
  • Save the PR number for later phases
  • Add the
    Run functional tests
    and
    Run MBridge tests
    labels
    to the PR immediately after creation. The
    Run functional tests
    label ensures
    /ok to test
    triggers the full CI suite (unit tests + functional/ integration tests with 100-step training and golden value comparison). The
    Run MBridge tests
    label triggers the MBridge test suite. Without these labels, only a lightweight subset runs.
    bash
    gh pr edit <PR_NUMBER> --repo $REPO \
      --add-label "Run functional tests" \
      --add-label "Run MBridge tests"

  • 标题:
    chore: nightly sync main into dev ($DATE)
  • 创建为草稿
    gh pr create --draft
  • 正文应包含:
    1. 同步内容摘要(来自main分支的提交数量)
    2. 仅Python代码的行数变更统计,以便审核者评估实际代码范围(排除黄金值JSON、uv.lock等)。通过以下命令计算:
      bash
      git diff --numstat origin/dev...HEAD -- '*.py' \
        | awk 'BEGIN{a=0;d=0} {a+=$1; d+=$2} END{
            printf "Python代码行数:+%d / -%d,涉及%d个文件\n", a, d, NR
          }'
      将精确结果(如
      Python代码行数:+1234 / -567,涉及42个文件
      )包含在PR正文中,方便审核者快速查看。
    3. 采用main分支版本而非合并的文件列表
    4. dev分支已删除但被恢复的文件列表及原因
    5. 重新合并差异输出(合并提交上执行
      git show --remerge-diff HEAD
      ),以便审核者仅查看冲突解决内容。若输出过长,按文件总结冲突,并将完整差异放在折叠的
      <details>
      块中。若git版本过旧不支持
      --remerge-diff
      ,注明git版本并描述所用的合并策略。
  • 保存PR编号供后续阶段使用
  • 立即为PR添加
    Run functional tests
    Run MBridge tests
    标签
    Run functional tests
    标签确保
    /ok to test
    触发完整CI套件(单元测试 + 功能/集成测试,含100步训练和黄金值对比)。
    Run MBridge tests
    标签触发MBridge测试套件。若无这些标签,仅会运行轻量级子集。
    bash
    gh pr edit <PR_NUMBER> --repo $REPO \
      --add-label "Run functional tests" \
      --add-label "Run MBridge tests"

Phase 3: CI Iteration

阶段3:CI迭代

CI Architecture

CI架构

  • Nemo_CICD_Test
    is a downstream gate job aggregating unit test, integration test, and other results. If it fails, investigate the upstream jobs it depends on — do NOT debug the gate itself.
  • Integration tests (H100, GB200) may be skipped for non-maintainer PRs. This is expected; the
    Nemo_CICD_Test
    gate will fail as a result.
  • tests/unit_tests/conftest.py
    imports from
    megatron.training.training
    , so a broken import in
    training.py
    (or anything it transitively imports) cascades to fail ALL test suites. If every test job fails with ImportError, check the training.py import chain first.
  • **
    Nemo_CICD_Test
    **是聚合单元测试、集成测试及其他结果的下游网关任务。若该任务失败,需调查其依赖的上游任务——请勿直接调试网关本身。
  • 集成测试(H100、GB200)可能对非维护者PR跳过。这是预期情况;
    Nemo_CICD_Test
    网关会因此失败。
  • **
    tests/unit_tests/conftest.py
    **导入
    megatron.training.training
    ,因此
    training.py
    (或其间接导入的任何文件)中的导入错误会导致所有测试套件失败。若所有测试任务均因ImportError失败,首先检查training.py的导入链。

Execution model: one step, no background

执行模型:单步骤,无后台任务

You run inside ONE GitHub Actions step. The moment you stop emitting tool calls, the step ends and the runner container is destroyed. Any background process you started dies with it. There is NO persistent session and NO future wakeup. See the workflow prompt's "NO background tasks" block for the full ban list.
Practical rule: every wait for CI to resolve is a SINGLE foreground Bash tool call that blocks inline until the wait is resolved.
你运行在单个GitHub Actions步骤中。一旦停止输出工具调用,步骤结束,运行器容器被销毁。你启动的任何后台进程都会随之终止。无持久会话,也不会在未来唤醒。参见工作流提示中的“禁止后台任务”块获取完整禁止列表。
实操规则:等待CI完成的每个操作都是单个前台Bash工具调用,会阻塞直到等待完成。

The Fix-Then-Retrigger Loop

修复-重新触发循环

Two nested loops. Do NOT conflate them:
  • The outer loop is YOUR sequence of tool calls (each iteration: one
    /ok to test
    , one blocking poll, maybe one fix-and-push). It is NOT a Bash loop. It advances because you make new tool calls.
  • The inner loop is a single blocking Bash tool call using
    while true; do ... sleep 120; done
    . It runs during one iteration of the outer loop and ends when CI reaches a terminal state for that iteration.
The outer loop terminates ONLY when Phase 4's gate is satisfied.
Source of truth:
gh pr view <PR_NUMBER> --repo $REPO --json statusCheckRollup
. This lists every required check, including external status contexts (GitLab CI,
copy-pr-bot
, etc.) that
gh api .../actions/runs/.../jobs
does NOT show.
Outer-loop iteration (each iteration is a few tool calls):
  1. latest_sha=$(git rev-parse HEAD)
    (one Bash call).
  2. Post
    /ok to test $latest_sha
    on the PR:
    gh pr comment <PR_NUMBER> --repo $REPO --body "/ok to test $latest_sha"
  3. ONE blocking Bash tool call. This is the inner loop. Copy this template verbatim, only changing
    REPO
    and
    PR
    :
    bash
    REPO='NVIDIA/Megatron-LM'
    PR='<PR_NUMBER>'
    # Names matched case-insensitively, anchored to the START of the name.
    EXEMPT='copy-pr-bot|is-not-external-contributor|greptile|coderabbit|codeowners|.*review|.*approval|codecov|coverage|build-docs|doc-build|readthedocs|sphinx'
    # Sentinel check that tells us CI has fully run. Update this if the
    # aggregate gate job is renamed.
    SENTINEL='Nemo_CICD_Test'
    
    while true; do
      # Normalize both CheckRun (.status / .conclusion) and StatusContext
      # (.state) entries into the same {name, status, conclusion} shape.
      rollup=$(gh pr view "$PR" --repo "$REPO" --json statusCheckRollup --jq '
        .statusCheckRollup[] | [
          (.name // .context // "?"),
          (if .__typename == "StatusContext" then
             (if (.state == "PENDING" or .state == "EXPECTED") then "IN_PROGRESS"
              else "COMPLETED" end)
           else (.status // "UNKNOWN") end),
          (if .__typename == "StatusContext" then
             (if .state == "SUCCESS" then "SUCCESS"
              elif (.state == "FAILURE" or .state == "ERROR") then "FAILURE"
              else "NEUTRAL" end)
           else (.conclusion // "UNKNOWN") end)
        ] | @tsv')
    
      # Sentinel: do NOT declare green until the CI aggregate gate has
      # reached a terminal state. Before /ok to test triggers the run,
      # the sentinel is absent; while CI is running, it's IN_PROGRESS.
      sentinel_line=$(printf '%s\n' "$rollup" | awk -F'\t' -v s="$SENTINEL" '$1 == s')
      sentinel_status=$(printf '%s\n' "$sentinel_line" | awk -F'\t' 'NR==1 {print $2}')
      if [ "$sentinel_status" != "COMPLETED" ]; then
        echo "=== $(date -u) waiting for $SENTINEL (status: ${sentinel_status:-absent}) ==="
        sleep 120
        continue
      fi
    
      # Classify non-exempt checks (exempt list applied to the NAME only).
      non_exempt=$(printf '%s\n' "$rollup" | awk -F'\t' -v p="^($EXEMPT)" 'tolower($1) !~ tolower(p)')
      failed=$(printf '%s\n' "$non_exempt" | awk -F'\t' '$2 == "COMPLETED" && $3 !~ /^(SUCCESS|SKIPPED|NEUTRAL)$/')
      pending=$(printf '%s\n' "$non_exempt" | awk -F'\t' '$2 != "COMPLETED"')
    
      if [ -n "$failed" ]; then
        echo "=== NON-EXEMPT FAILURES ==="
        printf '%s\n' "$failed"
        echo "RESULT=FAILURE"
        exit 0
      fi
      if [ -n "$pending" ]; then
        # Sentinel is COMPLETED but a non-exempt check is still pending —
        # rare but possible. Keep waiting; do NOT ship.
        echo "=== $(date -u) sentinel done but non-exempt checks still pending ==="
        printf '%s\n' "$pending"
        sleep 120
        continue
      fi
    
      echo "=== ALL NON-EXEMPT CHECKS COMPLETED GREEN ==="
      printf '%s\n' "$non_exempt"
      echo "RESULT=GREEN"
      exit 0
    done
    This Bash call blocks for as long as CI takes (minutes to hours). Do NOT split it into many short polls interleaved with other tool calls — that wastes
    --max-turns
    and creates windows where you could lose track of the loop state.
  4. Read the tool output:
    • If
      RESULT=FAILURE
      : diagnose via
      gh api repos/$REPO/actions/jobs/<JOB_ID>/logs
      (or the external-context equivalent) and fix the code. The Phase 1 commit is immutable; fixes accumulate in a single rolling fix commit on top of it:
      bash
      git add -A
      if git rev-parse --verify HEAD^2 >/dev/null 2>&1; then
        # HEAD has two parents → still the Phase 1 merge commit.
        # First failure of this run: create the fix commit.
        git commit -m "fix: post-CI corrections"
        git push origin "$BRANCH"
      else
        # HEAD is the existing fix commit → amend it.
        git commit --amend --no-edit
        git push --force-with-lease origin "$BRANCH"
      fi
      --force-with-lease
      (not
      --force
      ): if a human pushed onto the branch since the bot last fetched, the lease aborts the push instead of clobbering them — fetch and decide what to do. Start a new outer-loop iteration at step 1 with the new HEAD SHA.
    • If
      RESULT=GREEN
      : outer loop is done. Proceed to Phase 4.
Why not wait-for-run-to-register first?
gh pr comment
with
/ok to test <sha>
is handled by
copy-pr-bot
, which takes a few seconds to trigger the CI run. The
statusCheckRollup
poll in step 3 will initially show checks in
PENDING
/
QUEUED
; that's fine — the inner loop treats those as "keep waiting" and will see them advance as CI progresses. No separate registration poll needed.
两个嵌套循环。请勿混淆:
  • 外循环是你的工具调用序列(每次迭代:一次
    /ok to test
    ,一次阻塞轮询,可能一次修复并推送)。它不是Bash循环。通过你发起新的工具调用推进。
  • 内循环是单个阻塞Bash工具调用,使用
    while true; do ... sleep 120; done
    。它在外循环的一次迭代中运行,当CI达到该迭代的终端状态时结束。
外循环仅在满足阶段4的网关条件时终止。
事实来源
gh pr view <PR_NUMBER> --repo $REPO --json statusCheckRollup
。该命令列出所有必填检查,包括
gh api .../actions/runs/.../jobs
未显示的外部状态上下文(GitLab CI、
copy-pr-bot
等)。
外循环迭代(每次迭代包含几个工具调用)
  1. latest_sha=$(git rev-parse HEAD)
    (一个Bash调用)。
  2. 在PR上发布
    /ok to test $latest_sha
    gh pr comment <PR_NUMBER> --repo $REPO --body "/ok to test $latest_sha"
  3. 一个阻塞Bash工具调用。这是内循环。完全复制以下模板,仅修改
    REPO
    PR
    bash
    REPO='NVIDIA/Megatron-LM'
    PR='<PR_NUMBER>'
    # 名称不区分大小写,匹配名称开头。
    EXEMPT='copy-pr-bot|is-not-external-contributor|greptile|coderabbit|codeowners|.*review|.*approval|codecov|coverage|build-docs|doc-build|readthedocs|sphinx'
    # 指示CI已完全运行的哨兵检查。若聚合网关任务重命名,需更新此处。
    SENTINEL='Nemo_CICD_Test'
    
    while true; do
      # 将CheckRun(.status / .conclusion)和StatusContext(.state)条目统一为相同的{name, status, conclusion}格式。
      rollup=$(gh pr view "$PR" --repo "$REPO" --json statusCheckRollup --jq '
        .statusCheckRollup[] | [
          (.name // .context // "?"),
          (if .__typename == "StatusContext" then
             (if (.state == "PENDING" or .state == "EXPECTED") then "IN_PROGRESS"
              else "COMPLETED" end)
           else (.status // "UNKNOWN") end),
          (if .__typename == "StatusContext" then
             (if .state == "SUCCESS" then "SUCCESS"
              elif (.state == "FAILURE" or .state == "ERROR") then "FAILURE"
              else "NEUTRAL" end)
           else (.conclusion // "UNKNOWN") end)
        ] | @tsv')
    
      # 哨兵:在CI聚合网关达到终端状态前,请勿标记为成功。在/ok to test触发运行前,哨兵不存在;CI运行时,它处于IN_PROGRESS状态。
      sentinel_line=$(printf '%s\n' "$rollup" | awk -F'\t' -v s="$SENTINEL" '$1 == s')
      sentinel_status=$(printf '%s\n' "$sentinel_line" | awk -F'\t' 'NR==1 {print $2}')
      if [ "$sentinel_status" != "COMPLETED" ]; then
        echo "=== $(date -u) 等待$SENTINEL(状态:${sentinel_status:-不存在}) ==="
        sleep 120
        continue
      fi
    
      # 分类非豁免检查(仅对名称应用豁免列表)。
      non_exempt=$(printf '%s\n' "$rollup" | awk -F'\t' -v p="^($EXEMPT)" 'tolower($1) !~ tolower(p)')
      failed=$(printf '%s\n' "$non_exempt" | awk -F'\t' '$2 == "COMPLETED" && $3 !~ /^(SUCCESS|SKIPPED|NEUTRAL)$/')
      pending=$(printf '%s\n' "$non_exempt" | awk -F'\t' '$2 != "COMPLETED"')
    
      if [ -n "$failed" ]; then
        echo "=== 非豁免检查失败 ==="
        printf '%s\n' "$failed"
        echo "RESULT=FAILURE"
        exit 0
      fi
      if [ -n "$pending" ]; then
        # 哨兵已完成,但非豁免检查仍在等待——罕见但可能发生。继续等待;请勿提交。
        echo "=== $(date -u) 哨兵已完成,但非豁免检查仍在等待 ==="
        printf '%s\n' "$pending"
        sleep 120
        continue
      fi
    
      echo "=== 所有非豁免检查已完成且通过 ==="
      printf '%s\n' "$non_exempt"
      echo "RESULT=GREEN"
      exit 0
    done
    该Bash调用会阻塞直到CI完成(数分钟到数小时)。请勿将其拆分为多个短轮询并穿插其他工具调用——这会浪费
    --max-turns
    ,并可能导致你丢失循环状态。
  4. 读取工具输出:
    • RESULT=FAILURE
      :通过
      gh api repos/$REPO/actions/jobs/<JOB_ID>/logs
      (或外部上下文等效方式)诊断问题并修复代码。阶段1的提交不可变;修复内容会累积在其上方的单个滚动修复提交中:
      bash
      git add -A
      if git rev-parse --verify HEAD^2 >/dev/null 2>&1; then
        # HEAD有两个父提交 → 仍为阶段1的合并提交。
        # 本次运行首次失败:创建修复提交。
        git commit -m "fix: post-CI corrections"
        git push origin "$BRANCH"
      else
        # HEAD是现有修复提交 → 合并到该提交中。
        git commit --amend --no-edit
        git push --force-with-lease origin "$BRANCH"
      fi
      --force-with-lease
      (而非
      --force
      ):若自机器人上次拉取后有人向分支推送了内容,租约会终止推送而非覆盖——拉取内容并决定后续操作。 使用新的HEAD SHA从步骤1开始新的外循环迭代。
    • RESULT=GREEN
      :外循环结束。进入阶段4。
为何不先等待运行注册?
gh pr comment
发送
/ok to test <sha>
copy-pr-bot
处理,触发CI运行需要几秒时间。步骤3中的
statusCheckRollup
轮询最初会显示检查处于
PENDING
/
QUEUED
状态;这没问题——内循环会将其视为“继续等待”,并在CI推进时看到状态更新。无需单独的注册轮询。

Anti-Patterns (what went wrong on run 24800621116)

反模式(运行24800621116中出现的问题)

  • Do NOT classify a queued/in-progress job as "infrastructure- blocked" and ship. A stuck queue drains eventually — wait. If the job eventually passes, great; if it fails, go fix it.
  • Do NOT mark ready while any required check is
    PENDING
    /
    QUEUED
    /
    IN_PROGRESS
    on the HEAD SHA.
    A push is not a pass; only a
    COMPLETED
    + green status is.
  • Do NOT declare an untested job "pre-existing." Pre-existing means the test ran to completion and failed the same way on recent dev CI. A job that never ran on your PR cannot be pre-existing.
  • Do NOT use
    gh api .../actions/runs/.../jobs
    alone
    as the gate signal. External status contexts (GitLab CI pipelines, copy-pr-bot status, etc.) do NOT appear there. Use
    statusCheckRollup
    .
  • Do NOT start any background process. No
    &
    , no
    nohup
    , no
    run_in_background: true
    , no
    ScheduleWakeup
    . The GitHub Actions step owns your shell; when the step ends, every background process is killed and cannot resume.
  • Do NOT push directly to
    pull-request/<PR_NUMBER>
    branches.
    The community bot manages those branches when it processes
    /ok to test
    . Pushing to them directly breaks the CI trigger mechanism. Always push to your own sync branch (e.g.
    main2dev/<DATE>
    ) instead.
  • Do NOT forget the
    Run functional tests
    and
    Run MBridge tests
    labels.
    Without
    Run functional tests
    , the internal GitLab functional tests do not run; without
    Run MBridge tests
    , the MBridge test suite does not run.
  • **请勿将排队/运行中的任务归类为“基础设施阻塞”并提交。**卡住的队列最终会处理——等待即可。若任务最终通过,很好;若失败,修复问题。
  • **请勿在HEAD SHA的任何必填检查处于
    PENDING
    /
    QUEUED
    /
    IN_PROGRESS
    状态时标记为就绪。**推送不代表通过;只有
    COMPLETED
    +成功状态才算通过。
  • **请勿将未运行的任务声明为“预先存在的问题”。**预先存在的问题指测试已完成运行,且在最近的dev分支CI中以相同方式失败。未在你的PR上运行的任务不能视为预先存在的问题。
  • **请勿仅使用
    gh api .../actions/runs/.../jobs
    作为网关信号。**外部状态上下文(GitLab CI流水线、copy-pr-bot状态等)不会显示在其中。使用
    statusCheckRollup
  • **请勿启动任何后台进程。**禁止使用
    &
    nohup
    run_in_background: true
    ScheduleWakeup
    。GitHub Actions步骤拥有你的shell;步骤结束时,所有后台进程都会被终止且无法恢复。
  • **请勿直接推送到
    pull-request/<PR_NUMBER>
    分支。**社区机器人在处理
    /ok to test
    时管理这些分支。直接推送会破坏CI触发机制。始终推送到你自己的同步分支(如
    main2dev/<DATE>
    )。
  • **请勿忘记添加
    Run functional tests
    Run MBridge tests
    标签。**若无
    Run functional tests
    ,内部GitLab功能测试不会运行;若无
    Run MBridge tests
    ,MBridge测试套件不会运行。

Failure Investigation

故障排查

  1. Fetch logs:
    gh api repos/$REPO/actions/jobs/<JOB_ID>/logs
  2. Grep for:
    ImportError
    ,
    ModuleNotFoundError
    ,
    FAILED
    ,
    would reformat
    ,
    line-too-long
    ,
    Traceback
  3. Read the error, understand root cause, fix the code
  1. 获取日志:
    gh api repos/$REPO/actions/jobs/<JOB_ID>/logs
  2. 搜索关键词:
    ImportError
    ModuleNotFoundError
    FAILED
    would reformat
    line-too-long
    Traceback
  3. 读取错误信息,理解根本原因,修复代码

Common Issues

常见问题

  • ImportError for a class/module: Dev test imports a class from a file where we took main's version. Restore only the missing class/function — not the entire file. If a file's classes are completely different between main and dev, keep both sets of code.
  • Formatting failures (black/pylint): Run
    black --config pyproject.toml
    on offending files. For pylint long-line or missing-docstring, edit directly.
  • Circular imports:
    isort
    can reorder imports in a way that introduces circular dependencies (e.g.
    megatron/legacy/model/__init__.py
    ). Check
    git diff
    on
    __init__.py
    files to see if import order changed.
  • Dependency version mismatches: Taking main's
    pyproject.toml
    /
    uv.lock
    can change library versions in the CI container. Dev-only code may depend on newer versions (e.g. TransformerEngine's
    single_grouped_weight
    ). If failures trace to missing kwargs or changed APIs in third-party libs, this is the cause.
  • API mismatch (AttributeError / TypeError at runtime): Main's callers reference methods that don't exist (or have different signatures) in dev's implementations. See "API Mismatch Detection" in Phase 1. Fix by adding shims, renaming methods, or adjusting call signatures.
  • Infrastructure / network failures (apt-get, pip download): Errors like
    archive.ubuntu.com unreachable
    or
    Connection timed out
    during package installation are transient CI infrastructure issues, not code problems. Retry CI with the same SHA. Do not investigate as code failures.
  • 类/模块导入错误:dev分支测试从采用main分支版本的文件中导入类。仅恢复缺失的类/函数——而非整个文件。若文件中的类在main和dev分支完全不同,保留两组代码。
  • 格式化失败(black/pylint):对有问题的文件执行
    black --config pyproject.toml
    。对于pylint的行过长或缺失文档字符串问题,直接编辑修复。
  • 循环导入
    isort
    可能重新排序导入,导致循环依赖(如
    megatron/legacy/model/__init__.py
    )。检查
    __init__.py
    文件的
    git diff
    ,查看导入顺序是否变更。
  • 依赖版本不匹配:采用main分支的
    pyproject.toml
    /
    uv.lock
    会改变CI容器中的库版本。dev分支独有的代码可能依赖更新版本(如TransformerEngine的
    single_grouped_weight
    )。若失败追溯到第三方库的缺失参数或变更API,这就是原因。
  • API不匹配(运行时AttributeError/TypeError):main分支的调用方引用dev分支实现中不存在的方法(或签名不同)。参见阶段1的“API不匹配检测”。通过添加垫片、重命名方法或调整调用签名修复。
  • 基础设施/网络故障(apt-get、pip下载):如
    archive.ubuntu.com unreachable
    或包安装期间
    Connection timed out
    等错误是临时CI基础设施问题,而非代码问题。使用相同SHA重试CI。不要将其作为代码故障排查。

Pre-Existing Failure Verification

预先存在的故障验证

You MUST empirically verify before classifying any failure as pre-existing.
  1. gh pr list --repo $REPO --base dev --state merged --limit 3
  2. gh pr checks <PR_NUMBER> --repo $REPO
    on a recently merged dev PR
  3. If the same test bucket passes on recent dev CI → the failure is sync-caused. You must fix it.
  4. Only if the test also fails on recent dev CI can you classify it as pre-existing. Document with the dev PR number and CI run as evidence.
在将任何故障归类为预先存在的问题前,必须进行实证验证。
  1. gh pr list --repo $REPO --base dev --state merged --limit 3
  2. 对最近合并的dev分支PR执行
    gh pr checks <PR_NUMBER> --repo $REPO
  3. 若相同测试组在最近的dev分支CI中通过→故障由同步导致。必须修复。
  4. 只有当测试在最近的dev分支CI中也失败时,才能归类为预先存在的问题。在PR正文中记录证据(dev分支PR编号和CI运行链接)。

Internal GitLab Functional Tests

内部GitLab功能测试

GitHub CI covers unit tests and some integration tests. Internal GitLab (
gitlab-master.nvidia.com
) runs additional functional tests on H100/GB200 hardware that may reveal issues GitHub CI does not catch. These surface in
statusCheckRollup
as external status contexts (the bash template already handles them via the
__typename == "StatusContext"
branch).
  • Fine-grained activation offloading failures, for example, only showed up in GitLab functional tests during PR #4291
  • If GitHub CI passes but a reviewer reports GitLab failures, investigate with the same rigor as GitHub CI failures
  • The sync PR should ideally pass both GitHub and GitLab CI before merge, but GitHub CI passing (i.e. the Phase 4 gate above) is the minimum before
    gh pr ready

GitHub CI覆盖单元测试和部分集成测试。内部GitLab(
gitlab-master.nvidia.com
)在H100/GB200硬件上运行额外的功能测试,可能会发现GitHub CI未检测到的问题。这些测试在
statusCheckRollup
中显示为外部状态上下文(bash模板已通过
__typename == "StatusContext"
分支处理)。
  • 例如,PR #4291中只有GitLab功能测试发现了细粒度激活卸载失败
  • 若GitHub CI通过但审核者报告GitLab失败,需以与GitHub CI失败相同的严谨性调查
  • 同步PR在合并前理想情况下应通过GitHub和GitLab CI,但GitHub CI通过(即上述阶段4的网关条件)是执行
    gh pr ready
    的最低要求

Phase 4: Mark PR Ready — Strict Gate

阶段4:标记PR就绪——严格网关

Run
gh pr ready
ONLY when every non-exempt required check on the latest CI run (against the current HEAD SHA) satisfies BOTH:
  1. status == "completed"
    — NOT
    queued
    ,
    in_progress
    ,
    pending
    ,
    waiting
    , or
    requested
    .
  2. conclusion ∈ {"success", "skipped", "neutral"}
    .
If a non-exempt check is pending/queued/in-progress: keep polling; do not run
gh pr ready
. If it fails: go back to Phase 3's loop.
The exempt list (approval/coverage/docs) is defined in Phase 3; only those checks may be ignored.
A pre-existing failure (same test failing identically on recent dev CI) may be accepted, but ONLY after it has fully run, been empirically verified against dev, and documented in the PR body with evidence (dev PR number + CI run URL).
gh pr ready <PR_NUMBER> --repo $REPO
Then comment on the PR confirming it is ready for human review. The comment should include:
  • Which non-exempt checks passed (summary from the bash template's final
    ALL NON-EXEMPT CHECKS COMPLETED GREEN
    output)
  • Any documented pre-existing failures with evidence (dev PR number + CI run URL showing the same failure on recent dev CI)
  • Which files were taken from main vs. merged manually
  • Any API mismatches detected and fixed
  • Any
    pyproject.toml
    git source reconciliation performed
  • Links to the CI runs that validated the fixes

仅当最新CI运行(针对当前HEAD SHA)的所有非豁免必填检查满足以下两个条件时,才能执行
gh pr ready
  1. status == "completed"
    —— 不能是
    queued
    in_progress
    pending
    waiting
    requested
  2. conclusion ∈ {"success", "skipped", "neutral"}
若非豁免检查处于等待/排队/运行中:继续轮询;不要执行
gh pr ready
。若失败:回到阶段3的循环。
豁免列表(审核/覆盖率/文档)在阶段3中定义;仅可忽略这些检查。
预先存在的故障(最近dev分支CI中以相同方式失败的测试)可被接受,但必须在完全运行、经dev分支实证验证并在PR正文中记录证据(dev分支PR编号+显示相同故障的CI运行链接)之后。
gh pr ready <PR_NUMBER> --repo $REPO
然后在PR上评论确认已准备好人工审核。评论应包含:
  • 通过的非豁免检查摘要(来自bash模板最终的
    所有非豁免检查已完成且通过
    输出)
  • 任何已记录的预先存在的故障及证据(dev分支PR编号+显示相同故障的CI运行链接)
  • 哪些文件采用main分支版本,哪些文件手动合并
  • 检测并修复的API不匹配问题
  • 执行的
    pyproject.toml
    Git源协调操作
  • 验证修复的CI运行链接

Rules

规则

  • Prioritize main over dev on genuine conflicts. Preserve dev-only additions that do not conflict.
  • Two-commit policy: the PR contains at most two bot-authored commits — the Phase 1 merge commit (immutable once pushed) and a single rolling fix commit on top. The fix commit is created on the first Phase 3 failure (normal push) and amended on every subsequent failure (
    git commit --amend --no-edit
    +
    git push --force-with-lease
    ). Never modify the Phase 1 commit after pushing it; never let the fix-commit count exceed one.
  • CI triggers via comment:
    /ok to test <sha>
  • CI runs appear on branch
    pull-request/<PR_NUMBER>
  • Git committer identity:
    svcnvidia-nemo-ci
  • After editing imports, run
    isort
    on those files
  • Push directly to NVIDIA/Megatron-LM (not a fork). The bot uses a PAT with write access. CLAUDE.md says "never push directly" but that rule is for human contributors — the sync bot is an exception.
  • 在真实冲突中优先选择main分支版本。保留不冲突的dev分支独有的新增内容。
  • 双提交策略:PR最多包含两个机器人生成的提交——阶段1的合并提交(推送后不可变)和其上方的单个滚动修复提交。首次阶段3失败时创建修复提交(正常推送),后续每次失败时合并到该提交中(
    git commit --amend --no-edit
    +
    git push --force-with-lease
    )。推送后永远不要修改阶段1的提交;修复提交数量永远不要超过一个。
  • 通过评论触发CI:
    /ok to test <sha>
  • CI运行出现在
    pull-request/<PR_NUMBER>
    分支上
  • Git提交者身份:
    svcnvidia-nemo-ci
  • 修改导入后,对这些文件执行
    isort
  • 直接推送到NVIDIA/Megatron-LM(而非fork)。机器人使用具有写入权限的PAT。CLAUDE.md中提到“永远不要直接推送”,但该规则针对人类贡献者——同步机器人是例外。