bump-dependency

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Bump Dependency

升级依赖

End-to-end workflow for shipping a dependency bump in Megatron Bridge. Optimised for the case where TE, MCore, or another GPU-heavy pin moves forward — which often surfaces flakes that have to be quarantined before the PR can land.

The pipeline is always: edit → relock → push → /ok to test → watchdog → quarantine on red → re-trigger → repeat until green.

Megatron Bridge中完成依赖升级的端到端工作流。针对TE、MCore或其他GPU相关依赖版本向前迭代的场景优化——这类升级通常会暴露不稳定测试，需要先隔离才能合并PR。

流程固定为：编辑 → 重新生成锁文件 → 推送 → /ok to test → 监控 → 失败时隔离不稳定测试 → 重新触发 → 重复直到运行通过。

When to reach for this skill

何时使用该技能

Bumping a git-source pin in

pyproject.toml

override-dependencies

(e.g.

transformer-engine @ git+...@<ref>

Bumping the
```
3rdparty/Megatron-LM
```
submodule.
Any change that touches
```
uv.lock
```
and needs the full L0 + L1 matrix to prove out before merge.

For pure dep additions/removals without a CI loop, the

build-and-dependency

skill is enough.

在

pyproject.toml

的

override-dependencies

中升级Git源依赖的版本（例如

transformer-engine @ git+...@<ref>

）。

升级
```
3rdparty/Megatron-LM
```
子模块。
任何涉及修改
```
uv.lock
```
且需要完整L0 + L1矩阵验证才能合并的变更。

如果只是单纯添加/移除依赖且无需CI循环，使用

build-and-dependency

技能即可。

Required context

必要前置知识

Read first, then follow the steps below:

@CONTRIBUTING.md — PR title/label policy, DCO sign-off
@skills/build-and-dependency/SKILL.md —
```
uv lock
```
mechanics, container choice
@skills/cicd/SKILL.md — how
```
copy-pr-bot
```
and
```
/ok to test
```
work
@skills/testing/SKILL.md —
```
active/
```
vs
```
flaky/
```
directory layout,
```
git mv
```
quarantine recipe

先阅读以下内容，再执行后续步骤：

@CONTRIBUTING.md — PR标题/标签规范、DCO签署要求
@skills/build-and-dependency/SKILL.md —
```
uv lock
```
机制、容器选择
@skills/cicd/SKILL.md —
```
copy-pr-bot
```
和
```
/ok to test
```
的工作方式
@skills/testing/SKILL.md —
```
active/
```
与
```
flaky/
```
目录结构、
```
git mv
```
隔离测试方法

Step 1 — Worktree and edit

步骤1 — 创建工作树并编辑

Create a worktree off

main

per @CLAUDE.md. Then, before any
uv lock
:

bash

git submodule update --init 3rdparty/Megatron-LM

The submodule must be initialised in the worktree or

uv lock

errors with "not a Python project" on the MCore path.

Edit the pin. For TE the canonical knob is the override line in

pyproject.toml

toml

override-dependencies = [
    ...
    "transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@<new-ref>",
    ...
]

Use a branch name (

release_v2.15

) only when you want to track a moving tip; use a full SHA for reproducibility. TE branches use

release_vX.Y

(underscore), not

release/vX.Y

. Verify with

git ls-remote https://github.com/NVIDIA/TransformerEngine.git

根据@CLAUDE.md从

main

分支创建工作树。然后，在执行任何
uv lock
之前：

bash

git submodule update --init 3rdparty/Megatron-LM

必须在工作树中初始化子模块，否则

uv lock

会在MCore路径上报错“not a Python project”。

编辑依赖版本。对于TE，标准方式是修改

pyproject.toml

中的override行：

toml

override-dependencies = [
    ...
    "transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@<new-ref>",
    ...
]

仅当需要跟踪动态更新的分支时使用分支名称（如

release_v2.15

）；为了可复现性，优先使用完整SHA。TE分支采用

release_vX.Y

格式（下划线分隔），而非

release/vX.Y

。可通过

git ls-remote https://github.com/NVIDIA/TransformerEngine.git

验证。

Step 2 — Regenerate the lockfile

步骤2 — 重新生成锁文件

Run

uv lock

inside the project container per @skills/build-and-dependency/SKILL.md "Regenerating uv.lock". Then confirm only the intended packages moved:

bash

git diff --stat pyproject.toml uv.lock

If the diff carries changes you didn't ask for (transitive movements you can't explain), stop and investigate before pushing. Note that

override-dependencies

carries CVE floors that float — unrelated packages bumping by a patch version is expected; accept those, don't revert them.

按照@skills/build-and-dependency/SKILL.md中「重新生成uv.lock」的说明，在项目容器内运行

uv lock

。然后确认仅预期的包发生了变更：

bash

git diff --stat pyproject.toml uv.lock

如果diff包含未预期的变更（无法解释的传递依赖更新），请在推送前停止并排查。注意

override-dependencies

包含CVE版本下限，无关包的补丁版本升级是正常现象，无需回退。

Step 3 — Commit and push

步骤3 — 提交并推送

Sign-off + signed-commit + PR title format per @CONTRIBUTING.md and @skills/cicd/SKILL.md "Commit and PR Workflow". For a bump:

bash

git add pyproject.toml uv.lock
git commit -S -s -m "[build] chore: bump <package> to <ref>"
git push -u origin <branch-name>

A signed commit (

-S

) lets

copy-pr-bot

trigger CI without manual

/ok to test

for the first push — but you'll still post

/ok to test

on every subsequent SHA in this loop (Step 5).

按照@CONTRIBUTING.md和@skills/cicd/SKILL.md中「提交与PR工作流」的要求，完成签署、签名提交和PR标题格式。对于依赖升级：

bash

git add pyproject.toml uv.lock
git commit -S -s -m "[build] chore: bump <package> to <ref>"
git push -u origin <branch-name>

签名提交（

-S

）允许

copy-pr-bot

在首次推送时自动触发CI，无需手动执行

/ok to test

——但在此循环中的后续每次推送新SHA时，仍需执行

/ok to test

（步骤5）。

Step 4 — Open the PR

步骤4 — 创建PR

Title and labels per @CONTRIBUTING.md. Two bump-specific requirements:

Apply
```
needs-more-tests
```
— mandatory for a bump; expands the matrix from L0 to L0+L1.
For a high-blast-radius bump (TE, MCore submodule, anything that touches CUDA kernels), also apply
```
full-test-suite
```
to pull L2 into the PR run. L2 covers VL models, checkpoint conversion, and heavy quantization which otherwise only run on schedule.

The PR body template — this is the durable record of the bump:

markdown

<details><summary>Claude summary</summary>

标题和标签需符合@CONTRIBUTING.md要求。依赖升级有两个特定要求：

添加
```
needs-more-tests
```
标签 — 必填，将测试矩阵从L0扩展到L0+L1。
对于影响范围较大的升级（TE、MCore子模块、任何涉及CUDA内核的变更），额外添加
```
full-test-suite
```
标签，将L2测试纳入PR运行。L2测试涵盖大语言模型、 checkpoint转换和重度量化，这些测试通常仅在定时任务中运行。

PR正文模板——这是依赖升级的永久记录：

markdown

<details><summary>Claude 摘要</summary>

What

变更内容

Bump <package> to <ref>.
Regenerate
```
uv.lock
```
.

将<package>升级至<ref>。
重新生成
```
uv.lock
```
。

Lockfile delta

锁文件差异

Updated <package> <old> -> <new>

Updated <package> <old> -> <new>

Test plan

测试计划

L0 CI green
L1 CI green (label
```
needs-more-tests
```
applied)

L0 CI 通过
L1 CI 通过（已添加
```
needs-more-tests
```
标签）

Quarantined tests (this bump)

本次升级隔离的测试

None yet — will be appended as flakes are identified during CI iteration.

</details> ```

To update the PR title or body later, use

gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"

— never

gh pr edit

暂未隔离——CI迭代过程中识别出不稳定测试后会补充。

</details> ```

如需后续更新PR标题或正文，请使用

gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"

——切勿使用

gh pr edit

。

Step 5 — Trigger CI on the exact SHA

步骤5 — 针对指定SHA触发CI

Trigger mechanics live in @skills/cicd/SKILL.md "How CI Is Triggered". For this loop the rule is simple: on every new SHA you push, post
/ok to test $(git rev-parse HEAD)
as a PR comment, even if your commits are signed. This guarantees the run targets the SHA you actually want exercised and re-fires anything that got cancelled or cached.

触发机制详见@skills/cicd/SKILL.md中「CI触发方式」。在此循环中的规则很简单：每次推送新SHA后，在PR评论中发布
/ok to test $(git rev-parse HEAD)
，即使提交已签名。这能确保运行针对的是你实际想要验证的SHA，并重新触发任何被取消或缓存的任务。

Step 6 — Attach the watchdog (always; never a cronjob)

步骤6 — 添加监控程序（必须使用，禁止定时任务）

For a bump PR you want a single live process that emits per-job state changes for the CICD NeMo workflow only. Other workflows (docs, wheel, copyright, install-test) are noise here — the gate that decides green-or-red for a bump is

CICD NeMo

Always attach a watchdog with the Monitor tool. Never schedule wakeups or cronjobs for this loop. A watchdog gives you:

Sub-minute reaction time on every job transition.
A single live process — no scattered scheduled-wakeup state to reason about.
Natural early termination via
```
TaskStop
```
once the run is green.

对于依赖升级PR，你需要一个实时进程来监控仅「CICD NeMo」工作流的每个任务状态变化。其他工作流（文档、wheel包、版权、安装测试）都是干扰项——决定升级是否通过的关键是

CICD NeMo

。

**必须使用Monitor工具添加监控程序。切勿为此循环设置定时唤醒或定时任务。**监控程序的优势：

任务状态变化时的响应时间小于1分钟。
单一实时进程——无需处理分散的定时唤醒状态。
运行通过后可通过
```
TaskStop
```
自动终止。

Watchdog script

监控脚本

Save to

/tmp/watchdog-<PR>.sh

and chmod +x:

bash

#!/usr/bin/env bash

保存至

/tmp/watchdog-<PR>.sh

并设置可执行权限：

bash

#!/usr/bin/env bash

Watchdog: monitor "CICD NeMo" runs on pull-request/<PR> and emit

per-job state changes. Stays alive across re-runs (new commits).

set -u PR=<PR> REPO=NVIDIA-NeMo/Megatron-Bridge BRANCH="pull-request/$PR"

prev_run_id="" declare -A prev_state

emit() { echo "[$(date -u +%H:%M:%SZ)] $*"; }

while true; do run_json=$(gh run list --repo "$REPO" --workflow "CICD NeMo"
--branch "$BRANCH" --limit 1
--json databaseId,status,conclusion,headSha 2>/dev/null || echo "[]") run_id=$(echo "$run_json" | jq -r '.[0].databaseId // empty') run_status=$(echo "$run_json" | jq -r '.[0].status // empty') run_conclusion=$(echo "$run_json" | jq -r '.[0].conclusion // empty') run_sha=$(echo "$run_json" | jq -r '.[0].headSha // empty')

if [[ -z "$run_id" ]]; then sleep 30; continue fi

if [[ "$run_id" != "$prev_run_id" ]]; then emit "RUN ${run_id} STARTED sha=${run_sha:0:8} status=${run_status}" prev_run_id="$run_id" unset prev_state declare -A prev_state fi

jobs_json=$(gh run view "$run_id" --repo "$REPO" --json jobs 2>/dev/null || echo "{}") while IFS=$'\t' read -r name status conclusion; do [[ -z "$name" ]] && continue cur="${status}/${conclusion}" if [[ "${prev_state[$name]:-}" != "$cur" ]]; then case "$status" in completed) emit "JOB ${name} -> ${conclusion}" ;; in_progress) if [[ -z "${prev_state[$name]:-}" || "${prev_state[$name]}" == "queued/" ]]; then emit "JOB ${name} -> in_progress" fi ;; esac prev_state[$name]="$cur" fi done < <(echo "$jobs_json" | jq -r '.jobs[]? | [.name, .status, (.conclusion // "")] | @tsv')

if [[ "$run_status" == "completed" ]]; then emit "RUN ${run_id} COMPLETED conclusion=${run_conclusion}" fi

sleep 60 done

undefined

set -u PR=<PR> REPO=NVIDIA-NeMo/Megatron-Bridge BRANCH="pull-request/$PR"

prev_run_id="" declare -A prev_state

emit() { echo "[$(date -u +%H:%M:%SZ)] $*"; }

if [[ -z "$run_id" ]]; then sleep 30; continue fi

if [[ "$run_id" != "$prev_run_id" ]]; then emit "RUN ${run_id} STARTED sha=${run_sha:0:8} status=${run_status}" prev_run_id="$run_id" unset prev_state declare -A prev_state fi

if [[ "$run_status" == "completed" ]]; then emit "RUN ${run_id} COMPLETED conclusion=${run_conclusion}" fi

sleep 60 done

undefined

Arming the watchdog

启动监控程序

text

Monitor(
  description="CICD NeMo run state changes on PR <N>",
  command="bash /tmp/watchdog-<N>.sh",
  persistent=true,
  timeout_ms=3600000
)

persistent: true

keeps it alive across re-runs (you'll push more commits when quarantining flakes). Stop it with

TaskStop(<task-id>)

once the run is green.

text

Monitor(
  description="CICD NeMo run state changes on PR <N>",
  command="bash /tmp/watchdog-<N>.sh",
  persistent=true,
  timeout_ms=3600000
)

persistent: true

确保监控程序在重新运行时保持活跃（隔离不稳定测试时会推送新提交）。运行通过后使用

TaskStop(<task-id>)

停止监控。

Why never a cronjob / scheduled wakeup

为何禁止定时任务/定时唤醒

Cronjobs run blind — they fire on a clock, not on an event. You'll either over-poll (cache miss every wake-up) or miss long stalls.
Wakeups can't easily fan out to "tell me whenever a job transitions" — they only resume the agent on a fixed interval.
A persistent Monitor surfaces every job edge in real time and exits cleanly when the work is done.

定时任务是盲目的——按时间触发而非事件触发。要么过度轮询（每次唤醒都命中缓存），要么错过长时间停滞。
定时唤醒无法实现“任务状态变化时通知我”的需求——仅能在固定间隔恢复代理。
持久化Monitor工具能实时显示每个任务状态变化，并在工作完成后干净退出。

Step 7 — Quarantine on red, then iterate

步骤7 — 失败时隔离测试，然后迭代

When a

JOB <name> -> failure

event fires:

Triage the failure — is it the bump or a flake? Skim the logs:
bash
```
RUN_ID=<from "RUN ... STARTED" event>
gh run view "$RUN_ID" --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > /tmp/run.log
wc -l /tmp/run.log
tail -200 /tmp/run.log
```
This is the bump-specific judgement call: only quarantine if the failure reproduces on
```
main
```
or is clearly unrelated infrastructure. If the failure is caused by the bump (real regression), stop quarantining — fix the underlying issue or revert the bump. Quarantining a real regression hides the very signal the bump PR exists to surface.
Move the launch script to
flaky/
per @skills/testing/SKILL.md "Moving a Test to Flaky". Map a CI job name to its launch script via:
- prefix
```
gb200_
```
  →
```
gb200/active/
```
  , otherwise
```
h100/active/
```
- the rest is the script's basename without
```
.sh
```
Append to the PR body's Quarantined tests section with a one-line reason and a follow-up tracking link if you have one. This is the durable record of what this bump deferred — the section exists precisely so a reviewer can see at a glance which flakes were side-stepped to land the bump.

Commit, push, retrigger:

bash

git commit -S -s -m "[ci] chore: quarantine flaky <test> for <package> bump"
git push
gh pr comment <N> --repo NVIDIA-NeMo/Megatron-Bridge \
  --body "/ok to test $(git rev-parse HEAD)"

Update the PR body via
```
gh api PATCH
```
so the quarantine list stays current.

The watchdog is persistent — it picks up the new run automatically and emits

RUN <id> STARTED

for the new attempt. Loop back to step 1.

当收到

JOB <name> -> failure

事件时：

排查失败原因——是升级导致的还是不稳定测试？ 快速查看日志：
bash
```
RUN_ID=<来自"RUN ... STARTED"事件>
gh run view "$RUN_ID" --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > /tmp/run.log
wc -l /tmp/run.log
tail -200 /tmp/run.log
```
这是依赖升级特有的判断：仅当失败在
```
main
```
分支可复现，或明显是基础设施问题时，才进行隔离。如果失败是升级导致的真实回归，停止隔离——修复底层问题或回退升级。隔离真实回归会掩盖升级PR本应暴露的问题。
将启动脚本移至
flaky/
目录，按照@skills/testing/SKILL.md中「将测试标记为不稳定」的说明操作。CI任务名称与启动脚本的映射规则：
- 前缀
```
gb200_
```
  → 对应
```
gb200/active/
```
  目录，否则对应
```
h100/active/
```
  目录
- 剩余部分为脚本的基础名称（不含
```
.sh
```
  ）
在PR正文的「本次升级隔离的测试」部分添加记录，包含一行原因说明和跟踪链接（如有）。这是本次升级所延迟处理内容的永久记录——该部分的存在是为了让评审者能快速了解为了合并升级而跳过了哪些不稳定测试。

提交、推送、重新触发：

bash

git commit -S -s -m "[ci] chore: quarantine flaky <test> for <package> bump"
git push
gh pr comment <N> --repo NVIDIA-NeMo/Megatron-Bridge \
  --body "/ok to test $(git rev-parse HEAD)"

通过
gh api PATCH
更新PR正文，确保隔离列表保持最新。

监控程序是持久化的——会自动识别新的运行并发出

RUN <id> STARTED

事件。回到步骤1重复操作。

Step 8 — Stop when green

步骤8 — 运行通过后停止

RUN <id> COMPLETED conclusion=success

is the exit condition. Then:

bash

gh pr checks <N> --repo NVIDIA-NeMo/Megatron-Bridge | awk '{print $2}' | sort | uniq -c
TaskStop(<watchdog-task-id>)
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"

当收到

RUN <id> COMPLETED conclusion=success

事件时，即为结束条件。然后执行：

bash

gh pr checks <N> --repo NVIDIA-NeMo/Megatron-Bridge | awk '{print $2}' | sort | uniq -c
TaskStop(<watchdog-task-id>)
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"

Common pitfalls

常见问题

Symptom	Cause	Fix
Wrong TE branch ref ( `release/v2.15` ) silently resolves nothing	TE uses `release_vX.Y` with an underscore	Verify with `git ls-remote` before locking
Lockfile diff includes unrelated CVE-pinned packages	`override-dependencies` carries floors that float	Re-run lock and accept; don't try to revert those
Signed first push triggers CI but later pushes don't	`copy-pr-bot` re-trusts on each new SHA only via `/ok to test` once you're past the first signed commit in this loop	Always re-post `/ok to test $(git rev-parse HEAD)` per Step 5
Watchdog goes silent for 30+ min	`gh` rate-limited or auth expired	Bump poll interval; `gh auth status` ; restart Monitor
Job name doesn't map to a script in `active/`	`gb200_` prefix is the hardware indicator, not part of the filename	Strip `gb200_` and look in `gb200/active/`

症状	原因	修复方案
错误的TE分支引用（ `release/v2.15` ）导致无任何变更	TE使用下划线分隔的 `release_vX.Y` 格式	锁定前通过 `git ls-remote` 验证
锁文件差异包含无关的CVE固定版本包	`override-dependencies` 包含版本下限	重新生成锁文件并接受变更；不要尝试回退
首次签名推送触发了CI，但后续推送未触发	`copy-pr-bot` 仅在首次签名提交后，通过 `/ok to test` 信任新SHA	严格按照步骤5，每次推送后重新发布 `/ok to test $(git rev-parse HEAD)`
监控程序静默超过30分钟	`gh` 触发速率限制或认证过期	增加轮询间隔；执行 `gh auth status` 检查；重启Monitor
任务名称无法映射到 `active/` 目录中的脚本	`gb200_` 前缀是硬件标识，不属于文件名	去除 `gb200_` 前缀后查看 `gb200/active/` 目录

Anti-patterns

反模式

Cron / scheduled wakeups for this loop. Always Monitor.
Polling all workflows. Filter to
```
CICD NeMo
```
— the rest are noise for a bump.
Quarantining a real regression to "make CI green." That defeats the purpose of the bump PR. Only quarantine if the failure reproduces on
```
main
```
or is clearly unrelated infrastructure.
gh pr edit
for title/body. Use
```
gh api PATCH
```
.
HEREDOC in
gh pr create --body
. Always go through a tmpfile +
```
--body-file
```
.
Bundling unrelated changes (feature work, refactors) into a bump PR. Bumps should stay surgical so CI failures attribute cleanly.

为此循环使用定时任务/定时唤醒。务必使用Monitor工具。
轮询所有工作流。仅过滤
```
CICD NeMo
```
——其他工作流对升级来说都是干扰项。
隔离真实回归以“让CI通过”。这违背了升级PR的目的。仅当失败在
```
main
```
分支可复现或明显是基础设施问题时才隔离。
**使用
```
gh pr edit
```
**修改标题/正文。请使用
```
gh api PATCH
```
。
在
gh pr create --body
中使用HEREDOC。务必通过临时文件+
```
--body-file
```
方式。
将无关变更（功能开发、重构）打包进升级PR。升级应保持精准，以便CI失败能清晰归因。