bump-dependency
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBump Dependency
升级依赖
End-to-end workflow for shipping a dependency bump in Megatron Bridge.
Optimised for the case where TE, MCore, or another GPU-heavy pin moves
forward — which often surfaces flakes that have to be quarantined before
the PR can land.
The pipeline is always: edit → relock → push → /ok to test → watchdog →
quarantine on red → re-trigger → repeat until green.
Megatron Bridge中完成依赖升级的端到端工作流。
针对TE、MCore或其他GPU相关依赖版本向前迭代的场景优化——这类升级通常会暴露不稳定测试,需要先隔离才能合并PR。
流程固定为:编辑 → 重新生成锁文件 → 推送 → /ok to test → 监控 → 失败时隔离不稳定测试 → 重新触发 → 重复直到运行通过。
When to reach for this skill
何时使用该技能
- Bumping a git-source pin in
pyproject.toml(e.g.override-dependencies).transformer-engine @ git+...@<ref> - Bumping the submodule.
3rdparty/Megatron-LM - Any change that touches and needs the full L0 + L1 matrix to prove out before merge.
uv.lock
For pure dep additions/removals without a CI loop, the
skill is enough.
build-and-dependency- 在的
pyproject.toml中升级Git源依赖的版本(例如override-dependencies)。transformer-engine @ git+...@<ref> - 升级子模块。
3rdparty/Megatron-LM - 任何涉及修改且需要完整L0 + L1矩阵验证才能合并的变更。
uv.lock
如果只是单纯添加/移除依赖且无需CI循环,使用技能即可。
build-and-dependencyRequired context
必要前置知识
Read first, then follow the steps below:
- @CONTRIBUTING.md — PR title/label policy, DCO sign-off
- @skills/build-and-dependency/SKILL.md — mechanics, container choice
uv lock - @skills/cicd/SKILL.md — how and
copy-pr-botwork/ok to test - @skills/testing/SKILL.md — vs
active/directory layout,flaky/quarantine recipegit mv
先阅读以下内容,再执行后续步骤:
- @CONTRIBUTING.md — PR标题/标签规范、DCO签署要求
- @skills/build-and-dependency/SKILL.md — 机制、容器选择
uv lock - @skills/cicd/SKILL.md — 和
copy-pr-bot的工作方式/ok to test - @skills/testing/SKILL.md — 与
active/目录结构、flaky/隔离测试方法git mv
Step 1 — Worktree and edit
步骤1 — 创建工作树并编辑
Create a worktree off per @CLAUDE.md. Then, before any :
mainuv lockbash
git submodule update --init 3rdparty/Megatron-LMThe submodule must be initialised in the worktree or errors
with "not a Python project" on the MCore path.
uv lockEdit the pin. For TE the canonical knob is the override line in
:
pyproject.tomltoml
override-dependencies = [
...
"transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@<new-ref>",
...
]Use a branch name () only when you want to track a
moving tip; use a full SHA for reproducibility. TE branches use
(underscore), not . Verify with
.
release_v2.15release_vX.Yrelease/vX.Ygit ls-remote https://github.com/NVIDIA/TransformerEngine.git根据@CLAUDE.md从分支创建工作树。然后,在执行任何之前:
mainuv lockbash
git submodule update --init 3rdparty/Megatron-LM必须在工作树中初始化子模块,否则会在MCore路径上报错“not a Python project”。
uv lock编辑依赖版本。对于TE,标准方式是修改中的override行:
pyproject.tomltoml
override-dependencies = [
...
"transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@<new-ref>",
...
]仅当需要跟踪动态更新的分支时使用分支名称(如);为了可复现性,优先使用完整SHA。TE分支采用格式(下划线分隔),而非。可通过验证。
release_v2.15release_vX.Yrelease/vX.Ygit ls-remote https://github.com/NVIDIA/TransformerEngine.gitStep 2 — Regenerate the lockfile
步骤2 — 重新生成锁文件
Run inside the project container per
@skills/build-and-dependency/SKILL.md "Regenerating uv.lock". Then
confirm only the intended packages moved:
uv lockbash
git diff --stat pyproject.toml uv.lockIf the diff carries changes you didn't ask for (transitive movements you
can't explain), stop and investigate before pushing. Note that
carries CVE floors that float — unrelated
packages bumping by a patch version is expected; accept those, don't
revert them.
override-dependencies按照@skills/build-and-dependency/SKILL.md中「重新生成uv.lock」的说明,在项目容器内运行。然后确认仅预期的包发生了变更:
uv lockbash
git diff --stat pyproject.toml uv.lock如果diff包含未预期的变更(无法解释的传递依赖更新),请在推送前停止并排查。注意包含CVE版本下限,无关包的补丁版本升级是正常现象,无需回退。
override-dependenciesStep 3 — Commit and push
步骤3 — 提交并推送
Sign-off + signed-commit + PR title format per @CONTRIBUTING.md and
@skills/cicd/SKILL.md "Commit and PR Workflow". For a bump:
bash
git add pyproject.toml uv.lock
git commit -S -s -m "[build] chore: bump <package> to <ref>"
git push -u origin <branch-name>A signed commit () lets trigger CI without manual
for the first push — but you'll still post
on every subsequent SHA in this loop (Step 5).
-Scopy-pr-bot/ok to test/ok to test按照@CONTRIBUTING.md和@skills/cicd/SKILL.md中「提交与PR工作流」的要求,完成签署、签名提交和PR标题格式。对于依赖升级:
bash
git add pyproject.toml uv.lock
git commit -S -s -m "[build] chore: bump <package> to <ref>"
git push -u origin <branch-name>签名提交()允许在首次推送时自动触发CI,无需手动执行——但在此循环中的后续每次推送新SHA时,仍需执行(步骤5)。
-Scopy-pr-bot/ok to test/ok to testStep 4 — Open the PR
步骤4 — 创建PR
Title and labels per @CONTRIBUTING.md. Two bump-specific requirements:
- Apply — mandatory for a bump; expands the matrix from L0 to L0+L1.
needs-more-tests - For a high-blast-radius bump (TE, MCore submodule, anything that
touches CUDA kernels), also apply to pull L2 into the PR run. L2 covers VL models, checkpoint conversion, and heavy quantization which otherwise only run on schedule.
full-test-suite
The PR body template — this is the durable record of the bump:
markdown
<details><summary>Claude summary</summary>标题和标签需符合@CONTRIBUTING.md要求。依赖升级有两个特定要求:
- 添加标签 — 必填,将测试矩阵从L0扩展到L0+L1。
needs-more-tests - 对于影响范围较大的升级(TE、MCore子模块、任何涉及CUDA内核的变更),额外添加标签,将L2测试纳入PR运行。L2测试涵盖大语言模型、 checkpoint转换和重度量化,这些测试通常仅在定时任务中运行。
full-test-suite
PR正文模板——这是依赖升级的永久记录:
markdown
<details><summary>Claude 摘要</summary>What
变更内容
- Bump <package> to <ref>.
- Regenerate .
uv.lock
- 将<package>升级至<ref>。
- 重新生成。
uv.lock
Lockfile delta
锁文件差异
Updated <package> <old> -> <new>Updated <package> <old> -> <new>Test plan
测试计划
- L0 CI green
- L1 CI green (label applied)
needs-more-tests
- L0 CI 通过
- L1 CI 通过(已添加标签)
needs-more-tests
Quarantined tests (this bump)
本次升级隔离的测试
None yet — will be appended as flakes are identified during CI iteration.
</details>
```
To update the PR title or body later, use
— never .
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"gh pr edit暂未隔离——CI迭代过程中识别出不稳定测试后会补充。
</details>
```
如需后续更新PR标题或正文,请使用——切勿使用。
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"gh pr editStep 5 — Trigger CI on the exact SHA
步骤5 — 针对指定SHA触发CI
Trigger mechanics live in @skills/cicd/SKILL.md "How CI Is Triggered".
For this loop the rule is simple: on every new SHA you push, post
as a PR comment, even if your
commits are signed. This guarantees the run targets the SHA you actually
want exercised and re-fires anything that got cancelled or cached.
/ok to test $(git rev-parse HEAD)触发机制详见@skills/cicd/SKILL.md中「CI触发方式」。在此循环中的规则很简单:每次推送新SHA后,在PR评论中发布,即使提交已签名。这能确保运行针对的是你实际想要验证的SHA,并重新触发任何被取消或缓存的任务。
/ok to test $(git rev-parse HEAD)Step 6 — Attach the watchdog (always; never a cronjob)
步骤6 — 添加监控程序(必须使用,禁止定时任务)
For a bump PR you want a single live process that emits per-job state
changes for the CICD NeMo workflow only. Other workflows (docs,
wheel, copyright, install-test) are noise here — the gate that decides
green-or-red for a bump is .
CICD NeMoAlways attach a watchdog with the Monitor tool. Never schedule wakeups
or cronjobs for this loop. A watchdog gives you:
- Sub-minute reaction time on every job transition.
- A single live process — no scattered scheduled-wakeup state to reason about.
- Natural early termination via once the run is green.
TaskStop
对于依赖升级PR,你需要一个实时进程来监控仅「CICD NeMo」工作流的每个任务状态变化。其他工作流(文档、wheel包、版权、安装测试)都是干扰项——决定升级是否通过的关键是。
CICD NeMo**必须使用Monitor工具添加监控程序。切勿为此循环设置定时唤醒或定时任务。**监控程序的优势:
- 任务状态变化时的响应时间小于1分钟。
- 单一实时进程——无需处理分散的定时唤醒状态。
- 运行通过后可通过自动终止。
TaskStop
Watchdog script
监控脚本
Save to and chmod +x:
/tmp/watchdog-<PR>.shbash
#!/usr/bin/env bash保存至并设置可执行权限:
/tmp/watchdog-<PR>.shbash
#!/usr/bin/env bashWatchdog: monitor "CICD NeMo" runs on pull-request/<PR> and emit
Watchdog: monitor "CICD NeMo" runs on pull-request/<PR> and emit
per-job state changes. Stays alive across re-runs (new commits).
per-job state changes. Stays alive across re-runs (new commits).
set -u
PR=<PR>
REPO=NVIDIA-NeMo/Megatron-Bridge
BRANCH="pull-request/$PR"
prev_run_id=""
declare -A prev_state
emit() { echo "[$(date -u +%H:%M:%SZ)] $*"; }
while true; do
run_json=$(gh run list --repo "$REPO" --workflow "CICD NeMo"
--branch "$BRANCH" --limit 1
--json databaseId,status,conclusion,headSha 2>/dev/null || echo "[]") run_id=$(echo "$run_json" | jq -r '.[0].databaseId // empty') run_status=$(echo "$run_json" | jq -r '.[0].status // empty') run_conclusion=$(echo "$run_json" | jq -r '.[0].conclusion // empty') run_sha=$(echo "$run_json" | jq -r '.[0].headSha // empty')
--branch "$BRANCH" --limit 1
--json databaseId,status,conclusion,headSha 2>/dev/null || echo "[]") run_id=$(echo "$run_json" | jq -r '.[0].databaseId // empty') run_status=$(echo "$run_json" | jq -r '.[0].status // empty') run_conclusion=$(echo "$run_json" | jq -r '.[0].conclusion // empty') run_sha=$(echo "$run_json" | jq -r '.[0].headSha // empty')
if [[ -z "$run_id" ]]; then
sleep 30; continue
fi
if [[ "$run_id" != "$prev_run_id" ]]; then
emit "RUN ${run_id} STARTED sha=${run_sha:0:8} status=${run_status}"
prev_run_id="$run_id"
unset prev_state
declare -A prev_state
fi
jobs_json=$(gh run view "$run_id" --repo "$REPO" --json jobs 2>/dev/null || echo "{}")
while IFS=$'\t' read -r name status conclusion; do
[[ -z "$name" ]] && continue
cur="${status}/${conclusion}"
if [[ "${prev_state[$name]:-}" != "$cur" ]]; then
case "$status" in
completed)
emit "JOB ${name} -> ${conclusion}" ;;
in_progress)
if [[ -z "${prev_state[$name]:-}" || "${prev_state[$name]}" == "queued/" ]]; then
emit "JOB ${name} -> in_progress"
fi ;;
esac
prev_state[$name]="$cur"
fi
done < <(echo "$jobs_json" | jq -r '.jobs[]? | [.name, .status, (.conclusion // "")] | @tsv')
if [[ "$run_status" == "completed" ]]; then
emit "RUN ${run_id} COMPLETED conclusion=${run_conclusion}"
fi
sleep 60
done
undefinedset -u
PR=<PR>
REPO=NVIDIA-NeMo/Megatron-Bridge
BRANCH="pull-request/$PR"
prev_run_id=""
declare -A prev_state
emit() { echo "[$(date -u +%H:%M:%SZ)] $*"; }
while true; do
run_json=$(gh run list --repo "$REPO" --workflow "CICD NeMo"
--branch "$BRANCH" --limit 1
--json databaseId,status,conclusion,headSha 2>/dev/null || echo "[]") run_id=$(echo "$run_json" | jq -r '.[0].databaseId // empty') run_status=$(echo "$run_json" | jq -r '.[0].status // empty') run_conclusion=$(echo "$run_json" | jq -r '.[0].conclusion // empty') run_sha=$(echo "$run_json" | jq -r '.[0].headSha // empty')
--branch "$BRANCH" --limit 1
--json databaseId,status,conclusion,headSha 2>/dev/null || echo "[]") run_id=$(echo "$run_json" | jq -r '.[0].databaseId // empty') run_status=$(echo "$run_json" | jq -r '.[0].status // empty') run_conclusion=$(echo "$run_json" | jq -r '.[0].conclusion // empty') run_sha=$(echo "$run_json" | jq -r '.[0].headSha // empty')
if [[ -z "$run_id" ]]; then
sleep 30; continue
fi
if [[ "$run_id" != "$prev_run_id" ]]; then
emit "RUN ${run_id} STARTED sha=${run_sha:0:8} status=${run_status}"
prev_run_id="$run_id"
unset prev_state
declare -A prev_state
fi
jobs_json=$(gh run view "$run_id" --repo "$REPO" --json jobs 2>/dev/null || echo "{}")
while IFS=$'\t' read -r name status conclusion; do
[[ -z "$name" ]] && continue
cur="${status}/${conclusion}"
if [[ "${prev_state[$name]:-}" != "$cur" ]]; then
case "$status" in
completed)
emit "JOB ${name} -> ${conclusion}" ;;
in_progress)
if [[ -z "${prev_state[$name]:-}" || "${prev_state[$name]}" == "queued/" ]]; then
emit "JOB ${name} -> in_progress"
fi ;;
esac
prev_state[$name]="$cur"
fi
done < <(echo "$jobs_json" | jq -r '.jobs[]? | [.name, .status, (.conclusion // "")] | @tsv')
if [[ "$run_status" == "completed" ]]; then
emit "RUN ${run_id} COMPLETED conclusion=${run_conclusion}"
fi
sleep 60
done
undefinedArming the watchdog
启动监控程序
text
Monitor(
description="CICD NeMo run state changes on PR <N>",
command="bash /tmp/watchdog-<N>.sh",
persistent=true,
timeout_ms=3600000
)persistent: trueTaskStop(<task-id>)text
Monitor(
description="CICD NeMo run state changes on PR <N>",
command="bash /tmp/watchdog-<N>.sh",
persistent=true,
timeout_ms=3600000
)persistent: trueTaskStop(<task-id>)Why never a cronjob / scheduled wakeup
为何禁止定时任务/定时唤醒
- Cronjobs run blind — they fire on a clock, not on an event. You'll either over-poll (cache miss every wake-up) or miss long stalls.
- Wakeups can't easily fan out to "tell me whenever a job transitions" — they only resume the agent on a fixed interval.
- A persistent Monitor surfaces every job edge in real time and exits cleanly when the work is done.
- 定时任务是盲目的——按时间触发而非事件触发。要么过度轮询(每次唤醒都命中缓存),要么错过长时间停滞。
- 定时唤醒无法实现“任务状态变化时通知我”的需求——仅能在固定间隔恢复代理。
- 持久化Monitor工具能实时显示每个任务状态变化,并在工作完成后干净退出。
Step 7 — Quarantine on red, then iterate
步骤7 — 失败时隔离测试,然后迭代
When a event fires:
JOB <name> -> failure-
Triage the failure — is it the bump or a flake? Skim the logs:bash
RUN_ID=<from "RUN ... STARTED" event> gh run view "$RUN_ID" --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > /tmp/run.log wc -l /tmp/run.log tail -200 /tmp/run.logThis is the bump-specific judgement call: only quarantine if the failure reproduces onor is clearly unrelated infrastructure. If the failure is caused by the bump (real regression), stop quarantining — fix the underlying issue or revert the bump. Quarantining a real regression hides the very signal the bump PR exists to surface.main -
Move the launch script toper @skills/testing/SKILL.md "Moving a Test to Flaky". Map a CI job name to its launch script via:
flaky/- prefix →
gb200_, otherwisegb200/active/h100/active/ - the rest is the script's basename without
.sh
- prefix
-
Append to the PR body's Quarantined tests section with a one-line reason and a follow-up tracking link if you have one. This is the durable record of what this bump deferred — the section exists precisely so a reviewer can see at a glance which flakes were side-stepped to land the bump.
-
Commit, push, retrigger:bash
git commit -S -s -m "[ci] chore: quarantine flaky <test> for <package> bump" git push gh pr comment <N> --repo NVIDIA-NeMo/Megatron-Bridge \ --body "/ok to test $(git rev-parse HEAD)" -
Update the PR body viaso the quarantine list stays current.
gh api PATCH
The watchdog is persistent — it picks up the new run automatically and
emits for the new attempt. Loop back to step 1.
RUN <id> STARTED当收到事件时:
JOB <name> -> failure-
排查失败原因——是升级导致的还是不稳定测试? 快速查看日志:bash
RUN_ID=<来自"RUN ... STARTED"事件> gh run view "$RUN_ID" --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > /tmp/run.log wc -l /tmp/run.log tail -200 /tmp/run.log这是依赖升级特有的判断:仅当失败在分支可复现,或明显是基础设施问题时,才进行隔离。如果失败是升级导致的真实回归,停止隔离——修复底层问题或回退升级。隔离真实回归会掩盖升级PR本应暴露的问题。main -
将启动脚本移至目录,按照@skills/testing/SKILL.md中「将测试标记为不稳定」的说明操作。CI任务名称与启动脚本的映射规则:
flaky/- 前缀→ 对应
gb200_目录,否则对应gb200/active/目录h100/active/ - 剩余部分为脚本的基础名称(不含)
.sh
- 前缀
-
在PR正文的「本次升级隔离的测试」部分添加记录,包含一行原因说明和跟踪链接(如有)。这是本次升级所延迟处理内容的永久记录——该部分的存在是为了让评审者能快速了解为了合并升级而跳过了哪些不稳定测试。
-
提交、推送、重新触发:bash
git commit -S -s -m "[ci] chore: quarantine flaky <test> for <package> bump" git push gh pr comment <N> --repo NVIDIA-NeMo/Megatron-Bridge \ --body "/ok to test $(git rev-parse HEAD)" -
通过更新PR正文,确保隔离列表保持最新。
gh api PATCH
监控程序是持久化的——会自动识别新的运行并发出事件。回到步骤1重复操作。
RUN <id> STARTEDStep 8 — Stop when green
步骤8 — 运行通过后停止
RUN <id> COMPLETED conclusion=successbash
gh pr checks <N> --repo NVIDIA-NeMo/Megatron-Bridge | awk '{print $2}' | sort | uniq -c
TaskStop(<watchdog-task-id>)
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"当收到事件时,即为结束条件。然后执行:
RUN <id> COMPLETED conclusion=successbash
gh pr checks <N> --repo NVIDIA-NeMo/Megatron-Bridge | awk '{print $2}' | sort | uniq -c
TaskStop(<watchdog-task-id>)
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"Common pitfalls
常见问题
| Symptom | Cause | Fix |
|---|---|---|
Wrong TE branch ref ( | TE uses | Verify with |
| Lockfile diff includes unrelated CVE-pinned packages | | Re-run lock and accept; don't try to revert those |
| Signed first push triggers CI but later pushes don't | | Always re-post |
| Watchdog goes silent for 30+ min | | Bump poll interval; |
Job name doesn't map to a script in | | Strip |
| 症状 | 原因 | 修复方案 |
|---|---|---|
错误的TE分支引用( | TE使用下划线分隔的 | 锁定前通过 |
| 锁文件差异包含无关的CVE固定版本包 | | 重新生成锁文件并接受变更;不要尝试回退 |
| 首次签名推送触发了CI,但后续推送未触发 | | 严格按照步骤5,每次推送后重新发布 |
| 监控程序静默超过30分钟 | | 增加轮询间隔;执行 |
任务名称无法映射到 | | 去除 |
Anti-patterns
反模式
- Cron / scheduled wakeups for this loop. Always Monitor.
- Polling all workflows. Filter to — the rest are noise for a bump.
CICD NeMo - Quarantining a real regression to "make CI green." That defeats
the purpose of the bump PR. Only quarantine if the failure reproduces
on or is clearly unrelated infrastructure.
main - for title/body. Use
gh pr edit.gh api PATCH - HEREDOC in . Always go through a tmpfile +
gh pr create --body.--body-file - Bundling unrelated changes (feature work, refactors) into a bump PR. Bumps should stay surgical so CI failures attribute cleanly.
- 为此循环使用定时任务/定时唤醒。务必使用Monitor工具。
- 轮询所有工作流。仅过滤——其他工作流对升级来说都是干扰项。
CICD NeMo - 隔离真实回归以“让CI通过”。这违背了升级PR的目的。仅当失败在分支可复现或明显是基础设施问题时才隔离。
main - **使用**修改标题/正文。请使用
gh pr edit。gh api PATCH - 在中使用HEREDOC。务必通过临时文件+
gh pr create --body方式。--body-file - 将无关变更(功能开发、重构)打包进升级PR。升级应保持精准,以便CI失败能清晰归因。