bump-dependency

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Bump Dependency

升级依赖

End-to-end workflow for shipping a dependency bump in Megatron Bridge. Optimised for the case where TE, MCore, or another GPU-heavy pin moves forward — which often surfaces flakes that have to be quarantined before the PR can land.
The pipeline is always: edit → relock → push → /ok to test → watchdog → quarantine on red → re-trigger → repeat until green.
Megatron Bridge中完成依赖升级的端到端工作流。 针对TE、MCore或其他GPU相关依赖版本向前迭代的场景优化——这类升级通常会暴露不稳定测试,需要先隔离才能合并PR。
流程固定为:编辑 → 重新生成锁文件 → 推送 → /ok to test → 监控 → 失败时隔离不稳定测试 → 重新触发 → 重复直到运行通过

When to reach for this skill

何时使用该技能

  • Bumping a git-source pin in
    pyproject.toml
    override-dependencies
    (e.g.
    transformer-engine @ git+...@<ref>
    ).
  • Bumping the
    3rdparty/Megatron-LM
    submodule.
  • Any change that touches
    uv.lock
    and needs the full L0 + L1 matrix to prove out before merge.
For pure dep additions/removals without a CI loop, the
build-and-dependency
skill is enough.
  • pyproject.toml
    override-dependencies
    中升级Git源依赖的版本(例如
    transformer-engine @ git+...@<ref>
    )。
  • 升级
    3rdparty/Megatron-LM
    子模块。
  • 任何涉及修改
    uv.lock
    且需要完整L0 + L1矩阵验证才能合并的变更。
如果只是单纯添加/移除依赖且无需CI循环,使用
build-and-dependency
技能即可。

Required context

必要前置知识

Read first, then follow the steps below:
  • @CONTRIBUTING.md — PR title/label policy, DCO sign-off
  • @skills/build-and-dependency/SKILL.md —
    uv lock
    mechanics, container choice
  • @skills/cicd/SKILL.md — how
    copy-pr-bot
    and
    /ok to test
    work
  • @skills/testing/SKILL.md —
    active/
    vs
    flaky/
    directory layout,
    git mv
    quarantine recipe
先阅读以下内容,再执行后续步骤:
  • @CONTRIBUTING.md — PR标题/标签规范、DCO签署要求
  • @skills/build-and-dependency/SKILL.md —
    uv lock
    机制、容器选择
  • @skills/cicd/SKILL.md —
    copy-pr-bot
    /ok to test
    的工作方式
  • @skills/testing/SKILL.md —
    active/
    flaky/
    目录结构、
    git mv
    隔离测试方法

Step 1 — Worktree and edit

步骤1 — 创建工作树并编辑

Create a worktree off
main
per @CLAUDE.md. Then, before any
uv lock
:
bash
git submodule update --init 3rdparty/Megatron-LM
The submodule must be initialised in the worktree or
uv lock
errors with "not a Python project" on the MCore path.
Edit the pin. For TE the canonical knob is the override line in
pyproject.toml
:
toml
override-dependencies = [
    ...
    "transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@<new-ref>",
    ...
]
Use a branch name (
release_v2.15
) only when you want to track a moving tip; use a full SHA for reproducibility. TE branches use
release_vX.Y
(underscore), not
release/vX.Y
. Verify with
git ls-remote https://github.com/NVIDIA/TransformerEngine.git
.
根据@CLAUDE.md从
main
分支创建工作树。然后,在执行任何
uv lock
之前
bash
git submodule update --init 3rdparty/Megatron-LM
必须在工作树中初始化子模块,否则
uv lock
会在MCore路径上报错“not a Python project”。
编辑依赖版本。对于TE,标准方式是修改
pyproject.toml
中的override行:
toml
override-dependencies = [
    ...
    "transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@<new-ref>",
    ...
]
仅当需要跟踪动态更新的分支时使用分支名称(如
release_v2.15
);为了可复现性,优先使用完整SHA。TE分支采用
release_vX.Y
格式(下划线分隔),而非
release/vX.Y
。可通过
git ls-remote https://github.com/NVIDIA/TransformerEngine.git
验证。

Step 2 — Regenerate the lockfile

步骤2 — 重新生成锁文件

Run
uv lock
inside the project container per @skills/build-and-dependency/SKILL.md "Regenerating uv.lock". Then confirm only the intended packages moved:
bash
git diff --stat pyproject.toml uv.lock
If the diff carries changes you didn't ask for (transitive movements you can't explain), stop and investigate before pushing. Note that
override-dependencies
carries CVE floors that float — unrelated packages bumping by a patch version is expected; accept those, don't revert them.
按照@skills/build-and-dependency/SKILL.md中「重新生成uv.lock」的说明,在项目容器内运行
uv lock
。然后确认仅预期的包发生了变更:
bash
git diff --stat pyproject.toml uv.lock
如果diff包含未预期的变更(无法解释的传递依赖更新),请在推送前停止并排查。注意
override-dependencies
包含CVE版本下限,无关包的补丁版本升级是正常现象,无需回退。

Step 3 — Commit and push

步骤3 — 提交并推送

Sign-off + signed-commit + PR title format per @CONTRIBUTING.md and @skills/cicd/SKILL.md "Commit and PR Workflow". For a bump:
bash
git add pyproject.toml uv.lock
git commit -S -s -m "[build] chore: bump <package> to <ref>"
git push -u origin <branch-name>
A signed commit (
-S
) lets
copy-pr-bot
trigger CI without manual
/ok to test
for the first push — but you'll still post
/ok to test
on every subsequent SHA in this loop (Step 5).
按照@CONTRIBUTING.md和@skills/cicd/SKILL.md中「提交与PR工作流」的要求,完成签署、签名提交和PR标题格式。对于依赖升级:
bash
git add pyproject.toml uv.lock
git commit -S -s -m "[build] chore: bump <package> to <ref>"
git push -u origin <branch-name>
签名提交(
-S
)允许
copy-pr-bot
在首次推送时自动触发CI,无需手动执行
/ok to test
——但在此循环中的后续每次推送新SHA时,仍需执行
/ok to test
(步骤5)。

Step 4 — Open the PR

步骤4 — 创建PR

Title and labels per @CONTRIBUTING.md. Two bump-specific requirements:
  • Apply
    needs-more-tests
    mandatory for a bump; expands the matrix from L0 to L0+L1.
  • For a high-blast-radius bump (TE, MCore submodule, anything that touches CUDA kernels), also apply
    full-test-suite
    to pull L2 into the PR run. L2 covers VL models, checkpoint conversion, and heavy quantization which otherwise only run on schedule.
The PR body template — this is the durable record of the bump:
markdown
<details><summary>Claude summary</summary>
标题和标签需符合@CONTRIBUTING.md要求。依赖升级有两个特定要求:
  • 添加
    needs-more-tests
    标签 — 必填,将测试矩阵从L0扩展到L0+L1。
  • 对于影响范围较大的升级(TE、MCore子模块、任何涉及CUDA内核的变更),额外添加
    full-test-suite
    标签,将L2测试纳入PR运行。L2测试涵盖大语言模型、 checkpoint转换和重度量化,这些测试通常仅在定时任务中运行。
PR正文模板——这是依赖升级的永久记录:
markdown
<details><summary>Claude 摘要</summary>

What

变更内容

  • Bump <package> to <ref>.
  • Regenerate
    uv.lock
    .
  • <package>升级至<ref>
  • 重新生成
    uv.lock

Lockfile delta

锁文件差异

Updated <package> <old> -> <new>
Updated <package> <old> -> <new>

Test plan

测试计划

  • L0 CI green
  • L1 CI green (label
    needs-more-tests
    applied)
  • L0 CI 通过
  • L1 CI 通过(已添加
    needs-more-tests
    标签)

Quarantined tests (this bump)

本次升级隔离的测试

None yet — will be appended as flakes are identified during CI iteration.
</details> ```
To update the PR title or body later, use
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"
— never
gh pr edit
.
暂未隔离——CI迭代过程中识别出不稳定测试后会补充。
</details> ```
如需后续更新PR标题或正文,请使用
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"
——切勿使用
gh pr edit

Step 5 — Trigger CI on the exact SHA

步骤5 — 针对指定SHA触发CI

Trigger mechanics live in @skills/cicd/SKILL.md "How CI Is Triggered". For this loop the rule is simple: on every new SHA you push, post
/ok to test $(git rev-parse HEAD)
as a PR comment, even if your commits are signed. This guarantees the run targets the SHA you actually want exercised and re-fires anything that got cancelled or cached.
触发机制详见@skills/cicd/SKILL.md中「CI触发方式」。在此循环中的规则很简单:每次推送新SHA后,在PR评论中发布
/ok to test $(git rev-parse HEAD)
,即使提交已签名。这能确保运行针对的是你实际想要验证的SHA,并重新触发任何被取消或缓存的任务。

Step 6 — Attach the watchdog (always; never a cronjob)

步骤6 — 添加监控程序(必须使用,禁止定时任务)

For a bump PR you want a single live process that emits per-job state changes for the CICD NeMo workflow only. Other workflows (docs, wheel, copyright, install-test) are noise here — the gate that decides green-or-red for a bump is
CICD NeMo
.
Always attach a watchdog with the Monitor tool. Never schedule wakeups or cronjobs for this loop. A watchdog gives you:
  • Sub-minute reaction time on every job transition.
  • A single live process — no scattered scheduled-wakeup state to reason about.
  • Natural early termination via
    TaskStop
    once the run is green.
对于依赖升级PR,你需要一个实时进程来监控仅「CICD NeMo」工作流的每个任务状态变化。其他工作流(文档、wheel包、版权、安装测试)都是干扰项——决定升级是否通过的关键是
CICD NeMo
**必须使用Monitor工具添加监控程序。切勿为此循环设置定时唤醒或定时任务。**监控程序的优势:
  • 任务状态变化时的响应时间小于1分钟。
  • 单一实时进程——无需处理分散的定时唤醒状态。
  • 运行通过后可通过
    TaskStop
    自动终止。

Watchdog script

监控脚本

Save to
/tmp/watchdog-<PR>.sh
and chmod +x:
bash
#!/usr/bin/env bash
保存至
/tmp/watchdog-<PR>.sh
并设置可执行权限:
bash
#!/usr/bin/env bash

Watchdog: monitor "CICD NeMo" runs on pull-request/<PR> and emit

Watchdog: monitor "CICD NeMo" runs on pull-request/<PR> and emit

per-job state changes. Stays alive across re-runs (new commits).

per-job state changes. Stays alive across re-runs (new commits).

set -u PR=<PR> REPO=NVIDIA-NeMo/Megatron-Bridge BRANCH="pull-request/$PR"
prev_run_id="" declare -A prev_state
emit() { echo "[$(date -u +%H:%M:%SZ)] $*"; }
while true; do run_json=$(gh run list --repo "$REPO" --workflow "CICD NeMo"
--branch "$BRANCH" --limit 1
--json databaseId,status,conclusion,headSha 2>/dev/null || echo "[]") run_id=$(echo "$run_json" | jq -r '.[0].databaseId // empty') run_status=$(echo "$run_json" | jq -r '.[0].status // empty') run_conclusion=$(echo "$run_json" | jq -r '.[0].conclusion // empty') run_sha=$(echo "$run_json" | jq -r '.[0].headSha // empty')
if [[ -z "$run_id" ]]; then sleep 30; continue fi
if [[ "$run_id" != "$prev_run_id" ]]; then emit "RUN ${run_id} STARTED sha=${run_sha:0:8} status=${run_status}" prev_run_id="$run_id" unset prev_state declare -A prev_state fi
jobs_json=$(gh run view "$run_id" --repo "$REPO" --json jobs 2>/dev/null || echo "{}") while IFS=$'\t' read -r name status conclusion; do [[ -z "$name" ]] && continue cur="${status}/${conclusion}" if [[ "${prev_state[$name]:-}" != "$cur" ]]; then case "$status" in completed) emit "JOB ${name} -> ${conclusion}" ;; in_progress) if [[ -z "${prev_state[$name]:-}" || "${prev_state[$name]}" == "queued/" ]]; then emit "JOB ${name} -> in_progress" fi ;; esac prev_state[$name]="$cur" fi done < <(echo "$jobs_json" | jq -r '.jobs[]? | [.name, .status, (.conclusion // "")] | @tsv')
if [[ "$run_status" == "completed" ]]; then emit "RUN ${run_id} COMPLETED conclusion=${run_conclusion}" fi
sleep 60 done
undefined
set -u PR=<PR> REPO=NVIDIA-NeMo/Megatron-Bridge BRANCH="pull-request/$PR"
prev_run_id="" declare -A prev_state
emit() { echo "[$(date -u +%H:%M:%SZ)] $*"; }
while true; do run_json=$(gh run list --repo "$REPO" --workflow "CICD NeMo"
--branch "$BRANCH" --limit 1
--json databaseId,status,conclusion,headSha 2>/dev/null || echo "[]") run_id=$(echo "$run_json" | jq -r '.[0].databaseId // empty') run_status=$(echo "$run_json" | jq -r '.[0].status // empty') run_conclusion=$(echo "$run_json" | jq -r '.[0].conclusion // empty') run_sha=$(echo "$run_json" | jq -r '.[0].headSha // empty')
if [[ -z "$run_id" ]]; then sleep 30; continue fi
if [[ "$run_id" != "$prev_run_id" ]]; then emit "RUN ${run_id} STARTED sha=${run_sha:0:8} status=${run_status}" prev_run_id="$run_id" unset prev_state declare -A prev_state fi
jobs_json=$(gh run view "$run_id" --repo "$REPO" --json jobs 2>/dev/null || echo "{}") while IFS=$'\t' read -r name status conclusion; do [[ -z "$name" ]] && continue cur="${status}/${conclusion}" if [[ "${prev_state[$name]:-}" != "$cur" ]]; then case "$status" in completed) emit "JOB ${name} -> ${conclusion}" ;; in_progress) if [[ -z "${prev_state[$name]:-}" || "${prev_state[$name]}" == "queued/" ]]; then emit "JOB ${name} -> in_progress" fi ;; esac prev_state[$name]="$cur" fi done < <(echo "$jobs_json" | jq -r '.jobs[]? | [.name, .status, (.conclusion // "")] | @tsv')
if [[ "$run_status" == "completed" ]]; then emit "RUN ${run_id} COMPLETED conclusion=${run_conclusion}" fi
sleep 60 done
undefined

Arming the watchdog

启动监控程序

text
Monitor(
  description="CICD NeMo run state changes on PR <N>",
  command="bash /tmp/watchdog-<N>.sh",
  persistent=true,
  timeout_ms=3600000
)
persistent: true
keeps it alive across re-runs (you'll push more commits when quarantining flakes). Stop it with
TaskStop(<task-id>)
once the run is green.
text
Monitor(
  description="CICD NeMo run state changes on PR <N>",
  command="bash /tmp/watchdog-<N>.sh",
  persistent=true,
  timeout_ms=3600000
)
persistent: true
确保监控程序在重新运行时保持活跃(隔离不稳定测试时会推送新提交)。运行通过后使用
TaskStop(<task-id>)
停止监控。

Why never a cronjob / scheduled wakeup

为何禁止定时任务/定时唤醒

  • Cronjobs run blind — they fire on a clock, not on an event. You'll either over-poll (cache miss every wake-up) or miss long stalls.
  • Wakeups can't easily fan out to "tell me whenever a job transitions" — they only resume the agent on a fixed interval.
  • A persistent Monitor surfaces every job edge in real time and exits cleanly when the work is done.
  • 定时任务是盲目的——按时间触发而非事件触发。要么过度轮询(每次唤醒都命中缓存),要么错过长时间停滞。
  • 定时唤醒无法实现“任务状态变化时通知我”的需求——仅能在固定间隔恢复代理。
  • 持久化Monitor工具能实时显示每个任务状态变化,并在工作完成后干净退出。

Step 7 — Quarantine on red, then iterate

步骤7 — 失败时隔离测试,然后迭代

When a
JOB <name> -> failure
event fires:
  1. Triage the failure — is it the bump or a flake? Skim the logs:
    bash
    RUN_ID=<from "RUN ... STARTED" event>
    gh run view "$RUN_ID" --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > /tmp/run.log
    wc -l /tmp/run.log
    tail -200 /tmp/run.log
    This is the bump-specific judgement call: only quarantine if the failure reproduces on
    main
    or is clearly unrelated infrastructure. If the failure is caused by the bump (real regression), stop quarantining — fix the underlying issue or revert the bump. Quarantining a real regression hides the very signal the bump PR exists to surface.
  2. Move the launch script to
    flaky/
    per @skills/testing/SKILL.md "Moving a Test to Flaky". Map a CI job name to its launch script via:
    • prefix
      gb200_
      gb200/active/
      , otherwise
      h100/active/
    • the rest is the script's basename without
      .sh
  3. Append to the PR body's Quarantined tests section with a one-line reason and a follow-up tracking link if you have one. This is the durable record of what this bump deferred — the section exists precisely so a reviewer can see at a glance which flakes were side-stepped to land the bump.
  4. Commit, push, retrigger:
    bash
    git commit -S -s -m "[ci] chore: quarantine flaky <test> for <package> bump"
    git push
    gh pr comment <N> --repo NVIDIA-NeMo/Megatron-Bridge \
      --body "/ok to test $(git rev-parse HEAD)"
  5. Update the PR body via
    gh api PATCH
    so the quarantine list stays current.
The watchdog is persistent — it picks up the new run automatically and emits
RUN <id> STARTED
for the new attempt. Loop back to step 1.
当收到
JOB <name> -> failure
事件时:
  1. 排查失败原因——是升级导致的还是不稳定测试? 快速查看日志:
    bash
    RUN_ID=<来自"RUN ... STARTED"事件>
    gh run view "$RUN_ID" --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > /tmp/run.log
    wc -l /tmp/run.log
    tail -200 /tmp/run.log
    这是依赖升级特有的判断:仅当失败在
    main
    分支可复现,或明显是基础设施问题时,才进行隔离。如果失败是升级导致的真实回归,停止隔离——修复底层问题或回退升级。隔离真实回归会掩盖升级PR本应暴露的问题。
  2. 将启动脚本移至
    flaky/
    目录
    ,按照@skills/testing/SKILL.md中「将测试标记为不稳定」的说明操作。CI任务名称与启动脚本的映射规则:
    • 前缀
      gb200_
      → 对应
      gb200/active/
      目录,否则对应
      h100/active/
      目录
    • 剩余部分为脚本的基础名称(不含
      .sh
  3. 在PR正文的「本次升级隔离的测试」部分添加记录,包含一行原因说明和跟踪链接(如有)。这是本次升级所延迟处理内容的永久记录——该部分的存在是为了让评审者能快速了解为了合并升级而跳过了哪些不稳定测试。
  4. 提交、推送、重新触发
    bash
    git commit -S -s -m "[ci] chore: quarantine flaky <test> for <package> bump"
    git push
    gh pr comment <N> --repo NVIDIA-NeMo/Megatron-Bridge \
      --body "/ok to test $(git rev-parse HEAD)"
  5. 通过
    gh api PATCH
    更新PR正文
    ,确保隔离列表保持最新。
监控程序是持久化的——会自动识别新的运行并发出
RUN <id> STARTED
事件。回到步骤1重复操作。

Step 8 — Stop when green

步骤8 — 运行通过后停止

RUN <id> COMPLETED conclusion=success
is the exit condition. Then:
bash
gh pr checks <N> --repo NVIDIA-NeMo/Megatron-Bridge | awk '{print $2}' | sort | uniq -c
TaskStop(<watchdog-task-id>)
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"
当收到
RUN <id> COMPLETED conclusion=success
事件时,即为结束条件。然后执行:
bash
gh pr checks <N> --repo NVIDIA-NeMo/Megatron-Bridge | awk '{print $2}' | sort | uniq -c
TaskStop(<watchdog-task-id>)
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"

Common pitfalls

常见问题

SymptomCauseFix
Wrong TE branch ref (
release/v2.15
) silently resolves nothing
TE uses
release_vX.Y
with an underscore
Verify with
git ls-remote
before locking
Lockfile diff includes unrelated CVE-pinned packages
override-dependencies
carries floors that float
Re-run lock and accept; don't try to revert those
Signed first push triggers CI but later pushes don't
copy-pr-bot
re-trusts on each new SHA only via
/ok to test
once you're past the first signed commit in this loop
Always re-post
/ok to test $(git rev-parse HEAD)
per Step 5
Watchdog goes silent for 30+ min
gh
rate-limited or auth expired
Bump poll interval;
gh auth status
; restart Monitor
Job name doesn't map to a script in
active/
gb200_
prefix is the hardware indicator, not part of the filename
Strip
gb200_
and look in
gb200/active/
症状原因修复方案
错误的TE分支引用(
release/v2.15
)导致无任何变更
TE使用下划线分隔的
release_vX.Y
格式
锁定前通过
git ls-remote
验证
锁文件差异包含无关的CVE固定版本包
override-dependencies
包含版本下限
重新生成锁文件并接受变更;不要尝试回退
首次签名推送触发了CI,但后续推送未触发
copy-pr-bot
仅在首次签名提交后,通过
/ok to test
信任新SHA
严格按照步骤5,每次推送后重新发布
/ok to test $(git rev-parse HEAD)
监控程序静默超过30分钟
gh
触发速率限制或认证过期
增加轮询间隔;执行
gh auth status
检查;重启Monitor
任务名称无法映射到
active/
目录中的脚本
gb200_
前缀是硬件标识,不属于文件名
去除
gb200_
前缀后查看
gb200/active/
目录

Anti-patterns

反模式

  • Cron / scheduled wakeups for this loop. Always Monitor.
  • Polling all workflows. Filter to
    CICD NeMo
    — the rest are noise for a bump.
  • Quarantining a real regression to "make CI green." That defeats the purpose of the bump PR. Only quarantine if the failure reproduces on
    main
    or is clearly unrelated infrastructure.
  • gh pr edit
    for title/body. Use
    gh api PATCH
    .
  • HEREDOC in
    gh pr create --body
    .
    Always go through a tmpfile +
    --body-file
    .
  • Bundling unrelated changes (feature work, refactors) into a bump PR. Bumps should stay surgical so CI failures attribute cleanly.
  • 为此循环使用定时任务/定时唤醒。务必使用Monitor工具。
  • 轮询所有工作流。仅过滤
    CICD NeMo
    ——其他工作流对升级来说都是干扰项。
  • 隔离真实回归以“让CI通过”。这违背了升级PR的目的。仅当失败在
    main
    分支可复现或明显是基础设施问题时才隔离。
  • **使用
    gh pr edit
    **修改标题/正文。请使用
    gh api PATCH
  • gh pr create --body
    中使用HEREDOC
    。务必通过临时文件+
    --body-file
    方式。
  • 将无关变更(功能开发、重构)打包进升级PR。升级应保持精准,以便CI失败能清晰归因。