Bump Dependency
End-to-end workflow for shipping a dependency bump in Megatron Bridge.
Optimised for the case where TE, MCore, or another GPU-heavy pin moves
forward — which often surfaces flakes that have to be quarantined before
the PR can land.
The pipeline is always: edit → relock → push → /ok to test → watchdog →
quarantine on red → re-trigger → repeat until green.
When to reach for this skill
- Bumping a git-source pin in
(e.g.
transformer-engine @ git+...@<ref>
).
- Bumping the submodule.
- Any change that touches and needs the full L0 + L1 matrix to
prove out before merge.
For pure dep additions/removals without a CI loop, the
skill is enough.
Required context
Read first, then follow the steps below:
- @CONTRIBUTING.md — PR title/label policy, DCO sign-off
- @skills/build-and-dependency/SKILL.md — mechanics, container choice
- @skills/cicd/SKILL.md — how and work
- @skills/testing/SKILL.md — vs directory layout, quarantine recipe
Step 1 — Worktree and edit
Create a worktree off
per @CLAUDE.md. Then,
before any :
bash
git submodule update --init 3rdparty/Megatron-LM
The submodule must be initialised in the worktree or
errors
with "not a Python project" on the MCore path.
Edit the pin. For TE the canonical knob is the override line in
:
toml
override-dependencies = [
...
"transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@<new-ref>",
...
]
Use a
branch name (
) only when you want to track a
moving tip; use a full SHA for reproducibility. TE branches use
(underscore), not
. Verify with
git ls-remote https://github.com/NVIDIA/TransformerEngine.git
.
Step 2 — Regenerate the lockfile
Run
inside the project container per
@skills/build-and-dependency/SKILL.md "Regenerating uv.lock". Then
confirm only the intended packages moved:
bash
git diff --stat pyproject.toml uv.lock
If the diff carries changes you didn't ask for (transitive movements you
can't explain), stop and investigate before pushing. Note that
carries CVE floors that float — unrelated
packages bumping by a patch version is expected; accept those, don't
revert them.
Step 3 — Commit and push
Sign-off + signed-commit + PR title format per @CONTRIBUTING.md and
@skills/cicd/SKILL.md "Commit and PR Workflow". For a bump:
bash
git add pyproject.toml uv.lock
git commit -S -s -m "[build] chore: bump <package> to <ref>"
git push -u origin <branch-name>
A signed commit (
) lets
trigger CI without manual
for the first push — but you'll still post
on every subsequent SHA in this loop (Step 5).
Step 4 — Open the PR
Title and labels per @CONTRIBUTING.md. Two bump-specific requirements:
- Apply — mandatory for a bump; expands the matrix
from L0 to L0+L1.
- For a high-blast-radius bump (TE, MCore submodule, anything that
touches CUDA kernels), also apply to pull L2 into
the PR run. L2 covers VL models, checkpoint conversion, and heavy
quantization which otherwise only run on schedule.
The PR body template — this is the durable record of the bump:
markdown
<details><summary>Claude summary</summary>
## What
- Bump <package> to <ref>.
- Regenerate `uv.lock`.
## Lockfile delta
Updated <package> <old> -> <new>
## Test plan
- [ ] L0 CI green
- [ ] L1 CI green (label `needs-more-tests` applied)
## Quarantined tests (this bump)
_None yet — will be appended as flakes are identified during CI iteration._
</details>
To update the PR title or body later, use
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"
— never
.
Step 5 — Trigger CI on the exact SHA
Trigger mechanics live in @skills/cicd/SKILL.md "How CI Is Triggered".
For this loop the rule is simple:
on every new SHA you push, post
/ok to test $(git rev-parse HEAD)
as a PR comment, even if your
commits are signed. This guarantees the run targets the SHA you actually
want exercised and re-fires anything that got cancelled or cached.
Step 6 — Attach the watchdog (always; never a cronjob)
For a bump PR you want a single live process that emits per-job state
changes for the
CICD NeMo workflow only. Other workflows (docs,
wheel, copyright, install-test) are noise here — the gate that decides
green-or-red for a bump is
.
Always attach a watchdog with the Monitor tool. Never schedule wakeups
or cronjobs for this loop. A watchdog gives you:
- Sub-minute reaction time on every job transition.
- A single live process — no scattered scheduled-wakeup state to reason
about.
- Natural early termination via once the run is green.
Watchdog script
bash
#!/usr/bin/env bash
# Watchdog: monitor "CICD NeMo" runs on pull-request/<PR> and emit
# per-job state changes. Stays alive across re-runs (new commits).
set -u
PR=<PR>
REPO=NVIDIA-NeMo/Megatron-Bridge
BRANCH="pull-request/$PR"
prev_run_id=""
declare -A prev_state
emit() { echo "[$(date -u +%H:%M:%SZ)] $*"; }
while true; do
run_json=$(gh run list --repo "$REPO" --workflow "CICD NeMo" \
--branch "$BRANCH" --limit 1 \
--json databaseId,status,conclusion,headSha 2>/dev/null || echo "[]")
run_id=$(echo "$run_json" | jq -r '.[0].databaseId // empty')
run_status=$(echo "$run_json" | jq -r '.[0].status // empty')
run_conclusion=$(echo "$run_json" | jq -r '.[0].conclusion // empty')
run_sha=$(echo "$run_json" | jq -r '.[0].headSha // empty')
if [[ -z "$run_id" ]]; then
sleep 30; continue
fi
if [[ "$run_id" != "$prev_run_id" ]]; then
emit "RUN ${run_id} STARTED sha=${run_sha:0:8} status=${run_status}"
prev_run_id="$run_id"
unset prev_state
declare -A prev_state
fi
jobs_json=$(gh run view "$run_id" --repo "$REPO" --json jobs 2>/dev/null || echo "{}")
while IFS=$'\t' read -r name status conclusion; do
[[ -z "$name" ]] && continue
cur="${status}/${conclusion}"
if [[ "${prev_state[$name]:-}" != "$cur" ]]; then
case "$status" in
completed)
emit "JOB ${name} -> ${conclusion}" ;;
in_progress)
if [[ -z "${prev_state[$name]:-}" || "${prev_state[$name]}" == "queued/" ]]; then
emit "JOB ${name} -> in_progress"
fi ;;
esac
prev_state[$name]="$cur"
fi
done < <(echo "$jobs_json" | jq -r '.jobs[]? | [.name, .status, (.conclusion // "")] | @tsv')
if [[ "$run_status" == "completed" ]]; then
emit "RUN ${run_id} COMPLETED conclusion=${run_conclusion}"
fi
sleep 60
done
Arming the watchdog
text
Monitor(
description="CICD NeMo run state changes on PR <N>",
command="bash /tmp/watchdog-<N>.sh",
persistent=true,
timeout_ms=3600000
)
keeps it alive across re-runs (you'll push more
commits when quarantining flakes). Stop it with
once the run is green.
Why never a cronjob / scheduled wakeup
- Cronjobs run blind — they fire on a clock, not on an event. You'll
either over-poll (cache miss every wake-up) or miss long stalls.
- Wakeups can't easily fan out to "tell me whenever a job transitions"
— they only resume the agent on a fixed interval.
- A persistent Monitor surfaces every job edge in real time and exits
cleanly when the work is done.
Step 7 — Quarantine on red, then iterate
-
Triage the failure — is it the bump or a flake? Skim the logs:
bash
RUN_ID=<from "RUN ... STARTED" event>
gh run view "$RUN_ID" --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > /tmp/run.log
wc -l /tmp/run.log
tail -200 /tmp/run.log
This is the bump-specific judgement call: only quarantine if the
failure reproduces on
or is clearly unrelated infrastructure.
If the failure is caused by the bump (real regression),
stop
quarantining — fix the underlying issue or revert the bump.
Quarantining a real regression hides the very signal the bump PR
exists to surface.
-
Move the launch script to per @skills/testing/SKILL.md
"Moving a Test to Flaky". Map a CI job name to its launch script via:
- prefix → , otherwise
- the rest is the script's basename without
-
Append to the PR body's Quarantined tests section with a one-line
reason and a follow-up tracking link if you have one. This is the
durable record of what this bump deferred — the section exists
precisely so a reviewer can see at a glance which flakes were
side-stepped to land the bump.
-
Commit, push, retrigger:
bash
git commit -S -s -m "[ci] chore: quarantine flaky <test> for <package> bump"
git push
gh pr comment <N> --repo NVIDIA-NeMo/Megatron-Bridge \
--body "/ok to test $(git rev-parse HEAD)"
-
Update the PR body via
so the quarantine list
stays current.
The watchdog is persistent — it picks up the new run automatically and
emits
for the new attempt. Loop back to step 1.
Step 8 — Stop when green
RUN <id> COMPLETED conclusion=success
is the exit condition. Then:
bash
gh pr checks <N> --repo NVIDIA-NeMo/Megatron-Bridge | awk '{print $2}' | sort | uniq -c
TaskStop(<watchdog-task-id>)
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"
Common pitfalls
| Symptom | Cause | Fix |
|---|
| Wrong TE branch ref () silently resolves nothing | TE uses with an underscore | Verify with before locking |
| Lockfile diff includes unrelated CVE-pinned packages | carries floors that float | Re-run lock and accept; don't try to revert those |
| Signed first push triggers CI but later pushes don't | re-trusts on each new SHA only via once you're past the first signed commit in this loop | Always re-post /ok to test $(git rev-parse HEAD)
per Step 5 |
| Watchdog goes silent for 30+ min | rate-limited or auth expired | Bump poll interval; ; restart Monitor |
| Job name doesn't map to a script in | prefix is the hardware indicator, not part of the filename | Strip and look in |
Anti-patterns
- Cron / scheduled wakeups for this loop. Always Monitor.
- Polling all workflows. Filter to — the rest are noise
for a bump.
- Quarantining a real regression to "make CI green." That defeats
the purpose of the bump PR. Only quarantine if the failure reproduces
on or is clearly unrelated infrastructure.
- for title/body. Use .
- HEREDOC in . Always go through a tmpfile +
.
- Bundling unrelated changes (feature work, refactors) into a bump
PR. Bumps should stay surgical so CI failures attribute cleanly.