bump-base-image

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Bump the PyTorch base image

升级PyTorch基础镜像

End-to-end workflow for moving Megatron-LM's CI to a newer

nvcr.io/nvidia/pytorch:<YY.MM>-py3

container. The most common failure mode is forgetting that GitHub CI and GitLab CI have separate pins — a bump that only touches the former lands green, then breaks GitLab CI on

main

and forces an immediate follow-up PR. Always update both in the same PR.

将Megatron-LM的CI迁移至更新版本的

nvcr.io/nvidia/pytorch:<YY.MM>-py3

容器的端到端工作流。最常见的失败模式是忘记GitHub CI和GitLab CI有各自独立的固定配置——仅更新前者的升级操作会显示成功，但随后会破坏

main

分支上的GitLab CI，迫使其立即提交后续PR。务必在同一个PR中更新两者。

Inputs to gather from the user

需要向用户收集的输入信息

Target tag, e.g.
```
26.04-py3
```
. NVIDIA NGC PyTorch containers are released as
```
nvcr.io/nvidia/pytorch:YY.MM-py3
```
.
Scope — usually
```
dev
```
only. The
```
lts
```
pin (
```
docker/.ngc_version.lts
```
, plus the
```
IMAGE_TYPE: lts
```
rows in GitLab) is bumped on a different cadence; only touch it if the user explicitly asks.
Workflow run ID (optional but typical) — after the first CI run, the user will provide a GitHub Actions run ID for golden-value refresh.

目标标签，例如
```
26.04-py3
```
。NVIDIA NGC PyTorch容器以
```
nvcr.io/nvidia/pytorch:YY.MM-py3
```
的形式发布。
范围 —— 通常仅为
```
dev
```
。
```
lts
```
版本的固定配置（
```
docker/.ngc_version.lts
```
，以及GitLab中
```
IMAGE_TYPE: lts
```
的行）有不同的更新节奏；仅当用户明确要求时才对其进行修改。
工作流运行ID（可选但常用）—— 首次CI运行后，用户会提供一个GitHub Actions运行ID用于基准值刷新。

Workflow

工作流

- [ ] Step 1: Update the GitHub CI pin (docker/.ngc_version.dev)
- [ ] Step 2: Update the GitLab CI pin (.gitlab/stages/01.build.yml)
- [ ] Step 3: Open the PR with the `Run functional tests` label
- [ ] Step 4: Re-run failing tests via `/ok to test <commit-sha>`
- [ ] Step 5: For golden-value drift → refresh with the `update-golden-values` skill
- [ ] Step 6: For hangs / real regressions → mark tests `mr-broken` and file tracking issues
- [ ] Step 7: Verify both pins are in sync before merging

- [ ] 步骤1：更新GitHub CI的固定配置（docker/.ngc_version.dev）
- [ ] 步骤2：更新GitLab CI的固定配置（.gitlab/stages/01.build.yml）
- [ ] 步骤3：创建带有`Run functional tests`标签的PR
- [ ] 步骤4：通过`/ok to test <commit-sha>`重新运行失败的测试
- [ ] 步骤5：若出现基准值偏移 → 使用`update-golden-values`技能进行刷新
- [ ] 步骤6：若出现挂起/真实回归问题 → 将测试标记为`mr-broken`并创建跟踪问题
- [ ] 步骤7：合并前验证两个固定配置是否同步

Step 1 — GitHub CI pin

步骤1 — GitHub CI固定配置

docker/.ngc_version.dev

is a single-line file consumed by

docker/Dockerfile.ci.dev

(via

FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev)

). Overwrite it:

bash

echo 'nvcr.io/nvidia/pytorch:<YY.MM>-py3' > docker/.ngc_version.dev

The file has no trailing newline historically; preserving or adding one is fine — the build args treat the value as

$(cat ...)

. Do not touch

docker/.ngc_version.lts

unless bumping LTS too.

docker/.ngc_version.dev

是一个单行文件，由

docker/Dockerfile.ci.dev

通过

FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev)

读取。覆盖该文件：

bash

echo 'nvcr.io/nvidia/pytorch:<YY.MM>-py3' > docker/.ngc_version.dev

该文件历史上没有尾随换行符；保留或添加换行符均可——构建参数会将其值视为

$(cat ...)

。除非同时升级LTS版本，否则不要修改
docker/.ngc_version.lts
。

Step 2 — GitLab CI pin

步骤2 — GitLab CI固定配置

GitLab CI does not read

docker/.ngc_version.dev

. It hardcodes

BASE_IMAGE

in a

parallel: matrix:

block. Update the two

IMAGE_TYPE: dev

rows (one per platform):

yaml

undefined

GitLab CI不会读取

docker/.ngc_version.dev

。它在

parallel: matrix:

块中硬编码了

BASE_IMAGE

。更新两个

IMAGE_TYPE: dev

的行（每个平台一行）：

yaml

undefined

.gitlab/stages/01.build.yml — under test:pre_build_image -> parallel.matrix

.gitlab/stages/01.build.yml — 在test:pre_build_image -> parallel.matrix下

IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # amd64 row PLATFORM: amd64
IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # arm64 row PLATFORM: arm64


Leave the `IMAGE_TYPE: lts` rows alone. Quick sanity check before commit:

```bash
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml

IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # amd64行 PLATFORM: amd64
IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # arm64行 PLATFORM: arm64


不要修改`IMAGE_TYPE: lts`的行。提交前快速检查：

```bash
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml

expect: lts pin × 2 unchanged, dev pin × 2 == new tag

预期结果: lts固定配置 ×2 未更改，dev固定配置 ×2 == 新标签

undefined

undefined

Step 3 — Open the PR

步骤3 — 创建PR

Title convention:

chore: Update Docker image version to <YY.MM>-py3

(see #4611).

Apply the
Run functional tests
label before the first push. This unlocks the full functional matrix on the PR; without it the bump only runs the standard GH PR checks and you'll miss the drift.
Push as draft first if you're still iterating; the bot will auto-draft otherwise.

标题惯例：

chore: Update Docker image version to <YY.MM>-py3

（参考#4611）。

在首次推送前添加
Run functional tests
标签。这会解锁PR上的完整功能测试矩阵；如果没有该标签，升级操作仅会运行标准的GH PR检查，你会错过偏移问题。
如果你仍在迭代，可先推送为草稿；否则机器人会自动将其标记为草稿。

Step 4 — Re-running CI on a new commit

步骤4 — 在新提交上重新运行CI

For PRs from forks (the typical contributor case), each new commit needs an explicit

/ok to test <commit-sha>

PR comment to authorize NVIDIA runners (see the

copy-pr-bot

flow in #4611). One comment per commit. If

copy-pr-bot

reports "had a problem deploying to test", just push another commit (or re-issue the comment after the next push); the deploy is per-commit, not per-comment.

对于来自fork的PR（典型的贡献者场景），每个新提交都需要明确的PR评论

/ok to test <commit-sha>

来授权NVIDIA运行器（参考#4611中的

copy-pr-bot

流程）。每个提交对应一条评论。如果

copy-pr-bot

报告“had a problem deploying to test”，只需推送另一个提交（或在下一次推送后重新发布评论）；部署是按提交进行的，而非按评论。

Step 5 — Golden-value drift

步骤5 — 基准值偏移

Container bumps shift CUDA / cuBLAS / cuDNN / kernel autotuning, which moves

lm loss

num-zeros

iteration-time

, and

mem-*

metrics on a large fraction of functional tests. This is expected and is not a correctness regression — refresh the golden values rather than chasing each test.

Hand off to the

update-golden-values

skill with:

```
--source github
```
```
--pipeline-id <WORKFLOW_RUN_ID>
```
from the failing CI run
```
--only-failing
```
(refresh just the trajectories that drifted)

PR #4611 refreshed 78 golden-value files across

dev_dgx_h100

and

dev_dgx_gb200

for GPT / MoE / MIMO / hybrid suites in a single pass via this exact flow. The per-metric relative-difference summary the skill produces is the recommended PR description blurb — reviewers expect to see it.

容器升级会改变CUDA / cuBLAS / cuDNN / 内核自动调优，这会导致大部分功能测试中的

lm loss

、

num-zeros

、

iteration-time

和

mem-*

指标发生变化。这是预期现象，并非正确性回归——应刷新基准值而非逐一排查每个测试。

使用以下参数调用

update-golden-values

技能：

```
--source github
```
```
--pipeline-id <WORKFLOW_RUN_ID>
```
来自失败的CI运行
```
--only-failing
```
（仅刷新发生偏移的测试轨迹）

PR #4611通过此流程一次性刷新了

dev_dgx_h100

和

dev_dgx_gb200

上GPT / MoE / MIMO / 混合套件的78个基准值文件。该技能生成的每指标相对差异摘要是推荐添加到PR描述中的内容——评审人员希望看到此信息。

Step 6 — Real regressions: mark broken, don't block the bump

步骤6 — 真实回归问题：标记为失败，不要阻碍升级

A small number of tests will genuinely break (hangs, OOM, real numerical regressions). Don't gate the base-image bump on fixing them — that conflates two changes. Instead:

File a GitHub issue describing the failure mode and linking the failing CI run.

Flip the test's scope to the
-broken
variant in the recipe YAML under

tests/test_utils/recipes/<arch>/

, with an inline comment that references the issue. Pattern:

yaml

- test_case: [hybrid_dynamic_inference_tp1_ep8_nanov3_chunked_prefill]
  products:
    - environment: [dev]
      # Broken: hangs on repeat iter 3, exceeds 1h job limit — see issue #<N>.
      scope: [mr-broken, mr-github-broken]      # was: [mr, mr-github]
      platforms: [dgx_h100]

Scope mapping (replace, don't append):

Before	After
`mr`	`mr-broken`
`mr-github`	`mr-github-broken`
`nightly`	`nightly-broken`

The recipe still runs in the

-broken

scope, but failures stop blocking PR merges.

少数测试会真正失败（挂起、OOM、真实数值回归）。不要将基础镜像升级的进度依赖于修复这些问题——这会混淆两个独立的变更。相反：

创建GitHub问题描述失败模式并链接到失败的CI运行。

在
tests/test_utils/recipes/<arch>/
下的配方YAML中，将测试的范围切换为
-broken
变体，并添加引用该问题的内联注释。格式：

yaml

- test_case: [hybrid_dynamic_inference_tp1_ep8_nanov3_chunked_prefill]
  products:
    - environment: [dev]
      # 失败：在重复迭代3时挂起，超过1小时作业限制 — 参考问题#<N>。
      scope: [mr-broken, mr-github-broken]      # 原内容: [mr, mr-github]
      platforms: [dgx_h100]

范围映射（替换，而非追加）：

原范围	更新后范围
`mr`	`mr-broken`
`mr-github`	`mr-github-broken`
`nightly`	`nightly-broken`

该测试在

-broken

范围下仍会运行，但失败不会再阻碍PR合并。

Step 7 — Sync check before merging

步骤7 — 合并前的同步检查

The single biggest failure mode of this workflow is shipping #4611 without #4688. Before you ask for the merge, confirm both pins resolve to the same tag:

bash

echo -n "ngc_version.dev: " && cat docker/.ngc_version.dev
echo
echo "gitlab dev rows:"
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml \
  | rg -B1 'IMAGE_TYPE: dev' \
  | rg 'BASE_IMAGE'

All three lines should show

nvcr.io/nvidia/pytorch:<YY.MM>-py3

. If they don't, fix it before merge — otherwise GitLab CI keeps building on the old container and the next person hits the same trap.

此工作流最常见的失败模式是只提交了#4611而未提交#4688。在请求合并前，确认两个固定配置指向相同的标签：

bash

echo -n "ngc_version.dev: " && cat docker/.ngc_version.dev
echo
echo "gitlab dev行:"
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml \
  | rg -B1 'IMAGE_TYPE: dev' \
  | rg 'BASE_IMAGE'

所有三行都应显示

nvcr.io/nvidia/pytorch:<YY.MM>-py3

。如果不一致，在合并前修复——否则GitLab CI会继续使用旧容器构建，下一个人会遇到同样的问题。

File-touch cheat sheet

文件修改速查表

Path	Edit
`docker/.ngc_version.dev`	Overwrite with new `nvcr.io/nvidia/pytorch:<YY.MM>-py3`
`.gitlab/stages/01.build.yml`	Update both `IMAGE_TYPE: dev` `BASE_IMAGE:` rows (amd64 + arm64)
`tests/functional_tests/test_cases/**/golden_values_dev_dgx_{h100,gb200}.json`	Refresh via the `update-golden-values` skill
`tests/test_utils/recipes/<arch>/<suite>.yaml`	Flip drifting / hanging cases to `mr-broken` / `mr-github-broken` with an issue link
`docker/.ngc_version.lts` , `.gitlab/stages/01.build.yml` `IMAGE_TYPE: lts` rows	Skip unless explicitly bumping LTS. LTS has its own release cadence.

路径	修改内容
`docker/.ngc_version.dev`	用新的 `nvcr.io/nvidia/pytorch:<YY.MM>-py3` 覆盖文件内容
`.gitlab/stages/01.build.yml`	更新两个 `IMAGE_TYPE: dev` 的 `BASE_IMAGE:` 行（amd64 + arm64）
`tests/functional_tests/test_cases/**/golden_values_dev_dgx_{h100,gb200}.json`	使用 `update-golden-values` 技能刷新内容
`tests/test_utils/recipes/<arch>/<suite>.yaml`	将偏移/挂起的测试用例切换为 `mr-broken` / `mr-github-broken` 并添加问题链接
`docker/.ngc_version.lts` , `.gitlab/stages/01.build.yml` 中 `IMAGE_TYPE: lts` 的行	除非明确升级LTS，否则跳过。 LTS有自己的发布节奏。

Gotchas

常见陷阱

GitHub vs GitLab pins are independent.
```
docker/.ngc_version.dev
```
only drives GitHub CI's local container build via
```
Dockerfile.ci.dev
```
. GitLab CI has its own hardcoded
```
BASE_IMAGE:
```
matrix in
```
.gitlab/stages/01.build.yml
```
. PR #4688 existed solely because #4611 forgot the second one — don't repeat this.
Don't bump LTS along with dev. The
```
IMAGE_TYPE: lts
```
rows and
```
docker/.ngc_version.lts
```
are stability-pinned for the
```
container::lts
```
label path. Bump them in a dedicated PR with its own LTS validation.
Don't fix golden-value drift by hand. Use
```
tests/test_utils/python_scripts/download_golden_values.py
```
via the
```
update-golden-values
```
skill. Hand-editing the JSONs invites diff noise and relative-difference regressions on subsequent bumps.
mr-broken
is a real scope, not a comment marker. It keeps the recipe wired into the matrix (so it stays discoverable and runnable on demand) without gating merges. Don't delete the test case from the recipe.
/ok to test
is per-commit. A new force-push or fixup commit needs a fresh
```
/ok to test <sha>
```
comment to re-trigger NVIDIA-runner CI on a fork PR.
Don't merge until the GitLab pin matches. Use the Step 7 grep before requesting review.

GitHub与GitLab的固定配置相互独立。
```
docker/.ngc_version.dev
```
仅通过
```
Dockerfile.ci.dev
```
驱动GitHub CI的本地容器构建。GitLab CI在
```
.gitlab/stages/01.build.yml
```
中有自己硬编码的
```
BASE_IMAGE:
```
矩阵。PR #4688的存在仅仅是因为#4611忘记了第二个配置——不要重复这个错误。
不要同时升级dev和LTS版本。
```
IMAGE_TYPE: lts
```
的行和
```
docker/.ngc_version.lts
```
是为
```
container::lts
```
标签路径固定的稳定版本。应在专门的PR中升级它们并进行独立的LTS验证。
不要手动修复基准值偏移。通过
```
update-golden-values
```
技能使用
```
tests/test_utils/python_scripts/download_golden_values.py
```
。手动编辑JSON会引入差异噪声，并在后续升级时导致相对差异回归。
mr-broken
是真实的范围，而非注释标记。它会保持测试配方在矩阵中（使其保持可发现并可按需运行），同时不会阻碍合并。不要从配方中删除测试用例。
/ok to test
是按提交生效的。新的强制推送或修正提交需要新的
```
/ok to test <sha>
```
评论，才能在fork PR上重新触发NVIDIA运行器CI。
在GitLab配置匹配前不要合并。请求评审前使用步骤7的grep命令检查。

bump-base-image

Original

Translation

Bump the PyTorch base image

升级PyTorch基础镜像

Inputs to gather from the user

需要向用户收集的输入信息

Workflow

工作流

Step 1 — GitHub CI pin

步骤1 — GitHub CI固定配置

Step 2 — GitLab CI pin

步骤2 — GitLab CI固定配置

.gitlab/stages/01.build.yml — under test:pre_build_image -> parallel.matrix

.gitlab/stages/01.build.yml — 在test:pre_build_image -> parallel.matrix下

expect: lts pin × 2 unchanged, dev pin × 2 == new tag

预期结果: lts固定配置 ×2 未更改，dev固定配置 ×2 == 新标签

Step 3 — Open the PR

步骤3 — 创建PR

Step 4 — Re-running CI on a new commit

步骤4 — 在新提交上重新运行CI

Step 5 — Golden-value drift

步骤5 — 基准值偏移

Step 6 — Real regressions: mark broken, don't block the bump

步骤6 — 真实回归问题：标记为失败，不要阻碍升级

Step 7 — Sync check before merging

步骤7 — 合并前的同步检查

File-touch cheat sheet

文件修改速查表

Gotchas

常见陷阱

Related skills

相关技能