bump-base-image

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Bump the PyTorch base image

升级PyTorch基础镜像

End-to-end workflow for moving Megatron-LM's CI to a newer
nvcr.io/nvidia/pytorch:<YY.MM>-py3
container. The most common failure mode is forgetting that GitHub CI and GitLab CI have separate pins — a bump that only touches the former lands green, then breaks GitLab CI on
main
and forces an immediate follow-up PR. Always update both in the same PR.
将Megatron-LM的CI迁移至更新版本的
nvcr.io/nvidia/pytorch:<YY.MM>-py3
容器的端到端工作流。最常见的失败模式是忘记GitHub CI和GitLab CI有各自独立的固定配置——仅更新前者的升级操作会显示成功,但随后会破坏
main
分支上的GitLab CI,迫使其立即提交后续PR。务必在同一个PR中更新两者。

Inputs to gather from the user

需要向用户收集的输入信息

  1. Target tag, e.g.
    26.04-py3
    . NVIDIA NGC PyTorch containers are released as
    nvcr.io/nvidia/pytorch:YY.MM-py3
    .
  2. Scope — usually
    dev
    only. The
    lts
    pin (
    docker/.ngc_version.lts
    , plus the
    IMAGE_TYPE: lts
    rows in GitLab) is bumped on a different cadence; only touch it if the user explicitly asks.
  3. Workflow run ID (optional but typical) — after the first CI run, the user will provide a GitHub Actions run ID for golden-value refresh.
  1. 目标标签,例如
    26.04-py3
    。NVIDIA NGC PyTorch容器以
    nvcr.io/nvidia/pytorch:YY.MM-py3
    的形式发布。
  2. 范围 —— 通常仅为
    dev
    lts
    版本的固定配置(
    docker/.ngc_version.lts
    ,以及GitLab中
    IMAGE_TYPE: lts
    的行)有不同的更新节奏;仅当用户明确要求时才对其进行修改。
  3. 工作流运行ID(可选但常用)—— 首次CI运行后,用户会提供一个GitHub Actions运行ID用于基准值刷新。

Workflow

工作流

- [ ] Step 1: Update the GitHub CI pin (docker/.ngc_version.dev)
- [ ] Step 2: Update the GitLab CI pin (.gitlab/stages/01.build.yml)
- [ ] Step 3: Open the PR with the `Run functional tests` label
- [ ] Step 4: Re-run failing tests via `/ok to test <commit-sha>`
- [ ] Step 5: For golden-value drift → refresh with the `update-golden-values` skill
- [ ] Step 6: For hangs / real regressions → mark tests `mr-broken` and file tracking issues
- [ ] Step 7: Verify both pins are in sync before merging
- [ ] 步骤1:更新GitHub CI的固定配置(docker/.ngc_version.dev)
- [ ] 步骤2:更新GitLab CI的固定配置(.gitlab/stages/01.build.yml)
- [ ] 步骤3:创建带有`Run functional tests`标签的PR
- [ ] 步骤4:通过`/ok to test <commit-sha>`重新运行失败的测试
- [ ] 步骤5:若出现基准值偏移 → 使用`update-golden-values`技能进行刷新
- [ ] 步骤6:若出现挂起/真实回归问题 → 将测试标记为`mr-broken`并创建跟踪问题
- [ ] 步骤7:合并前验证两个固定配置是否同步

Step 1 — GitHub CI pin

步骤1 — GitHub CI固定配置

docker/.ngc_version.dev
is a single-line file consumed by
docker/Dockerfile.ci.dev
(via
FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev)
). Overwrite it:
bash
echo 'nvcr.io/nvidia/pytorch:<YY.MM>-py3' > docker/.ngc_version.dev
The file has no trailing newline historically; preserving or adding one is fine — the build args treat the value as
$(cat ...)
. Do not touch
docker/.ngc_version.lts
unless bumping LTS too.
docker/.ngc_version.dev
是一个单行文件,由
docker/Dockerfile.ci.dev
通过
FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev)
读取。覆盖该文件:
bash
echo 'nvcr.io/nvidia/pytorch:<YY.MM>-py3' > docker/.ngc_version.dev
该文件历史上没有尾随换行符;保留或添加换行符均可——构建参数会将其值视为
$(cat ...)
除非同时升级LTS版本,否则不要修改
docker/.ngc_version.lts

Step 2 — GitLab CI pin

步骤2 — GitLab CI固定配置

GitLab CI does not read
docker/.ngc_version.dev
. It hardcodes
BASE_IMAGE
in a
parallel: matrix:
block. Update the two
IMAGE_TYPE: dev
rows (one per platform):
yaml
undefined
GitLab CI不会读取
docker/.ngc_version.dev
。它在
parallel: matrix:
块中硬编码了
BASE_IMAGE
。更新两个
IMAGE_TYPE: dev
的行(每个平台一行):
yaml
undefined

.gitlab/stages/01.build.yml — under test:pre_build_image -> parallel.matrix

.gitlab/stages/01.build.yml — 在test:pre_build_image -> parallel.matrix下

  • IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # amd64 row PLATFORM: amd64
  • IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # arm64 row PLATFORM: arm64

Leave the `IMAGE_TYPE: lts` rows alone. Quick sanity check before commit:

```bash
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml
  • IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # amd64行 PLATFORM: amd64
  • IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # arm64行 PLATFORM: arm64

不要修改`IMAGE_TYPE: lts`的行。提交前快速检查:

```bash
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml

expect: lts pin × 2 unchanged, dev pin × 2 == new tag

预期结果: lts固定配置 ×2 未更改,dev固定配置 ×2 == 新标签

undefined
undefined

Step 3 — Open the PR

步骤3 — 创建PR

  • Title convention:
    chore: Update Docker image version to <YY.MM>-py3
    (see #4611).
  • Apply the
    Run functional tests
    label
    before the first push. This unlocks the full functional matrix on the PR; without it the bump only runs the standard GH PR checks and you'll miss the drift.
  • Push as draft first if you're still iterating; the bot will auto-draft otherwise.
  • 标题惯例:
    chore: Update Docker image version to <YY.MM>-py3
    (参考#4611)。
  • 在首次推送前添加
    Run functional tests
    标签
    。这会解锁PR上的完整功能测试矩阵;如果没有该标签,升级操作仅会运行标准的GH PR检查,你会错过偏移问题。
  • 如果你仍在迭代,可先推送为草稿;否则机器人会自动将其标记为草稿。

Step 4 — Re-running CI on a new commit

步骤4 — 在新提交上重新运行CI

For PRs from forks (the typical contributor case), each new commit needs an explicit
/ok to test <commit-sha>
PR comment to authorize NVIDIA runners (see the
copy-pr-bot
flow in #4611). One comment per commit. If
copy-pr-bot
reports "had a problem deploying to test", just push another commit (or re-issue the comment after the next push); the deploy is per-commit, not per-comment.
对于来自fork的PR(典型的贡献者场景),每个新提交都需要明确的PR评论
/ok to test <commit-sha>
来授权NVIDIA运行器(参考#4611中的
copy-pr-bot
流程)。每个提交对应一条评论。如果
copy-pr-bot
报告“had a problem deploying to test”,只需推送另一个提交(或在下一次推送后重新发布评论);部署是按提交进行的,而非按评论。

Step 5 — Golden-value drift

步骤5 — 基准值偏移

Container bumps shift CUDA / cuBLAS / cuDNN / kernel autotuning, which moves
lm loss
,
num-zeros
,
iteration-time
, and
mem-*
metrics on a large fraction of functional tests. This is expected and is not a correctness regression — refresh the golden values rather than chasing each test.
Hand off to the
update-golden-values
skill with:
  • --source github
  • --pipeline-id <WORKFLOW_RUN_ID>
    from the failing CI run
  • --only-failing
    (refresh just the trajectories that drifted)
PR #4611 refreshed 78 golden-value files across
dev_dgx_h100
and
dev_dgx_gb200
for GPT / MoE / MIMO / hybrid suites in a single pass via this exact flow. The per-metric relative-difference summary the skill produces is the recommended PR description blurb — reviewers expect to see it.
容器升级会改变CUDA / cuBLAS / cuDNN / 内核自动调优,这会导致大部分功能测试中的
lm loss
num-zeros
iteration-time
mem-*
指标发生变化。这是预期现象,并非正确性回归——应刷新基准值而非逐一排查每个测试。
使用以下参数调用
update-golden-values
技能:
  • --source github
  • --pipeline-id <WORKFLOW_RUN_ID>
    来自失败的CI运行
  • --only-failing
    (仅刷新发生偏移的测试轨迹)
PR #4611通过此流程一次性刷新了
dev_dgx_h100
dev_dgx_gb200
上GPT / MoE / MIMO / 混合套件的78个基准值文件。该技能生成的每指标相对差异摘要是推荐添加到PR描述中的内容——评审人员希望看到此信息。

Step 6 — Real regressions: mark broken, don't block the bump

步骤6 — 真实回归问题:标记为失败,不要阻碍升级

A small number of tests will genuinely break (hangs, OOM, real numerical regressions). Don't gate the base-image bump on fixing them — that conflates two changes. Instead:
  1. File a GitHub issue describing the failure mode and linking the failing CI run.
  2. Flip the test's scope to the
    -broken
    variant
    in the recipe YAML under
    tests/test_utils/recipes/<arch>/
    , with an inline comment that references the issue. Pattern:
    yaml
    - test_case: [hybrid_dynamic_inference_tp1_ep8_nanov3_chunked_prefill]
      products:
        - environment: [dev]
          # Broken: hangs on repeat iter 3, exceeds 1h job limit — see issue #<N>.
          scope: [mr-broken, mr-github-broken]      # was: [mr, mr-github]
          platforms: [dgx_h100]
    Scope mapping (replace, don't append):
    BeforeAfter
    mr
    mr-broken
    mr-github
    mr-github-broken
    nightly
    nightly-broken
    The recipe still runs in the
    -broken
    scope, but failures stop blocking PR merges.
少数测试会真正失败(挂起、OOM、真实数值回归)。不要将基础镜像升级的进度依赖于修复这些问题——这会混淆两个独立的变更。相反:
  1. 创建GitHub问题描述失败模式并链接到失败的CI运行。
  2. tests/test_utils/recipes/<arch>/
    下的配方YAML中,将测试的范围切换为
    -broken
    变体
    ,并添加引用该问题的内联注释。格式:
    yaml
    - test_case: [hybrid_dynamic_inference_tp1_ep8_nanov3_chunked_prefill]
      products:
        - environment: [dev]
          # 失败:在重复迭代3时挂起,超过1小时作业限制 — 参考问题#<N>。
          scope: [mr-broken, mr-github-broken]      # 原内容: [mr, mr-github]
          platforms: [dgx_h100]
    范围映射(替换,而非追加):
    原范围更新后范围
    mr
    mr-broken
    mr-github
    mr-github-broken
    nightly
    nightly-broken
    该测试在
    -broken
    范围下仍会运行,但失败不会再阻碍PR合并。

Step 7 — Sync check before merging

步骤7 — 合并前的同步检查

The single biggest failure mode of this workflow is shipping #4611 without #4688. Before you ask for the merge, confirm both pins resolve to the same tag:
bash
echo -n "ngc_version.dev: " && cat docker/.ngc_version.dev
echo
echo "gitlab dev rows:"
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml \
  | rg -B1 'IMAGE_TYPE: dev' \
  | rg 'BASE_IMAGE'
All three lines should show
nvcr.io/nvidia/pytorch:<YY.MM>-py3
. If they don't, fix it before merge — otherwise GitLab CI keeps building on the old container and the next person hits the same trap.
此工作流最常见的失败模式是只提交了#4611而未提交#4688。在请求合并前,确认两个固定配置指向相同的标签:
bash
echo -n "ngc_version.dev: " && cat docker/.ngc_version.dev
echo
echo "gitlab dev行:"
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml \
  | rg -B1 'IMAGE_TYPE: dev' \
  | rg 'BASE_IMAGE'
所有三行都应显示
nvcr.io/nvidia/pytorch:<YY.MM>-py3
。如果不一致,在合并前修复——否则GitLab CI会继续使用旧容器构建,下一个人会遇到同样的问题。

File-touch cheat sheet

文件修改速查表

PathEdit
docker/.ngc_version.dev
Overwrite with new
nvcr.io/nvidia/pytorch:<YY.MM>-py3
.gitlab/stages/01.build.yml
Update both
IMAGE_TYPE: dev
BASE_IMAGE:
rows (amd64 + arm64)
tests/functional_tests/test_cases/**/golden_values_dev_dgx_{h100,gb200}.json
Refresh via the
update-golden-values
skill
tests/test_utils/recipes/<arch>/<suite>.yaml
Flip drifting / hanging cases to
mr-broken
/
mr-github-broken
with an issue link
docker/.ngc_version.lts
,
.gitlab/stages/01.build.yml
IMAGE_TYPE: lts
rows
Skip unless explicitly bumping LTS. LTS has its own release cadence.
路径修改内容
docker/.ngc_version.dev
用新的
nvcr.io/nvidia/pytorch:<YY.MM>-py3
覆盖文件内容
.gitlab/stages/01.build.yml
更新两个
IMAGE_TYPE: dev
BASE_IMAGE:
行(amd64 + arm64)
tests/functional_tests/test_cases/**/golden_values_dev_dgx_{h100,gb200}.json
使用
update-golden-values
技能刷新内容
tests/test_utils/recipes/<arch>/<suite>.yaml
将偏移/挂起的测试用例切换为
mr-broken
/
mr-github-broken
并添加问题链接
docker/.ngc_version.lts
,
.gitlab/stages/01.build.yml
IMAGE_TYPE: lts
的行
除非明确升级LTS,否则跳过。 LTS有自己的发布节奏。

Gotchas

常见陷阱

  • GitHub vs GitLab pins are independent.
    docker/.ngc_version.dev
    only drives GitHub CI's local container build via
    Dockerfile.ci.dev
    . GitLab CI has its own hardcoded
    BASE_IMAGE:
    matrix in
    .gitlab/stages/01.build.yml
    . PR #4688 existed solely because #4611 forgot the second one — don't repeat this.
  • Don't bump LTS along with dev. The
    IMAGE_TYPE: lts
    rows and
    docker/.ngc_version.lts
    are stability-pinned for the
    container::lts
    label path. Bump them in a dedicated PR with its own LTS validation.
  • Don't fix golden-value drift by hand. Use
    tests/test_utils/python_scripts/download_golden_values.py
    via the
    update-golden-values
    skill. Hand-editing the JSONs invites diff noise and relative-difference regressions on subsequent bumps.
  • mr-broken
    is a real scope, not a comment marker.
    It keeps the recipe wired into the matrix (so it stays discoverable and runnable on demand) without gating merges. Don't delete the test case from the recipe.
  • /ok to test
    is per-commit.
    A new force-push or fixup commit needs a fresh
    /ok to test <sha>
    comment to re-trigger NVIDIA-runner CI on a fork PR.
  • Don't merge until the GitLab pin matches. Use the Step 7 grep before requesting review.
  • GitHub与GitLab的固定配置相互独立
    docker/.ngc_version.dev
    仅通过
    Dockerfile.ci.dev
    驱动GitHub CI的本地容器构建。GitLab CI在
    .gitlab/stages/01.build.yml
    中有自己硬编码的
    BASE_IMAGE:
    矩阵。PR #4688的存在仅仅是因为#4611忘记了第二个配置——不要重复这个错误。
  • 不要同时升级dev和LTS版本
    IMAGE_TYPE: lts
    的行和
    docker/.ngc_version.lts
    是为
    container::lts
    标签路径固定的稳定版本。应在专门的PR中升级它们并进行独立的LTS验证。
  • 不要手动修复基准值偏移。通过
    update-golden-values
    技能使用
    tests/test_utils/python_scripts/download_golden_values.py
    。手动编辑JSON会引入差异噪声,并在后续升级时导致相对差异回归。
  • mr-broken
    是真实的范围,而非注释标记
    。它会保持测试配方在矩阵中(使其保持可发现并可按需运行),同时不会阻碍合并。不要从配方中删除测试用例。
  • /ok to test
    是按提交生效的
    。新的强制推送或修正提交需要新的
    /ok to test <sha>
    评论,才能在fork PR上重新触发NVIDIA运行器CI。
  • 在GitLab配置匹配前不要合并。请求评审前使用步骤7的grep命令检查。

Related skills

相关技能

  • update-golden-values — call this as soon as the first post-bump CI run finishes and you have a workflow run ID with failing golden checks. Produces the per-metric relative-difference summary you paste into the PR description.
  • build-and-dependency — for verifying the new image builds locally before opening the PR (
    docker build --target main --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) ...
    ).
  • cicd — for the PR scope-label semantics (
    Run functional tests
    ,
    complexity::*
    ) and the
    copy-pr-bot
    flow.
  • update-golden-values — 首次升级后的CI运行完成后,当你拥有包含失败基准检查的工作流运行ID时,调用此技能。它会生成你粘贴到PR描述中的每指标相对差异摘要。
  • build-and-dependency — 用于在创建PR前本地验证新镜像是否构建成功(
    docker build --target main --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) ...
    )。
  • cicd — 用于了解PR范围标签的语义(
    Run functional tests
    complexity::*
    )以及
    copy-pr-bot
    流程。