bump-base-image
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBump the PyTorch base image
升级PyTorch基础镜像
End-to-end workflow for moving Megatron-LM's CI to a newer container. The most common failure mode is forgetting that GitHub CI and GitLab CI have separate pins — a bump that only touches the former lands green, then breaks GitLab CI on and forces an immediate follow-up PR. Always update both in the same PR.
nvcr.io/nvidia/pytorch:<YY.MM>-py3main将Megatron-LM的CI迁移至更新版本的容器的端到端工作流。最常见的失败模式是忘记GitHub CI和GitLab CI有各自独立的固定配置——仅更新前者的升级操作会显示成功,但随后会破坏分支上的GitLab CI,迫使其立即提交后续PR。务必在同一个PR中更新两者。
nvcr.io/nvidia/pytorch:<YY.MM>-py3mainInputs to gather from the user
需要向用户收集的输入信息
- Target tag, e.g. . NVIDIA NGC PyTorch containers are released as
26.04-py3.nvcr.io/nvidia/pytorch:YY.MM-py3 - Scope — usually only. The
devpin (lts, plus thedocker/.ngc_version.ltsrows in GitLab) is bumped on a different cadence; only touch it if the user explicitly asks.IMAGE_TYPE: lts - Workflow run ID (optional but typical) — after the first CI run, the user will provide a GitHub Actions run ID for golden-value refresh.
- 目标标签,例如。NVIDIA NGC PyTorch容器以
26.04-py3的形式发布。nvcr.io/nvidia/pytorch:YY.MM-py3 - 范围 —— 通常仅为。
dev版本的固定配置(lts,以及GitLab中docker/.ngc_version.lts的行)有不同的更新节奏;仅当用户明确要求时才对其进行修改。IMAGE_TYPE: lts - 工作流运行ID(可选但常用)—— 首次CI运行后,用户会提供一个GitHub Actions运行ID用于基准值刷新。
Workflow
工作流
- [ ] Step 1: Update the GitHub CI pin (docker/.ngc_version.dev)
- [ ] Step 2: Update the GitLab CI pin (.gitlab/stages/01.build.yml)
- [ ] Step 3: Open the PR with the `Run functional tests` label
- [ ] Step 4: Re-run failing tests via `/ok to test <commit-sha>`
- [ ] Step 5: For golden-value drift → refresh with the `update-golden-values` skill
- [ ] Step 6: For hangs / real regressions → mark tests `mr-broken` and file tracking issues
- [ ] Step 7: Verify both pins are in sync before merging- [ ] 步骤1:更新GitHub CI的固定配置(docker/.ngc_version.dev)
- [ ] 步骤2:更新GitLab CI的固定配置(.gitlab/stages/01.build.yml)
- [ ] 步骤3:创建带有`Run functional tests`标签的PR
- [ ] 步骤4:通过`/ok to test <commit-sha>`重新运行失败的测试
- [ ] 步骤5:若出现基准值偏移 → 使用`update-golden-values`技能进行刷新
- [ ] 步骤6:若出现挂起/真实回归问题 → 将测试标记为`mr-broken`并创建跟踪问题
- [ ] 步骤7:合并前验证两个固定配置是否同步Step 1 — GitHub CI pin
步骤1 — GitHub CI固定配置
docker/.ngc_version.devdocker/Dockerfile.ci.devFROM_IMAGE_NAME=$(cat docker/.ngc_version.dev)bash
echo 'nvcr.io/nvidia/pytorch:<YY.MM>-py3' > docker/.ngc_version.devThe file has no trailing newline historically; preserving or adding one is fine — the build args treat the value as . Do not touch unless bumping LTS too.
$(cat ...)docker/.ngc_version.ltsdocker/.ngc_version.devdocker/Dockerfile.ci.devFROM_IMAGE_NAME=$(cat docker/.ngc_version.dev)bash
echo 'nvcr.io/nvidia/pytorch:<YY.MM>-py3' > docker/.ngc_version.dev该文件历史上没有尾随换行符;保留或添加换行符均可——构建参数会将其值视为。除非同时升级LTS版本,否则不要修改。
$(cat ...)docker/.ngc_version.ltsStep 2 — GitLab CI pin
步骤2 — GitLab CI固定配置
GitLab CI does not read . It hardcodes in a block. Update the two rows (one per platform):
docker/.ngc_version.devBASE_IMAGEparallel: matrix:IMAGE_TYPE: devyaml
undefinedGitLab CI不会读取。它在块中硬编码了。更新两个的行(每个平台一行):
docker/.ngc_version.devparallel: matrix:BASE_IMAGEIMAGE_TYPE: devyaml
undefined.gitlab/stages/01.build.yml — under test:pre_build_image -> parallel.matrix
.gitlab/stages/01.build.yml — 在test:pre_build_image -> parallel.matrix下
- IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # amd64 row PLATFORM: amd64
- IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # arm64 row PLATFORM: arm64
Leave the `IMAGE_TYPE: lts` rows alone. Quick sanity check before commit:
```bash
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml- IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # amd64行 PLATFORM: amd64
- IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # arm64行 PLATFORM: arm64
不要修改`IMAGE_TYPE: lts`的行。提交前快速检查:
```bash
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.ymlexpect: lts pin × 2 unchanged, dev pin × 2 == new tag
预期结果: lts固定配置 ×2 未更改,dev固定配置 ×2 == 新标签
undefinedundefinedStep 3 — Open the PR
步骤3 — 创建PR
- Title convention: (see #4611).
chore: Update Docker image version to <YY.MM>-py3 - Apply the label before the first push. This unlocks the full functional matrix on the PR; without it the bump only runs the standard GH PR checks and you'll miss the drift.
Run functional tests - Push as draft first if you're still iterating; the bot will auto-draft otherwise.
- 标题惯例:(参考#4611)。
chore: Update Docker image version to <YY.MM>-py3 - 在首次推送前添加标签。这会解锁PR上的完整功能测试矩阵;如果没有该标签,升级操作仅会运行标准的GH PR检查,你会错过偏移问题。
Run functional tests - 如果你仍在迭代,可先推送为草稿;否则机器人会自动将其标记为草稿。
Step 4 — Re-running CI on a new commit
步骤4 — 在新提交上重新运行CI
For PRs from forks (the typical contributor case), each new commit needs an explicit PR comment to authorize NVIDIA runners (see the flow in #4611). One comment per commit. If reports "had a problem deploying to test", just push another commit (or re-issue the comment after the next push); the deploy is per-commit, not per-comment.
/ok to test <commit-sha>copy-pr-botcopy-pr-bot对于来自fork的PR(典型的贡献者场景),每个新提交都需要明确的PR评论来授权NVIDIA运行器(参考#4611中的流程)。每个提交对应一条评论。如果报告“had a problem deploying to test”,只需推送另一个提交(或在下一次推送后重新发布评论);部署是按提交进行的,而非按评论。
/ok to test <commit-sha>copy-pr-botcopy-pr-botStep 5 — Golden-value drift
步骤5 — 基准值偏移
Container bumps shift CUDA / cuBLAS / cuDNN / kernel autotuning, which moves , , , and metrics on a large fraction of functional tests. This is expected and is not a correctness regression — refresh the golden values rather than chasing each test.
lm lossnum-zerositeration-timemem-*Hand off to the skill with:
update-golden-values--source github- from the failing CI run
--pipeline-id <WORKFLOW_RUN_ID> - (refresh just the trajectories that drifted)
--only-failing
PR #4611 refreshed 78 golden-value files across and for GPT / MoE / MIMO / hybrid suites in a single pass via this exact flow. The per-metric relative-difference summary the skill produces is the recommended PR description blurb — reviewers expect to see it.
dev_dgx_h100dev_dgx_gb200容器升级会改变CUDA / cuBLAS / cuDNN / 内核自动调优,这会导致大部分功能测试中的、、和指标发生变化。这是预期现象,并非正确性回归——应刷新基准值而非逐一排查每个测试。
lm lossnum-zerositeration-timemem-*使用以下参数调用技能:
update-golden-values--source github- 来自失败的CI运行
--pipeline-id <WORKFLOW_RUN_ID> - (仅刷新发生偏移的测试轨迹)
--only-failing
PR #4611通过此流程一次性刷新了和上GPT / MoE / MIMO / 混合套件的78个基准值文件。该技能生成的每指标相对差异摘要是推荐添加到PR描述中的内容——评审人员希望看到此信息。
dev_dgx_h100dev_dgx_gb200Step 6 — Real regressions: mark broken, don't block the bump
步骤6 — 真实回归问题:标记为失败,不要阻碍升级
A small number of tests will genuinely break (hangs, OOM, real numerical regressions). Don't gate the base-image bump on fixing them — that conflates two changes. Instead:
-
File a GitHub issue describing the failure mode and linking the failing CI run.
-
Flip the test's scope to thevariant in the recipe YAML under
-broken, with an inline comment that references the issue. Pattern:tests/test_utils/recipes/<arch>/yaml- test_case: [hybrid_dynamic_inference_tp1_ep8_nanov3_chunked_prefill] products: - environment: [dev] # Broken: hangs on repeat iter 3, exceeds 1h job limit — see issue #<N>. scope: [mr-broken, mr-github-broken] # was: [mr, mr-github] platforms: [dgx_h100]Scope mapping (replace, don't append):Before After mrmr-brokenmr-githubmr-github-brokennightlynightly-brokenThe recipe still runs in thescope, but failures stop blocking PR merges.-broken
少数测试会真正失败(挂起、OOM、真实数值回归)。不要将基础镜像升级的进度依赖于修复这些问题——这会混淆两个独立的变更。相反:
-
创建GitHub问题描述失败模式并链接到失败的CI运行。
-
在下的配方YAML中,将测试的范围切换为
tests/test_utils/recipes/<arch>/变体,并添加引用该问题的内联注释。格式:-brokenyaml- test_case: [hybrid_dynamic_inference_tp1_ep8_nanov3_chunked_prefill] products: - environment: [dev] # 失败:在重复迭代3时挂起,超过1小时作业限制 — 参考问题#<N>。 scope: [mr-broken, mr-github-broken] # 原内容: [mr, mr-github] platforms: [dgx_h100]范围映射(替换,而非追加):原范围 更新后范围 mrmr-brokenmr-githubmr-github-brokennightlynightly-broken该测试在范围下仍会运行,但失败不会再阻碍PR合并。-broken
Step 7 — Sync check before merging
步骤7 — 合并前的同步检查
The single biggest failure mode of this workflow is shipping #4611 without #4688. Before you ask for the merge, confirm both pins resolve to the same tag:
bash
echo -n "ngc_version.dev: " && cat docker/.ngc_version.dev
echo
echo "gitlab dev rows:"
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml \
| rg -B1 'IMAGE_TYPE: dev' \
| rg 'BASE_IMAGE'All three lines should show . If they don't, fix it before merge — otherwise GitLab CI keeps building on the old container and the next person hits the same trap.
nvcr.io/nvidia/pytorch:<YY.MM>-py3此工作流最常见的失败模式是只提交了#4611而未提交#4688。在请求合并前,确认两个固定配置指向相同的标签:
bash
echo -n "ngc_version.dev: " && cat docker/.ngc_version.dev
echo
echo "gitlab dev行:"
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml \
| rg -B1 'IMAGE_TYPE: dev' \
| rg 'BASE_IMAGE'所有三行都应显示。如果不一致,在合并前修复——否则GitLab CI会继续使用旧容器构建,下一个人会遇到同样的问题。
nvcr.io/nvidia/pytorch:<YY.MM>-py3File-touch cheat sheet
文件修改速查表
| Path | Edit |
|---|---|
| Overwrite with new |
| Update both |
| Refresh via the |
| Flip drifting / hanging cases to |
| Skip unless explicitly bumping LTS. LTS has its own release cadence. |
| 路径 | 修改内容 |
|---|---|
| 用新的 |
| 更新两个 |
| 使用 |
| 将偏移/挂起的测试用例切换为 |
| 除非明确升级LTS,否则跳过。 LTS有自己的发布节奏。 |
Gotchas
常见陷阱
- GitHub vs GitLab pins are independent. only drives GitHub CI's local container build via
docker/.ngc_version.dev. GitLab CI has its own hardcodedDockerfile.ci.devmatrix inBASE_IMAGE:. PR #4688 existed solely because #4611 forgot the second one — don't repeat this..gitlab/stages/01.build.yml - Don't bump LTS along with dev. The rows and
IMAGE_TYPE: ltsare stability-pinned for thedocker/.ngc_version.ltslabel path. Bump them in a dedicated PR with its own LTS validation.container::lts - Don't fix golden-value drift by hand. Use via the
tests/test_utils/python_scripts/download_golden_values.pyskill. Hand-editing the JSONs invites diff noise and relative-difference regressions on subsequent bumps.update-golden-values - is a real scope, not a comment marker. It keeps the recipe wired into the matrix (so it stays discoverable and runnable on demand) without gating merges. Don't delete the test case from the recipe.
mr-broken - is per-commit. A new force-push or fixup commit needs a fresh
/ok to testcomment to re-trigger NVIDIA-runner CI on a fork PR./ok to test <sha> - Don't merge until the GitLab pin matches. Use the Step 7 grep before requesting review.
- GitHub与GitLab的固定配置相互独立。仅通过
docker/.ngc_version.dev驱动GitHub CI的本地容器构建。GitLab CI在Dockerfile.ci.dev中有自己硬编码的.gitlab/stages/01.build.yml矩阵。PR #4688的存在仅仅是因为#4611忘记了第二个配置——不要重复这个错误。BASE_IMAGE: - 不要同时升级dev和LTS版本。的行和
IMAGE_TYPE: lts是为docker/.ngc_version.lts标签路径固定的稳定版本。应在专门的PR中升级它们并进行独立的LTS验证。container::lts - 不要手动修复基准值偏移。通过技能使用
update-golden-values。手动编辑JSON会引入差异噪声,并在后续升级时导致相对差异回归。tests/test_utils/python_scripts/download_golden_values.py - 是真实的范围,而非注释标记。它会保持测试配方在矩阵中(使其保持可发现并可按需运行),同时不会阻碍合并。不要从配方中删除测试用例。
mr-broken - 是按提交生效的。新的强制推送或修正提交需要新的
/ok to test评论,才能在fork PR上重新触发NVIDIA运行器CI。/ok to test <sha> - 在GitLab配置匹配前不要合并。请求评审前使用步骤7的grep命令检查。
Related skills
相关技能
- update-golden-values — call this as soon as the first post-bump CI run finishes and you have a workflow run ID with failing golden checks. Produces the per-metric relative-difference summary you paste into the PR description.
- build-and-dependency — for verifying the new image builds locally before opening the PR ().
docker build --target main --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) ... - cicd — for the PR scope-label semantics (,
Run functional tests) and thecomplexity::*flow.copy-pr-bot
- update-golden-values — 首次升级后的CI运行完成后,当你拥有包含失败基准检查的工作流运行ID时,调用此技能。它会生成你粘贴到PR描述中的每指标相对差异摘要。
- build-and-dependency — 用于在创建PR前本地验证新镜像是否构建成功()。
docker build --target main --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) ... - cicd — 用于了解PR范围标签的语义(、
Run functional tests)以及complexity::*流程。copy-pr-bot