Loading...
Loading...
Bump the NVIDIA PyTorch base image (`nvcr.io/nvidia/pytorch:<YY.MM>-py3`) used by Megatron-LM CI. Covers the two pin sites (GitHub CI in `docker/.ngc_version.dev` and GitLab CI in `.gitlab/stages/01.build.yml`), the post-bump CI loop (re-run functional tests, refresh golden values, mark broken tests), and the gotchas that bit PRs
npx skill4agent add nvidia/skills bump-base-imagenvcr.io/nvidia/pytorch:<YY.MM>-py3main26.04-py3nvcr.io/nvidia/pytorch:YY.MM-py3devltsdocker/.ngc_version.ltsIMAGE_TYPE: lts- [ ] Step 1: Update the GitHub CI pin (docker/.ngc_version.dev)
- [ ] Step 2: Update the GitLab CI pin (.gitlab/stages/01.build.yml)
- [ ] Step 3: Open the PR with the `Run functional tests` label
- [ ] Step 4: Re-run failing tests via `/ok to test <commit-sha>`
- [ ] Step 5: For golden-value drift → refresh with the `update-golden-values` skill
- [ ] Step 6: For hangs / real regressions → mark tests `mr-broken` and file tracking issues
- [ ] Step 7: Verify both pins are in sync before mergingdocker/.ngc_version.devdocker/Dockerfile.ci.devFROM_IMAGE_NAME=$(cat docker/.ngc_version.dev)echo 'nvcr.io/nvidia/pytorch:<YY.MM>-py3' > docker/.ngc_version.dev$(cat ...)docker/.ngc_version.ltsdocker/.ngc_version.devBASE_IMAGEparallel: matrix:IMAGE_TYPE: dev# .gitlab/stages/01.build.yml — under test:pre_build_image -> parallel.matrix
- IMAGE: CI_MCORE_DEV_IMAGE
FILE: Dockerfile.ci.dev
IMAGE_TYPE: dev
BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # amd64 row
PLATFORM: amd64
- IMAGE: CI_MCORE_DEV_IMAGE
FILE: Dockerfile.ci.dev
IMAGE_TYPE: dev
BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # arm64 row
PLATFORM: arm64IMAGE_TYPE: ltsrg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml
# expect: lts pin × 2 unchanged, dev pin × 2 == new tagchore: Update Docker image version to <YY.MM>-py3Run functional tests/ok to test <commit-sha>copy-pr-botcopy-pr-botlm lossnum-zerositeration-timemem-*update-golden-values--source github--pipeline-id <WORKFLOW_RUN_ID>--only-failingdev_dgx_h100dev_dgx_gb200-brokentests/test_utils/recipes/<arch>/- test_case: [hybrid_dynamic_inference_tp1_ep8_nanov3_chunked_prefill]
products:
- environment: [dev]
# Broken: hangs on repeat iter 3, exceeds 1h job limit — see issue #<N>.
scope: [mr-broken, mr-github-broken] # was: [mr, mr-github]
platforms: [dgx_h100]| Before | After |
|---|---|
| |
| |
| |
-brokenecho -n "ngc_version.dev: " && cat docker/.ngc_version.dev
echo
echo "gitlab dev rows:"
rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml \
| rg -B1 'IMAGE_TYPE: dev' \
| rg 'BASE_IMAGE'nvcr.io/nvidia/pytorch:<YY.MM>-py3| Path | Edit |
|---|---|
| Overwrite with new |
| Update both |
| Refresh via the |
| Flip drifting / hanging cases to |
| Skip unless explicitly bumping LTS. LTS has its own release cadence. |
docker/.ngc_version.devDockerfile.ci.devBASE_IMAGE:.gitlab/stages/01.build.ymlIMAGE_TYPE: ltsdocker/.ngc_version.ltscontainer::ltstests/test_utils/python_scripts/download_golden_values.pyupdate-golden-valuesmr-broken/ok to test/ok to test <sha>docker build --target main --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) ...Run functional testscomplexity::*copy-pr-bot