Loading...
Loading...
Test system for Megatron-LM. Covers test layout, recipe YAML structure, adding and running unit and functional tests, golden values, marker filters, and CI parity.
npx skill4agent add nvidia/skills mcore-testing-brokenscope: [mr-github]scope: [mr-github-broken]@pytest.mark.flaky_in_dev@pytest.mark.flakytests/
├── unit_tests/ # pytest, 1 node × 8 GPUs, torch.distributed runner
├── functional_tests/ # end-to-end shell + training scripts
│ └── test_cases/
│ └── {model}/{test_case}/
│ ├── model_config.yaml # training args
│ └── golden_values_{env}_{platform}.json
└── test_utils/
├── recipes/
│ ├── h100/ # YAML recipes for H100 jobs
│ └── gb200/ # YAML recipes for GB200 jobs
└── python_scripts/ # helpers (recipe_parser, golden-value download, …)launch_nemo_run_workload.pyDockerExecutor/opt/megatron-lm/mnt/artifactstorch.distributed.run{assets_dir}/logs/1/tests/functional_tests/shell_test_utils/run_ci_test.shlaunch_nemo_run_workload.pytests/test_utils/recipes/tests/test_utils/python_scripts/recipe_parser.pyproductstype: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
name: "{test_case}_{environment}_{platforms}"
model: gpt # maps to tests/functional_tests/test_cases/{model}/
build: mcore-pyt-{environment}
nodes: 1
gpus: 8
n_repeat: 5
platforms: dgx_h100
time_limit: 1800
script_setup: |
...
script: |-
bash tests/functional_tests/shell_test_utils/run_ci_test.sh ...
products:
- test_case: [my_test]
products:
- environment: [dev, lts]
scope: [mr-github]
platforms: [dgx_h100]{assets_dir}{artifacts_dir}{test_case}{environment}{platforms}{n_repeat}scope-broken# before (test runs in CI)
scope: [mr-github]
# after (test is skipped; entry preserved for easy re-enable)
scope: [mr-github-broken]torch.distributedtorch.distributed.run# Full suite
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
tests/unit_tests
# Single file
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
tests/unit_tests/models/test_gpt_model.py
# Single test
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
tests/unit_tests/models/test_gpt_model.py::TestGPTModel::test_constructor
# Filter by name substring
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
tests/unit_tests -k optimizer# Exclude flaky tests during development
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
tests/unit_tests -m "not flaky and not flaky_in_dev"
# Include experimental tests
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
tests/unit_tests --experimentaltests/unit_tests/run_ci_test.shtorch.distributed.runpyproject.tomladdopts = --durations=15 -s -rA-s--capture=fdtests/unit_tests/conftest.py/opt/datatests/unit_tests/<category>/test_<name>.pytests/unit_tests/conftest.py@pytest.mark.internallegacy@pytest.mark.flaky_in_devdev@pytest.mark.flakylts@pytest.mark.experimentallatesttests/test_utils/recipes/h100/unit-tests.yamltests/functional_tests/test_cases/<model>/<test_name>/model_config.yamlMODEL_ARGSENV_VARSTEST_TYPEtests/test_utils/recipes/h100/gb200/scopeenvironmentplatformn_repeattime_limitpython tests/test_utils/python_scripts/download_golden_values.py \
--source github --pipeline-id <run-id>| Problem | Cause | Fix |
|---|---|---|
| Test passes locally but fails in CI | Different environment or data path | Check |
| Golden value mismatch after a code change | Numerical regression | Download new golden values via |
| GB200 jobs require maintainer status | Ask a maintainer to trigger, or add the |