mcore-testing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTesting Guide
测试指南
Answer-First Testing Facts
测试核心须知
For questions about disabling tests without deleting them:
- Functional recipe entries stay in YAML; disable by suffixing scope with
, for example
-broken->scope: [mr-github].scope: [mr-github-broken] - Unit-test skips use pytest markers instead: skips in the default dev environment, and
@pytest.mark.flaky_in_devskips in LTS.@pytest.mark.flaky - Do not delete the test case or recipe entry when the goal is discoverability and easy re-enable.
关于不删除测试而禁用测试的相关问题:
- 功能测试Recipe条目保留在YAML文件中;通过在scope后添加后缀来禁用,例如
-broken->scope: [mr-github]。scope: [mr-github-broken] - 单元测试跳过使用pytest标记:用于在默认dev环境中跳过测试,
@pytest.mark.flaky_in_dev用于在LTS环境中跳过测试。@pytest.mark.flaky - 若需保留测试的可发现性并便于重新启用,请勿删除测试用例或Recipe条目。
Test Layout
测试布局
text
tests/
├── unit_tests/ # pytest, 1 node × 8 GPUs, torch.distributed runner
├── functional_tests/ # end-to-end shell + training scripts
│ └── test_cases/
│ └── {model}/{test_case}/
│ ├── model_config.yaml # training args
│ └── golden_values_{env}_{platform}.json
└── test_utils/
├── recipes/
│ ├── h100/ # YAML recipes for H100 jobs
│ └── gb200/ # YAML recipes for GB200 jobs
└── python_scripts/ # helpers (recipe_parser, golden-value download, …)text
tests/
├── unit_tests/ # pytest测试,1节点×8块GPU,使用torch.distributed运行器
├── functional_tests/ # 端到端shell + 训练脚本
│ └── test_cases/
│ └── {model}/{test_case}/
│ ├── model_config.yaml # 训练参数
│ └── golden_values_{env}_{platform}.json
└── test_utils/
├── recipes/
│ ├── h100/ # H100任务的YAML Recipe
│ └── gb200/ # GB200任务的YAML Recipe
└── python_scripts/ # 辅助脚本(Recipe解析器、黄金值下载等)How Tests Execute
测试执行流程
The GitHub Actions runner invokes , which uses
nemo-run to launch a container. The repo is bind-mounted
at ; training data is mounted at .
launch_nemo_run_workload.pyDockerExecutor/opt/megatron-lm/mnt/artifactsUnit tests are dispatched through :
torch.distributed.run- Ranks 0 and 3 are tee-d to stdout; all other ranks write only to log files.
- Per-rank log files land at and are uploaded as a GitHub artifact after the run.
{assets_dir}/logs/1/
Functional tests are driven by
. Only rank 0 runs the
pytest validation step; training output from all ranks is uploaded as an artifact.
tests/functional_tests/shell_test_utils/run_ci_test.shFlaky-failure auto-retry: retries up to
3 times for known transient patterns (NCCL timeout, ECC error, segfault,
HuggingFace connectivity, …) before declaring a genuine failure.
launch_nemo_run_workload.pyGitHub Actions运行器调用,该脚本使用nemo-run启动一个容器。代码仓库被绑定挂载到;训练数据挂载到。
launch_nemo_run_workload.pyDockerExecutor/opt/megatron-lm/mnt/artifacts单元测试通过分发执行:
torch.distributed.run- Rank 0和Rank 3的输出会同步到标准输出;其他所有Rank仅写入日志文件。
- 各Rank的日志文件存储在,运行结束后会作为GitHub工件上传。
{assets_dir}/logs/1/
功能测试由驱动。仅Rank 0会执行pytest验证步骤;所有Rank的训练输出都会作为工件上传。
tests/functional_tests/shell_test_utils/run_ci_test.sh不稳定故障自动重试:针对已知的临时故障模式(NCCL超时、ECC错误、段错误、HuggingFace连接问题等)最多重试3次,之后才会判定为真正的失败。
launch_nemo_run_workload.pyRecipe YAML Structure
Recipe YAML结构
Recipes live in and are parsed by
. Each file expands a
cartesian block into individual workload specs:
tests/test_utils/recipes/tests/test_utils/python_scripts/recipe_parser.pyproductsyaml
type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
name: "{test_case}_{environment}_{platforms}"
model: gpt # maps to tests/functional_tests/test_cases/{model}/
build: mcore-pyt-{environment}
nodes: 1
gpus: 8
n_repeat: 5
platforms: dgx_h100
time_limit: 1800
script_setup: |
...
script: |-
bash tests/functional_tests/shell_test_utils/run_ci_test.sh ...
products:
- test_case: [my_test]
products:
- environment: [dev, lts]
scope: [mr-github]
platforms: [dgx_h100]Key runtime placeholders: , , ,
, , .
{assets_dir}{artifacts_dir}{test_case}{environment}{platforms}{n_repeat}Recipe存储在目录下,由解析。每个文件会将笛卡尔积形式的块扩展为独立的任务规格:
tests/test_utils/recipes/tests/test_utils/python_scripts/recipe_parser.pyproductsyaml
type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
name: "{test_case}_{environment}_{platforms}"
model: gpt # 对应tests/functional_tests/test_cases/{model}/路径
build: mcore-pyt-{environment}
nodes: 1
gpus: 8
n_repeat: 5
platforms: dgx_h100
time_limit: 1800
script_setup: |
...
script: |-
bash tests/functional_tests/shell_test_utils/run_ci_test.sh ...
products:
- test_case: [my_test]
products:
- environment: [dev, lts]
scope: [mr-github]
platforms: [dgx_h100]关键运行时占位符:、、、、、。
{assets_dir}{artifacts_dir}{test_case}{environment}{platforms}{n_repeat}Disabling a Test Without Deleting It
不删除测试的情况下禁用测试
To temporarily disable a test case in a recipe YAML, suffix its value
with — do not delete the entry:
scope-brokenyaml
undefined要临时禁用Recipe YAML中的测试用例,在其值后添加后缀——请勿删除条目:
scope-brokenyaml
undefinedbefore (test runs in CI)
修改前(测试在CI中运行)
scope: [mr-github]
scope: [mr-github]
after (test is skipped; entry preserved for easy re-enable)
修改后(测试被跳过;条目保留以便重新启用)
scope: [mr-github-broken]
---scope: [mr-github-broken]
---Running Unit Tests Locally
本地运行单元测试
All unit tests initialize a group, so every invocation
requires GPU access and must go through :
torch.distributedtorch.distributed.runbash
undefined所有单元测试都会初始化一个组,因此每次调用都需要GPU权限,且必须通过执行:
torch.distributedtorch.distributed.runbash
undefinedFull suite
完整测试套件
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests
tests/unit_tests
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
tests/unit_tests
Single file
单个文件
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests/models/test_gpt_model.py
tests/unit_tests/models/test_gpt_model.py
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
tests/unit_tests/models/test_gpt_model.py
Single test
单个测试用例
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests/models/test_gpt_model.py::TestGPTModel::test_constructor
tests/unit_tests/models/test_gpt_model.py::TestGPTModel::test_constructor
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
tests/unit_tests/models/test_gpt_model.py::TestGPTModel::test_constructor
Filter by name substring
按名称子串过滤
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests -k optimizer
tests/unit_tests -k optimizer
undefineduv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
tests/unit_tests -k optimizer
undefinedMarker filters
标记过滤器
bash
undefinedbash
undefinedExclude flaky tests during development
开发环境中排除不稳定测试
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests -m "not flaky and not flaky_in_dev"
tests/unit_tests -m "not flaky and not flaky_in_dev"
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
tests/unit_tests -m "not flaky and not flaky_in_dev"
Include experimental tests
包含实验性测试
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests --experimental
tests/unit_tests --experimental
undefineduv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
tests/unit_tests --experimental
undefinedCI parity
CI一致性
Use to reproduce a CI bucket failure exactly.
For ad-hoc runs, prefer the direct invocations above.
tests/unit_tests/run_ci_test.shtorch.distributed.run使用可完全复现CI中的测试桶故障。对于临时运行,建议使用上述直接通过调用的方式。
tests/unit_tests/run_ci_test.shtorch.distributed.runGotchas
注意事项
- sets
pyproject.toml— stdout is not captured (addopts = --durations=15 -s -rA), so ranks interleave during multi-rank runs. Override with-swhen debugging a specific rank.--capture=fd - looks for test data under
tests/unit_tests/conftest.pyand attempts a download if missing. Supply it manually or skip data-dependent tests when running outside the canonical container./opt/data
- 中设置了
pyproject.toml——标准输出不会被捕获(addopts = --durations=15 -s -rA),因此多Rank运行时输出会交织。调试特定Rank时,可使用-s覆盖该设置。--capture=fd - 会在
tests/unit_tests/conftest.py目录下查找测试数据,若缺失则尝试下载。在标准容器外运行时,可手动提供数据或跳过依赖数据的测试。/opt/data
Adding a Unit Test
添加单元测试
- Create .
tests/unit_tests/<category>/test_<name>.py - Use fixtures from .
tests/unit_tests/conftest.py - Apply markers as needed:
- — skipped on
@pytest.mark.internaltaglegacy - — skipped in
@pytest.mark.flaky_in_devenvironment (CI default; use this to disable a flaky test without blocking the standard pipeline)dev - — skipped in
@pytest.mark.flakyenvironmentlts - —
@pytest.mark.experimentaltag onlylatest
- Verify locally (see Running Unit Tests Locally above).
- If the test needs a dedicated CI bucket, add an entry to
.
tests/test_utils/recipes/h100/unit-tests.yaml
- 创建文件。
tests/unit_tests/<category>/test_<name>.py - 使用中的fixture。
tests/unit_tests/conftest.py - 根据需要添加标记:
- —— 在
@pytest.mark.internal标签下会被跳过legacy - —— 在
@pytest.mark.flaky_in_dev环境中跳过(CI默认环境;用于禁用不稳定测试而不阻塞标准流水线)dev - —— 在
@pytest.mark.flaky环境中跳过lts - —— 仅在
@pytest.mark.experimental标签下运行latest
- 本地验证(参见上文“本地运行单元测试”)。
- 若测试需要专用CI桶,在中添加条目。
tests/test_utils/recipes/h100/unit-tests.yaml
Adding a Functional / Integration Test
添加功能/集成测试
-
Create.
tests/functional_tests/test_cases/<model>/<test_name>/ -
Writewith
model_config.yaml,MODEL_ARGS, andENV_VARS.TEST_TYPE -
Add a YAML recipe under(and
tests/test_utils/recipes/h100/if needed). Required fields:gb200/,scope,environment,platform,n_repeat.time_limit -
Push the PR, add the label "Run functional tests" to trigger a full run.
-
After a successful run, download golden values:bash
python tests/test_utils/python_scripts/download_golden_values.py \ --source github --pipeline-id <run-id> -
Commit the downloaded golden values.
-
创建目录。
tests/functional_tests/test_cases/<model>/<test_name>/ -
编写包含、
MODEL_ARGS和ENV_VARS的TEST_TYPE文件。model_config.yaml -
在(若需要,同时在
tests/test_utils/recipes/h100/)下添加YAML Recipe。必填字段:gb200/、scope、environment、platform、n_repeat。time_limit -
推送PR,并添加标签**"Run functional tests"**以触发完整测试运行。
-
运行成功后,下载黄金值:bash
python tests/test_utils/python_scripts/download_golden_values.py \\ --source github --pipeline-id <run-id> -
提交下载的黄金值文件。
Common Pitfalls
常见问题
| Problem | Cause | Fix |
|---|---|---|
| Test passes locally but fails in CI | Different environment or data path | Check |
| Golden value mismatch after a code change | Numerical regression | Download new golden values via |
| GB200 jobs require maintainer status | Ask a maintainer to trigger, or add the |
| 问题 | 原因 | 解决方法 |
|---|---|---|
| 本地测试通过但CI中失败 | 环境或数据路径不同 | 检查 |
| 代码变更后黄金值不匹配 | 数值回归 | 干净运行后通过 |
| GB200任务需要维护者权限 | 请求维护者触发,或添加 |
| ", |