mcore-testing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Testing Guide

测试指南



Answer-First Testing Facts

测试核心须知

For questions about disabling tests without deleting them:
  • Functional recipe entries stay in YAML; disable by suffixing scope with
    -broken
    , for example
    scope: [mr-github]
    ->
    scope: [mr-github-broken]
    .
  • Unit-test skips use pytest markers instead:
    @pytest.mark.flaky_in_dev
    skips in the default dev environment, and
    @pytest.mark.flaky
    skips in LTS.
  • Do not delete the test case or recipe entry when the goal is discoverability and easy re-enable.

关于不删除测试而禁用测试的相关问题:
  • 功能测试Recipe条目保留在YAML文件中;通过在scope后添加
    -broken
    后缀来禁用,例如
    scope: [mr-github]
    ->
    scope: [mr-github-broken]
  • 单元测试跳过使用pytest标记:
    @pytest.mark.flaky_in_dev
    用于在默认dev环境中跳过测试,
    @pytest.mark.flaky
    用于在LTS环境中跳过测试。
  • 若需保留测试的可发现性并便于重新启用,请勿删除测试用例或Recipe条目。

Test Layout

测试布局

text
tests/
├── unit_tests/          # pytest, 1 node × 8 GPUs, torch.distributed runner
├── functional_tests/    # end-to-end shell + training scripts
│   └── test_cases/
│       └── {model}/{test_case}/
│           ├── model_config.yaml          # training args
│           └── golden_values_{env}_{platform}.json
└── test_utils/
    ├── recipes/
    │   ├── h100/        # YAML recipes for H100 jobs
    │   └── gb200/       # YAML recipes for GB200 jobs
    └── python_scripts/  # helpers (recipe_parser, golden-value download, …)

text
tests/
├── unit_tests/          # pytest测试,1节点×8块GPU,使用torch.distributed运行器
├── functional_tests/    # 端到端shell + 训练脚本
│   └── test_cases/
│       └── {model}/{test_case}/
│           ├── model_config.yaml          # 训练参数
│           └── golden_values_{env}_{platform}.json
└── test_utils/
    ├── recipes/
    │   ├── h100/        # H100任务的YAML Recipe
    │   └── gb200/       # GB200任务的YAML Recipe
    └── python_scripts/  # 辅助脚本(Recipe解析器、黄金值下载等)

How Tests Execute

测试执行流程

The GitHub Actions runner invokes
launch_nemo_run_workload.py
, which uses nemo-run to launch a
DockerExecutor
container. The repo is bind-mounted at
/opt/megatron-lm
; training data is mounted at
/mnt/artifacts
.
Unit tests are dispatched through
torch.distributed.run
:
  • Ranks 0 and 3 are tee-d to stdout; all other ranks write only to log files.
  • Per-rank log files land at
    {assets_dir}/logs/1/
    and are uploaded as a GitHub artifact after the run.
Functional tests are driven by
tests/functional_tests/shell_test_utils/run_ci_test.sh
. Only rank 0 runs the pytest validation step; training output from all ranks is uploaded as an artifact.
Flaky-failure auto-retry:
launch_nemo_run_workload.py
retries up to 3 times for known transient patterns (NCCL timeout, ECC error, segfault, HuggingFace connectivity, …) before declaring a genuine failure.

GitHub Actions运行器调用
launch_nemo_run_workload.py
,该脚本使用nemo-run启动一个
DockerExecutor
容器。代码仓库被绑定挂载到
/opt/megatron-lm
;训练数据挂载到
/mnt/artifacts
单元测试通过
torch.distributed.run
分发执行:
  • Rank 0和Rank 3的输出会同步到标准输出;其他所有Rank仅写入日志文件。
  • 各Rank的日志文件存储在
    {assets_dir}/logs/1/
    ,运行结束后会作为GitHub工件上传。
功能测试
tests/functional_tests/shell_test_utils/run_ci_test.sh
驱动。仅Rank 0会执行pytest验证步骤;所有Rank的训练输出都会作为工件上传。
不稳定故障自动重试
launch_nemo_run_workload.py
针对已知的临时故障模式(NCCL超时、ECC错误、段错误、HuggingFace连接问题等)最多重试3次,之后才会判定为真正的失败。

Recipe YAML Structure

Recipe YAML结构

Recipes live in
tests/test_utils/recipes/
and are parsed by
tests/test_utils/python_scripts/recipe_parser.py
. Each file expands a cartesian
products
block into individual workload specs:
yaml
type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
  name: "{test_case}_{environment}_{platforms}"
  model: gpt              # maps to tests/functional_tests/test_cases/{model}/
  build: mcore-pyt-{environment}
  nodes: 1
  gpus: 8
  n_repeat: 5
  platforms: dgx_h100
  time_limit: 1800
  script_setup: |
    ...
  script: |-
    bash tests/functional_tests/shell_test_utils/run_ci_test.sh ...
products:
  - test_case: [my_test]
    products:
      - environment: [dev, lts]
        scope: [mr-github]
        platforms: [dgx_h100]
Key runtime placeholders:
{assets_dir}
,
{artifacts_dir}
,
{test_case}
,
{environment}
,
{platforms}
,
{n_repeat}
.
Recipe存储在
tests/test_utils/recipes/
目录下,由
tests/test_utils/python_scripts/recipe_parser.py
解析。每个文件会将笛卡尔积形式的
products
块扩展为独立的任务规格:
yaml
type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
  name: "{test_case}_{environment}_{platforms}"
  model: gpt              # 对应tests/functional_tests/test_cases/{model}/路径
  build: mcore-pyt-{environment}
  nodes: 1
  gpus: 8
  n_repeat: 5
  platforms: dgx_h100
  time_limit: 1800
  script_setup: |
    ...
  script: |-
    bash tests/functional_tests/shell_test_utils/run_ci_test.sh ...
products:
  - test_case: [my_test]
    products:
      - environment: [dev, lts]
        scope: [mr-github]
        platforms: [dgx_h100]
关键运行时占位符:
{assets_dir}
{artifacts_dir}
{test_case}
{environment}
{platforms}
{n_repeat}

Disabling a Test Without Deleting It

不删除测试的情况下禁用测试

To temporarily disable a test case in a recipe YAML, suffix its
scope
value with
-broken
do not delete the entry:
yaml
undefined
要临时禁用Recipe YAML中的测试用例,在其
scope
值后添加
-broken
后缀——请勿删除条目
yaml
undefined

before (test runs in CI)

修改前(测试在CI中运行)

scope: [mr-github]
scope: [mr-github]

after (test is skipped; entry preserved for easy re-enable)

修改后(测试被跳过;条目保留以便重新启用)

scope: [mr-github-broken]

---
scope: [mr-github-broken]

---

Running Unit Tests Locally

本地运行单元测试

All unit tests initialize a
torch.distributed
group, so every invocation requires GPU access and must go through
torch.distributed.run
:
bash
undefined
所有单元测试都会初始化一个
torch.distributed
组,因此每次调用都需要GPU权限,且必须通过
torch.distributed.run
执行:
bash
undefined

Full suite

完整测试套件

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \ tests/unit_tests

Single file

单个文件

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests/models/test_gpt_model.py
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \ tests/unit_tests/models/test_gpt_model.py

Single test

单个测试用例

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests/models/test_gpt_model.py::TestGPTModel::test_constructor
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \ tests/unit_tests/models/test_gpt_model.py::TestGPTModel::test_constructor

Filter by name substring

按名称子串过滤

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests -k optimizer
undefined
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \ tests/unit_tests -k optimizer
undefined

Marker filters

标记过滤器

bash
undefined
bash
undefined

Exclude flaky tests during development

开发环境中排除不稳定测试

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests -m "not flaky and not flaky_in_dev"
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \ tests/unit_tests -m "not flaky and not flaky_in_dev"

Include experimental tests

包含实验性测试

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests --experimental
undefined
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \ tests/unit_tests --experimental
undefined

CI parity

CI一致性

Use
tests/unit_tests/run_ci_test.sh
to reproduce a CI bucket failure exactly. For ad-hoc runs, prefer the direct
torch.distributed.run
invocations above.
使用
tests/unit_tests/run_ci_test.sh
可完全复现CI中的测试桶故障。对于临时运行,建议使用上述直接通过
torch.distributed.run
调用的方式。

Gotchas

注意事项

  • pyproject.toml
    sets
    addopts = --durations=15 -s -rA
    — stdout is not captured (
    -s
    ), so ranks interleave during multi-rank runs. Override with
    --capture=fd
    when debugging a specific rank.
  • tests/unit_tests/conftest.py
    looks for test data under
    /opt/data
    and attempts a download if missing. Supply it manually or skip data-dependent tests when running outside the canonical container.

  • pyproject.toml
    中设置了
    addopts = --durations=15 -s -rA
    ——标准输出不会被捕获(
    -s
    ),因此多Rank运行时输出会交织。调试特定Rank时,可使用
    --capture=fd
    覆盖该设置。
  • tests/unit_tests/conftest.py
    会在
    /opt/data
    目录下查找测试数据,若缺失则尝试下载。在标准容器外运行时,可手动提供数据或跳过依赖数据的测试。

Adding a Unit Test

添加单元测试

  1. Create
    tests/unit_tests/<category>/test_<name>.py
    .
  2. Use fixtures from
    tests/unit_tests/conftest.py
    .
  3. Apply markers as needed:
    • @pytest.mark.internal
      — skipped on
      legacy
      tag
    • @pytest.mark.flaky_in_dev
      — skipped in
      dev
      environment (CI default; use this to disable a flaky test without blocking the standard pipeline)
    • @pytest.mark.flaky
      — skipped in
      lts
      environment
    • @pytest.mark.experimental
      latest
      tag only
  4. Verify locally (see Running Unit Tests Locally above).
  5. If the test needs a dedicated CI bucket, add an entry to
    tests/test_utils/recipes/h100/unit-tests.yaml
    .

  1. 创建
    tests/unit_tests/<category>/test_<name>.py
    文件。
  2. 使用
    tests/unit_tests/conftest.py
    中的fixture。
  3. 根据需要添加标记:
    • @pytest.mark.internal
      —— 在
      legacy
      标签下会被跳过
    • @pytest.mark.flaky_in_dev
      —— 在
      dev
      环境中跳过(CI默认环境;用于禁用不稳定测试而不阻塞标准流水线)
    • @pytest.mark.flaky
      —— 在
      lts
      环境中跳过
    • @pytest.mark.experimental
      —— 仅在
      latest
      标签下运行
  4. 本地验证(参见上文“本地运行单元测试”)。
  5. 若测试需要专用CI桶,在
    tests/test_utils/recipes/h100/unit-tests.yaml
    中添加条目。

Adding a Functional / Integration Test

添加功能/集成测试

  1. Create
    tests/functional_tests/test_cases/<model>/<test_name>/
    .
  2. Write
    model_config.yaml
    with
    MODEL_ARGS
    ,
    ENV_VARS
    , and
    TEST_TYPE
    .
  3. Add a YAML recipe under
    tests/test_utils/recipes/h100/
    (and
    gb200/
    if needed). Required fields:
    scope
    ,
    environment
    ,
    platform
    ,
    n_repeat
    ,
    time_limit
    .
  4. Push the PR, add the label "Run functional tests" to trigger a full run.
  5. After a successful run, download golden values:
    bash
    python tests/test_utils/python_scripts/download_golden_values.py \
      --source github --pipeline-id <run-id>
  6. Commit the downloaded golden values.

  1. 创建
    tests/functional_tests/test_cases/<model>/<test_name>/
    目录。
  2. 编写包含
    MODEL_ARGS
    ENV_VARS
    TEST_TYPE
    model_config.yaml
    文件。
  3. tests/test_utils/recipes/h100/
    (若需要,同时在
    gb200/
    )下添加YAML Recipe。必填字段:
    scope
    environment
    platform
    n_repeat
    time_limit
  4. 推送PR,并添加标签**"Run functional tests"**以触发完整测试运行。
  5. 运行成功后,下载黄金值:
    bash
    python tests/test_utils/python_scripts/download_golden_values.py \\
      --source github --pipeline-id <run-id>
  6. 提交下载的黄金值文件。

Common Pitfalls

常见问题

ProblemCauseFix
Test passes locally but fails in CIDifferent environment or data pathCheck
DATA_PATH
,
DATA_CACHE_PATH
, and the
environment
tag (
dev
vs
lts
)
Golden value mismatch after a code changeNumerical regressionDownload new golden values via
download_golden_values.py
after a clean run
cicd-integration-tests-gb200
not triggered
GB200 jobs require maintainer statusAsk a maintainer to trigger, or add the
Run functional tests
label
问题原因解决方法
本地测试通过但CI中失败环境或数据路径不同检查
DATA_PATH
DATA_CACHE_PATH
以及
environment
标签(
dev
vs
lts
代码变更后黄金值不匹配数值回归干净运行后通过
download_golden_values.py
下载新的黄金值
cicd-integration-tests-gb200
未触发
GB200任务需要维护者权限请求维护者触发,或添加
Run functional tests
标签
",