mcore-testing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Testing Guide

测试指南

Answer-First Testing Facts

测试核心须知

For questions about disabling tests without deleting them:

Functional recipe entries stay in YAML; disable by suffixing scope with
```
-broken
```
, for example
```
scope: [mr-github]
```
->
```
scope: [mr-github-broken]
```
.
Unit-test skips use pytest markers instead:
```
@pytest.mark.flaky_in_dev
```
skips in the default dev environment, and
```
@pytest.mark.flaky
```
skips in LTS.
Do not delete the test case or recipe entry when the goal is discoverability and easy re-enable.

关于不删除测试而禁用测试的相关问题：

功能测试Recipe条目保留在YAML文件中；通过在scope后添加
```
-broken
```
后缀来禁用，例如
```
scope: [mr-github]
```
->
```
scope: [mr-github-broken]
```
。
单元测试跳过使用pytest标记：
```
@pytest.mark.flaky_in_dev
```
用于在默认dev环境中跳过测试，
```
@pytest.mark.flaky
```
用于在LTS环境中跳过测试。
若需保留测试的可发现性并便于重新启用，请勿删除测试用例或Recipe条目。

Test Layout

测试布局

text

tests/
├── unit_tests/          # pytest, 1 node × 8 GPUs, torch.distributed runner
├── functional_tests/    # end-to-end shell + training scripts
│   └── test_cases/
│       └── {model}/{test_case}/
│           ├── model_config.yaml          # training args
│           └── golden_values_{env}_{platform}.json
└── test_utils/
    ├── recipes/
    │   ├── h100/        # YAML recipes for H100 jobs
    │   └── gb200/       # YAML recipes for GB200 jobs
    └── python_scripts/  # helpers (recipe_parser, golden-value download, …)

text

tests/
├── unit_tests/          # pytest测试，1节点×8块GPU，使用torch.distributed运行器
├── functional_tests/    # 端到端shell + 训练脚本
│   └── test_cases/
│       └── {model}/{test_case}/
│           ├── model_config.yaml          # 训练参数
│           └── golden_values_{env}_{platform}.json
└── test_utils/
    ├── recipes/
    │   ├── h100/        # H100任务的YAML Recipe
    │   └── gb200/       # GB200任务的YAML Recipe
    └── python_scripts/  # 辅助脚本（Recipe解析器、黄金值下载等）

How Tests Execute

测试执行流程

The GitHub Actions runner invokes

launch_nemo_run_workload.py

, which uses nemo-run to launch a

DockerExecutor

container. The repo is bind-mounted at

/opt/megatron-lm

; training data is mounted at

/mnt/artifacts

Unit tests are dispatched through

torch.distributed.run

Ranks 0 and 3 are tee-d to stdout; all other ranks write only to log files.
Per-rank log files land at
```
{assets_dir}/logs/1/
```
and are uploaded as a GitHub artifact after the run.

Functional tests are driven by

tests/functional_tests/shell_test_utils/run_ci_test.sh

. Only rank 0 runs the pytest validation step; training output from all ranks is uploaded as an artifact.

Flaky-failure auto-retry:

launch_nemo_run_workload.py

retries up to 3 times for known transient patterns (NCCL timeout, ECC error, segfault, HuggingFace connectivity, …) before declaring a genuine failure.

GitHub Actions运行器调用

launch_nemo_run_workload.py

，该脚本使用nemo-run启动一个

DockerExecutor

容器。代码仓库被绑定挂载到

/opt/megatron-lm

；训练数据挂载到

/mnt/artifacts

。

单元测试通过

torch.distributed.run

分发执行：

Rank 0和Rank 3的输出会同步到标准输出；其他所有Rank仅写入日志文件。
各Rank的日志文件存储在
```
{assets_dir}/logs/1/
```
，运行结束后会作为GitHub工件上传。

功能测试由

tests/functional_tests/shell_test_utils/run_ci_test.sh

驱动。仅Rank 0会执行pytest验证步骤；所有Rank的训练输出都会作为工件上传。

不稳定故障自动重试：

launch_nemo_run_workload.py

针对已知的临时故障模式（NCCL超时、ECC错误、段错误、HuggingFace连接问题等）最多重试3次，之后才会判定为真正的失败。

Recipe YAML Structure

Recipe YAML结构

Recipes live in

tests/test_utils/recipes/

and are parsed by

tests/test_utils/python_scripts/recipe_parser.py

. Each file expands a cartesian

products

block into individual workload specs:

yaml

type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
  name: "{test_case}_{environment}_{platforms}"
  model: gpt              # maps to tests/functional_tests/test_cases/{model}/
  build: mcore-pyt-{environment}
  nodes: 1
  gpus: 8
  n_repeat: 5
  platforms: dgx_h100
  time_limit: 1800
  script_setup: |
    ...
  script: |-
    bash tests/functional_tests/shell_test_utils/run_ci_test.sh ...
products:
  - test_case: [my_test]
    products:
      - environment: [dev, lts]
        scope: [mr-github]
        platforms: [dgx_h100]

Key runtime placeholders:

{assets_dir}

{artifacts_dir}

{test_case}

{environment}

{platforms}

{n_repeat}

Recipe存储在

tests/test_utils/recipes/

目录下，由

tests/test_utils/python_scripts/recipe_parser.py

解析。每个文件会将笛卡尔积形式的

products

块扩展为独立的任务规格：

yaml

type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
  name: "{test_case}_{environment}_{platforms}"
  model: gpt              # 对应tests/functional_tests/test_cases/{model}/路径
  build: mcore-pyt-{environment}
  nodes: 1
  gpus: 8
  n_repeat: 5
  platforms: dgx_h100
  time_limit: 1800
  script_setup: |
    ...
  script: |-
    bash tests/functional_tests/shell_test_utils/run_ci_test.sh ...
products:
  - test_case: [my_test]
    products:
      - environment: [dev, lts]
        scope: [mr-github]
        platforms: [dgx_h100]

关键运行时占位符：

{assets_dir}

、

{artifacts_dir}

、

{test_case}

、

{environment}

、

{platforms}

、

{n_repeat}

。

Disabling a Test Without Deleting It

不删除测试的情况下禁用测试

To temporarily disable a test case in a recipe YAML, suffix its

scope

value with

-broken

— do not delete the entry:

yaml

undefined

要临时禁用Recipe YAML中的测试用例，在其

scope

值后添加

-broken

后缀——请勿删除条目：

yaml

undefined

before (test runs in CI)

修改前（测试在CI中运行）

scope: [mr-github]

after (test is skipped; entry preserved for easy re-enable)

修改后（测试被跳过；条目保留以便重新启用）

scope: [mr-github-broken]

---

scope: [mr-github-broken]

---

Running Unit Tests Locally

本地运行单元测试

All unit tests initialize a

torch.distributed

group, so every invocation requires GPU access and must go through

torch.distributed.run

bash

undefined

所有单元测试都会初始化一个

torch.distributed

组，因此每次调用都需要GPU权限，且必须通过

torch.distributed.run

执行：

bash

undefined

Full suite

完整测试套件

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \ tests/unit_tests

Single file

单个文件

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests/models/test_gpt_model.py

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \ tests/unit_tests/models/test_gpt_model.py

Single test

单个测试用例

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests/models/test_gpt_model.py::TestGPTModel::test_constructor

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \ tests/unit_tests/models/test_gpt_model.py::TestGPTModel::test_constructor

Filter by name substring

按名称子串过滤

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests -k optimizer

undefined

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \ tests/unit_tests -k optimizer

undefined

Marker filters

标记过滤器

bash

undefined

bash

undefined

Exclude flaky tests during development

开发环境中排除不稳定测试

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests -m "not flaky and not flaky_in_dev"

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \ tests/unit_tests -m "not flaky and not flaky_in_dev"

Include experimental tests

包含实验性测试

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q
tests/unit_tests --experimental

undefined

uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \ tests/unit_tests --experimental

undefined

CI parity

CI一致性

Use

tests/unit_tests/run_ci_test.sh

to reproduce a CI bucket failure exactly. For ad-hoc runs, prefer the direct

torch.distributed.run

invocations above.

使用

tests/unit_tests/run_ci_test.sh

可完全复现CI中的测试桶故障。对于临时运行，建议使用上述直接通过

torch.distributed.run

调用的方式。

Gotchas

注意事项

```
pyproject.toml
```
sets
```
addopts = --durations=15 -s -rA
```
— stdout is not captured (
```
-s
```
), so ranks interleave during multi-rank runs. Override with
```
--capture=fd
```
when debugging a specific rank.
```
tests/unit_tests/conftest.py
```
looks for test data under
```
/opt/data
```
and attempts a download if missing. Supply it manually or skip data-dependent tests when running outside the canonical container.

```
pyproject.toml
```
中设置了
```
addopts = --durations=15 -s -rA
```
——标准输出不会被捕获（
```
-s
```
），因此多Rank运行时输出会交织。调试特定Rank时，可使用
```
--capture=fd
```
覆盖该设置。
```
tests/unit_tests/conftest.py
```
会在
```
/opt/data
```
目录下查找测试数据，若缺失则尝试下载。在标准容器外运行时，可手动提供数据或跳过依赖数据的测试。

Adding a Unit Test

添加单元测试

Create

tests/unit_tests/<category>/test_<name>.py

Use fixtures from
```
tests/unit_tests/conftest.py
```
.
Apply markers as needed:
- ```
@pytest.mark.internal
```
  — skipped on
```
legacy
```
  tag
- ```
@pytest.mark.flaky_in_dev
```
  — skipped in
```
dev
```
  environment (CI default; use this to disable a flaky test without blocking the standard pipeline)
- ```
@pytest.mark.flaky
```
  — skipped in
```
lts
```
  environment
- ```
@pytest.mark.experimental
```
  —
```
latest
```
  tag only
Verify locally (see Running Unit Tests Locally above).
If the test needs a dedicated CI bucket, add an entry to
```
tests/test_utils/recipes/h100/unit-tests.yaml
```
.

创建

tests/unit_tests/<category>/test_<name>.py

文件。

使用
```
tests/unit_tests/conftest.py
```
中的fixture。
根据需要添加标记：
- ```
@pytest.mark.internal
```
  —— 在
```
legacy
```
  标签下会被跳过
- ```
@pytest.mark.flaky_in_dev
```
  —— 在
```
dev
```
  环境中跳过（CI默认环境；用于禁用不稳定测试而不阻塞标准流水线）
- ```
@pytest.mark.flaky
```
  —— 在
```
lts
```
  环境中跳过
- ```
@pytest.mark.experimental
```
  —— 仅在
```
latest
```
  标签下运行
本地验证（参见上文“本地运行单元测试”）。
若测试需要专用CI桶，在
```
tests/test_utils/recipes/h100/unit-tests.yaml
```
中添加条目。

Adding a Functional / Integration Test

添加功能/集成测试

Create

tests/functional_tests/test_cases/<model>/<test_name>/

Write

model_config.yaml

with

MODEL_ARGS

ENV_VARS

, and

TEST_TYPE

Add a YAML recipe under

tests/test_utils/recipes/h100/

(and

gb200/

if needed). Required fields:

scope

environment

platform

n_repeat

time_limit

Push the PR, add the label "Run functional tests" to trigger a full run.

After a successful run, download golden values:

bash

python tests/test_utils/python_scripts/download_golden_values.py \
  --source github --pipeline-id <run-id>

Commit the downloaded golden values.

创建

tests/functional_tests/test_cases/<model>/<test_name>/

目录。

编写包含

MODEL_ARGS

、

ENV_VARS

和

TEST_TYPE

的

model_config.yaml

文件。

在
```
tests/test_utils/recipes/h100/
```
（若需要，同时在
```
gb200/
```
）下添加YAML Recipe。必填字段：
```
scope
```
、
```
environment
```
、
```
platform
```
、
```
n_repeat
```
、
```
time_limit
```
。
推送PR，并添加标签**"Run functional tests"**以触发完整测试运行。

运行成功后，下载黄金值：

bash

python tests/test_utils/python_scripts/download_golden_values.py \\
  --source github --pipeline-id <run-id>

提交下载的黄金值文件。

Common Pitfalls

常见问题

Problem	Cause	Fix
Test passes locally but fails in CI	Different environment or data path	Check `DATA_PATH` , `DATA_CACHE_PATH` , and the `environment` tag ( `dev` vs `lts` )
Golden value mismatch after a code change	Numerical regression	Download new golden values via `download_golden_values.py` after a clean run
`cicd-integration-tests-gb200` not triggered	GB200 jobs require maintainer status	Ask a maintainer to trigger, or add the `Run functional tests` label

问题	原因	解决方法
本地测试通过但CI中失败	环境或数据路径不同	检查 `DATA_PATH` 、 `DATA_CACHE_PATH` 以及 `environment` 标签（ `dev` vs `lts` ）
代码变更后黄金值不匹配	数值回归	干净运行后通过 `download_golden_values.py` 下载新的黄金值
`cicd-integration-tests-gb200` 未触发	GB200任务需要维护者权限	请求维护者触发，或添加 `Run functional tests` 标签
",