testing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Testing

测试

Directory Layout

目录布局

tests/
  unit_tests/          # fast, isolated, no GPU required
  functional_tests/
    launch_scripts/
      h100/
        active/        # H100 tests that run in CI automatically
        flaky/         # H100 tests quarantined from blocking CI
      gb200/
        active/        # GB200 tests that run in CI automatically
        flaky/         # GB200 tests quarantined from blocking CI

Unit tests are independent of the launch script layout. Functional test scripts are named

{Tier}_{Description}.sh

(e.g.,

L0_Launch_training.sh

tests/
  unit_tests/          # 快速、独立，无需GPU
  functional_tests/
    launch_scripts/
      h100/
        active/        # 在CI中自动运行的H100测试
        flaky/         # 被隔离，不阻塞CI的H100测试
      gb200/
        active/        # 在CI中自动运行的GB200测试
        flaky/         # 被隔离，不阻塞CI的GB200测试

单元测试独立于启动脚本布局。功能测试脚本命名格式为

{Tier}_{Description}.sh

（例如

L0_Launch_training.sh

）。

Tier Semantics

层级语义

Tier	Trigger	Blocking
L0	Every PR, every push to `main` , schedule	Yes — PR cannot merge if L0 fails
L1	Push to `main` , schedule, PRs with `needs-more-tests` label	Yes
L2	Schedule, `workflow_dispatch` , PRs with `full-test-suite` label	Yes (when triggered)
flaky	`workflow_dispatch` with `test_suite=all` only	No — failures are informational

H100 and GB200 each have independent L0/L1/L2/flaky jobs. Moving a script to

flaky/

removes it from blocking CI on that hardware target only.

Prefer unit tests over functional tests. CI GPU resources are limited; every functional test slot has a real cost.

层级	触发条件	是否阻塞
L0	每个PR、每次推送到 `main` 分支、定时任务	是——若L0测试失败，PR无法合并
L1	推送到 `main` 分支、定时任务、带有 `needs-more-tests` 标签的PR	是
L2	定时任务、 `workflow_dispatch` 、带有 `full-test-suite` 标签的PR	是（触发时）
flaky	仅在 `workflow_dispatch` 且 `test_suite=all` 时触发	否——失败仅作为信息提示

H100和GB200各自拥有独立的L0/L1/L2/flaky任务。将脚本移至

flaky/

目录只会使其不再阻塞对应硬件目标的CI流程。

优先使用单元测试而非功能测试。 CI的GPU资源有限；每个功能测试的运行都有实际成本。

Running Tests Locally

本地运行测试

Unit Tests

单元测试

No GPU required:

bash

uv run pytest tests/unit_tests/ -x -v

Or inside Docker:

bash

docker run --rm --gpus all -v $(pwd):/workdir/ -w /workdir/ megatron-bridge \
  uv run pytest tests/unit_tests/

无需GPU：

bash

uv run pytest tests/unit_tests/ -x -v

或在Docker中运行：

bash

docker run --rm --gpus all -v $(pwd):/workdir/ -w /workdir/ megatron-bridge \
  uv run pytest tests/unit_tests/

Functional Tests

功能测试

Run the corresponding launch script directly on a GPU node:

bash

bash tests/functional_tests/launch_scripts/h100/active/L0_Launch_training.sh

在GPU节点上直接运行对应的启动脚本：

bash

bash tests/functional_tests/launch_scripts/h100/active/L0_Launch_training.sh

Adding a Unit Test

添加单元测试

Place the file under

tests/unit_tests/<domain>/test_<name>.py

Mark it:
```
@pytest.mark.unit
```
.
Keep configs tiny: small hidden dims, 1-2 layers, short sequences.

Run locally:

uv run python -m pytest tests/unit_tests/<your_test>.py

No foreign
setattr
on config dataclasses. When applying overrides via

setattr(config_obj, key, value)

, always guard first:

python

if not hasattr(config_obj, key):
    raise ValueError(f"Config has no field '{key}'")
setattr(config_obj, key, value)

Setting a non-existent attribute silently creates a phantom field — the test passes but the recipe fails for a real user.

将文件放在

tests/unit_tests/<domain>/test_<name>.py

路径下。

添加标记：
```
@pytest.mark.unit
```
。
配置保持精简：使用小隐藏维度、1-2层、短序列。

本地运行：

uv run python -m pytest tests/unit_tests/<your_test>.py

禁止在配置数据类上使用外部
setattr
。当通过

setattr(config_obj, key, value)

应用覆盖时，必须先进行检查：

python

if not hasattr(config_obj, key):
    raise ValueError(f"Config has no field '{key}'")
setattr(config_obj, key, value)

设置不存在的属性会静默创建一个虚拟字段——测试会通过，但实际用户使用时会失败。

Adding a Functional Test

添加功能测试

Create the script under

tests/functional_tests/launch_scripts/{h100,gb200}/active/

Start the file with a timeout header:
bash
```
# CI_TIMEOUT=<minutes>
```
Name it
```
{Tier}_{CamelDescription}.sh
```
— the tier prefix controls which CI matrix includes it.
Make it executable:
```
chmod +x <file>
```
.
Functional tests must use at most 2 GPUs.

No workflow file changes needed — the matrix is generated dynamically by scanning the directory.

在

tests/functional_tests/launch_scripts/{h100,gb200}/active/

目录下创建脚本。

在文件开头添加超时头：
bash
```
# CI_TIMEOUT=<minutes>
```
命名格式为
```
{Tier}_{CamelDescription}.sh
```
——层级前缀决定了该测试会被纳入哪个CI矩阵。
设置可执行权限：
```
chmod +x <file>
```
。
功能测试最多使用2块GPU。

无需修改工作流文件——矩阵会通过扫描目录动态生成。

Moving a Test to Flaky

将测试移至Flaky目录

bash

undefined

bash

undefined

H100

git mv tests/functional_tests/launch_scripts/h100/active/L0_Foo.sh
tests/functional_tests/launch_scripts/h100/flaky/L0_Foo.sh

GB200 (if the test also exists there)

GB200（若该测试也存在于此）

git mv tests/functional_tests/launch_scripts/gb200/active/L0_Foo.sh
tests/functional_tests/launch_scripts/gb200/flaky/L0_Foo.sh


Flaky tests still run on manual dispatches (`test_suite=all`) so failures
remain visible. Move back to `active/` once the underlying issue is fixed.

git mv tests/functional_tests/launch_scripts/gb200/active/L0_Foo.sh
tests/functional_tests/launch_scripts/gb200/flaky/L0_Foo.sh


Flaky测试仍会在手动触发（`test_suite=all`）时运行，因此失败情况依然可见。修复底层问题后，将其移回`active/`目录即可。

Removing a Test

删除测试

Delete the script file and commit. No other changes required.

删除脚本文件并提交即可，无需其他修改。

Pytest Conventions

pytest约定

Use pytest fixtures for common setup.

Available markers:

unit

integration

system

acceptance

docs

skipduringci

pleasefixme

Functional tests are capped at 2 GPUs. Set
```
CUDA_VISIBLE_DEVICES
```
explicitly for multi-GPU tests.
Use
```
uv run python -m pytest
```
, never bare
```
pytest
```
.

使用pytest夹具进行通用设置。

可用标记：

unit

integration

system

acceptance

docs

skipduringci

pleasefixme

。

功能测试最多使用2块GPU。多GPU测试需显式设置
```
CUDA_VISIBLE_DEVICES
```
。
使用
```
uv run python -m pytest
```
，切勿直接使用
```
pytest
```
。

CI Job Reference

CI任务参考

GitHub Actions job	Hardware	Directory scanned
`cicd-functional-tests-l0`	H100	`h100/active/L0_*.sh`
`cicd-functional-tests-l1`	H100	`h100/active/L1_*.sh`
`cicd-functional-tests-l2`	H100	`h100/active/L2_*.sh`
`cicd-functional-tests-flaky`	H100	`h100/flaky/L*.sh`
`cicd-functional-tests-gb200-l0`	GB200	`gb200/active/L0_*.sh`
`cicd-functional-tests-gb200-l1`	GB200	`gb200/active/L1_*.sh`
`cicd-functional-tests-gb200-l2`	GB200	`gb200/active/L2_*.sh`
`cicd-functional-tests-gb200-flaky`	GB200	`gb200/flaky/L*.sh`

Hardware runners: H100 uses

nemo-ci-{azure,aws}-gpu-x2

; GB200 uses

nemo-ci-gcp-gpu-x2

GitHub Actions任务	硬件	扫描目录
`cicd-functional-tests-l0`	H100	`h100/active/L0_*.sh`
`cicd-functional-tests-l1`	H100	`h100/active/L1_*.sh`
`cicd-functional-tests-l2`	H100	`h100/active/L2_*.sh`
`cicd-functional-tests-flaky`	H100	`h100/flaky/L*.sh`
`cicd-functional-tests-gb200-l0`	GB200	`gb200/active/L0_*.sh`
`cicd-functional-tests-gb200-l1`	GB200	`gb200/active/L1_*.sh`
`cicd-functional-tests-gb200-l2`	GB200	`gb200/active/L2_*.sh`
`cicd-functional-tests-gb200-flaky`	GB200	`gb200/flaky/L*.sh`

硬件运行器：H100使用

nemo-ci-{azure,aws}-gpu-x2

；GB200使用

nemo-ci-gcp-gpu-x2

。

Code Anchors

代码锚点

Component	Path
Matrix generation (H100)	@.github/workflows/cicd-main.yml job `generate-test-matrix`
Matrix generation (GB200)	@.github/workflows/cicd-main.yml job `generate-gb200-test-matrix`
Test runner action	@.github/actions/test-template/action.yml
Launch scripts root	`tests/functional_tests/launch_scripts/`

组件	路径
矩阵生成（H100）	@.github/workflows/cicd-main.yml 任务 `generate-test-matrix`
矩阵生成（GB200）	@.github/workflows/cicd-main.yml 任务 `generate-gb200-test-matrix`
测试运行器动作	@.github/actions/test-template/action.yml
启动脚本根目录	`tests/functional_tests/launch_scripts/`