testing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Testing

测试

Directory Layout

目录布局

tests/
  unit_tests/          # fast, isolated, no GPU required
  functional_tests/
    launch_scripts/
      h100/
        active/        # H100 tests that run in CI automatically
        flaky/         # H100 tests quarantined from blocking CI
      gb200/
        active/        # GB200 tests that run in CI automatically
        flaky/         # GB200 tests quarantined from blocking CI
Unit tests are independent of the launch script layout. Functional test scripts are named
{Tier}_{Description}.sh
(e.g.,
L0_Launch_training.sh
).
tests/
  unit_tests/          # 快速、独立,无需GPU
  functional_tests/
    launch_scripts/
      h100/
        active/        # 在CI中自动运行的H100测试
        flaky/         # 被隔离,不阻塞CI的H100测试
      gb200/
        active/        # 在CI中自动运行的GB200测试
        flaky/         # 被隔离,不阻塞CI的GB200测试
单元测试独立于启动脚本布局。功能测试脚本命名格式为
{Tier}_{Description}.sh
(例如
L0_Launch_training.sh
)。

Tier Semantics

层级语义

TierTriggerBlocking
L0Every PR, every push to
main
, schedule
Yes — PR cannot merge if L0 fails
L1Push to
main
, schedule, PRs with
needs-more-tests
label
Yes
L2Schedule,
workflow_dispatch
, PRs with
full-test-suite
label
Yes (when triggered)
flaky
workflow_dispatch
with
test_suite=all
only
No — failures are informational
H100 and GB200 each have independent L0/L1/L2/flaky jobs. Moving a script to
flaky/
removes it from blocking CI on that hardware target only.
Prefer unit tests over functional tests. CI GPU resources are limited; every functional test slot has a real cost.
层级触发条件是否阻塞
L0每个PR、每次推送到
main
分支、定时任务
是——若L0测试失败,PR无法合并
L1推送到
main
分支、定时任务、带有
needs-more-tests
标签的PR
L2定时任务、
workflow_dispatch
、带有
full-test-suite
标签的PR
是(触发时)
flaky仅在
workflow_dispatch
test_suite=all
时触发
否——失败仅作为信息提示
H100和GB200各自拥有独立的L0/L1/L2/flaky任务。将脚本移至
flaky/
目录只会使其不再阻塞对应硬件目标的CI流程。
优先使用单元测试而非功能测试。 CI的GPU资源有限;每个功能测试的运行都有实际成本。

Running Tests Locally

本地运行测试

Unit Tests

单元测试

No GPU required:
bash
uv run pytest tests/unit_tests/ -x -v
Or inside Docker:
bash
docker run --rm --gpus all -v $(pwd):/workdir/ -w /workdir/ megatron-bridge \
  uv run pytest tests/unit_tests/
无需GPU:
bash
uv run pytest tests/unit_tests/ -x -v
或在Docker中运行:
bash
docker run --rm --gpus all -v $(pwd):/workdir/ -w /workdir/ megatron-bridge \
  uv run pytest tests/unit_tests/

Functional Tests

功能测试

Run the corresponding launch script directly on a GPU node:
bash
bash tests/functional_tests/launch_scripts/h100/active/L0_Launch_training.sh
在GPU节点上直接运行对应的启动脚本:
bash
bash tests/functional_tests/launch_scripts/h100/active/L0_Launch_training.sh

Adding a Unit Test

添加单元测试

  1. Place the file under
    tests/unit_tests/<domain>/test_<name>.py
    .
  2. Mark it:
    @pytest.mark.unit
    .
  3. Keep configs tiny: small hidden dims, 1-2 layers, short sequences.
  4. Run locally:
    uv run python -m pytest tests/unit_tests/<your_test>.py
No foreign
setattr
on config dataclasses.
When applying overrides via
setattr(config_obj, key, value)
, always guard first:
python
if not hasattr(config_obj, key):
    raise ValueError(f"Config has no field '{key}'")
setattr(config_obj, key, value)
Setting a non-existent attribute silently creates a phantom field — the test passes but the recipe fails for a real user.
  1. 将文件放在
    tests/unit_tests/<domain>/test_<name>.py
    路径下。
  2. 添加标记:
    @pytest.mark.unit
  3. 配置保持精简:使用小隐藏维度、1-2层、短序列。
  4. 本地运行:
    uv run python -m pytest tests/unit_tests/<your_test>.py
禁止在配置数据类上使用外部
setattr
当通过
setattr(config_obj, key, value)
应用覆盖时,必须先进行检查:
python
if not hasattr(config_obj, key):
    raise ValueError(f"Config has no field '{key}'")
setattr(config_obj, key, value)
设置不存在的属性会静默创建一个虚拟字段——测试会通过,但实际用户使用时会失败。

Adding a Functional Test

添加功能测试

  1. Create the script under
    tests/functional_tests/launch_scripts/{h100,gb200}/active/
    .
  2. Start the file with a timeout header:
    bash
    # CI_TIMEOUT=<minutes>
  3. Name it
    {Tier}_{CamelDescription}.sh
    — the tier prefix controls which CI matrix includes it.
  4. Make it executable:
    chmod +x <file>
    .
  5. Functional tests must use at most 2 GPUs.
No workflow file changes needed — the matrix is generated dynamically by scanning the directory.
  1. tests/functional_tests/launch_scripts/{h100,gb200}/active/
    目录下创建脚本。
  2. 在文件开头添加超时头:
    bash
    # CI_TIMEOUT=<minutes>
  3. 命名格式为
    {Tier}_{CamelDescription}.sh
    ——层级前缀决定了该测试会被纳入哪个CI矩阵。
  4. 设置可执行权限:
    chmod +x <file>
  5. 功能测试最多使用2块GPU
无需修改工作流文件——矩阵会通过扫描目录动态生成。

Moving a Test to Flaky

将测试移至Flaky目录

bash
undefined
bash
undefined

H100

H100

git mv tests/functional_tests/launch_scripts/h100/active/L0_Foo.sh
tests/functional_tests/launch_scripts/h100/flaky/L0_Foo.sh
git mv tests/functional_tests/launch_scripts/h100/active/L0_Foo.sh
tests/functional_tests/launch_scripts/h100/flaky/L0_Foo.sh

GB200 (if the test also exists there)

GB200(若该测试也存在于此)

git mv tests/functional_tests/launch_scripts/gb200/active/L0_Foo.sh
tests/functional_tests/launch_scripts/gb200/flaky/L0_Foo.sh

Flaky tests still run on manual dispatches (`test_suite=all`) so failures
remain visible. Move back to `active/` once the underlying issue is fixed.
git mv tests/functional_tests/launch_scripts/gb200/active/L0_Foo.sh
tests/functional_tests/launch_scripts/gb200/flaky/L0_Foo.sh

Flaky测试仍会在手动触发(`test_suite=all`)时运行,因此失败情况依然可见。修复底层问题后,将其移回`active/`目录即可。

Removing a Test

删除测试

Delete the script file and commit. No other changes required.
删除脚本文件并提交即可,无需其他修改。

Pytest Conventions

pytest约定

  • Use pytest fixtures for common setup.
  • Available markers:
    unit
    ,
    integration
    ,
    system
    ,
    acceptance
    ,
    docs
    ,
    skipduringci
    ,
    pleasefixme
    .
  • Functional tests are capped at 2 GPUs. Set
    CUDA_VISIBLE_DEVICES
    explicitly for multi-GPU tests.
  • Use
    uv run python -m pytest
    , never bare
    pytest
    .
  • 使用pytest夹具进行通用设置。
  • 可用标记:
    unit
    ,
    integration
    ,
    system
    ,
    acceptance
    ,
    docs
    ,
    skipduringci
    ,
    pleasefixme
  • 功能测试最多使用2块GPU。多GPU测试需显式设置
    CUDA_VISIBLE_DEVICES
  • 使用
    uv run python -m pytest
    ,切勿直接使用
    pytest

CI Job Reference

CI任务参考

GitHub Actions jobHardwareDirectory scanned
cicd-functional-tests-l0
H100
h100/active/L0_*.sh
cicd-functional-tests-l1
H100
h100/active/L1_*.sh
cicd-functional-tests-l2
H100
h100/active/L2_*.sh
cicd-functional-tests-flaky
H100
h100/flaky/L*.sh
cicd-functional-tests-gb200-l0
GB200
gb200/active/L0_*.sh
cicd-functional-tests-gb200-l1
GB200
gb200/active/L1_*.sh
cicd-functional-tests-gb200-l2
GB200
gb200/active/L2_*.sh
cicd-functional-tests-gb200-flaky
GB200
gb200/flaky/L*.sh
Hardware runners: H100 uses
nemo-ci-{azure,aws}-gpu-x2
; GB200 uses
nemo-ci-gcp-gpu-x2
.
GitHub Actions任务硬件扫描目录
cicd-functional-tests-l0
H100
h100/active/L0_*.sh
cicd-functional-tests-l1
H100
h100/active/L1_*.sh
cicd-functional-tests-l2
H100
h100/active/L2_*.sh
cicd-functional-tests-flaky
H100
h100/flaky/L*.sh
cicd-functional-tests-gb200-l0
GB200
gb200/active/L0_*.sh
cicd-functional-tests-gb200-l1
GB200
gb200/active/L1_*.sh
cicd-functional-tests-gb200-l2
GB200
gb200/active/L2_*.sh
cicd-functional-tests-gb200-flaky
GB200
gb200/flaky/L*.sh
硬件运行器:H100使用
nemo-ci-{azure,aws}-gpu-x2
;GB200使用
nemo-ci-gcp-gpu-x2

Code Anchors

代码锚点

ComponentPath
Matrix generation (H100)@.github/workflows/cicd-main.yml job
generate-test-matrix
Matrix generation (GB200)@.github/workflows/cicd-main.yml job
generate-gb200-test-matrix
Test runner action@.github/actions/test-template/action.yml
Launch scripts root
tests/functional_tests/launch_scripts/
组件路径
矩阵生成(H100)@.github/workflows/cicd-main.yml 任务
generate-test-matrix
矩阵生成(GB200)@.github/workflows/cicd-main.yml 任务
generate-gb200-test-matrix
测试运行器动作@.github/actions/test-template/action.yml
启动脚本根目录
tests/functional_tests/launch_scripts/