testing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTesting
测试
Directory Layout
目录布局
tests/
unit_tests/ # fast, isolated, no GPU required
functional_tests/
launch_scripts/
h100/
active/ # H100 tests that run in CI automatically
flaky/ # H100 tests quarantined from blocking CI
gb200/
active/ # GB200 tests that run in CI automatically
flaky/ # GB200 tests quarantined from blocking CIUnit tests are independent of the launch script layout. Functional test
scripts are named (e.g., ).
{Tier}_{Description}.shL0_Launch_training.shtests/
unit_tests/ # 快速、独立,无需GPU
functional_tests/
launch_scripts/
h100/
active/ # 在CI中自动运行的H100测试
flaky/ # 被隔离,不阻塞CI的H100测试
gb200/
active/ # 在CI中自动运行的GB200测试
flaky/ # 被隔离,不阻塞CI的GB200测试单元测试独立于启动脚本布局。功能测试脚本命名格式为(例如)。
{Tier}_{Description}.shL0_Launch_training.shTier Semantics
层级语义
| Tier | Trigger | Blocking |
|---|---|---|
| L0 | Every PR, every push to | Yes — PR cannot merge if L0 fails |
| L1 | Push to | Yes |
| L2 | Schedule, | Yes (when triggered) |
| flaky | | No — failures are informational |
H100 and GB200 each have independent L0/L1/L2/flaky jobs. Moving a script to
removes it from blocking CI on that hardware target only.
flaky/Prefer unit tests over functional tests. CI GPU resources are limited;
every functional test slot has a real cost.
| 层级 | 触发条件 | 是否阻塞 |
|---|---|---|
| L0 | 每个PR、每次推送到 | 是——若L0测试失败,PR无法合并 |
| L1 | 推送到 | 是 |
| L2 | 定时任务、 | 是(触发时) |
| flaky | 仅在 | 否——失败仅作为信息提示 |
H100和GB200各自拥有独立的L0/L1/L2/flaky任务。将脚本移至目录只会使其不再阻塞对应硬件目标的CI流程。
flaky/优先使用单元测试而非功能测试。 CI的GPU资源有限;每个功能测试的运行都有实际成本。
Running Tests Locally
本地运行测试
Unit Tests
单元测试
No GPU required:
bash
uv run pytest tests/unit_tests/ -x -vOr inside Docker:
bash
docker run --rm --gpus all -v $(pwd):/workdir/ -w /workdir/ megatron-bridge \
uv run pytest tests/unit_tests/无需GPU:
bash
uv run pytest tests/unit_tests/ -x -v或在Docker中运行:
bash
docker run --rm --gpus all -v $(pwd):/workdir/ -w /workdir/ megatron-bridge \
uv run pytest tests/unit_tests/Functional Tests
功能测试
Run the corresponding launch script directly on a GPU node:
bash
bash tests/functional_tests/launch_scripts/h100/active/L0_Launch_training.sh在GPU节点上直接运行对应的启动脚本:
bash
bash tests/functional_tests/launch_scripts/h100/active/L0_Launch_training.shAdding a Unit Test
添加单元测试
- Place the file under .
tests/unit_tests/<domain>/test_<name>.py - Mark it: .
@pytest.mark.unit - Keep configs tiny: small hidden dims, 1-2 layers, short sequences.
- Run locally:
uv run python -m pytest tests/unit_tests/<your_test>.py
No foreign on config dataclasses. When applying overrides via
, always guard first:
setattrsetattr(config_obj, key, value)python
if not hasattr(config_obj, key):
raise ValueError(f"Config has no field '{key}'")
setattr(config_obj, key, value)Setting a non-existent attribute silently creates a phantom field — the test
passes but the recipe fails for a real user.
- 将文件放在路径下。
tests/unit_tests/<domain>/test_<name>.py - 添加标记:。
@pytest.mark.unit - 配置保持精简:使用小隐藏维度、1-2层、短序列。
- 本地运行:
uv run python -m pytest tests/unit_tests/<your_test>.py
禁止在配置数据类上使用外部。 当通过应用覆盖时,必须先进行检查:
setattrsetattr(config_obj, key, value)python
if not hasattr(config_obj, key):
raise ValueError(f"Config has no field '{key}'")
setattr(config_obj, key, value)设置不存在的属性会静默创建一个虚拟字段——测试会通过,但实际用户使用时会失败。
Adding a Functional Test
添加功能测试
- Create the script under .
tests/functional_tests/launch_scripts/{h100,gb200}/active/ - Start the file with a timeout header:
bash
# CI_TIMEOUT=<minutes> - Name it — the tier prefix controls which CI matrix includes it.
{Tier}_{CamelDescription}.sh - Make it executable: .
chmod +x <file> - Functional tests must use at most 2 GPUs.
No workflow file changes needed — the matrix is generated dynamically by
scanning the directory.
- 在目录下创建脚本。
tests/functional_tests/launch_scripts/{h100,gb200}/active/ - 在文件开头添加超时头:
bash
# CI_TIMEOUT=<minutes> - 命名格式为——层级前缀决定了该测试会被纳入哪个CI矩阵。
{Tier}_{CamelDescription}.sh - 设置可执行权限:。
chmod +x <file> - 功能测试最多使用2块GPU。
无需修改工作流文件——矩阵会通过扫描目录动态生成。
Moving a Test to Flaky
将测试移至Flaky目录
bash
undefinedbash
undefinedH100
H100
git mv tests/functional_tests/launch_scripts/h100/active/L0_Foo.sh
tests/functional_tests/launch_scripts/h100/flaky/L0_Foo.sh
tests/functional_tests/launch_scripts/h100/flaky/L0_Foo.sh
git mv tests/functional_tests/launch_scripts/h100/active/L0_Foo.sh
tests/functional_tests/launch_scripts/h100/flaky/L0_Foo.sh
tests/functional_tests/launch_scripts/h100/flaky/L0_Foo.sh
GB200 (if the test also exists there)
GB200(若该测试也存在于此)
git mv tests/functional_tests/launch_scripts/gb200/active/L0_Foo.sh
tests/functional_tests/launch_scripts/gb200/flaky/L0_Foo.sh
tests/functional_tests/launch_scripts/gb200/flaky/L0_Foo.sh
Flaky tests still run on manual dispatches (`test_suite=all`) so failures
remain visible. Move back to `active/` once the underlying issue is fixed.git mv tests/functional_tests/launch_scripts/gb200/active/L0_Foo.sh
tests/functional_tests/launch_scripts/gb200/flaky/L0_Foo.sh
tests/functional_tests/launch_scripts/gb200/flaky/L0_Foo.sh
Flaky测试仍会在手动触发(`test_suite=all`)时运行,因此失败情况依然可见。修复底层问题后,将其移回`active/`目录即可。Removing a Test
删除测试
Delete the script file and commit. No other changes required.
删除脚本文件并提交即可,无需其他修改。
Pytest Conventions
pytest约定
- Use pytest fixtures for common setup.
- Available markers: ,
unit,integration,system,acceptance,docs,skipduringci.pleasefixme - Functional tests are capped at 2 GPUs. Set explicitly for multi-GPU tests.
CUDA_VISIBLE_DEVICES - Use , never bare
uv run python -m pytest.pytest
- 使用pytest夹具进行通用设置。
- 可用标记:,
unit,integration,system,acceptance,docs,skipduringci。pleasefixme - 功能测试最多使用2块GPU。多GPU测试需显式设置。
CUDA_VISIBLE_DEVICES - 使用,切勿直接使用
uv run python -m pytest。pytest
CI Job Reference
CI任务参考
| GitHub Actions job | Hardware | Directory scanned |
|---|---|---|
| H100 | |
| H100 | |
| H100 | |
| H100 | |
| GB200 | |
| GB200 | |
| GB200 | |
| GB200 | |
Hardware runners: H100 uses ; GB200 uses .
nemo-ci-{azure,aws}-gpu-x2nemo-ci-gcp-gpu-x2| GitHub Actions任务 | 硬件 | 扫描目录 |
|---|---|---|
| H100 | |
| H100 | |
| H100 | |
| H100 | |
| GB200 | |
| GB200 | |
| GB200 | |
| GB200 | |
硬件运行器:H100使用;GB200使用。
nemo-ci-{azure,aws}-gpu-x2nemo-ci-gcp-gpu-x2Code Anchors
代码锚点
| Component | Path |
|---|---|
| Matrix generation (H100) | @.github/workflows/cicd-main.yml job |
| Matrix generation (GB200) | @.github/workflows/cicd-main.yml job |
| Test runner action | @.github/actions/test-template/action.yml |
| Launch scripts root | |
| 组件 | 路径 |
|---|---|
| 矩阵生成(H100) | @.github/workflows/cicd-main.yml 任务 |
| 矩阵生成(GB200) | @.github/workflows/cicd-main.yml 任务 |
| 测试运行器动作 | @.github/actions/test-template/action.yml |
| 启动脚本根目录 | |