onboard-gb200-1node-tests

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Onboard GB200 1-Node GitHub MR Tests

为GB200接入单节点GitHub MR测试

Create 1-node (

mr-github

) variants of existing 2-node (

mr

-scoped) GB200 functional tests. Each GB200 node has 4 GPUs. A 2-node test uses 8 GPUs total; the 1-node variant uses 4.

基于现有的双节点（

mr

范围）GB200功能测试，创建单节点（

mr-github

）变体。每个GB200节点配备4块GPU。双节点测试总共使用8块GPU；单节点变体则使用4块。

Background

背景

GB200 functional tests live in

tests/test_utils/recipes/gb200/

Recipe file	Notes
`gpt.yaml`	GPT dense tests, `nodes: 2, gpus: 4` (8 total)
`moe.yaml`	MoE tests, `nodes: 2, gpus: 4` (8 total)
`moe-1node.yaml`	Existing 1-node MoE tests, `nodes: 1, gpus: 4` (4 total)
`gpt-1node.yaml`	1-node GPT tests (create if not present)

Model configs live at:

tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml

1-node test cases use the

_1node

suffix:

tests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yaml

GB200功能测试位于

tests/test_utils/recipes/gb200/

路径下：

Recipe文件	说明
`gpt.yaml`	GPT稠密模型测试， `nodes: 2, gpus: 4` （共8块）
`moe.yaml`	MoE模型测试， `nodes: 2, gpus: 4` （共8块）
`moe-1node.yaml`	已有的单节点MoE测试， `nodes: 1, gpus: 4` （共4块）
`gpt-1node.yaml`	单节点GPT测试（若不存在则创建）

模型配置文件位于：

tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml

单节点测试用例需使用

_1node

后缀：

tests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yaml

Workflow

工作流程

Step 1 — Find candidate tests

步骤1 — 筛选候选测试

Scan the

products:

block in

gpt.yaml

and

moe.yaml

for entries with

scope: [mr, ...]

scope: [mr-slim, ...]

. These are the 2-node tests that need 1-node

mr-github

counterparts.

Ignore tests already covered in

*-1node.yaml

files, and ignore

nightly

weekly

mr-broken

scopes.

扫描

gpt.yaml

和

moe.yaml

中

products:

块内，带有

scope: [mr, ...]

或

scope: [mr-slim, ...]

的条目。这些是需要对应单节点

mr-github

版本的双节点测试。

忽略已在

*-1node.yaml

文件中覆盖的测试，以及带有

nightly

、

weekly

、

mr-broken

范围的测试。

Step 2 — Read each model config

步骤2 — 读取各模型配置

For each candidate, read its

model_config.yaml

and extract the key parallelism arguments:

--tensor-model-parallel-size   (TP)
--pipeline-model-parallel-size (PP)
--expert-model-parallel-size   (EP)
--expert-tensor-parallel-size  (ETP)
--context-parallel-size        (CP)
--global-batch-size
--micro-batch-size

针对每个候选测试，读取其

model_config.yaml

并提取关键并行参数：

--tensor-model-parallel-size   (TP)
--pipeline-model-parallel-size (PP)
--expert-model-parallel-size   (EP)
--expert-tensor-parallel-size  (ETP)
--context-parallel-size        (CP)
--global-batch-size
--micro-batch-size

Step 3 — Classify: trivial copy vs. needs adaptation

步骤3 — 分类：直接复制 vs 需要适配

The world size formula is:

world_size = TP × PP × DP

where

DP ≥ EP

Going from 8 GPUs → 4 GPUs:

Condition	Action
`TP × PP ≤ 4`	Trivial copy. Config unchanged; DP is halved automatically.
`TP × PP = 8` (e.g. tp4 pp2)	Reduce PP. Set `PP = PP / 2` (e.g. pp2→1). Verify `TP × PP_new ≤ 4` .
`EP > 4` (e.g. ep8 with tp1 pp1)	Reduce EP. Set `EP = 4` . Experts stay at `num-experts` (each EP rank holds more experts).
`EP > 4` and `TP × PP > 4`	Reduce both PP and EP as above.
ETP test (ep × etp ≤ TP × DP)	Check `EP × ETP ≤ TP × DP_new` after PP reduction. Usually satisfied when pp→1.

Do not change GBS — let gradient accumulation absorb the reduced DP.

全局规模公式为：

world_size = TP × PP × DP

，其中

DP ≥ EP

。

从8块GPU切换到4块GPU时：

条件	操作
`TP × PP ≤ 4`	直接复制。配置保持不变；DP会自动减半。
`TP × PP = 8` （例如tp4 pp2）	减少PP。设置 `PP = PP / 2` （例如pp2→1）。验证 `TP × PP_new ≤ 4` 。
`EP > 4` （例如tp1 pp1搭配ep8）	减少EP。设置 `EP = 4` 。专家数量保持 `num-experts` 不变（每个EP秩将承载更多专家）。
`EP > 4` 且 `TP × PP > 4`	按上述方式同时减少PP和EP。
ETP测试（ep × etp ≤ TP × DP）	在PP减少后检查 `EP × ETP ≤ TP × DP_new` 。通常当pp→1时可满足条件。

请勿修改GBS — 让梯度累积来吸收DP减少的影响。

Step 4 — Create

_1node

model config directories

步骤4 — 创建

_1node

模型配置目录

bash

undefined

bash

undefined

Trivial copy

直接复制

mkdir -p tests/functional_tests/test_cases/{model}/{test_case}_1node cp tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml
tests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yaml

Then apply any parallelism changes (EP or PP) with Edit tool

随后使用编辑工具应用任何并行参数修改（EP或PP）

undefined

undefined

Step 5 — Create or update recipe files

步骤5 — 创建或更新Recipe文件

For GPT tests — create

tests/test_utils/recipes/gb200/gpt-1node.yaml

(if absent) by cloning

gpt.yaml

's spec block with

nodes: 1

. Use this template for the spec:

yaml

type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
  name: "{test_case}_{environment}_{platforms}"
  model: gpt          # or moe
  build: mcore-pyt-{environment}
  nodes: 1
  gpus: 4
  n_repeat: 5
  platforms: dgx_gb200
  script_setup: |    # copy verbatim from gpt.yaml / moe.yaml
    ...
  script: |-         # copy verbatim from gpt.yaml / moe.yaml
    ...

For MoE tests — append entries to the existing

moe-1node.yaml

针对GPT测试 — 若

tests/test_utils/recipes/gb200/gpt-1node.yaml

不存在，则通过克隆

gpt.yaml

的spec块并设置

nodes: 1

来创建。使用以下模板作为spec：

yaml

type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
  name: "{test_case}_{environment}_{platforms}"
  model: gpt          # 或 moe
  build: mcore-pyt-{environment}
  nodes: 1
  gpus: 4
  n_repeat: 5
  platforms: dgx_gb200
  script_setup: |    # 从gpt.yaml / moe.yaml中直接复制
    ...
  script: |-         # 从gpt.yaml / moe.yaml中直接复制
    ...

针对MoE测试 — 将条目追加到已有的

moe-1node.yaml

中。

Step 6 — Add products entries

步骤6 — 添加products条目

Scope convention:

1–2 most representative tests per recipe:
```
scope: [mr-github, mr-github-slim]
```
All other tests:
```
scope: [mr-github]
```

yaml

products:
  - test_case: [<test_case>_1node]
    products:
      - environment: [dev]
        scope: [mr-github, mr-github-slim]   # or [mr-github]
        platforms: [dgx_gb200]

范围约定：

每个Recipe中1–2个最具代表性的测试：
```
scope: [mr-github, mr-github-slim]
```
所有其他测试：
```
scope: [mr-github]
```

yaml

products:
  - test_case: [<test_case>_1node]
    products:
      - environment: [dev]
        scope: [mr-github, mr-github-slim]   # 或 [mr-github]
        platforms: [dgx_gb200]

Quick parallelism reference

快速并行参数参考

Original (8 GPUs)	1-node config (4 GPUs)	Notes
tp1 pp1 ep1 → dp8	tp1 pp1 ep1 → dp4	trivial
tp2 pp1 ep1 → dp4	tp2 pp1 ep1 → dp2	trivial
tp1 pp2 ep1 → dp4	tp1 pp2 ep1 → dp2	trivial
tp4 pp1 ep1 → dp2	tp4 pp1 ep1 → dp1	trivial
tp1 pp4 ep1 → dp2	tp1 pp4 ep1 → dp1	trivial
tp1 pp1 ep8 → dp8	tp1 pp1 ep4 → dp4	ep 8→4
tp4 pp2 ep2 etp2 → dp1	tp4 pp1 ep2 etp2 → dp1	pp 2→1

原始配置（8块GPU）	单节点配置（4块GPU）	说明
tp1 pp1 ep1 → dp8	tp1 pp1 ep1 → dp4	直接复制
tp2 pp1 ep1 → dp4	tp2 pp1 ep1 → dp2	直接复制
tp1 pp2 ep1 → dp4	tp1 pp2 ep1 → dp2	直接复制
tp4 pp1 ep1 → dp2	tp4 pp1 ep1 → dp1	直接复制
tp1 pp4 ep1 → dp2	tp1 pp4 ep1 → dp1	直接复制
tp1 pp1 ep8 → dp8	tp1 pp1 ep4 → dp4	ep从8→4
tp4 pp2 ep2 etp2 → dp1	tp4 pp1 ep2 etp2 → dp1	pp从2→1

Checklist

检查清单

Identified all
```
mr
```
-scoped tests in
```
gpt.yaml
```
and
```
moe.yaml
```
not yet in
```
*-1node.yaml
```
Read model config for each candidate
Classified trivial vs. adaptation needed
Created
```
_1node/model_config.yaml
```
for each test
Applied EP or PP reductions where needed
Created/updated recipe YAML with
```
nodes: 1, gpus: 4
```
Assigned
```
mr-github
```
scope (+
```
mr-github-slim
```
for 1–2 representative tests per recipe)
Verified no
```
mr-github-slim
```
overload (slim suite should stay small)

已识别
```
gpt.yaml
```
和
```
moe.yaml
```
中所有未在
```
*-1node.yaml
```
覆盖的
```
mr
```
范围测试
已读取每个候选测试的模型配置
已分类直接复制或需要适配的测试
已为每个测试创建
```
_1node/model_config.yaml
```
已在需要时应用EP或PP减少的修改
已创建/更新Recipe YAML并设置
```
nodes: 1, gpus: 4
```
已为测试分配
```
mr-github
```
范围（每个Recipe中1–2个代表性测试额外添加
```
mr-github-slim
```
）
已验证
```
mr-github-slim
```
套件未过载（slim套件应保持精简）

onboard-gb200-1node-tests

Original

Translation

Onboard GB200 1-Node GitHub MR Tests

为GB200接入单节点GitHub MR测试

Background

背景

Workflow

工作流程

Step 1 — Find candidate tests

步骤1 — 筛选候选测试

Step 2 — Read each model config

步骤2 — 读取各模型配置

Step 3 — Classify: trivial copy vs. needs adaptation

步骤3 — 分类：直接复制 vs 需要适配

Step 4 — Create
`_1node`
model config directories

步骤4 — 创建
`_1node`
模型配置目录

Trivial copy

直接复制

Then apply any parallelism changes (EP or PP) with Edit tool

随后使用编辑工具应用任何并行参数修改（EP或PP）

Step 5 — Create or update recipe files

步骤5 — 创建或更新Recipe文件

Step 6 — Add products entries

步骤6 — 添加products条目

Quick parallelism reference

快速并行参数参考

Checklist

检查清单

onboard-gb200-1node-tests

Original

Translation

Onboard GB200 1-Node GitHub MR Tests

为GB200接入单节点GitHub MR测试

Background

背景

Workflow

工作流程

Step 1 — Find candidate tests

步骤1 — 筛选候选测试

Step 2 — Read each model config

步骤2 — 读取各模型配置

Step 3 — Classify: trivial copy vs. needs adaptation

步骤3 — 分类：直接复制 vs 需要适配

Step 4 — Create _1node model config directories

步骤4 — 创建_1node模型配置目录

Trivial copy

直接复制

Then apply any parallelism changes (EP or PP) with Edit tool

随后使用编辑工具应用任何并行参数修改（EP或PP）

Step 5 — Create or update recipe files

步骤5 — 创建或更新Recipe文件

Step 6 — Add products entries

步骤6 — 添加products条目

Quick parallelism reference

快速并行参数参考

Checklist

检查清单

Step 4 — Create
`_1node`
model config directories

步骤4 — 创建
`_1node`
模型配置目录