onboard-gb200-1node-tests
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOnboard GB200 1-Node GitHub MR Tests
为GB200接入单节点GitHub MR测试
Create 1-node () variants of existing 2-node (-scoped) GB200 functional tests.
Each GB200 node has 4 GPUs. A 2-node test uses 8 GPUs total; the 1-node variant uses 4.
mr-githubmr基于现有的双节点(范围)GB200功能测试,创建单节点()变体。
每个GB200节点配备4块GPU。双节点测试总共使用8块GPU;单节点变体则使用4块。
mrmr-githubBackground
背景
GB200 functional tests live in :
tests/test_utils/recipes/gb200/| Recipe file | Notes |
|---|---|
| GPT dense tests, |
| MoE tests, |
| Existing 1-node MoE tests, |
| 1-node GPT tests (create if not present) |
Model configs live at:
tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml1-node test cases use the suffix:
_1nodetests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yamlGB200功能测试位于路径下:
tests/test_utils/recipes/gb200/| Recipe文件 | 说明 |
|---|---|
| GPT稠密模型测试, |
| MoE模型测试, |
| 已有的单节点MoE测试, |
| 单节点GPT测试(若不存在则创建) |
模型配置文件位于:
tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml单节点测试用例需使用后缀:
_1nodetests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yamlWorkflow
工作流程
Step 1 — Find candidate tests
步骤1 — 筛选候选测试
Scan the block in and for entries with or . These are the 2-node tests that need 1-node counterparts.
products:gpt.yamlmoe.yamlscope: [mr, ...]scope: [mr-slim, ...]mr-githubIgnore tests already covered in files, and ignore , , scopes.
*-1node.yamlnightlyweeklymr-broken扫描和中块内,带有或的条目。这些是需要对应单节点版本的双节点测试。
gpt.yamlmoe.yamlproducts:scope: [mr, ...]scope: [mr-slim, ...]mr-github忽略已在文件中覆盖的测试,以及带有、、范围的测试。
*-1node.yamlnightlyweeklymr-brokenStep 2 — Read each model config
步骤2 — 读取各模型配置
For each candidate, read its and extract the key parallelism arguments:
model_config.yaml--tensor-model-parallel-size (TP)
--pipeline-model-parallel-size (PP)
--expert-model-parallel-size (EP)
--expert-tensor-parallel-size (ETP)
--context-parallel-size (CP)
--global-batch-size
--micro-batch-size针对每个候选测试,读取其并提取关键并行参数:
model_config.yaml--tensor-model-parallel-size (TP)
--pipeline-model-parallel-size (PP)
--expert-model-parallel-size (EP)
--expert-tensor-parallel-size (ETP)
--context-parallel-size (CP)
--global-batch-size
--micro-batch-sizeStep 3 — Classify: trivial copy vs. needs adaptation
步骤3 — 分类:直接复制 vs 需要适配
The world size formula is: where .
world_size = TP × PP × DPDP ≥ EPGoing from 8 GPUs → 4 GPUs:
| Condition | Action |
|---|---|
| Trivial copy. Config unchanged; DP is halved automatically. |
| Reduce PP. Set |
| Reduce EP. Set |
| Reduce both PP and EP as above. |
| ETP test (ep × etp ≤ TP × DP) | Check |
Do not change GBS — let gradient accumulation absorb the reduced DP.
全局规模公式为:,其中。
world_size = TP × PP × DPDP ≥ EP从8块GPU切换到4块GPU时:
| 条件 | 操作 |
|---|---|
| 直接复制。配置保持不变;DP会自动减半。 |
| 减少PP。设置 |
| 减少EP。设置 |
| 按上述方式同时减少PP和EP。 |
| ETP测试(ep × etp ≤ TP × DP) | 在PP减少后检查 |
请勿修改GBS — 让梯度累积来吸收DP减少的影响。
Step 4 — Create _1node
model config directories
_1node步骤4 — 创建_1node
模型配置目录
_1nodebash
undefinedbash
undefinedTrivial copy
直接复制
mkdir -p tests/functional_tests/test_cases/{model}/{test_case}_1node
cp tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml
tests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yaml
tests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yaml
mkdir -p tests/functional_tests/test_cases/{model}/{test_case}_1node
cp tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml
tests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yaml
tests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yaml
Then apply any parallelism changes (EP or PP) with Edit tool
随后使用编辑工具应用任何并行参数修改(EP或PP)
undefinedundefinedStep 5 — Create or update recipe files
步骤5 — 创建或更新Recipe文件
For GPT tests — create (if absent) by cloning 's spec block with . Use this template for the spec:
tests/test_utils/recipes/gb200/gpt-1node.yamlgpt.yamlnodes: 1yaml
type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
name: "{test_case}_{environment}_{platforms}"
model: gpt # or moe
build: mcore-pyt-{environment}
nodes: 1
gpus: 4
n_repeat: 5
platforms: dgx_gb200
script_setup: | # copy verbatim from gpt.yaml / moe.yaml
...
script: |- # copy verbatim from gpt.yaml / moe.yaml
...For MoE tests — append entries to the existing .
moe-1node.yaml针对GPT测试 — 若不存在,则通过克隆的spec块并设置来创建。使用以下模板作为spec:
tests/test_utils/recipes/gb200/gpt-1node.yamlgpt.yamlnodes: 1yaml
type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
name: "{test_case}_{environment}_{platforms}"
model: gpt # 或 moe
build: mcore-pyt-{environment}
nodes: 1
gpus: 4
n_repeat: 5
platforms: dgx_gb200
script_setup: | # 从gpt.yaml / moe.yaml中直接复制
...
script: |- # 从gpt.yaml / moe.yaml中直接复制
...针对MoE测试 — 将条目追加到已有的中。
moe-1node.yamlStep 6 — Add products entries
步骤6 — 添加products条目
Scope convention:
- 1–2 most representative tests per recipe:
scope: [mr-github, mr-github-slim] - All other tests:
scope: [mr-github]
yaml
products:
- test_case: [<test_case>_1node]
products:
- environment: [dev]
scope: [mr-github, mr-github-slim] # or [mr-github]
platforms: [dgx_gb200]范围约定:
- 每个Recipe中1–2个最具代表性的测试:
scope: [mr-github, mr-github-slim] - 所有其他测试:
scope: [mr-github]
yaml
products:
- test_case: [<test_case>_1node]
products:
- environment: [dev]
scope: [mr-github, mr-github-slim] # 或 [mr-github]
platforms: [dgx_gb200]Quick parallelism reference
快速并行参数参考
| Original (8 GPUs) | 1-node config (4 GPUs) | Notes |
|---|---|---|
| tp1 pp1 ep1 → dp8 | tp1 pp1 ep1 → dp4 | trivial |
| tp2 pp1 ep1 → dp4 | tp2 pp1 ep1 → dp2 | trivial |
| tp1 pp2 ep1 → dp4 | tp1 pp2 ep1 → dp2 | trivial |
| tp4 pp1 ep1 → dp2 | tp4 pp1 ep1 → dp1 | trivial |
| tp1 pp4 ep1 → dp2 | tp1 pp4 ep1 → dp1 | trivial |
| tp1 pp1 ep8 → dp8 | tp1 pp1 ep4 → dp4 | ep 8→4 |
| tp4 pp2 ep2 etp2 → dp1 | tp4 pp1 ep2 etp2 → dp1 | pp 2→1 |
| 原始配置(8块GPU) | 单节点配置(4块GPU) | 说明 |
|---|---|---|
| tp1 pp1 ep1 → dp8 | tp1 pp1 ep1 → dp4 | 直接复制 |
| tp2 pp1 ep1 → dp4 | tp2 pp1 ep1 → dp2 | 直接复制 |
| tp1 pp2 ep1 → dp4 | tp1 pp2 ep1 → dp2 | 直接复制 |
| tp4 pp1 ep1 → dp2 | tp4 pp1 ep1 → dp1 | 直接复制 |
| tp1 pp4 ep1 → dp2 | tp1 pp4 ep1 → dp1 | 直接复制 |
| tp1 pp1 ep8 → dp8 | tp1 pp1 ep4 → dp4 | ep从8→4 |
| tp4 pp2 ep2 etp2 → dp1 | tp4 pp1 ep2 etp2 → dp1 | pp从2→1 |
Checklist
检查清单
- Identified all -scoped tests in
mrandgpt.yamlnot yet inmoe.yaml*-1node.yaml - Read model config for each candidate
- Classified trivial vs. adaptation needed
- Created for each test
_1node/model_config.yaml - Applied EP or PP reductions where needed
- Created/updated recipe YAML with
nodes: 1, gpus: 4 - Assigned scope (+
mr-githubfor 1–2 representative tests per recipe)mr-github-slim - Verified no overload (slim suite should stay small)
mr-github-slim
- 已识别和
gpt.yaml中所有未在moe.yaml覆盖的*-1node.yaml范围测试mr - 已读取每个候选测试的模型配置
- 已分类直接复制或需要适配的测试
- 已为每个测试创建
_1node/model_config.yaml - 已在需要时应用EP或PP减少的修改
- 已创建/更新Recipe YAML并设置
nodes: 1, gpus: 4 - 已为测试分配范围(每个Recipe中1–2个代表性测试额外添加
mr-github)mr-github-slim - 已验证套件未过载(slim套件应保持精简)
mr-github-slim