onboard-gb200-1node-tests

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Onboard GB200 1-Node GitHub MR Tests

为GB200接入单节点GitHub MR测试

Create 1-node (
mr-github
) variants of existing 2-node (
mr
-scoped) GB200 functional tests. Each GB200 node has 4 GPUs. A 2-node test uses 8 GPUs total; the 1-node variant uses 4.

基于现有的双节点(
mr
范围)GB200功能测试,创建单节点(
mr-github
)变体。 每个GB200节点配备4块GPU。双节点测试总共使用8块GPU;单节点变体则使用4块。

Background

背景

GB200 functional tests live in
tests/test_utils/recipes/gb200/
:
Recipe fileNotes
gpt.yaml
GPT dense tests,
nodes: 2, gpus: 4
(8 total)
moe.yaml
MoE tests,
nodes: 2, gpus: 4
(8 total)
moe-1node.yaml
Existing 1-node MoE tests,
nodes: 1, gpus: 4
(4 total)
gpt-1node.yaml
1-node GPT tests (create if not present)
Model configs live at:
tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml
1-node test cases use the
_1node
suffix:
tests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yaml

GB200功能测试位于
tests/test_utils/recipes/gb200/
路径下:
Recipe文件说明
gpt.yaml
GPT稠密模型测试,
nodes: 2, gpus: 4
(共8块)
moe.yaml
MoE模型测试,
nodes: 2, gpus: 4
(共8块)
moe-1node.yaml
已有的单节点MoE测试,
nodes: 1, gpus: 4
(共4块)
gpt-1node.yaml
单节点GPT测试(若不存在则创建)
模型配置文件位于:
tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml
单节点测试用例需使用
_1node
后缀:
tests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yaml

Workflow

工作流程

Step 1 — Find candidate tests

步骤1 — 筛选候选测试

Scan the
products:
block in
gpt.yaml
and
moe.yaml
for entries with
scope: [mr, ...]
or
scope: [mr-slim, ...]
. These are the 2-node tests that need 1-node
mr-github
counterparts.
Ignore tests already covered in
*-1node.yaml
files, and ignore
nightly
,
weekly
,
mr-broken
scopes.
扫描
gpt.yaml
moe.yaml
products:
块内,带有
scope: [mr, ...]
scope: [mr-slim, ...]
的条目。这些是需要对应单节点
mr-github
版本的双节点测试。
忽略已在
*-1node.yaml
文件中覆盖的测试,以及带有
nightly
weekly
mr-broken
范围的测试。

Step 2 — Read each model config

步骤2 — 读取各模型配置

For each candidate, read its
model_config.yaml
and extract the key parallelism arguments:
--tensor-model-parallel-size   (TP)
--pipeline-model-parallel-size (PP)
--expert-model-parallel-size   (EP)
--expert-tensor-parallel-size  (ETP)
--context-parallel-size        (CP)
--global-batch-size
--micro-batch-size
针对每个候选测试,读取其
model_config.yaml
并提取关键并行参数:
--tensor-model-parallel-size   (TP)
--pipeline-model-parallel-size (PP)
--expert-model-parallel-size   (EP)
--expert-tensor-parallel-size  (ETP)
--context-parallel-size        (CP)
--global-batch-size
--micro-batch-size

Step 3 — Classify: trivial copy vs. needs adaptation

步骤3 — 分类:直接复制 vs 需要适配

The world size formula is:
world_size = TP × PP × DP
where
DP ≥ EP
.
Going from 8 GPUs → 4 GPUs:
ConditionAction
TP × PP ≤ 4
Trivial copy. Config unchanged; DP is halved automatically.
TP × PP = 8
(e.g. tp4 pp2)
Reduce PP. Set
PP = PP / 2
(e.g. pp2→1). Verify
TP × PP_new ≤ 4
.
EP > 4
(e.g. ep8 with tp1 pp1)
Reduce EP. Set
EP = 4
. Experts stay at
num-experts
(each EP rank holds more experts).
EP > 4
and
TP × PP > 4
Reduce both PP and EP as above.
ETP test (ep × etp ≤ TP × DP)Check
EP × ETP ≤ TP × DP_new
after PP reduction. Usually satisfied when pp→1.
Do not change GBS — let gradient accumulation absorb the reduced DP.
全局规模公式为:
world_size = TP × PP × DP
,其中
DP ≥ EP
从8块GPU切换到4块GPU时:
条件操作
TP × PP ≤ 4
直接复制。配置保持不变;DP会自动减半。
TP × PP = 8
(例如tp4 pp2)
减少PP。设置
PP = PP / 2
(例如pp2→1)。验证
TP × PP_new ≤ 4
EP > 4
(例如tp1 pp1搭配ep8)
减少EP。设置
EP = 4
。专家数量保持
num-experts
不变(每个EP秩将承载更多专家)。
EP > 4
TP × PP > 4
按上述方式同时减少PP和EP。
ETP测试(ep × etp ≤ TP × DP)在PP减少后检查
EP × ETP ≤ TP × DP_new
。通常当pp→1时可满足条件。
请勿修改GBS — 让梯度累积来吸收DP减少的影响。

Step 4 — Create
_1node
model config directories

步骤4 — 创建
_1node
模型配置目录

bash
undefined
bash
undefined

Trivial copy

直接复制

mkdir -p tests/functional_tests/test_cases/{model}/{test_case}_1node cp tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml
tests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yaml
mkdir -p tests/functional_tests/test_cases/{model}/{test_case}_1node cp tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml
tests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yaml

Then apply any parallelism changes (EP or PP) with Edit tool

随后使用编辑工具应用任何并行参数修改(EP或PP)

undefined
undefined

Step 5 — Create or update recipe files

步骤5 — 创建或更新Recipe文件

For GPT tests — create
tests/test_utils/recipes/gb200/gpt-1node.yaml
(if absent) by cloning
gpt.yaml
's spec block with
nodes: 1
. Use this template for the spec:
yaml
type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
  name: "{test_case}_{environment}_{platforms}"
  model: gpt          # or moe
  build: mcore-pyt-{environment}
  nodes: 1
  gpus: 4
  n_repeat: 5
  platforms: dgx_gb200
  script_setup: |    # copy verbatim from gpt.yaml / moe.yaml
    ...
  script: |-         # copy verbatim from gpt.yaml / moe.yaml
    ...
For MoE tests — append entries to the existing
moe-1node.yaml
.
针对GPT测试 — 若
tests/test_utils/recipes/gb200/gpt-1node.yaml
不存在,则通过克隆
gpt.yaml
的spec块并设置
nodes: 1
来创建。使用以下模板作为spec:
yaml
type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
  name: "{test_case}_{environment}_{platforms}"
  model: gpt          # 或 moe
  build: mcore-pyt-{environment}
  nodes: 1
  gpus: 4
  n_repeat: 5
  platforms: dgx_gb200
  script_setup: |    # 从gpt.yaml / moe.yaml中直接复制
    ...
  script: |-         # 从gpt.yaml / moe.yaml中直接复制
    ...
针对MoE测试 — 将条目追加到已有的
moe-1node.yaml
中。

Step 6 — Add products entries

步骤6 — 添加products条目

Scope convention:
  • 1–2 most representative tests per recipe:
    scope: [mr-github, mr-github-slim]
  • All other tests:
    scope: [mr-github]
yaml
products:
  - test_case: [<test_case>_1node]
    products:
      - environment: [dev]
        scope: [mr-github, mr-github-slim]   # or [mr-github]
        platforms: [dgx_gb200]

范围约定:
  • 每个Recipe中1–2个最具代表性的测试
    scope: [mr-github, mr-github-slim]
  • 所有其他测试
    scope: [mr-github]
yaml
products:
  - test_case: [<test_case>_1node]
    products:
      - environment: [dev]
        scope: [mr-github, mr-github-slim]   # 或 [mr-github]
        platforms: [dgx_gb200]

Quick parallelism reference

快速并行参数参考

Original (8 GPUs)1-node config (4 GPUs)Notes
tp1 pp1 ep1 → dp8tp1 pp1 ep1 → dp4trivial
tp2 pp1 ep1 → dp4tp2 pp1 ep1 → dp2trivial
tp1 pp2 ep1 → dp4tp1 pp2 ep1 → dp2trivial
tp4 pp1 ep1 → dp2tp4 pp1 ep1 → dp1trivial
tp1 pp4 ep1 → dp2tp1 pp4 ep1 → dp1trivial
tp1 pp1 ep8 → dp8tp1 pp1 ep4 → dp4ep 8→4
tp4 pp2 ep2 etp2 → dp1tp4 pp1 ep2 etp2 → dp1pp 2→1

原始配置(8块GPU)单节点配置(4块GPU)说明
tp1 pp1 ep1 → dp8tp1 pp1 ep1 → dp4直接复制
tp2 pp1 ep1 → dp4tp2 pp1 ep1 → dp2直接复制
tp1 pp2 ep1 → dp4tp1 pp2 ep1 → dp2直接复制
tp4 pp1 ep1 → dp2tp4 pp1 ep1 → dp1直接复制
tp1 pp4 ep1 → dp2tp1 pp4 ep1 → dp1直接复制
tp1 pp1 ep8 → dp8tp1 pp1 ep4 → dp4ep从8→4
tp4 pp2 ep2 etp2 → dp1tp4 pp1 ep2 etp2 → dp1pp从2→1

Checklist

检查清单

  • Identified all
    mr
    -scoped tests in
    gpt.yaml
    and
    moe.yaml
    not yet in
    *-1node.yaml
  • Read model config for each candidate
  • Classified trivial vs. adaptation needed
  • Created
    _1node/model_config.yaml
    for each test
  • Applied EP or PP reductions where needed
  • Created/updated recipe YAML with
    nodes: 1, gpus: 4
  • Assigned
    mr-github
    scope (+
    mr-github-slim
    for 1–2 representative tests per recipe)
  • Verified no
    mr-github-slim
    overload (slim suite should stay small)
  • 已识别
    gpt.yaml
    moe.yaml
    中所有未在
    *-1node.yaml
    覆盖的
    mr
    范围测试
  • 已读取每个候选测试的模型配置
  • 已分类直接复制或需要适配的测试
  • 已为每个测试创建
    _1node/model_config.yaml
  • 已在需要时应用EP或PP减少的修改
  • 已创建/更新Recipe YAML并设置
    nodes: 1, gpus: 4
  • 已为测试分配
    mr-github
    范围(每个Recipe中1–2个代表性测试额外添加
    mr-github-slim
  • 已验证
    mr-github-slim
    套件未过载(slim套件应保持精简)