nemo-mbridge-perf-hierarchical-context-parallel
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHierarchical Context Parallel Skill
分层上下文并行技能
This skill covers hierarchical context parallelism: nested context-parallel process
groups used by and configured with
.
cp_comm_type="a2a+p2p"hierarchical_context_parallel_sizesFor what hierarchical CP is, when to use it, and the decision tree
( vs pure vs ), see:
a2a+p2pa2ap2p- @docs/training/hierarchical-context-parallel.md
- @skills/nemo-mbridge-perf-hierarchical-context-parallel/card.yaml
本技能介绍分层上下文并行:即通过使用、并由配置的嵌套上下文并行进程组。
cp_comm_type="a2a+p2p"hierarchical_context_parallel_sizes关于分层CP是什么、何时使用,以及决策树( vs 纯 vs ),请参阅:
a2a+p2pa2ap2p- @docs/training/hierarchical-context-parallel.md
- @skills/nemo-mbridge-perf-hierarchical-context-parallel/card.yaml
Enablement
启用方法
Minimal Bridge override:
python
cfg.model.context_parallel_size = 4
cfg.model.cp_comm_type = "a2a+p2p"
cfg.model.hierarchical_context_parallel_sizes = [2, 2]
cfg.dist.use_decentralized_pg = FalseRequired constraints:
prod(hierarchical_context_parallel_sizes) == context_parallel_sizeseq_length % (2 * context_parallel_size) == 0- Transformer Engine
>= 1.12.0
最小化Bridge配置覆盖:
python
cfg.model.context_parallel_size = 4
cfg.model.cp_comm_type = "a2a+p2p"
cfg.model.hierarchical_context_parallel_sizes = [2, 2]
cfg.dist.use_decentralized_pg = False必要约束条件:
prod(hierarchical_context_parallel_sizes) == context_parallel_sizeseq_length % (2 * context_parallel_size) == 0- Transformer Engine
>= 1.12.0
Code Anchors
代码锚点
Upstream config and validation:
45
context_parallel_size: int = 1
"""Splits network input along sequence dimension across GPU ranks."""
hierarchical_context_parallel_sizes: Optional[list[int]] = None
"""Degrees of the hierarchical context parallelism. Users should provide a list to specify
the sizes for different levels. Taking the a2a+p2p cp comm type as example, it contains
groups of two levels, so the first value of the list indicates the group size of the a2a
communication type, and the second value indicates the group size of the p2p communication
type.
"""428
if args.hierarchical_context_parallel_sizes:
from numpy import prod
assert args.context_parallel_size == prod(args.hierarchical_context_parallel_sizes)
if "a2a+p2p" in args.cp_comm_type:
assert args.hierarchical_context_parallel_sizes is not None, \
"--hierarchical-context-parallel-sizes must be set when a2a+p2p is used in cp comm"Bridge MPU path:
613
parallel_state.initialize_model_parallel(
...
context_parallel_size=model_config.context_parallel_size,
hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
...
)
...
return ProcessGroupCollection.use_mpu_process_groups()Bridge decentralized-PG path:
503
pg_collection = ProcessGroupCollection(
...
cp=cp_pg,
tp_cp=tp_cp_pg,
hcp=None,
ep=ep_pg,
...
)上游配置与验证:
45
context_parallel_size: int = 1
"""在GPU节点间沿序列维度拆分网络输入。"""
hierarchical_context_parallel_sizes: Optional[list[int]] = None
"""分层上下文并行的层级数。用户需提供一个列表来指定不同层级的大小。以a2a+p2p类型的cp通信为例,它包含两个层级的组,因此列表中的第一个值表示a2a通信类型的组大小,第二个值表示p2p通信类型的组大小。
"""428
if args.hierarchical_context_parallel_sizes:
from numpy import prod
assert args.context_parallel_size == prod(args.hierarchical_context_parallel_sizes)
if "a2a+p2p" in args.cp_comm_type:
assert args.hierarchical_context_parallel_sizes is not None, \
"当cp通信使用a2a+p2p时,必须设置--hierarchical-context-parallel-sizes参数"Bridge的MPU路径:
613
parallel_state.initialize_model_parallel(
...
context_parallel_size=model_config.context_parallel_size,
hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
...
)
...
return ProcessGroupCollection.use_mpu_process_groups()Bridge的去中心化PG路径:
503
pg_collection = ProcessGroupCollection(
...
cp=cp_pg,
tp_cp=tp_cp_pg,
hcp=None,
ep=ep_pg,
...
)Implementation Map
实现映射
Config definition
配置定义
hierarchical_context_parallel_sizesModelParallelConfigundefinedhierarchical_context_parallel_sizesModelParallelConfigundefined3rdparty/Megatron-LM/megatron/core/model_parallel_config.py
3rdparty/Megatron-LM/megatron/core/model_parallel_config.py
hierarchical_context_parallel_sizes: Optional[list[int]] = None
hierarchical_context_parallel_sizes: Optional[list[int]] = None
For a2a+p2p, first value = a2a group size, second value = p2p group size.
对于a2a+p2p,第一个值为a2a组大小,第二个值为p2p组大小。
Product must equal context_parallel_size.
列表元素的乘积必须等于context_parallel_size。
`cp_comm_type` is declared in `TransformerConfig`:
`cp_comm_type`在`TransformerConfig`中声明:
3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
cp_comm_type: Optional[Union[str, List[str]]] = None
cp_comm_type: Optional[Union[str, List[str]]] = None
Can be per-layer (List[str]) or uniform (str).
可按层设置(List[str])或全局统一设置(str)。
Values: "p2p", "all_gather", "a2a", "a2a+p2p"
可选值:"p2p", "all_gather", "a2a", "a2a+p2p"
undefinedundefinedValidation (MCore)
验证(MCore)
TransformerConfig.__post_init__a2a+p2pTransformerConfig.__post_init__a2a+p2pProcess group creation
进程组创建
parallel_state.initialize_model_parallelcreate_hierarchical_groupsProcessGroupCollection当通过提供HCP大小时,会创建分层CP子组。Bridge目前通过基于MPU的获取这些组。
create_hierarchical_groupsparallel_state.initialize_model_parallelProcessGroupCollectionTE integration
TE集成
TEDotProductAttentiona2a+p2p当使用时,会将分层组传递给Transformer Engine。要求Transformer Engine >= 1.12.0。
a2a+p2pTEDotProductAttentionPitfalls
注意事项
- Bridge HCP is MPU-only today: If , Bridge initializes flat CP groups and leaves HCP unset.
use_decentralized_pg=True - No checked-in Bridge recipe currently exercises HCP directly.
- Single-GPU load helpers clear .
hierarchical_context_parallel_sizes - Silent broken training on old stacks: If you use without setting
a2a+p2p, MCore now asserts. Older versions would silently disable CP communication, so each rank attended only to its local chunk and produced artificially high throughput with broken gradients.hierarchical_context_parallel_sizes - Product must match: must exactly equal
prod(hierarchical_context_parallel_sizes). A mismatch triggers an assertion.context_parallel_size - Verify in logs: Look for the process group initialization output. You should see being created. If you only see
HIERARCHICAL_CONTEXT_PARALLEL_GROUPS, HCP is not active.CONTEXT_PARALLEL_GROUP
- 当前Bridge的HCP仅支持MPU:如果设置,Bridge会初始化扁平CP组,且不设置HCP。
use_decentralized_pg=True - 目前没有已提交的Bridge方案直接使用HCP。
- 单GPU加载工具会清除配置。
hierarchical_context_parallel_sizes - 旧版本栈会导致无提示的训练失败:如果使用但未设置
a2a+p2p,当前MCore会触发断言。而旧版本会无提示地禁用CP通信,导致每个节点仅处理本地数据块,产生虚假的高吞吐量,但梯度是错误的。hierarchical_context_parallel_sizes - 乘积必须匹配:的乘积必须完全等于
hierarchical_context_parallel_sizes,不匹配会触发断言。context_parallel_size - 在日志中验证:查看进程组初始化输出,应能看到被创建。如果仅看到
HIERARCHICAL_CONTEXT_PARALLEL_GROUPS,则HCP未激活。CONTEXT_PARALLEL_GROUP
Verification
验证方法
No dedicated Bridge end-to-end test exists yet for HCP (see @skills/nemo-mbridge-perf-hierarchical-context-parallel/card.yaml
). Use the existing unit tests and log inspection instead.
follow_up_validationRun the decentralized-PG unit test to confirm the flat-CP behavior is preserved:
bash
uv run python -m pytest tests/unit_tests/training/test_decentralized_pg.py -qFor a manual smoke check, launch a 4-GPU run with a small recipe and
plus :
cp_comm_type=a2a+p2phierarchical_context_parallel_sizes=[2,2]bash
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
scripts/training/run_recipe.py \
--recipe llama32_1b_pretrain_config \
model.context_parallel_size=4 \
model.cp_comm_type=a2a+p2p \
"model.hierarchical_context_parallel_sizes=[2,2]" \
train.train_iters=2Success criteria:
- Logs show being created
HIERARCHICAL_CONTEXT_PARALLEL_GROUPS - Training completes at least one step without error
- If you only see , HCP is not active
CONTEXT_PARALLEL_GROUP
目前还没有针对HCP的Bridge端到端测试(请参阅@skills/nemo-mbridge-perf-hierarchical-context-parallel/card.yaml中的部分)。请改用现有的单元测试和日志检查。
follow_up_validation运行去中心化PG单元测试,确认扁平CP行为正常:
bash
uv run python -m pytest tests/unit_tests/training/test_decentralized_pg.py -q如需手动冒烟测试,使用小型配置启动4-GPU运行,设置和:
cp_comm_type=a2a+p2phierarchical_context_parallel_sizes=[2,2]bash
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
scripts/training/run_recipe.py \
--recipe llama32_1b_pretrain_config \
model.context_parallel_size=4 \
model.cp_comm_type=a2a+p2p \
"model.hierarchical_context_parallel_sizes=[2,2]" \
train.train_iters=2成功标准:
- 日志中显示被创建
HIERARCHICAL_CONTEXT_PARALLEL_GROUPS - 训练至少完成一个步骤且无错误
- 如果仅看到,则HCP未激活
CONTEXT_PARALLEL_GROUP