nemo-mbridge-perf-hierarchical-context-parallel
Original:🇺🇸 English
Translated
Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
7installs
Added on
NPX Install
npx skill4agent add promptingcompany/nv-skills nemo-mbridge-perf-hierarchical-context-parallelTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Hierarchical Context Parallel Skill
This skill covers hierarchical context parallelism: nested context-parallel process
groups used by and configured with
.
cp_comm_type="a2a+p2p"hierarchical_context_parallel_sizesFor what hierarchical CP is, when to use it, and the decision tree
( vs pure vs ), see:
a2a+p2pa2ap2p- @docs/training/hierarchical-context-parallel.md
- @skills/nemo-mbridge-perf-hierarchical-context-parallel/card.yaml
Enablement
Minimal Bridge override:
python
cfg.model.context_parallel_size = 4
cfg.model.cp_comm_type = "a2a+p2p"
cfg.model.hierarchical_context_parallel_sizes = [2, 2]
cfg.dist.use_decentralized_pg = FalseRequired constraints:
prod(hierarchical_context_parallel_sizes) == context_parallel_sizeseq_length % (2 * context_parallel_size) == 0- Transformer Engine
>= 1.12.0
Code Anchors
Upstream config and validation:
45
context_parallel_size: int = 1
"""Splits network input along sequence dimension across GPU ranks."""
hierarchical_context_parallel_sizes: Optional[list[int]] = None
"""Degrees of the hierarchical context parallelism. Users should provide a list to specify
the sizes for different levels. Taking the a2a+p2p cp comm type as example, it contains
groups of two levels, so the first value of the list indicates the group size of the a2a
communication type, and the second value indicates the group size of the p2p communication
type.
"""428
if args.hierarchical_context_parallel_sizes:
from numpy import prod
assert args.context_parallel_size == prod(args.hierarchical_context_parallel_sizes)
if "a2a+p2p" in args.cp_comm_type:
assert args.hierarchical_context_parallel_sizes is not None, \
"--hierarchical-context-parallel-sizes must be set when a2a+p2p is used in cp comm"Bridge MPU path:
613
parallel_state.initialize_model_parallel(
...
context_parallel_size=model_config.context_parallel_size,
hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
...
)
...
return ProcessGroupCollection.use_mpu_process_groups()Bridge decentralized-PG path:
503
pg_collection = ProcessGroupCollection(
...
cp=cp_pg,
tp_cp=tp_cp_pg,
hcp=None,
ep=ep_pg,
...
)Implementation Map
The code anchors above show the config declarations and argument validation.
Validation (MCore)
TransformerConfig.__post_init__a2a+p2pProcess group creation
parallel_state.initialize_model_parallelcreate_hierarchical_groupsProcessGroupCollectionTE integration
TEDotProductAttentiona2a+p2pPitfalls
- Bridge HCP is MPU-only today: If , Bridge initializes flat CP groups and leaves HCP unset.
use_decentralized_pg=True - No checked-in Bridge recipe currently exercises HCP directly.
- Single-GPU load helpers clear .
hierarchical_context_parallel_sizes - Silent broken training on old stacks: If you use without setting
a2a+p2p, MCore now asserts. Older versions would silently disable CP communication, so each rank attended only to its local chunk and produced artificially high throughput with broken gradients.hierarchical_context_parallel_sizes - Product must match: must exactly equal
prod(hierarchical_context_parallel_sizes). A mismatch triggers an assertion.context_parallel_size - Verify in logs: Look for the process group initialization output. You should see being created. If you only see
HIERARCHICAL_CONTEXT_PARALLEL_GROUPS, HCP is not active.CONTEXT_PARALLEL_GROUP
Verification
No dedicated Bridge end-to-end test exists yet for HCP (see @skills/nemo-mbridge-perf-hierarchical-context-parallel/card.yaml
). Use the existing unit tests and log inspection instead.
follow_up_validationRun the decentralized-PG unit test to confirm the flat-CP behavior is preserved:
bash
uv run python -m pytest tests/unit_tests/training/test_decentralized_pg.py -qFor a manual smoke check, launch a 4-GPU run with a small recipe and
plus :
cp_comm_type=a2a+p2phierarchical_context_parallel_sizes=[2,2]bash
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
scripts/training/run_recipe.py \
--recipe llama32_1b_pretrain_config \
model.context_parallel_size=4 \
model.cp_comm_type=a2a+p2p \
"model.hierarchical_context_parallel_sizes=[2,2]" \
train.train_iters=2Success criteria:
- Logs show being created
HIERARCHICAL_CONTEXT_PARALLEL_GROUPS - Training completes at least one step without error
- If you only see , HCP is not active
CONTEXT_PARALLEL_GROUP