perf-memory-tuning
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMemory Tuning
内存调优
Stable docs: @docs/parallelisms.md
Card: @skills/perf-memory-tuning/card.yaml
稳定文档:@docs/parallelisms.md
卡片:@skills/perf-memory-tuning/card.yaml
What It Is
概述
GPU OOM failures during training often stem from memory fragmentation rather
than raw capacity. PyTorch's default CUDA allocator can leave unusable gaps
between allocations. The single most effective fix is:
bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:TrueThis tells PyTorch to use expandable (non-fixed-size) memory segments, which
dramatically reduces fragmentation and often eliminates borderline OOM without
any model or parallelism changes.
Beyond fragmentation, actual peak memory is determined by:
- Parameter + optimizer state memory — controlled by TP, PP, DP sharding (distributed optimizer, FSDP)
- Activation memory — controlled by activation recompute, sequence length, micro-batch size
- Temporary / workspace memory — CUDA kernels, NCCL buffers, CUDA graphs
训练过程中的GPU OOM(内存不足)故障通常源于内存碎片化而非原始容量不足。PyTorch默认的CUDA分配器会在内存分配之间留下无法使用的间隙。最有效的修复方案是:
bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True该配置会让PyTorch使用可扩展(非固定大小)的内存段,大幅减少内存碎片化,通常无需修改模型或并行度设置即可解决临界OOM问题。
除碎片化外,实际峰值内存由以下因素决定:
- 参数+优化器状态内存——由TP、PP、DP分片控制(分布式优化器、FSDP)
- 激活内存——由激活重计算、序列长度、微批次大小控制
- 临时/工作区内存——CUDA内核、NCCL缓冲区、CUDA图
Quick Decision
快速决策流程
When a training run OOMs or is close to the memory limit:
- Set first. This fixes fragmentation-induced OOM with zero performance cost. Most Slurm launch templates already include it.
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True - Add selective activation recompute () if not already enabled. See @skills/perf-activation-recompute/SKILL.md.
recompute_modules=[core_attn] - Avoid increasing TP as a memory fix — doubling TP dramatically increases NVLink all-reduce volume and often kills throughput (-28% on Llama3 70B).
- Avoid increasing PP at the cost of DP — halving DP doubles gradient accumulation steps and hurts throughput (~6%).
- Consider recompute if still OOM. Saves ~3 GB but costs ~16% GPU utilization on large dense models (Llama3 70B).
mlp - CPU offloading is blocked when PP > 1.
当训练出现OOM或接近内存上限时:
- 优先设置。该方案可修复碎片化导致的OOM,且无性能损耗。大多数Slurm启动模板已包含此配置。
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True - 若未启用,添加选择性激活重计算()。详情见@skills/perf-activation-recompute/SKILL.md。
recompute_modules=[core_attn] - 避免通过提升TP来解决内存问题——将TP翻倍会大幅增加NVLink全归并通信量,通常会严重降低吞吐量(Llama3 70B模型下降28%)。
- 避免以牺牲DP为代价提升PP——将DP减半会使梯度累积步数翻倍,影响吞吐量(约下降6%)。
- 若仍出现OOM,可考虑重计算。该方案可节省约3GB内存,但在大型稠密模型(Llama3 70B)上会消耗约16%的GPU利用率。
mlp - 当PP > 1时,CPU卸载功能被禁用。
Enablement
配置方法
Expandable segments (recommended first step)
可扩展内存段(推荐优先配置)
Set in the job's environment before launching:
bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:TrueIn Slurm scripts this is typically placed alongside other env vars:
bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:TrueNo model config changes needed. Zero throughput cost.
在启动任务前设置环境变量:
bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True在Slurm脚本中,通常与其他环境变量放在一起:
bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True无需修改模型配置,无吞吐量损耗。
Parallelism resizing
并行度调整
If the model genuinely does not fit (not fragmentation), adjust parallelism:
| Strategy | Memory effect | Throughput cost | Notes |
|---|---|---|---|
| Increase PP (keeping DP) | Fewer layers per stage | Moderate (~6% if DP halved) | Only if GPU count allows |
| Increase TP | Fewer params per GPU | Severe (-28% on 70B) | Last resort |
| Distributed optimizer | Shards optimizer state across DP ranks | ~1-2% | Recommended for large models |
| FSDP | Shards params + grads + optimizer | Varies | See @skills/perf-megatron-fsdp/SKILL.md |
若模型确实无法容纳(非碎片化问题),可调整并行度:
| 策略 | 内存影响 | 吞吐量损耗 | 说明 |
|---|---|---|---|
| 提升PP(保持DP不变) | 每个阶段的层数减少 | 中等(若DP减半则约6%) | 仅当GPU数量允许时使用 |
| 提升TP | 每个GPU的参数数量减少 | 严重(70B模型下降28%) | 最后手段 |
| 分布式优化器 | 在DP进程间分片优化器状态 | ~1-2% | 推荐用于大型模型 |
| FSDP | 分片参数+梯度+优化器 | 视情况而定 | 详情见@skills/perf-megatron-fsdp/SKILL.md |
Activation recompute
激活重计算
See @skills/perf-activation-recompute/SKILL.md for full details.
完整详情见@skills/perf-activation-recompute/SKILL.md。
CPU offloading
CPU卸载
python
cfg.model.cpu_offloading = TrueIncompatible with PP > 1. Only usable when .
pipeline_model_parallel_size = 1python
cfg.model.cpu_offloading = True与PP > 1不兼容。仅当时可用。
pipeline_model_parallel_size = 1A Note on VPP
关于VPP的说明
Virtual pipeline parallelism (VPP) is primarily a throughput optimization
that reduces pipeline bubble overhead by interleaving smaller model chunks. Its
effect on peak memory is minimal — changing VPP does not meaningfully change
the total activation, parameter, or optimizer memory on a GPU.
In earlier experiments we incorrectly attributed an OOM fix to VPP tuning
(VPP 5→10). The actual fix was
which eliminated memory fragmentation. The VPP=10 run actually used slightly
more peak memory (60.2 GB vs 58.8 GB) but did not OOM because expandable
segments prevented fragmentation.
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:TrueVPP should be tuned for pipeline bubble reduction (see @docs/parallelisms.md),
not as a memory fix.
虚拟流水线并行(VPP)主要是一种吞吐量优化手段,通过交错更小的模型块来减少流水线气泡开销。它对峰值内存的影响极小——调整VPP不会显著改变GPU上的总激活、参数或优化器内存。
在早期实验中,我们错误地将OOM修复归因于VPP调优(VPP 5→10)。实际修复方案是,它消除了内存碎片化。VPP=10的实验实际上使用了略多的峰值内存(60.2GB vs 58.8GB),但由于可扩展内存段避免了碎片化,未出现OOM。
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:TrueVPP应针对减少流水线气泡进行调优(见@docs/parallelisms.md),而非作为内存修复方案。
Compatibility and Constraints
兼容性与约束
- is incompatible with
expandable_segments:True(NCCL user-buffer registration). See Megatron-FSDP docs.--use-nccl-ub - When using CUDA graphs with , set
expandable_segments:True(required on pre-Blackwell GPUs, enforced by MCoreNCCL_GRAPH_REGISTER=0).CudaGraphManager - CPU offloading requires .
pipeline_model_parallel_size = 1 - Distributed optimizer requires in the optimizer config.
use_distributed_optimizer = True
- 与
expandable_segments:True(NCCL用户缓冲区注册)不兼容。详见Megatron-FSDP文档。--use-nccl-ub - 当结合CUDA图与使用时,需设置
expandable_segments:True(Blackwell之前的GPU需要此配置,MCore的NCCL_GRAPH_REGISTER=0会强制执行)。CudaGraphManager - CPU卸载要求。
pipeline_model_parallel_size = 1 - 分布式优化器需要在优化器配置中设置。
use_distributed_optimizer = True
Measured Results
实测结果
Llama3 70B SFT on 32x H100 80GB, FP8 (Current Scaling):
- Baseline: TP=4, PP=4, VPP=5, DP=2, MBS=1, GBS=32, seq_len=4096
- Golden GPU utilization: 709.93 TFLOP/s/GPU
- Regression threshold: 5%
在32张H100 80GB GPU上运行Llama3 70B SFT(当前缩放配置):
- 基准配置:TP=4,PP=4,VPP=5,DP=2,MBS=1,GBS=32,seq_len=4096
- 理想GPU利用率:709.93 TFLOP/s/GPU
- 性能退化阈值:5%
Strategy comparison: parallelism changes for memory reduction
内存优化并行度策略对比
| Experiment | TP | PP | VPP | DP | TFLOP/s/GPU | vs Golden | Peak Mem (GB) | Result |
|---|---|---|---|---|---|---|---|---|
| Baseline | 4 | 4 | 5 | 2 | ~704 | -0.8% | 58.8 | OOM (fragmentation) |
| More PP | 4 | 8 | 5 | 1 | 668.0 | -5.9% | 53.2 | Borderline perf |
| More TP | 8 | 4 | 5 | 1 | 508.7 | -28.4% | 50.2 | Severe regression |
| Baseline + expandable_segments | 4 | 4 | 5 | 2 | ~704 | -0.8% | ~59 | Passed |
Key takeaways:
- is the winner. The baseline OOM was caused by memory fragmentation, not insufficient capacity. Setting this env var eliminated the OOM with zero throughput cost and no parallelism changes.
expandable_segments:True - PP=8 works for memory but loses DP (2→1), meaning 32 gradient accumulation steps per batch, which hurts throughput by ~6%.
- TP=8 is catastrophic (-28%) because doubling TP increases all-reduce communication volume proportionally across NVLink, and DP=1 means no micro-batch overlap.
| 实验 | TP | PP | VPP | DP | TFLOP/s/GPU | 与理想值对比 | 峰值内存(GB) | 结果 |
|---|---|---|---|---|---|---|---|---|
| 基准 | 4 | 4 | 5 | 2 | ~704 | -0.8% | 58.8 | OOM(碎片化) |
| 提升PP | 4 | 8 | 5 | 1 | 668.0 | -5.9% | 53.2 | 性能临界 |
| 提升TP | 8 | 4 | 5 | 1 | 508.7 | -28.4% | 50.2 | 严重退化 |
| 基准+可扩展内存段 | 4 | 4 | 5 | 2 | ~704 | -0.8% | ~59 | 通过 |
关键结论:
- 是最优方案。基准配置的OOM由内存碎片化导致,而非容量不足。设置该环境变量可消除OOM,且无吞吐量损耗,无需调整并行度。
expandable_segments:True - PP=8可解决内存问题,但会损失DP(从2→1),意味着每个批次需要32次梯度累积,吞吐量下降约6%。
- TP=8的影响灾难性(下降28%),因为TP翻倍会成比例增加NVLink上的全归并通信量,且DP=1意味着无微批次重叠。
CPU offloading: blocked
CPU卸载:不可用
| Experiment | offload_layers | Result |
|---|---|---|
| Exp 4 | 2 | Incompatible (PP > 1) |
| Exp 5 | 4 | Incompatible (PP > 1) |
| Exp 6 | 6 | Incompatible (PP > 1) |
ValueError: Currently there is no support for Pipeline parallelism with CPU offloading.| 实验 | offload_layers | 结果 |
|---|---|---|
| 实验4 | 2 | 不兼容(PP > 1) |
| 实验5 | 4 | 不兼容(PP > 1) |
| 实验6 | 6 | 不兼容(PP > 1) |
ValueError: Currently there is no support for Pipeline parallelism with CPU offloading.Activation recompute: expensive alternative
激活重计算:高成本替代方案
Selective activation recompute with saved ~3 GB peak memory but cost
~16% GPU utilization on this workload. See
@skills/perf-activation-recompute/SKILL.md for full results.
mlp针对的选择性激活重计算可节省约3GB峰值内存,但在此工作负载上会消耗约16%的GPU利用率。完整结果见@skills/perf-activation-recompute/SKILL.md。
mlpCode Anchors
代码锚点
CPU offloading PP incompatibility (MCore)
CPU卸载与PP不兼容(MCore)
1303
if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
raise ValueError(
"Currently there is no support for Pipeline parallelism with CPU offloading"
)1303
if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
raise ValueError(
"Currently there is no support for Pipeline parallelism with CPU offloading"
)VPP config and layer divisibility validation (MCore)
VPP配置与层数可分性验证(MCore)
1581
if pipeline_parallel_size and self.virtual_pipeline_model_parallel_size is not None:
num_layers_per_middle_pipeline_rank = num_layers // pipeline_parallel_size
if (
not num_layers_per_middle_pipeline_rank
% self.virtual_pipeline_model_parallel_size
== 0
):
raise ValueError(
f"number of layers on each middle pipeline rank:"
f"{num_layers_per_middle_pipeline_rank} must be divisible by virtual"
f"pipeline parallel degree {self.virtual_pipeline_model_parallel_size}"
)1581
if pipeline_parallel_size and self.virtual_pipeline_model_parallel_size is not None:
num_layers_per_middle_pipeline_rank = num_layers // pipeline_parallel_size
if (
not num_layers_per_middle_pipeline_rank
% self.virtual_pipeline_model_parallel_size
== 0
):
raise ValueError(
f"number of layers on each middle pipeline rank:"
f"{num_layers_per_middle_pipeline_rank} must be divisible by virtual"
f"pipeline parallel degree {self.virtual_pipeline_model_parallel_size}"
)Parallelism docs on interleaved pipeline schedule
交错流水线调度的并行度文档
116
To minimize the pipeline bubble, the computation on each GPU can be divided into multiple subsets of layers (referred to as model chunks), rather than a single contiguous block. Enable this by setting `virtual_pipeline_model_parallel_size`:
model_config = GPTModelProvider(
pipeline_model_parallel_size=4,
virtual_pipeline_model_parallel_size=2, # 2 model chunks per pipeline stage
# ... other model parameters
)116
To minimize the pipeline bubble, the computation on each GPU can be divided into multiple subsets of layers (referred to as model chunks), rather than a single contiguous block. Enable this by setting `virtual_pipeline_model_parallel_size`:
model_config = GPTModelProvider(
pipeline_model_parallel_size=4,
virtual_pipeline_model_parallel_size=2, # 2 model chunks per pipeline stage
# ... other model parameters
)Failure Diagnosis
故障诊断
| Symptom | Cause | Confirm | Fix |
|---|---|---|---|
| OOM on a single rank despite headroom on others | Memory fragmentation | check if | set |
OOM with | Genuine capacity limit | check | increase PP, use distributed optimizer, or add recompute |
| using cpu_offloading with PP > 1 | check PP config | disable CPU offloading or set PP=1 |
| NCCL UB incompatible with expandable allocator | check env vars | remove |
| 症状 | 原因 | 验证方式 | 修复方案 |
|---|---|---|---|
| 单个进程出现OOM,其他进程仍有内存余量 | 内存碎片化 | 检查是否设置 | 设置 |
已设置 | 真实容量不足 | 使用 | 提升PP、使用分布式优化器或添加重计算 |
出现 | 在PP > 1时使用CPU卸载 | 检查PP配置 | 禁用CPU卸载或设置PP=1 |
使用 | NCCL UB与可扩展分配器不兼容 | 检查环境变量 | 移除 |
Known Limitations
已知限制
- CPU offloading is blocked when PP > 1
- Parallelism resizing (TP/PP) often has significant throughput costs
- No automatic memory profiling to recommend the optimal strategy
- PP > 1时CPU卸载功能被禁用
- 并行度调整(TP/PP)通常会带来显著的吞吐量损耗
- 无自动内存分析工具推荐最优策略
Verification
验证方法
Quick check that is active:
expandable_segments:Truepython
import os
assert "expandable_segments:True" in os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")For Slurm jobs, verify the env var is exported before the training command
in the launch script.
快速检查是否生效:
expandable_segments:Truepython
import os
assert "expandable_segments:True" in os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")对于Slurm任务,需验证环境变量在训练命令前已在启动脚本中导出。