Memory Tuning
Stable docs: @docs/parallelisms.md
Card: @skills/perf-memory-tuning/card.yaml
What It Is
GPU OOM failures during training often stem from memory fragmentation rather
than raw capacity. PyTorch's default CUDA allocator can leave unusable gaps
between allocations. The single most effective fix is:
bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
This tells PyTorch to use expandable (non-fixed-size) memory segments, which
dramatically reduces fragmentation and often eliminates borderline OOM without
any model or parallelism changes.
Beyond fragmentation, actual peak memory is determined by:
- Parameter + optimizer state memory — controlled by TP, PP, DP sharding
(distributed optimizer, FSDP)
- Activation memory — controlled by activation recompute, sequence length,
micro-batch size
- Temporary / workspace memory — CUDA kernels, NCCL buffers, CUDA graphs
Quick Decision
When a training run OOMs or is close to the memory limit:
- Set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
first. This fixes
fragmentation-induced OOM with zero performance cost. Most Slurm launch
templates already include it.
- Add selective activation recompute (
recompute_modules=[core_attn]
) if
not already enabled. See @skills/perf-activation-recompute/SKILL.md.
- Avoid increasing TP as a memory fix — doubling TP dramatically increases
NVLink all-reduce volume and often kills throughput (-28% on Llama3 70B).
- Avoid increasing PP at the cost of DP — halving DP doubles gradient
accumulation steps and hurts throughput (~6%).
- Consider recompute if still OOM. Saves ~3 GB but costs ~16% GPU
utilization on large dense models (Llama3 70B).
- CPU offloading is blocked when PP > 1.
Enablement
Expandable segments (recommended first step)
Set in the job's environment before launching:
bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
In Slurm scripts this is typically placed alongside other env vars:
bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
No model config changes needed. Zero throughput cost.
Parallelism resizing
If the model genuinely does not fit (not fragmentation), adjust parallelism:
| Strategy | Memory effect | Throughput cost | Notes |
|---|
| Increase PP (keeping DP) | Fewer layers per stage | Moderate (~6% if DP halved) | Only if GPU count allows |
| Increase TP | Fewer params per GPU | Severe (-28% on 70B) | Last resort |
| Distributed optimizer | Shards optimizer state across DP ranks | ~1-2% | Recommended for large models |
| FSDP | Shards params + grads + optimizer | Varies | See @skills/perf-megatron-fsdp/SKILL.md |
Activation recompute
See @skills/perf-activation-recompute/SKILL.md for full details.
CPU offloading
python
cfg.model.cpu_offloading = True
Incompatible with PP > 1. Only usable when
pipeline_model_parallel_size = 1
.
A Note on VPP
Virtual pipeline parallelism (VPP) is primarily a throughput optimization
that reduces pipeline bubble overhead by interleaving smaller model chunks. Its
effect on peak memory is minimal — changing VPP does not meaningfully change
the total activation, parameter, or optimizer memory on a GPU.
In earlier experiments we incorrectly attributed an OOM fix to VPP tuning
(VPP 5→10). The actual fix was
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
which eliminated memory fragmentation. The VPP=10 run actually used slightly
more peak memory (60.2 GB vs 58.8 GB) but did not OOM because expandable
segments prevented fragmentation.
VPP should be tuned for pipeline bubble reduction (see @docs/parallelisms.md),
not as a memory fix.
Compatibility and Constraints
- is incompatible with (NCCL
user-buffer registration). See Megatron-FSDP docs.
- When using CUDA graphs with , set
(required on pre-Blackwell GPUs, enforced by MCore
).
- CPU offloading requires
pipeline_model_parallel_size = 1
.
- Distributed optimizer requires
use_distributed_optimizer = True
in the
optimizer config.
Measured Results
Llama3 70B SFT on 32x H100 80GB, FP8 (Current Scaling):
- Baseline: TP=4, PP=4, VPP=5, DP=2, MBS=1, GBS=32, seq_len=4096
- Golden GPU utilization: 709.93 TFLOP/s/GPU
- Regression threshold: 5%
Strategy comparison: parallelism changes for memory reduction
| Experiment | TP | PP | VPP | DP | TFLOP/s/GPU | vs Golden | Peak Mem (GB) | Result |
|---|
| Baseline | 4 | 4 | 5 | 2 | ~704 | -0.8% | 58.8 | OOM (fragmentation) |
| More PP | 4 | 8 | 5 | 1 | 668.0 | -5.9% | 53.2 | Borderline perf |
| More TP | 8 | 4 | 5 | 1 | 508.7 | -28.4% | 50.2 | Severe regression |
| Baseline + expandable_segments | 4 | 4 | 5 | 2 | ~704 | -0.8% | ~59 | Passed |
Key takeaways:
- is the winner. The baseline OOM was caused by
memory fragmentation, not insufficient capacity. Setting this env var
eliminated the OOM with zero throughput cost and no parallelism changes.
- PP=8 works for memory but loses DP (2→1), meaning 32 gradient accumulation
steps per batch, which hurts throughput by ~6%.
- TP=8 is catastrophic (-28%) because doubling TP increases all-reduce
communication volume proportionally across NVLink, and DP=1 means no
micro-batch overlap.
CPU offloading: blocked
| Experiment | offload_layers | Result |
|---|
| Exp 4 | 2 | Incompatible (PP > 1) |
| Exp 5 | 4 | Incompatible (PP > 1) |
| Exp 6 | 6 | Incompatible (PP > 1) |
ValueError: Currently there is no support for Pipeline parallelism with CPU offloading.
This approach is blocked for any model using PP > 1.
Activation recompute: expensive alternative
Selective activation recompute with
saved ~3 GB peak memory but cost
~16% GPU utilization on this workload. See
@skills/perf-activation-recompute/SKILL.md for full results.
Code Anchors
CPU offloading PP incompatibility (MCore)
1303
if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
raise ValueError(
"Currently there is no support for Pipeline parallelism with CPU offloading"
)
VPP config and layer divisibility validation (MCore)
1581
if pipeline_parallel_size and self.virtual_pipeline_model_parallel_size is not None:
num_layers_per_middle_pipeline_rank = num_layers // pipeline_parallel_size
if (
not num_layers_per_middle_pipeline_rank
% self.virtual_pipeline_model_parallel_size
== 0
):
raise ValueError(
f"number of layers on each middle pipeline rank:"
f"{num_layers_per_middle_pipeline_rank} must be divisible by virtual"
f"pipeline parallel degree {self.virtual_pipeline_model_parallel_size}"
)
Parallelism docs on interleaved pipeline schedule
116
To minimize the pipeline bubble, the computation on each GPU can be divided into multiple subsets of layers (referred to as model chunks), rather than a single contiguous block. Enable this by setting `virtual_pipeline_model_parallel_size`:
model_config = GPTModelProvider(
pipeline_model_parallel_size=4,
virtual_pipeline_model_parallel_size=2, # 2 model chunks per pipeline stage
# ... other model parameters
)
Failure Diagnosis
| Symptom | Cause | Confirm | Fix |
|---|
| OOM on a single rank despite headroom on others | Memory fragmentation | check if is set | set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
|
| OOM with already set | Genuine capacity limit | check for param/optimizer memory | increase PP, use distributed optimizer, or add recompute |
ValueError: PP + CPU offloading
| using cpu_offloading with PP > 1 | check PP config | disable CPU offloading or set PP=1 |
| with + expandable segments | NCCL UB incompatible with expandable allocator | check env vars | remove or disable |
Known Limitations
- CPU offloading is blocked when PP > 1
- Parallelism resizing (TP/PP) often has significant throughput costs
- No automatic memory profiling to recommend the optimal strategy
Verification
Quick check that
is active:
python
import os
assert "expandable_segments:True" in os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")
For Slurm jobs, verify the env var is exported before the training command
in the launch script.