llm-serving-capacity-planner

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Serving Capacity Planner

LLM服务容量规划工具

Overview

概述

Use this when a serving log has enough memory lines to explain where GPU HBM went. The analyzer reads SGLang/vLLM startup logs, extracts weight load, KV pool, CUDA graph, framework overhead, and token-capacity lines, then estimates concurrent requests for common token lengths.
当服务日志包含足够的内存相关日志行,可用于分析GPU HBM内存的分配情况时,即可使用本工具。分析器会读取SGLang/vLLM的启动日志,提取权重加载、KV池、CUDA graph、框架开销以及令牌容量相关日志行,进而估算不同常见令牌长度下的并发请求数。

Confirmation Required

需要确认的信息

Before running analysis, collect or verify these inputs:
ItemWhy it mattersHow to obtainDefault if user skips
Log file pathPrimary input; all memory data comes from hereAsk user for the serving startup log— (required)
GPU typeDetermines total HBM for decomposition validationAsk user or infer from logAuto-detected from log if possible
nvidia-smi outputProvides per-rank actual memory for cross-validationCapture with
nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt
— (optional, but recommended)
Model config.jsonEnables theoretical KV cache byte calculation and replication factor analysisAsk user for the model's config.json path— (optional, log data used instead)
Request token lengthDetermines concurrency estimate denominatorAsk user4096, 6144, 8192
在运行分析前,请收集或验证以下输入信息:
项目重要性说明获取方式用户未提供时的默认值
日志文件路径核心输入;所有内存数据均来自该文件向用户索要服务启动日志—(必填)
GPU型号用于确定总HBM容量,验证内存分解结果询问用户或从日志中推断尽可能从日志中自动检测
nvidia-smi输出提供各rank的实际内存数据,用于交叉验证执行命令
nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt
获取
—(可选,但推荐提供)
模型config.json支持理论KV缓存字节计算和复制因子分析向用户索要模型的config.json路径—(可选,未提供时使用日志数据)
请求令牌长度决定并发量估算的分母值询问用户4096, 6144, 8192

Workflow

工作流程

Step 1: Collect the serving log

步骤1:收集服务日志

The user should provide the startup log from an SGLang or vLLM serving instance. Key log lines that the analyzer needs:
  • Load weight begin. avail mem=XX GB
  • Memory profiling: available_gpu_memory=XX GB, ...
    (newer sglang)
  • SW KV memory calculation: bytes_per_full_token=XX, available_bytes=XX GB, full_token=XX
    (SWA models like DeepSeek-V4)
  • Memory pool end. avail mem=XX GB
  • Capture cuda graph end. ... mem usage=XX GB. avail mem=XX GB.
  • max_total_num_tokens=XX, ... max_running_requests=XX, ... available_gpu_mem=XX GB
  • server_args=ServerArgs(...)
    (for serving parameters)
If the log is from a running instance, capture it by redirecting stdout/stderr to a file at launch time.
用户需提供SGLang或vLLM服务实例的启动日志。分析器需要的关键日志行包括:
  • Load weight begin. avail mem=XX GB
  • Memory profiling: available_gpu_memory=XX GB, ...
    (新版SGLang)
  • SW KV memory calculation: bytes_per_full_token=XX, available_bytes=XX GB, full_token=XX
    (如DeepSeek-V4等SWA模型)
  • Memory pool end. avail mem=XX GB
  • Capture cuda graph end. ... mem usage=XX GB. avail mem=XX GB.
  • max_total_num_tokens=XX, ... max_running_requests=XX, ... available_gpu_mem=XX GB
  • server_args=ServerArgs(...)
    (用于获取服务参数)
如果日志来自运行中的实例,请在启动时将stdout/stderr重定向到文件中捕获。

Step 2: Optionally capture nvidia-smi data

步骤2:可选捕获nvidia-smi数据

For per-rank memory comparison:
bash
docker exec <container> nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt
用于各rank的内存对比:
bash
docker exec <container> nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt

Step 3: Run the analyzer

步骤3:运行分析器

bash
python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --nvidia-smi-file /path/to/smi.txt \
  --gpu h200 \
  --config-json /path/to/config.json
For JSON output (automation):
bash
python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --format json
bash
python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --nvidia-smi-file /path/to/smi.txt \
  --gpu h200 \
  --config-json /path/to/config.json
如需JSON格式输出(用于自动化场景):
bash
python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --format json

Step 4: Review and interpret results

步骤4:查看并解读结果

The analyzer prints:
  1. Memory breakdown table: each category (weights, KV pool, CUDA graph, framework, other) with GiB, MiB, percentage, and derivation
  2. Per-rank comparison: nvidia-smi data across all TP ranks
  3. KV pool detail: pool configuration, KV dtype, replication factor, per-token byte calculation
  4. Concurrency estimate: max concurrent requests for different token lengths
  5. Tuning notes: configuration changes that may increase capacity
分析器会输出以下内容:
  1. 内存分解表:每个类别(权重、KV池、CUDA graph、框架、其他)的GiB、MiB、占比及数据来源说明
  2. 各rank对比:所有TP rank的nvidia-smi数据
  3. KV池详情:池配置、KV dtype、复制因子、每令牌字节计算
  4. 并发量估算:不同令牌长度下的最大并发请求数
  5. 调优建议:可提升容量的配置修改方案

When To Use It

使用场景

  • After launching an LLM serving instance, to understand how GPU memory is distributed
  • When comparing different
    --mem-fraction-static
    values and their impact on KV pool capacity
  • When planning deployment capacity: how many concurrent requests can a given GPU configuration support
  • When investigating OOM issues: identifying which memory category is consuming the most
  • When evaluating whether fp8 KV cache or EP can improve concurrency
  • 启动LLM服务实例后,了解GPU内存的分配情况
  • 对比不同
    --mem-fraction-static
    参数值对KV池容量的影响
  • 规划部署容量:特定GPU配置可支持多少并发请求
  • 排查OOM问题:确定哪个内存类别占用最多资源
  • 评估fp8 KV缓存或EP是否能提升并发量

Key Concepts

核心概念

mem-fraction-static

mem-fraction-static

Controls what fraction of available GPU memory after weight loading is reserved for the KV cache pool. Higher values give more KV capacity but less headroom for CUDA graph and other runtime buffers.
  • 0.88
    (default): aggressive — 88% of post-weight memory goes to KV pool
  • 0.60
    : conservative — more free memory left for runtime, but significantly less KV capacity
控制权重加载后可用GPU内存中分配给KV缓存池的比例。值越高,KV容量越大,但留给CUDA graph和其他运行时缓冲区的空间越少。
  • 0.88
    (默认):激进模式 — 权重加载后88%的内存分配给KV池
  • 0.60
    :保守模式 — 预留更多空闲内存给运行时,但KV容量显著减少

KV Head Replication

KV Head Replication

When
num_key_value_heads < tp_size
, KV cache is replicated across all TP ranks rather than split. For example, models with
kv_heads=1, tp=8
means each of the 8 cards stores a full copy of the KV cache — 8x the per-card KV memory compared to a split scenario.
num_key_value_heads < tp_size
时,KV缓存会在所有TP rank间复制而非拆分。例如,
kv_heads=1, tp=8
的模型意味着8张卡每张都存储完整的KV缓存副本 — 与拆分场景相比,单卡KV内存占用是其8倍。

SWA (Sliding Window Attention) Compression

SWA(滑动窗口注意力)压缩

Models like DeepSeek-V4 use CSA (Compressed Sliding Attention) and HCA (Hierarchical Context Attention) with sliding windows. This reduces per-token KV cache bytes compared to the theoretical full-attention calculation. The
bytes_per_full_token
reported in the log already accounts for this compression.
如DeepSeek-V4等模型使用CSA(压缩滑动注意力)和HCA(分层上下文注意力)及滑动窗口机制。与理论全注意力计算相比,这会减少每令牌KV缓存字节数。日志中报告的
bytes_per_full_token
已考虑该压缩机制。

Reporting Checklist

报告检查清单

Include:
  1. Serving configuration: model, GPU, TP/PP/EP, mem-fraction-static, kv-cache-dtype
  2. Memory breakdown table: category / GiB / MiB / percentage / derivation source
  3. Per-rank nvidia-smi comparison: used and free memory per TP rank
  4. KV pool detail: pool size, bytes_per_full_token, KV dtype, replication factor, theoretical per-token KV calculation (when config.json provided)
  5. Concurrency estimate table: request token length / token-limit / request-limit / max concurrent
  6. Tuning notes based on free memory and configuration
报告需包含:
  1. 服务配置:模型、GPU、TP/PP/EP、mem-fraction-static、kv-cache-dtype
  2. 内存分解表:类别 / GiB / MiB / 占比 / 数据来源
  3. 各rank nvidia-smi对比:各TP rank的已用和空闲内存
  4. KV池详情:池大小、bytes_per_full_token、KV dtype、复制因子、理论每令牌KV计算(提供config.json时)
  5. 并发量估算表:请求令牌长度 / 令牌限制 / 请求限制 / 最大并发数
  6. 调优建议:基于空闲内存和配置给出的优化方案

Known Limitations

已知限制

LimitationDetailWorkaround
SGLang-specific patternsCurrently only SGLang log patterns are fully supportedvLLM patterns to be added as encountered
SWA compression modelsPer-token KV bytes cannot be independently calculated from model config for CSA/HCA attention — the framework's internal SWA window parameters are neededUse
bytes_per_full_token
from the log directly
DeepGEMM JIT memoryThe analyzer categorizes DeepGEMM JIT compilation memory as "other" because it is not explicitly reported in the logCompare with nvidia-smi total for accurate accounting
PP (Pipeline Parallelism)Memory decomposition is per-rank; PP configurations may have uneven memory across stagesSpecify
--target-rank
for each PP stage
MoE expert bufferSome frameworks allocate additional buffers for expert routing that are not separately reportedIncluded in "model weights" or "other" depending on when allocated
限制详情解决方法
SGLang专属日志模式当前仅完全支持SGLang的日志模式后续会逐步添加遇到的vLLM日志模式
SWA压缩模型无法通过模型配置独立计算CSA/HCA注意力的每令牌KV字节数 — 需要框架内部的SWA窗口参数直接使用日志中的
bytes_per_full_token
DeepGEMM JIT内存分析器将DeepGEMM JIT编译内存归类为“其他”,因为日志中未明确报告与nvidia-smi总内存对比以获得准确统计
PP(流水线并行)内存分解按rank计算;PP配置可能导致各阶段内存分布不均为每个PP阶段指定
--target-rank
MoE专家缓冲区部分框架会为专家路由分配额外缓冲区,但未单独报告根据分配时机归类到“模型权重”或“其他”

References

参考资料

  • references/log-patterns.md
    : log line patterns and their semantics for memory analysis.
  • references/gpu-specs.json
    : GPU HBM specifications for
    h20
    ,
    h100
    ,
    h200
    , and
    b200
    aliases.
  • scripts/capacity_analyzer.py
    : the core analysis script.
  • references/log-patterns.md
    :用于内存分析的日志行模式及其语义说明
  • references/gpu-specs.json
    h20
    h100
    h200
    b200
    等GPU的HBM规格
  • scripts/capacity_analyzer.py
    :核心分析脚本