llm-serving-capacity-planner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Serving Capacity Planner
LLM服务容量规划工具
Overview
概述
Use this when a serving log has enough memory lines to explain where GPU HBM
went. The analyzer reads SGLang/vLLM startup logs, extracts weight load, KV
pool, CUDA graph, framework overhead, and token-capacity lines, then estimates
concurrent requests for common token lengths.
当服务日志包含足够的内存相关日志行,可用于分析GPU HBM内存的分配情况时,即可使用本工具。分析器会读取SGLang/vLLM的启动日志,提取权重加载、KV池、CUDA graph、框架开销以及令牌容量相关日志行,进而估算不同常见令牌长度下的并发请求数。
Confirmation Required
需要确认的信息
Before running analysis, collect or verify these inputs:
| Item | Why it matters | How to obtain | Default if user skips |
|---|---|---|---|
| Log file path | Primary input; all memory data comes from here | Ask user for the serving startup log | — (required) |
| GPU type | Determines total HBM for decomposition validation | Ask user or infer from log | Auto-detected from log if possible |
| nvidia-smi output | Provides per-rank actual memory for cross-validation | Capture with | — (optional, but recommended) |
| Model config.json | Enables theoretical KV cache byte calculation and replication factor analysis | Ask user for the model's config.json path | — (optional, log data used instead) |
| Request token length | Determines concurrency estimate denominator | Ask user | 4096, 6144, 8192 |
在运行分析前,请收集或验证以下输入信息:
| 项目 | 重要性说明 | 获取方式 | 用户未提供时的默认值 |
|---|---|---|---|
| 日志文件路径 | 核心输入;所有内存数据均来自该文件 | 向用户索要服务启动日志 | —(必填) |
| GPU型号 | 用于确定总HBM容量,验证内存分解结果 | 询问用户或从日志中推断 | 尽可能从日志中自动检测 |
| nvidia-smi输出 | 提供各rank的实际内存数据,用于交叉验证 | 执行命令 | —(可选,但推荐提供) |
| 模型config.json | 支持理论KV缓存字节计算和复制因子分析 | 向用户索要模型的config.json路径 | —(可选,未提供时使用日志数据) |
| 请求令牌长度 | 决定并发量估算的分母值 | 询问用户 | 4096, 6144, 8192 |
Workflow
工作流程
Step 1: Collect the serving log
步骤1:收集服务日志
The user should provide the startup log from an SGLang or vLLM serving instance. Key log lines that the analyzer needs:
Load weight begin. avail mem=XX GB- (newer sglang)
Memory profiling: available_gpu_memory=XX GB, ... - (SWA models like DeepSeek-V4)
SW KV memory calculation: bytes_per_full_token=XX, available_bytes=XX GB, full_token=XX Memory pool end. avail mem=XX GBCapture cuda graph end. ... mem usage=XX GB. avail mem=XX GB.max_total_num_tokens=XX, ... max_running_requests=XX, ... available_gpu_mem=XX GB- (for serving parameters)
server_args=ServerArgs(...)
If the log is from a running instance, capture it by redirecting stdout/stderr to a file at launch time.
用户需提供SGLang或vLLM服务实例的启动日志。分析器需要的关键日志行包括:
Load weight begin. avail mem=XX GB- (新版SGLang)
Memory profiling: available_gpu_memory=XX GB, ... - (如DeepSeek-V4等SWA模型)
SW KV memory calculation: bytes_per_full_token=XX, available_bytes=XX GB, full_token=XX Memory pool end. avail mem=XX GBCapture cuda graph end. ... mem usage=XX GB. avail mem=XX GB.max_total_num_tokens=XX, ... max_running_requests=XX, ... available_gpu_mem=XX GB- (用于获取服务参数)
server_args=ServerArgs(...)
如果日志来自运行中的实例,请在启动时将stdout/stderr重定向到文件中捕获。
Step 2: Optionally capture nvidia-smi data
步骤2:可选捕获nvidia-smi数据
For per-rank memory comparison:
bash
docker exec <container> nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt用于各rank的内存对比:
bash
docker exec <container> nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txtStep 3: Run the analyzer
步骤3:运行分析器
bash
python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
--log-file /path/to/sglang.log \
--nvidia-smi-file /path/to/smi.txt \
--gpu h200 \
--config-json /path/to/config.jsonFor JSON output (automation):
bash
python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
--log-file /path/to/sglang.log \
--format jsonbash
python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
--log-file /path/to/sglang.log \
--nvidia-smi-file /path/to/smi.txt \
--gpu h200 \
--config-json /path/to/config.json如需JSON格式输出(用于自动化场景):
bash
python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
--log-file /path/to/sglang.log \
--format jsonStep 4: Review and interpret results
步骤4:查看并解读结果
The analyzer prints:
- Memory breakdown table: each category (weights, KV pool, CUDA graph, framework, other) with GiB, MiB, percentage, and derivation
- Per-rank comparison: nvidia-smi data across all TP ranks
- KV pool detail: pool configuration, KV dtype, replication factor, per-token byte calculation
- Concurrency estimate: max concurrent requests for different token lengths
- Tuning notes: configuration changes that may increase capacity
分析器会输出以下内容:
- 内存分解表:每个类别(权重、KV池、CUDA graph、框架、其他)的GiB、MiB、占比及数据来源说明
- 各rank对比:所有TP rank的nvidia-smi数据
- KV池详情:池配置、KV dtype、复制因子、每令牌字节计算
- 并发量估算:不同令牌长度下的最大并发请求数
- 调优建议:可提升容量的配置修改方案
When To Use It
使用场景
- After launching an LLM serving instance, to understand how GPU memory is distributed
- When comparing different values and their impact on KV pool capacity
--mem-fraction-static - When planning deployment capacity: how many concurrent requests can a given GPU configuration support
- When investigating OOM issues: identifying which memory category is consuming the most
- When evaluating whether fp8 KV cache or EP can improve concurrency
- 启动LLM服务实例后,了解GPU内存的分配情况
- 对比不同参数值对KV池容量的影响
--mem-fraction-static - 规划部署容量:特定GPU配置可支持多少并发请求
- 排查OOM问题:确定哪个内存类别占用最多资源
- 评估fp8 KV缓存或EP是否能提升并发量
Key Concepts
核心概念
mem-fraction-static
mem-fraction-static
Controls what fraction of available GPU memory after weight loading is reserved for the KV cache pool. Higher values give more KV capacity but less headroom for CUDA graph and other runtime buffers.
- (default): aggressive — 88% of post-weight memory goes to KV pool
0.88 - : conservative — more free memory left for runtime, but significantly less KV capacity
0.60
控制权重加载后可用GPU内存中分配给KV缓存池的比例。值越高,KV容量越大,但留给CUDA graph和其他运行时缓冲区的空间越少。
- (默认):激进模式 — 权重加载后88%的内存分配给KV池
0.88 - :保守模式 — 预留更多空闲内存给运行时,但KV容量显著减少
0.60
KV Head Replication
KV Head Replication
When , KV cache is replicated across all TP ranks rather than split. For example, models with means each of the 8 cards stores a full copy of the KV cache — 8x the per-card KV memory compared to a split scenario.
num_key_value_heads < tp_sizekv_heads=1, tp=8当时,KV缓存会在所有TP rank间复制而非拆分。例如,的模型意味着8张卡每张都存储完整的KV缓存副本 — 与拆分场景相比,单卡KV内存占用是其8倍。
num_key_value_heads < tp_sizekv_heads=1, tp=8SWA (Sliding Window Attention) Compression
SWA(滑动窗口注意力)压缩
Models like DeepSeek-V4 use CSA (Compressed Sliding Attention) and HCA
(Hierarchical Context Attention) with sliding windows. This reduces per-token
KV cache bytes compared to the theoretical full-attention calculation. The
reported in the log already accounts for this
compression.
bytes_per_full_token如DeepSeek-V4等模型使用CSA(压缩滑动注意力)和HCA(分层上下文注意力)及滑动窗口机制。与理论全注意力计算相比,这会减少每令牌KV缓存字节数。日志中报告的已考虑该压缩机制。
bytes_per_full_tokenReporting Checklist
报告检查清单
Include:
- Serving configuration: model, GPU, TP/PP/EP, mem-fraction-static, kv-cache-dtype
- Memory breakdown table: category / GiB / MiB / percentage / derivation source
- Per-rank nvidia-smi comparison: used and free memory per TP rank
- KV pool detail: pool size, bytes_per_full_token, KV dtype, replication factor, theoretical per-token KV calculation (when config.json provided)
- Concurrency estimate table: request token length / token-limit / request-limit / max concurrent
- Tuning notes based on free memory and configuration
报告需包含:
- 服务配置:模型、GPU、TP/PP/EP、mem-fraction-static、kv-cache-dtype
- 内存分解表:类别 / GiB / MiB / 占比 / 数据来源
- 各rank nvidia-smi对比:各TP rank的已用和空闲内存
- KV池详情:池大小、bytes_per_full_token、KV dtype、复制因子、理论每令牌KV计算(提供config.json时)
- 并发量估算表:请求令牌长度 / 令牌限制 / 请求限制 / 最大并发数
- 调优建议:基于空闲内存和配置给出的优化方案
Known Limitations
已知限制
| Limitation | Detail | Workaround |
|---|---|---|
| SGLang-specific patterns | Currently only SGLang log patterns are fully supported | vLLM patterns to be added as encountered |
| SWA compression models | Per-token KV bytes cannot be independently calculated from model config for CSA/HCA attention — the framework's internal SWA window parameters are needed | Use |
| DeepGEMM JIT memory | The analyzer categorizes DeepGEMM JIT compilation memory as "other" because it is not explicitly reported in the log | Compare with nvidia-smi total for accurate accounting |
| PP (Pipeline Parallelism) | Memory decomposition is per-rank; PP configurations may have uneven memory across stages | Specify |
| MoE expert buffer | Some frameworks allocate additional buffers for expert routing that are not separately reported | Included in "model weights" or "other" depending on when allocated |
| 限制 | 详情 | 解决方法 |
|---|---|---|
| SGLang专属日志模式 | 当前仅完全支持SGLang的日志模式 | 后续会逐步添加遇到的vLLM日志模式 |
| SWA压缩模型 | 无法通过模型配置独立计算CSA/HCA注意力的每令牌KV字节数 — 需要框架内部的SWA窗口参数 | 直接使用日志中的 |
| DeepGEMM JIT内存 | 分析器将DeepGEMM JIT编译内存归类为“其他”,因为日志中未明确报告 | 与nvidia-smi总内存对比以获得准确统计 |
| PP(流水线并行) | 内存分解按rank计算;PP配置可能导致各阶段内存分布不均 | 为每个PP阶段指定 |
| MoE专家缓冲区 | 部分框架会为专家路由分配额外缓冲区,但未单独报告 | 根据分配时机归类到“模型权重”或“其他” |
References
参考资料
- : log line patterns and their semantics for memory analysis.
references/log-patterns.md - : GPU HBM specifications for
references/gpu-specs.json,h20,h100, andh200aliases.b200 - : the core analysis script.
scripts/capacity_analyzer.py
- :用于内存分析的日志行模式及其语义说明
references/log-patterns.md - :
references/gpu-specs.json、h20、h100和h200等GPU的HBM规格b200 - :核心分析脚本
scripts/capacity_analyzer.py