perf-host-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHost Performance Analysis
主机性能分析
Analyze host/CPU overhead in TensorRT-LLM inference workloads from nsys traces. This skill operates in two phases:
| Phase | Question | Input | Output |
|---|---|---|---|
| Detection | Is host overhead the bottleneck? | Single nsys trace | YES/NO verdict with metric evidence |
| Root Cause | What specifically regressed? | One or two nsys traces | NVTX per-step breakdown, regression sources, optional kernel-level drill-down |
基于nsys trace分析TensorRT-LLM推理工作负载中的主机/CPU开销。该技能分为两个阶段:
| 阶段 | 问题 | 输入 | 输出 |
|---|---|---|---|
| 检测阶段 | 主机开销是否为性能瓶颈? | 单个nsys trace | 包含指标证据的YES/NO判定结果 |
| 根因分析阶段 | 具体是哪些部分出现了性能退化? | 一个或两个nsys trace | NVTX分步细分结果、退化来源、可选的内核级钻取分析 |
When to Use
使用场景
- Before starting host optimization work -- confirms the bottleneck is real (Detection)
- As a sub-step of for bottleneck classification (Detection)
perf-analysis - When GPU utilization is suspiciously low and you need to know why (Detection)
- When throughput regressed but GPU kernel execution times are unchanged (Root Cause)
- When the gap between forward step iterations has increased (Root Cause)
- To compare inter-iteration overhead between two versions of the inference engine (Root Cause)
- When you need sub-operation granularity on inter-kernel gaps or graph coverage (Root Cause, kernel-level drill-down)
- When piecewise CUDA graph coverage is unexpectedly low (Root Cause, kernel-level drill-down)
- When multi-rank inference shows unexplained performance asymmetry (Root Cause, kernel-level drill-down)
Do NOT use when:
- The regression is in individual kernel performance (use )
perf-nsight-compute-analysis - You need to profile a workload from scratch (use first)
workload-instrumentation - The issue is NCCL communication (use distributed analysis)
- 开始主机优化工作前——确认瓶颈真实存在(检测阶段)
- 作为的子步骤进行瓶颈分类(检测阶段)
perf-analysis - 当GPU利用率异常低下,需要排查原因时(检测阶段)
- 当吞吐量出现退化但GPU内核执行时间未发生变化时(根因分析阶段)
- 当前向步骤迭代之间的间隙增大时(根因分析阶段)
- 对比推理引擎两个版本之间的迭代间开销时(根因分析阶段)
- 需要对核间间隙或图覆盖率进行子操作粒度分析时(根因分析阶段,内核级钻取)
- 分段CUDA图覆盖率低于预期时(根因分析阶段,内核级钻取)
- 多rank推理出现无法解释的性能不对称时(根因分析阶段,内核级钻取)
请勿在以下场景使用:
- 性能退化出现在单个内核性能上(使用)
perf-nsight-compute-analysis - 需要从头开始分析工作负载(先使用)
workload-instrumentation - 问题与NCCL通信相关(使用分布式分析工具)
Prerequisites
前置条件
- An nsys trace file (or
.sqlite) from a TRT-LLM benchmark run.nsys-rep - For Root Cause comparison: two traces (baseline and target)
- Python 3 with sqlite3 support
- 来自TRT-LLM基准测试的nsys trace文件(或
.sqlite格式).nsys-rep - 若进行根因对比分析:两个trace文件(基线版本与目标版本)
- 支持sqlite3的Python 3环境
Key Concepts
核心概念
Host Overhead in LLM Inference
LLM推理中的主机开销
In an LLM inference loop, each iteration consists of:
[inter-step gap] -> [_forward_step] -> [inter-step gap] -> [_forward_step] -> ...The forward step includes GPU kernel execution (GEMM, attention, normalization, allreduce) plus host-side preparation. The inter-step gap includes host-side work between forward steps (scheduling, request fetching, broadcasting, sampling, response handling).
See references/trtllm-nvtx-ranges.md for the full per-operation breakdown and timing ranges.
在LLM推理循环中,每个迭代流程如下:
[步骤间间隙] -> [_forward_step] -> [步骤间间隙] -> [_forward_step] -> ...前向步骤包括GPU内核执行(GEMM、注意力计算、归一化、allreduce)以及主机端准备工作。步骤间间隙包括前向步骤之间的主机端工作(调度、请求获取、广播、采样、响应处理)。
完整的操作细分和时序范围请参考references/trtllm-nvtx-ranges.md。
Hidden vs Exposed Host Overhead
隐藏与暴露的主机开销
Host overhead only hurts performance when it is exposed -- the GPU is idle waiting for work. When host prep overlaps with GPU execution, it is hidden and free. See references/metrics.md (M3 section) for diagrams and the exposed/hidden computation.
只有当主机开销处于暴露状态时才会影响性能——即GPU处于空闲状态等待工作。当主机准备工作与GPU执行重叠时,该开销是隐藏的,不会影响性能。有关暴露/隐藏计算的示意图和说明,请参考references/metrics.md的M3章节。
Forward Step Isolation
前向步骤隔离
In TP configurations, forward steps are isolated via allreduce kernel grouping (deterministic count per transformer layer). For TP=1, NVTX ranges are used directly. See references/iteration-isolation-techniques.md for the full algorithm.
_forward_step在TP配置中,前向步骤通过allreduce内核分组进行隔离(每个Transformer层的allreduce数量是确定的)。对于TP=1的情况,直接使用NVTX的范围。完整算法请参考references/iteration-isolation-techniques.md。
_forward_stepPhase Classification (Context vs Generation)
阶段分类(上下文与生成)
Iterations are classified by NVTX marker text into context (eager, no CUDA graphs) and generation (CUDA graph replay). Per-phase analysis is critical because aggregate metrics can mask phase-specific bottlenecks. See references/phase-classification.md.
通过NVTX标记文本将迭代分为上下文阶段(即时模式,无CUDA图)和生成阶段(CUDA图重放)。分阶段分析至关重要,因为聚合指标可能会掩盖特定阶段的瓶颈。请参考references/phase-classification.md。
Phase 1: Detection (YES/NO Verdict)
阶段1:检测(YES/NO判定)
Determine whether host overhead is the primary bottleneck.
确定主机开销是否为主要瓶颈。
Detection Metrics
检测指标
Six metrics in four categories. See references/metrics.md for full definitions, formulas, and SQL queries.
| # | Metric | Threshold | What it answers |
|---|---|---|---|
| M1 | GPU idle ratio | > 0.30 | Is the GPU starved for work? |
| M2 | Launch overhead ratio | > 0.10 | Is kernel launch itself expensive? |
| M3a | Host prep exposed ratio | > 0.50 | How well is host prep pipelined? |
| M3b | Host prep perf impact | > 0.05 | How much throughput does exposed prep cost? |
| M3c | Host prep idle attribution | > 0.50 | Is host prep the main cause of GPU idle? |
| M4 | GPU utilization | < 0.60 | Is GPU utilization too low? |
| M5 | NCCL ratio (caveat) | > 0.20 | Is communication a confounding factor? |
Host prep confirmation rule: Host prep is a confirmed bottleneck only when both M3b AND M3c cross their thresholds.
Thresholds are configurable with per-phase variants. See references/thresholds.md.
四大类共六项指标。完整定义、公式和SQL查询请参考references/metrics.md。
| 编号 | 指标 | 阈值 | 用途 |
|---|---|---|---|
| M1 | GPU空闲率 | > 0.30 | GPU是否因缺少工作而闲置? |
| M2 | 启动开销占比 | > 0.10 | 内核启动本身是否开销高昂? |
| M3a | 主机准备暴露率 | > 0.50 | 主机准备工作的流水线化程度如何? |
| M3b | 主机准备性能影响 | > 0.05 | 暴露的主机准备工作会损失多少吞吐量? |
| M3c | 主机准备空闲归因占比 | > 0.50 | 主机准备工作是否是GPU空闲的主要原因? |
| M4 | GPU利用率 | < 0.60 | GPU利用率是否过低? |
| M5 | NCCL占比(注意事项) | > 0.20 | 通信是否是干扰因素? |
主机准备瓶颈确认规则:只有当M3b 和 M3c均超过阈值时,才能确认主机准备工作是瓶颈。
阈值可配置,且支持分阶段变体。请参考references/thresholds.md。
Detection Workflow
检测流程
Step 1: Input Validation
步骤1:输入验证
bash
undefinedbash
undefinedAccept .sqlite or .nsys-rep
接受.sqlite或.nsys-rep格式
ls -la <trace_file>
ls -la <trace_file>
If .nsys-rep, export to SQLite first
若为.nsys-rep格式,先导出为SQLite
nsys export -t sqlite -o <output.sqlite> <input.nsys-rep>
undefinednsys export -t sqlite -o <output.sqlite> <input.nsys-rep>
undefinedStep 2: Run Detection Script
步骤2:运行检测脚本
bash
python scripts/detect_host_overhead.py \
--trace /path/to/trace.sqlite \
--output /path/to/verdict.jsonThe script computes M1, M2, M4, M5 from SQL, optionally M3 via range intersection, applies the verdict logic, and outputs structured JSON. See references/output-format.md for the output schema.
For manual metric extraction via SQL, see references/nsys-schema.md.
bash
python scripts/detect_host_overhead.py \
--trace /path/to/trace.sqlite \
--output /path/to/verdict.json该脚本通过SQL计算M1、M2、M4、M5指标,可选通过范围交集计算M3指标,应用判定逻辑后输出结构化JSON。输出格式请参考references/output-format.md。
若需通过SQL手动提取指标,请参考references/nsys-schema.md。
Step 3: Interpret Verdict
步骤3:解读判定结果
Overall Verdict:
if aggregate_verdict == YES or context_verdict == YES or generation_verdict == YES:
overall_verdict = YESPer-phase analysis can elevate the verdict but never demote it.
Format using the template in references/output-format.md.
Next Steps:
- If YES -> Proceed to Phase 2 (Root Cause) below, then use skill
perf-host-optimization - If NO -> Use for kernel SOL% or
perf-nsight-compute-analysisfor full classificationtrace-interpretation
总体判定规则:
if aggregate_verdict == YES or context_verdict == YES or generation_verdict == YES:
overall_verdict = YES分阶段分析可以提升判定结果,但不会降低判定等级。
请使用references/output-format.md中的模板格式化结果。
下一步操作:
- 若判定为YES -> 进入下方的阶段2(根因分析),之后使用技能
perf-host-optimization - 若判定为NO -> 使用分析内核SOL%,或使用
perf-nsight-compute-analysis进行完整分类trace-interpretation
Phase 2: Root Cause Analysis
阶段2:根因分析
Identify which specific host operations regressed and by how much. Works with a single trace (breakdown) or two traces (comparison).
识别具体哪些主机操作出现了性能退化以及退化程度。支持单个trace(细分分析)或两个trace(对比分析)。
Principles
原则
- Isolate forward steps, not the full trace. nsys traces contain warmup, JIT, model loading, and teardown.
- Use structural kernel patterns for iteration detection. Allreduce grouping is more robust than kernel density.
- Compare steady-state iterations. Filter to identical workload (same batch size, same ctx/gen mix).
- Per-step metrics, not totals. Always compare per-step averages.
- 隔离前向步骤,而非完整trace。nsys trace包含预热、JIT编译、模型加载和销毁等阶段。
- 使用结构化内核模式检测迭代。allreduce分组比内核密度更可靠。
- 对比稳态迭代。过滤到相同工作负载(相同批量大小、相同上下文/生成阶段比例)。
- 使用分步指标,而非总指标。始终对比分步平均值。
Root Cause Workflow
根因分析流程
Step 1: Collect nsys Traces
步骤1:收集nsys Trace
Profile both versions (if comparing) with identical settings:
bash
nsys profile -o /path/to/trace \
-t cuda,nvtx,osrt \
--force-overwrite=true \
--cuda-memory-usage=true \
-w true \
<benchmark_command> --num_requests 500使用相同配置对两个版本(若进行对比)进行性能分析:
bash
nsys profile -o /path/to/trace \
-t cuda,nvtx,osrt \
--force-overwrite=true \
--cuda-memory-usage=true \
-w true \
<benchmark_command> --num_requests 500Step 2: Export to SQLite
步骤2:导出为SQLite
bash
nsys export --type=sqlite --force-overwrite=true -o trace.sqlite trace.nsys-repbash
nsys export --type=sqlite --force-overwrite=true -o trace.sqlite trace.nsys-repStep 3: Run Host Overhead Analysis
步骤3:运行主机开销分析脚本
bash
undefinedbash
undefinedTwo-trace comparison
双trace对比分析
python scripts/analyze_host_overhead.py
--baseline /path/to/baseline/trace.sqlite
--target /path/to/target/trace.sqlite
--baseline-label "v1.1"
--target-label "main"
--output /path/to/output/analysis.txt
--baseline /path/to/baseline/trace.sqlite
--target /path/to/target/trace.sqlite
--baseline-label "v1.1"
--target-label "main"
--output /path/to/output/analysis.txt
python scripts/analyze_host_overhead.py
--baseline /path/to/baseline/trace.sqlite
--target /path/to/target/trace.sqlite
--baseline-label "v1.1"
--target-label "main"
--output /path/to/output/analysis.txt
--baseline /path/to/baseline/trace.sqlite
--target /path/to/target/trace.sqlite
--baseline-label "v1.1"
--target-label "main"
--output /path/to/output/analysis.txt
Single-trace breakdown
单trace细分分析
python scripts/analyze_host_overhead.py
--baseline /path/to/trace.sqlite
--baseline-label "current"
--baseline /path/to/trace.sqlite
--baseline-label "current"
undefinedpython scripts/analyze_host_overhead.py
--baseline /path/to/trace.sqlite
--baseline-label "current"
--baseline /path/to/trace.sqlite
--baseline-label "current"
undefinedStep 4: Interpret Results
步骤4:解读结果
The script produces:
- Allreduce-based iteration detection -- confirms forward step boundaries
- Per-step wall time comparison -- quantifies the regression
- NVTX per-step breakdown -- identifies which host operations regressed
- GPU kernel comparison -- confirms GPU execution is unchanged
- CUDA API comparison -- detects kernel launch overhead changes
脚本会生成以下内容:
- 基于allreduce的迭代检测——确认前向步骤边界
- 分步 wall time 对比——量化性能退化程度
- NVTX分步细分——识别出现退化的主机操作
- GPU内核对比——确认GPU执行时间未发生变化
- CUDA API对比——检测内核启动开销的变化
Reading the Output
结果解读
Per-Step Wall Time:
Avg wall time per step: 3,317 us (baseline) vs 3,978 us (target) +19.9%This is the primary regression metric.
NVTX Breakdown:
Operation | baseline (us/step) | target (us/step) | Delta | Status
_fetch_new_requests | 36 | 270 | +234 | REGRESSION
broadcast_requests | - | 250 | +250 | NEW
_update_requests | 413 | 723 | +310 | REGRESSIONFocus on operations with large absolute deltas.
GPU Kernel Comparison:
Kernels per step (launched): 6.2 (baseline) vs 21.9 (target) +253%More individual launches = more host-side launch overhead.
分步Wall Time:
Avg wall time per step: 3,317 us (baseline) vs 3,978 us (target) +19.9%这是主要的性能退化指标。
NVTX细分:
Operation | baseline (us/step) | target (us/step) | Delta | Status
_fetch_new_requests | 36 | 270 | +234 | REGRESSION
broadcast_requests | - | 250 | +250 | NEW
_update_requests | 413 | 723 | +310 | REGRESSION重点关注绝对增量较大的操作。
GPU内核对比:
Kernels per step (launched): 6.2 (baseline) vs 21.9 (target) +253%单个启动次数越多,主机端启动开销越大。
Step 5: Kernel-Level Drill-Down (Optional)
步骤5:内核级钻取分析(可选)
When the NVTX breakdown identifies a regressing operation but does not reveal why (the overhead is inside the GPU dispatch, not between NVTX ranges), drill below NVTX operations into individual GPU kernel launches.
See references/kernel-level-analysis.md for full technique details, SQL queries, and examples.
When to drill down:
- An operation has high wall time but the overhead is inside GPU dispatch, not between NVTX ranges
- You need to understand how much of the forward pass is graph-captured vs eager
- Per-layer overhead is significant and you need to map kernels to functional groups
- Multi-rank inference shows unexplained performance asymmetry
当NVTX细分识别到退化操作但无法揭示原因(开销来自GPU调度内部,而非NVTX范围之间)时,需要深入到单个GPU内核启动层面进行分析。
完整技术细节、SQL查询和示例请参考references/kernel-level-analysis.md。
钻取分析适用场景:
- 某操作wall time较高,但开销来自GPU调度内部,而非NVTX范围之间
- 需要了解前向传播中已捕获到图中的部分与即时模式部分的占比
- 每层开销显著,需要将内核映射到功能组
- 多rank推理出现无法解释的性能不对称
Kernel-Level Techniques
内核级分析技术
| Technique | Question | Key Output |
|---|---|---|
| Inter-Kernel Gap Analysis | Where is the GPU idle between kernels? | Gap bucket distribution, top-N largest gaps with source mapping |
| Eager vs Graph Classification | What fraction of kernels are graph-captured? | Graph coverage ratio, list of eager kernels with source attribution |
| Repeating-Pattern Mapping | Which functional group within a layer has the most overhead? | Per-group gap totals, priority ranking |
| Straggler Detection | Is one rank consistently slower? | Straggler rank ID, root cause (extra host work, queue depth feedback loop) |
| 技术 | 解决问题 | 核心输出 |
|---|---|---|
| 核间间隙分析 | GPU在内核之间的空闲时间分布在哪里? | 间隙桶分布、带来源映射的Top-N最大间隙 |
| 即时模式与图模式分类 | 有多少比例的内核被图捕获? | 图覆盖率、带来源归因的即时内核列表 |
| 重复模式映射 | 层内哪个功能组的开销最大? | 各组间隙总和、优先级排序 |
| 掉队节点检测 | 是否存在某个rank持续变慢? | 掉队rank ID、根因(额外主机工作、队列深度反馈循环) |
Workflow
流程
- Start with Inter-Kernel Gap Analysis — bucket the gap distribution to understand the dominant overhead type (graph dispatch, Python interpreter, host-device sync)
- If piecewise graph is in use, run Eager vs Graph Classification to measure graph coverage and identify unnecessary eager kernels
- For per-layer overhead, use Repeating-Pattern Mapping to isolate the highest-overhead functional group within a single layer
- For multi-rank setups, run Straggler Detection if per-step wall time varies across ranks
- 从核间间隙分析开始——对间隙分布进行分桶,了解主导开销类型(图调度、Python解释器、主机-设备同步)
- 若使用分段图,运行即时模式与图模式分类,测量图覆盖率并识别不必要的即时内核
- 针对每层开销,使用重复模式映射隔离单一层内开销最高的功能组
- 针对多rank设置,若分步wall time在不同rank间存在差异,运行掉队节点检测
Kernel-Level Findings to Optimization Patterns
内核级发现对应的优化模式
| Finding | Optimization Pattern |
|---|---|
| Large gaps from Python tensor view chains | CUSTOM_OP — replace with C++ custom op |
| Graph-capturable kernels running eagerly | GRAPH_EXPAND — fix partition poisoning |
| Monolithic custom op blocking graph capture | GRAPH_SPLIT — split into capturable + eager parts |
Host-device sync ( | SYNC (Pattern 1: pre-compute on CPU) + HOIST (Variant B: pass from step level) |
| Per-layer buffer allocation | ALLOC — pre-allocate at init |
| Straggler rank with extra host work | Apply targeted optimization to coordinator-only code paths |
| 发现 | 优化模式 |
|---|---|
| Python张量视图链导致的大间隙 | CUSTOM_OP —— 替换为C++自定义算子 |
| 可被图捕获的内核以即时模式运行 | GRAPH_EXPAND —— 修复分区污染问题 |
| 单片自定义算子阻碍图捕获 | GRAPH_SPLIT —— 拆分为可捕获部分+即时部分 |
每层代码中的主机-设备同步( | SYNC(模式1:在CPU上预计算) + HOIST(变体B:从步骤级别传递) |
| 每层缓冲区分配 | ALLOC —— 在初始化阶段预分配 |
| 存在额外主机工作的掉队rank | 针对协调器专属代码路径进行定向优化 |
Common Patterns and Root Causes
常见模式与根因
Pattern 1: Request Management Refactor
模式1:请求管理重构
Symptom: regressed 5-10x, new operation.
Cause: Request fetching refactored for multi-rank broadcasting in TP.
Mitigation: Optimize broadcast path; batch request state updates.
_fetch_new_requestsbroadcast_requests症状:退化5-10倍,新增操作。
原因:为支持TP中的多rank广播,重构了请求获取逻辑。
缓解方案:优化广播路径;批量处理请求状态更新。
_fetch_new_requestsbroadcast_requestsPattern 2: Increased Kernel Launch Count
模式2:内核启动次数增加
Symptom: 3-5x more calls per step, similar GPU time.
Cause: Operations that were fused or graph-captured are now individual launches.
Mitigation: Re-fuse kernels; extend CUDA graph capture scope.
cudaLaunchKernel症状:每步调用次数增加3-5倍,但GPU时间相近。
原因:原本被融合或图捕获的操作现在变为单独启动。
缓解方案:重新融合内核;扩展CUDA图捕获范围。
cudaLaunchKernelPattern 3: New Bookkeeping Operations
模式3:新增记账操作
Symptom: New NVTX ranges like , .
Cause: New features added to the inference loop without overhead budgeting.
Mitigation: Defer non-critical bookkeeping to async paths; batch updates.
_write_finish_reasonshandle_additional_outputs症状:新增、等NVTX范围。
原因:在推理循环中添加了新功能,但未控制开销。
缓解方案:将非关键记账操作推迟到异步路径;批量更新。
_write_finish_reasonshandle_additional_outputsPattern 4: Flashinfer JIT Warmup Masquerading as Inference
模式4:Flashinfer JIT预热被误判为推理阶段
Symptom: Massive elementwise/reduce kernel counts in "steady state" analysis.
Cause: Analysis window includes flashinfer JIT compilation phase.
Fix: Use allreduce-based iteration isolation, not kernel density or time windows.
症状:“稳态”分析中出现大量elementwise/reduce内核。
原因:分析窗口包含Flashinfer JIT编译阶段。
修复方案:使用基于allreduce的迭代隔离,而非内核密度或时间窗口。
Pattern 5: Context-Only Bottleneck (Masked by Aggregate)
模式5:仅上下文阶段存在瓶颈(被聚合指标掩盖)
Symptom: Aggregate metrics below threshold, but context iterations have 50% GPU idle.
Cause: Generation iterations dilute the context-phase bottleneck.
Fix: Per-phase analysis in Detection phase catches this.
症状:聚合指标低于阈值,但上下文迭代的GPU空闲率达50%。
原因:生成阶段的迭代稀释了上下文阶段的瓶颈。
修复方案:检测阶段的分阶段分析可捕获此问题。
Pitfalls
常见陷阱
1. shortName is an Integer ID
1. shortName是整数ID
In , is an integer referencing . Always join. See references/nsys-schema.md.
CUPTI_ACTIVITY_KIND_KERNELshortNameStringIds.id在中,是引用的整数。必须进行关联查询。请参考references/nsys-schema.md。
CUPTI_ACTIVITY_KIND_KERNELshortNameStringIds.id2. NVTX textId vs text
2. NVTX的textId与text
Most NVTX events have (integer) but NULL . Join with StringIds. See references/nsys-schema.md.
textIdtext大多数NVTX事件有(整数)但为NULL。需与StringIds进行关联查询。请参考references/nsys-schema.md。
textIdtext3. Duplicate NVTX Ranges from TP Ranks
3. TP rank产生的重复NVTX范围
In TP configurations, each rank reports NVTX ranges independently. De-duplicate by grouping entries within 100us of each other.
在TP配置中,每个rank独立上报NVTX范围。需将时间差在100us内的条目分组以去重。
4. Negative Inter-Step Gaps
4. 负的步骤间间隙
When TP ranks report overlapping NVTX ranges, can be negative. Use the maximum end time when de-duplicating.
gap = next_start - prev_end当TP rank上报的NVTX范围重叠时,可能为负数。去重时使用最大结束时间。
gap = next_start - prev_end5. Benchmark Window Selection
5. 基准测试窗口选择
The allreduce-based window captures context+generation phases; steady-state NVTX filtering captures generation-only. Both are valid; use the appropriate one for your comparison goal.
基于allreduce的窗口会捕获上下文+生成阶段;基于稳态NVTX过滤的窗口仅捕获生成阶段。两种方式均有效,请根据对比目标选择合适的窗口。
Handoff to Optimization
向优化流程移交
When analysis is complete and the verdict is YES, hand off to the skill with:
perf-host-optimization-
Detection verdict and evidence: Which metrics crossed thresholds (M1-M5), whether host prep was confirmed (M3b+M3c), and per-phase breakdown.
-
NVTX-based triage (from Root Cause): Top regressing operations by absolute delta (us/step). Map NVTX range names to source functions -- see references/trtllm-nvtx-ranges.md.
-
Handoff data block: Include structured data from references/output-format.md (see "Handoff to Optimization" section).
-
Kernel-level findings (from drill-down, if performed): Inter-kernel gap distribution, graph coverage ratio, per-group overhead map, and straggler rank identification. Map findings to optimization patterns using the table in the Root Cause kernel-level drill-down section above.
当分析完成且判定结果为YES时,向技能移交以下内容:
perf-host-optimization-
检测判定结果与证据:哪些指标(M1-M5)超过阈值,是否确认主机准备工作为瓶颈(M3b+M3c),以及分阶段细分结果。
-
基于NVTX的分类结果(来自根因分析):按绝对增量(us/step)排序的顶级退化操作。将NVTX范围名称映射到源函数——请参考references/trtllm-nvtx-ranges.md。
-
移交数据块:包含references/output-format.md中“向优化流程移交”章节的结构化数据。
-
内核级发现(若进行了钻取分析):核间间隙分布、图覆盖率、各组开销映射、掉队rank识别。使用根因分析中内核级钻取章节的表格将发现映射到优化模式。
Reference
参考文档
| File | Contents |
|---|---|
| references/metrics.md | Full metric definitions, formulas, SQL queries, M3 sub-metric analysis |
| references/thresholds.md | Aggregate and per-phase threshold tables |
| references/phase-classification.md | NVTX marker parsing, iteration classification, per-phase aggregation |
| references/output-format.md | Report template and integration JSON schema |
| references/examples.md | Worked scenarios (aggregate, phase-specific, and case study) |
| references/iteration-isolation-techniques.md | Allreduce, NVTX, and kernel-density iteration isolation techniques |
| references/trtllm-nvtx-ranges.md | TRT-LLM NVTX range reference with per-operation timings |
| references/kernel-level-analysis.md | Kernel-level drill-down techniques: gap analysis, graph classification, pattern mapping, straggler detection |
| references/nsys-schema.md | nsys SQLite schema reference and useful queries |
| scripts/analyze_host_overhead.py | Python script for Phase 2 root cause analysis |
| scripts/detect_host_overhead.py | Python script for Phase 1 detection verdict |
| 文件 | 内容 |
|---|---|
| references/metrics.md | 完整指标定义、公式、SQL查询、M3子指标分析 |
| references/thresholds.md | 聚合与分阶段阈值表 |
| references/phase-classification.md | NVTX标记解析、迭代分类、分阶段聚合 |
| references/output-format.md | 报告模板与集成JSON schema |
| references/examples.md | 已完成的场景案例(聚合、特定阶段、案例研究) |
| references/iteration-isolation-techniques.md | allreduce、NVTX、内核密度迭代隔离技术 |
| references/trtllm-nvtx-ranges.md | TRT-LLM NVTX范围参考及操作时序 |
| references/kernel-level-analysis.md | 内核级钻取分析技术:间隙分析、图分类、模式映射、掉队节点检测 |
| references/nsys-schema.md | nsys SQLite schema参考及实用查询 |
| scripts/analyze_host_overhead.py | 阶段2根因分析的Python脚本 |
| scripts/detect_host_overhead.py | 阶段1检测判定的Python脚本 |