Loading...
Loading...
Compare original and translation side by side
perf-host-analysisperf-host-analysisperf-host-analysis_prepare_tp_inputstop_regressing_ops_prepare_tp_inputs_fetch_new_requestsbroadcast_requests_update_requests_prepare_tp_inputsperf-host-analysis_prepare_tp_inputstop_regressing_ops_prepare_tp_inputs_fetch_new_requestsbroadcast_requests_update_requests_prepare_tp_inputsTLLM_LINE_PROFILER_ENABLED=TrueTLLM_LINE_PROFILER_PATHTLLM_LINE_PROFILER_FUNCTIONSundefinedTLLM_LINE_PROFILER_ENABLED=TrueTLLM_LINE_PROFILER_PATHTLLM_LINE_PROFILER_FUNCTIONSundefinedundefinedundefinedtaskset -p <pid>numactl --shownumactl --cpunodebind=<node> --membind=<node>taskset -p <pid>numactl --shownumactl --cpunodebind=<node> --membind=<node>EXTRA_SUFFIX=round0_baseline bash profile.sh
EXTRA_SUFFIX=round1_eliminate_redundant_iter bash profile.shEXTRA_SUFFIX=round0_baseline bash profile.sh
EXTRA_SUFFIX=round1_eliminate_redundant_iter bash profile.shFOR round = 1 to MAX_ROUNDS:
1. PROFILE (with Drill-Down)
2. ANALYZE (Multi-Option)
3. OPTIMIZE (Apply Change — prefer zero-risk first)
4. TEST (Unit Test Validation)
5. VALIDATE (Re-Profile — expect bottleneck landscape to shift)
6. REPORT
END FOR → FINAL SUMMARYFOR round = 1 to MAX_ROUNDS:
1. 剖析(含深度挖掘)
2. 分析(多选项)
3. 优化(应用更改——优先零风险选项)
4. 测试(单元测试验证)
5. 验证(重新剖析——预期瓶颈分布会变化)
6. 报告
END FOR → 最终总结Line # Hits Time Per Hit % Time Line Contents
==============================================================
2848 4100 59200000000.0 14439024.4 98.7 output = self.model_engine.forward(...)tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputsTLLM_LINE_PROFILER_FUNCTIONSLine # Hits Time Per Hit % Time Line Contents
==============================================================
2848 4100 59200000000.0 14439024.4 98.7 output = self.model_engine.forward(...)tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputsTLLM_LINE_PROFILER_FUNCTIONS| Type | Indicators | Severity |
|---|---|---|
| HOST_SYNC | | Critical |
| SYNC | | Critical |
| CUSTOM_OP | Chain of Python tensor ops (view/slice/cast) before kernel launch | Critical |
| GRAPH_BREAK | Op that prevents CUDA graph capture of surrounding code (fix via GRAPH_EXPAND / GRAPH_SPLIT) | High |
| ALLOC | | High |
| HOIST | Per-layer recomputation of step-invariant values | High |
| PYLOOP | | High |
| REDUNDANT_ITER | Multiple passes over the same collection | High |
| DEAD_WORK | Object construction whose results are always discarded | High |
| CONTAINER | Dict/set lookups in hot loops | Medium |
| FUNCALL | Repeated method/property calls | Medium |
| COMM | | Medium |
| GIL | Lock/queue contention | Medium |
| SERIALIZE | | Medium |
| GC | Periodic latency spikes, non-deterministic pauses (tail latency) | Low |
| COMPUTE | Actual computation (may not be optimizable) | Low |
| Option | Description | Estimated Savings | Risk | Complexity |
|---|---|---|---|---|
| A | ... | ... | Low/Med/High | ... |
| B | ... | ... | ... | ... |
| 类型 | 特征 | 严重程度 |
|---|---|---|
| HOST_SYNC | 每层前向路径中的 | 严重 |
| SYNC | 步骤级代码中的 | 严重 |
| CUSTOM_OP | 内核启动前的Python张量操作链(view/slice/cast) | 严重 |
| GRAPH_BREAK | 阻止CUDA图捕获周围代码的操作(通过GRAPH_EXPAND/GRAPH_SPLIT修复) | 高 |
| ALLOC | 循环中的 | 高 |
| HOIST | 每层重复计算步骤不变量 | 高 |
| PYLOOP | 迭代次数较多的 | 高 |
| REDUNDANT_ITER | 对同一集合进行多次遍历 | 高 |
| DEAD_WORK | 结果始终被丢弃的对象构造 | 高 |
| CONTAINER | 热点循环中的字典/集合查找 | 中 |
| FUNCALL | 重复的方法/属性调用 | 中 |
| COMM | TP/PP路径中的 | 中 |
| GIL | 锁/队列竞争 | 中 |
| SERIALIZE | 请求处理中的 | 中 |
| GC | 周期性延迟峰值、非确定性暂停(尾部延迟) | 低 |
| COMPUTE | 实际计算(可能无法优化) | 低 |
| 选项 | 描述 | 预估收益 | 风险 | 复杂度 |
|---|---|---|---|---|
| A | ... | ... | 低/中/高 | ... |
| B | ... | ... | ... | ... |
undefinedundefined
**Running tests:**
```bash
**运行测试:**
```bash
For the full UT-to-file mapping, see [references/hot-path-files.md](references/hot-path-files.md).
**If tests fail:**
1. Read the failure message
2. Rollback immediately (`git checkout -- <file>`)
3. Analyze why the optimization broke correctness
4. Try the next-best option from Phase 2
完整的单元测试与文件映射可参考[references/hot-path-files.md](references/hot-path-files.md)。
**如果测试失败:**
1. 阅读失败信息
2. 立即回滚(`git checkout -- <file>`)
3. 分析优化破坏正确性的原因
4. 尝试阶段2中的次优选项round<N>_<description>round<N>_<description>Timer unit: 1e-06 s
Total time: 1.234 s
File: /path/to/file.py
Function: my_function at line 100
Line # Hits Time Per Hit % Time Line Contents
==============================================================
100 def my_function(self):
101 500 890000.0 1780.0 72.1 result = tensor.item()
102 500 234567.0 469.1 19.0 return resultfor x in range(1):Timer unit: 1e-06 s
Total time: 1.234 s
File: /path/to/file.py
Function: my_function at line 100
Line # Hits Time Per Hit % Time Line Contents
==============================================================
100 def my_function(self):
101 500 890000.0 1780.0 72.1 result = tensor.item()
102 500 234567.0 469.1 19.0 return resultfor x in range(1):| File | Contents |
|---|---|
| references/optimization-patterns.md | Pattern index — links to 6 sub-files: sync-alloc, loop-iteration, python-overhead, gpu-graph, system, compound-pitfalls |
| references/optimization-strategy.md | Zero-risk-first ordering, metric traps, three scopes of host overhead, pattern selection guide |
| references/hotspot-classification.md | Extended per-type indicators and code examples (including CUSTOM_OP, GRAPH_BREAK, HOST_SYNC) |
| references/communication-patterns.md | Communication overhead patterns (NCCL batching, barrier removal, async overlap, reduce_scatter) |
| references/hot-path-files.md | Key file tables, drill-down targets, UT mapping |
| references/examples.md | Usage examples and multi-round walkthrough |
| trtllm-nvtx-ranges.md | TRT-LLM NVTX range reference (from analysis skill) — maps range names to source functions |
| 文件 | 内容 |
|---|---|
| references/optimization-patterns.md | 模式索引——链接到6个子文件:sync-alloc、loop-iteration、python-overhead、gpu-graph、system、compound-pitfalls |
| references/optimization-strategy.md | 零风险优先顺序、指标陷阱、主机开销的三个范围、模式选择指南 |
| references/hotspot-classification.md | 扩展的按类型特征说明和代码示例(包括CUSTOM_OP、GRAPH_BREAK、HOST_SYNC) |
| references/communication-patterns.md | 通信开销模式(NCCL批处理、移除屏障、异步重叠、reduce_scatter) |
| references/hot-path-files.md | 关键文件表、深度挖掘目标、单元测试映射 |
| references/examples.md | 使用示例和多轮优化演练 |
| trtllm-nvtx-ranges.md | TRT-LLM NVTX范围参考(来自分析技能)——将范围名称映射到源函数 |