Host Performance Optimization Skill
Automates detection and optimization of host-side (CPU) overhead in TensorRT-LLM's PyTorch backend.
When to Use
- GPU utilization is low during inference (CPU bottleneck suspected)
- User asks to reduce host overhead or CPU latency
- Optimizing PyExecutor throughput (requests/sec)
- Need line-by-line profiling of specific Python functions
Confirming the Bottleneck
line_profiler measures
where CPU time is spent but not
whether CPU is the bottleneck.
If you need to confirm CPU is the limiting factor, run the
skill first -- it provides a YES/NO verdict with metric evidence.
As a rough heuristic without nsys: if doubling the batch size does not proportionally increase GPU utilization or throughput, CPU overhead is likely the bottleneck.
Using Analysis Skill Results
If the
skill has already been run, use its output to skip the confirmation step and prioritize targets:
- Detection verdict: If YES with host_prep_confirmed, start with .
- NVTX triage (from Root Cause): The in the handoff data block maps NVTX range names to source functions. Profile the function with the largest absolute delta first.
- Cross-function triage: When the top NVTX regression is NOT in (e.g., , , ), target that function's source file directly instead of defaulting to . See references/trtllm-nvtx-ranges.md for the NVTX-to-source mapping.
Profiling Setup
line_profiler (Primary Method)
Environment Variables:
TLLM_LINE_PROFILER_ENABLED=True
— Enable the profiler
- — Output file path
TLLM_LINE_PROFILER_FUNCTIONS
— Additional functions to profile (comma-separated)
Function specification format:
bash
# Class methods: module.path.ClassName.method_name
TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs"
# Standalone functions: module.path::function_name
TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.sampler::_group_requests_by_strategy_key"
# Multiple functions (comma-separated)
TLLM_LINE_PROFILER_FUNCTIONS="module.Class.method1,module.Class.method2"
CPU Affinity (Environment Factor)
CPU core affinity can significantly affect host overhead measurements,
especially on multi-socket systems (e.g., B300). Pinning processes to cores
near the GPU's NUMA node reduces cross-socket memory access latency.
- Check current affinity: or
- Pin to local NUMA node:
numactl --cpunodebind=<node> --membind=<node>
- Impact: Up to 2x difference in host overhead on B300 systems
When comparing profiling results across runs, ensure CPU affinity is consistent.
Do not externally modify the affinity, unless user requires to do this to examine the affects of this part.
Document the affinity setting in each round's report if it varies.
Workspace & Suffix Management
Each profiling run should have a unique suffix to track progress across rounds:
bash
EXTRA_SUFFIX=round0_baseline bash profile.sh
EXTRA_SUFFIX=round1_eliminate_redundant_iter bash profile.sh
Autonomous Optimization Loop
Before starting the loop, review references/optimization-strategy.md for strategic guidance on ordering (zero-risk-first), measurement traps, and overhead scoping.
Key insight: Optimizations are NOT independent. Fixing a 50ms bottleneck may reveal a 30ms bottleneck that was previously masked (hidden behind the larger one). Always re-profile after each significant change — the bottleneck landscape shifts.
Ordering principle: Within each round, prefer zero-risk optimizations (caching, pre-allocation, hoisting invariants) over medium/high-risk ones (graph partition changes, algorithm fusion). Zero-risk changes provide free gains and make subsequent profiling cleaner.
Run N rounds (default 3) of the following cycle:
FOR round = 1 to MAX_ROUNDS:
1. PROFILE (with Drill-Down)
2. ANALYZE (Multi-Option)
3. OPTIMIZE (Apply Change — prefer zero-risk first)
4. TEST (Unit Test Validation)
5. VALIDATE (Re-Profile — expect bottleneck landscape to shift)
6. REPORT
END FOR → FINAL SUMMARY
Phase 1: PROFILE (with Drill-Down)
- Run workload with profiler enabled
- Parse output: identify functions with highest Total time and lines with highest % Time
- CRITICAL: Drill down into sub-functions that are not yet profiled (see below)
Drill-Down Profiling
The default profiler covers top-level executor functions but not all sub-functions. When a profiled function shows most time in a single sub-call, you must drill down.
When: A single line consumes >80% of a function's time calling an unprofiled sub-function:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2848 4100 59200000000.0 14439024.4 98.7 output = self.model_engine.forward(...)
How:
- Identify the sub-function's full qualified path (e.g.,
tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs
)
- Add it to
TLLM_LINE_PROFILER_FUNCTIONS
- Re-profile to get line-level data inside it
- Now analyze the inner hotspots
For common drill-down targets, see references/hot-path-files.md.
Phase 2: ANALYZE (Multi-Option)
For the chosen hotspot:
- Identify the top hotspots by absolute time (not just %) within the target function
- Classify each hotspot by type. Summary table:
| Type | Indicators | Severity |
|---|
| HOST_SYNC | , in per-layer forward path | Critical |
| SYNC | , , in step-level code | Critical |
| CUSTOM_OP | Chain of Python tensor ops (view/slice/cast) before kernel launch | Critical |
| GRAPH_BREAK | Op that prevents CUDA graph capture of surrounding code (fix via GRAPH_EXPAND / GRAPH_SPLIT) | High |
| ALLOC | torch.zeros/empty/tensor()
in loops, | High |
| HOIST | Per-layer recomputation of step-invariant values | High |
| PYLOOP | with many iterations | High |
| REDUNDANT_ITER | Multiple passes over the same collection | High |
| DEAD_WORK | Object construction whose results are always discarded | High |
| CONTAINER | Dict/set lookups in hot loops | Medium |
| FUNCALL | Repeated method/property calls | Medium |
| COMM | , , NCCL overhead in TP/PP paths | Medium |
| GIL | Lock/queue contention | Medium |
| SERIALIZE | , in request processing | Medium |
| GC | Periodic latency spikes, non-deterministic pauses (tail latency) | Low |
| COMPUTE | Actual computation (may not be optimizable) | Low |
For detailed classification with code examples, see references/hotspot-classification.md.
- Propose 2-4 optimization options in a table:
| Option | Description | Estimated Savings | Risk | Complexity |
|---|
| A | ... | ... | Low/Med/High | ... |
| B | ... | ... | ... | ... |
- Select the best option and explain reasoning (prefer high-savings + low-risk; follow zero-risk-first ordering from references/optimization-strategy.md)
For optimization patterns by type, see references/optimization-patterns.md (index) — it links to the relevant sub-file for each hotspot type. For GPU-specific patterns (CUSTOM_OP, GRAPH_SPLIT, GRAPH_EXPAND), see references/patterns/gpu-graph.md.
Phase 3: OPTIMIZE (Apply Change)
- Apply the selected code change with Edit tool
- One optimization per round — keep changes minimal and targeted
- Record the exact change (file, line range, before/after) for potential rollback
Phase 4: TEST (Unit Test Validation)
Mandatory after each optimization. Find and run related UTs to verify correctness.
Finding related tests:
bash
# Search by modified file name
grep -rl "model_engine\|PyTorchModelEngine" tests/unittest/_torch/executor/
# Search by modified function name
grep -rl "_prepare_tp_inputs\|prepare_inputs" tests/
Running tests:
bash
# Run specific test file with stop-on-first-failure
pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py -v -x --timeout=120
# Run specific test method
pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py::PyTorchModelEngineTestCase::test_position_id_preparation -v -x
For the full UT-to-file mapping, see references/hot-path-files.md.
If tests fail:
- Read the failure message
- Rollback immediately ()
- Analyze why the optimization broke correctness
- Try the next-best option from Phase 2
Phase 5: VALIDATE (Re-Profile)
- Re-run profiler with identical workload, using suffix
- Compare three things:
- Did the target hotspot time decrease?
- Did the overall function Total time decrease?
- Did benchmark metrics (TPOT, throughput) improve?
If regression detected (function time increased or metrics worsened):
- The "optimization" may have triggered a CPython pitfall — see references/patterns/compound-pitfalls.md (CPython Pitfalls section)
- Rollback and try the next-best option from Phase 2
Phase 6: REPORT
Log for this round:
- Round number
- Hotspot location (file:line) and classification
- Optimization applied (with before/after code summary)
- Time delta: function Total time before → after
- Benchmark delta: TPOT, throughput before → after
Reading Profile Output
Timer unit: 1e-06 s
Total time: 1.234 s
File: /path/to/file.py
Function: my_function at line 100
Line # Hits Time Per Hit % Time Line Contents
==============================================================
100 def my_function(self):
101 500 890000.0 1780.0 72.1 result = tensor.item()
102 500 234567.0 469.1 19.0 return result
How to read effectively:
- Start with Total time for each function — this is the overall budget
- Sort lines mentally by absolute Time, not just % Time (3% of a 60s function = 1.8s)
- Check Hits count to understand iteration patterns:
- Hits = 2 × expected count → loop overhead (2 hits = enter + exit check)
- Hits ≫ expected → the line is inside a nested loop
- Look for repeated patterns: if 10 lines each take 3% in a loop body, the loop itself costs 30%
Stopping Criteria
Stop the optimization loop when:
- Iteration limit reached: Completed N rounds (default 3)
- No actionable hotspots: Top hotspots are pure GPU compute (COMPUTE type)
- Diminishing returns: < 5% improvement in last 2 rounds
- Risk threshold: Further optimizations require architectural changes (e.g., Cython, struct-of-arrays)
- Test failures: Cannot find an optimization that passes UTs
Primary success metric: Benchmark throughput (requests/sec or tokens/sec) as measured by the profiling script. line_profiler time reductions are leading indicators, but throughput is the ground truth — a function-level speedup that doesn't improve throughput is not a real win.
Final Summary Output
The final report should include:
- Rounds executed: Number of profile-optimize cycles completed
- Cumulative improvement: Total host time reduction (percentage and absolute)
- Benchmark metrics: Before/after comparison table (TPOT, throughput, ITL, E2EL)
- Optimizations applied: List of changes with file:line locations and classification
- Failed attempts: Any optimizations tried and reverted (with why)
- Remaining hotspots: Top bottlenecks that couldn't be optimized (with classification)
- Recommendations: Suggested follow-up for architectural changes if needed
For a concrete multi-round example, see references/examples.md.
Reference Files
| File | Contents |
|---|
| references/optimization-patterns.md | Pattern index — links to 6 sub-files: sync-alloc, loop-iteration, python-overhead, gpu-graph, system, compound-pitfalls |
| references/optimization-strategy.md | Zero-risk-first ordering, metric traps, three scopes of host overhead, pattern selection guide |
| references/hotspot-classification.md | Extended per-type indicators and code examples (including CUSTOM_OP, GRAPH_BREAK, HOST_SYNC) |
| references/communication-patterns.md | Communication overhead patterns (NCCL batching, barrier removal, async overlap, reduce_scatter) |
| references/hot-path-files.md | Key file tables, drill-down targets, UT mapping |
| references/examples.md | Usage examples and multi-round walkthrough |
| trtllm-nvtx-ranges.md | TRT-LLM NVTX range reference (from analysis skill) — maps range names to source functions |