Host Performance Optimization Skill

Automates detection and optimization of host-side (CPU) overhead in TensorRT-LLM's PyTorch backend.

When to Use

GPU utilization is low during inference (CPU bottleneck suspected)
User asks to reduce host overhead or CPU latency
Optimizing PyExecutor throughput (requests/sec)
Need line-by-line profiling of specific Python functions

Confirming the Bottleneck

line_profiler measures where CPU time is spent but not whether CPU is the bottleneck. If you need to confirm CPU is the limiting factor, run the

perf-host-analysis

skill first -- it provides a YES/NO verdict with metric evidence.

As a rough heuristic without nsys: if doubling the batch size does not proportionally increase GPU utilization or throughput, CPU overhead is likely the bottleneck.

Using Analysis Skill Results

If the

perf-host-analysis

skill has already been run, use its output to skip the confirmation step and prioritize targets:

Detection verdict: If YES with host_prep_confirmed, start with
```
_prepare_tp_inputs
```
.
NVTX triage (from Root Cause): The
```
top_regressing_ops
```
in the handoff data block maps NVTX range names to source functions. Profile the function with the largest absolute delta first.
Cross-function triage: When the top NVTX regression is NOT in
```
_prepare_tp_inputs
```
(e.g.,
```
_fetch_new_requests
```
,
```
broadcast_requests
```
,
```
_update_requests
```
), target that function's source file directly instead of defaulting to
```
_prepare_tp_inputs
```
. See references/trtllm-nvtx-ranges.md for the NVTX-to-source mapping.

Profiling Setup

line_profiler (Primary Method)

Environment Variables:

```
TLLM_LINE_PROFILER_ENABLED=True
```
— Enable the profiler
```
TLLM_LINE_PROFILER_PATH
```
— Output file path
```
TLLM_LINE_PROFILER_FUNCTIONS
```
— Additional functions to profile (comma-separated)

Function specification format:

bash

# Class methods: module.path.ClassName.method_name
TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs"

# Standalone functions: module.path::function_name
TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.sampler::_group_requests_by_strategy_key"

# Multiple functions (comma-separated)
TLLM_LINE_PROFILER_FUNCTIONS="module.Class.method1,module.Class.method2"

CPU Affinity (Environment Factor)

CPU core affinity can significantly affect host overhead measurements, especially on multi-socket systems (e.g., B300). Pinning processes to cores near the GPU's NUMA node reduces cross-socket memory access latency.

Check current affinity:
```
taskset -p <pid>
```
or
```
numactl --show
```

Pin to local NUMA node:

numactl --cpunodebind=<node> --membind=<node>

Impact: Up to 2x difference in host overhead on B300 systems

When comparing profiling results across runs, ensure CPU affinity is consistent. Do not externally modify the affinity, unless user requires to do this to examine the affects of this part. Document the affinity setting in each round's report if it varies.

Workspace & Suffix Management

Each profiling run should have a unique suffix to track progress across rounds:

bash

EXTRA_SUFFIX=round0_baseline bash profile.sh
EXTRA_SUFFIX=round1_eliminate_redundant_iter bash profile.sh

Autonomous Optimization Loop

Before starting the loop, review references/optimization-strategy.md for strategic guidance on ordering (zero-risk-first), measurement traps, and overhead scoping.

Key insight: Optimizations are NOT independent. Fixing a 50ms bottleneck may reveal a 30ms bottleneck that was previously masked (hidden behind the larger one). Always re-profile after each significant change — the bottleneck landscape shifts.

Ordering principle: Within each round, prefer zero-risk optimizations (caching, pre-allocation, hoisting invariants) over medium/high-risk ones (graph partition changes, algorithm fusion). Zero-risk changes provide free gains and make subsequent profiling cleaner.

Run N rounds (default 3) of the following cycle:

FOR round = 1 to MAX_ROUNDS:

  1. PROFILE (with Drill-Down)
  2. ANALYZE (Multi-Option)
  3. OPTIMIZE (Apply Change — prefer zero-risk first)
  4. TEST (Unit Test Validation)
  5. VALIDATE (Re-Profile — expect bottleneck landscape to shift)
  6. REPORT

END FOR → FINAL SUMMARY

Phase 1: PROFILE (with Drill-Down)

Run workload with profiler enabled
Parse output: identify functions with highest Total time and lines with highest % Time
CRITICAL: Drill down into sub-functions that are not yet profiled (see below)

Drill-Down Profiling

The default profiler covers top-level executor functions but not all sub-functions. When a profiled function shows most time in a single sub-call, you must drill down.

When: A single line consumes >80% of a function's time calling an unprofiled sub-function:

Line #      Hits         Time    Per Hit   % Time  Line Contents
==============================================================
  2848      4100  59200000000.0  14439024.4   98.7      output = self.model_engine.forward(...)

How:

Identify the sub-function's full qualified path (e.g.,

tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs

)

Add it to
```
TLLM_LINE_PROFILER_FUNCTIONS
```
Re-profile to get line-level data inside it
Now analyze the inner hotspots

For common drill-down targets, see references/hot-path-files.md.

Phase 2: ANALYZE (Multi-Option)

For the chosen hotspot:

Identify the top hotspots by absolute time (not just %) within the target function
Classify each hotspot by type. Summary table:

Type	Indicators	Severity
HOST_SYNC	`.item()` , `.cpu()` in per-layer forward path	Critical
SYNC	`.item()` , `.cpu()` , `synchronize()` in step-level code	Critical
CUSTOM_OP	Chain of Python tensor ops (view/slice/cast) before kernel launch	Critical
GRAPH_BREAK	Op that prevents CUDA graph capture of surrounding code (fix via GRAPH_EXPAND / GRAPH_SPLIT)	High
ALLOC	`torch.zeros/empty/tensor()` in loops, `.clone()`	High
HOIST	Per-layer recomputation of step-invariant values	High
PYLOOP	`for x in collection:` with many iterations	High
REDUNDANT_ITER	Multiple passes over the same collection	High
DEAD_WORK	Object construction whose results are always discarded	High
CONTAINER	Dict/set lookups in hot loops	Medium
FUNCALL	Repeated method/property calls	Medium
COMM	`dist.all_reduce` , `dist.barrier` , NCCL overhead in TP/PP paths	Medium
GIL	Lock/queue contention	Medium
SERIALIZE	`pickle.dumps/loads` , `json.dumps/loads` in request processing	Medium
GC	Periodic latency spikes, non-deterministic pauses (tail latency)	Low
COMPUTE	Actual computation (may not be optimizable)	Low

For detailed classification with code examples, see references/hotspot-classification.md.

Propose 2-4 optimization options in a table:

Option	Description	Estimated Savings	Risk	Complexity
A	...	...	Low/Med/High	...
B	...	...	...	...

Select the best option and explain reasoning (prefer high-savings + low-risk; follow zero-risk-first ordering from references/optimization-strategy.md)

For optimization patterns by type, see references/optimization-patterns.md (index) — it links to the relevant sub-file for each hotspot type. For GPU-specific patterns (CUSTOM_OP, GRAPH_SPLIT, GRAPH_EXPAND), see references/patterns/gpu-graph.md.

Phase 3: OPTIMIZE (Apply Change)

Apply the selected code change with Edit tool
One optimization per round — keep changes minimal and targeted
Record the exact change (file, line range, before/after) for potential rollback

Phase 4: TEST (Unit Test Validation)

Mandatory after each optimization. Find and run related UTs to verify correctness.

Finding related tests:

bash

# Search by modified file name
grep -rl "model_engine\|PyTorchModelEngine" tests/unittest/_torch/executor/

# Search by modified function name
grep -rl "_prepare_tp_inputs\|prepare_inputs" tests/

Running tests:

bash

# Run specific test file with stop-on-first-failure
pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py -v -x --timeout=120

# Run specific test method
pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py::PyTorchModelEngineTestCase::test_position_id_preparation -v -x

For the full UT-to-file mapping, see references/hot-path-files.md.

If tests fail:

Read the failure message
Rollback immediately (
```
git checkout -- <file>
```
)
Analyze why the optimization broke correctness
Try the next-best option from Phase 2

Phase 5: VALIDATE (Re-Profile)

Re-run profiler with identical workload, using suffix
```
round<N>_<description>
```
Compare three things:
1. Did the target hotspot time decrease?
2. Did the overall function Total time decrease?
3. Did benchmark metrics (TPOT, throughput) improve?

If regression detected (function time increased or metrics worsened):

The "optimization" may have triggered a CPython pitfall — see references/patterns/compound-pitfalls.md (CPython Pitfalls section)
Rollback and try the next-best option from Phase 2

Phase 6: REPORT

Log for this round:

Round number
Hotspot location (file:line) and classification
Optimization applied (with before/after code summary)
Time delta: function Total time before → after
Benchmark delta: TPOT, throughput before → after

Reading Profile Output

Timer unit: 1e-06 s
Total time: 1.234 s
File: /path/to/file.py
Function: my_function at line 100

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   100                                           def my_function(self):
   101       500    890000.0   1780.0     72.1       result = tensor.item()
   102       500    234567.0    469.1     19.0       return result

How to read effectively:

Start with Total time for each function — this is the overall budget
Sort lines mentally by absolute Time, not just % Time (3% of a 60s function = 1.8s)
Check Hits count to understand iteration patterns:
- Hits = 2 × expected count →
```
for x in range(1):
```
  loop overhead (2 hits = enter + exit check)
- Hits ≫ expected → the line is inside a nested loop
Look for repeated patterns: if 10 lines each take 3% in a loop body, the loop itself costs 30%

Stopping Criteria

Stop the optimization loop when:

Iteration limit reached: Completed N rounds (default 3)
No actionable hotspots: Top hotspots are pure GPU compute (COMPUTE type)
Diminishing returns: < 5% improvement in last 2 rounds
Risk threshold: Further optimizations require architectural changes (e.g., Cython, struct-of-arrays)
Test failures: Cannot find an optimization that passes UTs

Primary success metric: Benchmark throughput (requests/sec or tokens/sec) as measured by the profiling script. line_profiler time reductions are leading indicators, but throughput is the ground truth — a function-level speedup that doesn't improve throughput is not a real win.

Final Summary Output

The final report should include:

Rounds executed: Number of profile-optimize cycles completed
Cumulative improvement: Total host time reduction (percentage and absolute)
Benchmark metrics: Before/after comparison table (TPOT, throughput, ITL, E2EL)
Optimizations applied: List of changes with file:line locations and classification
Failed attempts: Any optimizations tried and reverted (with why)
Remaining hotspots: Top bottlenecks that couldn't be optimized (with classification)
Recommendations: Suggested follow-up for architectural changes if needed

For a concrete multi-round example, see references/examples.md.

Reference Files

File	Contents
references/optimization-patterns.md	Pattern index — links to 6 sub-files: sync-alloc, loop-iteration, python-overhead, gpu-graph, system, compound-pitfalls
references/optimization-strategy.md	Zero-risk-first ordering, metric traps, three scopes of host overhead, pattern selection guide
references/hotspot-classification.md	Extended per-type indicators and code examples (including CUSTOM_OP, GRAPH_BREAK, HOST_SYNC)
references/communication-patterns.md	Communication overhead patterns (NCCL batching, barrier removal, async overlap, reduce_scatter)
references/hot-path-files.md	Key file tables, drill-down targets, UT mapping
references/examples.md	Usage examples and multi-round walkthrough
trtllm-nvtx-ranges.md	TRT-LLM NVTX range reference (from analysis skill) — maps range names to source functions

perf-host-optimization

NPX Install

Tags

SKILL.md Content

Host Performance Optimization Skill

When to Use

Confirming the Bottleneck

Using Analysis Skill Results

Profiling Setup

line_profiler (Primary Method)

CPU Affinity (Environment Factor)

Workspace & Suffix Management

Autonomous Optimization Loop

Phase 1: PROFILE (with Drill-Down)

Drill-Down Profiling

Phase 2: ANALYZE (Multi-Option)

Phase 3: OPTIMIZE (Apply Change)

Phase 4: TEST (Unit Test Validation)

Phase 5: VALIDATE (Re-Profile)

Phase 6: REPORT

Reading Profile Output

Stopping Criteria

Final Summary Output

Reference Files