Loading...
Loading...
Systematic approach to exploring the TensorRT-LLM codebase before implementing new features or optimizations. Teaches how to discover existing infrastructure, trace code paths, and avoid reimplementing what already exists. Derived from real mistakes where ~250 lines of code were written and deleted because existing forward methods weren't discovered upfront. Use when starting any new feature, optimization, or code modification in TRT-LLM.
npx skill4agent add nvidia/skills trtllm-codebase-explorationforward_context_default# List all methods (not just forward*)
grep -n "def " tensorrt_llm/_torch/modules/attention.py | head -50
# List all attributes set in __init__
grep -n "self\." tensorrt_llm/_torch/modules/attention.py | grep "__init__" -A 200 | head -80
# Find the class hierarchy
grep -n "class MLA\|class Attention\|class TrtllmAttention" tensorrt_llm/_torch/modules/attention.py# Find all forward methods
grep -n "def forward" tensorrt_llm/_torch/modules/attention.py
# For each one, read the full implementation (not just the signature)| What you need | Search for | Common hits |
|---|---|---|
| Attention computation | | Handles packed seqs, variable lengths, KV cache natively |
| Compiled fusion | | Already in |
| RoPE application | | Multiple implementations exist; check which one the current code path uses |
| KV cache management | | Fused RoPE + cache operations in C++ kernels |
| Sparse attention | | DSA-specific backend with sparse routing |
# Generic search pattern
grep -rn "KEYWORD" tensorrt_llm/_torch/ --include="*.py" | head -20# Find what the attention kernel handles internally
grep -rn "latent_cache\|rope.*fuse\|rope_fusion" tensorrt_llm/_torch/attention_backend/rope_fusion=Trueapply_rotary_emb=Falselatent_cache# Find assertions in the class
grep -n "assert " tensorrt_llm/_torch/modules/attention.pyassert self.mha is Noneself.mhaassert self.mqa is not None# Find weight loading/transformation code
grep -rn "load_.*weight\|weight.*transform\|load_kv_b_proj" tensorrt_llm/_torch/models/
# Check how weights are laid out after loading
grep -n "def load_" tensorrt_llm/_torch/models/modeling_deepseekv3.py# Find all callers of the method to see its dispatch context
grep -rn "forward_context_default\|forward_context(" tensorrt_llm/_torch/modules/attention.py
# Look for the dispatcher that routes to this method
# Often named similarly but without a suffix (e.g., forward_context dispatches to forward_context_default)forward_context_default()forward_context()forward_context_defaultforward_context_with_cached_kvforward_context_with_chunked_prefillforward_context_defaultforward_context_default__init__apply_rotary_embrope_fusionlatent_cache# Find what calls forward_context_default to discover the dispatch chain
grep -n "forward_context_default" tensorrt_llm/_torch/modules/attention.pytensorrt_llm/_torch/utils.pytensorrt_llm/_torch/modules/tests/unittest/_torch/| Mistake | Consequence | Prevention |
|---|---|---|
| Reading only the method you're modifying | Miss that another method does what you need | Read ALL methods in the class |
| Searching only for the exact function name | Miss equivalent implementations | Search for the concept (e.g., "attention", "rope", "expand kv") |
| Assuming assertions are immutable | Work around them with hacks (separate attributes) | Question whether the assertion's intent still applies |
| Not reading the fused kernel's capabilities | Reimplement what it already does | Check what |
| Only reading Python code | Miss C++ implementations called via bindings | Check |
| Calling a method directly instead of through its dispatcher | Miss edge cases (cached KV, chunked prefill, SM-version gating) | Search for callers of the method to find the dispatch chain |
| Assuming hardware-uniform numerical behavior | Silent accuracy degradation on specific SM versions | Check for |
| Area | Key files to read |
|---|---|
| Attention modules | |
| Attention backends | |
| Model definitions | |
| Utilities | |
| RoPE | |
| Test fixtures | |
| Weight loading | |