Loading...
Loading...
Found 34 Skills
Best practices for contributing code to TensorRT-LLM. Covers the official contribution process (issue tracking, fork workflow, DCO signing), coding guidelines, implementation workflow, common mistakes, testing strategy, commit hygiene, and review readiness. Incorporates rules from CONTRIBUTING.md and CODING_GUIDELINES.md plus lessons distilled from real PR retrospectives. Use when implementing new features, optimizations, or bug fixes in the TensorRT-LLM codebase.
Use when reducing model size, improving inference speed, or deploying to edge devices - covers quantization, pruning, knowledge distillation, ONNX export, and TensorRT optimizationUse when ", " mentioned.
CLIP vision-language model for image-text retrieval, zero-shot classification, embedding extraction, ONNX export, and TensorRT deployment. Use when fine-tuning or training CLIP, running zero-shot classification, computing image embeddings, or deploying CLIP to ONNX/TensorRT.
Compile TensorRT-LLM on a SLURM cluster. Covers submitting a batch job with a container image, monitoring the job, and verifying the build. Use when the user wants to compile TRT-LLM remotely via SLURM rather than on a local compute node.
Systematic approach to exploring the TensorRT-LLM codebase before implementing new features or optimizations. Teaches how to discover existing infrastructure, trace code paths, and avoid reimplementing what already exists. Derived from real mistakes where ~250 lines of code were written and deleted because existing forward methods weren't discovered upfront. Use when starting any new feature, optimization, or code modification in TRT-LLM.
Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
Framework-independent LLM serving benchmark skill for comparing SGLang, vLLM, TensorRT-LLM, or another serving framework. Use when a user wants to find the best deployment command for one model across multiple serving frameworks under the same workload, GPU budget, and latency SLA.
Compile TensorRT-LLM on a compute node inside a Docker container. Use this when already on a compute node with GPUs visible.
Enable and interpret TensorRT-LLM AutoDeploy FX graph text dumps via AD_DUMP_GRAPHS_DIR. Use when you need before/after graphs per transform, to locate subgraphs, or to confirm a rewrite ran. Paths and behavior are grounded in tensorrt_llm/_torch/auto_deploy (GraphWriter, BaseTransform). Complements ad-add-fusion-transformation.
This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.
Claude Code skill (trtllm-agent-toolkit): implement or extend TensorRT-LLM AutoDeploy fusion transforms under transform/library/ in a TensorRT-LLM checkout. Prefer existing kernels and custom ops; use Triton only when no viable existing-kernel path exists. Use ad-graph-dump for AD_DUMP_GRAPHS_DIR workflows. Covers TRT-LLM paths, registry, default.yaml registration, graph validation, tests, and a review checklist — without prescribing profiling tools or throughput targets.
Upgrade flashinfer-python version in TensorRT-LLM. Fetches the latest releases from GitHub (stable and nightly), compares with the current pinned version, lets the user pick a target version, and updates all version references across the repo. Use when the user wants to bump or upgrade flashinfer.