Search Results: tensorrt

Found 36 Skills

trtllm-moe-develop

Review, design, and refactor TensorRT-LLM PyTorch MoE code for architecture fit, clean code, maintainability, and testability. Always use for any modification, review, refactor, or design planning that touches MoE modules, including tensorrt_llm/_torch/modules/fused_moe, ConfigurableMoE, MoE backends, MoEScheduler/moe_scheduler.py, forward execution/chunking, communication strategies, EPLB, quantization/weight handling, routing, factories, MoE docs, or MoE tests. Also use when the user asks whether a MoE design follows the current architecture or whether a MoE refactor is reasonable.

🇺🇸|EnglishTranslated

AI & Machine Learningomer-metin/skills-for-ant...

model-optimization

Use when reducing model size, improving inference speed, or deploying to edge devices - covers quantization, pruning, knowledge distillation, ONNX export, and TensorRT optimizationUse when ", " mentioned.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

trtllm-code-contribution

Best practices for contributing code to TensorRT-LLM. Covers the official contribution process (issue tracking, fork workflow, DCO signing), coding guidelines, implementation workflow, common mistakes, testing strategy, commit hygiene, and review readiness. Incorporates rules from CONTRIBUTING.md and CODING_GUIDELINES.md plus lessons distilled from real PR retrospectives. Use when implementing new features, optimizations, or bug fixes in the TensorRT-LLM codebase.

🇺🇸|EnglishTranslated

AI & Machine Learningpromptingcompany/nv-skill...

tao-finetune-clip

CLIP vision-language model for image-text retrieval, zero-shot classification, embedding extraction, ONNX export, and TensorRT deployment. Use when fine-tuning or training CLIP, running zero-shot classification, computing image embeddings, or deploying CLIP to ONNX/TensorRT.

🇺🇸|EnglishTranslated

AI & Machine Learningbbuf/sglang-auto-driven-s...

llm-serving-auto-benchmark

Framework-independent LLM serving benchmark skill for comparing SGLang, vLLM, TensorRT-LLM, or another serving framework. Use when a user wants to find the best deployment command for one model across multiple serving frameworks under the same workload, GPU budget, and latency SLA.

🇺🇸|EnglishTranslated

2 scripts/Checked

AI & Machine Learningnvidia/skills

tao-port-huggingface-model

Integrate a HuggingFace Computer Vision model into the NVIDIA TAO Toolkit ecosystem (tao-core config, tao-pytorch trainer, tao-deploy TensorRT pipeline). Use when the user asks to "integrate a HuggingFace model into TAO", "add an HF model to TAO Toolkit", "wire a HuggingFace ViT/DETR/ SegFormer into tao-pytorch", "build a TAO trainer + deploy pipeline for an HF CV model", or pastes a HuggingFace model URL/ID and wants it turned into a TAO model. Covers the full 7-phase loop: prerequisites check, HuggingFace inspection and validation, codebase exploration, tao-core configuration and native trainer implementation, ONNX export plus TensorRT deploy integration, packaging and L0 testing, container-based end-to-end validation, and (conditional) accuracy/latency tuning. Supports classification, object detection, semantic / instance / panoptic segmentation, zero-shot detection, and depth estimation.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

tao-launch-workflow

Shared launch intake for any TAO workflow or action. Use when the user wants to run TAO AutoML, train, evaluate, infer, export, generate TensorRT engines, or launch DEFT/workflow jobs on an execution platform.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

ptq

This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

ad-graph-dump

Enable and interpret TensorRT-LLM AutoDeploy FX graph text dumps via AD_DUMP_GRAPHS_DIR. Use when you need before/after graphs per transform, to locate subgraphs, or to confirm a rewrite ran. Paths and behavior are grounded in tensorrt_llm/_torch/auto_deploy (GraphWriter, BaseTransform). Complements ad-add-fusion-transformation.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

trtllm-codebase-exploration

Systematic approach to exploring the TensorRT-LLM codebase before implementing new features or optimizations. Teaches how to discover existing infrastructure, trace code paths, and avoid reimplementing what already exists. Derived from real mistakes where ~250 lines of code were written and deleted because existing forward methods weren't discovered upfront. Use when starting any new feature, optimization, or code modification in TRT-LLM.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

trtllm-serve-config-guide

Generate a source-backed starting `trtllm-serve --config` YAML for basic aggregate single-node PyTorch serving, aligned with checked-in TensorRT-LLM configs and deployment docs. Preserves explicit latency / balanced / throughput objectives. Excludes disaggregated, multi-node, and non-MTP speculative configs.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

exec-slurm-compile

Compile TensorRT-LLM on a SLURM cluster. Covers submitting a batch job with a container image, monitoring the job, and verifying the build. Use when the user wants to compile TRT-LLM remotely via SLURM rather than on a local compute node.

🇺🇸|EnglishTranslated

4 scripts/Checked