Search Results: tensorrt

Found 36 Skills

deepstream-dev

NVIDIA DeepStream SDK 9.0 development with Python pyservicemaker API. Use when building video analytics pipelines, GStreamer-based video processing, TensorRT inference integration, object detection/tracking, or Kafka/message broker integration.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

perf-host-analysis

Analyze host/CPU overhead in TensorRT-LLM inference from nsys traces. Detect whether host overhead is the bottleneck using GPU idle ratio, host prep exposed ratio, and per-phase evidence. For regressions, isolate forward steps via allreduce/NVTX patterns, compare host operation breakdowns across versions, and identify scheduling or request-management overhead. Supports optional inter-kernel gap, eager-vs-graph, pattern mapping, and multi-rank straggler drill-down. Use standalone or within perf-analysis. Triggers: host overhead, inter-step gap, scheduling overhead, forward step isolation, nsys iteration analysis, NVTX breakdown, request management overhead, GPU idle, host bottleneck, host prep exposed, inter-kernel gap, bubble analysis, graph coverage, eager kernel, rank imbalance, straggler detection.

🇺🇸|EnglishTranslated

2 scripts/Attention

AI & Machine Learningnvidia/skills

perf-host-optimization

Profiles and optimizes TensorRT-LLM host/CPU overhead using line_profiler (with nsys support planned). Runs iterative profile-analyze-optimize-validate rounds. Use when GPU utilization is low or optimizing PyExecutor throughput.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

tensorrt-llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

ad-add-fusion-transformation

Claude Code skill (trtllm-agent-toolkit): implement or extend TensorRT-LLM AutoDeploy fusion transforms under transform/library/ in a TensorRT-LLM checkout. Prefer existing kernels and custom ops; use Triton only when no viable existing-kernel path exists. Use ad-graph-dump for AD_DUMP_GRAPHS_DIR workflows. Covers TRT-LLM paths, registry, default.yaml registration, graph validation, tests, and a review checklist — without prescribing profiling tools or throughput targets.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

exec-local-compile

Compile TensorRT-LLM on a compute node inside a Docker container. Use this when already on a compute node with GPUs visible.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

trtllm-flashinfer-upgrade

Upgrade flashinfer-python version in TensorRT-LLM. Fetches the latest releases from GitHub (stable and nightly), compares with the current pinned version, lets the user pick a target version, and updates all version references across the repo. Use when the user wants to bump or upgrade flashinfer.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

jetson-inference-mem-tune

Pick the serving stack and per-runtime memory flags (vLLM, SGLang, llama.cpp, TensorRT Edge-LLM) for an LLM/VLM workload on any NVIDIA Jetson.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningancoleman/ai-design-compo...

model-serving

LLM and ML model deployment for inference. Use when serving models in production, building AI APIs, or optimizing inference. Covers vLLM (LLM serving), TensorRT-LLM (GPU optimization), Ollama (local), BentoML (ML deployment), Triton (multi-model), LangChain (orchestration), LlamaIndex (RAG), and streaming patterns.

🇺🇸|EnglishTranslated

5 scripts/Attention

AI & Machine Learningnvidia/skills

nemoclaw-user-configure-inference

Connects NemoClaw to a local inference server. Use when setting up Ollama, vLLM, TensorRT-LLM, NIM, or any OpenAI-compatible local model server with NemoClaw. Trigger keywords - nemoclaw local inference, ollama nemoclaw, vllm nemoclaw, local model server, openai compatible endpoint, switch nemoclaw inference model, change inference runtime, nemoclaw additional model, nemoclaw sub-agent model, openclaw sub-agent, agents.list, sessions_spawn, vlm-demo, nemoclaw tool calling, ollama tool calls, vllm tool-call-parser, raw json in tui, nemoclaw inference options, nemoclaw onboarding providers, nemoclaw inference routing.

🇺🇸|EnglishTranslated

AI & Machine Learningbbuf/sglang-auto-driven-s...

model-pr-history-knowledge

Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningbbuf/sglang-auto-driven-s...

sglang-sota-performance

End-to-end SGLang SOTA performance workflow. Use when a user names an LLM model and wants SGLang to match or beat the best observed vLLM and TensorRT-LLM serving performance by searching each framework's best deployment command, benchmarking them fairly, profiling SGLang if it is slower, identifying kernel/overlap/fusion bottlenecks, patching SGLang code, and revalidating with real model runs.

🇺🇸|EnglishTranslated