Search Results: tensorrt-llm

Found 21 Skills

trtllm-flashinfer-upgrade

Upgrade flashinfer-python version in TensorRT-LLM. Fetches the latest releases from GitHub (stable and nightly), compares with the current pinned version, lets the user pick a target version, and updates all version references across the repo. Use when the user wants to bump or upgrade flashinfer.

🇺🇸|EnglishTranslated

AI & Machine Learningancoleman/ai-design-compo...

model-serving

LLM and ML model deployment for inference. Use when serving models in production, building AI APIs, or optimizing inference. Covers vLLM (LLM serving), TensorRT-LLM (GPU optimization), Ollama (local), BentoML (ML deployment), Triton (multi-model), LangChain (orchestration), LlamaIndex (RAG), and streaming patterns.

🇺🇸|EnglishTranslated

5 scripts/Attention

AI & Machine Learningbbuf/sglang-auto-driven-s...

model-pr-history-knowledge

Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningnvidia/skills

nemoclaw-user-configure-inference

Connects NemoClaw to a local inference server. Use when setting up Ollama, vLLM, TensorRT-LLM, NIM, or any OpenAI-compatible local model server with NemoClaw. Trigger keywords - nemoclaw local inference, ollama nemoclaw, vllm nemoclaw, local model server, openai compatible endpoint, switch nemoclaw inference model, change inference runtime, nemoclaw additional model, nemoclaw sub-agent model, openclaw sub-agent, agents.list, sessions_spawn, vlm-demo, nemoclaw tool calling, ollama tool calls, vllm tool-call-parser, raw json in tui, nemoclaw inference options, nemoclaw onboarding providers, nemoclaw inference routing.

🇺🇸|EnglishTranslated

AI & Machine Learningbbuf/sglang-auto-driven-s...

sglang-sota-performance

End-to-end SGLang SOTA performance workflow. Use when a user names an LLM model and wants SGLang to match or beat the best observed vLLM and TensorRT-LLM serving performance by searching each framework's best deployment command, benchmarking them fairly, profiling SGLang if it is slower, identifying kernel/overlap/fusion bottlenecks, patching SGLang code, and revalidating with real model runs.

🇺🇸|EnglishTranslated

AI & Machine Learningbbuf/sglang-auto-driven-s...

llm-torch-profiler-analysis

Unified LLM torch-profiler triage skill for `sglang`, `vllm`, and `TensorRT-LLM`. Use it to inspect an existing `trace.json(.gz)` or profile directory, or to drive live profiling against a running server and return one three-table report with kernel, overlap-opportunity, and fuse-pattern tables.

🇺🇸|EnglishTranslated

13 scripts/Attention

AI & Machine Learningnvidia/skills

ad-conf-check

Check whether AutoDeploy YAML configs were actually applied by analyzing server logs and optionally graph dumps (AD_DUMP_GRAPHS_DIR). Use when the user wants to verify config application, debug config issues, or check if AutoDeploy transforms (piecewise CUDA graph, multi-stream, sharding, fusion, etc.) were applied or fell back. Triggers on: "check config", "verify config", "ad-conf-check", "were my configs applied", "config not working", "check if piecewise is enabled", "check log for config", or any request to compare AD YAML settings against runtime behavior.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningnvidia/skills

ad-accuracy-debug

Debug AutoDeploy accuracy regressions vs a reference score (PyTorch backend or published baseline). Use when an AutoDeploy model's eval score is significantly below the reference and the root cause is unknown.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

ad-model-onboard

Translates a HuggingFace model into a prefill-only AutoDeploy custom model using reference custom ops, validates with hierarchical equivalence tests.

🇺🇸|EnglishTranslated