Search Results: sglang

Found 25 Skills

AI & Machine Learningslowlyc/agent-gpu-skills

sglang-skill

Develop, debug, and optimize SGLang LLM serving engine. Use when the user mentions SGLang, sglang, srt, sgl-kernel, LLM serving, model inference, KV cache, attention backend, FlashInfer, MLA, MoE routing, speculative decoding, disaggregated serving, TP/PP/EP, radix cache, continuous batching, chunked prefill, CUDA graph, model loading, quantization FP8/GPTQ/AWQ, JIT kernel, triton kernel SGLang, or asks about serving LLMs with SGLang.

🇺🇸|EnglishTranslated

1 scripts/Attention

AI & Machine Learningbbuf/sglang-auto-driven-s...

sglang-sota-performance

End-to-end SGLang SOTA performance workflow. Use when a user names an LLM model and wants SGLang to match or beat the best observed vLLM and TensorRT-LLM serving performance by searching each framework's best deployment command, benchmarking them fairly, profiling SGLang if it is slower, identifying kernel/overlap/fusion bottlenecks, patching SGLang code, and revalidating with real model runs.

🇺🇸|EnglishTranslated

Code Qualitybbuf/sglang-auto-driven-s...

sglang-humanize-review

Perform SGLang code review in the style of human maintainers by consulting the 2024-2025 non-agent PR review corpus, including inline code snippets, original multilingual comments, and discussion threads. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.

🇺🇸|EnglishTranslated

2 scripts/Checked

AI & Machine Learningbbuf/sglang-auto-driven-s...

h100

SSH into host `h100_sglang`, enter Docker container `sglang_bbuf`, work in `/sgl-workspace/sglang`, and use the ready H100 remote environment for SGLang development and validation. Use when a task needs remote CUDA work, GPU-backed smoke tests, diffusion checks, or a safe remote copy instead of local-only execution.

🇺🇸|EnglishTranslated

AI & Machine Learningbbuf/sglang-auto-driven-s...

model-architecture-diagram

Return public original model architecture diagrams for user-specified LLM, VLM, MoE, diffusion, OCR, and SGLang/sgl-cookbook model families. Use when the user asks for a model structure chart, architecture diagram, or rendered image link for a specific model such as DeepSeek, GLM, Qwen, Kimi, MiniMax, Step, Hunyuan, or Qwen3-VL.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningbbuf/sglang-auto-driven-s...

llm-serving-capacity-planner

Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningkiterlin/intelligent-dete...

slime-rl-training

Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang framework. Use when training GLM models, implementing custom data generation workflows, or needing tight Megatron-LM integration for RL scaling.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

deployment

Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).

🇺🇸|EnglishTranslated

1 scripts/Attention

AI & Machine Learningbbuf/sglang-auto-driven-s...

llm-pipeline-analysis

Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.

🇺🇸|EnglishTranslated

4 scripts/Checked

AI & Machine Learningsgl-project/sglang

add-sgl-kernel

Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)

🇺🇸|EnglishTranslated

AI & Machine Learningbbuf/sglang-auto-driven-s...

model-compute-simulation

Build an operator-level compute template for an LLM and estimate FLOPs/MFU for a serving shape. Use when you need tensor shapes, per-op FLOPs, kernel-to-op MFU mapping, or parallelism what-if analysis.

🇺🇸|EnglishTranslated

2 scripts/Attention

AI & Machine Learningaradotso/trending-skills

club-3090-llm-serving

Recipes and configs for serving LLMs locally on RTX 3090 GPUs using vLLM, llama.cpp, and SGLang with OpenAI-compatible API

🇺🇸|EnglishTranslated