Loading...
Loading...
Found 8 Skills
Use when designing prompts for LLMs, optimizing model performance, building evaluation frameworks, or implementing advanced prompting techniques like chain-of-thought, few-shot learning, or structured outputs.
Optimize Ollama configuration for maximum performance on the current machine. Use when asked to "optimize Ollama", "configure Ollama", "speed up Ollama", "tune LLM performance", "setup local LLM", "fix Ollama performance", "Ollama running slow", or when users want to maximize inference speed, reduce memory usage, or select appropriate models for their hardware. Analyzes system hardware (GPU, RAM, CPU) and provides tailored recommendations.
Use when the workflow is too slow, too expensive, or both and needs latency, cost, or token usage optimization.
This is a skill for benchmarking the efficiency of automatic prefix caching in vLLM using fixed prompts, real-world datasets, or synthetic prefix/suffix patterns. Use when the user asks to benchmark prefix caching hit rate, caching efficiency, or repeated-prompt performance in vLLM.
Use when diagnosing agent failures, debugging lost-in-middle issues, understanding context poisoning, or asking about "context degradation", "lost in middle", "context poisoning", "attention patterns", "context clash", "agent performance drops"
Use whenever the user mentions LLM prompt/prefix cache misses, cached_tokens=0, cache_read_input_tokens/cache_creation_input_tokens, prompt_cache_key, cache_control/cachePoint placement, stable prefixes, tool/schema stability, TTFT/prefill latency, OpenAI/Claude/Bedrock/OpenRouter routing, vLLM/SGLang KV reuse, or LLM cost/speed regressions on repeated long prompts. Use when reviewing LLM request shape changes: prompt text, message order, request builders, tools, schemas, response_format, provider API surface, model/router settings, agent loop structure, context compaction, or inference deployment. Use for speeding up agents only when prompt-cache stability, TTFT, or cache cost is central. Do not use for generic prompt writing, generic RAG design, token counting, or non-LLM performance.
Integrate TileGym kernels into Hugging Face `transformers` models by replacing the library's submodule(s) and certain class(es)' implementations, and patching certain class(es)' init/forward/load weight methods prior to instantiating models. Used when the user requires integrating TileGym kernels into `transformers` models.
This skill should be used when the user asks to "optimize a DSPy program", "use MIPROv2", "tune instructions and demos", "get best DSPy performance", "run Bayesian optimization", mentions "state-of-the-art DSPy optimizer", "joint instruction tuning", or needs maximum performance from a DSPy program with substantial training data (200+ examples).