Loading...
Loading...
Optimize Ollama configuration for maximum performance on the current machine. Use when asked to "optimize Ollama", "configure Ollama", "speed up Ollama", "tune LLM performance", "setup local LLM", "fix Ollama performance", "Ollama running slow", or when users want to maximize inference speed, reduce memory usage, or select appropriate models for their hardware. Analyzes system hardware (GPU, RAM, CPU) and provides tailored recommendations.
npx skill4agent add luongnv89/skills ollama-optimizerpython3 scripts/detect_system.py| Tier | Criteria | Max Model | Key Optimizations |
|---|---|---|---|
| CPU-only | No GPU detected | 3B | num_thread tuning, Q4_K_M quant |
| Low VRAM | <6GB VRAM | 3B | Flash attention, KV cache q4_0 |
| Entry | 6-8GB VRAM | 8B | Flash attention, KV cache q8_0 |
| Prosumer | 10-12GB VRAM | 14B | Flash attention, full offload |
| Workstation | 16-24GB VRAM | 32B | Standard config, Q5_K_M option |
| High-end | 48GB+ VRAM | 70B+ | Multiple models, Q5/Q6 quants |
# Always recommended
export OLLAMA_FLASH_ATTENTION=1
# Memory-constrained systems (<12GB)
export OLLAMA_KV_CACHE_TYPE=q8_0 # or q4_0 for severe constraintsollama listPARAMETER num_gpu <layers> # Partial offload for limited VRAM
PARAMETER num_thread <cores> # CPU threads (physical cores, not hyperthreads)
PARAMETER num_ctx <size> # Reduce context for memory savingsollama run <model> --verbose# Benchmark current performance
python3 scripts/benchmark_ollama.py --model <model>
# Check GPU memory usage (NVIDIA)
nvidia-smi
# Verify config is applied
ollama run <model> "test" --verbose 2>&1 | head -20ollama-optimization-guide.md# Ollama Optimization Guide
**Generated:** <timestamp>
**System:** <OS> | <CPU> | <RAM>GB RAM | <GPU>
## System Overview
<hardware summary and constraints>
## Current Configuration
<existing Ollama setup and env vars>
## Recommendations
### Environment Variables
<shell commands to set vars>
### Model Selection
<recommended models with rationale>
### Performance Tuning
<Modelfile adjustments if needed>
## Execution Checklist
- [ ] <step 1>
- [ ] <step 2>
...
## Verification
<benchmark commands and expected results>
## Rollback
<commands to revert changes if needed>export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.2:3b # Safe for 8GB, fastexport OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.1:8b-instruct-q4_K_Mexport CUDA_VISIBLE_DEVICES=-1
ollama pull llama3.2:3b
# Create Modelfile with: PARAMETER num_thread 4