ollama-optimizer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOllama Optimizer
Ollama 优化工具
Optimize Ollama configuration based on system hardware analysis.
基于系统硬件分析优化Ollama配置。
Workflow
工作流程
Phase 1: System Detection
阶段1:系统检测
Run the detection script to gather hardware information:
bash
python3 scripts/detect_system.pyParse the JSON output to identify:
- OS and version
- CPU model and core count
- Total RAM / unified memory
- GPU type, VRAM, and driver version
- Current Ollama installation and environment variables
运行检测脚本以收集硬件信息:
bash
python3 scripts/detect_system.py解析JSON输出以识别:
- 操作系统及版本
- CPU型号与核心数
- 总内存/统一内存
- GPU类型、显存及驱动版本
- 当前Ollama安装情况与环境变量
Phase 2: Analyze and Recommend
阶段2:分析与建议
Based on detected hardware, determine the optimization profile:
Hardware Tier Classification:
| Tier | Criteria | Max Model | Key Optimizations |
|---|---|---|---|
| CPU-only | No GPU detected | 3B | num_thread tuning, Q4_K_M quant |
| Low VRAM | <6GB VRAM | 3B | Flash attention, KV cache q4_0 |
| Entry | 6-8GB VRAM | 8B | Flash attention, KV cache q8_0 |
| Prosumer | 10-12GB VRAM | 14B | Flash attention, full offload |
| Workstation | 16-24GB VRAM | 32B | Standard config, Q5_K_M option |
| High-end | 48GB+ VRAM | 70B+ | Multiple models, Q5/Q6 quants |
Apple Silicon Special Case:
- Unified memory = shared CPU/GPU RAM
- 8GB Mac → treat as 6GB VRAM tier
- 16GB Mac → treat as 12GB VRAM tier
- 32GB+ Mac → treat as workstation tier
基于检测到的硬件,确定优化配置文件:
硬件层级分类:
| 层级 | 判定标准 | 最大支持模型 | 核心优化措施 |
|---|---|---|---|
| 仅CPU | 未检测到GPU | 3B | num_thread调优、Q4_K_M量化 |
| 低显存 | 显存<6GB | 3B | Flash注意力、KV缓存q4_0 |
| 入门级 | 显存6-8GB | 8B | Flash注意力、KV缓存q8_0 |
| 消费级高端 | 显存10-12GB | 14B | Flash注意力、全卸载 |
| 工作站级 | 显存16-24GB | 32B | 标准配置、可选Q5_K_M |
| 高端 | 显存48GB+ | 70B+ | 多模型支持、Q5/Q6量化 |
Apple Silicon 特殊情况:
- 统一内存=CPU/GPU共享内存
- 8GB内存Mac→视为6GB显存层级
- 16GB内存Mac→视为12GB显存层级
- 32GB+内存Mac→视为工作站层级
Phase 3: Generate Optimization Plan
阶段3:生成优化方案
Create a structured optimization guide with these sections:
创建包含以下部分的结构化优化指南:
1. System Overview
1. 系统概述
Present detected hardware specs and highlight constraints (e.g., "8GB unified memory limits to 7B models").
展示检测到的硬件规格并强调限制条件(例如:“8GB统一内存限制最大支持7B模型”)。
2. Dependency Assessment
2. 依赖项评估
List what's needed based on the platform:
- macOS: Ollama only (Metal automatic)
- Linux NVIDIA: Ollama + NVIDIA driver 450+
- Linux AMD: Ollama + ROCm 5.0+
- Windows: Ollama + NVIDIA driver 452+
根据平台列出所需依赖:
- macOS:仅需Ollama(Metal自动启用)
- Linux NVIDIA:Ollama + NVIDIA驱动450+
- Linux AMD:Ollama + ROCm 5.0+
- Windows:Ollama + NVIDIA驱动452+
3. Configuration Recommendations
3. 配置建议
Essential environment variables:
bash
undefined核心环境变量:
bash
undefinedAlways recommended
始终推荐启用
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_FLASH_ATTENTION=1
Memory-constrained systems (<12GB)
内存受限系统(<12GB)
export OLLAMA_KV_CACHE_TYPE=q8_0 # or q4_0 for severe constraints
**Model selection guidance:**
- Recommend specific models from `ollama list` output
- Suggest appropriate quantization (Q4_K_M default, Q5_K_M if headroom exists)
- Warn if current models exceed hardware capacity
**Modelfile tuning (when needed):**PARAMETER num_gpu <layers> # Partial offload for limited VRAM
PARAMETER num_thread <cores> # CPU threads (physical cores, not hyperthreads)
PARAMETER num_ctx <size> # Reduce context for memory savings
undefinedexport OLLAMA_KV_CACHE_TYPE=q8_0 # 内存严重受限可使用q4_0
**模型选择指南:**
- 从`ollama list`输出中推荐特定模型
- 建议合适的量化方式(默认Q4_K_M,若有剩余资源可选Q5_K_M)
- 若当前模型超出硬件能力则发出警告
**Modelfile调优(必要时):**PARAMETER num_gpu <layers> # 显存有限时部分卸载
PARAMETER num_thread <cores> # CPU线程数(物理核心,而非超线程)
PARAMETER num_ctx <size> # 减少上下文以节省内存
undefined4. Execution Checklist
4. 执行清单
Provide copy-paste commands in order:
- Set environment variables
- Restart Ollama service
- Pull recommended models
- Test with
ollama run <model> --verbose
提供可直接复制粘贴的命令,按顺序执行:
- 设置环境变量
- 重启Ollama服务
- 拉取推荐的模型
- 使用进行测试
ollama run <model> --verbose
5. Verification Commands
5. 验证命令
bash
undefinedbash
undefinedBenchmark current performance
基准测试当前性能
python3 scripts/benchmark_ollama.py --model <model>
python3 scripts/benchmark_ollama.py --model <model>
Check GPU memory usage (NVIDIA)
检查GPU内存使用情况(NVIDIA)
nvidia-smi
nvidia-smi
Verify config is applied
验证配置是否生效
ollama run <model> "test" --verbose 2>&1 | head -20
undefinedollama run <model> "test" --verbose 2>&1 | head -20
undefinedReference Files
参考文件
- VRAM Requirements - Model sizing and quantization guide
- Environment Variables - Complete env var reference
- Platform-Specific Setup - OS-specific installation and configuration
- 显存要求 - 模型尺寸与量化指南
- 环境变量 - 完整环境变量参考
- 平台特定设置 - 操作系统专属安装与配置指南
Output Format
输出格式
Generate an file in the current directory with:
ollama-optimization-guide.mdmarkdown
undefined在当前目录生成文件,格式如下:
ollama-optimization-guide.mdmarkdown
undefinedOllama Optimization Guide
Ollama 优化指南
Generated: <timestamp>
System: <OS> | <CPU> | <RAM>GB RAM | <GPU>
生成时间: <时间戳>
系统信息: <操作系统> | <CPU> | <内存>GB 内存 | <GPU>
System Overview
系统概述
<hardware summary and constraints>
<硬件摘要与限制条件>
Current Configuration
当前配置
<existing Ollama setup and env vars>
<现有Ollama安装情况与环境变量>
Recommendations
优化建议
Environment Variables
环境变量
<shell commands to set vars>
<设置变量的Shell命令>
Model Selection
模型选择
<recommended models with rationale>
<推荐模型及理由>
Performance Tuning
性能调优
<Modelfile adjustments if needed>
<必要时的Modelfile调整>
Execution Checklist
执行清单
- <step 1>
- <step 2> ...
- <步骤1>
- <步骤2> ...
Verification
验证
<benchmark commands and expected results>
<基准测试命令与预期结果>
Rollback
回滚
<commands to revert changes if needed>
```
<必要时的变更回滚命令>
undefinedQuick Optimization Commands
快速优化命令
For users who want immediate results without full analysis:
macOS (Apple Silicon):
bash
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.2:3b # Safe for 8GB, fastLinux/Windows with 8GB NVIDIA GPU:
bash
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.1:8b-instruct-q4_K_MCPU-only systems:
bash
export CUDA_VISIBLE_DEVICES=-1
ollama pull llama3.2:3b适用于希望立即获得效果而无需完整分析的用户:
macOS(Apple Silicon):
bash
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.2:3b # 8GB内存Mac适用,速度快配备8GB NVIDIA GPU的Linux/Windows系统:
bash
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.1:8b-instruct-q4_K_M仅CPU系统:
bash
export CUDA_VISIBLE_DEVICES=-1
ollama pull llama3.2:3bCreate Modelfile with: PARAMETER num_thread 4
创建Modelfile并添加:PARAMETER num_thread 4
undefinedundefined