ollama-optimizer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Ollama Optimizer

Ollama 优化工具

Optimize Ollama configuration based on system hardware analysis.
基于系统硬件分析优化Ollama配置。

Workflow

工作流程

Phase 1: System Detection

阶段1:系统检测

Run the detection script to gather hardware information:
bash
python3 scripts/detect_system.py
Parse the JSON output to identify:
  • OS and version
  • CPU model and core count
  • Total RAM / unified memory
  • GPU type, VRAM, and driver version
  • Current Ollama installation and environment variables
运行检测脚本以收集硬件信息:
bash
python3 scripts/detect_system.py
解析JSON输出以识别:
  • 操作系统及版本
  • CPU型号与核心数
  • 总内存/统一内存
  • GPU类型、显存及驱动版本
  • 当前Ollama安装情况与环境变量

Phase 2: Analyze and Recommend

阶段2:分析与建议

Based on detected hardware, determine the optimization profile:
Hardware Tier Classification:
TierCriteriaMax ModelKey Optimizations
CPU-onlyNo GPU detected3Bnum_thread tuning, Q4_K_M quant
Low VRAM<6GB VRAM3BFlash attention, KV cache q4_0
Entry6-8GB VRAM8BFlash attention, KV cache q8_0
Prosumer10-12GB VRAM14BFlash attention, full offload
Workstation16-24GB VRAM32BStandard config, Q5_K_M option
High-end48GB+ VRAM70B+Multiple models, Q5/Q6 quants
Apple Silicon Special Case:
  • Unified memory = shared CPU/GPU RAM
  • 8GB Mac → treat as 6GB VRAM tier
  • 16GB Mac → treat as 12GB VRAM tier
  • 32GB+ Mac → treat as workstation tier
基于检测到的硬件,确定优化配置文件:
硬件层级分类:
层级判定标准最大支持模型核心优化措施
仅CPU未检测到GPU3Bnum_thread调优、Q4_K_M量化
低显存显存<6GB3BFlash注意力、KV缓存q4_0
入门级显存6-8GB8BFlash注意力、KV缓存q8_0
消费级高端显存10-12GB14BFlash注意力、全卸载
工作站级显存16-24GB32B标准配置、可选Q5_K_M
高端显存48GB+70B+多模型支持、Q5/Q6量化
Apple Silicon 特殊情况:
  • 统一内存=CPU/GPU共享内存
  • 8GB内存Mac→视为6GB显存层级
  • 16GB内存Mac→视为12GB显存层级
  • 32GB+内存Mac→视为工作站层级

Phase 3: Generate Optimization Plan

阶段3:生成优化方案

Create a structured optimization guide with these sections:
创建包含以下部分的结构化优化指南:

1. System Overview

1. 系统概述

Present detected hardware specs and highlight constraints (e.g., "8GB unified memory limits to 7B models").
展示检测到的硬件规格并强调限制条件(例如:“8GB统一内存限制最大支持7B模型”)。

2. Dependency Assessment

2. 依赖项评估

List what's needed based on the platform:
  • macOS: Ollama only (Metal automatic)
  • Linux NVIDIA: Ollama + NVIDIA driver 450+
  • Linux AMD: Ollama + ROCm 5.0+
  • Windows: Ollama + NVIDIA driver 452+
根据平台列出所需依赖:
  • macOS:仅需Ollama(Metal自动启用)
  • Linux NVIDIA:Ollama + NVIDIA驱动450+
  • Linux AMD:Ollama + ROCm 5.0+
  • Windows:Ollama + NVIDIA驱动452+

3. Configuration Recommendations

3. 配置建议

Essential environment variables:
bash
undefined
核心环境变量:
bash
undefined

Always recommended

始终推荐启用

export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_FLASH_ATTENTION=1

Memory-constrained systems (<12GB)

内存受限系统(<12GB)

export OLLAMA_KV_CACHE_TYPE=q8_0 # or q4_0 for severe constraints

**Model selection guidance:**
- Recommend specific models from `ollama list` output
- Suggest appropriate quantization (Q4_K_M default, Q5_K_M if headroom exists)
- Warn if current models exceed hardware capacity

**Modelfile tuning (when needed):**
PARAMETER num_gpu <layers> # Partial offload for limited VRAM PARAMETER num_thread <cores> # CPU threads (physical cores, not hyperthreads) PARAMETER num_ctx <size> # Reduce context for memory savings
undefined
export OLLAMA_KV_CACHE_TYPE=q8_0 # 内存严重受限可使用q4_0

**模型选择指南:**
- 从`ollama list`输出中推荐特定模型
- 建议合适的量化方式(默认Q4_K_M,若有剩余资源可选Q5_K_M)
- 若当前模型超出硬件能力则发出警告

**Modelfile调优(必要时):**
PARAMETER num_gpu <layers> # 显存有限时部分卸载 PARAMETER num_thread <cores> # CPU线程数(物理核心,而非超线程) PARAMETER num_ctx <size> # 减少上下文以节省内存
undefined

4. Execution Checklist

4. 执行清单

Provide copy-paste commands in order:
  1. Set environment variables
  2. Restart Ollama service
  3. Pull recommended models
  4. Test with
    ollama run <model> --verbose
提供可直接复制粘贴的命令,按顺序执行:
  1. 设置环境变量
  2. 重启Ollama服务
  3. 拉取推荐的模型
  4. 使用
    ollama run <model> --verbose
    进行测试

5. Verification Commands

5. 验证命令

bash
undefined
bash
undefined

Benchmark current performance

基准测试当前性能

python3 scripts/benchmark_ollama.py --model <model>
python3 scripts/benchmark_ollama.py --model <model>

Check GPU memory usage (NVIDIA)

检查GPU内存使用情况(NVIDIA)

nvidia-smi
nvidia-smi

Verify config is applied

验证配置是否生效

ollama run <model> "test" --verbose 2>&1 | head -20
undefined
ollama run <model> "test" --verbose 2>&1 | head -20
undefined

Reference Files

参考文件

  • VRAM Requirements - Model sizing and quantization guide
  • Environment Variables - Complete env var reference
  • Platform-Specific Setup - OS-specific installation and configuration
  • 显存要求 - 模型尺寸与量化指南
  • 环境变量 - 完整环境变量参考
  • 平台特定设置 - 操作系统专属安装与配置指南

Output Format

输出格式

Generate an
ollama-optimization-guide.md
file in the current directory with:
markdown
undefined
在当前目录生成
ollama-optimization-guide.md
文件,格式如下:
markdown
undefined

Ollama Optimization Guide

Ollama 优化指南

Generated: <timestamp> System: <OS> | <CPU> | <RAM>GB RAM | <GPU>
生成时间: <时间戳> 系统信息: <操作系统> | <CPU> | <内存>GB 内存 | <GPU>

System Overview

系统概述

<hardware summary and constraints>
<硬件摘要与限制条件>

Current Configuration

当前配置

<existing Ollama setup and env vars>
<现有Ollama安装情况与环境变量>

Recommendations

优化建议

Environment Variables

环境变量

<shell commands to set vars>
<设置变量的Shell命令>

Model Selection

模型选择

<recommended models with rationale>
<推荐模型及理由>

Performance Tuning

性能调优

<Modelfile adjustments if needed>
<必要时的Modelfile调整>

Execution Checklist

执行清单

  • <step 1>
  • <step 2> ...
  • <步骤1>
  • <步骤2> ...

Verification

验证

<benchmark commands and expected results>
<基准测试命令与预期结果>

Rollback

回滚

<commands to revert changes if needed> ```
<必要时的变更回滚命令>
undefined

Quick Optimization Commands

快速优化命令

For users who want immediate results without full analysis:
macOS (Apple Silicon):
bash
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.2:3b  # Safe for 8GB, fast
Linux/Windows with 8GB NVIDIA GPU:
bash
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.1:8b-instruct-q4_K_M
CPU-only systems:
bash
export CUDA_VISIBLE_DEVICES=-1
ollama pull llama3.2:3b
适用于希望立即获得效果而无需完整分析的用户:
macOS(Apple Silicon):
bash
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.2:3b  # 8GB内存Mac适用,速度快
配备8GB NVIDIA GPU的Linux/Windows系统:
bash
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.1:8b-instruct-q4_K_M
仅CPU系统:
bash
export CUDA_VISIBLE_DEVICES=-1
ollama pull llama3.2:3b

Create Modelfile with: PARAMETER num_thread 4

创建Modelfile并添加:PARAMETER num_thread 4

undefined
undefined