gguf-quantization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGGUF - Quantization Format for llama.cpp
GGUF - 适用于llama.cpp的量化格式
The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
GGUF(GPT-Generated Unified Format)是llama.cpp的标准文件格式,支持在CPU、Apple Silicon和GPU上实现高效推理,并提供灵活的量化选项。
When to use GGUF
何时使用GGUF
Use GGUF when:
- Deploying on consumer hardware (laptops, desktops)
- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
- Need CPU inference without GPU requirements
- Want flexible quantization (Q2_K to Q8_0)
- Using local AI tools (LM Studio, Ollama, text-generation-webui)
Key advantages:
- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
- No Python runtime: Pure C/C++ inference
- Flexible quantization: 2-8 bit with various methods (K-quants)
- Ecosystem support: LM Studio, Ollama, koboldcpp, and more
- imatrix: Importance matrix for better low-bit quality
Use alternatives instead:
- AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
- HQQ: Fast calibration-free quantization for HuggingFace
- bitsandbytes: Simple integration with transformers library
- TensorRT-LLM: Production NVIDIA deployment with maximum speed
适合使用GGUF的场景:
- 在消费级硬件(笔记本电脑、台式机)上部署模型
- 在Apple Silicon(M1/M2/M3)上运行并借助Metal加速
- 需要无需GPU支持的CPU推理
- 想要灵活的量化选项(Q2_K至Q8_0)
- 使用本地AI工具(LM Studio、Ollama、text-generation-webui)
核心优势:
- 全硬件支持:兼容CPU、Apple Silicon、NVIDIA、AMD
- 无需Python运行时:纯C/C++推理
- 灵活量化:2-8位多种量化方法(K-quants)
- 生态系统支持:适配LM Studio、Ollama、koboldcpp等工具
- imatrix:重要性矩阵,提升低比特量化的质量
适合使用其他方案的场景:
- AWQ/GPTQ:在NVIDIA GPU上通过校准实现最高精度
- HQQ:无需校准的快速量化,适配HuggingFace
- bitsandbytes:与transformers库简单集成
- TensorRT-LLM:面向生产环境的NVIDIA部署,追求极致速度
Quick start
快速开始
Installation
安装
bash
undefinedbash
undefinedClone llama.cpp
Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Build (CPU)
构建(仅CPU)
make
make
Build with CUDA (NVIDIA)
启用CUDA支持构建(NVIDIA)
make GGML_CUDA=1
make GGML_CUDA=1
Build with Metal (Apple Silicon)
启用Metal支持构建(Apple Silicon)
make GGML_METAL=1
make GGML_METAL=1
Install Python bindings (optional)
安装Python绑定(可选)
pip install llama-cpp-python
undefinedpip install llama-cpp-python
undefinedConvert model to GGUF
将模型转换为GGUF格式
bash
undefinedbash
undefinedInstall requirements
安装依赖
pip install -r requirements.txt
pip install -r requirements.txt
Convert HuggingFace model to GGUF (FP16)
将HuggingFace模型转换为GGUF格式(FP16)
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
Or specify output type
或指定输出类型
python convert_hf_to_gguf.py ./path/to/model
--outfile model-f16.gguf
--outtype f16
--outfile model-f16.gguf
--outtype f16
undefinedpython convert_hf_to_gguf.py ./path/to/model
--outfile model-f16.gguf
--outtype f16
--outfile model-f16.gguf
--outtype f16
undefinedQuantize model
量化模型
bash
undefinedbash
undefinedBasic quantization to Q4_K_M
基础量化至Q4_K_M
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
Quantize with importance matrix (better quality)
使用重要性矩阵量化(提升质量)
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
undefined./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
undefinedRun inference
运行推理
bash
undefinedbash
undefinedCLI inference
CLI推理
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
Interactive mode
交互模式
./llama-cli -m model-q4_k_m.gguf --interactive
./llama-cli -m model-q4_k_m.gguf --interactive
With GPU offload
启用GPU卸载
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
undefined./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
undefinedQuantization types
量化类型
K-quant methods (recommended)
K-quant方法(推荐)
| Type | Bits | Size (7B) | Quality | Use Case |
|---|---|---|---|---|
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
| Q4_K_M | 4.5 | ~4.1 GB | High | Recommended default |
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
| 类型 | 比特数 | 7B模型大小 | 质量 | 使用场景 |
|---|---|---|---|---|
| Q2_K | 2.5 | ~2.8 GB | 低 | 极致压缩 |
| Q3_K_S | 3.0 | ~3.0 GB | 低-中 | 内存受限场景 |
| Q3_K_M | 3.3 | ~3.3 GB | 中 | 平衡场景 |
| Q4_K_S | 4.0 | ~3.8 GB | 中-高 | 良好平衡 |
| Q4_K_M | 4.5 | ~4.1 GB | 高 | 推荐默认选项 |
| Q5_K_S | 5.0 | ~4.6 GB | 高 | 注重质量 |
| Q5_K_M | 5.5 | ~4.8 GB | 极高 | 高质量需求 |
| Q6_K | 6.0 | ~5.5 GB | 优秀 | 接近原始模型质量 |
| Q8_0 | 8.0 | ~7.2 GB | 最佳 | 最高质量需求 |
Legacy methods
传统方法
| Type | Description |
|---|---|
| Q4_0 | 4-bit, basic |
| Q4_1 | 4-bit with delta |
| Q5_0 | 5-bit, basic |
| Q5_1 | 5-bit with delta |
Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
| 类型 | 描述 |
|---|---|
| Q4_0 | 4比特基础量化 |
| Q4_1 | 带增量的4比特量化 |
| Q5_0 | 5比特基础量化 |
| Q5_1 | 带增量的5比特量化 |
推荐建议:使用K-quant方法(Q4_K_M、Q5_K_M)以获得最佳的质量/体积比。
Conversion workflows
转换工作流
Workflow 1: HuggingFace to GGUF
工作流1:HuggingFace转GGUF
bash
undefinedbash
undefined1. Download model
1. 下载模型
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
2. Convert to GGUF (FP16)
2. 转换为GGUF格式(FP16)
python convert_hf_to_gguf.py ./llama-3.1-8b
--outfile llama-3.1-8b-f16.gguf
--outtype f16
--outfile llama-3.1-8b-f16.gguf
--outtype f16
python convert_hf_to_gguf.py ./llama-3.1-8b
--outfile llama-3.1-8b-f16.gguf
--outtype f16
--outfile llama-3.1-8b-f16.gguf
--outtype f16
3. Quantize
3. 量化
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
4. Test
4. 测试
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
undefined./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
undefinedWorkflow 2: With importance matrix (better quality)
工作流2:使用重要性矩阵(提升质量)
bash
undefinedbash
undefined1. Convert to GGUF
1. 转换为GGUF格式
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
2. Create calibration text (diverse samples)
2. 创建校准文本(多样化样本)
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
Python is a popular programming language.
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
Python is a popular programming language.
Add more diverse text samples...
添加更多多样化文本样本...
EOF
EOF
3. Generate importance matrix
3. 生成重要性矩阵
./llama-imatrix -m model-f16.gguf
-f calibration.txt
--chunk 512
-o model.imatrix
-ngl 35 # GPU layers if available
-f calibration.txt
--chunk 512
-o model.imatrix
-ngl 35 # GPU layers if available
./llama-imatrix -m model-f16.gguf
-f calibration.txt
--chunk 512
-o model.imatrix
-ngl 35 # 如有可用GPU则指定GPU层
-f calibration.txt
--chunk 512
-o model.imatrix
-ngl 35 # 如有可用GPU则指定GPU层
4. Quantize with imatrix
4. 结合imatrix进行量化
./llama-quantize --imatrix model.imatrix
model-f16.gguf
model-q4_k_m.gguf
Q4_K_M
model-f16.gguf
model-q4_k_m.gguf
Q4_K_M
undefined./llama-quantize --imatrix model.imatrix
model-f16.gguf
model-q4_k_m.gguf
Q4_K_M
model-f16.gguf
model-q4_k_m.gguf
Q4_K_M
undefinedWorkflow 3: Multiple quantizations
工作流3:批量量化
bash
#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"bash
#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"Generate imatrix once
一次性生成imatrix
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
Create multiple quantizations
创建多种量化版本
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done
undefinedfor QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done
undefinedPython usage
Python使用示例
llama-cpp-python
llama-cpp-python
python
from llama_cpp import Llamapython
from llama_cpp import LlamaLoad model
加载模型
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096, # Context window
n_gpu_layers=35, # GPU offload (0 for CPU only)
n_threads=8 # CPU threads
)
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096, # 上下文窗口
n_gpu_layers=35, # GPU卸载(0表示仅使用CPU)
n_threads=8 # CPU线程数
)
Generate
生成文本
output = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7,
stop=["</s>", "\n\n"]
)
print(output["choices"][0]["text"])
undefinedoutput = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7,
stop=["</s>", "\n\n"]
)
print(output["choices"][0]["text"])
undefinedChat completion
对话补全
python
from llama_cpp import Llama
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3" # Or "chatml", "mistral", etc.
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
response = llm.create_chat_completion(
messages=messages,
max_tokens=256,
temperature=0.7
)
print(response["choices"][0]["message"]["content"])python
from llama_cpp import Llama
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3" # 也可使用"chatml"、"mistral"等
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
response = llm.create_chat_completion(
messages=messages,
max_tokens=256,
temperature=0.7
)
print(response["choices"][0]["message"]["content"])Streaming
流式输出
python
from llama_cpp import Llama
llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)python
from llama_cpp import Llama
llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)Stream tokens
流式输出令牌
for chunk in llm(
"Explain quantum computing:",
max_tokens=256,
stream=True
):
print(chunk["choices"][0]["text"], end="", flush=True)
undefinedfor chunk in llm(
"Explain quantum computing:",
max_tokens=256,
stream=True
):
print(chunk["choices"][0]["text"], end="", flush=True)
undefinedServer mode
服务器模式
Start OpenAI-compatible server
启动兼容OpenAI的服务器
bash
undefinedbash
undefinedStart server
启动服务器
./llama-server -m model-q4_k_m.gguf
--host 0.0.0.0
--port 8080
-ngl 35
-c 4096
--host 0.0.0.0
--port 8080
-ngl 35
-c 4096
./llama-server -m model-q4_k_m.gguf
--host 0.0.0.0
--port 8080
-ngl 35
-c 4096
--host 0.0.0.0
--port 8080
-ngl 35
-c 4096
Or with Python bindings
或使用Python绑定启动
python -m llama_cpp.server
--model model-q4_k_m.gguf
--n_gpu_layers 35
--host 0.0.0.0
--port 8080
--model model-q4_k_m.gguf
--n_gpu_layers 35
--host 0.0.0.0
--port 8080
undefinedpython -m llama_cpp.server
--model model-q4_k_m.gguf
--n_gpu_layers 35
--host 0.0.0.0
--port 8080
--model model-q4_k_m.gguf
--n_gpu_layers 35
--host 0.0.0.0
--port 8080
undefinedUse with OpenAI client
结合OpenAI客户端使用
python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256
)
print(response.choices[0].message.content)python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256
)
print(response.choices[0].message.content)Hardware optimization
硬件优化
Apple Silicon (Metal)
Apple Silicon(Metal)
bash
undefinedbash
undefinedBuild with Metal
结合Metal构建
make clean && make GGML_METAL=1
make clean && make GGML_METAL=1
Run with Metal acceleration
借助Metal加速运行
./llama-cli -m model.gguf -ngl 99 -p "Hello"
./llama-cli -m model.gguf -ngl 99 -p "Hello"
Python with Metal
Python结合Metal使用
llm = Llama(
model_path="model.gguf",
n_gpu_layers=99, # Offload all layers
n_threads=1 # Metal handles parallelism
)
undefinedllm = Llama(
model_path="model.gguf",
n_gpu_layers=99, # 卸载所有层至GPU
n_threads=1 # Metal负责并行处理
)
undefinedNVIDIA CUDA
NVIDIA CUDA
bash
undefinedbash
undefinedBuild with CUDA
结合CUDA构建
make clean && make GGML_CUDA=1
make clean && make GGML_CUDA=1
Run with CUDA
结合CUDA运行
./llama-cli -m model.gguf -ngl 35 -p "Hello"
./llama-cli -m model.gguf -ngl 35 -p "Hello"
Specify GPU
指定GPU
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
undefinedCUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
undefinedCPU optimization
CPU优化
bash
undefinedbash
undefinedBuild with AVX2/AVX512
结合AVX2/AVX512构建
make clean && make
make clean && make
Run with optimal threads
使用最优线程数运行
./llama-cli -m model.gguf -t 8 -p "Hello"
./llama-cli -m model.gguf -t 8 -p "Hello"
Python CPU config
Python CPU配置
llm = Llama(
model_path="model.gguf",
n_gpu_layers=0, # CPU only
n_threads=8, # Match physical cores
n_batch=512 # Batch size for prompt processing
)
undefinedllm = Llama(
model_path="model.gguf",
n_gpu_layers=0, # 仅使用CPU
n_threads=8, # 匹配物理核心数
n_batch=512 # 提示处理的批处理大小
)
undefinedIntegration with tools
与工具集成
Ollama
Ollama
bash
undefinedbash
undefinedCreate Modelfile
创建Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF
Create Ollama model
创建Ollama模型
ollama create mymodel -f Modelfile
ollama create mymodel -f Modelfile
Run
运行模型
ollama run mymodel "Hello!"
undefinedollama run mymodel "Hello!"
undefinedLM Studio
LM Studio
- Place GGUF file in
~/.cache/lm-studio/models/ - Open LM Studio and select the model
- Configure context length and GPU offload
- Start inference
- 将GGUF文件放置在目录下
~/.cache/lm-studio/models/ - 打开LM Studio并选择该模型
- 配置上下文长度和GPU卸载选项
- 启动推理
text-generation-webui
text-generation-webui
bash
undefinedbash
undefinedPlace in models folder
将模型放置在models文件夹
cp model-q4_k_m.gguf text-generation-webui/models/
cp model-q4_k_m.gguf text-generation-webui/models/
Start with llama.cpp loader
结合llama.cpp加载器启动
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
undefinedpython server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
undefinedBest practices
最佳实践
- Use K-quants: Q4_K_M offers best quality/size balance
- Use imatrix: Always use importance matrix for Q4 and below
- GPU offload: Offload as many layers as VRAM allows
- Context length: Start with 4096, increase if needed
- Thread count: Match physical CPU cores, not logical
- Batch size: Increase n_batch for faster prompt processing
- 使用K-quants:Q4_K_M提供最佳的质量/体积平衡
- 使用imatrix:对于Q4及以下的量化,始终使用重要性矩阵
- GPU卸载:根据VRAM容量尽可能多地卸载层至GPU
- 上下文长度:从4096开始,根据需要调整
- 线程数:匹配物理CPU核心数,而非逻辑核心数
- 批处理大小:增大n_batch以加快提示处理速度
Common issues
常见问题
Model loads slowly:
bash
undefined模型加载缓慢:
bash
undefinedUse mmap for faster loading
使用mmap加快加载速度
./llama-cli -m model.gguf --mmap
**Out of memory:**
```bash./llama-cli -m model.gguf --mmap
**内存不足:**
```bashReduce GPU layers
减少GPU卸载的层数
./llama-cli -m model.gguf -ngl 20 # Reduce from 35
./llama-cli -m model.gguf -ngl 20 # 从35减少至20
Or use smaller quantization
或使用更小的量化版本
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
**Poor quality at low bits:**
```bash./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
**低比特量化质量差:**
```bashAlways use imatrix for Q4 and below
对于Q4及以下的量化,务必使用imatrix
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
undefined./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
undefinedReferences
参考资料
- Advanced Usage - Batching, speculative decoding, custom builds
- Troubleshooting - Common issues, debugging, benchmarks
- 高级用法 - 批处理、投机解码、自定义构建
- 故障排除 - 常见问题、调试、性能基准测试
Resources
资源
- Repository: https://github.com/ggml-org/llama.cpp
- Python Bindings: https://github.com/abetlen/llama-cpp-python
- Pre-quantized Models: https://huggingface.co/TheBloke
- GGUF Converter: https://huggingface.co/spaces/ggml-org/gguf-my-repo
- License: MIT