gguf-quantization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GGUF - Quantization Format for llama.cpp

GGUF - 适用于llama.cpp的量化格式

The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
GGUF(GPT-Generated Unified Format)是llama.cpp的标准文件格式,支持在CPU、Apple Silicon和GPU上实现高效推理,并提供灵活的量化选项。

When to use GGUF

何时使用GGUF

Use GGUF when:
  • Deploying on consumer hardware (laptops, desktops)
  • Running on Apple Silicon (M1/M2/M3) with Metal acceleration
  • Need CPU inference without GPU requirements
  • Want flexible quantization (Q2_K to Q8_0)
  • Using local AI tools (LM Studio, Ollama, text-generation-webui)
Key advantages:
  • Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
  • No Python runtime: Pure C/C++ inference
  • Flexible quantization: 2-8 bit with various methods (K-quants)
  • Ecosystem support: LM Studio, Ollama, koboldcpp, and more
  • imatrix: Importance matrix for better low-bit quality
Use alternatives instead:
  • AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
  • HQQ: Fast calibration-free quantization for HuggingFace
  • bitsandbytes: Simple integration with transformers library
  • TensorRT-LLM: Production NVIDIA deployment with maximum speed
适合使用GGUF的场景:
  • 在消费级硬件(笔记本电脑、台式机)上部署模型
  • 在Apple Silicon(M1/M2/M3)上运行并借助Metal加速
  • 需要无需GPU支持的CPU推理
  • 想要灵活的量化选项(Q2_K至Q8_0)
  • 使用本地AI工具(LM Studio、Ollama、text-generation-webui)
核心优势:
  • 全硬件支持:兼容CPU、Apple Silicon、NVIDIA、AMD
  • 无需Python运行时:纯C/C++推理
  • 灵活量化:2-8位多种量化方法(K-quants)
  • 生态系统支持:适配LM Studio、Ollama、koboldcpp等工具
  • imatrix:重要性矩阵,提升低比特量化的质量
适合使用其他方案的场景:
  • AWQ/GPTQ:在NVIDIA GPU上通过校准实现最高精度
  • HQQ:无需校准的快速量化,适配HuggingFace
  • bitsandbytes:与transformers库简单集成
  • TensorRT-LLM:面向生产环境的NVIDIA部署,追求极致速度

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

Clone llama.cpp

Clone llama.cpp

Build (CPU)

构建(仅CPU)

make
make

Build with CUDA (NVIDIA)

启用CUDA支持构建(NVIDIA)

make GGML_CUDA=1
make GGML_CUDA=1

Build with Metal (Apple Silicon)

启用Metal支持构建(Apple Silicon)

make GGML_METAL=1
make GGML_METAL=1

Install Python bindings (optional)

安装Python绑定(可选)

pip install llama-cpp-python
undefined
pip install llama-cpp-python
undefined

Convert model to GGUF

将模型转换为GGUF格式

bash
undefined
bash
undefined

Install requirements

安装依赖

pip install -r requirements.txt
pip install -r requirements.txt

Convert HuggingFace model to GGUF (FP16)

将HuggingFace模型转换为GGUF格式(FP16)

python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf

Or specify output type

或指定输出类型

python convert_hf_to_gguf.py ./path/to/model
--outfile model-f16.gguf
--outtype f16
undefined
python convert_hf_to_gguf.py ./path/to/model
--outfile model-f16.gguf
--outtype f16
undefined

Quantize model

量化模型

bash
undefined
bash
undefined

Basic quantization to Q4_K_M

基础量化至Q4_K_M

./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

Quantize with importance matrix (better quality)

使用重要性矩阵量化(提升质量)

./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
undefined
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
undefined

Run inference

运行推理

bash
undefined
bash
undefined

CLI inference

CLI推理

./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"

Interactive mode

交互模式

./llama-cli -m model-q4_k_m.gguf --interactive
./llama-cli -m model-q4_k_m.gguf --interactive

With GPU offload

启用GPU卸载

./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
undefined
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
undefined

Quantization types

量化类型

K-quant methods (recommended)

K-quant方法(推荐)

TypeBitsSize (7B)QualityUse Case
Q2_K2.5~2.8 GBLowExtreme compression
Q3_K_S3.0~3.0 GBLow-MedMemory constrained
Q3_K_M3.3~3.3 GBMediumBalance
Q4_K_S4.0~3.8 GBMed-HighGood balance
Q4_K_M4.5~4.1 GBHighRecommended default
Q5_K_S5.0~4.6 GBHighQuality focused
Q5_K_M5.5~4.8 GBVery HighHigh quality
Q6_K6.0~5.5 GBExcellentNear-original
Q8_08.0~7.2 GBBestMaximum quality
类型比特数7B模型大小质量使用场景
Q2_K2.5~2.8 GB极致压缩
Q3_K_S3.0~3.0 GB低-中内存受限场景
Q3_K_M3.3~3.3 GB平衡场景
Q4_K_S4.0~3.8 GB中-高良好平衡
Q4_K_M4.5~4.1 GB推荐默认选项
Q5_K_S5.0~4.6 GB注重质量
Q5_K_M5.5~4.8 GB极高高质量需求
Q6_K6.0~5.5 GB优秀接近原始模型质量
Q8_08.0~7.2 GB最佳最高质量需求

Legacy methods

传统方法

TypeDescription
Q4_04-bit, basic
Q4_14-bit with delta
Q5_05-bit, basic
Q5_15-bit with delta
Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
类型描述
Q4_04比特基础量化
Q4_1带增量的4比特量化
Q5_05比特基础量化
Q5_1带增量的5比特量化
推荐建议:使用K-quant方法(Q4_K_M、Q5_K_M)以获得最佳的质量/体积比。

Conversion workflows

转换工作流

Workflow 1: HuggingFace to GGUF

工作流1:HuggingFace转GGUF

bash
undefined
bash
undefined

1. Download model

1. 下载模型

huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b

2. Convert to GGUF (FP16)

2. 转换为GGUF格式(FP16)

python convert_hf_to_gguf.py ./llama-3.1-8b
--outfile llama-3.1-8b-f16.gguf
--outtype f16
python convert_hf_to_gguf.py ./llama-3.1-8b
--outfile llama-3.1-8b-f16.gguf
--outtype f16

3. Quantize

3. 量化

./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M

4. Test

4. 测试

./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
undefined
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
undefined

Workflow 2: With importance matrix (better quality)

工作流2:使用重要性矩阵(提升质量)

bash
undefined
bash
undefined

1. Convert to GGUF

1. 转换为GGUF格式

python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf

2. Create calibration text (diverse samples)

2. 创建校准文本(多样化样本)

cat > calibration.txt << 'EOF' The quick brown fox jumps over the lazy dog. Machine learning is a subset of artificial intelligence. Python is a popular programming language.
cat > calibration.txt << 'EOF' The quick brown fox jumps over the lazy dog. Machine learning is a subset of artificial intelligence. Python is a popular programming language.

Add more diverse text samples...

添加更多多样化文本样本...

EOF
EOF

3. Generate importance matrix

3. 生成重要性矩阵

./llama-imatrix -m model-f16.gguf
-f calibration.txt
--chunk 512
-o model.imatrix
-ngl 35 # GPU layers if available
./llama-imatrix -m model-f16.gguf
-f calibration.txt
--chunk 512
-o model.imatrix
-ngl 35 # 如有可用GPU则指定GPU层

4. Quantize with imatrix

4. 结合imatrix进行量化

./llama-quantize --imatrix model.imatrix
model-f16.gguf
model-q4_k_m.gguf
Q4_K_M
undefined
./llama-quantize --imatrix model.imatrix
model-f16.gguf
model-q4_k_m.gguf
Q4_K_M
undefined

Workflow 3: Multiple quantizations

工作流3:批量量化

bash
#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"
bash
#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"

Generate imatrix once

一次性生成imatrix

./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35

Create multiple quantizations

创建多种量化版本

for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do OUTPUT="llama-3.1-8b-${QUANT,,}.gguf" ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))" done
undefined
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do OUTPUT="llama-3.1-8b-${QUANT,,}.gguf" ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))" done
undefined

Python usage

Python使用示例

llama-cpp-python

llama-cpp-python

python
from llama_cpp import Llama
python
from llama_cpp import Llama

Load model

加载模型

llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, # Context window n_gpu_layers=35, # GPU offload (0 for CPU only) n_threads=8 # CPU threads )
llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, # 上下文窗口 n_gpu_layers=35, # GPU卸载(0表示仅使用CPU) n_threads=8 # CPU线程数 )

Generate

生成文本

output = llm( "What is machine learning?", max_tokens=256, temperature=0.7, stop=["</s>", "\n\n"] ) print(output["choices"][0]["text"])
undefined
output = llm( "What is machine learning?", max_tokens=256, temperature=0.7, stop=["</s>", "\n\n"] ) print(output["choices"][0]["text"])
undefined

Chat completion

对话补全

python
from llama_cpp import Llama

llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    chat_format="llama-3"  # Or "chatml", "mistral", etc.
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=256,
    temperature=0.7
)
print(response["choices"][0]["message"]["content"])
python
from llama_cpp import Llama

llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    chat_format="llama-3"  # 也可使用"chatml"、"mistral"等
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=256,
    temperature=0.7
)
print(response["choices"][0]["message"]["content"])

Streaming

流式输出

python
from llama_cpp import Llama

llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
python
from llama_cpp import Llama

llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)

Stream tokens

流式输出令牌

for chunk in llm( "Explain quantum computing:", max_tokens=256, stream=True ): print(chunk["choices"][0]["text"], end="", flush=True)
undefined
for chunk in llm( "Explain quantum computing:", max_tokens=256, stream=True ): print(chunk["choices"][0]["text"], end="", flush=True)
undefined

Server mode

服务器模式

Start OpenAI-compatible server

启动兼容OpenAI的服务器

bash
undefined
bash
undefined

Start server

启动服务器

./llama-server -m model-q4_k_m.gguf
--host 0.0.0.0
--port 8080
-ngl 35
-c 4096
./llama-server -m model-q4_k_m.gguf
--host 0.0.0.0
--port 8080
-ngl 35
-c 4096

Or with Python bindings

或使用Python绑定启动

python -m llama_cpp.server
--model model-q4_k_m.gguf
--n_gpu_layers 35
--host 0.0.0.0
--port 8080
undefined
python -m llama_cpp.server
--model model-q4_k_m.gguf
--n_gpu_layers 35
--host 0.0.0.0
--port 8080
undefined

Use with OpenAI client

结合OpenAI客户端使用

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256
)
print(response.choices[0].message.content)
python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256
)
print(response.choices[0].message.content)

Hardware optimization

硬件优化

Apple Silicon (Metal)

Apple Silicon(Metal)

bash
undefined
bash
undefined

Build with Metal

结合Metal构建

make clean && make GGML_METAL=1
make clean && make GGML_METAL=1

Run with Metal acceleration

借助Metal加速运行

./llama-cli -m model.gguf -ngl 99 -p "Hello"
./llama-cli -m model.gguf -ngl 99 -p "Hello"

Python with Metal

Python结合Metal使用

llm = Llama( model_path="model.gguf", n_gpu_layers=99, # Offload all layers n_threads=1 # Metal handles parallelism )
undefined
llm = Llama( model_path="model.gguf", n_gpu_layers=99, # 卸载所有层至GPU n_threads=1 # Metal负责并行处理 )
undefined

NVIDIA CUDA

NVIDIA CUDA

bash
undefined
bash
undefined

Build with CUDA

结合CUDA构建

make clean && make GGML_CUDA=1
make clean && make GGML_CUDA=1

Run with CUDA

结合CUDA运行

./llama-cli -m model.gguf -ngl 35 -p "Hello"
./llama-cli -m model.gguf -ngl 35 -p "Hello"

Specify GPU

指定GPU

CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
undefined
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
undefined

CPU optimization

CPU优化

bash
undefined
bash
undefined

Build with AVX2/AVX512

结合AVX2/AVX512构建

make clean && make
make clean && make

Run with optimal threads

使用最优线程数运行

./llama-cli -m model.gguf -t 8 -p "Hello"
./llama-cli -m model.gguf -t 8 -p "Hello"

Python CPU config

Python CPU配置

llm = Llama( model_path="model.gguf", n_gpu_layers=0, # CPU only n_threads=8, # Match physical cores n_batch=512 # Batch size for prompt processing )
undefined
llm = Llama( model_path="model.gguf", n_gpu_layers=0, # 仅使用CPU n_threads=8, # 匹配物理核心数 n_batch=512 # 提示处理的批处理大小 )
undefined

Integration with tools

与工具集成

Ollama

Ollama

bash
undefined
bash
undefined

Create Modelfile

创建Modelfile

cat > Modelfile << 'EOF' FROM ./model-q4_k_m.gguf TEMPLATE """{{ .System }} {{ .Prompt }}""" PARAMETER temperature 0.7 PARAMETER num_ctx 4096 EOF
cat > Modelfile << 'EOF' FROM ./model-q4_k_m.gguf TEMPLATE """{{ .System }} {{ .Prompt }}""" PARAMETER temperature 0.7 PARAMETER num_ctx 4096 EOF

Create Ollama model

创建Ollama模型

ollama create mymodel -f Modelfile
ollama create mymodel -f Modelfile

Run

运行模型

ollama run mymodel "Hello!"
undefined
ollama run mymodel "Hello!"
undefined

LM Studio

LM Studio

  1. Place GGUF file in
    ~/.cache/lm-studio/models/
  2. Open LM Studio and select the model
  3. Configure context length and GPU offload
  4. Start inference
  1. 将GGUF文件放置在
    ~/.cache/lm-studio/models/
    目录下
  2. 打开LM Studio并选择该模型
  3. 配置上下文长度和GPU卸载选项
  4. 启动推理

text-generation-webui

text-generation-webui

bash
undefined
bash
undefined

Place in models folder

将模型放置在models文件夹

cp model-q4_k_m.gguf text-generation-webui/models/
cp model-q4_k_m.gguf text-generation-webui/models/

Start with llama.cpp loader

结合llama.cpp加载器启动

python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
undefined
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
undefined

Best practices

最佳实践

  1. Use K-quants: Q4_K_M offers best quality/size balance
  2. Use imatrix: Always use importance matrix for Q4 and below
  3. GPU offload: Offload as many layers as VRAM allows
  4. Context length: Start with 4096, increase if needed
  5. Thread count: Match physical CPU cores, not logical
  6. Batch size: Increase n_batch for faster prompt processing
  1. 使用K-quants:Q4_K_M提供最佳的质量/体积平衡
  2. 使用imatrix:对于Q4及以下的量化,始终使用重要性矩阵
  3. GPU卸载:根据VRAM容量尽可能多地卸载层至GPU
  4. 上下文长度:从4096开始,根据需要调整
  5. 线程数:匹配物理CPU核心数,而非逻辑核心数
  6. 批处理大小:增大n_batch以加快提示处理速度

Common issues

常见问题

Model loads slowly:
bash
undefined
模型加载缓慢:
bash
undefined

Use mmap for faster loading

使用mmap加快加载速度

./llama-cli -m model.gguf --mmap

**Out of memory:**
```bash
./llama-cli -m model.gguf --mmap

**内存不足:**
```bash

Reduce GPU layers

减少GPU卸载的层数

./llama-cli -m model.gguf -ngl 20 # Reduce from 35
./llama-cli -m model.gguf -ngl 20 # 从35减少至20

Or use smaller quantization

或使用更小的量化版本

./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M

**Poor quality at low bits:**
```bash
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M

**低比特量化质量差:**
```bash

Always use imatrix for Q4 and below

对于Q4及以下的量化,务必使用imatrix

./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
undefined
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
undefined

References

参考资料

  • Advanced Usage - Batching, speculative decoding, custom builds
  • Troubleshooting - Common issues, debugging, benchmarks
  • 高级用法 - 批处理、投机解码、自定义构建
  • 故障排除 - 常见问题、调试、性能基准测试

Resources

资源