gguf-quantization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

GGUF - Quantization Format for llama.cpp

GGUF - 适用于llama.cpp的量化格式

The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.

GGUF（GPT-Generated Unified Format）是llama.cpp的标准文件格式，支持在CPU、Apple Silicon和GPU上实现高效推理，并提供灵活的量化选项。

When to use GGUF

何时使用GGUF

Use GGUF when:

Deploying on consumer hardware (laptops, desktops)
Running on Apple Silicon (M1/M2/M3) with Metal acceleration
Need CPU inference without GPU requirements
Want flexible quantization (Q2_K to Q8_0)
Using local AI tools (LM Studio, Ollama, text-generation-webui)

Key advantages:

Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
No Python runtime: Pure C/C++ inference
Flexible quantization: 2-8 bit with various methods (K-quants)
Ecosystem support: LM Studio, Ollama, koboldcpp, and more
imatrix: Importance matrix for better low-bit quality

Use alternatives instead:

AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
HQQ: Fast calibration-free quantization for HuggingFace
bitsandbytes: Simple integration with transformers library
TensorRT-LLM: Production NVIDIA deployment with maximum speed

适合使用GGUF的场景：

在消费级硬件（笔记本电脑、台式机）上部署模型
在Apple Silicon（M1/M2/M3）上运行并借助Metal加速
需要无需GPU支持的CPU推理
想要灵活的量化选项（Q2_K至Q8_0）
使用本地AI工具（LM Studio、Ollama、text-generation-webui）

核心优势：

全硬件支持：兼容CPU、Apple Silicon、NVIDIA、AMD
无需Python运行时：纯C/C++推理
灵活量化：2-8位多种量化方法（K-quants）
生态系统支持：适配LM Studio、Ollama、koboldcpp等工具
imatrix：重要性矩阵，提升低比特量化的质量

适合使用其他方案的场景：

AWQ/GPTQ：在NVIDIA GPU上通过校准实现最高精度
HQQ：无需校准的快速量化，适配HuggingFace
bitsandbytes：与transformers库简单集成
TensorRT-LLM：面向生产环境的NVIDIA部署，追求极致速度

Quick start

快速开始

Installation

安装

bash

undefined

bash

undefined

Clone llama.cpp

git clone https://github.com/ggml-org/llama.cpp cd llama.cpp

Build (CPU)

构建（仅CPU）

make

Build with CUDA (NVIDIA)

启用CUDA支持构建（NVIDIA）

make GGML_CUDA=1

Build with Metal (Apple Silicon)

启用Metal支持构建（Apple Silicon）

make GGML_METAL=1

Install Python bindings (optional)

安装Python绑定（可选）

pip install llama-cpp-python

undefined

pip install llama-cpp-python

undefined

Convert model to GGUF

将模型转换为GGUF格式

bash

undefined

bash

undefined

Install requirements

安装依赖

pip install -r requirements.txt

Convert HuggingFace model to GGUF (FP16)

将HuggingFace模型转换为GGUF格式（FP16）

python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf

Or specify output type

或指定输出类型

python convert_hf_to_gguf.py ./path/to/model
--outfile model-f16.gguf
--outtype f16

undefined

python convert_hf_to_gguf.py ./path/to/model
--outfile model-f16.gguf
--outtype f16

undefined

Quantize model

量化模型

bash

undefined

bash

undefined

Basic quantization to Q4_K_M

基础量化至Q4_K_M

./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

Quantize with importance matrix (better quality)

使用重要性矩阵量化（提升质量）

./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

undefined

./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

undefined

Run inference

运行推理

bash

undefined

bash

undefined

CLI inference

CLI推理

./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"

Interactive mode

交互模式

./llama-cli -m model-q4_k_m.gguf --interactive

With GPU offload

启用GPU卸载

./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"

undefined

./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"

undefined

Quantization types

量化类型

K-quant methods (recommended)

K-quant方法（推荐）

Type	Bits	Size (7B)	Quality	Use Case
Q2_K	2.5	~2.8 GB	Low	Extreme compression
Q3_K_S	3.0	~3.0 GB	Low-Med	Memory constrained
Q3_K_M	3.3	~3.3 GB	Medium	Balance
Q4_K_S	4.0	~3.8 GB	Med-High	Good balance
Q4_K_M	4.5	~4.1 GB	High	Recommended default
Q5_K_S	5.0	~4.6 GB	High	Quality focused
Q5_K_M	5.5	~4.8 GB	Very High	High quality
Q6_K	6.0	~5.5 GB	Excellent	Near-original
Q8_0	8.0	~7.2 GB	Best	Maximum quality

类型	比特数	7B模型大小	质量	使用场景
Q2_K	2.5	~2.8 GB	低	极致压缩
Q3_K_S	3.0	~3.0 GB	低-中	内存受限场景
Q3_K_M	3.3	~3.3 GB	中	平衡场景
Q4_K_S	4.0	~3.8 GB	中-高	良好平衡
Q4_K_M	4.5	~4.1 GB	高	推荐默认选项
Q5_K_S	5.0	~4.6 GB	高	注重质量
Q5_K_M	5.5	~4.8 GB	极高	高质量需求
Q6_K	6.0	~5.5 GB	优秀	接近原始模型质量
Q8_0	8.0	~7.2 GB	最佳	最高质量需求

Legacy methods

传统方法

Type	Description
Q4_0	4-bit, basic
Q4_1	4-bit with delta
Q5_0	5-bit, basic
Q5_1	5-bit with delta

Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.

类型	描述
Q4_0	4比特基础量化
Q4_1	带增量的4比特量化
Q5_0	5比特基础量化
Q5_1	带增量的5比特量化

推荐建议：使用K-quant方法（Q4_K_M、Q5_K_M）以获得最佳的质量/体积比。

Conversion workflows

转换工作流

Workflow 1: HuggingFace to GGUF

工作流1：HuggingFace转GGUF

bash

undefined

bash

undefined

1. Download model

1. 下载模型

huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b

2. Convert to GGUF (FP16)

2. 转换为GGUF格式（FP16）

python convert_hf_to_gguf.py ./llama-3.1-8b
--outfile llama-3.1-8b-f16.gguf
--outtype f16

3. Quantize

3. 量化

./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M

4. Test

4. 测试

./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

undefined

./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

undefined

Workflow 2: With importance matrix (better quality)

工作流2：使用重要性矩阵（提升质量）

bash

undefined

bash

undefined

1. Convert to GGUF

1. 转换为GGUF格式

python convert_hf_to_gguf.py ./model --outfile model-f16.gguf

2. Create calibration text (diverse samples)

2. 创建校准文本（多样化样本）

cat > calibration.txt << 'EOF' The quick brown fox jumps over the lazy dog. Machine learning is a subset of artificial intelligence. Python is a popular programming language.

Add more diverse text samples...

添加更多多样化文本样本...

EOF

3. Generate importance matrix

3. 生成重要性矩阵

./llama-imatrix -m model-f16.gguf
-f calibration.txt
--chunk 512
-o model.imatrix
-ngl 35 # GPU layers if available

./llama-imatrix -m model-f16.gguf
-f calibration.txt
--chunk 512
-o model.imatrix
-ngl 35 # 如有可用GPU则指定GPU层

4. Quantize with imatrix

4. 结合imatrix进行量化

./llama-quantize --imatrix model.imatrix
model-f16.gguf
model-q4_k_m.gguf
Q4_K_M

undefined

./llama-quantize --imatrix model.imatrix
model-f16.gguf
model-q4_k_m.gguf
Q4_K_M

undefined

Workflow 3: Multiple quantizations

工作流3：批量量化

bash

#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"

bash

#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"

Generate imatrix once

一次性生成imatrix

./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35

Create multiple quantizations

创建多种量化版本

for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do OUTPUT="llama-3.1-8b-${QUANT,,}.gguf" ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))" done

undefined

for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do OUTPUT="llama-3.1-8b-${QUANT,,}.gguf" ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))" done

undefined

Python usage

Python使用示例

llama-cpp-python

python

from llama_cpp import Llama

python

from llama_cpp import Llama

Load model

加载模型

llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, # Context window n_gpu_layers=35, # GPU offload (0 for CPU only) n_threads=8 # CPU threads )

llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, # 上下文窗口 n_gpu_layers=35, # GPU卸载（0表示仅使用CPU） n_threads=8 # CPU线程数 )

Generate

生成文本

output = llm( "What is machine learning?", max_tokens=256, temperature=0.7, stop=["</s>", "\n\n"] ) print(output["choices"][0]["text"])

undefined

output = llm( "What is machine learning?", max_tokens=256, temperature=0.7, stop=["</s>", "\n\n"] ) print(output["choices"][0]["text"])

undefined

Chat completion

对话补全

python

from llama_cpp import Llama

llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    chat_format="llama-3"  # Or "chatml", "mistral", etc.
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=256,
    temperature=0.7
)
print(response["choices"][0]["message"]["content"])

python

from llama_cpp import Llama

llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    chat_format="llama-3"  # 也可使用"chatml"、"mistral"等
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=256,
    temperature=0.7
)
print(response["choices"][0]["message"]["content"])

Streaming

流式输出

python

from llama_cpp import Llama

llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)

python

from llama_cpp import Llama

llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)

Stream tokens

流式输出令牌

for chunk in llm( "Explain quantum computing:", max_tokens=256, stream=True ): print(chunk["choices"][0]["text"], end="", flush=True)

undefined

for chunk in llm( "Explain quantum computing:", max_tokens=256, stream=True ): print(chunk["choices"][0]["text"], end="", flush=True)

undefined

Server mode

服务器模式

Start OpenAI-compatible server

启动兼容OpenAI的服务器

bash

undefined

bash

undefined

Start server

启动服务器

./llama-server -m model-q4_k_m.gguf
--host 0.0.0.0
--port 8080
-ngl 35
-c 4096

Or with Python bindings

或使用Python绑定启动

python -m llama_cpp.server
--model model-q4_k_m.gguf
--n_gpu_layers 35
--host 0.0.0.0
--port 8080

undefined

python -m llama_cpp.server
--model model-q4_k_m.gguf
--n_gpu_layers 35
--host 0.0.0.0
--port 8080

undefined

Use with OpenAI client

结合OpenAI客户端使用

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256
)
print(response.choices[0].message.content)

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256
)
print(response.choices[0].message.content)

Hardware optimization

硬件优化

Apple Silicon (Metal)

Apple Silicon（Metal）

bash

undefined

bash

undefined

Build with Metal

结合Metal构建

make clean && make GGML_METAL=1

Run with Metal acceleration

借助Metal加速运行

./llama-cli -m model.gguf -ngl 99 -p "Hello"

Python with Metal

Python结合Metal使用

llm = Llama( model_path="model.gguf", n_gpu_layers=99, # Offload all layers n_threads=1 # Metal handles parallelism )

undefined

llm = Llama( model_path="model.gguf", n_gpu_layers=99, # 卸载所有层至GPU n_threads=1 # Metal负责并行处理 )

undefined

NVIDIA CUDA

bash

undefined

bash

undefined

Build with CUDA

结合CUDA构建

make clean && make GGML_CUDA=1

Run with CUDA

结合CUDA运行

./llama-cli -m model.gguf -ngl 35 -p "Hello"

Specify GPU

指定GPU

CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35

undefined

CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35

undefined

CPU optimization

CPU优化

bash

undefined

bash

undefined

Build with AVX2/AVX512

结合AVX2/AVX512构建

make clean && make

Run with optimal threads

使用最优线程数运行

./llama-cli -m model.gguf -t 8 -p "Hello"

Python CPU config

Python CPU配置

llm = Llama( model_path="model.gguf", n_gpu_layers=0, # CPU only n_threads=8, # Match physical cores n_batch=512 # Batch size for prompt processing )

undefined

llm = Llama( model_path="model.gguf", n_gpu_layers=0, # 仅使用CPU n_threads=8, # 匹配物理核心数 n_batch=512 # 提示处理的批处理大小 )

undefined

Integration with tools

与工具集成

Ollama

bash

undefined

bash

undefined

Create Modelfile

创建Modelfile

cat > Modelfile << 'EOF' FROM ./model-q4_k_m.gguf TEMPLATE """{{ .System }} {{ .Prompt }}""" PARAMETER temperature 0.7 PARAMETER num_ctx 4096 EOF

Create Ollama model

创建Ollama模型

ollama create mymodel -f Modelfile

Run

运行模型

ollama run mymodel "Hello!"

undefined

ollama run mymodel "Hello!"

undefined

LM Studio

Place GGUF file in
```
~/.cache/lm-studio/models/
```
Open LM Studio and select the model
Configure context length and GPU offload
Start inference

将GGUF文件放置在
```
~/.cache/lm-studio/models/
```
目录下
打开LM Studio并选择该模型
配置上下文长度和GPU卸载选项
启动推理

text-generation-webui

bash

undefined

bash

undefined

Place in models folder

将模型放置在models文件夹

cp model-q4_k_m.gguf text-generation-webui/models/

Start with llama.cpp loader

结合llama.cpp加载器启动

python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35

undefined

python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35

undefined

Best practices

最佳实践

Use K-quants: Q4_K_M offers best quality/size balance
Use imatrix: Always use importance matrix for Q4 and below
GPU offload: Offload as many layers as VRAM allows
Context length: Start with 4096, increase if needed
Thread count: Match physical CPU cores, not logical
Batch size: Increase n_batch for faster prompt processing

使用K-quants：Q4_K_M提供最佳的质量/体积平衡
使用imatrix：对于Q4及以下的量化，始终使用重要性矩阵
GPU卸载：根据VRAM容量尽可能多地卸载层至GPU
上下文长度：从4096开始，根据需要调整
线程数：匹配物理CPU核心数，而非逻辑核心数
批处理大小：增大n_batch以加快提示处理速度

Common issues

常见问题

Model loads slowly:

bash

undefined

模型加载缓慢：

bash

undefined

Use mmap for faster loading

使用mmap加快加载速度

./llama-cli -m model.gguf --mmap


**Out of memory:**
```bash

./llama-cli -m model.gguf --mmap


**内存不足：**
```bash

Reduce GPU layers

减少GPU卸载的层数

./llama-cli -m model.gguf -ngl 20 # Reduce from 35

./llama-cli -m model.gguf -ngl 20 # 从35减少至20

Or use smaller quantization

或使用更小的量化版本

./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M


**Poor quality at low bits:**
```bash

./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M


**低比特量化质量差：**
```bash

Always use imatrix for Q4 and below

对于Q4及以下的量化，务必使用imatrix

./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

undefined

./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

undefined

References

参考资料

Advanced Usage - Batching, speculative decoding, custom builds
Troubleshooting - Common issues, debugging, benchmarks

高级用法 - 批处理、投机解码、自定义构建
故障排除 - 常见问题、调试、性能基准测试

Resources

资源

Repository: https://github.com/ggml-org/llama.cpp
Python Bindings: https://github.com/abetlen/llama-cpp-python
Pre-quantized Models: https://huggingface.co/TheBloke
GGUF Converter: https://huggingface.co/spaces/ggml-org/gguf-my-repo
License: MIT

代码仓库：https://github.com/ggml-org/llama.cpp
Python绑定：https://github.com/abetlen/llama-cpp-python
预量化模型：https://huggingface.co/TheBloke
GGUF转换器：https://huggingface.co/spaces/ggml-org/gguf-my-repo
许可证：MIT