mlx

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Using MLX for LLMs on Apple Silicon

在Apple Silicon上使用MLX运行大语言模型（LLM）

MLX-LM is a Python package for running large language models on Apple Silicon, leveraging the MLX framework for optimized performance with unified memory architecture.

MLX-LM是一个Python包，用于在Apple Silicon上运行大语言模型，它借助MLX框架的统一内存架构实现了性能优化。

Core Concepts

核心概念

Why MLX

为什么选择MLX

Aspect	PyTorch on Mac	MLX
Memory	Separate CPU/GPU copies	Unified memory, no copies
Optimization	Generic Metal backend	Apple Silicon native
Model loading	Slower, more memory	Lazy loading, efficient
Quantization	Limited support	Built-in 4/8-bit

MLX arrays live in shared memory, accessible by both CPU and GPU without data transfer overhead.

对比维度	Mac上的PyTorch	MLX
内存管理	CPU与GPU内存分离，数据需复制	统一内存，无需数据复制
优化程度	通用Metal后端	Apple Silicon原生优化
模型加载	速度慢，占用内存多	延迟加载，内存使用高效
量化支持	支持有限	内置4/8位量化

MLX数组存储在共享内存中，CPU和GPU均可直接访问，无需数据传输开销。

Supported Models

支持的模型

MLX-LM supports most popular architectures: Llama, Mistral, Qwen, Phi, Gemma, Cohere, and many more. Check the mlx-community on Hugging Face for pre-converted models.

MLX-LM支持大多数主流模型架构：Llama、Mistral、Qwen、Phi、Gemma、Cohere等。可前往Hugging Face上的mlx-community查看已转换好的模型。

Installation

安装

bash

pip install mlx-lm

Requires macOS 13.5+ and Apple Silicon (M1/M2/M3/M4).

bash

pip install mlx-lm

要求macOS 13.5及以上版本，且设备为Apple Silicon（M1/M2/M3/M4）。

Text Generation

文本生成

Python API

python

from mlx_lm import load, generate

python

from mlx_lm import load, generate

Load model (from HF hub or local path)

加载模型（来自HF Hub或本地路径）

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

Generate text

生成文本

response = generate( model, tokenizer, prompt="Explain quantum computing in simple terms:", max_tokens=256, temp=0.7, ) print(response)

undefined

response = generate( model, tokenizer, prompt="用简单的语言解释量子计算：", max_tokens=256, temp=0.7, ) print(response)

undefined

Streaming Generation

流式生成

python

from mlx_lm import load, stream_generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

prompt = "Write a haiku about programming:"
for response in stream_generate(model, tokenizer, prompt, max_tokens=100):
    print(response.text, end="", flush=True)
print()

python

from mlx_lm import load, stream_generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

prompt="写一首关于编程的俳句："
for response in stream_generate(model, tokenizer, prompt, max_tokens=100):
    print(response.text, end="", flush=True)
print()

Batch Generation

批量生成

python

from mlx_lm import load, batch_generate

model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")

prompts = [
    "What is machine learning?",
    "Explain neural networks:",
    "Define deep learning:",
]

responses = batch_generate(
    model,
    tokenizer,
    prompts,
    max_tokens=100,
)

for prompt, response in zip(prompts, responses):
    print(f"Q: {prompt}\nA: {response}\n")

python

from mlx_lm import load, batch_generate

model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")

prompts = [
    "什么是机器学习？",
    "解释神经网络：",
    "定义深度学习：",
]

responses = batch_generate(
    model,
    tokenizer,
    prompts,
    max_tokens=100,
)

for prompt, response in zip(prompts, responses):
    print(f"问：{prompt}\n答：{response}\n")

CLI Generation

CLI生成

bash

undefined

bash

undefined

Basic generation

基础生成

mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit
--prompt "Explain recursion:"
--max-tokens 256

mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit
--prompt "解释递归："
--max-tokens 256

With sampling parameters

带采样参数的生成

mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit
--prompt "Write a poem about AI:"
--temp 0.8
--top-p 0.95

undefined

mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit
--prompt "写一首关于AI的诗："
--temp 0.8
--top-p 0.95

undefined

Interactive Chat

交互式对话

CLI Chat

CLI对话

bash

undefined

bash

undefined

Start chat REPL (context preserved between turns)

启动对话REPL（对话上下文会在多轮中保留）

mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit

undefined

mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit

undefined

Python Chat

Python对话

python

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the capital of France?"},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(response)

python

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

messages = [
    {"role": "system", "content": "你是一个乐于助人的助手。"},
    {"role": "user", "content": "法国的首都是哪里？"},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(response)

Model Conversion

模型转换

Convert Hugging Face models to MLX format:

将Hugging Face模型转换为MLX格式：

CLI Conversion

CLI转换

bash

undefined

bash

undefined

Convert with 4-bit quantization

转换并进行4位量化

mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct
-q # Quantize to 4-bit

mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct
-q # 量化为4位

With specific quantization

指定量化参数转换

mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3
-q
--q-bits 8
--q-group-size 64

Upload to Hugging Face Hub

上传至Hugging Face Hub

mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct
-q
--upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx

undefined

mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct
-q
--upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx

undefined

Python Conversion

Python转换

python

from mlx_lm import convert

convert(
    hf_path="meta-llama/Llama-3.2-3B-Instruct",
    mlx_path="./llama-3.2-3b-mlx",
    quantize=True,
    q_bits=4,
    q_group_size=64,
)

python

from mlx_lm import convert

convert(
    hf_path="meta-llama/Llama-3.2-3B-Instruct",
    mlx_path="./llama-3.2-3b-mlx",
    quantize=True,
    q_bits=4,
    q_group_size=64,
)

Conversion Options

转换选项

Option	Default	Description
`--q-bits`	4	Quantization bits (4 or 8)
`--q-group-size`	64	Group size for quantization
`--dtype`	float16	Data type for non-quantized weights

选项	默认值	描述
`--q-bits`	4	量化位数（4或8）
`--q-group-size`	64	量化分组大小
`--dtype`	float16	非量化权重的数据类型

Quantization

量化

MLX supports multiple quantization methods for different use cases:

Method	Best For	Command
Basic	Quick conversion	`mlx_lm.convert -q`
DWQ	Quality-preserving	`mlx_lm.dwq`
AWQ	Activation-aware	`mlx_lm.awq`
Dynamic	Per-layer precision	`mlx_lm.dynamic_quant`
GPTQ	Established method	`mlx_lm.gptq`

MLX支持多种量化方法，适用于不同场景：

方法	适用场景	命令
Basic	快速转换	`mlx_lm.convert -q`
DWQ	保留模型质量	`mlx_lm.dwq`
AWQ	感知激活值的量化	`mlx_lm.awq`
Dynamic	逐层动态精度	`mlx_lm.dynamic_quant`
GPTQ	成熟的量化方法	`mlx_lm.gptq`

Quick Quantization

快速量化

bash

undefined

bash

undefined

4-bit quantization during conversion

转换时进行4位量化

mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q

8-bit for higher quality

8位量化以获得更高质量

mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q --q-bits 8


For detailed coverage of each method, see `reference/quantization.md`.

mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q --q-bits 8


如需了解每种方法的详细说明，请查看`reference/quantization.md`。

Fine-tuning with LoRA

基于LoRA的微调

MLX supports LoRA and QLoRA fine-tuning for efficient adaptation on Apple Silicon.

MLX支持LoRA和QLoRA微调，可在Apple Silicon上高效适配模型。

Quick Start

快速开始

bash

undefined

bash

undefined

Prepare training data (JSONL format)

准备训练数据（JSONL格式）

{"text": "Your training text here"}

{"text": "你的训练文本内容"}

or

或

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Fine-tune with LoRA

使用LoRA进行微调

mlx_lm.lora --model mlx-community/Llama-3.2-3B-Instruct-4bit
--train
--data ./data
--iters 1000

Generate with adapter

加载适配器进行生成

mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--prompt "Your prompt here"

undefined

mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--prompt "你的提示词内容"

undefined

Fuse Adapter into Model

将适配器融合到模型中

bash

undefined

bash

undefined

Merge LoRA weights into base model

将LoRA权重合并到基础模型

mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--save-path ./fused-model

Or export to GGUF

或导出为GGUF格式

mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--export-gguf


For detailed LoRA configuration and training patterns, see `reference/fine-tuning.md`.

mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--export-gguf


如需了解详细的LoRA配置和训练模式，请查看`reference/fine-tuning.md`。

Serving Models

模型部署

OpenAI-Compatible Server

兼容OpenAI的服务器

bash

undefined

bash

undefined

Start server

启动服务器

mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080

Use with OpenAI client

使用OpenAI客户端调用

curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 256 }'

undefined

curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [{"role": "user", "content": "你好！"}], "max_tokens": 256 }'

undefined

Python Client

Python客户端

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain MLX in one sentence."}],
    max_tokens=100,
)
print(response.choices[0].message.content)

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "用一句话解释MLX。"}],
    max_tokens=100,
)
print(response.choices[0].message.content)

Best Practices

最佳实践

Use pre-quantized models: Download from
```
mlx-community
```
on Hugging Face for immediate use
Match quantization to your hardware: M1/M2 with 8GB: use 4-bit; M2/M3 Pro/Max: 8-bit for quality
Leverage unified memory: Unlike CUDA, MLX models can exceed "GPU memory" by using swap (slower but works)
Use streaming for UX:
```
stream_generate
```
provides responsive output for interactive applications
Cache prompt prefixes: Use
```
mlx_lm.cache_prompt
```
for repeated prompts with varying suffixes
Batch similar requests:
```
batch_generate
```
is more efficient than sequential generation
Start with 4-bit quantization: Good quality/size tradeoff; upgrade to 8-bit if quality issues
Fuse adapters for deployment: After fine-tuning, fuse adapters for faster inference without loading separately
Monitor memory with Activity Monitor: Watch memory pressure to avoid swap thrashing
Use chat templates: Always apply
```
tokenizer.apply_chat_template()
```
for instruction-tuned models

使用预量化模型：从Hugging Face的
```
mlx-community
```
下载预量化模型，可直接使用
根据硬件匹配量化方式：8GB内存的M1/M2设备建议使用4位量化；M2/M3 Pro/Max设备可使用8位量化以获得更好的模型质量
利用统一内存：与CUDA不同，MLX模型可通过使用交换内存突破“GPU内存”限制（速度会变慢但可正常运行）
使用流式提升用户体验：
```
stream_generate
```
可提供响应式输出，适合交互式应用
缓存提示词前缀：对于重复使用的提示词前缀，可使用
```
mlx_lm.cache_prompt
```
提升效率
批量处理相似请求：
```
batch_generate
```
比依次生成多个请求更高效
从4位量化开始：4位量化在质量和体积间达到较好平衡；若出现质量问题，再升级到8位量化
部署前融合适配器：微调完成后，将适配器与基础模型融合，可避免单独加载适配器，提升推理速度
用活动监视器监控内存：关注内存压力，避免交换内存过度使用导致性能骤降
使用对话模板：对于指令微调模型，务必使用
```
tokenizer.apply_chat_template()
```
处理对话内容

References

参考资料

See

reference/

for detailed documentation:

```
quantization.md
```
- Detailed quantization methods and when to use each
```
fine-tuning.md
```
- Complete LoRA/QLoRA training guide with data formats and configuration

请查看

reference/

目录下的详细文档：

```
quantization.md
```
- 详细介绍各种量化方法及适用场景
```
fine-tuning.md
```
- 完整的LoRA/QLoRA训练指南，包含数据格式和配置说明

mlx

Original

Translation

Using MLX for LLMs on Apple Silicon

在Apple Silicon上使用MLX运行大语言模型（LLM）

Table of Contents

目录

Core Concepts

核心概念

Why MLX

为什么选择MLX

Supported Models

支持的模型

Installation

安装

Text Generation

文本生成

Python API

Python API

Load model (from HF hub or local path)

加载模型（来自HF Hub或本地路径）

Generate text

生成文本

Streaming Generation

流式生成

Batch Generation

批量生成

CLI Generation

CLI生成

Basic generation

基础生成

With sampling parameters

带采样参数的生成

Interactive Chat

交互式对话

CLI Chat

CLI对话

Start chat REPL (context preserved between turns)

启动对话REPL（对话上下文会在多轮中保留）

Python Chat

Python对话

Model Conversion

模型转换

CLI Conversion

CLI转换

Convert with 4-bit quantization

转换并进行4位量化

With specific quantization

指定量化参数转换

Upload to Hugging Face Hub

上传至Hugging Face Hub

Python Conversion

Python转换

Conversion Options

转换选项

Quantization

量化

Quick Quantization

快速量化

4-bit quantization during conversion

转换时进行4位量化

8-bit for higher quality

8位量化以获得更高质量

Fine-tuning with LoRA

基于LoRA的微调

Quick Start

快速开始

Prepare training data (JSONL format)

准备训练数据（JSONL格式）

{"text": "Your training text here"}

{"text": "你的训练文本内容"}

or

或

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Fine-tune with LoRA

使用LoRA进行微调

Generate with adapter

加载适配器进行生成

Fuse Adapter into Model