mlx
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseUsing MLX for LLMs on Apple Silicon
在Apple Silicon上使用MLX运行大语言模型(LLM)
MLX-LM is a Python package for running large language models on Apple Silicon, leveraging the MLX framework for optimized performance with unified memory architecture.
MLX-LM是一个Python包,用于在Apple Silicon上运行大语言模型,它借助MLX框架的统一内存架构实现了性能优化。
Table of Contents
目录
Core Concepts
核心概念
Why MLX
为什么选择MLX
| Aspect | PyTorch on Mac | MLX |
|---|---|---|
| Memory | Separate CPU/GPU copies | Unified memory, no copies |
| Optimization | Generic Metal backend | Apple Silicon native |
| Model loading | Slower, more memory | Lazy loading, efficient |
| Quantization | Limited support | Built-in 4/8-bit |
MLX arrays live in shared memory, accessible by both CPU and GPU without data transfer overhead.
| 对比维度 | Mac上的PyTorch | MLX |
|---|---|---|
| 内存管理 | CPU与GPU内存分离,数据需复制 | 统一内存,无需数据复制 |
| 优化程度 | 通用Metal后端 | Apple Silicon原生优化 |
| 模型加载 | 速度慢,占用内存多 | 延迟加载,内存使用高效 |
| 量化支持 | 支持有限 | 内置4/8位量化 |
MLX数组存储在共享内存中,CPU和GPU均可直接访问,无需数据传输开销。
Supported Models
支持的模型
MLX-LM supports most popular architectures: Llama, Mistral, Qwen, Phi, Gemma, Cohere, and many more. Check the mlx-community on Hugging Face for pre-converted models.
MLX-LM支持大多数主流模型架构:Llama、Mistral、Qwen、Phi、Gemma、Cohere等。可前往Hugging Face上的mlx-community查看已转换好的模型。
Installation
安装
bash
pip install mlx-lmRequires macOS 13.5+ and Apple Silicon (M1/M2/M3/M4).
bash
pip install mlx-lm要求macOS 13.5及以上版本,且设备为Apple Silicon(M1/M2/M3/M4)。
Text Generation
文本生成
Python API
Python API
python
from mlx_lm import load, generatepython
from mlx_lm import load, generateLoad model (from HF hub or local path)
加载模型(来自HF Hub或本地路径)
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
Generate text
生成文本
response = generate(
model,
tokenizer,
prompt="Explain quantum computing in simple terms:",
max_tokens=256,
temp=0.7,
)
print(response)
undefinedresponse = generate(
model,
tokenizer,
prompt="用简单的语言解释量子计算:",
max_tokens=256,
temp=0.7,
)
print(response)
undefinedStreaming Generation
流式生成
python
from mlx_lm import load, stream_generate
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
prompt = "Write a haiku about programming:"
for response in stream_generate(model, tokenizer, prompt, max_tokens=100):
print(response.text, end="", flush=True)
print()python
from mlx_lm import load, stream_generate
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
prompt="写一首关于编程的俳句:"
for response in stream_generate(model, tokenizer, prompt, max_tokens=100):
print(response.text, end="", flush=True)
print()Batch Generation
批量生成
python
from mlx_lm import load, batch_generate
model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")
prompts = [
"What is machine learning?",
"Explain neural networks:",
"Define deep learning:",
]
responses = batch_generate(
model,
tokenizer,
prompts,
max_tokens=100,
)
for prompt, response in zip(prompts, responses):
print(f"Q: {prompt}\nA: {response}\n")python
from mlx_lm import load, batch_generate
model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")
prompts = [
"什么是机器学习?",
"解释神经网络:",
"定义深度学习:",
]
responses = batch_generate(
model,
tokenizer,
prompts,
max_tokens=100,
)
for prompt, response in zip(prompts, responses):
print(f"问:{prompt}\n答:{response}\n")CLI Generation
CLI生成
bash
undefinedbash
undefinedBasic generation
基础生成
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit
--prompt "Explain recursion:"
--max-tokens 256
--prompt "Explain recursion:"
--max-tokens 256
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit
--prompt "解释递归:"
--max-tokens 256
--prompt "解释递归:"
--max-tokens 256
With sampling parameters
带采样参数的生成
mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit
--prompt "Write a poem about AI:"
--temp 0.8
--top-p 0.95
--prompt "Write a poem about AI:"
--temp 0.8
--top-p 0.95
undefinedmlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit
--prompt "写一首关于AI的诗:"
--temp 0.8
--top-p 0.95
--prompt "写一首关于AI的诗:"
--temp 0.8
--top-p 0.95
undefinedInteractive Chat
交互式对话
CLI Chat
CLI对话
bash
undefinedbash
undefinedStart chat REPL (context preserved between turns)
启动对话REPL(对话上下文会在多轮中保留)
mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit
undefinedmlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit
undefinedPython Chat
Python对话
python
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's the capital of France?"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(response)python
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
messages = [
{"role": "system", "content": "你是一个乐于助人的助手。"},
{"role": "user", "content": "法国的首都是哪里?"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(response)Model Conversion
模型转换
Convert Hugging Face models to MLX format:
将Hugging Face模型转换为MLX格式:
CLI Conversion
CLI转换
bash
undefinedbash
undefinedConvert with 4-bit quantization
转换并进行4位量化
mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct
-q # Quantize to 4-bit
-q # Quantize to 4-bit
mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct
-q # 量化为4位
-q # 量化为4位
With specific quantization
指定量化参数转换
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3
-q
--q-bits 8
--q-group-size 64
-q
--q-bits 8
--q-group-size 64
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3
-q
--q-bits 8
--q-group-size 64
-q
--q-bits 8
--q-group-size 64
Upload to Hugging Face Hub
上传至Hugging Face Hub
mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct
-q
--upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx
-q
--upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx
undefinedmlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct
-q
--upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx
-q
--upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx
undefinedPython Conversion
Python转换
python
from mlx_lm import convert
convert(
hf_path="meta-llama/Llama-3.2-3B-Instruct",
mlx_path="./llama-3.2-3b-mlx",
quantize=True,
q_bits=4,
q_group_size=64,
)python
from mlx_lm import convert
convert(
hf_path="meta-llama/Llama-3.2-3B-Instruct",
mlx_path="./llama-3.2-3b-mlx",
quantize=True,
q_bits=4,
q_group_size=64,
)Conversion Options
转换选项
| Option | Default | Description |
|---|---|---|
| 4 | Quantization bits (4 or 8) |
| 64 | Group size for quantization |
| float16 | Data type for non-quantized weights |
| 选项 | 默认值 | 描述 |
|---|---|---|
| 4 | 量化位数(4或8) |
| 64 | 量化分组大小 |
| float16 | 非量化权重的数据类型 |
Quantization
量化
MLX supports multiple quantization methods for different use cases:
| Method | Best For | Command |
|---|---|---|
| Basic | Quick conversion | |
| DWQ | Quality-preserving | |
| AWQ | Activation-aware | |
| Dynamic | Per-layer precision | |
| GPTQ | Established method | |
MLX支持多种量化方法,适用于不同场景:
| 方法 | 适用场景 | 命令 |
|---|---|---|
| Basic | 快速转换 | |
| DWQ | 保留模型质量 | |
| AWQ | 感知激活值的量化 | |
| Dynamic | 逐层动态精度 | |
| GPTQ | 成熟的量化方法 | |
Quick Quantization
快速量化
bash
undefinedbash
undefined4-bit quantization during conversion
转换时进行4位量化
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q
8-bit for higher quality
8位量化以获得更高质量
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q --q-bits 8
For detailed coverage of each method, see `reference/quantization.md`.mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q --q-bits 8
如需了解每种方法的详细说明,请查看`reference/quantization.md`。Fine-tuning with LoRA
基于LoRA的微调
MLX supports LoRA and QLoRA fine-tuning for efficient adaptation on Apple Silicon.
MLX支持LoRA和QLoRA微调,可在Apple Silicon上高效适配模型。
Quick Start
快速开始
bash
undefinedbash
undefinedPrepare training data (JSONL format)
准备训练数据(JSONL格式)
{"text": "Your training text here"}
{"text": "你的训练文本内容"}
or
或
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
Fine-tune with LoRA
使用LoRA进行微调
mlx_lm.lora --model mlx-community/Llama-3.2-3B-Instruct-4bit
--train
--data ./data
--iters 1000
--train
--data ./data
--iters 1000
mlx_lm.lora --model mlx-community/Llama-3.2-3B-Instruct-4bit
--train
--data ./data
--iters 1000
--train
--data ./data
--iters 1000
Generate with adapter
加载适配器进行生成
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--prompt "Your prompt here"
--adapter-path ./adapters
--prompt "Your prompt here"
undefinedmlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--prompt "你的提示词内容"
--adapter-path ./adapters
--prompt "你的提示词内容"
undefinedFuse Adapter into Model
将适配器融合到模型中
bash
undefinedbash
undefinedMerge LoRA weights into base model
将LoRA权重合并到基础模型
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--save-path ./fused-model
--adapter-path ./adapters
--save-path ./fused-model
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--save-path ./fused-model
--adapter-path ./adapters
--save-path ./fused-model
Or export to GGUF
或导出为GGUF格式
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--export-gguf
--adapter-path ./adapters
--export-gguf
For detailed LoRA configuration and training patterns, see `reference/fine-tuning.md`.mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--export-gguf
--adapter-path ./adapters
--export-gguf
如需了解详细的LoRA配置和训练模式,请查看`reference/fine-tuning.md`。Serving Models
模型部署
OpenAI-Compatible Server
兼容OpenAI的服务器
bash
undefinedbash
undefinedStart server
启动服务器
mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080
mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080
Use with OpenAI client
使用OpenAI客户端调用
curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 256 }'
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 256 }'
undefinedcurl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [{"role": "user", "content": "你好!"}], "max_tokens": 256 }'
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [{"role": "user", "content": "你好!"}], "max_tokens": 256 }'
undefinedPython Client
Python客户端
python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Explain MLX in one sentence."}],
max_tokens=100,
)
print(response.choices[0].message.content)python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "用一句话解释MLX。"}],
max_tokens=100,
)
print(response.choices[0].message.content)Best Practices
最佳实践
-
Use pre-quantized models: Download fromon Hugging Face for immediate use
mlx-community -
Match quantization to your hardware: M1/M2 with 8GB: use 4-bit; M2/M3 Pro/Max: 8-bit for quality
-
Leverage unified memory: Unlike CUDA, MLX models can exceed "GPU memory" by using swap (slower but works)
-
Use streaming for UX:provides responsive output for interactive applications
stream_generate -
Cache prompt prefixes: Usefor repeated prompts with varying suffixes
mlx_lm.cache_prompt -
Batch similar requests:is more efficient than sequential generation
batch_generate -
Start with 4-bit quantization: Good quality/size tradeoff; upgrade to 8-bit if quality issues
-
Fuse adapters for deployment: After fine-tuning, fuse adapters for faster inference without loading separately
-
Monitor memory with Activity Monitor: Watch memory pressure to avoid swap thrashing
-
Use chat templates: Always applyfor instruction-tuned models
tokenizer.apply_chat_template()
-
使用预量化模型:从Hugging Face的下载预量化模型,可直接使用
mlx-community -
根据硬件匹配量化方式:8GB内存的M1/M2设备建议使用4位量化;M2/M3 Pro/Max设备可使用8位量化以获得更好的模型质量
-
利用统一内存:与CUDA不同,MLX模型可通过使用交换内存突破“GPU内存”限制(速度会变慢但可正常运行)
-
使用流式提升用户体验:可提供响应式输出,适合交互式应用
stream_generate -
缓存提示词前缀:对于重复使用的提示词前缀,可使用提升效率
mlx_lm.cache_prompt -
批量处理相似请求:比依次生成多个请求更高效
batch_generate -
从4位量化开始:4位量化在质量和体积间达到较好平衡;若出现质量问题,再升级到8位量化
-
部署前融合适配器:微调完成后,将适配器与基础模型融合,可避免单独加载适配器,提升推理速度
-
用活动监视器监控内存:关注内存压力,避免交换内存过度使用导致性能骤降
-
使用对话模板:对于指令微调模型,务必使用处理对话内容
tokenizer.apply_chat_template()
References
参考资料
See for detailed documentation:
reference/- - Detailed quantization methods and when to use each
quantization.md - - Complete LoRA/QLoRA training guide with data formats and configuration
fine-tuning.md
请查看目录下的详细文档:
reference/- - 详细介绍各种量化方法及适用场景
quantization.md - - 完整的LoRA/QLoRA训练指南,包含数据格式和配置说明
fine-tuning.md