mlx

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Using MLX for LLMs on Apple Silicon

在Apple Silicon上使用MLX运行大语言模型(LLM)

MLX-LM is a Python package for running large language models on Apple Silicon, leveraging the MLX framework for optimized performance with unified memory architecture.
MLX-LM是一个Python包,用于在Apple Silicon上运行大语言模型,它借助MLX框架的统一内存架构实现了性能优化。

Table of Contents

目录

Core Concepts

核心概念

Why MLX

为什么选择MLX

AspectPyTorch on MacMLX
MemorySeparate CPU/GPU copiesUnified memory, no copies
OptimizationGeneric Metal backendApple Silicon native
Model loadingSlower, more memoryLazy loading, efficient
QuantizationLimited supportBuilt-in 4/8-bit
MLX arrays live in shared memory, accessible by both CPU and GPU without data transfer overhead.
对比维度Mac上的PyTorchMLX
内存管理CPU与GPU内存分离,数据需复制统一内存,无需数据复制
优化程度通用Metal后端Apple Silicon原生优化
模型加载速度慢,占用内存多延迟加载,内存使用高效
量化支持支持有限内置4/8位量化
MLX数组存储在共享内存中,CPU和GPU均可直接访问,无需数据传输开销。

Supported Models

支持的模型

MLX-LM supports most popular architectures: Llama, Mistral, Qwen, Phi, Gemma, Cohere, and many more. Check the mlx-community on Hugging Face for pre-converted models.
MLX-LM支持大多数主流模型架构:Llama、Mistral、Qwen、Phi、Gemma、Cohere等。可前往Hugging Face上的mlx-community查看已转换好的模型。

Installation

安装

bash
pip install mlx-lm
Requires macOS 13.5+ and Apple Silicon (M1/M2/M3/M4).
bash
pip install mlx-lm
要求macOS 13.5及以上版本,且设备为Apple Silicon(M1/M2/M3/M4)。

Text Generation

文本生成

Python API

Python API

python
from mlx_lm import load, generate
python
from mlx_lm import load, generate

Load model (from HF hub or local path)

加载模型(来自HF Hub或本地路径)

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

Generate text

生成文本

response = generate( model, tokenizer, prompt="Explain quantum computing in simple terms:", max_tokens=256, temp=0.7, ) print(response)
undefined
response = generate( model, tokenizer, prompt="用简单的语言解释量子计算:", max_tokens=256, temp=0.7, ) print(response)
undefined

Streaming Generation

流式生成

python
from mlx_lm import load, stream_generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

prompt = "Write a haiku about programming:"
for response in stream_generate(model, tokenizer, prompt, max_tokens=100):
    print(response.text, end="", flush=True)
print()
python
from mlx_lm import load, stream_generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

prompt="写一首关于编程的俳句:"
for response in stream_generate(model, tokenizer, prompt, max_tokens=100):
    print(response.text, end="", flush=True)
print()

Batch Generation

批量生成

python
from mlx_lm import load, batch_generate

model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")

prompts = [
    "What is machine learning?",
    "Explain neural networks:",
    "Define deep learning:",
]

responses = batch_generate(
    model,
    tokenizer,
    prompts,
    max_tokens=100,
)

for prompt, response in zip(prompts, responses):
    print(f"Q: {prompt}\nA: {response}\n")
python
from mlx_lm import load, batch_generate

model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")

prompts = [
    "什么是机器学习?",
    "解释神经网络:",
    "定义深度学习:",
]

responses = batch_generate(
    model,
    tokenizer,
    prompts,
    max_tokens=100,
)

for prompt, response in zip(prompts, responses):
    print(f"问:{prompt}\n答:{response}\n")

CLI Generation

CLI生成

bash
undefined
bash
undefined

Basic generation

基础生成

mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit
--prompt "Explain recursion:"
--max-tokens 256
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit
--prompt "解释递归:"
--max-tokens 256

With sampling parameters

带采样参数的生成

mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit
--prompt "Write a poem about AI:"
--temp 0.8
--top-p 0.95
undefined
mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit
--prompt "写一首关于AI的诗:"
--temp 0.8
--top-p 0.95
undefined

Interactive Chat

交互式对话

CLI Chat

CLI对话

bash
undefined
bash
undefined

Start chat REPL (context preserved between turns)

启动对话REPL(对话上下文会在多轮中保留)

mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit
undefined
mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit
undefined

Python Chat

Python对话

python
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the capital of France?"},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(response)
python
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

messages = [
    {"role": "system", "content": "你是一个乐于助人的助手。"},
    {"role": "user", "content": "法国的首都是哪里?"},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(response)

Model Conversion

模型转换

Convert Hugging Face models to MLX format:
将Hugging Face模型转换为MLX格式:

CLI Conversion

CLI转换

bash
undefined
bash
undefined

Convert with 4-bit quantization

转换并进行4位量化

mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct
-q # Quantize to 4-bit
mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct
-q # 量化为4位

With specific quantization

指定量化参数转换

mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3
-q
--q-bits 8
--q-group-size 64
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3
-q
--q-bits 8
--q-group-size 64

Upload to Hugging Face Hub

上传至Hugging Face Hub

mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct
-q
--upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx
undefined
mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct
-q
--upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx
undefined

Python Conversion

Python转换

python
from mlx_lm import convert

convert(
    hf_path="meta-llama/Llama-3.2-3B-Instruct",
    mlx_path="./llama-3.2-3b-mlx",
    quantize=True,
    q_bits=4,
    q_group_size=64,
)
python
from mlx_lm import convert

convert(
    hf_path="meta-llama/Llama-3.2-3B-Instruct",
    mlx_path="./llama-3.2-3b-mlx",
    quantize=True,
    q_bits=4,
    q_group_size=64,
)

Conversion Options

转换选项

OptionDefaultDescription
--q-bits
4Quantization bits (4 or 8)
--q-group-size
64Group size for quantization
--dtype
float16Data type for non-quantized weights
选项默认值描述
--q-bits
4量化位数(4或8)
--q-group-size
64量化分组大小
--dtype
float16非量化权重的数据类型

Quantization

量化

MLX supports multiple quantization methods for different use cases:
MethodBest ForCommand
BasicQuick conversion
mlx_lm.convert -q
DWQQuality-preserving
mlx_lm.dwq
AWQActivation-aware
mlx_lm.awq
DynamicPer-layer precision
mlx_lm.dynamic_quant
GPTQEstablished method
mlx_lm.gptq
MLX支持多种量化方法,适用于不同场景:
方法适用场景命令
Basic快速转换
mlx_lm.convert -q
DWQ保留模型质量
mlx_lm.dwq
AWQ感知激活值的量化
mlx_lm.awq
Dynamic逐层动态精度
mlx_lm.dynamic_quant
GPTQ成熟的量化方法
mlx_lm.gptq

Quick Quantization

快速量化

bash
undefined
bash
undefined

4-bit quantization during conversion

转换时进行4位量化

mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q

8-bit for higher quality

8位量化以获得更高质量

mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q --q-bits 8

For detailed coverage of each method, see `reference/quantization.md`.
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q --q-bits 8

如需了解每种方法的详细说明,请查看`reference/quantization.md`。

Fine-tuning with LoRA

基于LoRA的微调

MLX supports LoRA and QLoRA fine-tuning for efficient adaptation on Apple Silicon.
MLX支持LoRA和QLoRA微调,可在Apple Silicon上高效适配模型。

Quick Start

快速开始

bash
undefined
bash
undefined

Prepare training data (JSONL format)

准备训练数据(JSONL格式)

{"text": "Your training text here"}

{"text": "你的训练文本内容"}

or

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Fine-tune with LoRA

使用LoRA进行微调

mlx_lm.lora --model mlx-community/Llama-3.2-3B-Instruct-4bit
--train
--data ./data
--iters 1000
mlx_lm.lora --model mlx-community/Llama-3.2-3B-Instruct-4bit
--train
--data ./data
--iters 1000

Generate with adapter

加载适配器进行生成

mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--prompt "Your prompt here"
undefined
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--prompt "你的提示词内容"
undefined

Fuse Adapter into Model

将适配器融合到模型中

bash
undefined
bash
undefined

Merge LoRA weights into base model

将LoRA权重合并到基础模型

mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--save-path ./fused-model
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--save-path ./fused-model

Or export to GGUF

或导出为GGUF格式

mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--export-gguf

For detailed LoRA configuration and training patterns, see `reference/fine-tuning.md`.
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit
--adapter-path ./adapters
--export-gguf

如需了解详细的LoRA配置和训练模式,请查看`reference/fine-tuning.md`。

Serving Models

模型部署

OpenAI-Compatible Server

兼容OpenAI的服务器

bash
undefined
bash
undefined

Start server

启动服务器

mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080
mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080

Use with OpenAI client

使用OpenAI客户端调用

curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 256 }'
undefined
curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "default", "messages": [{"role": "user", "content": "你好!"}], "max_tokens": 256 }'
undefined

Python Client

Python客户端

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain MLX in one sentence."}],
    max_tokens=100,
)
print(response.choices[0].message.content)
python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "用一句话解释MLX。"}],
    max_tokens=100,
)
print(response.choices[0].message.content)

Best Practices

最佳实践

  1. Use pre-quantized models: Download from
    mlx-community
    on Hugging Face for immediate use
  2. Match quantization to your hardware: M1/M2 with 8GB: use 4-bit; M2/M3 Pro/Max: 8-bit for quality
  3. Leverage unified memory: Unlike CUDA, MLX models can exceed "GPU memory" by using swap (slower but works)
  4. Use streaming for UX:
    stream_generate
    provides responsive output for interactive applications
  5. Cache prompt prefixes: Use
    mlx_lm.cache_prompt
    for repeated prompts with varying suffixes
  6. Batch similar requests:
    batch_generate
    is more efficient than sequential generation
  7. Start with 4-bit quantization: Good quality/size tradeoff; upgrade to 8-bit if quality issues
  8. Fuse adapters for deployment: After fine-tuning, fuse adapters for faster inference without loading separately
  9. Monitor memory with Activity Monitor: Watch memory pressure to avoid swap thrashing
  10. Use chat templates: Always apply
    tokenizer.apply_chat_template()
    for instruction-tuned models
  1. 使用预量化模型:从Hugging Face的
    mlx-community
    下载预量化模型,可直接使用
  2. 根据硬件匹配量化方式:8GB内存的M1/M2设备建议使用4位量化;M2/M3 Pro/Max设备可使用8位量化以获得更好的模型质量
  3. 利用统一内存:与CUDA不同,MLX模型可通过使用交换内存突破“GPU内存”限制(速度会变慢但可正常运行)
  4. 使用流式提升用户体验
    stream_generate
    可提供响应式输出,适合交互式应用
  5. 缓存提示词前缀:对于重复使用的提示词前缀,可使用
    mlx_lm.cache_prompt
    提升效率
  6. 批量处理相似请求
    batch_generate
    比依次生成多个请求更高效
  7. 从4位量化开始:4位量化在质量和体积间达到较好平衡;若出现质量问题,再升级到8位量化
  8. 部署前融合适配器:微调完成后,将适配器与基础模型融合,可避免单独加载适配器,提升推理速度
  9. 用活动监视器监控内存:关注内存压力,避免交换内存过度使用导致性能骤降
  10. 使用对话模板:对于指令微调模型,务必使用
    tokenizer.apply_chat_template()
    处理对话内容

References

参考资料

See
reference/
for detailed documentation:
  • quantization.md
    - Detailed quantization methods and when to use each
  • fine-tuning.md
    - Complete LoRA/QLoRA training guide with data formats and configuration
请查看
reference/
目录下的详细文档:
  • quantization.md
    - 详细介绍各种量化方法及适用场景
  • fine-tuning.md
    - 完整的LoRA/QLoRA训练指南,包含数据格式和配置说明