mac-code-local-ai-agent
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesemac-code — Free Local AI Agent on Apple Silicon
mac-code — Apple Silicon平台免费本地AI Agent
Skill by ara.so — Daily 2026 Skills collection.
Run a 35B reasoning model locally on your Mac for $0/month. mac-code is a CLI AI coding agent (Claude Code alternative) that routes tasks — web search, shell commands, file edits, chat — through a local LLM. Supports llama.cpp (30 tok/s) and MLX (64K context, persistent KV cache) backends on Apple Silicon.
由ara.so开发的Skill — 属于Daily 2026 Skills合集。
无需每月付费,在你的Mac本地运行35B规模的推理模型。mac-code是一款CLI AI编码Agent(可作为Claude Code的替代工具),能将任务——网页搜索、Shell命令、文件编辑、对话——通过本地LLM进行路由处理。支持在Apple Silicon设备上使用llama.cpp(速度30 tokens/秒)和MLX(64K上下文长度、持久化KV缓存)作为后端。
What It Does
功能特性
- LLM-as-router: The model classifies every prompt as ,
search, orshelland routes accordinglychat - 35B MoE at 30 tok/s via llama.cpp + IQ2_M quantization (fits in 16 GB RAM)
- 35B full Q4 on 16 GB via custom MoE Expert Sniper (1.54 tok/s, only 1.42 GB RAM used)
- 9B at 64K context via quantized KV cache (keys/values)
q4_0 - MLX backend adds persistent KV cache save/load, context compression, R2 sync
- Tools: DuckDuckGo search, shell execution, file read/write
- LLM作为路由核心:模型会将每个提示分类为、
search或shell类型,并进行相应的任务路由chat - 35B MoE模型,速度30 tokens/秒:基于llama.cpp + IQ2_M量化,可在16GB内存设备运行
- 16GB内存运行35B全量Q4模型:通过自定义MoE Expert Sniper实现(速度1.54 tokens/秒,仅占用1.42GB内存)
- 9B模型支持64K上下文长度:基于量化KV缓存(键/值缓存)
q4_0 - MLX后端特性:新增持久化KV缓存的保存/加载、上下文压缩、Cloudflare R2同步功能
- 内置工具:DuckDuckGo搜索、Shell命令执行、文件读写操作
Installation
安装步骤
Prerequisites
前置依赖
bash
brew install llama.cpp
pip3 install rich ddgs huggingface-hub mlx-lm --break-system-packagesbash
brew install llama.cpp
pip3 install rich ddgs huggingface-hub mlx-lm --break-system-packagesClone the repo
克隆代码仓库
bash
git clone https://github.com/walter-grace/mac-code
cd mac-codebash
git clone https://github.com/walter-grace/mac-code
cd mac-codeDownload models
下载模型
35B MoE — fast daily driver (10.6 GB, fits in 16 GB RAM):
bash
mkdir -p ~/models
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-35B-A3B-GGUF',
'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
local_dir='$HOME/models/'
)
"9B — 64K context, long documents (5.3 GB):
bash
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-9B-GGUF',
'Qwen3.5-9B-Q4_K_M.gguf',
local_dir='$HOME/models/'
)
"35B MoE模型 — 日常使用首选(10.6GB,适配16GB内存):
bash
mkdir -p ~/models
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-35B-A3B-GGUF',
'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
local_dir='$HOME/models/'
)
"9B模型 — 支持64K上下文,适配长文档处理(5.3GB):
bash
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-9B-GGUF',
'Qwen3.5-9B-Q4_K_M.gguf',
local_dir='$HOME/models/'
)
"Starting the Backend
启动后端服务
Option A: llama.cpp + 35B MoE (recommended, 30 tok/s)
选项A:llama.cpp + 35B MoE(推荐,速度30 tokens/秒)
bash
llama-server \
--model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
--port 8000 --host 127.0.0.1 \
--flash-attn on --ctx-size 12288 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 --reasoning off -np 1 -t 4bash
llama-server \
--model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
--port 8000 --host 127.0.0.1 \
--flash-attn on --ctx-size 12288 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 --reasoning off -np 1 -t 4Option B: llama.cpp + 9B (64K context)
选项B:llama.cpp + 9B模型(64K上下文长度)
bash
llama-server \
--model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
--port 8000 --host 127.0.0.1 \
--flash-attn on --ctx-size 65536 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 --reasoning off -t 4bash
llama-server \
--model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
--port 8000 --host 127.0.0.1 \
--flash-attn on --ctx-size 65536 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 --reasoning off -t 4Option C: MLX backend (persistent context, 9B)
选项C:MLX后端(持久化上下文,适配9B模型)
bash
undefinedbash
undefinedStarts server on port 8000, downloads model on first run
在8000端口启动服务,首次运行时自动下载模型
python3 mlx/mlx_engine.py
undefinedpython3 mlx/mlx_engine.py
undefinedStart the agent (all options)
启动Agent(所有选项通用)
bash
python3 agent.pybash
python3 agent.pyAgent CLI Commands
Agent CLI命令
Inside the agent REPL, type for all commands:
/| Command | Action |
|---|---|
| Agent mode with tools (default) |
| Direct streaming, no tools |
| Switch to 9B model (64K context) |
| Switch to 35B MoE |
| Quick DuckDuckGo search |
| Run speed benchmark |
| Session statistics |
| Show cost savings vs cloud |
| Grade the last response |
| View response grading stats |
| Reset conversation |
| Exit |
在Agent的REPL交互环境中,输入可查看所有命令:
/| 命令 | 操作 |
|---|---|
| 启用带工具的Agent模式(默认) |
| 直接流式输出,不使用工具 |
| 切换到9B模型(支持64K上下文) |
| 切换到35B MoE模型 |
| 快速执行DuckDuckGo搜索 |
| 运行速度基准测试 |
| 查看会话统计信息 |
| 对比云端服务的成本节省情况 |
| 对最后一次响应进行评分 |
| 查看响应评分统计 |
| 重置对话历史 |
| 退出Agent |
Example prompts
示例提示词
> find all Python files modified in the last 7 days
→ routes to "shell", generates: find . -name "*.py" -mtime -7
> who won the NBA finals
→ routes to "search", queries DuckDuckGo, summarizes
> explain how attention works
→ routes to "chat", streams directly> 查找最近7天修改过的所有Python文件
→ 路由到"shell"工具,生成命令:find . -name "*.py" -mtime -7
> NBA总决赛谁赢了
→ 路由到"search"工具,调用DuckDuckGo搜索并总结结果
> 解释注意力机制的工作原理
→ 路由到"chat"模式,直接流式输出回答MLX Backend — Persistent KV Cache API
MLX后端 — 持久化KV缓存API
The MLX engine exposes a REST API on .
localhost:8000MLX引擎在提供REST API服务。
localhost:8000Save context after processing a large codebase
处理大型代码库后保存上下文
bash
curl -X POST localhost:8000/v1/context/save \
-H "Content-Type: application/json" \
-d '{"name": "my-project", "prompt": "$(cat README.md)"}'bash
curl -X POST localhost:8000/v1/context/save \
-H "Content-Type: application/json" \
-d '{"name": "my-project", "prompt": "$(cat README.md)"}'Load saved context instantly (0.0003s)
即时加载已保存的上下文(耗时0.0003秒)
bash
curl -X POST localhost:8000/v1/context/load \
-H "Content-Type: application/json" \
-d '{"name": "my-project"}'bash
curl -X POST localhost:8000/v1/context/load \
-H "Content-Type: application/json" \
-d '{"name": "my-project"}'Download context from Cloudflare R2 (cross-Mac sync)
从Cloudflare R2下载上下文(跨Mac同步)
bash
undefinedbash
undefinedRequires R2 credentials in environment
需要在环境变量中配置R2凭证
export R2_ACCOUNT_ID=your_account_id
export R2_ACCESS_KEY_ID=your_key_id
export R2_SECRET_ACCESS_KEY=your_secret
export R2_BUCKET=your_bucket_name
curl -X POST localhost:8000/v1/context/download
-H "Content-Type: application/json"
-d '{"name": "my-project"}'
-H "Content-Type: application/json"
-d '{"name": "my-project"}'
undefinedexport R2_ACCOUNT_ID=your_account_id
export R2_ACCESS_KEY_ID=your_key_id
export R2_SECRET_ACCESS_KEY=your_secret
export R2_BUCKET=your_bucket_name
curl -X POST localhost:8000/v1/context/download
-H "Content-Type: application/json"
-d '{"name": "my-project"}'
-H "Content-Type: application/json"
-d '{"name": "my-project"}'
undefinedStandard OpenAI-compatible chat
标准OpenAI兼容对话接口
python
import requests
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "local",
"messages": [{"role": "user", "content": "Write a Python quicksort"}],
"stream": False
})
print(response.json()["choices"][0]["message"]["content"])python
import requests
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "local",
"messages": [{"role": "user", "content": "Write a Python quicksort"}],
"stream": False
})
print(response.json()["choices"][0]["message"]["content"])Streaming chat
流式对话接口
python
import requests, json
with requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "local",
"messages": [{"role": "user", "content": "Explain transformers"}],
"stream": True
}, stream=True) as r:
for line in r.iter_lines():
if line.startswith(b"data: "):
chunk = json.loads(line[6:])
delta = chunk["choices"][0]["delta"].get("content", "")
print(delta, end="", flush=True)python
import requests, json
with requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "local",
"messages": [{"role": "user", "content": "Explain transformers"}],
"stream": True
}, stream=True) as r:
for line in r.iter_lines():
if line.startswith(b"data: "):
chunk = json.loads(line[6:])
delta = chunk["choices"][0]["delta"].get("content", "")
print(delta, end="", flush=True)KV Cache Compression (MLX)
KV缓存压缩(MLX)
Compress context 4x with 99.3% similarity:
python
from mlx.turboquant import compress_kv_cache
from mlx.kv_cache import save_kv_cache, load_kv_cache可将上下文压缩4倍,同时保持99.3%的相似度:
python
from mlx.turboquant import compress_kv_cache
from mlx.kv_cache import save_kv_cache, load_kv_cacheAfter building a KV cache from a long document
从长文档构建KV缓存后
compressed = compress_kv_cache(kv_cache, bits=4) # 26.6 MB → 6.7 MB
save_kv_cache(compressed, "my-project-compressed")
compressed = compress_kv_cache(kv_cache, bits=4) # 26.6 MB → 6.7 MB
save_kv_cache(compressed, "my-project-compressed")
Load later
后续加载
kv = load_kv_cache("my-project-compressed")
---kv = load_kv_cache("my-project-compressed")
---Flash Streaming — Out-of-Core Inference
流式外存推理(Flash Streaming)
For models larger than your RAM (research mode):
bash
cd research/flash-streaming针对内存无法容纳的大模型(研究阶段功能):
bash
cd research/flash-streamingRun 35B MoE Expert Sniper (22 GB model, 1.42 GB RAM)
运行35B MoE Expert Sniper模型(模型大小22GB,仅占用1.42GB内存)
python3 moe_expert_sniper.py
python3 moe_expert_sniper.py
Run 32B dense flash stream (18.4 GB model, 4.5 GB RAM)
运行32B密集型流式模型(模型大小18.4GB,占用4.5GB内存)
python3 flash_stream_v2.py
undefinedpython3 flash_stream_v2.py
undefinedHow F_NOCACHE direct I/O works
F_NOCACHE直接I/O的工作原理
python
import os, fcntlpython
import os, fcntlOpen model file bypassing macOS Unified Buffer Cache
打开模型文件,绕过macOS统一缓冲区缓存
fd = os.open("model.bin", os.O_RDONLY)
fcntl.fcntl(fd, fcntl.F_NOCACHE, 1) # bypass page cache
fd = os.open("model.bin", os.O_RDONLY)
fcntl.fcntl(fd, fcntl.F_NOCACHE, 1) # 绕过页缓存
Aligned read (16KB boundary for DART IOMMU)
对齐读取(DART IOMMU要求16KB边界)
ALIGN = 16384
offset = (layer_offset // ALIGN) * ALIGN
data = os.pread(fd, layer_size + ALIGN, offset)
weights = data[layer_offset - offset : layer_offset - offset + layer_size]
undefinedALIGN = 16384
offset = (layer_offset // ALIGN) * ALIGN
data = os.pread(fd, layer_size + ALIGN, offset)
weights = data[layer_offset - offset : layer_offset - offset + layer_size]
undefinedMoE Expert Sniper pattern
MoE Expert Sniper模式
python
undefinedpython
undefinedRouter predicts which 8 of 256 experts activate per token
路由层预测每个token激活256个专家中的哪8个
active_experts = router_forward(hidden_state) # returns [8] indices
active_experts = router_forward(hidden_state) # 返回[8]个索引
Load only those experts from SSD (8 threads, parallel pread)
仅从SSD加载这些专家(8线程并行pread读取)
from concurrent.futures import ThreadPoolExecutor
def load_expert(expert_idx):
offset = expert_offsets[expert_idx]
return os.pread(fd, expert_size, offset)
with ThreadPoolExecutor(max_workers=8) as pool:
expert_weights = list(pool.map(load_expert, active_experts))
from concurrent.futures import ThreadPoolExecutor
def load_expert(expert_idx):
offset = expert_offsets[expert_idx]
return os.pread(fd, expert_size, offset)
with ThreadPoolExecutor(max_workers=8) as pool:
expert_weights = list(pool.map(load_expert, active_experts))
~14 MB loaded per layer instead of 221 MB (dense)
每层仅加载约14MB数据,而非密集模型的221MB
---
---Common Patterns
常见使用模式
Use as a Python library (direct API calls)
作为Python库使用(直接调用API)
python
import requests
BASE = "http://localhost:8000/v1"
def ask(prompt: str, system: str = "You are a helpful coding assistant.") -> str:
r = requests.post(f"{BASE}/chat/completions", json={
"model": "local",
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": prompt}
]
})
return r.json()["choices"][0]["message"]["content"]python
import requests
BASE = "http://localhost:8000/v1"
def ask(prompt: str, system: str = "You are a helpful coding assistant.") -> str:
r = requests.post(f"{BASE}/chat/completions", json={
"model": "local",
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": prompt}
]
})
return r.json()["choices"][0]["message"]["content"]Examples
示例
print(ask("Write a Python function to parse JSON safely"))
print(ask("Explain this error: AttributeError: NoneType has no attribute split"))
undefinedprint(ask("Write a Python function to parse JSON safely"))
print(ask("Explain this error: AttributeError: NoneType has no attribute split"))
undefinedProcess a large file with paged inference
分页推理处理大文件
python
from mlx.paged_inference import PagedInference
engine = PagedInference(model="mlx-community/Qwen3.5-9B-4bit")
with open("large_codebase.txt") as f:
content = f.read() # beyond single context windowpython
from mlx.paged_inference import PagedInference
engine = PagedInference(model="mlx-community/Qwen3.5-9B-4bit")
with open("large_codebase.txt") as f:
content = f.read() # 内容超出单个上下文窗口Automatically pages through content
自动分页处理内容
result = engine.summarize(content, question="What does this codebase do?")
print(result)
undefinedresult = engine.summarize(content, question="What does this codebase do?")
print(result)
undefinedMonitor server performance
监控服务器性能
bash
python3 dashboard.pybash
python3 dashboard.pyModel Selection Guide
模型选择指南
| Your Mac RAM | Best Option | Command |
|---|---|---|
| 8 GB | 9B Q4_K_M | |
| 16 GB | 35B IQ2_M (30 tok/s) | Default Option A above |
| 16 GB (quality) | 35B Q4 Expert Sniper | |
| 48 GB | 35B Q4_K_M native | Download full Q4, |
| 192 GB | 397B frontier | Any large GGUF, full offload |
| 你的Mac内存 | 最佳选项 | 命令 |
|---|---|---|
| 8 GB | 9B Q4_K_M模型 | |
| 16 GB | 35B IQ2_M模型(30 tokens/秒) | 上述默认选项A |
| 16 GB(追求质量) | 35B Q4 Expert Sniper模型 | |
| 48 GB | 35B Q4_K_M原生模型 | 下载完整Q4模型,添加参数 |
| 192 GB | 397B frontier模型 | 支持任意大型GGUF模型,全量卸载到GPU |
Troubleshooting
故障排查
Server not responding on port 8000
服务器在8000端口无响应
bash
undefinedbash
undefinedCheck if server is running
检查服务器是否运行
Check what's on port 8000
查看8000端口的占用情况
lsof -i :8000
lsof -i :8000
Restart llama-server with verbose logging
启用详细日志重启llama-server
llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf
--port 8000 --verbose
--port 8000 --verbose
undefinedllama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf
--port 8000 --verbose
--port 8000 --verbose
undefinedModel download fails / incomplete
模型下载失败/不完整
bash
undefinedbash
undefinedResume interrupted download
恢复中断的下载
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-35B-A3B-GGUF',
'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
local_dir='$HOME/models/',
resume_download=True
)
"
undefinedpython3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'unsloth/Qwen3.5-35B-A3B-GGUF',
'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
local_dir='$HOME/models/',
resume_download=True
)
"
undefinedSlow inference / RAM pressure on 16 GB Mac
16GB Mac上推理速度慢/内存压力大
bash
undefinedbash
undefinedReduce context size to free RAM
减小上下文长度以释放内存
llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf
--port 8000 --ctx-size 4096 \ # reduced from 12288 --cache-type-k q4_0 --cache-type-v q4_0
--n-gpu-layers 99 -t 4
--port 8000 --ctx-size 4096 \ # reduced from 12288 --cache-type-k q4_0 --cache-type-v q4_0
--n-gpu-layers 99 -t 4
llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf
--port 8000 --ctx-size 4096 \ # 从12288减小 --cache-type-k q4_0 --cache-type-v q4_0
--n-gpu-layers 99 -t 4
--port 8000 --ctx-size 4096 \ # 从12288减小 --cache-type-k q4_0 --cache-type-v q4_0
--n-gpu-layers 99 -t 4
Or switch to 9B for lower RAM usage
或切换到9B模型以降低内存占用
python3 agent.py
python3 agent.py
Then: /model 9b
然后输入:/model 9b
undefinedundefinedMLX engine crashes with memory error
MLX引擎因内存错误崩溃
bash
undefinedbash
undefinedMLX uses unified memory — check pressure
MLX使用统一内存 — 检查内存压力
vm_stat | grep "Pages free"
vm_stat | grep "Pages free"
Reduce batch size in mlx_engine.py
在mlx_engine.py中减小批处理大小
Edit: max_batch_size = 512 → max_batch_size = 128
修改:max_batch_size = 512 → max_batch_size = 128
undefinedundefinedF_NOCACHE not bypassing page cache (macOS Sonoma+)
F_NOCACHE未绕过页缓存(macOS Sonoma及以上版本)
python
undefinedpython
undefinedVerify F_NOCACHE is active
验证F_NOCACHE是否生效
import fcntl, os
fd = os.open(model_path, os.O_RDONLY)
result = fcntl.fcntl(fd, fcntl.F_NOCACHE, 1)
assert result == 0, "F_NOCACHE failed — check macOS version and SIP status"
undefinedimport fcntl, os
fd = os.open(model_path, os.O_RDONLY)
result = fcntl.fcntl(fd, fcntl.F_NOCACHE, 1)
assert result == 0, "F_NOCACHE失败 — 检查macOS版本和SIP状态"
undefinedddgs
search fails
ddgsddgs
搜索失败
ddgsbash
pip3 install --upgrade ddgs --break-system-packagesbash
pip3 install --upgrade ddgs --break-system-packagesddgs uses DuckDuckGo — no API key required, but may rate-limit
ddgs基于DuckDuckGo — 无需API密钥,但可能触发限流
Retry after 60 seconds if you get a 202 response
若收到202响应,60秒后重试
undefinedundefinedWrong reshape on GGUF dequantization
GGUF反量化时出现维度错误
python
undefinedpython
undefinedGGUF tensors are column-major — correct reshape:
GGUF张量采用列优先存储 — 正确的重塑方式:
weights = dequantized_flat.reshape(ne[1], ne[0]) # CORRECT
weights = dequantized_flat.reshape(ne[1], ne[0]) # 正确
NOT: dequantized_flat.reshape(ne[0], ne[1]).T # WRONG
错误方式:dequantized_flat.reshape(ne[0], ne[1]).T # 错误
---
---Architecture Summary
架构概述
agent.py
├── Intent classification → "search" | "shell" | "chat"
├── search → ddgs.DDGS().text() → summarize
├── shell → generate command → subprocess.run()
└── chat → stream directly
Backends (both expose OpenAI-compatible API on :8000)
├── llama.cpp → fast, standard, no persistence
└── mlx/ → KV cache save/load/compress/sync
Flash Streaming (research/)
├── moe_expert_sniper.py → 35B Q4, 1.42 GB RAM
└── flash_stream_v2.py → 32B dense, 4.5 GB RAM
└── F_NOCACHE + pread + 16KB alignmentagent.py
├── 意图分类 → "search" | "shell" | "chat"
├── search → 调用ddgs.DDGS().text() → 结果总结
├── shell → 生成命令 → 执行subprocess.run()
└── chat → 直接流式输出
后端服务(均在:8000端口提供OpenAI兼容API)
├── llama.cpp → 速度快、标准实现、无持久化功能
└── mlx/ → 支持KV缓存的保存/加载/压缩/同步
流式外存推理(research/目录)
├── moe_expert_sniper.py → 35B Q4模型,仅占用1.42GB内存
└── flash_stream_v2.py → 32B密集模型,占用4.5GB内存
└── 基于F_NOCACHE + pread + 16KB对齐读取