mac-code-local-ai-agent

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

mac-code — Free Local AI Agent on Apple Silicon

mac-code — Apple Silicon平台免费本地AI Agent

Skill by ara.so — Daily 2026 Skills collection.
Run a 35B reasoning model locally on your Mac for $0/month. mac-code is a CLI AI coding agent (Claude Code alternative) that routes tasks — web search, shell commands, file edits, chat — through a local LLM. Supports llama.cpp (30 tok/s) and MLX (64K context, persistent KV cache) backends on Apple Silicon.

ara.so开发的Skill — 属于Daily 2026 Skills合集。
无需每月付费,在你的Mac本地运行35B规模的推理模型。mac-code是一款CLI AI编码Agent(可作为Claude Code的替代工具),能将任务——网页搜索、Shell命令、文件编辑、对话——通过本地LLM进行路由处理。支持在Apple Silicon设备上使用llama.cpp(速度30 tokens/秒)和MLX(64K上下文长度、持久化KV缓存)作为后端。

What It Does

功能特性

  • LLM-as-router: The model classifies every prompt as
    search
    ,
    shell
    , or
    chat
    and routes accordingly
  • 35B MoE at 30 tok/s via llama.cpp + IQ2_M quantization (fits in 16 GB RAM)
  • 35B full Q4 on 16 GB via custom MoE Expert Sniper (1.54 tok/s, only 1.42 GB RAM used)
  • 9B at 64K context via quantized KV cache (
    q4_0
    keys/values)
  • MLX backend adds persistent KV cache save/load, context compression, R2 sync
  • Tools: DuckDuckGo search, shell execution, file read/write

  • LLM作为路由核心:模型会将每个提示分类为
    search
    shell
    chat
    类型,并进行相应的任务路由
  • 35B MoE模型,速度30 tokens/秒:基于llama.cpp + IQ2_M量化,可在16GB内存设备运行
  • 16GB内存运行35B全量Q4模型:通过自定义MoE Expert Sniper实现(速度1.54 tokens/秒,仅占用1.42GB内存)
  • 9B模型支持64K上下文长度:基于量化KV缓存(
    q4_0
    键/值缓存)
  • MLX后端特性:新增持久化KV缓存的保存/加载、上下文压缩、Cloudflare R2同步功能
  • 内置工具:DuckDuckGo搜索、Shell命令执行、文件读写操作

Installation

安装步骤

Prerequisites

前置依赖

bash
brew install llama.cpp
pip3 install rich ddgs huggingface-hub mlx-lm --break-system-packages
bash
brew install llama.cpp
pip3 install rich ddgs huggingface-hub mlx-lm --break-system-packages

Clone the repo

克隆代码仓库

bash
git clone https://github.com/walter-grace/mac-code
cd mac-code
bash
git clone https://github.com/walter-grace/mac-code
cd mac-code

Download models

下载模型

35B MoE — fast daily driver (10.6 GB, fits in 16 GB RAM):
bash
mkdir -p ~/models
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    'unsloth/Qwen3.5-35B-A3B-GGUF',
    'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
    local_dir='$HOME/models/'
)
"
9B — 64K context, long documents (5.3 GB):
bash
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    'unsloth/Qwen3.5-9B-GGUF',
    'Qwen3.5-9B-Q4_K_M.gguf',
    local_dir='$HOME/models/'
)
"

35B MoE模型 — 日常使用首选(10.6GB,适配16GB内存):
bash
mkdir -p ~/models
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    'unsloth/Qwen3.5-35B-A3B-GGUF',
    'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
    local_dir='$HOME/models/'
)
"
9B模型 — 支持64K上下文,适配长文档处理(5.3GB):
bash
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    'unsloth/Qwen3.5-9B-GGUF',
    'Qwen3.5-9B-Q4_K_M.gguf',
    local_dir='$HOME/models/'
)
"

Starting the Backend

启动后端服务

Option A: llama.cpp + 35B MoE (recommended, 30 tok/s)

选项A:llama.cpp + 35B MoE(推荐,速度30 tokens/秒)

bash
llama-server \
    --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 12288 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -np 1 -t 4
bash
llama-server \
    --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 12288 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -np 1 -t 4

Option B: llama.cpp + 9B (64K context)

选项B:llama.cpp + 9B模型(64K上下文长度)

bash
llama-server \
    --model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 65536 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -t 4
bash
llama-server \
    --model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 65536 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -t 4

Option C: MLX backend (persistent context, 9B)

选项C:MLX后端(持久化上下文,适配9B模型)

bash
undefined
bash
undefined

Starts server on port 8000, downloads model on first run

在8000端口启动服务,首次运行时自动下载模型

python3 mlx/mlx_engine.py
undefined
python3 mlx/mlx_engine.py
undefined

Start the agent (all options)

启动Agent(所有选项通用)

bash
python3 agent.py

bash
python3 agent.py

Agent CLI Commands

Agent CLI命令

Inside the agent REPL, type
/
for all commands:
CommandAction
/agent
Agent mode with tools (default)
/raw
Direct streaming, no tools
/model 9b
Switch to 9B model (64K context)
/model 35b
Switch to 35B MoE
/search <query>
Quick DuckDuckGo search
/bench
Run speed benchmark
/stats
Session statistics
/cost
Show cost savings vs cloud
/good
/
/bad
Grade the last response
/improve
View response grading stats
/clear
Reset conversation
/quit
Exit
在Agent的REPL交互环境中,输入
/
可查看所有命令:
命令操作
/agent
启用带工具的Agent模式(默认)
/raw
直接流式输出,不使用工具
/model 9b
切换到9B模型(支持64K上下文)
/model 35b
切换到35B MoE模型
/search <query>
快速执行DuckDuckGo搜索
/bench
运行速度基准测试
/stats
查看会话统计信息
/cost
对比云端服务的成本节省情况
/good
/
/bad
对最后一次响应进行评分
/improve
查看响应评分统计
/clear
重置对话历史
/quit
退出Agent

Example prompts

示例提示词

> find all Python files modified in the last 7 days
→ routes to "shell", generates: find . -name "*.py" -mtime -7

> who won the NBA finals
→ routes to "search", queries DuckDuckGo, summarizes

> explain how attention works
→ routes to "chat", streams directly

> 查找最近7天修改过的所有Python文件
→ 路由到"shell"工具,生成命令:find . -name "*.py" -mtime -7

> NBA总决赛谁赢了
→ 路由到"search"工具,调用DuckDuckGo搜索并总结结果

> 解释注意力机制的工作原理
→ 路由到"chat"模式,直接流式输出回答

MLX Backend — Persistent KV Cache API

MLX后端 — 持久化KV缓存API

The MLX engine exposes a REST API on
localhost:8000
.
MLX引擎在
localhost:8000
提供REST API服务。

Save context after processing a large codebase

处理大型代码库后保存上下文

bash
curl -X POST localhost:8000/v1/context/save \
    -H "Content-Type: application/json" \
    -d '{"name": "my-project", "prompt": "$(cat README.md)"}'
bash
curl -X POST localhost:8000/v1/context/save \
    -H "Content-Type: application/json" \
    -d '{"name": "my-project", "prompt": "$(cat README.md)"}'

Load saved context instantly (0.0003s)

即时加载已保存的上下文(耗时0.0003秒)

bash
curl -X POST localhost:8000/v1/context/load \
    -H "Content-Type: application/json" \
    -d '{"name": "my-project"}'
bash
curl -X POST localhost:8000/v1/context/load \
    -H "Content-Type: application/json" \
    -d '{"name": "my-project"}'

Download context from Cloudflare R2 (cross-Mac sync)

从Cloudflare R2下载上下文(跨Mac同步)

bash
undefined
bash
undefined

Requires R2 credentials in environment

需要在环境变量中配置R2凭证

export R2_ACCOUNT_ID=your_account_id export R2_ACCESS_KEY_ID=your_key_id export R2_SECRET_ACCESS_KEY=your_secret export R2_BUCKET=your_bucket_name
curl -X POST localhost:8000/v1/context/download
-H "Content-Type: application/json"
-d '{"name": "my-project"}'
undefined
export R2_ACCOUNT_ID=your_account_id export R2_ACCESS_KEY_ID=your_key_id export R2_SECRET_ACCESS_KEY=your_secret export R2_BUCKET=your_bucket_name
curl -X POST localhost:8000/v1/context/download
-H "Content-Type: application/json"
-d '{"name": "my-project"}'
undefined

Standard OpenAI-compatible chat

标准OpenAI兼容对话接口

python
import requests

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "local",
    "messages": [{"role": "user", "content": "Write a Python quicksort"}],
    "stream": False
})
print(response.json()["choices"][0]["message"]["content"])
python
import requests

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "local",
    "messages": [{"role": "user", "content": "Write a Python quicksort"}],
    "stream": False
})
print(response.json()["choices"][0]["message"]["content"])

Streaming chat

流式对话接口

python
import requests, json

with requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "local",
    "messages": [{"role": "user", "content": "Explain transformers"}],
    "stream": True
}, stream=True) as r:
    for line in r.iter_lines():
        if line.startswith(b"data: "):
            chunk = json.loads(line[6:])
            delta = chunk["choices"][0]["delta"].get("content", "")
            print(delta, end="", flush=True)

python
import requests, json

with requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "local",
    "messages": [{"role": "user", "content": "Explain transformers"}],
    "stream": True
}, stream=True) as r:
    for line in r.iter_lines():
        if line.startswith(b"data: "):
            chunk = json.loads(line[6:])
            delta = chunk["choices"][0]["delta"].get("content", "")
            print(delta, end="", flush=True)

KV Cache Compression (MLX)

KV缓存压缩(MLX)

Compress context 4x with 99.3% similarity:
python
from mlx.turboquant import compress_kv_cache
from mlx.kv_cache import save_kv_cache, load_kv_cache
可将上下文压缩4倍,同时保持99.3%的相似度:
python
from mlx.turboquant import compress_kv_cache
from mlx.kv_cache import save_kv_cache, load_kv_cache

After building a KV cache from a long document

从长文档构建KV缓存后

compressed = compress_kv_cache(kv_cache, bits=4) # 26.6 MB → 6.7 MB save_kv_cache(compressed, "my-project-compressed")
compressed = compress_kv_cache(kv_cache, bits=4) # 26.6 MB → 6.7 MB save_kv_cache(compressed, "my-project-compressed")

Load later

后续加载

kv = load_kv_cache("my-project-compressed")

---
kv = load_kv_cache("my-project-compressed")

---

Flash Streaming — Out-of-Core Inference

流式外存推理(Flash Streaming)

For models larger than your RAM (research mode):
bash
cd research/flash-streaming
针对内存无法容纳的大模型(研究阶段功能):
bash
cd research/flash-streaming

Run 35B MoE Expert Sniper (22 GB model, 1.42 GB RAM)

运行35B MoE Expert Sniper模型(模型大小22GB,仅占用1.42GB内存)

python3 moe_expert_sniper.py
python3 moe_expert_sniper.py

Run 32B dense flash stream (18.4 GB model, 4.5 GB RAM)

运行32B密集型流式模型(模型大小18.4GB,占用4.5GB内存)

python3 flash_stream_v2.py
undefined
python3 flash_stream_v2.py
undefined

How F_NOCACHE direct I/O works

F_NOCACHE直接I/O的工作原理

python
import os, fcntl
python
import os, fcntl

Open model file bypassing macOS Unified Buffer Cache

打开模型文件,绕过macOS统一缓冲区缓存

fd = os.open("model.bin", os.O_RDONLY) fcntl.fcntl(fd, fcntl.F_NOCACHE, 1) # bypass page cache
fd = os.open("model.bin", os.O_RDONLY) fcntl.fcntl(fd, fcntl.F_NOCACHE, 1) # 绕过页缓存

Aligned read (16KB boundary for DART IOMMU)

对齐读取(DART IOMMU要求16KB边界)

ALIGN = 16384 offset = (layer_offset // ALIGN) * ALIGN data = os.pread(fd, layer_size + ALIGN, offset) weights = data[layer_offset - offset : layer_offset - offset + layer_size]
undefined
ALIGN = 16384 offset = (layer_offset // ALIGN) * ALIGN data = os.pread(fd, layer_size + ALIGN, offset) weights = data[layer_offset - offset : layer_offset - offset + layer_size]
undefined

MoE Expert Sniper pattern

MoE Expert Sniper模式

python
undefined
python
undefined

Router predicts which 8 of 256 experts activate per token

路由层预测每个token激活256个专家中的哪8个

active_experts = router_forward(hidden_state) # returns [8] indices
active_experts = router_forward(hidden_state) # 返回[8]个索引

Load only those experts from SSD (8 threads, parallel pread)

仅从SSD加载这些专家(8线程并行pread读取)

from concurrent.futures import ThreadPoolExecutor
def load_expert(expert_idx): offset = expert_offsets[expert_idx] return os.pread(fd, expert_size, offset)
with ThreadPoolExecutor(max_workers=8) as pool: expert_weights = list(pool.map(load_expert, active_experts))
from concurrent.futures import ThreadPoolExecutor
def load_expert(expert_idx): offset = expert_offsets[expert_idx] return os.pread(fd, expert_size, offset)
with ThreadPoolExecutor(max_workers=8) as pool: expert_weights = list(pool.map(load_expert, active_experts))

~14 MB loaded per layer instead of 221 MB (dense)

每层仅加载约14MB数据,而非密集模型的221MB


---

---

Common Patterns

常见使用模式

Use as a Python library (direct API calls)

作为Python库使用(直接调用API)

python
import requests

BASE = "http://localhost:8000/v1"

def ask(prompt: str, system: str = "You are a helpful coding assistant.") -> str:
    r = requests.post(f"{BASE}/chat/completions", json={
        "model": "local",
        "messages": [
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ]
    })
    return r.json()["choices"][0]["message"]["content"]
python
import requests

BASE = "http://localhost:8000/v1"

def ask(prompt: str, system: str = "You are a helpful coding assistant.") -> str:
    r = requests.post(f"{BASE}/chat/completions", json={
        "model": "local",
        "messages": [
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ]
    })
    return r.json()["choices"][0]["message"]["content"]

Examples

示例

print(ask("Write a Python function to parse JSON safely")) print(ask("Explain this error: AttributeError: NoneType has no attribute split"))
undefined
print(ask("Write a Python function to parse JSON safely")) print(ask("Explain this error: AttributeError: NoneType has no attribute split"))
undefined

Process a large file with paged inference

分页推理处理大文件

python
from mlx.paged_inference import PagedInference

engine = PagedInference(model="mlx-community/Qwen3.5-9B-4bit")

with open("large_codebase.txt") as f:
    content = f.read()  # beyond single context window
python
from mlx.paged_inference import PagedInference

engine = PagedInference(model="mlx-community/Qwen3.5-9B-4bit")

with open("large_codebase.txt") as f:
    content = f.read()  # 内容超出单个上下文窗口

Automatically pages through content

自动分页处理内容

result = engine.summarize(content, question="What does this codebase do?") print(result)
undefined
result = engine.summarize(content, question="What does this codebase do?") print(result)
undefined

Monitor server performance

监控服务器性能

bash
python3 dashboard.py

bash
python3 dashboard.py

Model Selection Guide

模型选择指南

Your Mac RAMBest OptionCommand
8 GB9B Q4_K_M
--model ~/models/Qwen3.5-9B-Q4_K_M.gguf --ctx-size 4096
16 GB35B IQ2_M (30 tok/s)Default Option A above
16 GB (quality)35B Q4 Expert Sniper
python3 research/flash-streaming/moe_expert_sniper.py
48 GB35B Q4_K_M nativeDownload full Q4,
--n-gpu-layers 99
192 GB397B frontierAny large GGUF, full offload

你的Mac内存最佳选项命令
8 GB9B Q4_K_M模型
--model ~/models/Qwen3.5-9B-Q4_K_M.gguf --ctx-size 4096
16 GB35B IQ2_M模型(30 tokens/秒)上述默认选项A
16 GB(追求质量)35B Q4 Expert Sniper模型
python3 research/flash-streaming/moe_expert_sniper.py
48 GB35B Q4_K_M原生模型下载完整Q4模型,添加参数
--n-gpu-layers 99
192 GB397B frontier模型支持任意大型GGUF模型,全量卸载到GPU

Troubleshooting

故障排查

Server not responding on port 8000

服务器在8000端口无响应

bash
undefined
bash
undefined

Check if server is running

检查服务器是否运行

Check what's on port 8000

查看8000端口的占用情况

lsof -i :8000
lsof -i :8000

Restart llama-server with verbose logging

启用详细日志重启llama-server

llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf
--port 8000 --verbose
undefined
llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf
--port 8000 --verbose
undefined

Model download fails / incomplete

模型下载失败/不完整

bash
undefined
bash
undefined

Resume interrupted download

恢复中断的下载

python3 -c " from huggingface_hub import hf_hub_download hf_hub_download( 'unsloth/Qwen3.5-35B-A3B-GGUF', 'Qwen3.5-35B-A3B-UD-IQ2_M.gguf', local_dir='$HOME/models/', resume_download=True ) "
undefined
python3 -c " from huggingface_hub import hf_hub_download hf_hub_download( 'unsloth/Qwen3.5-35B-A3B-GGUF', 'Qwen3.5-35B-A3B-UD-IQ2_M.gguf', local_dir='$HOME/models/', resume_download=True ) "
undefined

Slow inference / RAM pressure on 16 GB Mac

16GB Mac上推理速度慢/内存压力大

bash
undefined
bash
undefined

Reduce context size to free RAM

减小上下文长度以释放内存

llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf
--port 8000 --ctx-size 4096 \ # reduced from 12288 --cache-type-k q4_0 --cache-type-v q4_0
--n-gpu-layers 99 -t 4
llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf
--port 8000 --ctx-size 4096 \ # 从12288减小 --cache-type-k q4_0 --cache-type-v q4_0
--n-gpu-layers 99 -t 4

Or switch to 9B for lower RAM usage

或切换到9B模型以降低内存占用

python3 agent.py
python3 agent.py

Then: /model 9b

然后输入:/model 9b

undefined
undefined

MLX engine crashes with memory error

MLX引擎因内存错误崩溃

bash
undefined
bash
undefined

MLX uses unified memory — check pressure

MLX使用统一内存 — 检查内存压力

vm_stat | grep "Pages free"
vm_stat | grep "Pages free"

Reduce batch size in mlx_engine.py

在mlx_engine.py中减小批处理大小

Edit: max_batch_size = 512 → max_batch_size = 128

修改:max_batch_size = 512 → max_batch_size = 128

undefined
undefined

F_NOCACHE not bypassing page cache (macOS Sonoma+)

F_NOCACHE未绕过页缓存(macOS Sonoma及以上版本)

python
undefined
python
undefined

Verify F_NOCACHE is active

验证F_NOCACHE是否生效

import fcntl, os fd = os.open(model_path, os.O_RDONLY) result = fcntl.fcntl(fd, fcntl.F_NOCACHE, 1) assert result == 0, "F_NOCACHE failed — check macOS version and SIP status"
undefined
import fcntl, os fd = os.open(model_path, os.O_RDONLY) result = fcntl.fcntl(fd, fcntl.F_NOCACHE, 1) assert result == 0, "F_NOCACHE失败 — 检查macOS版本和SIP状态"
undefined

ddgs
search fails

ddgs
搜索失败

bash
pip3 install --upgrade ddgs --break-system-packages
bash
pip3 install --upgrade ddgs --break-system-packages

ddgs uses DuckDuckGo — no API key required, but may rate-limit

ddgs基于DuckDuckGo — 无需API密钥,但可能触发限流

Retry after 60 seconds if you get a 202 response

若收到202响应,60秒后重试

undefined
undefined

Wrong reshape on GGUF dequantization

GGUF反量化时出现维度错误

python
undefined
python
undefined

GGUF tensors are column-major — correct reshape:

GGUF张量采用列优先存储 — 正确的重塑方式:

weights = dequantized_flat.reshape(ne[1], ne[0]) # CORRECT
weights = dequantized_flat.reshape(ne[1], ne[0]) # 正确

NOT: dequantized_flat.reshape(ne[0], ne[1]).T # WRONG

错误方式:dequantized_flat.reshape(ne[0], ne[1]).T # 错误


---

---

Architecture Summary

架构概述

agent.py
  ├── Intent classification → "search" | "shell" | "chat"
  ├── search → ddgs.DDGS().text() → summarize
  ├── shell  → generate command → subprocess.run()
  └── chat   → stream directly

Backends (both expose OpenAI-compatible API on :8000)
  ├── llama.cpp  → fast, standard, no persistence
  └── mlx/       → KV cache save/load/compress/sync

Flash Streaming (research/)
  ├── moe_expert_sniper.py  → 35B Q4, 1.42 GB RAM
  └── flash_stream_v2.py    → 32B dense, 4.5 GB RAM
      └── F_NOCACHE + pread + 16KB alignment
agent.py
  ├── 意图分类 → "search" | "shell" | "chat"
  ├── search → 调用ddgs.DDGS().text() → 结果总结
  ├── shell  → 生成命令 → 执行subprocess.run()
  └── chat   → 直接流式输出

后端服务(均在:8000端口提供OpenAI兼容API)
  ├── llama.cpp  → 速度快、标准实现、无持久化功能
  └── mlx/       → 支持KV缓存的保存/加载/压缩/同步

流式外存推理(research/目录)
  ├── moe_expert_sniper.py  → 35B Q4模型,仅占用1.42GB内存
  └── flash_stream_v2.py    → 32B密集模型,占用4.5GB内存
      └── 基于F_NOCACHE + pread + 16KB对齐读取