mac-code-local-ai-agent

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

mac-code — Free Local AI Agent on Apple Silicon

mac-code — Apple Silicon平台免费本地AI Agent

Skill by ara.so — Daily 2026 Skills collection.

Run a 35B reasoning model locally on your Mac for $0/month. mac-code is a CLI AI coding agent (Claude Code alternative) that routes tasks — web search, shell commands, file edits, chat — through a local LLM. Supports llama.cpp (30 tok/s) and MLX (64K context, persistent KV cache) backends on Apple Silicon.

由ara.so开发的Skill — 属于Daily 2026 Skills合集。

无需每月付费，在你的Mac本地运行35B规模的推理模型。mac-code是一款CLI AI编码Agent（可作为Claude Code的替代工具），能将任务——网页搜索、Shell命令、文件编辑、对话——通过本地LLM进行路由处理。支持在Apple Silicon设备上使用llama.cpp（速度30 tokens/秒）和MLX（64K上下文长度、持久化KV缓存）作为后端。

What It Does

功能特性

LLM-as-router: The model classifies every prompt as
```
search
```
,
```
shell
```
, or
```
chat
```
and routes accordingly
35B MoE at 30 tok/s via llama.cpp + IQ2_M quantization (fits in 16 GB RAM)
35B full Q4 on 16 GB via custom MoE Expert Sniper (1.54 tok/s, only 1.42 GB RAM used)
9B at 64K context via quantized KV cache (
```
q4_0
```
keys/values)
MLX backend adds persistent KV cache save/load, context compression, R2 sync
Tools: DuckDuckGo search, shell execution, file read/write

LLM作为路由核心：模型会将每个提示分类为
```
search
```
、
```
shell
```
或
```
chat
```
类型，并进行相应的任务路由
35B MoE模型，速度30 tokens/秒：基于llama.cpp + IQ2_M量化，可在16GB内存设备运行
16GB内存运行35B全量Q4模型：通过自定义MoE Expert Sniper实现（速度1.54 tokens/秒，仅占用1.42GB内存）
9B模型支持64K上下文长度：基于量化KV缓存（
```
q4_0
```
键/值缓存）
MLX后端特性：新增持久化KV缓存的保存/加载、上下文压缩、Cloudflare R2同步功能
内置工具：DuckDuckGo搜索、Shell命令执行、文件读写操作

Installation

安装步骤

Prerequisites

前置依赖

bash

brew install llama.cpp
pip3 install rich ddgs huggingface-hub mlx-lm --break-system-packages

bash

brew install llama.cpp
pip3 install rich ddgs huggingface-hub mlx-lm --break-system-packages

Clone the repo

克隆代码仓库

bash

git clone https://github.com/walter-grace/mac-code
cd mac-code

bash

git clone https://github.com/walter-grace/mac-code
cd mac-code

Download models

下载模型

35B MoE — fast daily driver (10.6 GB, fits in 16 GB RAM):

bash

mkdir -p ~/models
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    'unsloth/Qwen3.5-35B-A3B-GGUF',
    'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
    local_dir='$HOME/models/'
)
"

9B — 64K context, long documents (5.3 GB):

bash

python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    'unsloth/Qwen3.5-9B-GGUF',
    'Qwen3.5-9B-Q4_K_M.gguf',
    local_dir='$HOME/models/'
)
"

35B MoE模型 — 日常使用首选（10.6GB，适配16GB内存）：

bash

mkdir -p ~/models
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    'unsloth/Qwen3.5-35B-A3B-GGUF',
    'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
    local_dir='$HOME/models/'
)
"

9B模型 — 支持64K上下文，适配长文档处理（5.3GB）：

bash

python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    'unsloth/Qwen3.5-9B-GGUF',
    'Qwen3.5-9B-Q4_K_M.gguf',
    local_dir='$HOME/models/'
)
"

Starting the Backend

启动后端服务

Option A: llama.cpp + 35B MoE (recommended, 30 tok/s)

选项A：llama.cpp + 35B MoE（推荐，速度30 tokens/秒）

bash

llama-server \
    --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 12288 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -np 1 -t 4

bash

llama-server \
    --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 12288 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -np 1 -t 4

Option B: llama.cpp + 9B (64K context)

选项B：llama.cpp + 9B模型（64K上下文长度）

bash

llama-server \
    --model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 65536 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -t 4

bash

llama-server \
    --model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
    --port 8000 --host 127.0.0.1 \
    --flash-attn on --ctx-size 65536 \
    --cache-type-k q4_0 --cache-type-v q4_0 \
    --n-gpu-layers 99 --reasoning off -t 4

Option C: MLX backend (persistent context, 9B)

选项C：MLX后端（持久化上下文，适配9B模型）

bash

undefined

bash

undefined

Starts server on port 8000, downloads model on first run

在8000端口启动服务，首次运行时自动下载模型

python3 mlx/mlx_engine.py

undefined

python3 mlx/mlx_engine.py

undefined

Start the agent (all options)

启动Agent（所有选项通用）

bash

python3 agent.py

bash

python3 agent.py

Agent CLI Commands

Agent CLI命令

Inside the agent REPL, type

for all commands:

Command	Action
`/agent`	Agent mode with tools (default)
`/raw`	Direct streaming, no tools
`/model 9b`	Switch to 9B model (64K context)
`/model 35b`	Switch to 35B MoE
`/search <query>`	Quick DuckDuckGo search
`/bench`	Run speed benchmark
`/stats`	Session statistics
`/cost`	Show cost savings vs cloud
`/good` / `/bad`	Grade the last response
`/improve`	View response grading stats
`/clear`	Reset conversation
`/quit`	Exit

在Agent的REPL交互环境中，输入

可查看所有命令：

命令	操作
`/agent`	启用带工具的Agent模式（默认）
`/raw`	直接流式输出，不使用工具
`/model 9b`	切换到9B模型（支持64K上下文）
`/model 35b`	切换到35B MoE模型
`/search <query>`	快速执行DuckDuckGo搜索
`/bench`	运行速度基准测试
`/stats`	查看会话统计信息
`/cost`	对比云端服务的成本节省情况
`/good` / `/bad`	对最后一次响应进行评分
`/improve`	查看响应评分统计
`/clear`	重置对话历史
`/quit`	退出Agent

Example prompts

示例提示词

> find all Python files modified in the last 7 days
→ routes to "shell", generates: find . -name "*.py" -mtime -7

> who won the NBA finals
→ routes to "search", queries DuckDuckGo, summarizes

> explain how attention works
→ routes to "chat", streams directly

> 查找最近7天修改过的所有Python文件
→ 路由到"shell"工具，生成命令：find . -name "*.py" -mtime -7

> NBA总决赛谁赢了
→ 路由到"search"工具，调用DuckDuckGo搜索并总结结果

> 解释注意力机制的工作原理
→ 路由到"chat"模式，直接流式输出回答

MLX Backend — Persistent KV Cache API

MLX后端 — 持久化KV缓存API

The MLX engine exposes a REST API on

localhost:8000

MLX引擎在

localhost:8000

提供REST API服务。

Save context after processing a large codebase

处理大型代码库后保存上下文

bash

curl -X POST localhost:8000/v1/context/save \
    -H "Content-Type: application/json" \
    -d '{"name": "my-project", "prompt": "$(cat README.md)"}'

bash

curl -X POST localhost:8000/v1/context/save \
    -H "Content-Type: application/json" \
    -d '{"name": "my-project", "prompt": "$(cat README.md)"}'

Load saved context instantly (0.0003s)

即时加载已保存的上下文（耗时0.0003秒）

bash

curl -X POST localhost:8000/v1/context/load \
    -H "Content-Type: application/json" \
    -d '{"name": "my-project"}'

bash

curl -X POST localhost:8000/v1/context/load \
    -H "Content-Type: application/json" \
    -d '{"name": "my-project"}'

Download context from Cloudflare R2 (cross-Mac sync)

从Cloudflare R2下载上下文（跨Mac同步）

bash

undefined

bash

undefined

Requires R2 credentials in environment

需要在环境变量中配置R2凭证

export R2_ACCOUNT_ID=your_account_id export R2_ACCESS_KEY_ID=your_key_id export R2_SECRET_ACCESS_KEY=your_secret export R2_BUCKET=your_bucket_name

curl -X POST localhost:8000/v1/context/download
-H "Content-Type: application/json"
-d '{"name": "my-project"}'

undefined

export R2_ACCOUNT_ID=your_account_id export R2_ACCESS_KEY_ID=your_key_id export R2_SECRET_ACCESS_KEY=your_secret export R2_BUCKET=your_bucket_name

curl -X POST localhost:8000/v1/context/download
-H "Content-Type: application/json"
-d '{"name": "my-project"}'

undefined

Standard OpenAI-compatible chat

标准OpenAI兼容对话接口

python

import requests

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "local",
    "messages": [{"role": "user", "content": "Write a Python quicksort"}],
    "stream": False
})
print(response.json()["choices"][0]["message"]["content"])

python

import requests

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "local",
    "messages": [{"role": "user", "content": "Write a Python quicksort"}],
    "stream": False
})
print(response.json()["choices"][0]["message"]["content"])

Streaming chat

流式对话接口

python

import requests, json

with requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "local",
    "messages": [{"role": "user", "content": "Explain transformers"}],
    "stream": True
}, stream=True) as r:
    for line in r.iter_lines():
        if line.startswith(b"data: "):
            chunk = json.loads(line[6:])
            delta = chunk["choices"][0]["delta"].get("content", "")
            print(delta, end="", flush=True)

python

import requests, json

with requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "local",
    "messages": [{"role": "user", "content": "Explain transformers"}],
    "stream": True
}, stream=True) as r:
    for line in r.iter_lines():
        if line.startswith(b"data: "):
            chunk = json.loads(line[6:])
            delta = chunk["choices"][0]["delta"].get("content", "")
            print(delta, end="", flush=True)

KV Cache Compression (MLX)

KV缓存压缩（MLX）

Compress context 4x with 99.3% similarity:

python

from mlx.turboquant import compress_kv_cache
from mlx.kv_cache import save_kv_cache, load_kv_cache

可将上下文压缩4倍，同时保持99.3%的相似度：

python

from mlx.turboquant import compress_kv_cache
from mlx.kv_cache import save_kv_cache, load_kv_cache

After building a KV cache from a long document

从长文档构建KV缓存后

compressed = compress_kv_cache(kv_cache, bits=4) # 26.6 MB → 6.7 MB save_kv_cache(compressed, "my-project-compressed")

Load later

后续加载

kv = load_kv_cache("my-project-compressed")

---

kv = load_kv_cache("my-project-compressed")

---

Flash Streaming — Out-of-Core Inference

流式外存推理（Flash Streaming）

For models larger than your RAM (research mode):

bash

cd research/flash-streaming

针对内存无法容纳的大模型（研究阶段功能）：

bash

cd research/flash-streaming

Run 35B MoE Expert Sniper (22 GB model, 1.42 GB RAM)

运行35B MoE Expert Sniper模型（模型大小22GB，仅占用1.42GB内存）

python3 moe_expert_sniper.py

Run 32B dense flash stream (18.4 GB model, 4.5 GB RAM)

运行32B密集型流式模型（模型大小18.4GB，占用4.5GB内存）

python3 flash_stream_v2.py

undefined

python3 flash_stream_v2.py

undefined

How F_NOCACHE direct I/O works

F_NOCACHE直接I/O的工作原理

python

import os, fcntl

python

import os, fcntl

Open model file bypassing macOS Unified Buffer Cache

打开模型文件，绕过macOS统一缓冲区缓存

fd = os.open("model.bin", os.O_RDONLY) fcntl.fcntl(fd, fcntl.F_NOCACHE, 1) # bypass page cache

fd = os.open("model.bin", os.O_RDONLY) fcntl.fcntl(fd, fcntl.F_NOCACHE, 1) # 绕过页缓存

Aligned read (16KB boundary for DART IOMMU)

对齐读取（DART IOMMU要求16KB边界）

ALIGN = 16384 offset = (layer_offset // ALIGN) * ALIGN data = os.pread(fd, layer_size + ALIGN, offset) weights = data[layer_offset - offset : layer_offset - offset + layer_size]

undefined

ALIGN = 16384 offset = (layer_offset // ALIGN) * ALIGN data = os.pread(fd, layer_size + ALIGN, offset) weights = data[layer_offset - offset : layer_offset - offset + layer_size]

undefined

MoE Expert Sniper pattern

MoE Expert Sniper模式

python

undefined

python

undefined

Router predicts which 8 of 256 experts activate per token

路由层预测每个token激活256个专家中的哪8个

active_experts = router_forward(hidden_state) # returns [8] indices

active_experts = router_forward(hidden_state) # 返回[8]个索引

Load only those experts from SSD (8 threads, parallel pread)

仅从SSD加载这些专家（8线程并行pread读取）

from concurrent.futures import ThreadPoolExecutor

def load_expert(expert_idx): offset = expert_offsets[expert_idx] return os.pread(fd, expert_size, offset)

with ThreadPoolExecutor(max_workers=8) as pool: expert_weights = list(pool.map(load_expert, active_experts))

from concurrent.futures import ThreadPoolExecutor

def load_expert(expert_idx): offset = expert_offsets[expert_idx] return os.pread(fd, expert_size, offset)

with ThreadPoolExecutor(max_workers=8) as pool: expert_weights = list(pool.map(load_expert, active_experts))

~14 MB loaded per layer instead of 221 MB (dense)

每层仅加载约14MB数据，而非密集模型的221MB

---

---

Common Patterns

常见使用模式

Use as a Python library (direct API calls)

作为Python库使用（直接调用API）

python

import requests

BASE = "http://localhost:8000/v1"

def ask(prompt: str, system: str = "You are a helpful coding assistant.") -> str:
    r = requests.post(f"{BASE}/chat/completions", json={
        "model": "local",
        "messages": [
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ]
    })
    return r.json()["choices"][0]["message"]["content"]

python

import requests

BASE = "http://localhost:8000/v1"

def ask(prompt: str, system: str = "You are a helpful coding assistant.") -> str:
    r = requests.post(f"{BASE}/chat/completions", json={
        "model": "local",
        "messages": [
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ]
    })
    return r.json()["choices"][0]["message"]["content"]

Examples

示例

print(ask("Write a Python function to parse JSON safely")) print(ask("Explain this error: AttributeError: NoneType has no attribute split"))

undefined

print(ask("Write a Python function to parse JSON safely")) print(ask("Explain this error: AttributeError: NoneType has no attribute split"))

undefined

Process a large file with paged inference

分页推理处理大文件

python

from mlx.paged_inference import PagedInference

engine = PagedInference(model="mlx-community/Qwen3.5-9B-4bit")

with open("large_codebase.txt") as f:
    content = f.read()  # beyond single context window

python

from mlx.paged_inference import PagedInference

engine = PagedInference(model="mlx-community/Qwen3.5-9B-4bit")

with open("large_codebase.txt") as f:
    content = f.read()  # 内容超出单个上下文窗口

Automatically pages through content

自动分页处理内容

result = engine.summarize(content, question="What does this codebase do?") print(result)

undefined

result = engine.summarize(content, question="What does this codebase do?") print(result)

undefined

Monitor server performance

监控服务器性能

bash

python3 dashboard.py

bash

python3 dashboard.py

Model Selection Guide

模型选择指南

Your Mac RAM	Best Option	Command
8 GB	9B Q4_K_M	`--model ~/models/Qwen3.5-9B-Q4_K_M.gguf --ctx-size 4096`
16 GB	35B IQ2_M (30 tok/s)	Default Option A above
16 GB (quality)	35B Q4 Expert Sniper	`python3 research/flash-streaming/moe_expert_sniper.py`
48 GB	35B Q4_K_M native	Download full Q4, `--n-gpu-layers 99`
192 GB	397B frontier	Any large GGUF, full offload

你的Mac内存	最佳选项	命令
8 GB	9B Q4_K_M模型	`--model ~/models/Qwen3.5-9B-Q4_K_M.gguf --ctx-size 4096`
16 GB	35B IQ2_M模型（30 tokens/秒）	上述默认选项A
16 GB（追求质量）	35B Q4 Expert Sniper模型	`python3 research/flash-streaming/moe_expert_sniper.py`
48 GB	35B Q4_K_M原生模型	下载完整Q4模型，添加参数 `--n-gpu-layers 99`
192 GB	397B frontier模型	支持任意大型GGUF模型，全量卸载到GPU

Troubleshooting

故障排查

Server not responding on port 8000

服务器在8000端口无响应

bash

undefined

bash

undefined

Check if server is running

检查服务器是否运行

curl http://localhost:8000/health

Check what's on port 8000

查看8000端口的占用情况

lsof -i :8000

Restart llama-server with verbose logging

启用详细日志重启llama-server

llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf
--port 8000 --verbose

undefined

llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf
--port 8000 --verbose

undefined

Model download fails / incomplete

模型下载失败/不完整

bash

undefined

bash

undefined

Resume interrupted download

恢复中断的下载

python3 -c " from huggingface_hub import hf_hub_download hf_hub_download( 'unsloth/Qwen3.5-35B-A3B-GGUF', 'Qwen3.5-35B-A3B-UD-IQ2_M.gguf', local_dir='$HOME/models/', resume_download=True ) "

undefined

python3 -c " from huggingface_hub import hf_hub_download hf_hub_download( 'unsloth/Qwen3.5-35B-A3B-GGUF', 'Qwen3.5-35B-A3B-UD-IQ2_M.gguf', local_dir='$HOME/models/', resume_download=True ) "

undefined

Slow inference / RAM pressure on 16 GB Mac

16GB Mac上推理速度慢/内存压力大

bash

undefined

bash

undefined

Reduce context size to free RAM

减小上下文长度以释放内存

llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf
--port 8000 --ctx-size 4096 \ # reduced from 12288 --cache-type-k q4_0 --cache-type-v q4_0
--n-gpu-layers 99 -t 4

llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf
--port 8000 --ctx-size 4096 \ # 从12288减小 --cache-type-k q4_0 --cache-type-v q4_0
--n-gpu-layers 99 -t 4

Or switch to 9B for lower RAM usage

或切换到9B模型以降低内存占用

python3 agent.py

Then: /model 9b

然后输入：/model 9b

undefined

undefined

MLX engine crashes with memory error

MLX引擎因内存错误崩溃

bash

undefined

bash

undefined

MLX uses unified memory — check pressure

MLX使用统一内存 — 检查内存压力

vm_stat | grep "Pages free"

Reduce batch size in mlx_engine.py

在mlx_engine.py中减小批处理大小

Edit: max_batch_size = 512 → max_batch_size = 128

修改：max_batch_size = 512 → max_batch_size = 128

undefined

undefined

F_NOCACHE not bypassing page cache (macOS Sonoma+)

F_NOCACHE未绕过页缓存（macOS Sonoma及以上版本）

python

undefined

python

undefined

Verify F_NOCACHE is active

验证F_NOCACHE是否生效

import fcntl, os fd = os.open(model_path, os.O_RDONLY) result = fcntl.fcntl(fd, fcntl.F_NOCACHE, 1) assert result == 0, "F_NOCACHE failed — check macOS version and SIP status"

undefined

import fcntl, os fd = os.open(model_path, os.O_RDONLY) result = fcntl.fcntl(fd, fcntl.F_NOCACHE, 1) assert result == 0, "F_NOCACHE失败 — 检查macOS版本和SIP状态"

undefined

ddgs

search fails

ddgs

搜索失败

bash

pip3 install --upgrade ddgs --break-system-packages

bash

pip3 install --upgrade ddgs --break-system-packages

ddgs uses DuckDuckGo — no API key required, but may rate-limit

ddgs基于DuckDuckGo — 无需API密钥，但可能触发限流

Retry after 60 seconds if you get a 202 response

若收到202响应，60秒后重试

undefined

undefined

Wrong reshape on GGUF dequantization

GGUF反量化时出现维度错误

python

undefined

python

undefined

GGUF tensors are column-major — correct reshape:

GGUF张量采用列优先存储 — 正确的重塑方式：

weights = dequantized_flat.reshape(ne[1], ne[0]) # CORRECT

weights = dequantized_flat.reshape(ne[1], ne[0]) # 正确

NOT: dequantized_flat.reshape(ne[0], ne[1]).T # WRONG

错误方式：dequantized_flat.reshape(ne[0], ne[1]).T # 错误

---

---

Architecture Summary

架构概述

agent.py
  ├── Intent classification → "search" | "shell" | "chat"
  ├── search → ddgs.DDGS().text() → summarize
  ├── shell  → generate command → subprocess.run()
  └── chat   → stream directly

Backends (both expose OpenAI-compatible API on :8000)
  ├── llama.cpp  → fast, standard, no persistence
  └── mlx/       → KV cache save/load/compress/sync

Flash Streaming (research/)
  ├── moe_expert_sniper.py  → 35B Q4, 1.42 GB RAM
  └── flash_stream_v2.py    → 32B dense, 4.5 GB RAM
      └── F_NOCACHE + pread + 16KB alignment

agent.py
  ├── 意图分类 → "search" | "shell" | "chat"
  ├── search → 调用ddgs.DDGS().text() → 结果总结
  ├── shell  → 生成命令 → 执行subprocess.run()
  └── chat   → 直接流式输出

后端服务（均在:8000端口提供OpenAI兼容API）
  ├── llama.cpp  → 速度快、标准实现、无持久化功能
  └── mlx/       → 支持KV缓存的保存/加载/压缩/同步

流式外存推理（research/目录）
  ├── moe_expert_sniper.py  → 35B Q4模型，仅占用1.42GB内存
  └── flash_stream_v2.py    → 32B密集模型，占用4.5GB内存
      └── 基于F_NOCACHE + pread + 16KB对齐读取