apple-on-device-ai

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

On-Device AI for Apple Platforms

Apple平台设备端AI开发指南

Guide for selecting, deploying, and optimizing on-device ML models. Covers Apple Foundation Models, Core ML, MLX Swift, and llama.cpp.

本指南介绍如何选择、部署和优化设备端机器学习模型，涵盖Apple Foundation Models、Core ML、MLX Swift和llama.cpp。

Framework Selection Router

框架选择决策树

Use this decision tree to pick the right framework for your use case.

使用此决策树为你的使用场景挑选合适的框架。

Apple Foundation Models

When to use: Text generation, summarization, entity extraction, structured output, and short dialog on iOS 26+ / macOS 26+ devices with Apple Intelligence enabled. Zero setup -- no API keys, no network, no model downloads.

Best for:

Generating text or structured data with
```
@Generable
```
types
Summarization, classification, content tagging
Tool-augmented generation with the
```
Tool
```
protocol
Apps that need guaranteed on-device privacy

Not suited for: Complex math, code generation, factual accuracy tasks, or apps targeting pre-iOS 26 devices.

适用场景： 支持Apple Intelligence的iOS 26+ / macOS 26+设备上的文本生成、摘要、实体提取、结构化输出和短对话场景。无需任何设置——无需API密钥、无需网络、无需下载模型。

最佳适用：

使用
```
@Generable
```
类型生成文本或结构化数据
文本摘要、分类、内容标签标注
基于
```
Tool
```
协议的工具增强生成
需要保障设备端隐私的应用

不适用： 复杂数学运算、代码生成、事实准确性相关任务，或面向iOS 26之前版本的应用。

Core ML

When to use: Deploying custom trained models (vision, NLP, audio) across all Apple platforms. Converting models from PyTorch, TensorFlow, or scikit-learn with coremltools.

Best for:

Image classification, object detection, segmentation
Custom NLP classifiers, sentiment analysis models
Audio/speech models via SoundAnalysis integration
Any scenario needing Neural Engine optimization
Models requiring quantization, palettization, or pruning

适用场景： 在所有Apple平台部署自定义训练模型（视觉、NLP、音频）。通过coremltools将PyTorch、TensorFlow或scikit-learn的模型转换为Core ML格式。

最佳适用：

图像分类、目标检测、图像分割
自定义NLP分类器、情感分析模型
集成SoundAnalysis的音频/语音模型
任何需要神经引擎优化的场景
需要量化、调色板化或剪枝的模型

MLX Swift

When to use: Running specific open-source LLMs (Llama, Mistral, Qwen, Gemma) on Apple Silicon with maximum throughput. Research and prototyping.

Best for:

Highest sustained token generation on Apple Silicon
Running Hugging Face models from
```
mlx-community
```
Research requiring automatic differentiation
Fine-tuning workflows on Mac

适用场景： 在Apple Silicon上运行特定开源大语言模型（Llama、Mistral、Qwen、Gemma）以实现最高吞吐量，适用于研究和原型开发。

最佳适用：

在Apple Silicon上实现最高持续令牌生成速度
运行
```
mlx-community
```
提供的Hugging Face模型
需要自动微分的研究场景
Mac上的微调工作流

llama.cpp

When to use: Cross-platform LLM inference using GGUF model format. Production deployments needing broad device support.

Best for:

GGUF quantized models (Q4_K_M, Q5_K_M, Q8_0)
Cross-platform apps (iOS + Android + desktop)
Maximum compatibility with open-source model ecosystem

适用场景： 使用GGUF模型格式进行跨平台大语言模型推理，适用于需要广泛设备支持的生产部署。

最佳适用：

GGUF量化模型（Q4_K_M、Q5_K_M、Q8_0）
跨平台应用（iOS + Android + 桌面端）
与开源模型生态系统的最大兼容性

Quick Reference

速查表

Scenario	Framework
Text generation, zero setup (iOS 26+)	Foundation Models
Structured output from on-device LLM	Foundation Models ( `@Generable` )
Image classification, object detection	Core ML
Custom model from PyTorch/TensorFlow	Core ML + coremltools
Running specific open-source LLMs	MLX Swift or llama.cpp
Maximum throughput on Apple Silicon	MLX Swift
Cross-platform LLM inference	llama.cpp
OCR and text recognition	Vision framework
Sentiment analysis, NER, tokenization	Natural Language framework
Training custom classifiers on device	Create ML

场景	框架
文本生成、零设置（iOS 26+）	Foundation Models
设备端大语言模型的结构化输出	Foundation Models（ `@Generable` ）
图像分类、目标检测	Core ML
来自PyTorch/TensorFlow的自定义模型	Core ML + coremltools
运行特定开源大语言模型	MLX Swift 或 llama.cpp
Apple Silicon上的最高吞吐量	MLX Swift
跨平台大语言模型推理	llama.cpp
OCR与文本识别	Vision框架
情感分析、命名实体识别、分词	Natural Language框架
设备端训练自定义分类器	Create ML

Apple Foundation Models Overview

Apple Foundation Models 概述

On-device ~3B parameter model optimized for Apple Silicon. Available on devices supporting Apple Intelligence (iOS 26+, macOS 26+).

Context window: 4096 tokens (input + output combined)
15 supported languages
Guardrails always enforced, cannot be disabled

针对Apple Silicon优化的设备端约30亿参数模型，仅在支持Apple Intelligence的设备（iOS 26+、macOS 26+）上可用。

上下文窗口：4096令牌（输入+输出总和）
支持15种语言
始终启用安全防护，无法禁用

Availability Checking (Required)

可用性检查（必填）

Always check before using. Never crash on unavailability.

swift

import FoundationModels

switch SystemLanguageModel.default.availability {
case .available:
    // Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
    // Guide user to enable Apple Intelligence in Settings
case .unavailable(.modelNotReady):
    // Model is downloading; show loading state
case .unavailable(.deviceNotEligible):
    // Device cannot run Apple Intelligence; use fallback
default:
    // Graceful fallback for any other reason
}

使用前务必检查，避免在不支持的设备上崩溃。

swift

import FoundationModels

switch SystemLanguageModel.default.availability {
case .available:
    // 继续使用模型
case .unavailable(.appleIntelligenceNotEnabled):
    // 引导用户在设置中启用Apple Intelligence
case .unavailable(.modelNotReady):
    // 模型正在下载；显示加载状态
case .unavailable(.deviceNotEligible):
    // 设备无法运行Apple Intelligence；使用降级方案
default:
    // 其他情况的优雅降级处理
}

Session Management

会话管理

swift

// Basic session
let session = LanguageModelSession()

// Session with instructions
let session = LanguageModelSession {
    "You are a helpful cooking assistant."
}

// Session with tools
let session = LanguageModelSession(
    tools: [weatherTool, recipeTool]
) {
    "You are a helpful assistant with access to tools."
}

Key rules:

Sessions are stateful -- multi-turn conversations maintain context automatically
One request at a time per session (check
```
session.isResponding
```
)
Call
```
session.prewarm()
```
before user interaction for faster first response

Save/restore transcripts:

LanguageModelSession(model: model, tools: [], transcript: savedTranscript)

swift

// 基础会话
let session = LanguageModelSession()

// 带指令的会话
let session = LanguageModelSession {
    "你是一位乐于助人的烹饪助手。"
}

// 带工具的会话
let session = LanguageModelSession(
    tools: [weatherTool, recipeTool]
) {
    "你是一位可调用工具的助手。"
}

核心规则：

会话是有状态的——多轮对话会自动维护上下文
每个会话同一时间仅支持一个请求（检查
```
session.isResponding
```
）
用户交互前调用
```
session.prewarm()
```
以加快首次响应速度

保存/恢复对话记录：

LanguageModelSession(model: model, tools: [], transcript: savedTranscript)

Structured Output with @Generable

基于@Generable的结构化输出

The

@Generable

macro creates compile-time schemas for type-safe output:

swift

@Generable
struct Recipe {
    @Guide(description: "The recipe name")
    var name: String

    @Guide(description: "Cooking steps", .count(3))
    var steps: [String]

    @Guide(description: "Prep time in minutes", .range(1...120))
    var prepTime: Int
}

let response = try await session.respond(
    to: "Suggest a quick pasta recipe",
    generating: Recipe.self
)
print(response.content.name)

@Generable

宏为类型安全输出创建编译时模式：

swift

@Generable
struct Recipe {
    @Guide(description: "菜谱名称")
    var name: String

    @Guide(description: "烹饪步骤", .count(3))
    var steps: [String]

    @Guide(description: "准备时间（分钟）", .range(1...120))
    var prepTime: Int
}

let response = try await session.respond(
    to: "推荐一个快速意面菜谱",
    generating: Recipe.self
)
print(response.content.name)

@Guide Constraints

@Guide约束

Constraint	Purpose
`description:`	Natural language hint for generation
`.anyOf([values])`	Restrict to enumerated string values
`.count(n)`	Fixed array length
`.range(min...max)`	Numeric range
`.minimum(n)` / `.maximum(n)`	One-sided numeric bound
`.minimumCount(n)` / `.maximumCount(n)`	Array length bounds
`.constant(value)`	Always returns this value
`.pattern(regex)`	String format enforcement

Properties generate in declaration order. Place foundational data before dependent data for better results.

约束	用途
`description:`	为生成提供自然语言提示
`.anyOf([values])`	限制为枚举字符串值
`.count(n)`	固定数组长度
`.range(min...max)`	数值范围
`.minimum(n)` / `.maximum(n)`	单侧数值边界
`.minimumCount(n)` / `.maximumCount(n)`	数组长度边界
`.constant(value)`	始终返回该值
`.pattern(regex)`	字符串格式校验

属性按声明顺序生成。将基础数据放在依赖数据之前以获得更好的结果。

Streaming Structured Output

流式结构化输出

swift

let stream = session.streamResponse(
    to: "Suggest a recipe",
    generating: Recipe.self
)
for try await snapshot in stream {
    // snapshot.content is Recipe.PartiallyGenerated (all properties optional)
    if let name = snapshot.content.name { updateNameLabel(name) }
}

swift

let stream = session.streamResponse(
    to: "推荐一个菜谱",
    generating: Recipe.self
)
for try await snapshot in stream {
    // snapshot.content 是 Recipe.PartiallyGenerated（所有属性为可选类型）
    if let name = snapshot.content.name { updateNameLabel(name) }
}

Tool Calling

工具调用

swift

struct WeatherTool: Tool {
    let name = "weather"
    let description = "Get current weather for a city."

    @Generable
    struct Arguments {
        @Guide(description: "The city name")
        var city: String
    }

    func call(arguments: Arguments) async throws -> String {
        let weather = try await fetchWeather(arguments.city)
        return weather.description
    }
}

swift

struct WeatherTool: Tool {
    let name = "weather"
    let description = "获取城市当前天气。"

    @Generable
    struct Arguments {
        @Guide(description: "城市名称")
        var city: String
    }

    func call(arguments: Arguments) async throws -> String {
        let weather = try await fetchWeather(arguments.city)
        return weather.description
    }
}

在会话创建时注册工具，模型会自动调用它们。

Error Handling

错误处理

swift

do {
    let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
    switch error {
    case .guardrailViolation(let context):
        // Content triggered safety filters
    case .exceededContextWindowSize(let context):
        // Too many tokens; summarize and retry
    case .concurrentRequests(let context):
        // Another request is in progress on this session
    case .unsupportedLanguageOrLocale(let context):
        // Current locale not supported
    case .refusal(let refusal, _):
        // Model refused; stream refusal.explanation for details
    case .rateLimited(let context):
        // Too many requests; back off and retry
    case .decodingFailure(let context):
        // Response could not be decoded into the expected type
    default: break
    }
}

swift

do {
    let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
    switch error {
    case .guardrailViolation(let context):
        // 内容触发安全过滤
    case .exceededContextWindowSize(let context):
        // 令牌数量过多；摘要后重试
    case .concurrentRequests(let context):
        // 该会话已有请求在进行中
    case .unsupportedLanguageOrLocale(let context):
        // 当前语言/地区不支持
    case .refusal(let refusal, _):
        // 模型拒绝生成；流式输出refusal.explanation查看详情
    case .rateLimited(let context):
        // 请求过于频繁；延迟后重试
    case .decodingFailure(let context):
        // 响应无法解码为预期类型
    default: break
    }
}

Generation Options

生成选项

swift

let options = GenerationOptions(
    sampling: .random(top: 40),
    temperature: 0.7,
    maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)

Sampling modes:

.greedy

.random(top:)

.random(probabilityThreshold:)

swift

let options = GenerationOptions(
    sampling: .random(top: 40),
    temperature: 0.7,
    maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)

采样模式：

.greedy

、

.random(top:)

、

.random(probabilityThreshold:)

。

Prompt Design Rules

提示词设计规则

Be concise -- 4096 tokens is the total budget (input + output)
Use bracketed placeholders in instructions:
```
[descriptive example]
```
Use "DO NOT" in all caps for prohibitions
Provide up to 5 few-shot examples for consistency
Use length qualifiers: "in a few words", "in three sentences"
Token estimate: ~4 characters per token

保持简洁——总令牌预算为4096（输入+输出）
在指令中使用括号占位符：
```
[描述性示例]
```
禁止事项使用全大写的"DO NOT"
提供最多5个少样本示例以保证一致性
使用长度限定词："用几个词"、"用三句话"
令牌估算：约每4个字符对应1个令牌

Safety and Guardrails

安全与防护

Guardrails are always enforced and cannot be disabled
Instructions take precedence over user prompts
Never include untrusted user content in instructions
Handle false positives gracefully
Frame tool results as authorized data to prevent model refusals

安全防护始终启用，无法禁用
指令优先级高于用户提示词
切勿在指令中包含不可信的用户内容
优雅处理误判情况
将工具结果标记为授权数据以避免模型拒绝生成

Use Cases

适用场景

Foundation Models supports specialized use cases via

SystemLanguageModel.UseCase

```
.general
```
-- Default for text generation, summarization, dialog
```
.contentTagging
```
-- Optimized for categorization and labeling tasks

Foundation Models通过

SystemLanguageModel.UseCase

支持特定场景：

```
.general
```
——默认用于文本生成、摘要、对话
```
.contentTagging
```
——针对分类和标签任务优化

Custom Adapters

自定义适配器

Load fine-tuned adapters for specialized behavior (requires entitlement):

swift

let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)

See references/foundation-models.md for the complete Foundation Models API reference.

加载微调适配器以实现特定行为（需要权限）：

swift

let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)

完整的Foundation Models API参考请见references/foundation-models.md。

Core ML Overview

Core ML 概述

Apple's framework for deploying trained models. Automatically dispatches to the optimal compute unit (CPU, GPU, or Neural Engine).

Apple用于部署训练后模型的框架，会自动调度到最优计算单元（CPU、GPU或神经引擎）。

Model Formats

模型格式

Format	Extension	When to Use
`.mlpackage`	Directory (mlprogram)	All new models (iOS 15+)
`.mlmodel`	Single file (neuralnetwork)	Legacy only (iOS 11-14)
`.mlmodelc`	Compiled	Pre-compiled for faster loading

Always use mlprogram (

.mlpackage

) for new work.

格式	扩展名	适用场景
`.mlpackage`	目录（mlprogram）	所有新模型（iOS 15+）
`.mlmodel`	单个文件（neuralnetwork）	仅遗留场景（iOS 11-14）
`.mlmodelc`	编译后	预编译以加快加载速度

新开发请始终使用mlprogram格式（

.mlpackage

）。

Conversion Pipeline (coremltools)

转换流程（coremltools）

python

import coremltools as ct

python

import coremltools as ct

PyTorch conversion (torch.jit.trace)

PyTorch转换（torch.jit.trace）

model.eval() # CRITICAL: always call eval() before tracing traced = torch.jit.trace(model, example_input) mlmodel = ct.convert( traced, inputs=[ct.TensorType(shape=(1, 3, 224, 224), name="image")], minimum_deployment_target=ct.target.iOS18, convert_to='mlprogram', ) mlmodel.save("Model.mlpackage")

undefined

model.eval() # 关键：追踪前务必调用eval() traced = torch.jit.trace(model, example_input) mlmodel = ct.convert( traced, inputs=[ct.TensorType(shape=(1, 3, 224, 224), name="image")], minimum_deployment_target=ct.target.iOS18, convert_to='mlprogram', ) mlmodel.save("Model.mlpackage")

undefined

Optimization Techniques

优化技术

Technique	Size Reduction	Accuracy Impact	Best Compute Unit
INT8 per-channel	~4x	Low	CPU/GPU
INT4 per-block	~8x	Medium	GPU
Palettization 4-bit	~8x	Low-Medium	Neural Engine
W8A8 (weights+activations)	~4x	Low	ANE (A17 Pro/M4+)
Pruning 75%	~4x	Medium	CPU/ANE

技术	体积缩减比例	精度影响	最佳计算单元
INT8逐通道	~4倍	低	CPU/GPU
INT4逐块	~8倍	中	GPU
4位调色板化	~8倍	低-中	神经引擎
W8A8（权重+激活值）	~4倍	低	ANE（A17 Pro/M4+）
75%剪枝	~4倍	中	CPU/ANE

Swift Integration

Swift集成

swift

let config = MLModelConfiguration()
config.computeUnits = .all
let model = try MLModel(contentsOf: modelURL, configuration: config)

// Async prediction (iOS 17+)
let output = try await model.prediction(from: input)

swift

let config = MLModelConfiguration()
config.computeUnits = .all
let model = try MLModel(contentsOf: modelURL, configuration: config)

// 异步预测（iOS 17+）
let output = try await model.prediction(from: input)

MLTensor (iOS 18+)

MLTensor（iOS 18+）

Swift type for multidimensional array operations:

swift

import CoreML

let tensor = MLTensor([1.0, 2.0, 3.0, 4.0])
let reshaped = tensor.reshaped(to: [2, 2])
let result = tensor.softmax()

See references/coreml-conversion.md for the full conversion pipeline and references/coreml-optimization.md for optimization techniques.

用于多维数组操作的Swift类型：

swift

import CoreML

let tensor = MLTensor([1.0, 2.0, 3.0, 4.0])
let reshaped = tensor.reshaped(to: [2, 2])
let result = tensor.softmax()

完整的转换流程请见references/coreml-conversion.md，优化技术请见references/coreml-optimization.md。

MLX Swift Overview

MLX Swift 概述

Apple's ML framework for Swift. Highest sustained generation throughput on Apple Silicon via unified memory architecture.

Apple推出的Swift机器学习框架，通过统一内存架构在Apple Silicon上实现最高生成吞吐量。

Loading and Running LLMs

加载与运行大语言模型

swift

import MLX
import MLXLLM

let config = ModelConfiguration(id: "mlx-community/Mistral-7B-Instruct-v0.3-4bit")
let model = try await LLMModelFactory.shared.loadContainer(configuration: config)

try await model.perform { context in
    let input = try await context.processor.prepare(
        input: UserInput(prompt: "Hello")
    )
    let stream = try generate(
        input: input,
        parameters: GenerateParameters(temperature: 0.0),
        context: context
    )
    for await part in stream {
        print(part.chunk ?? "", terminator: "")
    }
}

swift

import MLX
import MLXLLM

let config = ModelConfiguration(id: "mlx-community/Mistral-7B-Instruct-v0.3-4bit")
let model = try await LLMModelFactory.shared.loadContainer(configuration: config)

try await model.perform { context in
    let input = try await context.processor.prepare(
        input: UserInput(prompt: "Hello")
    )
    let stream = try generate(
        input: input,
        parameters: GenerateParameters(temperature: 0.0),
        context: context
    )
    for await part in stream {
        print(part.chunk ?? "", terminator: "")
    }
}

Model Selection by Device

按设备选择模型

Device	RAM	Recommended Model	RAM Usage
iPhone 12-14	4-6 GB	SmolLM2-135M or Qwen 2.5 0.5B	~0.3 GB
iPhone 15 Pro+	8 GB	Gemma 3n E4B 4-bit	~3.5 GB
Mac 8 GB	8 GB	Llama 3.2 3B 4-bit	~3 GB
Mac 16 GB+	16 GB+	Mistral 7B 4-bit	~6 GB

设备	内存	推荐模型	内存占用
iPhone 12-14	4-6 GB	SmolLM2-135M 或 Qwen 2.5 0.5B	~0.3 GB
iPhone 15 Pro+	8 GB	Gemma 3n E4B 4-bit	~3.5 GB
Mac 8 GB	8 GB	Llama 3.2 3B 4-bit	~3 GB
Mac 16 GB+	16 GB+	Mistral 7B 4-bit	~6 GB

Memory Management

内存管理

Never exceed 60% of total RAM on iOS

Set GPU cache limits:

MLX.GPU.set(cacheLimit: 512 * 1024 * 1024)

Unload models on app backgrounding
Use "Increased Memory Limit" entitlement for larger models
Physical device required (no simulator support for Metal GPU)

See references/mlx-swift.md for full MLX Swift patterns and llama.cpp integration.

iOS设备上内存占用切勿超过总RAM的60%

设置GPU缓存限制：

MLX.GPU.set(cacheLimit: 512 * 1024 * 1024)

应用进入后台时卸载模型
如需运行更大模型，使用"Increased Memory Limit"权限
需要物理设备（模拟器不支持Metal GPU）

完整的MLX Swift模式和llama.cpp集成请见references/mlx-swift.md。

Multi-Backend Architecture

多后端架构

When an app needs multiple AI backends (e.g., Foundation Models + MLX fallback):

swift

func respond(to prompt: String) async throws -> String {
    if SystemLanguageModel.default.isAvailable {
        return try await foundationModelsRespond(prompt)
    } else if canLoadMLXModel() {
        return try await mlxRespond(prompt)
    } else {
        throw AIError.noBackendAvailable
    }
}

Serialize all model access through a coordinator actor to prevent contention:

swift

actor ModelCoordinator {
    func withExclusiveAccess<T>(_ work: () async throws -> T) async rethrows -> T {
        try await work()
    }
}

当应用需要多个AI后端时（例如Foundation Models + MLX降级方案）：

swift

func respond(to prompt: String) async throws -> String {
    if SystemLanguageModel.default.isAvailable {
        return try await foundationModelsRespond(prompt)
    } else if canLoadMLXModel() {
        return try await mlxRespond(prompt)
    } else {
        throw AIError.noBackendAvailable
    }
}

通过协调器actor序列化所有模型访问以避免资源竞争：

swift

actor ModelCoordinator {
    func withExclusiveAccess<T>(_ work: () async throws -> T) async rethrows -> T {
        try await work()
    }
}

Performance Best Practices

性能最佳实践

Run outside debugger for accurate benchmarks (Xcode: Cmd-Opt-R, uncheck "Debug Executable")
Call
```
session.prewarm()
```
for Foundation Models before user interaction
Pre-compile Core ML models to
```
.mlmodelc
```
for faster loading
Use EnumeratedShapes over RangeDim for Neural Engine optimization
Use 4-bit palettization for best Neural Engine memory/latency gains
Batch Vision framework requests in a single
```
perform()
```
call
Use async prediction (iOS 17+) in Swift concurrency contexts
Neural Engine (Core ML) is most energy-efficient for compatible operations

脱离调试器运行以获得准确基准测试结果（Xcode：Cmd-Opt-R，取消勾选"Debug Executable"）
用户交互前调用Foundation Models的
```
session.prewarm()
```
将Core ML模型预编译为
```
.mlmodelc
```
以加快加载速度
使用EnumeratedShapes而非RangeDim以优化神经引擎性能
使用4位调色板化以获得神经引擎最佳内存/延迟收益
在单个
```
perform()
```
调用中批量处理Vision框架请求
在Swift并发上下文使用异步预测（iOS 17+）
神经引擎（Core ML）对兼容操作的能效最高

Common Mistakes

常见错误

No availability check. Calling
```
LanguageModelSession()
```
without checking
```
SystemLanguageModel.default.availability
```
crashes on unsupported devices.
No fallback UI. Users on pre-iOS 26 or devices without Apple Intelligence see nothing. Always provide a graceful degradation path.
Exceeding the context window. Foundation Models has a 4096 token total budget (input + output). Long prompts or multi-turn sessions hit this fast. Monitor token usage and summarize when needed.
Concurrent requests on one session.
```
LanguageModelSession
```
supports one request at a time. Check
```
session.isResponding
```
or serialize access.
Untrusted content in instructions. User input placed in the instructions parameter bypasses guardrail boundaries. Keep user content in the prompt.
Forgetting
model.eval()
before Core ML tracing. PyTorch models must be in eval mode before
```
torch.jit.trace
```
. Training-mode artifacts corrupt output.
Using neuralnetwork format. Always use
```
mlprogram
```
(.mlpackage) for new Core ML models. The legacy neuralnetwork format is deprecated.
Exceeding 60% RAM on iOS (MLX Swift). Large models cause OOM kills. Check device RAM and select appropriate model sizes.
Running MLX in simulator. MLX requires Metal GPU -- use physical devices.
Not unloading models on background. iOS reclaims memory aggressively. Unload MLX/llama.cpp models in
```
scenePhase == .background
```
.

未检查可用性：未检查
```
SystemLanguageModel.default.availability
```
就调用
```
LanguageModelSession()
```
会在不支持的设备上崩溃。
无降级UI：iOS 26之前版本或无Apple Intelligence的设备用户会看不到任何内容，始终提供优雅的降级路径。
超出上下文窗口：Foundation Models的总令牌预算为4096（输入+输出），长提示词或多轮会话会很快触及上限，需监控令牌使用并适时摘要。
单会话并发请求：
```
LanguageModelSession
```
同一时间仅支持一个请求，需检查
```
session.isResponding
```
或序列化访问。
指令中包含不可信内容：将用户输入放入指令参数会绕过防护边界，用户内容应放在提示词中。
Core ML追踪前未调用
model.eval()
：PyTorch模型在
```
torch.jit.trace
```
前必须处于eval模式，训练模式的 artifacts会破坏输出。
使用neuralnetwork格式：新Core ML模型请始终使用
```
mlprogram
```
（.mlpackage）格式，遗留的neuralnetwork格式已被弃用。
MLX Swift超出iOS内存限制60%：大模型会导致OOM崩溃，需检查设备内存并选择合适的模型大小。
在模拟器运行MLX：MLX需要Metal GPU，需使用物理设备。
后台未卸载模型：iOS会主动回收内存，需在
```
scenePhase == .background
```
时卸载MLX/llama.cpp模型。

Review Checklist

检查清单

Framework selection matches use case and target OS version
Foundation Models: availability checked before every API call
Foundation Models: graceful fallback when model unavailable
Foundation Models: session prewarm called before user interaction
Foundation Models: @Generable properties in logical generation order
Foundation Models: token budget accounted for (4096 total)
Core ML: model format is mlprogram (.mlpackage) for iOS 15+
Core ML: model.eval() called before tracing/exporting PyTorch models
Core ML: minimum_deployment_target set explicitly
Core ML: model accuracy validated after compression
MLX Swift: model size appropriate for target device RAM
MLX Swift: GPU cache limits set, models unloaded on backgrounding
All model access serialized through coordinator actor
Concurrency: model types and tool implementations are
```
Sendable
```
-conformant or
```
@MainActor
```
-isolated
Physical device testing performed (not simulator)

框架选择符合使用场景和目标系统版本
Foundation Models：每次API调用前检查可用性
Foundation Models：模型不可用时提供优雅降级
Foundation Models：用户交互前调用会话预加载
Foundation Models：@Generable属性按逻辑生成顺序排列
Foundation Models：考虑令牌预算（总计4096）
Core ML：iOS 15+使用mlprogram格式（.mlpackage）
Core ML：PyTorch模型追踪/导出前调用
```
model.eval()
```
Core ML：明确设置minimum_deployment_target
Core ML：压缩后验证模型精度
MLX Swift：模型大小适合目标设备内存
MLX Swift：设置GPU缓存限制，后台时卸载模型
所有模型访问通过协调器actor序列化
并发：模型类型和工具实现符合
```
Sendable
```
或
```
@MainActor
```
隔离
已在物理设备上测试（非模拟器）

Reference Files

参考文件

Foundation Models API -- Complete LanguageModelSession, @Generable, tool calling, and prompt design reference
Core ML Conversion -- Model conversion pipeline from PyTorch, TensorFlow, and other frameworks
Core ML Optimization -- Quantization, palettization, pruning, and performance tuning
MLX Swift & llama.cpp -- MLX Swift patterns, llama.cpp integration, and memory management

Foundation Models API——完整的LanguageModelSession、@Generable、工具调用和提示词设计参考
Core ML Conversion——从PyTorch、TensorFlow等框架转换模型的流程
Core ML Optimization——量化、调色板化、剪枝和性能调优
MLX Swift & llama.cpp——MLX Swift模式、llama.cpp集成和内存管理