apple-on-device-ai

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

On-Device AI for Apple Platforms

Apple平台设备端AI开发指南

Guide for selecting, deploying, and optimizing on-device ML models. Covers Apple Foundation Models, Core ML, MLX Swift, and llama.cpp.
本指南介绍如何选择、部署和优化设备端机器学习模型,涵盖Apple Foundation Models、Core ML、MLX Swift和llama.cpp。

Framework Selection Router

框架选择决策树

Use this decision tree to pick the right framework for your use case.
使用此决策树为你的使用场景挑选合适的框架。

Apple Foundation Models

Apple Foundation Models

When to use: Text generation, summarization, entity extraction, structured output, and short dialog on iOS 26+ / macOS 26+ devices with Apple Intelligence enabled. Zero setup -- no API keys, no network, no model downloads.
Best for:
  • Generating text or structured data with
    @Generable
    types
  • Summarization, classification, content tagging
  • Tool-augmented generation with the
    Tool
    protocol
  • Apps that need guaranteed on-device privacy
Not suited for: Complex math, code generation, factual accuracy tasks, or apps targeting pre-iOS 26 devices.
适用场景: 支持Apple Intelligence的iOS 26+ / macOS 26+设备上的文本生成、摘要、实体提取、结构化输出和短对话场景。无需任何设置——无需API密钥、无需网络、无需下载模型。
最佳适用:
  • 使用
    @Generable
    类型生成文本或结构化数据
  • 文本摘要、分类、内容标签标注
  • 基于
    Tool
    协议的工具增强生成
  • 需要保障设备端隐私的应用
不适用: 复杂数学运算、代码生成、事实准确性相关任务,或面向iOS 26之前版本的应用。

Core ML

Core ML

When to use: Deploying custom trained models (vision, NLP, audio) across all Apple platforms. Converting models from PyTorch, TensorFlow, or scikit-learn with coremltools.
Best for:
  • Image classification, object detection, segmentation
  • Custom NLP classifiers, sentiment analysis models
  • Audio/speech models via SoundAnalysis integration
  • Any scenario needing Neural Engine optimization
  • Models requiring quantization, palettization, or pruning
适用场景: 在所有Apple平台部署自定义训练模型(视觉、NLP、音频)。通过coremltools将PyTorch、TensorFlow或scikit-learn的模型转换为Core ML格式。
最佳适用:
  • 图像分类、目标检测、图像分割
  • 自定义NLP分类器、情感分析模型
  • 集成SoundAnalysis的音频/语音模型
  • 任何需要神经引擎优化的场景
  • 需要量化、调色板化或剪枝的模型

MLX Swift

MLX Swift

When to use: Running specific open-source LLMs (Llama, Mistral, Qwen, Gemma) on Apple Silicon with maximum throughput. Research and prototyping.
Best for:
  • Highest sustained token generation on Apple Silicon
  • Running Hugging Face models from
    mlx-community
  • Research requiring automatic differentiation
  • Fine-tuning workflows on Mac
适用场景: 在Apple Silicon上运行特定开源大语言模型(Llama、Mistral、Qwen、Gemma)以实现最高吞吐量,适用于研究和原型开发。
最佳适用:
  • 在Apple Silicon上实现最高持续令牌生成速度
  • 运行
    mlx-community
    提供的Hugging Face模型
  • 需要自动微分的研究场景
  • Mac上的微调工作流

llama.cpp

llama.cpp

When to use: Cross-platform LLM inference using GGUF model format. Production deployments needing broad device support.
Best for:
  • GGUF quantized models (Q4_K_M, Q5_K_M, Q8_0)
  • Cross-platform apps (iOS + Android + desktop)
  • Maximum compatibility with open-source model ecosystem
适用场景: 使用GGUF模型格式进行跨平台大语言模型推理,适用于需要广泛设备支持的生产部署。
最佳适用:
  • GGUF量化模型(Q4_K_M、Q5_K_M、Q8_0)
  • 跨平台应用(iOS + Android + 桌面端)
  • 与开源模型生态系统的最大兼容性

Quick Reference

速查表

ScenarioFramework
Text generation, zero setup (iOS 26+)Foundation Models
Structured output from on-device LLMFoundation Models (
@Generable
)
Image classification, object detectionCore ML
Custom model from PyTorch/TensorFlowCore ML + coremltools
Running specific open-source LLMsMLX Swift or llama.cpp
Maximum throughput on Apple SiliconMLX Swift
Cross-platform LLM inferencellama.cpp
OCR and text recognitionVision framework
Sentiment analysis, NER, tokenizationNatural Language framework
Training custom classifiers on deviceCreate ML
场景框架
文本生成、零设置(iOS 26+)Foundation Models
设备端大语言模型的结构化输出Foundation Models(
@Generable
图像分类、目标检测Core ML
来自PyTorch/TensorFlow的自定义模型Core ML + coremltools
运行特定开源大语言模型MLX Swift 或 llama.cpp
Apple Silicon上的最高吞吐量MLX Swift
跨平台大语言模型推理llama.cpp
OCR与文本识别Vision框架
情感分析、命名实体识别、分词Natural Language框架
设备端训练自定义分类器Create ML

Apple Foundation Models Overview

Apple Foundation Models 概述

On-device ~3B parameter model optimized for Apple Silicon. Available on devices supporting Apple Intelligence (iOS 26+, macOS 26+).
  • Context window: 4096 tokens (input + output combined)
  • 15 supported languages
  • Guardrails always enforced, cannot be disabled
针对Apple Silicon优化的设备端约30亿参数模型,仅在支持Apple Intelligence的设备(iOS 26+、macOS 26+)上可用。
  • 上下文窗口:4096令牌(输入+输出总和)
  • 支持15种语言
  • 始终启用安全防护,无法禁用

Availability Checking (Required)

可用性检查(必填)

Always check before using. Never crash on unavailability.
swift
import FoundationModels

switch SystemLanguageModel.default.availability {
case .available:
    // Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
    // Guide user to enable Apple Intelligence in Settings
case .unavailable(.modelNotReady):
    // Model is downloading; show loading state
case .unavailable(.deviceNotEligible):
    // Device cannot run Apple Intelligence; use fallback
default:
    // Graceful fallback for any other reason
}
使用前务必检查,避免在不支持的设备上崩溃。
swift
import FoundationModels

switch SystemLanguageModel.default.availability {
case .available:
    // 继续使用模型
case .unavailable(.appleIntelligenceNotEnabled):
    // 引导用户在设置中启用Apple Intelligence
case .unavailable(.modelNotReady):
    // 模型正在下载;显示加载状态
case .unavailable(.deviceNotEligible):
    // 设备无法运行Apple Intelligence;使用降级方案
default:
    // 其他情况的优雅降级处理
}

Session Management

会话管理

swift
// Basic session
let session = LanguageModelSession()

// Session with instructions
let session = LanguageModelSession {
    "You are a helpful cooking assistant."
}

// Session with tools
let session = LanguageModelSession(
    tools: [weatherTool, recipeTool]
) {
    "You are a helpful assistant with access to tools."
}
Key rules:
  • Sessions are stateful -- multi-turn conversations maintain context automatically
  • One request at a time per session (check
    session.isResponding
    )
  • Call
    session.prewarm()
    before user interaction for faster first response
  • Save/restore transcripts:
    LanguageModelSession(model: model, tools: [], transcript: savedTranscript)
swift
// 基础会话
let session = LanguageModelSession()

// 带指令的会话
let session = LanguageModelSession {
    "你是一位乐于助人的烹饪助手。"
}

// 带工具的会话
let session = LanguageModelSession(
    tools: [weatherTool, recipeTool]
) {
    "你是一位可调用工具的助手。"
}
核心规则:
  • 会话是有状态的——多轮对话会自动维护上下文
  • 每个会话同一时间仅支持一个请求(检查
    session.isResponding
  • 用户交互前调用
    session.prewarm()
    以加快首次响应速度
  • 保存/恢复对话记录:
    LanguageModelSession(model: model, tools: [], transcript: savedTranscript)

Structured Output with @Generable

基于@Generable的结构化输出

The
@Generable
macro creates compile-time schemas for type-safe output:
swift
@Generable
struct Recipe {
    @Guide(description: "The recipe name")
    var name: String

    @Guide(description: "Cooking steps", .count(3))
    var steps: [String]

    @Guide(description: "Prep time in minutes", .range(1...120))
    var prepTime: Int
}

let response = try await session.respond(
    to: "Suggest a quick pasta recipe",
    generating: Recipe.self
)
print(response.content.name)
@Generable
宏为类型安全输出创建编译时模式:
swift
@Generable
struct Recipe {
    @Guide(description: "菜谱名称")
    var name: String

    @Guide(description: "烹饪步骤", .count(3))
    var steps: [String]

    @Guide(description: "准备时间(分钟)", .range(1...120))
    var prepTime: Int
}

let response = try await session.respond(
    to: "推荐一个快速意面菜谱",
    generating: Recipe.self
)
print(response.content.name)

@Guide Constraints

@Guide约束

ConstraintPurpose
description:
Natural language hint for generation
.anyOf([values])
Restrict to enumerated string values
.count(n)
Fixed array length
.range(min...max)
Numeric range
.minimum(n)
/
.maximum(n)
One-sided numeric bound
.minimumCount(n)
/
.maximumCount(n)
Array length bounds
.constant(value)
Always returns this value
.pattern(regex)
String format enforcement
Properties generate in declaration order. Place foundational data before dependent data for better results.
约束用途
description:
为生成提供自然语言提示
.anyOf([values])
限制为枚举字符串值
.count(n)
固定数组长度
.range(min...max)
数值范围
.minimum(n)
/
.maximum(n)
单侧数值边界
.minimumCount(n)
/
.maximumCount(n)
数组长度边界
.constant(value)
始终返回该值
.pattern(regex)
字符串格式校验
属性按声明顺序生成。将基础数据放在依赖数据之前以获得更好的结果。

Streaming Structured Output

流式结构化输出

swift
let stream = session.streamResponse(
    to: "Suggest a recipe",
    generating: Recipe.self
)
for try await snapshot in stream {
    // snapshot.content is Recipe.PartiallyGenerated (all properties optional)
    if let name = snapshot.content.name { updateNameLabel(name) }
}
swift
let stream = session.streamResponse(
    to: "推荐一个菜谱",
    generating: Recipe.self
)
for try await snapshot in stream {
    // snapshot.content 是 Recipe.PartiallyGenerated(所有属性为可选类型)
    if let name = snapshot.content.name { updateNameLabel(name) }
}

Tool Calling

工具调用

swift
struct WeatherTool: Tool {
    let name = "weather"
    let description = "Get current weather for a city."

    @Generable
    struct Arguments {
        @Guide(description: "The city name")
        var city: String
    }

    func call(arguments: Arguments) async throws -> String {
        let weather = try await fetchWeather(arguments.city)
        return weather.description
    }
}
Register tools at session creation. The model invokes them autonomously.
swift
struct WeatherTool: Tool {
    let name = "weather"
    let description = "获取城市当前天气。"

    @Generable
    struct Arguments {
        @Guide(description: "城市名称")
        var city: String
    }

    func call(arguments: Arguments) async throws -> String {
        let weather = try await fetchWeather(arguments.city)
        return weather.description
    }
}
在会话创建时注册工具,模型会自动调用它们。

Error Handling

错误处理

swift
do {
    let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
    switch error {
    case .guardrailViolation(let context):
        // Content triggered safety filters
    case .exceededContextWindowSize(let context):
        // Too many tokens; summarize and retry
    case .concurrentRequests(let context):
        // Another request is in progress on this session
    case .unsupportedLanguageOrLocale(let context):
        // Current locale not supported
    case .refusal(let refusal, _):
        // Model refused; stream refusal.explanation for details
    case .rateLimited(let context):
        // Too many requests; back off and retry
    case .decodingFailure(let context):
        // Response could not be decoded into the expected type
    default: break
    }
}
swift
do {
    let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
    switch error {
    case .guardrailViolation(let context):
        // 内容触发安全过滤
    case .exceededContextWindowSize(let context):
        // 令牌数量过多;摘要后重试
    case .concurrentRequests(let context):
        // 该会话已有请求在进行中
    case .unsupportedLanguageOrLocale(let context):
        // 当前语言/地区不支持
    case .refusal(let refusal, _):
        // 模型拒绝生成;流式输出refusal.explanation查看详情
    case .rateLimited(let context):
        // 请求过于频繁;延迟后重试
    case .decodingFailure(let context):
        // 响应无法解码为预期类型
    default: break
    }
}

Generation Options

生成选项

swift
let options = GenerationOptions(
    sampling: .random(top: 40),
    temperature: 0.7,
    maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)
Sampling modes:
.greedy
,
.random(top:)
,
.random(probabilityThreshold:)
.
swift
let options = GenerationOptions(
    sampling: .random(top: 40),
    temperature: 0.7,
    maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)
采样模式:
.greedy
.random(top:)
.random(probabilityThreshold:)

Prompt Design Rules

提示词设计规则

  1. Be concise -- 4096 tokens is the total budget (input + output)
  2. Use bracketed placeholders in instructions:
    [descriptive example]
  3. Use "DO NOT" in all caps for prohibitions
  4. Provide up to 5 few-shot examples for consistency
  5. Use length qualifiers: "in a few words", "in three sentences"
  6. Token estimate: ~4 characters per token
  1. 保持简洁——总令牌预算为4096(输入+输出)
  2. 在指令中使用括号占位符:
    [描述性示例]
  3. 禁止事项使用全大写的"DO NOT"
  4. 提供最多5个少样本示例以保证一致性
  5. 使用长度限定词:"用几个词"、"用三句话"
  6. 令牌估算:约每4个字符对应1个令牌

Safety and Guardrails

安全与防护

  • Guardrails are always enforced and cannot be disabled
  • Instructions take precedence over user prompts
  • Never include untrusted user content in instructions
  • Handle false positives gracefully
  • Frame tool results as authorized data to prevent model refusals
  • 安全防护始终启用,无法禁用
  • 指令优先级高于用户提示词
  • 切勿在指令中包含不可信的用户内容
  • 优雅处理误判情况
  • 将工具结果标记为授权数据以避免模型拒绝生成

Use Cases

适用场景

Foundation Models supports specialized use cases via
SystemLanguageModel.UseCase
:
  • .general
    -- Default for text generation, summarization, dialog
  • .contentTagging
    -- Optimized for categorization and labeling tasks
Foundation Models通过
SystemLanguageModel.UseCase
支持特定场景:
  • .general
    ——默认用于文本生成、摘要、对话
  • .contentTagging
    ——针对分类和标签任务优化

Custom Adapters

自定义适配器

Load fine-tuned adapters for specialized behavior (requires entitlement):
swift
let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)
See references/foundation-models.md for the complete Foundation Models API reference.
加载微调适配器以实现特定行为(需要权限):
swift
let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)
完整的Foundation Models API参考请见references/foundation-models.md

Core ML Overview

Core ML 概述

Apple's framework for deploying trained models. Automatically dispatches to the optimal compute unit (CPU, GPU, or Neural Engine).
Apple用于部署训练后模型的框架,会自动调度到最优计算单元(CPU、GPU或神经引擎)。

Model Formats

模型格式

FormatExtensionWhen to Use
.mlpackage
Directory (mlprogram)All new models (iOS 15+)
.mlmodel
Single file (neuralnetwork)Legacy only (iOS 11-14)
.mlmodelc
CompiledPre-compiled for faster loading
Always use mlprogram (
.mlpackage
) for new work.
格式扩展名适用场景
.mlpackage
目录(mlprogram)所有新模型(iOS 15+)
.mlmodel
单个文件(neuralnetwork)仅遗留场景(iOS 11-14)
.mlmodelc
编译后预编译以加快加载速度
新开发请始终使用mlprogram格式(
.mlpackage
)。

Conversion Pipeline (coremltools)

转换流程(coremltools)

python
import coremltools as ct
python
import coremltools as ct

PyTorch conversion (torch.jit.trace)

PyTorch转换(torch.jit.trace)

model.eval() # CRITICAL: always call eval() before tracing traced = torch.jit.trace(model, example_input) mlmodel = ct.convert( traced, inputs=[ct.TensorType(shape=(1, 3, 224, 224), name="image")], minimum_deployment_target=ct.target.iOS18, convert_to='mlprogram', ) mlmodel.save("Model.mlpackage")
undefined
model.eval() # 关键:追踪前务必调用eval() traced = torch.jit.trace(model, example_input) mlmodel = ct.convert( traced, inputs=[ct.TensorType(shape=(1, 3, 224, 224), name="image")], minimum_deployment_target=ct.target.iOS18, convert_to='mlprogram', ) mlmodel.save("Model.mlpackage")
undefined

Optimization Techniques

优化技术

TechniqueSize ReductionAccuracy ImpactBest Compute Unit
INT8 per-channel~4xLowCPU/GPU
INT4 per-block~8xMediumGPU
Palettization 4-bit~8xLow-MediumNeural Engine
W8A8 (weights+activations)~4xLowANE (A17 Pro/M4+)
Pruning 75%~4xMediumCPU/ANE
技术体积缩减比例精度影响最佳计算单元
INT8逐通道~4倍CPU/GPU
INT4逐块~8倍GPU
4位调色板化~8倍低-中神经引擎
W8A8(权重+激活值)~4倍ANE(A17 Pro/M4+)
75%剪枝~4倍CPU/ANE

Swift Integration

Swift集成

swift
let config = MLModelConfiguration()
config.computeUnits = .all
let model = try MLModel(contentsOf: modelURL, configuration: config)

// Async prediction (iOS 17+)
let output = try await model.prediction(from: input)
swift
let config = MLModelConfiguration()
config.computeUnits = .all
let model = try MLModel(contentsOf: modelURL, configuration: config)

// 异步预测(iOS 17+)
let output = try await model.prediction(from: input)

MLTensor (iOS 18+)

MLTensor(iOS 18+)

Swift type for multidimensional array operations:
swift
import CoreML

let tensor = MLTensor([1.0, 2.0, 3.0, 4.0])
let reshaped = tensor.reshaped(to: [2, 2])
let result = tensor.softmax()
See references/coreml-conversion.md for the full conversion pipeline and references/coreml-optimization.md for optimization techniques.
用于多维数组操作的Swift类型:
swift
import CoreML

let tensor = MLTensor([1.0, 2.0, 3.0, 4.0])
let reshaped = tensor.reshaped(to: [2, 2])
let result = tensor.softmax()
完整的转换流程请见references/coreml-conversion.md,优化技术请见references/coreml-optimization.md

MLX Swift Overview

MLX Swift 概述

Apple's ML framework for Swift. Highest sustained generation throughput on Apple Silicon via unified memory architecture.
Apple推出的Swift机器学习框架,通过统一内存架构在Apple Silicon上实现最高生成吞吐量。

Loading and Running LLMs

加载与运行大语言模型

swift
import MLX
import MLXLLM

let config = ModelConfiguration(id: "mlx-community/Mistral-7B-Instruct-v0.3-4bit")
let model = try await LLMModelFactory.shared.loadContainer(configuration: config)

try await model.perform { context in
    let input = try await context.processor.prepare(
        input: UserInput(prompt: "Hello")
    )
    let stream = try generate(
        input: input,
        parameters: GenerateParameters(temperature: 0.0),
        context: context
    )
    for await part in stream {
        print(part.chunk ?? "", terminator: "")
    }
}
swift
import MLX
import MLXLLM

let config = ModelConfiguration(id: "mlx-community/Mistral-7B-Instruct-v0.3-4bit")
let model = try await LLMModelFactory.shared.loadContainer(configuration: config)

try await model.perform { context in
    let input = try await context.processor.prepare(
        input: UserInput(prompt: "Hello")
    )
    let stream = try generate(
        input: input,
        parameters: GenerateParameters(temperature: 0.0),
        context: context
    )
    for await part in stream {
        print(part.chunk ?? "", terminator: "")
    }
}

Model Selection by Device

按设备选择模型

DeviceRAMRecommended ModelRAM Usage
iPhone 12-144-6 GBSmolLM2-135M or Qwen 2.5 0.5B~0.3 GB
iPhone 15 Pro+8 GBGemma 3n E4B 4-bit~3.5 GB
Mac 8 GB8 GBLlama 3.2 3B 4-bit~3 GB
Mac 16 GB+16 GB+Mistral 7B 4-bit~6 GB
设备内存推荐模型内存占用
iPhone 12-144-6 GBSmolLM2-135M 或 Qwen 2.5 0.5B~0.3 GB
iPhone 15 Pro+8 GBGemma 3n E4B 4-bit~3.5 GB
Mac 8 GB8 GBLlama 3.2 3B 4-bit~3 GB
Mac 16 GB+16 GB+Mistral 7B 4-bit~6 GB

Memory Management

内存管理

  1. Never exceed 60% of total RAM on iOS
  2. Set GPU cache limits:
    MLX.GPU.set(cacheLimit: 512 * 1024 * 1024)
  3. Unload models on app backgrounding
  4. Use "Increased Memory Limit" entitlement for larger models
  5. Physical device required (no simulator support for Metal GPU)
See references/mlx-swift.md for full MLX Swift patterns and llama.cpp integration.
  1. iOS设备上内存占用切勿超过总RAM的60%
  2. 设置GPU缓存限制:
    MLX.GPU.set(cacheLimit: 512 * 1024 * 1024)
  3. 应用进入后台时卸载模型
  4. 如需运行更大模型,使用"Increased Memory Limit"权限
  5. 需要物理设备(模拟器不支持Metal GPU)
完整的MLX Swift模式和llama.cpp集成请见references/mlx-swift.md

Multi-Backend Architecture

多后端架构

When an app needs multiple AI backends (e.g., Foundation Models + MLX fallback):
swift
func respond(to prompt: String) async throws -> String {
    if SystemLanguageModel.default.isAvailable {
        return try await foundationModelsRespond(prompt)
    } else if canLoadMLXModel() {
        return try await mlxRespond(prompt)
    } else {
        throw AIError.noBackendAvailable
    }
}
Serialize all model access through a coordinator actor to prevent contention:
swift
actor ModelCoordinator {
    func withExclusiveAccess<T>(_ work: () async throws -> T) async rethrows -> T {
        try await work()
    }
}
当应用需要多个AI后端时(例如Foundation Models + MLX降级方案):
swift
func respond(to prompt: String) async throws -> String {
    if SystemLanguageModel.default.isAvailable {
        return try await foundationModelsRespond(prompt)
    } else if canLoadMLXModel() {
        return try await mlxRespond(prompt)
    } else {
        throw AIError.noBackendAvailable
    }
}
通过协调器actor序列化所有模型访问以避免资源竞争:
swift
actor ModelCoordinator {
    func withExclusiveAccess<T>(_ work: () async throws -> T) async rethrows -> T {
        try await work()
    }
}

Performance Best Practices

性能最佳实践

  1. Run outside debugger for accurate benchmarks (Xcode: Cmd-Opt-R, uncheck "Debug Executable")
  2. Call
    session.prewarm()
    for Foundation Models before user interaction
  3. Pre-compile Core ML models to
    .mlmodelc
    for faster loading
  4. Use EnumeratedShapes over RangeDim for Neural Engine optimization
  5. Use 4-bit palettization for best Neural Engine memory/latency gains
  6. Batch Vision framework requests in a single
    perform()
    call
  7. Use async prediction (iOS 17+) in Swift concurrency contexts
  8. Neural Engine (Core ML) is most energy-efficient for compatible operations
  1. 脱离调试器运行以获得准确基准测试结果(Xcode:Cmd-Opt-R,取消勾选"Debug Executable")
  2. 用户交互前调用Foundation Models的
    session.prewarm()
  3. 将Core ML模型预编译为
    .mlmodelc
    以加快加载速度
  4. 使用EnumeratedShapes而非RangeDim以优化神经引擎性能
  5. 使用4位调色板化以获得神经引擎最佳内存/延迟收益
  6. 在单个
    perform()
    调用中批量处理Vision框架请求
  7. 在Swift并发上下文使用异步预测(iOS 17+)
  8. 神经引擎(Core ML)对兼容操作的能效最高

Common Mistakes

常见错误

  1. No availability check. Calling
    LanguageModelSession()
    without checking
    SystemLanguageModel.default.availability
    crashes on unsupported devices.
  2. No fallback UI. Users on pre-iOS 26 or devices without Apple Intelligence see nothing. Always provide a graceful degradation path.
  3. Exceeding the context window. Foundation Models has a 4096 token total budget (input + output). Long prompts or multi-turn sessions hit this fast. Monitor token usage and summarize when needed.
  4. Concurrent requests on one session.
    LanguageModelSession
    supports one request at a time. Check
    session.isResponding
    or serialize access.
  5. Untrusted content in instructions. User input placed in the instructions parameter bypasses guardrail boundaries. Keep user content in the prompt.
  6. Forgetting
    model.eval()
    before Core ML tracing.
    PyTorch models must be in eval mode before
    torch.jit.trace
    . Training-mode artifacts corrupt output.
  7. Using neuralnetwork format. Always use
    mlprogram
    (.mlpackage) for new Core ML models. The legacy neuralnetwork format is deprecated.
  8. Exceeding 60% RAM on iOS (MLX Swift). Large models cause OOM kills. Check device RAM and select appropriate model sizes.
  9. Running MLX in simulator. MLX requires Metal GPU -- use physical devices.
  10. Not unloading models on background. iOS reclaims memory aggressively. Unload MLX/llama.cpp models in
    scenePhase == .background
    .
  1. 未检查可用性:未检查
    SystemLanguageModel.default.availability
    就调用
    LanguageModelSession()
    会在不支持的设备上崩溃。
  2. 无降级UI:iOS 26之前版本或无Apple Intelligence的设备用户会看不到任何内容,始终提供优雅的降级路径。
  3. 超出上下文窗口:Foundation Models的总令牌预算为4096(输入+输出),长提示词或多轮会话会很快触及上限,需监控令牌使用并适时摘要。
  4. 单会话并发请求
    LanguageModelSession
    同一时间仅支持一个请求,需检查
    session.isResponding
    或序列化访问。
  5. 指令中包含不可信内容:将用户输入放入指令参数会绕过防护边界,用户内容应放在提示词中。
  6. Core ML追踪前未调用
    model.eval()
    :PyTorch模型在
    torch.jit.trace
    前必须处于eval模式,训练模式的 artifacts会破坏输出。
  7. 使用neuralnetwork格式:新Core ML模型请始终使用
    mlprogram
    (.mlpackage)格式,遗留的neuralnetwork格式已被弃用。
  8. MLX Swift超出iOS内存限制60%:大模型会导致OOM崩溃,需检查设备内存并选择合适的模型大小。
  9. 在模拟器运行MLX:MLX需要Metal GPU,需使用物理设备。
  10. 后台未卸载模型:iOS会主动回收内存,需在
    scenePhase == .background
    时卸载MLX/llama.cpp模型。

Review Checklist

检查清单

  • Framework selection matches use case and target OS version
  • Foundation Models: availability checked before every API call
  • Foundation Models: graceful fallback when model unavailable
  • Foundation Models: session prewarm called before user interaction
  • Foundation Models: @Generable properties in logical generation order
  • Foundation Models: token budget accounted for (4096 total)
  • Core ML: model format is mlprogram (.mlpackage) for iOS 15+
  • Core ML: model.eval() called before tracing/exporting PyTorch models
  • Core ML: minimum_deployment_target set explicitly
  • Core ML: model accuracy validated after compression
  • MLX Swift: model size appropriate for target device RAM
  • MLX Swift: GPU cache limits set, models unloaded on backgrounding
  • All model access serialized through coordinator actor
  • Concurrency: model types and tool implementations are
    Sendable
    -conformant or
    @MainActor
    -isolated
  • Physical device testing performed (not simulator)
  • 框架选择符合使用场景和目标系统版本
  • Foundation Models:每次API调用前检查可用性
  • Foundation Models:模型不可用时提供优雅降级
  • Foundation Models:用户交互前调用会话预加载
  • Foundation Models:@Generable属性按逻辑生成顺序排列
  • Foundation Models:考虑令牌预算(总计4096)
  • Core ML:iOS 15+使用mlprogram格式(.mlpackage)
  • Core ML:PyTorch模型追踪/导出前调用
    model.eval()
  • Core ML:明确设置minimum_deployment_target
  • Core ML:压缩后验证模型精度
  • MLX Swift:模型大小适合目标设备内存
  • MLX Swift:设置GPU缓存限制,后台时卸载模型
  • 所有模型访问通过协调器actor序列化
  • 并发:模型类型和工具实现符合
    Sendable
    @MainActor
    隔离
  • 已在物理设备上测试(非模拟器)

Reference Files

参考文件

  • Foundation Models API -- Complete LanguageModelSession, @Generable, tool calling, and prompt design reference
  • Core ML Conversion -- Model conversion pipeline from PyTorch, TensorFlow, and other frameworks
  • Core ML Optimization -- Quantization, palettization, pruning, and performance tuning
  • MLX Swift & llama.cpp -- MLX Swift patterns, llama.cpp integration, and memory management
  • Foundation Models API——完整的LanguageModelSession、@Generable、工具调用和提示词设计参考
  • Core ML Conversion——从PyTorch、TensorFlow等框架转换模型的流程
  • Core ML Optimization——量化、调色板化、剪枝和性能调优
  • MLX Swift & llama.cpp——MLX Swift模式、llama.cpp集成和内存管理