axiom-vision

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Vision Framework Computer Vision

Vision Framework 计算机视觉

Guides you through implementing computer vision: subject segmentation, hand/body pose detection, person detection, text recognition, barcode detection, document scanning, and combining Vision APIs to solve complex problems.
指导你实现计算机视觉功能:主体分割、手部/身体姿态检测、人物检测、文本识别、条形码检测、文档扫描,以及结合Vision API解决复杂问题。

When to Use This Skill

何时使用此技能

Use when you need to:
  • ☑ Isolate subjects from backgrounds (subject lifting)
  • ☑ Detect and track hand poses for gestures
  • ☑ Detect and track body poses for fitness/action classification
  • ☑ Segment multiple people separately
  • ☑ Exclude hands from object bounding boxes (combining APIs)
  • ☑ Choose between VisionKit and Vision framework
  • ☑ Combine Vision with CoreImage for compositing
  • ☑ Decide which Vision API solves your problem
  • ☑ Recognize text in images (OCR)
  • ☑ Detect barcodes and QR codes
  • ☑ Scan documents with perspective correction
  • ☑ Extract structured data from documents (iOS 26+)
  • ☑ Build live scanning experiences (DataScannerViewController)
当你需要以下功能时使用:
  • ☑ 将主体从背景中分离(主体提取)
  • ☑ 检测并跟踪手部姿态以识别手势
  • ☑ 检测并跟踪身体姿态用于健身/动作分类
  • ☑ 单独分割多个人物
  • ☑ 在物体边界框中排除手部(结合API)
  • ☑ 在VisionKit和Vision框架之间做选择
  • ☑ 将Vision与CoreImage结合用于合成
  • ☑ 确定哪种Vision API能解决你的问题
  • ☑ 识别图像中的文本(OCR)
  • ☑ 检测条形码和二维码
  • ☑ 扫描文档并进行透视校正
  • ☑ 从文档中提取结构化数据(iOS 26+)
  • ☑ 构建实时扫描体验(DataScannerViewController)

Example Prompts

示例提示

"How do I isolate a subject from the background?" "I need to detect hand gestures like pinch" "How can I get a bounding box around an object without including the hand holding it?" "Should I use VisionKit or Vision framework for subject lifting?" "How do I segment multiple people separately?" "I need to detect body poses for a fitness app" "How do I preserve HDR when compositing subjects on new backgrounds?" "How do I recognize text in an image?" "I need to scan QR codes from camera" "How do I extract data from a receipt?" "Should I use DataScannerViewController or Vision directly?" "How do I scan documents and correct perspective?" "I need to extract table data from a document"
"如何将主体从背景中分离?" "我需要检测捏合之类的手势" "如何获取物体的边界框不包含握住它的手?" "主体提取应该用VisionKit还是Vision框架?" "如何单独分割多个人物?" "我需要为健身APP检测身体姿态" "在新背景上合成主体时如何保留HDR?" "如何识别图像中的文本?" "我需要从摄像头扫描二维码" "如何从收据中提取数据?" "应该用DataScannerViewController还是直接用Vision?" "如何扫描文档并校正透视?" "我需要从文档中提取表格数据"

Red Flags

注意事项(易错点)

Signs you're making this harder than it needs to be:
  • ❌ Manually implementing subject segmentation with CoreML models
  • ❌ Using ARKit just for body pose (Vision works offline)
  • ❌ Writing gesture recognition from scratch (use hand pose + simple distance checks)
  • ❌ Processing on main thread (blocks UI - Vision is resource intensive)
  • ❌ Training custom models when Vision APIs already exist
  • ❌ Not checking confidence scores (low confidence = unreliable landmarks)
  • ❌ Forgetting to convert coordinates (lower-left origin vs UIKit top-left)
  • ❌ Building custom text recognizer when VNRecognizeTextRequest exists
  • ❌ Using AVFoundation + Vision when DataScannerViewController suffices
  • ❌ Processing every camera frame for scanning (skip frames, use region of interest)
  • ❌ Enabling all barcode symbologies when you only need one (performance hit)
  • ❌ Ignoring RecognizeDocumentsRequest when you need table/list structure (iOS 26+)
这些迹象表明你把问题复杂化了:
  • ❌ 用CoreML模型手动实现主体分割
  • ❌ 仅为了身体姿态而使用ARKit(Vision可离线工作)
  • ❌ 从头编写手势识别逻辑(使用手部姿态+简单距离检测即可)
  • ❌ 在主线程处理任务(会阻塞UI - Vision是资源密集型操作)
  • ❌ 当已有Vision API时仍训练自定义模型
  • ❌ 不检查置信度分数(低置信度意味着地标不可靠)
  • ❌ 忘记转换坐标(左下原点与UIKit的左上原点不同)
  • ❌ 当VNRecognizeTextRequest已存在时仍构建自定义文本识别器
  • ❌ 当DataScannerViewController足够时仍使用AVFoundation + Vision
  • ❌ 扫描时处理每一帧摄像头画面(跳过部分帧,使用感兴趣区域)
  • ❌ 仅需要一种条形码类型时却启用所有符号体系(影响性能)
  • ❌ 当需要表格/列表结构时忽略RecognizeDocumentsRequest(iOS 26+)

Mandatory First Steps

必备前置步骤

Before implementing any Vision feature:
在实现任何Vision功能之前:

1. Choose the Right API (Decision Tree)

1. 选择正确的API(决策树)

What do you need to do?

┌─ Isolate subject(s) from background?
│  ├─ Need system UI + out-of-process → VisionKit
│  │  └─ ImageAnalysisInteraction (iOS/iPadOS)
│  │  └─ ImageAnalysisOverlayView (macOS)
│  ├─ Need custom pipeline / HDR / large images → Vision
│  │  └─ VNGenerateForegroundInstanceMaskRequest
│  └─ Need to EXCLUDE hands from object → Combine APIs
│     └─ Subject mask + Hand pose + custom masking (see Pattern 1)
├─ Segment people?
│  ├─ All people in one mask → VNGeneratePersonSegmentationRequest
│  └─ Separate mask per person (up to 4) → VNGeneratePersonInstanceMaskRequest
├─ Detect hand pose/gestures?
│  ├─ Just hand location → VNDetectHumanRectanglesRequest
│  └─ 21 hand landmarks → VNDetectHumanHandPoseRequest
│     └─ Gesture recognition → Hand pose + distance checks
├─ Detect body pose?
│  ├─ 2D normalized landmarks → VNDetectHumanBodyPoseRequest
│  ├─ 3D real-world coordinates → VNDetectHumanBodyPose3DRequest
│  └─ Action classification → Body pose + CreateML model
├─ Face detection?
│  ├─ Just bounding boxes → VNDetectFaceRectanglesRequest
│  └─ Detailed landmarks → VNDetectFaceLandmarksRequest
├─ Person detection (location only)?
│  └─ VNDetectHumanRectanglesRequest
├─ Recognize text in images?
│  ├─ Real-time from camera + need UI → DataScannerViewController (iOS 16+)
│  ├─ Processing captured image → VNRecognizeTextRequest
│  │  ├─ Need speed (real-time camera) → recognitionLevel = .fast
│  │  └─ Need accuracy (documents) → recognitionLevel = .accurate
│  └─ Need structured documents (iOS 26+) → RecognizeDocumentsRequest
├─ Detect barcodes/QR codes?
│  ├─ Real-time camera + need UI → DataScannerViewController (iOS 16+)
│  └─ Processing image → VNDetectBarcodesRequest
└─ Scan documents?
   ├─ Need built-in UI + perspective correction → VNDocumentCameraViewController
   ├─ Need structured data (tables, lists) → RecognizeDocumentsRequest (iOS 26+)
   └─ Custom pipeline → VNDetectDocumentSegmentationRequest + perspective correction
你需要实现什么功能?

┌─ 将主体从背景中分离?
│  ├─ 需要系统UI + 进程外处理 → VisionKit
│  │  └─ ImageAnalysisInteraction (iOS/iPadOS)
│  │  └─ ImageAnalysisOverlayView (macOS)
│  ├─ 需要自定义流水线 / HDR / 大尺寸图像 → Vision
│  │  └─ VNGenerateForegroundInstanceMaskRequest
│  └─ 需要从物体中排除手部 → 结合API
│     └─ 主体掩码 + 手部姿态 + 自定义遮罩(见模式1)
├─ 分割人物?
│  ├─ 将所有人合并为一个掩码 → VNGeneratePersonSegmentationRequest
│  └─ 为每个人生成单独的掩码(最多4人) → VNGeneratePersonInstanceMaskRequest
├─ 检测手部姿态/手势?
│  ├─ 仅需要手部位置 → VNDetectHumanRectanglesRequest
│  └─ 21个手部地标 → VNDetectHumanHandPoseRequest
│     └─ 手势识别 → 手部姿态 + 距离检测
├─ 检测身体姿态?
│  ├─ 2D归一化地标 → VNDetectHumanBodyPoseRequest
│  ├─ 3D真实世界坐标 → VNDetectHumanBodyPose3DRequest
│  └─ 动作分类 → 身体姿态 + CreateML模型
├─ 人脸检测?
│  ├─ 仅需要边界框 → VNDetectFaceRectanglesRequest
│  └─ 详细地标 → VNDetectFaceLandmarksRequest
├─ 人物检测(仅位置)?
│  └─ VNDetectHumanRectanglesRequest
├─ 识别图像中的文本?
│  ├─ 摄像头实时识别 + 需要UI → DataScannerViewController (iOS 16+)
│  ├─ 处理已捕获的图像 → VNRecognizeTextRequest
│  │  ├─ 需要速度(实时摄像头) → recognitionLevel = .fast
│  │  └─ 需要精度(文档) → recognitionLevel = .accurate
│  └─ 需要结构化文档(iOS 26+) → RecognizeDocumentsRequest
├─ 检测条形码/二维码?
│  ├─ 摄像头实时扫描 + 需要UI → DataScannerViewController (iOS 16+)
│  └─ 处理图像 → VNDetectBarcodesRequest
└─ 扫描文档?
   ├─ 需要内置UI + 透视校正 → VNDocumentCameraViewController
   ├─ 需要结构化数据(表格、列表) → RecognizeDocumentsRequest (iOS 26+)
   └─ 自定义流水线 → VNDetectDocumentSegmentationRequest + 透视校正

2. Set Up Background Processing

2. 设置后台处理

NEVER run Vision on main thread:
swift
let processingQueue = DispatchQueue(label: "com.yourapp.vision", qos: .userInitiated)

processingQueue.async {
    do {
        let request = VNGenerateForegroundInstanceMaskRequest()
        let handler = VNImageRequestHandler(cgImage: image)
        try handler.perform([request])

        // Process observations...

        DispatchQueue.main.async {
            // Update UI
        }
    } catch {
        // Handle error
    }
}
绝对不要在主线程运行Vision
swift
let processingQueue = DispatchQueue(label: "com.yourapp.vision", qos: .userInitiated)

processingQueue.async {
    do {
        let request = VNGenerateForegroundInstanceMaskRequest()
        let handler = VNImageRequestHandler(cgImage: image)
        try handler.perform([request])

        // 处理观测结果...

        DispatchQueue.main.async {
            // 更新UI
        }
    } catch {
        // 处理错误
    }
}

3. Verify Platform Availability

3. 验证平台兼容性

APIMinimum Version
Subject segmentation (instance masks)iOS 17+
VisionKit subject liftingiOS 16+
Hand poseiOS 14+
Body pose (2D)iOS 14+
Body pose (3D)iOS 17+
Person instance segmentationiOS 17+
VNRecognizeTextRequest (basic)iOS 13+
VNRecognizeTextRequest (accurate, multi-lang)iOS 14+
VNDetectBarcodesRequestiOS 11+
VNDetectBarcodesRequest (revision 2: Codabar, MicroQR)iOS 15+
VNDetectBarcodesRequest (revision 3: ML-based)iOS 16+
DataScannerViewControlleriOS 16+
VNDocumentCameraViewControlleriOS 13+
VNDetectDocumentSegmentationRequestiOS 15+
RecognizeDocumentsRequestiOS 26+
API最低版本
主体分割(实例掩码)iOS 17+
VisionKit主体提取iOS 16+
手部姿态iOS 14+
身体姿态(2D)iOS 14+
身体姿态(3D)iOS 17+
人物实例分割iOS 17+
VNRecognizeTextRequest(基础版)iOS 13+
VNRecognizeTextRequest(高精度、多语言)iOS 14+
VNDetectBarcodesRequestiOS 11+
VNDetectBarcodesRequest(修订版2:Codabar、MicroQR)iOS 15+
VNDetectBarcodesRequest(修订版3:基于ML)iOS 16+
DataScannerViewControlleriOS 16+
VNDocumentCameraViewControlleriOS 13+
VNDetectDocumentSegmentationRequestiOS 15+
RecognizeDocumentsRequestiOS 26+

Common Patterns

常见实现模式

Pattern 1: Isolate Object While Excluding Hand

模式1:分离物体同时排除手部

User's original problem: Getting a bounding box around an object held in hand, without including the hand.
Root cause:
VNGenerateForegroundInstanceMaskRequest
is class-agnostic and treats hand+object as one subject.
Solution: Combine subject mask with hand pose to create exclusion mask.
swift
// 1. Get subject instance mask
let subjectRequest = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([subjectRequest])

guard let subjectObservation = subjectRequest.results?.first as? VNInstanceMaskObservation else {
    fatalError("No subject detected")
}

// 2. Get hand pose landmarks
let handRequest = VNDetectHumanHandPoseRequest()
handRequest.maximumHandCount = 2
try handler.perform([handRequest])

guard let handObservation = handRequest.results?.first as? VNHumanHandPoseObservation else {
    // No hand detected - use full subject mask
    let mask = try subjectObservation.createScaledMask(
        for: subjectObservation.allInstances,
        croppedToInstancesContent: false
    )
    return mask
}

// 3. Create hand exclusion region from landmarks
let handPoints = try handObservation.recognizedPoints(.all)
let handBounds = calculateConvexHull(from: handPoints)  // Your implementation

// 4. Subtract hand region from subject mask using CoreImage
let subjectMask = try subjectObservation.createScaledMask(
    for: subjectObservation.allInstances,
    croppedToInstancesContent: false
)

let subjectCIMask = CIImage(cvPixelBuffer: subjectMask)
let handMask = createMaskFromRegion(handBounds, size: sourceImage.size)
let finalMask = subtractMasks(handMask: handMask, from: subjectCIMask)

// 5. Calculate bounding box from final mask
let objectBounds = calculateBoundingBox(from: finalMask)
Helper: Convex Hull
swift
func calculateConvexHull(from points: [VNRecognizedPointKey: VNRecognizedPoint]) -> CGRect {
    // Get high-confidence points
    let validPoints = points.values.filter { $0.confidence > 0.5 }

    guard !validPoints.isEmpty else { return .zero }

    // Simple bounding rect (for more accuracy, use actual convex hull algorithm)
    let xs = validPoints.map { $0.location.x }
    let ys = validPoints.map { $0.location.y }

    let minX = xs.min()!
    let maxX = xs.max()!
    let minY = ys.min()!
    let maxY = ys.max()!

    return CGRect(
        x: minX,
        y: minY,
        width: maxX - minX,
        height: maxY - minY
    )
}
Cost: 2-5 hours initial implementation, 30 min ongoing maintenance
用户原始问题:获取被手握住的物体的边界框,不包含手部
根本原因
VNGenerateForegroundInstanceMaskRequest
是无类别的,会将手+物体视为一个主体。
解决方案:将主体掩码与手部姿态结合,创建排除掩码。
swift
// 1. 获取主体实例掩码
let subjectRequest = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([subjectRequest])

guard let subjectObservation = subjectRequest.results?.first as? VNInstanceMaskObservation else {
    fatalError("未检测到主体")
}

// 2. 获取手部姿态地标
let handRequest = VNDetectHumanHandPoseRequest()
handRequest.maximumHandCount = 2
try handler.perform([handRequest])

guard let handObservation = handRequest.results?.first as? VNHumanHandPoseObservation else {
    // 未检测到手部 - 使用完整主体掩码
    let mask = try subjectObservation.createScaledMask(
        for: subjectObservation.allInstances,
        croppedToInstancesContent: false
    )
    return mask
}

// 3. 从地标创建手部排除区域
let handPoints = try handObservation.recognizedPoints(.all)
let handBounds = calculateConvexHull(from: handPoints)  // 你的实现代码

// 4. 使用CoreImage从主体掩码中减去手部区域
let subjectMask = try subjectObservation.createScaledMask(
    for: subjectObservation.allInstances,
    croppedToInstancesContent: false
)

let subjectCIMask = CIImage(cvPixelBuffer: subjectMask)
let handMask = createMaskFromRegion(handBounds, size: sourceImage.size)
let finalMask = subtractMasks(handMask: handMask, from: subjectCIMask)

// 5. 从最终掩码计算边界框
let objectBounds = calculateBoundingBox(from: finalMask)
辅助工具:凸包算法
swift
func calculateConvexHull(from points: [VNRecognizedPointKey: VNRecognizedPoint]) -> CGRect {
    // 获取高置信度的点
    let validPoints = points.values.filter { $0.confidence > 0.5 }

    guard !validPoints.isEmpty else { return .zero }

    // 简单边界矩形(如需更高精度,使用实际凸包算法)
    let xs = validPoints.map { $0.location.x }
    let ys = validPoints.map { $0.location.y }

    let minX = xs.min()!
    let maxX = xs.max()!
    let minY = ys.min()!
    let maxY = ys.max()!

    return CGRect(
        x: minX,
        y: minY,
        width: maxX - minX,
        height: maxY - minY
    )
}
成本:初始实现2-5小时,持续维护30分钟

Pattern 2: VisionKit Simple Subject Lifting

模式2:VisionKit简单主体提取

Use case: Add system-like subject lifting UI with minimal code.
swift
// iOS
let interaction = ImageAnalysisInteraction()
interaction.preferredInteractionTypes = .imageSubject
imageView.addInteraction(interaction)

// macOS
let overlayView = ImageAnalysisOverlayView()
overlayView.preferredInteractionTypes = .imageSubject
nsView.addSubview(overlayView)
When to use:
  • ✓ Want system behavior (long-press to select, drag to share)
  • ✓ Don't need custom processing pipeline
  • ✓ Image size within VisionKit limits (out-of-process)
Cost: 15 min implementation, 5 min ongoing
使用场景:用最少代码添加系统级的主体提取UI。
swift
// iOS
let interaction = ImageAnalysisInteraction()
interaction.preferredInteractionTypes = .imageSubject
imageView.addInteraction(interaction)

// macOS
let overlayView = ImageAnalysisOverlayView()
overlayView.preferredInteractionTypes = .imageSubject
nsView.addSubview(overlayView)
适用场景
  • ✓ 需要系统默认行为(长按选择、拖动分享)
  • ✓ 不需要自定义处理流水线
  • ✓ 图像大小在VisionKit限制内(进程外处理)
成本:实现15分钟,持续维护5分钟

Pattern 3: Programmatic Subject Access (VisionKit)

模式3:程序化访问主体(VisionKit)

Use case: Need subject images/bounds without UI interaction.
swift
let analyzer = ImageAnalyzer()
let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])

let analysis = try await analyzer.analyze(sourceImage, configuration: configuration)

// Get all subjects
for subject in analysis.subjects {
    let subjectImage = subject.image
    let subjectBounds = subject.bounds

    // Process subject...
}

// Tap-based lookup
if let subject = try await analysis.subject(at: tapPoint) {
    let compositeImage = try await analysis.image(for: [subject])
}
Cost: 30 min implementation, 10 min ongoing
使用场景:无需UI交互即可获取主体图像/边界框。
swift
let analyzer = ImageAnalyzer()
let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])

let analysis = try await analyzer.analyze(sourceImage, configuration: configuration)

// 获取所有主体
for subject in analysis.subjects {
    let subjectImage = subject.image
    let subjectBounds = subject.bounds

    // 处理主体...
}

// 基于点击的查找
if let subject = try await analysis.subject(at: tapPoint) {
    let compositeImage = try await analysis.image(for: [subject])
}
成本:实现30分钟,持续维护10分钟

Pattern 4: Vision Instance Mask for Custom Pipeline

模式4:Vision实例掩码用于自定义流水线

Use case: HDR preservation, large images, custom compositing.
swift
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

// Get soft segmentation mask
let mask = try observation.createScaledMask(
    for: observation.allInstances,
    croppedToInstancesContent: false  // Full resolution for compositing
)

// Use with CoreImage for HDR preservation
let filter = CIFilter(name: "CIBlendWithMask")!
filter.setValue(CIImage(cgImage: sourceImage), forKey: kCIInputImageKey)
filter.setValue(CIImage(cvPixelBuffer: mask), forKey: kCIInputMaskImageKey)
filter.setValue(newBackground, forKey: kCIInputBackgroundImageKey)

let compositedImage = filter.outputImage
Cost: 1 hour implementation, 15 min ongoing
使用场景:HDR保留、大尺寸图像、自定义合成。
swift
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: sourceImage)
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

// 获取软分割掩码
let mask = try observation.createScaledMask(
    for: observation.allInstances,
    croppedToInstancesContent: false  // 合成时使用全分辨率
)

// 结合CoreImage保留HDR
let filter = CIFilter(name: "CIBlendWithMask")!
filter.setValue(CIImage(cgImage: sourceImage), forKey: kCIInputImageKey)
filter.setValue(CIImage(cvPixelBuffer: mask), forKey: kCIInputMaskImageKey)
filter.setValue(newBackground, forKey: kCIInputBackgroundImageKey)

let compositedImage = filter.outputImage
成本:实现1小时,持续维护15分钟

Pattern 5: Tap-to-Select Instance

模式5:点击选择实例

Use case: User taps to select which subject/person to lift.
swift
// Get instance at tap point
let instance = observation.instanceAtPoint(tapPoint)

if instance == 0 {
    // Background tapped - select all instances
    let mask = try observation.createScaledMask(
        for: observation.allInstances,
        croppedToInstancesContent: false
    )
} else {
    // Specific instance tapped
    let mask = try observation.createScaledMask(
        for: IndexSet(integer: instance),
        croppedToInstancesContent: true
    )
}
Alternative: Raw pixel buffer access
swift
let instanceMask = observation.instanceMask

CVPixelBufferLockBaseAddress(instanceMask, .readOnly)
defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }

let baseAddress = CVPixelBufferGetBaseAddress(instanceMask)
let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)

// Convert normalized tap to pixel coordinates
let pixelPoint = VNImagePointForNormalizedPoint(
    tapPoint,
    width: imageWidth,
    height: imageHeight
)

let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x)
let label = UnsafeRawPointer(baseAddress!).load(
    fromByteOffset: offset,
    as: UInt8.self
)
Cost: 45 min implementation, 10 min ongoing
使用场景:用户点击选择要提取的主体/人物。
swift
// 获取点击位置的实例
let instance = observation.instanceAtPoint(tapPoint)

if instance == 0 {
    // 点击了背景 - 选择所有实例
    let mask = try observation.createScaledMask(
        for: observation.allInstances,
        croppedToInstancesContent: false
    )
} else {
    // 点击了特定实例
    let mask = try observation.createScaledMask(
        for: IndexSet(integer: instance),
        croppedToInstancesContent: true
    )
}
替代方案:直接访问像素缓冲区
swift
let instanceMask = observation.instanceMask

CVPixelBufferLockBaseAddress(instanceMask, .readOnly)
defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }

let baseAddress = CVPixelBufferGetBaseAddress(instanceMask)
let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)

// 将归一化点击坐标转换为像素坐标
let pixelPoint = VNImagePointForNormalizedPoint(
    tapPoint,
    width: imageWidth,
    height: imageHeight
)

let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x)
let label = UnsafeRawPointer(baseAddress!).load(
    fromByteOffset: offset,
    as: UInt8.self
)
成本:实现45分钟,持续维护10分钟

Pattern 6: Hand Gesture Recognition (Pinch)

模式6:手部手势识别(捏合)

Use case: Detect pinch gesture for custom camera trigger or UI control.
swift
let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 1

try handler.perform([request])

guard let observation = request.results?.first as? VNHumanHandPoseObservation else {
    return
}

let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)

// Check confidence
guard thumbTip.confidence > 0.5, indexTip.confidence > 0.5 else {
    return
}

// Calculate distance (normalized coordinates)
let dx = thumbTip.location.x - indexTip.location.x
let dy = thumbTip.location.y - indexTip.location.y
let distance = sqrt(dx * dx + dy * dy)

let isPinching = distance < 0.05  // Adjust threshold

// State machine for evidence accumulation
if isPinching {
    pinchFrameCount += 1
    if pinchFrameCount >= 3 {
        state = .pinched
    }
} else {
    pinchFrameCount = max(0, pinchFrameCount - 1)
    if pinchFrameCount == 0 {
        state = .apart
    }
}
Cost: 2 hours implementation, 20 min ongoing
使用场景:检测捏合手势用于自定义摄像头触发或UI控制。
swift
let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 1

try handler.perform([request])

guard let observation = request.results?.first as? VNHumanHandPoseObservation else {
    return
}

let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)

// 检查置信度
guard thumbTip.confidence > 0.5, indexTip.confidence > 0.5 else {
    return
}

// 计算距离(归一化坐标)
let dx = thumbTip.location.x - indexTip.location.x
let dy = thumbTip.location.y - indexTip.location.y
let distance = sqrt(dx * dx + dy * dy)

let isPinching = distance < 0.05  // 调整阈值

// 用于累积证据的状态机
if isPinching {
    pinchFrameCount += 1
    if pinchFrameCount >= 3 {
        state = .pinched
    }
} else {
    pinchFrameCount = max(0, pinchFrameCount - 1)
    if pinchFrameCount == 0 {
        state = .apart
    }
}
成本:实现2小时,持续维护20分钟

Pattern 7: Separate Multiple People

模式7:单独分割多个人物

Use case: Apply different effects to each person or count people.
swift
let request = VNGeneratePersonInstanceMaskRequest()
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

let peopleCount = observation.allInstances.count  // Up to 4

for personIndex in observation.allInstances {
    let personMask = try observation.createScaledMask(
        for: IndexSet(integer: personIndex),
        croppedToInstancesContent: false
    )

    // Apply effect to this person only
    applyEffect(to: personMask, personIndex: personIndex)
}
Crowded scenes (>4 people):
swift
// Count faces to detect crowding
let faceRequest = VNDetectFaceRectanglesRequest()
try handler.perform([faceRequest])

let faceCount = faceRequest.results?.count ?? 0

if faceCount > 4 {
    // Fallback: Use single mask for all people
    let singleMaskRequest = VNGeneratePersonSegmentationRequest()
    try handler.perform([singleMaskRequest])
}
Cost: 1.5 hours implementation, 15 min ongoing
使用场景:为每个人物应用不同效果或统计人数。
swift
let request = VNGeneratePersonInstanceMaskRequest()
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

let peopleCount = observation.allInstances.count  // 最多4人

for personIndex in observation.allInstances {
    let personMask = try observation.createScaledMask(
        for: IndexSet(integer: personIndex),
        croppedToInstancesContent: false
    )

    // 仅对该人物应用效果
    applyEffect(to: personMask, personIndex: personIndex)
}
拥挤场景(超过4人)
swift
// 统计人脸数量以检测拥挤情况
let faceRequest = VNDetectFaceRectanglesRequest()
try handler.perform([faceRequest])

let faceCount = faceRequest.results?.count ?? 0

if faceCount > 4 {
    // 回退方案:为所有人使用单个掩码
    let singleMaskRequest = VNGeneratePersonSegmentationRequest()
    try handler.perform([singleMaskRequest])
}
成本:实现1.5小时,持续维护15分钟

Pattern 8: Body Pose for Action Classification

模式8:身体姿态用于动作分类

Use case: Fitness app that recognizes exercises (jumping jacks, squats, etc.)
swift
// 1. Collect body pose observations
var poseObservations: [VNHumanBodyPoseObservation] = []

let request = VNDetectHumanBodyPoseRequest()
try handler.perform([request])

if let observation = request.results?.first as? VNHumanBodyPoseObservation {
    poseObservations.append(observation)
}

// 2. When you have 60 frames of poses, prepare for CreateML model
if poseObservations.count == 60 {
    var multiArray = try MLMultiArray(
        shape: [60, 18, 3],  // 60 frames, 18 joints, (x, y, confidence)
        dataType: .double
    )

    for (frameIndex, observation) in poseObservations.enumerated() {
        let allPoints = try observation.recognizedPoints(.all)

        for (jointIndex, (_, point)) in allPoints.enumerated() {
            multiArray[[frameIndex, jointIndex, 0] as [NSNumber]] = NSNumber(value: point.location.x)
            multiArray[[frameIndex, jointIndex, 1] as [NSNumber]] = NSNumber(value: point.location.y)
            multiArray[[frameIndex, jointIndex, 2] as [NSNumber]] = NSNumber(value: point.confidence)
        }
    }

    // 3. Run inference with CreateML model
    let input = YourActionClassifierInput(poses: multiArray)
    let output = try actionClassifier.prediction(input: input)

    let action = output.label  // "jumping_jacks", "squats", etc.
}
Cost: 3-4 hours implementation, 1 hour ongoing
使用场景:识别锻炼动作(开合跳、深蹲等)的健身APP。
swift
// 1. 收集身体姿态观测结果
var poseObservations: [VNHumanBodyPoseObservation] = []

let request = VNDetectHumanBodyPoseRequest()
try handler.perform([request])

if let observation = request.results?.first as? VNHumanBodyPoseObservation {
    poseObservations.append(observation)
}

// 2. 当收集到60帧姿态时,为CreateML模型做准备
if poseObservations.count == 60 {
    var multiArray = try MLMultiArray(
        shape: [60, 18, 3],  // 60帧、18个关节、(x, y, 置信度)
        dataType: .double
    )

    for (frameIndex, observation) in poseObservations.enumerated() {
        let allPoints = try observation.recognizedPoints(.all)

        for (jointIndex, (_, point)) in allPoints.enumerated() {
            multiArray[[frameIndex, jointIndex, 0] as [NSNumber]] = NSNumber(value: point.location.x)
            multiArray[[frameIndex, jointIndex, 1] as [NSNumber]] = NSNumber(value: point.location.y)
            multiArray[[frameIndex, jointIndex, 2] as [NSNumber]] = NSNumber(value: point.confidence)
        }
    }

    // 3. 使用CreateML模型运行推理
    let input = YourActionClassifierInput(poses: multiArray)
    let output = try actionClassifier.prediction(input: input)

    let action = output.label  // "jumping_jacks", "squats"等
}
成本:实现3-4小时,持续维护1小时

Pattern 9: Text Recognition (OCR)

模式9:文本识别(OCR)

Use case: Extract text from images, receipts, signs, documents.
swift
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate  // Or .fast for real-time
request.recognitionLanguages = ["en-US"]  // Specify known languages
request.usesLanguageCorrection = true  // Helps accuracy

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

guard let observations = request.results as? [VNRecognizedTextObservation] else {
    return
}

for observation in observations {
    // Get top candidate (most likely)
    guard let candidate = observation.topCandidates(1).first else { continue }

    let text = candidate.string
    let confidence = candidate.confidence

    // Get bounding box for specific substring
    if let range = text.range(of: searchTerm) {
        if let boundingBox = try? candidate.boundingBox(for: range) {
            // Use for highlighting
        }
    }
}
Fast vs Accurate:
  • Fast: Real-time camera, large legible text (signs, billboards), character-by-character
  • Accurate: Documents, receipts, small text, handwriting, ML-based word/line recognition
Language tips:
  • Order matters: first language determines ML model for accurate path
  • Use
    automaticallyDetectsLanguage = true
    only when language unknown
  • Query
    supportedRecognitionLanguages
    for current revision
Cost: 30 min basic implementation, 2 hours with language handling
使用场景:从图像、收据、标识、文档中提取文本。
swift
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate  // 实时场景用.fast
request.recognitionLanguages = ["en-US"]  // 指定已知语言
request.usesLanguageCorrection = true  // 提升精度

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

guard let observations = request.results as? [VNRecognizedTextObservation] else {
    return
}

for observation in observations {
    // 获取最可能的候选结果
    guard let candidate = observation.topCandidates(1).first else { continue }

    let text = candidate.string
    let confidence = candidate.confidence

    // 获取特定子字符串的边界框
    if let range = text.range(of: searchTerm) {
        if let boundingBox = try? candidate.boundingBox(for: range) {
            // 用于高亮显示
        }
    }
}
快速模式 vs 高精度模式
  • 快速模式:实时摄像头、大尺寸清晰文本(标识、广告牌)、逐字符识别
  • 高精度模式:文档、收据、小文本、手写体、基于ML的单词/行识别
语言技巧
  • 顺序重要:第一种语言决定高精度路径使用的ML模型
  • 仅当语言未知时使用
    automaticallyDetectsLanguage = true
  • 查询
    supportedRecognitionLanguages
    获取当前修订版支持的语言
成本:基础实现30分钟,包含语言处理的实现2小时

Pattern 10: Barcode/QR Code Detection

模式10:条形码/二维码检测

Use case: Scan product barcodes, QR codes, healthcare codes.
swift
let request = VNDetectBarcodesRequest()
request.revision = VNDetectBarcodesRequestRevision3  // ML-based, iOS 16+
request.symbologies = [.qr, .ean13]  // Specify only what you need!

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

guard let observations = request.results as? [VNBarcodeObservation] else {
    return
}

for barcode in observations {
    let payload = barcode.payloadStringValue  // Decoded content
    let symbology = barcode.symbology  // Type of barcode
    let bounds = barcode.boundingBox  // Location (normalized)

    print("Found \(symbology): \(payload ?? "no string")")
}
Performance tip: Specifying fewer symbologies = faster scanning
Revision differences:
  • Revision 1: One code at a time, 1D codes return lines
  • Revision 2: Codabar, GS1Databar, MicroPDF, MicroQR, better with ROI
  • Revision 3: ML-based, multiple codes at once, better bounding boxes, fewer duplicates
Cost: 15 min implementation
使用场景:扫描产品条形码、二维码、医疗条码。
swift
let request = VNDetectBarcodesRequest()
request.revision = VNDetectBarcodesRequestRevision3  // 基于ML,iOS 16+
request.symbologies = [.qr, .ean13]  // 只指定你需要的类型!

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

guard let observations = request.results as? [VNBarcodeObservation] else {
    return
}

for barcode in observations {
    let payload = barcode.payloadStringValue  // 解码内容
    let symbology = barcode.symbology  // 条形码类型
    let bounds = barcode.boundingBox  // 位置(归一化)

    print("发现 \(symbology)\(payload ?? "无字符串")")
}
性能技巧:指定更少的符号体系=更快的扫描速度
修订版差异
  • 修订版1:一次识别一个码,一维码返回线条
  • 修订版2:支持Codabar、GS1Databar、MicroPDF、MicroQR,对感兴趣区域支持更好
  • 修订版3:基于ML,可同时识别多个码,边界框更准确,重复识别更少
成本:实现15分钟

Pattern 11: DataScannerViewController (Live Scanning)

模式11:DataScannerViewController(实时扫描)

Use case: Camera-based text/barcode scanning with built-in UI (iOS 16+).
swift
import VisionKit

// Check support
guard DataScannerViewController.isSupported,
      DataScannerViewController.isAvailable else {
    // Not supported or camera access denied
    return
}

// Configure what to scan
let recognizedDataTypes: Set<DataScannerViewController.RecognizedDataType> = [
    .barcode(symbologies: [.qr]),
    .text(textContentType: .URL)  // Or nil for all text
]

// Create and present
let scanner = DataScannerViewController(
    recognizedDataTypes: recognizedDataTypes,
    qualityLevel: .balanced,  // Or .fast, .accurate
    recognizesMultipleItems: false,  // Center-most if false
    isHighFrameRateTrackingEnabled: true,  // For smooth highlights
    isPinchToZoomEnabled: true,
    isGuidanceEnabled: true,
    isHighlightingEnabled: true
)

scanner.delegate = self
present(scanner, animated: true) {
    try? scanner.startScanning()
}
Delegate methods:
swift
func dataScanner(_ scanner: DataScannerViewController,
                 didTapOn item: RecognizedItem) {
    switch item {
    case .text(let text):
        print("Tapped text: \(text.transcript)")
    case .barcode(let barcode):
        print("Tapped barcode: \(barcode.payloadStringValue ?? "")")
    @unknown default: break
    }
}

// For custom highlights
func dataScanner(_ scanner: DataScannerViewController,
                 didAdd addedItems: [RecognizedItem],
                 allItems: [RecognizedItem]) {
    for item in addedItems {
        let highlight = createHighlight(for: item)
        scanner.overlayContainerView.addSubview(highlight)
    }
}
Async stream alternative:
swift
for await items in scanner.recognizedItems {
    // Process current items
}
Cost: 45 min implementation with custom highlights
使用场景:基于摄像头的文本/条形码扫描,带内置UI(iOS 16+)。
swift
import VisionKit

// 检查支持情况
guard DataScannerViewController.isSupported,
      DataScannerViewController.isAvailable else {
    // 不支持或摄像头权限被拒绝
    return
}

// 配置扫描类型
let recognizedDataTypes: Set<DataScannerViewController.RecognizedDataType> = [
    .barcode(symbologies: [.qr]),
    .text(textContentType: .URL)  // 或nil扫描所有文本
]

// 创建并展示
let scanner = DataScannerViewController(
    recognizedDataTypes: recognizedDataTypes,
    qualityLevel: .balanced,  // 或.fast、.accurate
    recognizesMultipleItems: false,  // 为false时识别中心最近的物体
    isHighFrameRateTrackingEnabled: true,  // 实现流畅高亮
    isPinchToZoomEnabled: true,
    isGuidanceEnabled: true,
    isHighlightingEnabled: true
)

scanner.delegate = self
present(scanner, animated: true) {
    try? scanner.startScanning()
}
代理方法
swift
func dataScanner(_ scanner: DataScannerViewController,
                 didTapOn item: RecognizedItem) {
    switch item {
    case .text(let text):
        print("点击文本:\(text.transcript)")
    case .barcode(let barcode):
        print("点击条形码:\(barcode.payloadStringValue ?? "")")
    @unknown default: break
    }
}

// 自定义高亮
func dataScanner(_ scanner: DataScannerViewController,
                 didAdd addedItems: [RecognizedItem],
                 allItems: [RecognizedItem]) {
    for item in addedItems {
        let highlight = createHighlight(for: item)
        scanner.overlayContainerView.addSubview(highlight)
    }
}
异步流替代方案
swift
for await items in scanner.recognizedItems {
    // 处理当前识别的物体
}
成本:带自定义高亮的实现45分钟

Pattern 12: Document Scanning with VNDocumentCameraViewController

模式12:用VNDocumentCameraViewController扫描文档

Use case: Scan paper documents with automatic edge detection and perspective correction.
swift
import VisionKit

let documentCamera = VNDocumentCameraViewController()
documentCamera.delegate = self
present(documentCamera, animated: true)

// In delegate
func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                   didFinishWith scan: VNDocumentCameraScan) {
    controller.dismiss(animated: true)

    // Process each page
    for pageIndex in 0..<scan.pageCount {
        let image = scan.imageOfPage(at: pageIndex)

        // Now run text recognition on the corrected image
        let handler = VNImageRequestHandler(cgImage: image.cgImage!)
        let textRequest = VNRecognizeTextRequest()
        try? handler.perform([textRequest])
    }
}
Cost: 30 min implementation
使用场景:扫描纸质文档,自动边缘检测和透视校正。
swift
import VisionKit

let documentCamera = VNDocumentCameraViewController()
documentCamera.delegate = self
present(documentCamera, animated: true)

// 代理方法
func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                   didFinishWith scan: VNDocumentCameraScan) {
    controller.dismiss(animated: true)

    // 处理每一页
    for pageIndex in 0..<scan.pageCount {
        let image = scan.imageOfPage(at: pageIndex)

        // 现在对校正后的图像运行文本识别
        let handler = VNImageRequestHandler(cgImage: image.cgImage!)
        let textRequest = VNRecognizeTextRequest()
        try? handler.perform([textRequest])
    }
}
成本:实现30分钟

Pattern 13: Document Segmentation (Custom Pipeline)

模式13:文档分割(自定义流水线)

Use case: Detect document edges programmatically for custom camera UI.
swift
let request = VNDetectDocumentSegmentationRequest()
let handler = VNImageRequestHandler(ciImage: inputImage)
try handler.perform([request])

guard let observation = request.results?.first,
      let document = observation as? VNRectangleObservation else {
    return
}

// Get corner points (normalized coordinates)
let topLeft = document.topLeft
let topRight = document.topRight
let bottomLeft = document.bottomLeft
let bottomRight = document.bottomRight

// Apply perspective correction with CoreImage
let correctedImage = inputImage
    .cropped(to: document.boundingBox.scaled(to: imageSize))
    .applyingFilter("CIPerspectiveCorrection", parameters: [
        "inputTopLeft": CIVector(cgPoint: topLeft.scaled(to: imageSize)),
        "inputTopRight": CIVector(cgPoint: topRight.scaled(to: imageSize)),
        "inputBottomLeft": CIVector(cgPoint: bottomLeft.scaled(to: imageSize)),
        "inputBottomRight": CIVector(cgPoint: bottomRight.scaled(to: imageSize))
    ])
VNDetectDocumentSegmentationRequest vs VNDetectRectanglesRequest:
  • Document: ML-based, trained on documents, handles non-rectangles, returns one document
  • Rectangle: Edge-based, finds any quadrilateral, returns multiple, CPU-only
Cost: 1-2 hours implementation
使用场景:以编程方式检测文档边缘,用于自定义摄像头UI。
swift
let request = VNDetectDocumentSegmentationRequest()
let handler = VNImageRequestHandler(ciImage: inputImage)
try handler.perform([request])

guard let observation = request.results?.first,
      let document = observation as? VNRectangleObservation else {
    return
}

// 获取角点(归一化坐标)
let topLeft = document.topLeft
let topRight = document.topRight
let bottomLeft = document.bottomLeft
let bottomRight = document.bottomRight

// 使用CoreImage进行透视校正
let correctedImage = inputImage
    .cropped(to: document.boundingBox.scaled(to: imageSize))
    .applyingFilter("CIPerspectiveCorrection", parameters: [
        "inputTopLeft": CIVector(cgPoint: topLeft.scaled(to: imageSize)),
        "inputTopRight": CIVector(cgPoint: topRight.scaled(to: imageSize)),
        "inputBottomLeft": CIVector(cgPoint: bottomLeft.scaled(to: imageSize)),
        "inputBottomRight": CIVector(cgPoint: bottomRight.scaled(to: imageSize))
    ])
VNDetectDocumentSegmentationRequest vs VNDetectRectanglesRequest
  • 文档分割:基于ML,针对文档训练,处理非矩形,返回一个文档
  • 矩形检测:基于边缘,查找任何四边形,返回多个结果,仅CPU处理
成本:实现1-2小时

Pattern 14: Structured Document Extraction (iOS 26+)

模式14:结构化文档提取(iOS 26+)

Use case: Extract tables, lists, paragraphs with semantic understanding.
swift
// iOS 26+
let request = RecognizeDocumentsRequest()
let observations = try await request.perform(on: imageData)

guard let document = observations.first?.document else {
    return
}

// Extract tables
for table in document.tables {
    for row in table.rows {
        for cell in row {
            let text = cell.content.text.transcript
            print("Cell: \(text)")
        }
    }
}

// Get detected data (emails, phones, URLs, dates)
let allDetectedData = document.text.detectedData
for data in allDetectedData {
    switch data.match.details {
    case .emailAddress(let email):
        print("Email: \(email.emailAddress)")
    case .phoneNumber(let phone):
        print("Phone: \(phone.phoneNumber)")
    case .link(let url):
        print("URL: \(url)")
    default: break
    }
}
Document hierarchy:
  • Document → containers (text, tables, lists, barcodes)
  • Table → rows → cells → content
  • Content → text (transcript, lines, paragraphs, words, detectedData)
Cost: 1 hour implementation
使用场景:提取表格、列表、段落并进行语义理解。
swift
// iOS 26+
let request = RecognizeDocumentsRequest()
let observations = try await request.perform(on: imageData)

guard let document = observations.first?.document else {
    return
}

// 提取表格
for table in document.tables {
    for row in table.rows {
        for cell in row {
            let text = cell.content.text.transcript
            print("单元格:\(text)")
        }
    }
}

// 获取检测到的数据(邮箱、电话、URL、日期)
let allDetectedData = document.text.detectedData
for data in allDetectedData {
    switch data.match.details {
    case .emailAddress(let email):
        print("邮箱:\(email.emailAddress)")
    case .phoneNumber(let phone):
        print("电话:\(phone.phoneNumber)")
    case .link(let url):
        print("URL:\(url)")
    default: break
    }
}
文档层级
  • 文档 → 容器(文本、表格、列表、条形码)
  • 表格 → 行 → 单元格 → 内容
  • 内容 → 文本(转录内容、行、段落、单词、检测到的数据)
成本:实现1小时

Pattern 15: Real-time Phone Number Scanner

模式15:实时电话号码扫描

Use case: Scan phone numbers from camera like barcode scanner (from WWDC 2019).
swift
// 1. Use region of interest to guide user
let textRequest = VNRecognizeTextRequest { request, error in
    guard let observations = request.results as? [VNRecognizedTextObservation] else { return }

    for observation in observations {
        guard let candidate = observation.topCandidates(1).first else { continue }

        // Use domain knowledge to filter
        if let phoneNumber = self.extractPhoneNumber(from: candidate.string) {
            self.stringTracker.add(phoneNumber)
        }
    }

    // Build evidence over frames
    if let stableNumber = self.stringTracker.getStableString(threshold: 10) {
        self.foundPhoneNumber(stableNumber)
    }
}

textRequest.recognitionLevel = .fast  // Real-time
textRequest.usesLanguageCorrection = false  // Codes, not natural text
textRequest.regionOfInterest = guidanceBox  // Crop to user's focus area

// 2. String tracker for stability
class StringTracker {
    private var seenStrings: [String: Int] = [:]

    func add(_ string: String) {
        seenStrings[string, default: 0] += 1
    }

    func getStableString(threshold: Int) -> String? {
        seenStrings.first { $0.value >= threshold }?.key
    }
}
Key techniques from WWDC 2019:
  • Use
    .fast
    recognition level for real-time
  • Disable language correction for codes/numbers
  • Use region of interest to improve speed and focus
  • Build evidence over multiple frames (string tracker)
  • Apply domain knowledge (phone number regex)
Cost: 2 hours implementation
使用场景:像条形码扫描器一样从摄像头扫描电话号码(来自WWDC 2019)。
swift
// 1. 使用感兴趣区域引导用户
let textRequest = VNRecognizeTextRequest { request, error in
    guard let observations = request.results as? [VNRecognizedTextObservation] else { return }

    for observation in observations {
        guard let candidate = observation.topCandidates(1).first else { continue }

        // 使用领域知识过滤
        if let phoneNumber = self.extractPhoneNumber(from: candidate.string) {
            self.stringTracker.add(phoneNumber)
        }
    }

    // 多帧累积证据
    if let stableNumber = self.stringTracker.getStableString(threshold: 10) {
        self.foundPhoneNumber(stableNumber)
    }
}

textRequest.recognitionLevel = .fast  // 实时
textRequest.usesLanguageCorrection = false  // 针对编码,不是自然文本
textRequest.regionOfInterest = guidanceBox  // 裁剪到用户聚焦区域

// 2. 用于稳定性的字符串跟踪器
class StringTracker {
    private var seenStrings: [String: Int] = [:]

    func add(_ string: String) {
        seenStrings[string, default: 0] += 1
    }

    func getStableString(threshold: Int) -> String? {
        seenStrings.first { $0.value >= threshold }?.key
    }
}
来自WWDC 2019的关键技巧
  • 实时场景使用.fast识别级别
  • 针对编码/数字禁用语言校正
  • 使用感兴趣区域提升速度和聚焦
  • 多帧累积证据(字符串跟踪器)
  • 应用领域知识(电话号码正则表达式)
成本:实现2小时

Anti-Patterns

反模式

Anti-Pattern 1: Processing on Main Thread

反模式1:在主线程处理任务

Wrong:
swift
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])  // Blocks UI!
Right:
swift
DispatchQueue.global(qos: .userInitiated).async {
    let request = VNGenerateForegroundInstanceMaskRequest()
    let handler = VNImageRequestHandler(cgImage: image)
    try handler.perform([request])

    DispatchQueue.main.async {
        // Update UI
    }
}
Why it matters: Vision is resource-intensive. Blocking main thread freezes UI.
错误做法
swift
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])  // 阻塞UI!
正确做法
swift
DispatchQueue.global(qos: .userInitiated).async {
    let request = VNGenerateForegroundInstanceMaskRequest()
    let handler = VNImageRequestHandler(cgImage: image)
    try handler.perform([request])

    DispatchQueue.main.async {
        // 更新UI
    }
}
重要性:Vision是资源密集型操作,阻塞主线程会导致UI冻结。

Anti-Pattern 2: Ignoring Confidence Scores

反模式2:忽略置信度分数

Wrong:
swift
let thumbTip = try observation.recognizedPoint(.thumbTip)
let location = thumbTip.location  // May be unreliable!
Right:
swift
let thumbTip = try observation.recognizedPoint(.thumbTip)
guard thumbTip.confidence > 0.5 else {
    // Low confidence - landmark unreliable
    return
}
let location = thumbTip.location
Why it matters: Low confidence points are inaccurate (occlusion, blur, edge of frame).
错误做法
swift
let thumbTip = try observation.recognizedPoint(.thumbTip)
let location = thumbTip.location  // 可能不可靠!
正确做法
swift
let thumbTip = try observation.recognizedPoint(.thumbTip)
guard thumbTip.confidence > 0.5 else {
    // 低置信度 - 地标不可靠
    return
}
let location = thumbTip.location
重要性:低置信度的点不准确(遮挡、模糊、靠近画面边缘)。

Anti-Pattern 3: Forgetting Coordinate Conversion

反模式3:忘记坐标转换

Wrong (mixing coordinate systems):
swift
// Vision uses lower-left origin
let visionPoint = recognizedPoint.location  // (0, 0) = bottom-left

// UIKit uses top-left origin
let uiPoint = CGPoint(x: axiom-visionPoint.x, y: axiom-visionPoint.y)  // WRONG!
Right:
swift
let visionPoint = recognizedPoint.location

// Convert to UIKit coordinates
let uiPoint = CGPoint(
    x: axiom-visionPoint.x * imageWidth,
    y: (1 - visionPoint.y) * imageHeight  // Flip Y axis
)
Why it matters: Mismatched origins cause UI overlays to appear in wrong positions.
错误做法(混合坐标系统):
swift
// Vision使用左下原点
let visionPoint = recognizedPoint.location  // (0, 0) = 左下角

// UIKit使用左上原点
let uiPoint = CGPoint(x: axiom-visionPoint.x, y: axiom-visionPoint.y)  // 错误!
正确做法
swift
let visionPoint = recognizedPoint.location

// 转换为UIKit坐标
let uiPoint = CGPoint(
    x: axiom-visionPoint.x * imageWidth,
    y: (1 - visionPoint.y) * imageHeight  // 翻转Y轴
)
重要性:原点不匹配会导致UI覆盖层显示在错误位置。

Anti-Pattern 4: Setting maximumHandCount Too High

反模式4:设置过高的maximumHandCount

Wrong:
swift
let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 10  // "Just in case"
Right:
swift
let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 2  // Only compute what you need
Why it matters: Performance scales with
maximumHandCount
. Pose computed for all detected hands ≤ max.
错误做法
swift
let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 10  // "以防万一"
正确做法
swift
let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 2  // 只计算你需要的数量
重要性:性能随
maximumHandCount
扩展,会为所有检测到的≤最大值的手部计算姿态。

Anti-Pattern 5: Using ARKit When Vision Suffices

反模式5:Vision足够时仍使用ARKit

Wrong (if you don't need AR):
swift
// Requires AR session just for body pose
let arSession = ARBodyTrackingConfiguration()
Right:
swift
// Vision works offline on still images
let request = VNDetectHumanBodyPoseRequest()
Why it matters: ARKit body pose requires rear camera, AR session, supported devices. Vision works everywhere (even offline).
错误做法(不需要AR时):
swift
// 仅为了身体姿态而需要AR会话
let arSession = ARBodyTrackingConfiguration()
正确做法
swift
// Vision可离线处理静态图像
let request = VNDetectHumanBodyPoseRequest()
重要性:ARKit身体姿态需要后置摄像头、AR会话、支持的设备。Vision可在任何地方使用(甚至离线)。

Pressure Scenarios

压力场景

Scenario 1: "Just Ship the Feature"

场景1:"尽快上线功能"

Context: Product manager wants subject lifting "like in Photos app" by Friday. You're considering skipping background processing.
Pressure: "It's working on my iPhone 15 Pro, let's ship it."
Reality: Vision blocks UI on older devices. Users on iPhone 12 will experience frozen app.
Correct action:
  1. Implement background queue (15 min)
  2. Add loading indicator (10 min)
  3. Test on iPhone 12 or earlier (5 min)
Push-back template: "Subject lifting works, but it freezes the UI on older devices. I need 30 minutes to add background processing and prevent 1-star reviews."
背景:产品经理希望周五前上线"像照片APP一样的主体提取"功能,你考虑跳过后台处理。
压力:"在我的iPhone 15 Pro上能运行,直接上线吧。"
实际情况:Vision在旧设备上会阻塞UI,iPhone 12用户会遇到APP冻结。
正确行动
  1. 实现后台队列(15分钟)
  2. 添加加载指示器(10分钟)
  3. 在iPhone 12或更早设备上测试(5分钟)
反驳模板:"主体提取功能可用,但在旧设备上会冻结UI。我需要30分钟添加后台处理,避免一星差评。"

Scenario 2: "Training Our Own Model"

场景2:"训练我们自己的模型"

Context: Designer wants to exclude hands from subject bounding box. Engineer suggests training custom CoreML model for specific object detection.
Pressure: "We need perfect bounds, let's train a model."
Reality: Training requires labeled dataset (weeks), ongoing maintenance, and still won't generalize to new objects. Built-in Vision APIs + hand pose solve it in 2-5 hours.
Correct action:
  1. Explain Pattern 1 (combine subject mask + hand pose)
  2. Prototype in 1 hour to demonstrate
  3. Compare against training timeline (weeks vs hours)
Push-back template: "Training a model takes weeks and only works for specific objects. I can combine Vision APIs to solve this in a few hours and it'll work for any object."
背景:设计师希望从主体边界框中排除手部,工程师建议训练自定义CoreML模型进行特定物体检测。
压力:"我们需要完美的边界框,训练模型吧。"
实际情况:训练需要标注数据集(数周)、持续维护,且仍无法泛化到新物体。内置Vision API+手部姿态可在2-5小时内解决问题。
正确行动
  1. 说明模式1(结合主体掩码+手部姿态)
  2. 1小时内制作原型演示
  3. 对比训练时间(数周vs数小时)
反驳模板:"训练模型需要数周,且仅对特定物体有效。我可以结合Vision API在几小时内解决问题,且适用于任何物体。"

Scenario 3: "We Can't Wait for iOS 17"

场景3:"我们不能等iOS 17"

Context: You need instance masks but app supports iOS 15+.
Pressure: "Just use iOS 15 person segmentation and ship it."
Reality:
VNGeneratePersonSegmentationRequest
(iOS 15) returns single mask for all people. Doesn't solve multi-person use case.
Correct action:
  1. Raise minimum deployment target to iOS 17 (best UX)
  2. OR implement fallback: use iOS 15 API but disable multi-person features
  3. OR use
    @available
    to conditionally enable features
Push-back template: "Person segmentation on iOS 15 combines all people into one mask. We can either require iOS 17 for the best experience, or disable multi-person features on older OS versions. Which do you prefer?"
背景:你需要实例掩码,但APP支持iOS 15+。
压力:"直接用iOS 15的人物分割上线吧。"
实际情况
VNGeneratePersonSegmentationRequest
(iOS 15)返回所有人的单个掩码,无法解决多人物场景。
正确行动
  1. 将最低部署目标提升到iOS 17(最佳用户体验)
  2. 或实现回退方案:使用iOS 15 API但禁用多人物功能
  3. 或使用
    @available
    条件启用功能
反驳模板:"iOS 15的人物分割会将所有人合并为一个掩码。我们要么要求iOS 17以获得最佳体验,要么在旧系统上禁用多人物功能。你倾向哪种方案?"

Checklist

检查清单

Before shipping Vision features:
Performance:
  • ☑ All Vision requests run on background queue
  • ☑ UI shows loading indicator during processing
  • ☑ Tested on iPhone 12 or earlier (not just latest devices)
  • maximumHandCount
    set to minimum needed value
Accuracy:
  • ☑ Confidence scores checked before using landmarks
  • ☑ Fallback behavior for low confidence observations
  • ☑ Handles case where no subjects/hands/people detected
Coordinates:
  • ☑ Vision coordinates (lower-left origin) converted to UIKit (top-left)
  • ☑ Normalized coordinates scaled to pixel dimensions
  • ☑ UI overlays aligned correctly with image
Platform Support:
  • @available
    checks for iOS 17+ APIs (instance masks)
  • ☑ Fallback for iOS 14-16 (or raised deployment target)
  • ☑ Tested on actual devices, not just simulator
Edge Cases:
  • ☑ Handles images with no detectable subjects
  • ☑ Handles partially occluded hands/bodies
  • ☑ Handles hands/bodies near image edges
  • ☑ Handles >4 people for person instance segmentation
CoreImage Integration (if applicable):
  • ☑ HDR preservation verified with high dynamic range images
  • ☑ Mask resolution matches source image
  • croppedToInstancesContent
    set appropriately (false for compositing)
Text/Barcode Recognition (if applicable):
  • ☑ Recognition level matches use case (fast for real-time, accurate for documents)
  • ☑ Language correction disabled for codes/serial numbers
  • ☑ Barcode symbologies limited to actual needs (performance)
  • ☑ Region of interest used to focus scanning area
  • ☑ Multiple candidates checked (not just top candidate)
  • ☑ Evidence accumulated over frames for real-time (string tracker)
  • ☑ DataScannerViewController availability checked before presenting
上线Vision功能前需确认:
性能
  • ☑ 所有Vision请求在后台队列运行
  • ☑ 处理期间UI显示加载指示器
  • ☑ 在iPhone 12或更早设备上测试(不只是最新设备)
  • maximumHandCount
    设置为所需最小值
精度
  • ☑ 使用地标前检查置信度分数
  • ☑ 为低置信度观测结果设置回退行为
  • ☑ 处理未检测到主体/手部/人物的情况
坐标
  • ☑ 将Vision坐标(左下原点)转换为UIKit坐标(左上原点)
  • ☑ 将归一化坐标缩放到像素尺寸
  • ☑ UI覆盖层与图像正确对齐
平台支持
  • ☑ 对iOS 17+ API(实例掩码)使用
    @available
    检查
  • ☑ 为iOS 14-16实现回退方案(或提升部署目标)
  • ☑ 在真实设备上测试,不只是模拟器
边缘情况
  • ☑ 处理无检测主体的图像
  • ☑ 处理部分遮挡的手/身体
  • ☑ 处理靠近图像边缘的手/身体
  • ☑ 处理人物实例分割中超过4人的情况
CoreImage集成(如适用):
  • ☑ 验证高动态范围图像的HDR保留
  • ☑ 掩码分辨率与源图像匹配
  • ☑ 正确设置
    croppedToInstancesContent
    (合成时设为false)
文本/条形码识别(如适用):
  • ☑ 识别级别匹配使用场景(实时用fast,文档用accurate)
  • ☑ 针对编码/序列号禁用语言校正
  • ☑ 条形码符号体系限制为实际需要的类型(提升性能)
  • ☑ 使用感兴趣区域聚焦扫描区域
  • ☑ 检查多个候选结果(不只是顶部候选)
  • ☑ 实时场景多帧累积证据(字符串跟踪器)
  • ☑ 展示前检查DataScannerViewController的可用性

Resources

资源

WWDC: 2019-234, 2021-10041, 2022-10024, 2022-10025, 2025-272, 2023-10176, 2023-111241, 2020-10653
Docs: /vision, /visionkit, /vision/vnrecognizetextrequest, /vision/vndetectbarcodesrequest
Skills: axiom-vision-ref, axiom-vision-diag
WWDC:2019-234, 2021-10041, 2022-10024, 2022-10025, 2025-272, 2023-10176, 2023-111241, 2020-10653
文档:/vision, /visionkit, /vision/vnrecognizetextrequest, /vision/vndetectbarcodesrequest
技能:axiom-vision-ref, axiom-vision-diag