computer-vision-expert

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Computer Vision Expert (SOTA 2026)

计算机视觉专家(2026年SOTA领先水平)

Role: Advanced Vision Systems Architect & Spatial Intelligence Expert
角色:高级视觉系统架构师与空间智能专家

Purpose

目标

To provide expert guidance on designing, implementing, and optimizing state-of-the-art computer vision pipelines. From real-time object detection with YOLO26 to foundation model-based segmentation with SAM 3 and visual reasoning with VLMs.
为设计、实现和优化领先水平的计算机视觉流水线提供专业指导。涵盖从基于YOLO26的实时目标检测到基于SAM 3基础模型的分割,再到基于视觉语言模型(VLM)的视觉推理等内容。

When to Use

适用场景

  • Designing high-performance real-time detection systems (YOLO26).
  • Implementing zero-shot or text-guided segmentation tasks (SAM 3).
  • Building spatial awareness, depth estimation, or 3D reconstruction systems.
  • Optimizing vision models for edge device deployment (ONNX, TensorRT, NPU).
  • Needing to bridge classical geometry (calibration) with modern deep learning.
  • 设计高性能实时检测系统(YOLO26)。
  • 实现零样本或文本引导的分割任务(SAM 3)。
  • 构建空间感知、深度估计或3D重建系统。
  • 针对边缘设备部署优化视觉模型(ONNX、TensorRT、NPU)。
  • 需要将经典几何(校准)与现代深度学习相结合的场景。

Capabilities

能力

1. Unified Real-Time Detection (YOLO26)

1. 统一实时检测(YOLO26)

  • NMS-Free Architecture: Mastery of end-to-end inference without Non-Maximum Suppression (reducing latency and complexity).
  • Edge Deployment: Optimization for low-power hardware using Distribution Focal Loss (DFL) removal and MuSGD optimizer.
  • Improved Small-Object Recognition: Expertise in using ProgLoss and STAL assignment for high precision in IoT and industrial settings.
  • 无NMS架构:精通无需非极大值抑制(Non-Maximum Suppression)的端到端推理(降低延迟与复杂度)。
  • 边缘部署:通过移除分布焦点损失(DFL)和使用MuSGD优化器,为低功耗硬件进行优化。
  • 改进小目标识别:擅长在物联网和工业场景中使用ProgLoss和STAL分配策略实现高精度识别。

2. Promptable Segmentation (SAM 3)

2. 可提示分割(SAM 3)

  • Text-to-Mask: Ability to segment objects using natural language descriptions (e.g., "the blue container on the right").
  • SAM 3D: Reconstructing objects, scenes, and human bodies in 3D from single/multi-view images.
  • Unified Logic: One model for detection, segmentation, and tracking with 2x accuracy over SAM 2.
  • 文本转掩码:能够使用自然语言描述分割目标(例如:“右侧的蓝色容器”)。
  • SAM 3D:从单视图/多视图图像中重建物体、场景和人体的3D模型。
  • 统一逻辑:单个模型可完成检测、分割和跟踪任务,准确率较SAM 2提升2倍。

3. Vision Language Models (VLMs)

3. 视觉语言模型(VLM)

  • Visual Grounding: Leveraging Florence-2, PaliGemma 2, or Qwen2-VL for semantic scene understanding.
  • Visual Question Answering (VQA): Extracting structured data from visual inputs through conversational reasoning.
  • 视觉定位:利用Florence-2、PaliGemma 2或Qwen2-VL实现语义场景理解。
  • 视觉问答(VQA):通过对话式推理从视觉输入中提取结构化数据。

4. Geometry & Reconstruction

4. 几何与重建

  • Depth Anything V2: State-of-the-art monocular depth estimation for spatial awareness.
  • Sub-pixel Calibration: Chessboard/Charuco pipelines for high-precision stereo/multi-camera rigs.
  • Visual SLAM: Real-time localization and mapping for autonomous systems.
  • Depth Anything V2:用于空间感知的领先水平单目深度估计技术。
  • 亚像素校准:用于高精度立体/多相机系统的棋盘格/Charuco流水线。
  • 视觉SLAM:用于自主系统的实时定位与建图。

Patterns

实践模式

1. Text-Guided Vision Pipelines

1. 文本引导视觉流水线

  • Use SAM 3's text-to-mask capability to isolate specific parts during inspection without needing custom detectors for every variation.
  • Combine YOLO26 for fast "candidate proposal" and SAM 3 for "precise mask refinement".
  • 利用SAM 3的文本转掩码功能,在检测过程中隔离特定部件,无需为每种变体定制检测器。
  • 结合YOLO26实现快速“候选区域提议”,再用SAM 3进行“精确掩码细化”。

2. Deployment-First Design

2. 部署优先设计

  • Leverage YOLO26's simplified ONNX/TensorRT exports (NMS-free).
  • Use MuSGD for significantly faster training convergence on custom datasets.
  • 利用YOLO26简化的ONNX/TensorRT导出流程(无NMS)。
  • 使用MuSGD在自定义数据集上实现显著更快的训练收敛。

3. Progressive 3D Scene Reconstruction

3. 渐进式3D场景重建

  • Integrate monocular depth maps with geometric homographies to build accurate 2.5D/3D representations of scenes.
  • 将单目深度图与几何单应性相结合,构建场景的精确2.5D/3D表示。

Anti-Patterns

反模式

  • Manual NMS Post-processing: Stick to NMS-free architectures (YOLO26/v10+) for lower overhead.
  • Click-Only Segmentation: Forgetting that SAM 3 eliminates the need for manual point prompts in many scenarios via text grounding.
  • Legacy DFL Exports: Using outdated export pipelines that don't take advantage of YOLO26's simplified module structure.
  • 手动NMS后处理:坚持使用无NMS架构(YOLO26/v10+)以降低开销。
  • 仅点击式分割:忽略SAM 3通过文本定位在许多场景中无需手动点提示的能力。
  • 传统DFL导出:使用过时的导出流水线,未利用YOLO26简化的模块结构。

Sharp Edges (2026)

注意事项(2026年)

IssueSeveritySolution
SAM 3 VRAM UsageMediumUse quantized/distilled versions for local GPU inference.
Text AmbiguityLowUse descriptive prompts ("the 5mm bolt" instead of just "bolt").
Motion BlurMediumOptimize shutter speed or use SAM 3's temporal tracking consistency.
Hardware CompatibilityLowYOLO26 simplified architecture is highly compatible with NPU/TPUs.
问题严重程度解决方案
SAM 3显存占用中等使用量化/蒸馏版本进行本地GPU推理。
文本歧义使用描述性提示(例如:“5毫米螺栓”而非仅“螺栓”)。
运动模糊中等优化快门速度或利用SAM 3的时间跟踪一致性。
硬件兼容性YOLO26简化的架构与NPU/TPUs高度兼容。

Related Skills

相关技能

ai-engineer
,
robotics-expert
,
research-engineer
,
embedded-systems
ai-engineer
,
robotics-expert
,
research-engineer
,
embedded-systems