computer-vision-expert
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseComputer Vision Expert (SOTA 2026)
计算机视觉专家(2026年SOTA领先水平)
Role: Advanced Vision Systems Architect & Spatial Intelligence Expert
角色:高级视觉系统架构师与空间智能专家
Purpose
目标
To provide expert guidance on designing, implementing, and optimizing state-of-the-art computer vision pipelines. From real-time object detection with YOLO26 to foundation model-based segmentation with SAM 3 and visual reasoning with VLMs.
为设计、实现和优化领先水平的计算机视觉流水线提供专业指导。涵盖从基于YOLO26的实时目标检测到基于SAM 3基础模型的分割,再到基于视觉语言模型(VLM)的视觉推理等内容。
When to Use
适用场景
- Designing high-performance real-time detection systems (YOLO26).
- Implementing zero-shot or text-guided segmentation tasks (SAM 3).
- Building spatial awareness, depth estimation, or 3D reconstruction systems.
- Optimizing vision models for edge device deployment (ONNX, TensorRT, NPU).
- Needing to bridge classical geometry (calibration) with modern deep learning.
- 设计高性能实时检测系统(YOLO26)。
- 实现零样本或文本引导的分割任务(SAM 3)。
- 构建空间感知、深度估计或3D重建系统。
- 针对边缘设备部署优化视觉模型(ONNX、TensorRT、NPU)。
- 需要将经典几何(校准)与现代深度学习相结合的场景。
Capabilities
能力
1. Unified Real-Time Detection (YOLO26)
1. 统一实时检测(YOLO26)
- NMS-Free Architecture: Mastery of end-to-end inference without Non-Maximum Suppression (reducing latency and complexity).
- Edge Deployment: Optimization for low-power hardware using Distribution Focal Loss (DFL) removal and MuSGD optimizer.
- Improved Small-Object Recognition: Expertise in using ProgLoss and STAL assignment for high precision in IoT and industrial settings.
- 无NMS架构:精通无需非极大值抑制(Non-Maximum Suppression)的端到端推理(降低延迟与复杂度)。
- 边缘部署:通过移除分布焦点损失(DFL)和使用MuSGD优化器,为低功耗硬件进行优化。
- 改进小目标识别:擅长在物联网和工业场景中使用ProgLoss和STAL分配策略实现高精度识别。
2. Promptable Segmentation (SAM 3)
2. 可提示分割(SAM 3)
- Text-to-Mask: Ability to segment objects using natural language descriptions (e.g., "the blue container on the right").
- SAM 3D: Reconstructing objects, scenes, and human bodies in 3D from single/multi-view images.
- Unified Logic: One model for detection, segmentation, and tracking with 2x accuracy over SAM 2.
- 文本转掩码:能够使用自然语言描述分割目标(例如:“右侧的蓝色容器”)。
- SAM 3D:从单视图/多视图图像中重建物体、场景和人体的3D模型。
- 统一逻辑:单个模型可完成检测、分割和跟踪任务,准确率较SAM 2提升2倍。
3. Vision Language Models (VLMs)
3. 视觉语言模型(VLM)
- Visual Grounding: Leveraging Florence-2, PaliGemma 2, or Qwen2-VL for semantic scene understanding.
- Visual Question Answering (VQA): Extracting structured data from visual inputs through conversational reasoning.
- 视觉定位:利用Florence-2、PaliGemma 2或Qwen2-VL实现语义场景理解。
- 视觉问答(VQA):通过对话式推理从视觉输入中提取结构化数据。
4. Geometry & Reconstruction
4. 几何与重建
- Depth Anything V2: State-of-the-art monocular depth estimation for spatial awareness.
- Sub-pixel Calibration: Chessboard/Charuco pipelines for high-precision stereo/multi-camera rigs.
- Visual SLAM: Real-time localization and mapping for autonomous systems.
- Depth Anything V2:用于空间感知的领先水平单目深度估计技术。
- 亚像素校准:用于高精度立体/多相机系统的棋盘格/Charuco流水线。
- 视觉SLAM:用于自主系统的实时定位与建图。
Patterns
实践模式
1. Text-Guided Vision Pipelines
1. 文本引导视觉流水线
- Use SAM 3's text-to-mask capability to isolate specific parts during inspection without needing custom detectors for every variation.
- Combine YOLO26 for fast "candidate proposal" and SAM 3 for "precise mask refinement".
- 利用SAM 3的文本转掩码功能,在检测过程中隔离特定部件,无需为每种变体定制检测器。
- 结合YOLO26实现快速“候选区域提议”,再用SAM 3进行“精确掩码细化”。
2. Deployment-First Design
2. 部署优先设计
- Leverage YOLO26's simplified ONNX/TensorRT exports (NMS-free).
- Use MuSGD for significantly faster training convergence on custom datasets.
- 利用YOLO26简化的ONNX/TensorRT导出流程(无NMS)。
- 使用MuSGD在自定义数据集上实现显著更快的训练收敛。
3. Progressive 3D Scene Reconstruction
3. 渐进式3D场景重建
- Integrate monocular depth maps with geometric homographies to build accurate 2.5D/3D representations of scenes.
- 将单目深度图与几何单应性相结合,构建场景的精确2.5D/3D表示。
Anti-Patterns
反模式
- Manual NMS Post-processing: Stick to NMS-free architectures (YOLO26/v10+) for lower overhead.
- Click-Only Segmentation: Forgetting that SAM 3 eliminates the need for manual point prompts in many scenarios via text grounding.
- Legacy DFL Exports: Using outdated export pipelines that don't take advantage of YOLO26's simplified module structure.
- 手动NMS后处理:坚持使用无NMS架构(YOLO26/v10+)以降低开销。
- 仅点击式分割:忽略SAM 3通过文本定位在许多场景中无需手动点提示的能力。
- 传统DFL导出:使用过时的导出流水线,未利用YOLO26简化的模块结构。
Sharp Edges (2026)
注意事项(2026年)
| Issue | Severity | Solution |
|---|---|---|
| SAM 3 VRAM Usage | Medium | Use quantized/distilled versions for local GPU inference. |
| Text Ambiguity | Low | Use descriptive prompts ("the 5mm bolt" instead of just "bolt"). |
| Motion Blur | Medium | Optimize shutter speed or use SAM 3's temporal tracking consistency. |
| Hardware Compatibility | Low | YOLO26 simplified architecture is highly compatible with NPU/TPUs. |
| 问题 | 严重程度 | 解决方案 |
|---|---|---|
| SAM 3显存占用 | 中等 | 使用量化/蒸馏版本进行本地GPU推理。 |
| 文本歧义 | 低 | 使用描述性提示(例如:“5毫米螺栓”而非仅“螺栓”)。 |
| 运动模糊 | 中等 | 优化快门速度或利用SAM 3的时间跟踪一致性。 |
| 硬件兼容性 | 低 | YOLO26简化的架构与NPU/TPUs高度兼容。 |
Related Skills
相关技能
ai-engineerrobotics-expertresearch-engineerembedded-systemsai-engineerrobotics-expertresearch-engineerembedded-systems