omni-vu

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

omni.vu - Visual Understanding & Automation

omni.vu - 视觉理解与自动化

Overview

概述

omni.vu gives you eyes and hands on the user's macOS screen. Use it to:
  • See what the user sees (screen capture)
  • Understand UI state with AI vision
  • Detect changes and wait for events
  • Act with mouse and keyboard automation
omni.vu 让您能够“看见”并“操控”用户的macOS屏幕。您可以用它来:
  • 查看用户所见内容(屏幕捕获)
  • 理解UI状态(借助AI视觉)
  • 检测变化并等待事件发生
  • 执行操作(鼠标与键盘自动化)

When to Use This Skill

何时使用该工具

Proactively Use omni.vu When:

建议主动使用 omni.vu 的场景:

  1. Debugging UI issues - "The button isn't working" → capture and analyze
  2. Verifying changes - After modifying UI code, check if it rendered correctly
  3. Waiting for operations - Build finishing, deployment completing, tests running
  4. Understanding context - User describes something on screen you can't see
  5. Automating repetitive tasks - Clicking through UI flows, filling forms
  6. Documentation - Capturing screenshots of features
  1. 调试UI问题 - “按钮点击没反应” → 捕获并分析
  2. 验证变更 - 修改UI代码后,检查渲染是否正确
  3. 等待操作完成 - 构建结束、部署完成、测试运行中
  4. 理解上下文 - 用户描述了屏幕上的内容,但您无法直接查看
  5. 自动化重复任务 - 点击UI流程、填写表单
  6. 文档制作 - 捕获功能截图

Do NOT Use When:

请勿使用的场景:

  • Reading/writing files (use Read/Write tools)
  • Running terminal commands (use Bash)
  • Making API calls (use appropriate tools)
  • User hasn't granted screen recording permission
  • 读写文件(使用读写工具)
  • 运行终端命令(使用Bash)
  • 调用API(使用对应工具)
  • 用户未授予屏幕录制权限

Tool Reference

工具参考

Capture Tools

捕获工具

vu
- Full Screen Capture

vu
- 全屏捕获

vu(monitor=0, save_to_history=True)
Captures the entire screen. Returns base64 image.
Use when: You need to see everything on screen.
vu(monitor=0, save_to_history=True)
捕获整个屏幕。返回base64格式的图片。
适用场景:您需要查看屏幕上的所有内容时。

vu_window
- Window Capture

vu_window
- 窗口捕获

vu_window(window_id=None, title="VS Code", include_frame=False)
Captures a specific window by ID or title (partial match).
Use when: You only need one application's content.
Workflow:
  1. Call
    vu_list_windows
    to see available windows
  2. Find the window_id or use title matching
  3. Call
    vu_window
    with that ID/title
vu_window(window_id=None, title="VS Code", include_frame=False)
通过ID或标题(部分匹配)捕获特定窗口。
适用场景:您只需要某个应用的内容时。
工作流程
  1. 调用
    vu_list_windows
    查看可用窗口
  2. 找到window_id或使用标题匹配
  3. 传入该ID/标题调用
    vu_window

vu_region
- Region Capture

vu_region
- 区域捕获

vu_region(x=100, y=100, width=500, height=300)
Captures a specific rectangle of the screen.
Use when: You need a precise area (error message, specific component).
vu_region(x=100, y=100, width=500, height=300)
捕获屏幕上的特定矩形区域。
适用场景:您需要精确区域(如错误提示、特定组件)时。

vu_list_windows
- List Windows

vu_list_windows
- 列出窗口

vu_list_windows(filter_app="Chrome")
Lists all visible windows with metadata.
Returns: List of
{window_id, title, owner, x, y, width, height}
vu_list_windows(filter_app="Chrome")
列出所有可见窗口及其元数据。
返回值
{window_id, title, owner, x, y, width, height}
格式的列表

vu_list_monitors
- List Displays

vu_list_monitors
- 列出显示器

vu_list_monitors()
Lists connected displays with resolution and scale factor.
vu_list_monitors()
列出已连接的显示器及其分辨率和缩放比例。

Vision Tools

视觉工具

vu_describe
- AI Vision Analysis

vu_describe
- AI视觉分析

vu_describe(
    prompt="What errors are visible?",
    provider="claude",  # or openai, gemini, ollama
    max_tokens=1024,
    capture_first=True
)
Captures screen and analyzes with AI.
Best prompts:
  • "Describe what you see on screen"
  • "Are there any error messages visible?"
  • "What is the state of the build/test output?"
  • "Is the login form filled correctly?"
  • "What color is the status indicator?"
Providers:
ProviderSpeedQualityCost
claudeMediumExcellent$$
openaiFastVery Good$$
geminiFastGood$
ollamaSlowVariesFree
vu_describe(
    prompt="What errors are visible?",
    provider="claude",  # or openai, gemini, ollama
    max_tokens=1024,
    capture_first=True
)
捕获屏幕并通过AI进行分析。
推荐提示词
  • “描述屏幕上的内容”
  • “是否有错误提示可见?”
  • “构建/测试输出的状态是什么?”
  • “登录表单是否填写正确?”
  • “状态指示器是什么颜色?”
提供商:
提供商速度质量成本
claude中等优秀$$
openai快速非常好$$
gemini快速良好$
ollama缓慢参差不齐免费

Utility Tools

实用工具

vu_diff
- Change Detection

vu_diff
- 变化检测

vu_diff(threshold=0.02, monitor=0)
Detects if screen changed since last capture.
Returns:
{changed: bool, diff_percentage: float}
Use for polling patterns:
python
undefined
vu_diff(threshold=0.02, monitor=0)
检测自上次捕获以来屏幕是否发生变化。
返回值
{changed: bool, diff_percentage: float}
轮询模式示例
python
undefined

Wait for build to finish

Wait for build to finish

while True: result = vu_diff(threshold=0.05) if result["changed"]: # Screen changed, check what happened analysis = vu_describe(prompt="Did the build succeed or fail?") break # Wait before checking again
undefined
while True: result = vu_diff(threshold=0.05) if result["changed"]: # Screen changed, check what happened analysis = vu_describe(prompt="Did the build succeed or fail?") break # Wait before checking again
undefined

vu_history
- Capture History

vu_history
- 捕获历史

vu_history(limit=10, capture_type="full_screen")
Gets recent capture metadata.
vu_history(limit=10, capture_type="full_screen")
获取最近的捕获元数据。

vu_last
- Last Capture

vu_last
- 最后一次捕获

vu_last()
Returns the most recent capture with image data.
Use when: You need to re-analyze without re-capturing.
vu_last()
返回最近一次捕获的图片数据。
适用场景:您需要重新分析而无需重新捕获时。

vu_status
- System Status

vu_status
- 系统状态

vu_status()
Returns system info: monitors, providers, safety settings.
vu_status()
返回系统信息:显示器、提供商、安全设置。

Automation Tools

自动化工具

vu_click
- Mouse Click

vu_click
- 鼠标点击

vu_click(x=500, y=300, button="left", count=1)
Clicks at screen coordinates.
Buttons:
left
,
right
,
middle
Count: 1 (single), 2 (double), 3 (triple)
Safety: Coordinates validated against screen bounds.
vu_click(x=500, y=300, button="left", count=1)
在屏幕坐标位置点击。
按钮选项
left
(左键)、
right
(右键)、
middle
(中键) 点击次数:1(单击)、2(双击)、3(三击)
安全机制:坐标会根据屏幕边界进行验证。

vu_move
- Move Cursor

vu_move
- 移动光标

vu_move(x=500, y=300, duration_ms=0)
Moves cursor to position. Use
duration_ms
for animated movement.
vu_move(x=500, y=300, duration_ms=0)
将光标移动到指定位置。使用
duration_ms
参数实现动画效果移动。

vu_drag
- Drag Operation

vu_drag
- 拖拽操作

vu_drag(start_x=100, start_y=100, end_x=500, end_y=300, duration_ms=500)
Drags from start to end position.
Safety: Maximum 2000px drag distance.
vu_drag(start_x=100, start_y=100, end_x=500, end_y=300, duration_ms=500)
从起始位置拖拽到结束位置。
安全机制:最大拖拽距离为2000像素。

vu_scroll
- Scroll

vu_scroll
- 滚动

vu_scroll(direction="down", amount=3, x=None, y=None)
Scrolls at current or specified position.
Directions:
up
,
down
,
left
,
right
Amount: Lines to scroll (1-20)
vu_scroll(direction="down", amount=3, x=None, y=None)
在当前位置或指定位置滚动。
方向选项
up
(上)、
down
(下)、
left
(左)、
right
(右) 滚动量:滚动行数(1-20)

vu_type
- Type Text

vu_type
- 输入文本

vu_type(text="Hello World", delay_between_ms=0)
Types text character by character.
Note: Click on target field first!
vu_type(text="Hello World", delay_between_ms=0)
逐字符输入文本。
注意:请先点击目标输入框!

vu_hotkey
- Keyboard Shortcut

vu_hotkey
- 键盘快捷键

vu_hotkey(keys="cmd+s")
Executes keyboard shortcuts.
Format:
modifier+key
(e.g.,
cmd+c
,
ctrl+shift+s
,
cmd+opt+i
) Modifiers:
cmd
,
ctrl
,
alt
/
opt
,
shift
Blocked hotkeys (for safety):
  • cmd+q
    (quit)
  • cmd+shift+q
    (logout)
  • cmd+opt+esc
    (force quit)
vu_hotkey(keys="cmd+s")
执行键盘快捷键。
格式
modifier+key
(例如:
cmd+c
ctrl+shift+s
cmd+opt+i
修饰键
cmd
ctrl
alt
/
opt
shift
禁用的快捷键(出于安全考虑):
  • cmd+q
    (退出)
  • cmd+shift+q
    (注销)
  • cmd+opt+esc
    (强制退出)

Common Workflows

常见工作流程

1. Debug UI Issue

1. 调试UI问题

User: "The submit button doesn't do anything when I click it"

1. vu_describe(prompt="Describe the form and submit button state")
2. Analyze: Is button disabled? Is there validation error?
3. If needed: vu_window(title="Chrome") for focused capture
4. Check console: vu_describe(prompt="Are there any errors in the developer console?")
User: "The submit button doesn't do anything when I click it"

1. vu_describe(prompt="Describe the form and submit button state")
2. Analyze: Is button disabled? Is there validation error?
3. If needed: vu_window(title="Chrome") for focused capture
4. Check console: vu_describe(prompt="Are there any errors in the developer console?")

2. Verify Build/Deploy

2. 验证构建/部署

User: "Run the build and let me know when it's done"

1. Run: npm run build (via Bash)
2. Loop: vu_diff(threshold=0.05)
3. When changed: vu_describe(prompt="What is the build status? Did it succeed or fail?")
4. Report result
User: "Run the build and let me know when it's done"

1. Run: npm run build (via Bash)
2. Loop: vu_diff(threshold=0.05)
3. When changed: vu_describe(prompt="What is the build status? Did it succeed or fail?")
4. Report result

3. Fill a Form

3. 填写表单

User: "Fill out the registration form with test data"

1. vu_describe(prompt="What form fields are visible?")
2. vu_click(x=field_x, y=field_y)  # Click first field
3. vu_type(text="testuser@example.com")
4. vu_hotkey(keys="tab")  # Move to next field
5. vu_type(text="Test User")
6. Continue for each field...
7. vu_click(x=submit_x, y=submit_y)  # Submit
8. vu_describe(prompt="Was the form submitted successfully?")
User: "Fill out the registration form with test data"

1. vu_describe(prompt="What form fields are visible?")
2. vu_click(x=field_x, y=field_y)  # Click first field
3. vu_type(text="testuser@example.com")
4. vu_hotkey(keys="tab")  # Move to next field
5. vu_type(text="Test User")
6. Continue for each field...
7. vu_click(x=submit_x, y=submit_y)  # Submit
8. vu_describe(prompt="Was the form submitted successfully?")

4. Navigate UI

4. 导航UI

User: "Open the settings panel in VS Code"

1. vu_list_windows(filter_app="Code")
2. vu_window(title="Code")  # Capture VS Code
3. vu_hotkey(keys="cmd+,")  # Open settings
4. vu_diff()  # Wait for settings to open
5. vu_describe(prompt="Are the VS Code settings now visible?")
User: "Open the settings panel in VS Code"

1. vu_list_windows(filter_app="Code")
2. vu_window(title="Code")  # Capture VS Code
3. vu_hotkey(keys="cmd+,")  # Open settings
4. vu_diff()  # Wait for settings to open
5. vu_describe(prompt="Are the VS Code settings now visible?")

5. Monitor for Changes

5. 监控变化

User: "Watch the dashboard and tell me when the status changes"

1. vu()  # Initial capture
2. Loop every 5 seconds:
   - vu_diff(threshold=0.02)
   - If changed: vu_describe(prompt="What changed on the dashboard?")
3. Report changes to user
User: "Watch the dashboard and tell me when the status changes"

1. vu()  # Initial capture
2. Loop every 5 seconds:
   - vu_diff(threshold=0.02)
   - If changed: vu_describe(prompt="What changed on the dashboard?")
3. Report changes to user

Best Practices

最佳实践

1. Capture Before Acting

1. 操作前先捕获分析

Always capture and analyze before clicking/typing:
❌ Bad: vu_click(x=500, y=300)  # Hope it's the right spot

✅ Good:
1. vu_describe(prompt="Where is the submit button?")
2. Parse coordinates from description
3. vu_click(x=parsed_x, y=parsed_y)
在点击/输入前始终先捕获并分析:
❌ Bad: vu_click(x=500, y=300)  # Hope it's the right spot

✅ Good:
1. vu_describe(prompt="Where is the submit button?")
2. Parse coordinates from description
3. vu_click(x=parsed_x, y=parsed_y)

2. Use Appropriate Capture Scope

2. 使用合适的捕获范围

Full screen → General context, multi-app workflows
Window → Single app focus, cleaner output
Region → Specific element, faster/smaller
全屏 → 通用上下文、多应用工作流
窗口 → 单应用聚焦、更清晰的输出
区域 → 特定元素、更快/更小的捕获量

3. Specific Vision Prompts

3. 使用明确的视觉提示词

❌ Vague: "What do you see?"

✅ Specific:
- "Is there an error message visible? If so, what does it say?"
- "What is the status of the test runner? Passing, failing, or running?"
- "List all form fields visible and their current values"
❌ 模糊:“你看到了什么?”

✅ 明确:
- “是否有错误提示可见?如果有,内容是什么?”
- “测试运行器的状态是什么?通过、失败还是运行中?”
- “列出所有可见的表单字段及其当前值”

4. Rate Limit Automation

4. 限制自动化操作频率

Don't spam clicks. Wait between actions:
vu_click(x=100, y=100)
不要频繁点击,操作之间请等待:
vu_click(x=100, y=100)

Wait for UI response

Wait for UI response

vu_diff(threshold=0.01) vu_click(x=200, y=200)
undefined
vu_diff(threshold=0.01) vu_click(x=200, y=200)
undefined

5. Verify After Actions

5. 操作后验证结果

Always verify automation succeeded:
vu_type(text="hello@example.com")
vu_describe(prompt="What text is now in the email field?")
始终验证自动化操作是否成功:
vu_type(text="hello@example.com")
vu_describe(prompt="What text is now in the email field?")

Troubleshooting

故障排除

"Permission denied" or blank captures

“Permission denied”或空白捕获

→ Grant Screen Recording permission: System Settings > Privacy > Screen Recording
→ 授予屏幕录制权限:系统设置 > 隐私与安全性 > 屏幕录制

Automation not working

自动化操作无效

→ Grant Accessibility permission: System Settings > Privacy > Accessibility
→ 授予辅助功能权限:系统设置 > 隐私与安全性 > 辅助功能

Vision analysis failing

视觉分析失败

→ Check API keys in environment variables → Try different provider:
vu_describe(provider="openai")
→ 检查环境变量中的API密钥 → 尝试更换提供商:
vu_describe(provider="openai")

Coordinates seem off

坐标显示异常

→ Retina display? Coordinates are in logical points, not pixels → Multi-monitor? Check
vu_list_monitors()
for display arrangement
→ 是Retina显示屏吗?坐标使用逻辑点而非像素 → 多显示器?调用
vu_list_monitors()
查看显示器布局

Hotkey blocked

快捷键被阻止

→ Some dangerous hotkeys (cmd+q) are blocked for safety → Unblock in config if absolutely necessary
→ 部分危险快捷键(如cmd+q)出于安全考虑被阻止 → 如有绝对必要,可在配置中解除限制

Configuration

配置

Environment variables:
bash
OMNI_VU_ANTHROPIC_API_KEY=sk-ant-...    # For Claude vision
OMNI_VU_OPENAI_API_KEY=sk-...           # For GPT-4 vision
OMNI_VU_DEFAULT_PROVIDER=claude          # Default AI provider
OMNI_VU_SAFETY_LEVEL=medium              # low/medium/high
History stored at:
~/.omni.vu/captures/
环境变量:
bash
OMNI_VU_ANTHROPIC_API_KEY=sk-ant-...    # For Claude vision
OMNI_VU_OPENAI_API_KEY=sk-...           # For GPT-4 vision
OMNI_VU_DEFAULT_PROVIDER=claude          # Default AI provider
OMNI_VU_SAFETY_LEVEL=medium              # low/medium/high
历史记录存储位置:
~/.omni.vu/captures/