omni.vu - Visual Understanding & Automation

Overview

omni.vu gives you eyes and hands on the user's macOS screen. Use it to:

See what the user sees (screen capture)
Understand UI state with AI vision
Detect changes and wait for events
Act with mouse and keyboard automation

When to Use This Skill

Proactively Use omni.vu When:

Debugging UI issues - "The button isn't working" → capture and analyze
Verifying changes - After modifying UI code, check if it rendered correctly
Waiting for operations - Build finishing, deployment completing, tests running
Understanding context - User describes something on screen you can't see
Automating repetitive tasks - Clicking through UI flows, filling forms
Documentation - Capturing screenshots of features

Do NOT Use When:

Reading/writing files (use Read/Write tools)
Running terminal commands (use Bash)
Making API calls (use appropriate tools)
User hasn't granted screen recording permission

Tool Reference

Capture Tools

vu

- Full Screen Capture

vu(monitor=0, save_to_history=True)

Captures the entire screen. Returns base64 image.

Use when: You need to see everything on screen.

vu_window

- Window Capture

vu_window(window_id=None, title="VS Code", include_frame=False)

Captures a specific window by ID or title (partial match).

Use when: You only need one application's content.

Workflow:

Call
```
vu_list_windows
```
to see available windows
Find the window_id or use title matching
Call
```
vu_window
```
with that ID/title

vu_region

- Region Capture

vu_region(x=100, y=100, width=500, height=300)

Captures a specific rectangle of the screen.

Use when: You need a precise area (error message, specific component).

vu_list_windows

- List Windows

vu_list_windows(filter_app="Chrome")

Lists all visible windows with metadata.

Returns: List of

{window_id, title, owner, x, y, width, height}

vu_list_monitors

- List Displays

vu_list_monitors()

Lists connected displays with resolution and scale factor.

Vision Tools

vu_describe

- AI Vision Analysis

vu_describe(
    prompt="What errors are visible?",
    provider="claude",  # or openai, gemini, ollama
    max_tokens=1024,
    capture_first=True
)

Captures screen and analyzes with AI.

Best prompts:

"Describe what you see on screen"
"Are there any error messages visible?"
"What is the state of the build/test output?"
"Is the login form filled correctly?"
"What color is the status indicator?"

Providers:

Provider	Speed	Quality	Cost
claude	Medium	Excellent	$$
openai	Fast	Very Good	$$
gemini	Fast	Good	$
ollama	Slow	Varies	Free

Utility Tools

vu_diff

- Change Detection

vu_diff(threshold=0.02, monitor=0)

Detects if screen changed since last capture.

Returns:

{changed: bool, diff_percentage: float}

Use for polling patterns:

python

# Wait for build to finish
while True:
    result = vu_diff(threshold=0.05)
    if result["changed"]:
        # Screen changed, check what happened
        analysis = vu_describe(prompt="Did the build succeed or fail?")
        break
    # Wait before checking again

vu_history

- Capture History

vu_history(limit=10, capture_type="full_screen")

Gets recent capture metadata.

vu_last

- Last Capture

vu_last()

Returns the most recent capture with image data.

Use when: You need to re-analyze without re-capturing.

vu_status

- System Status

vu_status()

Returns system info: monitors, providers, safety settings.

Automation Tools

vu_click

- Mouse Click

vu_click(x=500, y=300, button="left", count=1)

Clicks at screen coordinates.

Buttons:

left

right

middle

Count: 1 (single), 2 (double), 3 (triple)

Safety: Coordinates validated against screen bounds.

vu_move

- Move Cursor

vu_move(x=500, y=300, duration_ms=0)

Moves cursor to position. Use

duration_ms

for animated movement.

vu_drag

- Drag Operation

vu_drag(start_x=100, start_y=100, end_x=500, end_y=300, duration_ms=500)

Drags from start to end position.

Safety: Maximum 2000px drag distance.

vu_scroll

- Scroll

vu_scroll(direction="down", amount=3, x=None, y=None)

Scrolls at current or specified position.

Directions:

up

down

left

right

Amount: Lines to scroll (1-20)

vu_type

- Type Text

vu_type(text="Hello World", delay_between_ms=0)

Types text character by character.

Note: Click on target field first!

vu_hotkey

- Keyboard Shortcut

vu_hotkey(keys="cmd+s")

Executes keyboard shortcuts.

Format:

modifier+key

(e.g.,

cmd+c

ctrl+shift+s

cmd+opt+i

) Modifiers:

cmd

ctrl

alt

opt

shift

Blocked hotkeys (for safety):

```
cmd+q
```
(quit)
```
cmd+shift+q
```
(logout)
```
cmd+opt+esc
```
(force quit)

Common Workflows

1. Debug UI Issue

User: "The submit button doesn't do anything when I click it"

1. vu_describe(prompt="Describe the form and submit button state")
2. Analyze: Is button disabled? Is there validation error?
3. If needed: vu_window(title="Chrome") for focused capture
4. Check console: vu_describe(prompt="Are there any errors in the developer console?")

2. Verify Build/Deploy

User: "Run the build and let me know when it's done"

1. Run: npm run build (via Bash)
2. Loop: vu_diff(threshold=0.05)
3. When changed: vu_describe(prompt="What is the build status? Did it succeed or fail?")
4. Report result

3. Fill a Form

User: "Fill out the registration form with test data"

1. vu_describe(prompt="What form fields are visible?")
2. vu_click(x=field_x, y=field_y)  # Click first field
3. vu_type(text="testuser@example.com")
4. vu_hotkey(keys="tab")  # Move to next field
5. vu_type(text="Test User")
6. Continue for each field...
7. vu_click(x=submit_x, y=submit_y)  # Submit
8. vu_describe(prompt="Was the form submitted successfully?")

4. Navigate UI

User: "Open the settings panel in VS Code"

1. vu_list_windows(filter_app="Code")
2. vu_window(title="Code")  # Capture VS Code
3. vu_hotkey(keys="cmd+,")  # Open settings
4. vu_diff()  # Wait for settings to open
5. vu_describe(prompt="Are the VS Code settings now visible?")

5. Monitor for Changes

User: "Watch the dashboard and tell me when the status changes"

1. vu()  # Initial capture
2. Loop every 5 seconds:
   - vu_diff(threshold=0.02)
   - If changed: vu_describe(prompt="What changed on the dashboard?")
3. Report changes to user

Best Practices

1. Capture Before Acting

Always capture and analyze before clicking/typing:

❌ Bad: vu_click(x=500, y=300)  # Hope it's the right spot

✅ Good:
1. vu_describe(prompt="Where is the submit button?")
2. Parse coordinates from description
3. vu_click(x=parsed_x, y=parsed_y)

2. Use Appropriate Capture Scope

Full screen → General context, multi-app workflows
Window → Single app focus, cleaner output
Region → Specific element, faster/smaller

3. Specific Vision Prompts

❌ Vague: "What do you see?"

✅ Specific:
- "Is there an error message visible? If so, what does it say?"
- "What is the status of the test runner? Passing, failing, or running?"
- "List all form fields visible and their current values"

4. Rate Limit Automation

Don't spam clicks. Wait between actions:

vu_click(x=100, y=100)
# Wait for UI response
vu_diff(threshold=0.01)
vu_click(x=200, y=200)

5. Verify After Actions

Always verify automation succeeded:

vu_type(text="hello@example.com")
vu_describe(prompt="What text is now in the email field?")

Troubleshooting

"Permission denied" or blank captures

→ Grant Screen Recording permission: System Settings > Privacy > Screen Recording

Automation not working

→ Grant Accessibility permission: System Settings > Privacy > Accessibility

Vision analysis failing

→ Check API keys in environment variables → Try different provider:

vu_describe(provider="openai")

Coordinates seem off

→ Retina display? Coordinates are in logical points, not pixels → Multi-monitor? Check

vu_list_monitors()

for display arrangement

Hotkey blocked

→ Some dangerous hotkeys (cmd+q) are blocked for safety → Unblock in config if absolutely necessary

Configuration

Environment variables:

bash

OMNI_VU_ANTHROPIC_API_KEY=sk-ant-...    # For Claude vision
OMNI_VU_OPENAI_API_KEY=sk-...           # For GPT-4 vision
OMNI_VU_DEFAULT_PROVIDER=claude          # Default AI provider
OMNI_VU_SAFETY_LEVEL=medium              # low/medium/high

History stored at:

~/.omni.vu/captures/

omni-vu

NPX Install

Tags

SKILL.md Content

omni.vu - Visual Understanding & Automation

Overview

When to Use This Skill

Proactively Use omni.vu When:

Do NOT Use When:

Tool Reference

Capture Tools

vu - Full Screen Capture

vu_window - Window Capture

vu_region - Region Capture

vu_list_windows - List Windows

vu_list_monitors - List Displays

Vision Tools

vu_describe - AI Vision Analysis

Utility Tools

vu_diff - Change Detection

vu_history - Capture History

vu_last - Last Capture

vu_status - System Status

Automation Tools

vu_click - Mouse Click

vu_move - Move Cursor

vu_drag - Drag Operation

vu_scroll - Scroll

vu_type - Type Text

vu_hotkey - Keyboard Shortcut

Common Workflows

1. Debug UI Issue

2. Verify Build/Deploy

3. Fill a Form

4. Navigate UI

5. Monitor for Changes

Best Practices

1. Capture Before Acting

2. Use Appropriate Capture Scope

3. Specific Vision Prompts

4. Rate Limit Automation

5. Verify After Actions

Troubleshooting

"Permission denied" or blank captures

Automation not working

Vision analysis failing

Coordinates seem off

Hotkey blocked

Configuration

`vu`
- Full Screen Capture

`vu_window`
- Window Capture

`vu_region`
- Region Capture

`vu_list_windows`
- List Windows

`vu_list_monitors`
- List Displays

`vu_describe`
- AI Vision Analysis

`vu_diff`
- Change Detection

`vu_history`
- Capture History

`vu_last`
- Last Capture

`vu_status`
- System Status

`vu_click`
- Mouse Click

`vu_move`
- Move Cursor

`vu_drag`
- Drag Operation

`vu_scroll`
- Scroll

`vu_type`
- Type Text

`vu_hotkey`
- Keyboard Shortcut