watch-video

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Watch Video

观看视频

Watch a video, screen recording, or screenshot and produce structured, actionable notes that a coding agent can act on.

观看视频、屏幕录制内容或截图，生成可供 coding Agent 使用的结构化、可执行笔记。

What This Skill Does

技能功能

Given a video (Loom, YouTube, local file) or screenshot, this skill:

Downloads the media (if URL) or reads the local file
Extracts key frames and audio from videos
Uses Gemini Flash to analyze visual content (UI state, text, user actions)
Transcribes audio narration (if present)
Synthesizes everything into structured notes with:
- What's shown and what's happening
- Exact text visible on screen (errors, URLs, labels, code)
- What was said (audio transcript)
- Key observations and actionable takeaways
- Confidence level and clarifying questions

The notes are designed to give a coding agent enough context to take any action — fix a bug, build a feature, create a skill, write documentation, or implement what was demonstrated.

给定一个视频（Loom、YouTube、本地文件）或截图，该技能会执行以下操作：

下载媒体资源（如果是URL）或读取本地文件
从视频中提取关键帧和音频
使用 Gemini Flash 分析视觉内容（UI状态、文本、用户操作）
转录音频旁白（如有）
将所有内容整合为结构化笔记，包含：
- 展示的内容和正在发生的操作
- 屏幕上可见的准确文本（错误信息、URL、标签、代码）
- 口述内容（音频转录文本）
- 关键观察结果和可执行总结要点
- 置信度和待澄清问题

生成的笔记可为 coding Agent 提供足够上下文来执行任意操作：修复bug、开发功能、创建技能、编写文档，或者实现视频中演示的内容。

Setup

安装配置

Requires Python 3.11+ and ffmpeg. Choose a backend:

Gemini (default, API-based):

bash

pip install eyeroll
eyeroll init          # or: export GEMINI_API_KEY=your-key

Ollama + Qwen3-VL (local, no API key):

bash

pip install eyeroll
ollama serve          # start Ollama

For URL downloads (Loom, YouTube), install yt-dlp:

brew install yt-dlp

需要 Python 3.11+ 和 ffmpeg，可选择以下任意后端：

Gemini（默认，基于API）：

bash

pip install eyeroll
eyeroll init          # 或：export GEMINI_API_KEY=your-key

Ollama + Qwen3-VL（本地运行，无需API密钥）：

bash

pip install eyeroll
ollama serve          # 启动 Ollama

如果需要下载URL资源（Loom、YouTube），请安装 yt-dlp：

brew install yt-dlp

Commands

命令说明

Analyze a video or screenshot

分析视频或截图

bash

eyeroll watch <source> [--context "..."] [--backend gemini|ollama] [--model MODEL] [--max-frames 20] [--output report.md] [--verbose]

```
source
```
: URL (Loom, YouTube, any yt-dlp supported site) or local file path (.mp4, .webm, .mov, .png, .jpg, .gif)
```
--context
```
: Text context from the person who shared the video — Slack message, issue description, PR reference. Significantly improves analysis quality.
```
--max-frames
```
: Maximum frames to extract from video (default: 20)
```
--output
```
: Write notes to file instead of stdout
```
--verbose
```
: Show progress details

bash

eyeroll watch <source> [--context "..."] [--backend gemini|ollama] [--model MODEL] [--max-frames 20] [--output report.md] [--verbose]

```
source
```
：URL（Loom、YouTube、任意 yt-dlp 支持的站点）或本地文件路径（.mp4、.webm、.mov、.png、.jpg、.gif）
```
--context
```
：分享视频的用户提供的文本上下文，例如Slack消息、issue描述、PR关联引用，可显著提升分析质量
```
--max-frames
```
：从视频中提取的最大帧数（默认值：20）
```
--output
```
：将笔记写入指定文件而非输出到标准输出
```
--verbose
```
：展示进度详情

When To Use This Skill

适用场景

User shares a video or screenshot and wants the agent to understand it
User pastes a Loom or YouTube link
User says "watch this", "look at this recording", "look at this video"
User asks to fix a bug based on a video or screenshot
User asks to build something based on a demo video
User asks to create a skill/plugin/subagent from a tutorial video
User shares a screen recording with context about what to do with it
User asks to generate notes, a report, or documentation from a video

用户分享视频或截图，希望 Agent 理解内容
用户粘贴 Loom 或 YouTube 链接
用户表述「看看这个」、「看一下这个录屏」、「看一下这个视频」
用户要求根据视频或截图修复bug
用户要求根据演示视频构建类似功能
用户要求根据教程视频创建技能/插件/subagent
用户分享屏幕录制内容，同时提供了相关处理说明
用户要求从视频生成笔记、报告或文档

Example Interactions

交互示例

User: "Watch this video and tell me what's wrong: https://loom.com/share/abc123" Action: Run

eyeroll watch https://loom.com/share/abc123

User: "Here's a recording of the bug. Checkout is broken after billing migration." Action: Run

eyeroll watch <video_path> --context "checkout broken after billing migration merge"

User: "Watch this demo and build something similar" Action: Run

eyeroll watch <url>

, review the notes, then implement the demonstrated feature.

User: "Watch this tutorial and create a skill from it" Action: Run

eyeroll watch <url> --context "create a skill based on this tutorial"

, then use the notes to generate a SKILL.md and implementation.

User: "Look at this screenshot, what's going on?" Action: Run

eyeroll watch screenshot.png

User: "Watch this and raise a PR" Action: Run

eyeroll watch <url>

, review the notes, search codebase for relevant files, implement the change, and create a PR.

用户：「看看这个视频，告诉我哪里出问题了：https://loom.com/share/abc123」操作：运行

eyeroll watch https://loom.com/share/abc123

用户：「这是bug的录屏，账单迁移之后结账功能就坏了。」操作：运行

eyeroll watch <video_path> --context "账单迁移合并后结账功能损坏"

用户：「看看这个演示，做个类似的功能出来。」操作：运行

eyeroll watch <url>

，查看生成的笔记后实现演示的功能。

用户：「看看这个教程，基于它做一个技能。」操作：运行

eyeroll watch <url> --context "基于该教程创建一个技能"

，然后基于生成的笔记编写 SKILL.md 和对应实现。

用户：「看看这个截图，是出什么问题了？」操作：运行

eyeroll watch screenshot.png

用户：「看看这个，然后提交一个PR。」操作：运行

eyeroll watch <url>

，查看生成的笔记，检索代码库相关文件，实现对应修改后创建PR。

Rules

使用规则

Always check that GEMINI_API_KEY is set before running analysis.
If ffmpeg is not on PATH, the bundled imageio-ffmpeg fallback is used.
For videos under 20MB and 2 minutes, the full video is sent to Gemini directly (more accurate). Longer videos use frame-by-frame analysis.
When context is available (--context), always include it — it dramatically improves quality, especially for silent recordings.
Do NOT hallucinate text, error messages, or code. Only report what is actually visible/audible.
If the video is ambiguous or unclear, output clarifying questions rather than guessing.
The notes are a starting point — when the user asks to take action (fix, build, create), use the notes plus codebase context to do so.
Warn the user if yt-dlp is not installed when they provide a URL.

运行分析前请始终检查 GEMINI_API_KEY 是否已配置
如果PATH中没有 ffmpeg，会自动使用捆绑的 imageio-ffmpeg 作为备用
对于小于20MB、时长低于2分钟的视频，会将完整视频直接发送给Gemini（准确率更高），更长的视频会采用逐帧分析
如有可用上下文（--context）请务必传入，可大幅提升分析质量，尤其是对于无旁白的录屏
禁止虚构文本、错误信息或代码，仅报告实际可见/可听见的内容
如果视频内容模糊或不清晰，请输出待澄清问题而非猜测内容
笔记仅作为起点，当用户要求执行操作（修复、构建、创建）时，需结合笔记和代码库上下文完成任务
当用户提供URL但未安装 yt-dlp 时，需向用户发出警告

Supported Input Types

支持的输入类型

Input	Supported	Notes
Local video (.mp4, .webm, .mov, .avi, .mkv)	Yes	Direct analysis
Local image (.png, .jpg, .gif, .webp, .bmp)	Yes	Single-frame analysis
YouTube URL	Yes	Requires yt-dlp
Loom URL	Yes	Requires yt-dlp
Any yt-dlp supported URL	Yes	1000+ sites supported
Direct video URL (.mp4 link)	Yes	Requires yt-dlp

输入	是否支持	说明
本地视频（.mp4、.webm、.mov、.avi、.mkv）	是	直接分析
本地图片（.png、.jpg、.gif、.webp、.bmp）	是	单帧分析
YouTube URL	是	需要安装 yt-dlp
Loom URL	是	需要安装 yt-dlp
任意 yt-dlp 支持的URL	是	支持1000+站点
直接视频链接（.mp4链接）	是	需要安装 yt-dlp

Cost

成本说明

Gemini 2.0 Flash pricing (approximate):

Short video (<2min, direct upload): ~$0.01-0.05 per analysis
Long video (frame-by-frame, 20 frames): ~$0.02-0.10 per analysis
Audio transcription: ~$0.01 per minute
Synthesis: ~$0.01 per report

Total: typically under $0.15 per video analysis.

Gemini 2.0 Flash 近似定价：

短视频（<2分钟，直接上传）：每次分析约0.01-0.05美元
长视频（逐帧分析，20帧）：每次分析约0.02-0.10美元
音频转录：每分钟约0.01美元
内容整合：每份报告约0.01美元

总计：每次视频分析成本通常低于0.15美元。