watch-video
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWatch Video
观看视频
Watch a video, screen recording, or screenshot and produce structured, actionable notes that a coding agent can act on.
观看视频、屏幕录制内容或截图,生成可供 coding Agent 使用的结构化、可执行笔记。
What This Skill Does
技能功能
Given a video (Loom, YouTube, local file) or screenshot, this skill:
- Downloads the media (if URL) or reads the local file
- Extracts key frames and audio from videos
- Uses Gemini Flash to analyze visual content (UI state, text, user actions)
- Transcribes audio narration (if present)
- Synthesizes everything into structured notes with:
- What's shown and what's happening
- Exact text visible on screen (errors, URLs, labels, code)
- What was said (audio transcript)
- Key observations and actionable takeaways
- Confidence level and clarifying questions
The notes are designed to give a coding agent enough context to take any action — fix a bug, build a feature, create a skill, write documentation, or implement what was demonstrated.
给定一个视频(Loom、YouTube、本地文件)或截图,该技能会执行以下操作:
- 下载媒体资源(如果是URL)或读取本地文件
- 从视频中提取关键帧和音频
- 使用 Gemini Flash 分析视觉内容(UI状态、文本、用户操作)
- 转录音频旁白(如有)
- 将所有内容整合为结构化笔记,包含:
- 展示的内容和正在发生的操作
- 屏幕上可见的准确文本(错误信息、URL、标签、代码)
- 口述内容(音频转录文本)
- 关键观察结果和可执行总结要点
- 置信度和待澄清问题
生成的笔记可为 coding Agent 提供足够上下文来执行任意操作:修复bug、开发功能、创建技能、编写文档,或者实现视频中演示的内容。
Setup
安装配置
Requires Python 3.11+ and ffmpeg. Choose a backend:
Gemini (default, API-based):
bash
pip install eyeroll
eyeroll init # or: export GEMINI_API_KEY=your-keyOllama + Qwen3-VL (local, no API key):
bash
pip install eyeroll
ollama serve # start OllamaFor URL downloads (Loom, YouTube), install yt-dlp:
brew install yt-dlp需要 Python 3.11+ 和 ffmpeg,可选择以下任意后端:
Gemini(默认,基于API):
bash
pip install eyeroll
eyeroll init # 或:export GEMINI_API_KEY=your-keyOllama + Qwen3-VL(本地运行,无需API密钥):
bash
pip install eyeroll
ollama serve # 启动 Ollama如果需要下载URL资源(Loom、YouTube),请安装 yt-dlp:
brew install yt-dlpCommands
命令说明
Analyze a video or screenshot
分析视频或截图
bash
eyeroll watch <source> [--context "..."] [--backend gemini|ollama] [--model MODEL] [--max-frames 20] [--output report.md] [--verbose]- : URL (Loom, YouTube, any yt-dlp supported site) or local file path (.mp4, .webm, .mov, .png, .jpg, .gif)
source - : Text context from the person who shared the video — Slack message, issue description, PR reference. Significantly improves analysis quality.
--context - : Maximum frames to extract from video (default: 20)
--max-frames - : Write notes to file instead of stdout
--output - : Show progress details
--verbose
bash
eyeroll watch <source> [--context "..."] [--backend gemini|ollama] [--model MODEL] [--max-frames 20] [--output report.md] [--verbose]- :URL(Loom、YouTube、任意 yt-dlp 支持的站点)或本地文件路径(.mp4、.webm、.mov、.png、.jpg、.gif)
source - :分享视频的用户提供的文本上下文,例如Slack消息、issue描述、PR关联引用,可显著提升分析质量
--context - :从视频中提取的最大帧数(默认值:20)
--max-frames - :将笔记写入指定文件而非输出到标准输出
--output - :展示进度详情
--verbose
When To Use This Skill
适用场景
- User shares a video or screenshot and wants the agent to understand it
- User pastes a Loom or YouTube link
- User says "watch this", "look at this recording", "look at this video"
- User asks to fix a bug based on a video or screenshot
- User asks to build something based on a demo video
- User asks to create a skill/plugin/subagent from a tutorial video
- User shares a screen recording with context about what to do with it
- User asks to generate notes, a report, or documentation from a video
- 用户分享视频或截图,希望 Agent 理解内容
- 用户粘贴 Loom 或 YouTube 链接
- 用户表述「看看这个」、「看一下这个录屏」、「看一下这个视频」
- 用户要求根据视频或截图修复bug
- 用户要求根据演示视频构建类似功能
- 用户要求根据教程视频创建技能/插件/subagent
- 用户分享屏幕录制内容,同时提供了相关处理说明
- 用户要求从视频生成笔记、报告或文档
Example Interactions
交互示例
User: "Watch this video and tell me what's wrong: https://loom.com/share/abc123"
Action: Run
eyeroll watch https://loom.com/share/abc123User: "Here's a recording of the bug. Checkout is broken after billing migration."
Action: Run
eyeroll watch <video_path> --context "checkout broken after billing migration merge"User: "Watch this demo and build something similar"
Action: Run , review the notes, then implement the demonstrated feature.
eyeroll watch <url>User: "Watch this tutorial and create a skill from it"
Action: Run , then use the notes to generate a SKILL.md and implementation.
eyeroll watch <url> --context "create a skill based on this tutorial"User: "Look at this screenshot, what's going on?"
Action: Run
eyeroll watch screenshot.pngUser: "Watch this and raise a PR"
Action: Run , review the notes, search codebase for relevant files, implement the change, and create a PR.
eyeroll watch <url>用户:「看看这个视频,告诉我哪里出问题了:https://loom.com/share/abc123」
操作:运行
eyeroll watch https://loom.com/share/abc123用户:「这是bug的录屏,账单迁移之后结账功能就坏了。」
操作:运行
eyeroll watch <video_path> --context "账单迁移合并后结账功能损坏"用户:「看看这个演示,做个类似的功能出来。」
操作:运行 ,查看生成的笔记后实现演示的功能。
eyeroll watch <url>用户:「看看这个教程,基于它做一个技能。」
操作:运行 ,然后基于生成的笔记编写 SKILL.md 和对应实现。
eyeroll watch <url> --context "基于该教程创建一个技能"用户:「看看这个截图,是出什么问题了?」
操作:运行
eyeroll watch screenshot.png用户:「看看这个,然后提交一个PR。」
操作:运行 ,查看生成的笔记,检索代码库相关文件,实现对应修改后创建PR。
eyeroll watch <url>Rules
使用规则
- Always check that GEMINI_API_KEY is set before running analysis.
- If ffmpeg is not on PATH, the bundled imageio-ffmpeg fallback is used.
- For videos under 20MB and 2 minutes, the full video is sent to Gemini directly (more accurate). Longer videos use frame-by-frame analysis.
- When context is available (--context), always include it — it dramatically improves quality, especially for silent recordings.
- Do NOT hallucinate text, error messages, or code. Only report what is actually visible/audible.
- If the video is ambiguous or unclear, output clarifying questions rather than guessing.
- The notes are a starting point — when the user asks to take action (fix, build, create), use the notes plus codebase context to do so.
- Warn the user if yt-dlp is not installed when they provide a URL.
- 运行分析前请始终检查 GEMINI_API_KEY 是否已配置
- 如果PATH中没有 ffmpeg,会自动使用捆绑的 imageio-ffmpeg 作为备用
- 对于小于20MB、时长低于2分钟的视频,会将完整视频直接发送给Gemini(准确率更高),更长的视频会采用逐帧分析
- 如有可用上下文(--context)请务必传入,可大幅提升分析质量,尤其是对于无旁白的录屏
- 禁止虚构文本、错误信息或代码,仅报告实际可见/可听见的内容
- 如果视频内容模糊或不清晰,请输出待澄清问题而非猜测内容
- 笔记仅作为起点,当用户要求执行操作(修复、构建、创建)时,需结合笔记和代码库上下文完成任务
- 当用户提供URL但未安装 yt-dlp 时,需向用户发出警告
Supported Input Types
支持的输入类型
| Input | Supported | Notes |
|---|---|---|
| Local video (.mp4, .webm, .mov, .avi, .mkv) | Yes | Direct analysis |
| Local image (.png, .jpg, .gif, .webp, .bmp) | Yes | Single-frame analysis |
| YouTube URL | Yes | Requires yt-dlp |
| Loom URL | Yes | Requires yt-dlp |
| Any yt-dlp supported URL | Yes | 1000+ sites supported |
| Direct video URL (.mp4 link) | Yes | Requires yt-dlp |
| 输入 | 是否支持 | 说明 |
|---|---|---|
| 本地视频(.mp4、.webm、.mov、.avi、.mkv) | 是 | 直接分析 |
| 本地图片(.png、.jpg、.gif、.webp、.bmp) | 是 | 单帧分析 |
| YouTube URL | 是 | 需要安装 yt-dlp |
| Loom URL | 是 | 需要安装 yt-dlp |
| 任意 yt-dlp 支持的URL | 是 | 支持1000+站点 |
| 直接视频链接(.mp4链接) | 是 | 需要安装 yt-dlp |
Cost
成本说明
Gemini 2.0 Flash pricing (approximate):
- Short video (<2min, direct upload): ~$0.01-0.05 per analysis
- Long video (frame-by-frame, 20 frames): ~$0.02-0.10 per analysis
- Audio transcription: ~$0.01 per minute
- Synthesis: ~$0.01 per report
Total: typically under $0.15 per video analysis.
Gemini 2.0 Flash 近似定价:
- 短视频(<2分钟,直接上传):每次分析约0.01-0.05美元
- 长视频(逐帧分析,20帧):每次分析约0.02-0.10美元
- 音频转录:每分钟约0.01美元
- 内容整合:每份报告约0.01美元
总计:每次视频分析成本通常低于0.15美元。