video-understand

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

video-understand

视频内容解析

Understand video content locally using ffmpeg for frame extraction and Whisper for transcription. Fully offline, no API keys required.

借助ffmpeg提取帧、Whisper转录音频，在本地解析视频内容。完全离线运行，无需API密钥。

Prerequisites

前置依赖

```
ffmpeg
```
+
```
ffprobe
```
(required):
```
brew install ffmpeg
```

openai-whisper

(optional, for transcription):

pip install openai-whisper

```
ffmpeg
```
+
```
ffprobe
```
(必填):
```
brew install ffmpeg
```

openai-whisper

(可选，用于转录):

pip install openai-whisper

Commands

命令示例

bash

undefined

bash

undefined

Scene detection + transcribe (default)

场景检测 + 转录（默认）

python3 skills/video-understand/scripts/understand_video.py video.mp4

Keyframe extraction

关键帧提取

python3 skills/video-understand/scripts/understand_video.py video.mp4 -m keyframe

Regular interval extraction

固定间隔提取

python3 skills/video-understand/scripts/understand_video.py video.mp4 -m interval

Limit frames extracted

限制提取的帧数

python3 skills/video-understand/scripts/understand_video.py video.mp4 --max-frames 10

Use a larger Whisper model

使用更大的Whisper模型

python3 skills/video-understand/scripts/understand_video.py video.mp4 --whisper-model small

Frames only, skip transcription

仅提取帧，跳过转录

python3 skills/video-understand/scripts/understand_video.py video.mp4 --no-transcribe

Quiet mode (JSON only, no progress)

静默模式（仅输出JSON，无进度信息）

python3 skills/video-understand/scripts/understand_video.py video.mp4 -q

Output to file

输出到文件

python3 skills/video-understand/scripts/understand_video.py video.mp4 -o result.json

undefined

python3 skills/video-understand/scripts/understand_video.py video.mp4 -o result.json

undefined

CLI Options

CLI选项

Flag	Description
`video`	Input video file (positional, required)
`-m, --mode`	Extraction mode: `scene` (default), `keyframe` , `interval`
`--max-frames`	Maximum frames to keep (default: 20)
`--whisper-model`	Whisper model size: tiny, base, small, medium, large (default: base)
`--no-transcribe`	Skip audio transcription, extract frames only
`-o, --output`	Write result JSON to file instead of stdout
`-q, --quiet`	Suppress progress messages, output only JSON

参数	说明
`video`	输入视频文件（位置参数，必填）
`-m, --mode`	提取模式： `scene` （默认）、 `keyframe` 、 `interval`
`--max-frames`	保留的最大帧数（默认：20）
`--whisper-model`	Whisper模型大小：tiny、base、small、medium、large（默认：base）
`--no-transcribe`	跳过音频转录，仅提取帧
`-o, --output`	将结果JSON写入文件而非标准输出
`-q, --quiet`	隐藏进度信息，仅输出JSON

Extraction Modes

提取模式

Mode	How it works	Best for
`scene`	Detects scene changes via ffmpeg `select='gt(scene,0.3)'`	Most videos, varied content
`keyframe`	Extracts I-frames (codec keyframes)	Encoded video with natural keyframe placement
`interval`	Evenly spaced frames based on duration and max-frames	Fixed sampling, predictable output

scene

mode detects no scene changes, it automatically falls back to

interval

mode.

模式	工作原理	适用场景
`scene`	通过ffmpeg `select='gt(scene,0.3)'` 检测场景变化	大多数视频、内容多样的视频
`keyframe`	提取I帧（编解码器关键帧）	带有自然关键帧布局的编码视频
`interval`	根据时长和最大帧数均匀抽取帧	固定采样、可预测输出的场景

如果

scene

模式未检测到场景变化，将自动切换为

interval

模式。

Output

输出结果

The script outputs JSON to stdout (or file with

-o

). See

references/output-format.md

for the full schema.

json

{
  "video": "video.mp4",
  "duration": 18.076,
  "resolution": {"width": 1224, "height": 1080},
  "mode": "scene",
  "frames": [
    {"path": "/abs/path/frame_0001.jpg", "timestamp": 0.0, "timestamp_formatted": "00:00"}
  ],
  "frame_count": 12,
  "transcript": [
    {"start": 0.0, "end": 2.5, "text": "Hello and welcome..."}
  ],
  "text": "Full transcript...",
  "note": "Use the Read tool to view frame images for visual understanding."
}

Use the Read tool on frame image paths to visually inspect extracted frames.

该脚本会将JSON输出到标准输出（或通过

-o

参数写入文件）。完整的输出格式请参考

references/output-format.md

。

json

{
  "video": "video.mp4",
  "duration": 18.076,
  "resolution": {"width": 1224, "height": 1080},
  "mode": "scene",
  "frames": [
    {"path": "/abs/path/frame_0001.jpg", "timestamp": 0.0, "timestamp_formatted": "00:00"}
  ],
  "frame_count": 12,
  "transcript": [
    {"start": 0.0, "end": 2.5, "text": "Hello and welcome..."}
  ],
  "text": "Full transcript...",
  "note": "Use the Read tool to view frame images for visual understanding."
}

可使用Read工具查看帧图像路径，以可视化检查提取的帧。

References

参考资料

```
references/output-format.md
```
-- Full JSON output schema documentation

```
references/output-format.md
```
-- 完整的JSON输出格式文档