video-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Video Analysis

视频分析

Analyze video files using either native model understanding or frame extraction + transcription.

使用原生模型理解或帧提取+转录功能分析视频文件。

How It Works

工作原理

analyze_video(path, question)
      │
      ├─ file_size ≤ threshold (default 20MB)
      │     → Send video to a supports_video model (default Gemini 3.1 Flash Lite)
      │     → Model sees full video natively (best quality)
      │
      └─ file_size > threshold
            → ffmpeg extracts keyframes (scene detection for long videos)
            → Whisper transcribes audio track
            → Returns frame image paths + transcript text
            → Agent feeds these to the current chat model

analyze_video(path, question)
      │
      ├─ file_size ≤ threshold (default 20MB)
      │     → Send video to a supports_video model (default Gemini 3.1 Flash Lite)
      │     → Model sees full video natively (best quality)
      │
      └─ file_size > threshold
            → ffmpeg extracts keyframes (scene detection for long videos)
            → Whisper transcribes audio track
            → Returns frame image paths + transcript text
            → Agent feeds these to the current chat model

Quick Start

快速开始

⚠️ Invocation — do NOT use dotted imports. The directory name contains a hyphen (

video-analysis

), so

from skills.video-analysis.exports import ...

is a Python syntax error (

is parsed as minus). This is true for every hyphenated skill, not just this one. Use one of the two patterns below.

Pattern A — from workspace root (recommended for scripts):

bash

cd /data/workspace/skills/video-analysis && \
  python3 -c "from exports import analyze_video; \
    import json; \
    print(json.dumps(analyze_video('output/videos/clip.mp4', \
      question='What happens in this video?'), ensure_ascii=False))"

Note: pass the video path workspace-relative (analyze.py resolves it against

WORKSPACE_DIR

), even though you cd into the skill dir.

Pattern B — inside a starchild-clawd script:

python

from core.skill_tools import video_analysis
result = video_analysis.analyze_video("output/videos/clip.mp4",
                                      question="What happens in this video?")

❌ Do NOT

exec(open('skills/video-analysis/analyze.py').read())

— analyze.py uses

__file__

at import time, which is undefined under

exec

, so it crashes. Load it by file path with

importlib.util.spec_from_file_location

if you must avoid both patterns above.

python

undefined

⚠️ 调用注意——请勿使用点式导入。目录名称包含连字符（

video-analysis

），因此

from skills.video-analysis.exports import ...

会触发Python语法错误（

会被解析为减号）。所有带连字符的技能都存在这个问题，并非仅本技能如此。请使用以下两种模式之一。

模式A——从工作区根目录调用（推荐用于脚本）：

bash

cd /data/workspace/skills/video-analysis && \
  python3 -c "from exports import analyze_video; \
    import json; \
    print(json.dumps(analyze_video('output/videos/clip.mp4', \
      question='What happens in this video?'), ensure_ascii=False))"

注意：即使进入技能目录，也要传入工作区相对路径的视频路径（analyze.py会基于

WORKSPACE_DIR

解析路径）。

模式B——在starchild-clawd脚本中调用：

python

from core.skill_tools import video_analysis
result = video_analysis.analyze_video("output/videos/clip.mp4",
                                      question="What happens in this video?")

❌ 请勿使用

exec(open('skills/video-analysis/analyze.py').read())

——analyze.py在导入时会使用

__file__

，而

exec

环境下该变量未定义，会导致崩溃。如果必须避开上述两种模式，请使用

importlib.util.spec_from_file_location

按文件路径加载。

python

undefined

result keys (same for both patterns):

Analyze a video — auto-selects native or extraction mode

result = analyze_video("output/videos/clip.mp4", question="What happens in this video?")

result keys:

success: bool

mode: "native" | "extraction"

If mode == "native":

analysis: str (model's text response)

model: str (which model was used)

tokens: {input, output, video, audio}

If mode == "extraction":

frame_paths: list[str] (workspace-relative paths to keyframe JPEGs)

transcript: str | None (Whisper transcription text)

frame_count: int

duration_sec: float

undefined

undefined

Using the Exports

使用导出模块

python

from core.skill_tools import video_analysis

python

from core.skill_tools import video_analysis

Full analysis (auto-selects mode)

完整分析（自动选择模式）

result = video_analysis.analyze_video("output/videos/my_video.mp4", question="Describe this video")

Check current config

查看当前配置

config = video_analysis.get_config()

Get video metadata without analyzing

获取视频元数据而不进行分析

info = video_analysis.get_video_info("output/videos/my_video.mp4")

→ {"duration": 45.2, "size": 12345678, "width": 1920, "height": 1080, "has_audio": true}

undefined

undefined

Native Mode (small videos)

原生模式（小视频）

For videos under the size threshold, the skill sends the full video to a model that supports native video input. The model sees every frame and hears the audio.

Default model:

google/gemini-3.1-flash-lite

— best price/quality for video.

Model benchmark (6MB clip, vs

gemini-3.1-pro-preview

baseline):

Model	Tier	Cost	Time	Accuracy	Notes
google/gemini-3.1-flash-lite	budget	~$0.0014	8.1s	~88%	⭐ Default — cheapest + fastest
google/gemini-3.5-flash	std	~$0.0152	11.8s	~85%	More detail, higher cost
qwen/qwen3.6-plus	budget	~$0.0058	44.2s	~95%	Accurate but slow
qwen/qwen3.6-flash	budget	~$0.0027	16.6s	~80%	Misreads subjects sometimes
google/gemini-3.1-pro-preview	std	~$0.0199	19.7s	100%	Baseline (best, most expensive)

flash-lite identifies the full scene, action sequence, and transitions correctly at ~14x lower cost than the Pro baseline. For maximum accuracy (exact character names, fine detail), switch

default_model

gemini-3.1-pro-preview

gemini-3.5-flash

config/video-analysis.yaml

对于小于大小阈值的视频，本技能会将完整视频发送至支持原生视频输入的模型。模型可以看到每一帧并听到音频。

默认模型：

google/gemini-3.1-flash-lite

——视频分析性价比最优选择。

模型基准测试（6MB视频片段，以

gemini-3.1-pro-preview

为基准）：

模型	层级	成本	耗时	准确率	备注
google/gemini-3.1-flash-lite	经济型	~$0.0014	8.1s	~88%	⭐ 默认选项——最便宜且最快
google/gemini-3.5-flash	标准型	~$0.0152	11.8s	~85%	细节更丰富，但成本更高
qwen/qwen3.6-plus	经济型	~$0.0058	44.2s	~95%	准确率高但速度慢
qwen/qwen3.6-flash	经济型	~$0.0027	16.6s	~80%	有时会误判主体
google/gemini-3.1-pro-preview	标准型	~$0.0199	19.7s	100%	基准模型（质量最佳，成本最高）

flash-lite能正确识别完整场景、动作序列和转场，成本仅为Pro基准模型的1/14左右。如果需要最高准确率（如识别确切角色名称、精细细节），可在

config/video-analysis.yaml

中将

default_model

切换为

gemini-3.1-pro-preview

或

gemini-3.5-flash

。

Extraction Mode (large videos)

提取模式（大视频）

For videos over the size threshold, the skill extracts keyframes and transcribes audio:

Short videos (≤60s): One frame every N seconds (default: 2s)
Long videos (>60s): Scene-change detection picks visually distinct frames
Audio: Extracted and sent to Whisper for transcription
Max frames: Capped at 30 (configurable) to control cost

The agent receives frame image paths and transcript text, then feeds them to the current chat model as image attachments + context text.

对于超过大小阈值的视频，本技能会提取关键帧并转录音频：

短视频（≤60秒）： 每N秒提取一帧（默认：2秒）
长视频（>60秒）： 通过场景变化检测选择视觉差异明显的帧
音频： 提取后发送至Whisper进行转录
最大帧数： 上限为30（可配置）以控制成本

Agent会收到帧图片路径和转录文本，然后将其作为图片附件+上下文文本传入当前聊天模型。

Configuration

配置

Edit config/video-analysis.yaml
(in the workspace) to customize. This file is created automatically on first use, only needs the keys you want to override, and survives skill updates.

Do NOT edit
skills/video-analysis/config.yaml
— that's the factory default and is overwritten on every skill auto-update. The user file overlays it.

Both the standalone skill and the chat "send a video" flow read this same config, so one edit changes the model everywhere. Available keys:

yaml

undefined

编辑工作区中的**

config/video-analysis.yaml

进行自定义配置。该文件会在首次使用时自动创建，仅需覆盖需要修改的配置项，且在技能更新时会保留**。

请勿编辑
skills/video-analysis/config.yaml
——这是工厂默认配置，每次技能自动更新时都会被覆盖。用户配置文件会覆盖默认配置。

独立技能和聊天“发送视频”流程都会读取同一配置，因此一次修改即可应用于所有场景。可用配置项：

yaml

undefined

Model for native video understanding

用于原生视频理解的模型

default_model: google/gemini-3.1-flash-lite

Size threshold: native (≤) vs extraction (>)

大小阈值：原生模式（≤） vs 提取模式（>）

Set to 0 → always extraction. Set to 100 → always native.

设置为0 → 始终使用提取模式。设置为100 → 始终使用原生模式。

native_size_limit_mb: 20

Frame extraction settings

帧提取设置

extraction: max_frames: 30 # Max keyframes to extract short_video_interval_sec: 2 # Frame interval for ≤60s videos scene_threshold: 0.3 # Scene detection sensitivity (0.0-1.0) transcribe_audio: true # Whether to Whisper-transcribe audio

undefined

extraction: max_frames: 30 # 提取的最大关键帧数 short_video_interval_sec: 2 # ≤60秒视频的帧提取间隔 scene_threshold: 0.3 # 场景检测敏感度（0.0-1.0） transcribe_audio: true # 是否使用Whisper转录音频

undefined

Available Video Models

可用视频模型

Model	Alias	Tier	Notes
google/gemini-3.1-flash-lite	flash31	budget	⭐ Default, best price/quality
google/gemini-3.5-flash	gemini35	standard	More detail, higher cost
google/gemini-3.1-flash-lite	flash31	budget	Cheapest option
google/gemini-3.1-pro-preview	gemini	standard	Highest quality
qwen/qwen3.6-flash	qwenf	budget	Good alternative
qwen/qwen3.6-plus	qwen	budget	—
minimax/minimax-m3	mm3	standard	—
meta-llama/llama-4-maverick	maverick	standard	—
meta-llama/llama-4-scout	scout	budget	—
xiaomi/mimo-v2.5	mimo	standard	—
z-ai/glm-5v-turbo	glm5v	standard	—
minimax/minimax-m2.7	mm27	budget	Audio-only, no image

模型	别名	层级	备注
google/gemini-3.1-flash-lite	flash31	经济型	⭐ 默认选项，性价比最优
google/gemini-3.5-flash	gemini35	标准型	细节更丰富，成本更高
google/gemini-3.1-flash-lite	flash31	经济型	最便宜的选项
google/gemini-3.1-pro-preview	gemini	标准型	质量最高
qwen/qwen3.6-flash	qwenf	经济型	优质替代方案
qwen/qwen3.6-plus	qwen	经济型	—
minimax/minimax-m3	mm3	标准型	—
meta-llama/llama-4-maverick	maverick	标准型	—
meta-llama/llama-4-scout	scout	经济型	—
xiaomi/mimo-v2.5	mimo	标准型	—
z-ai/glm-5v-turbo	glm5v	标准型	—
minimax/minimax-m2.7	mm27	经济型	仅支持音频，不支持图像

Agent Behavior

Agent行为

When the user provides a video file (via upload or file path) and the current chat model does NOT support video:

Call
```
analyze_video(path, question)
```
.
If result mode is
```
"native"
```
→ return
```
result["analysis"]
```
directly.
If result mode is
```
"extraction"
```
→ use
```
result["frame_paths"]
```
as image references and
```
result["transcript"]
```
as context, then ask the current model to analyze based on the frames + transcript.

When the current model DOES support video, the backend handles it natively via Phase 1 (base64 content block injection) — no need for this skill.

当用户提供视频文件（通过上传或文件路径）且当前聊天模型不支持视频时：

调用
```
analyze_video(path, question)
```
。
如果结果模式为
```
"native"
```
→ 直接返回
```
result["analysis"]
```
。
如果结果模式为
```
"extraction"
```
→ 使用
```
result["frame_paths"]
```
作为图片引用，
```
result["transcript"]
```
作为上下文，然后请求当前模型基于帧+转录内容进行分析。

当当前模型支持视频时，后端会通过第一阶段（base64内容块注入）原生处理——无需使用本技能。

Troubleshooting

故障排除

Problem	Fix
"File not found"	Check path is workspace-relative (e.g. `output/videos/x.mp4` )
Native mode returns error	Check `default_model` in config/video-analysis.yaml is valid
No audio transcription	Video may have no audio track; check `has_audio` in result
Too few frames extracted	Lower `scene_threshold` in config/video-analysis.yaml (e.g. 0.15)
Too many frames / high cost	Reduce `max_frames` or raise `scene_threshold`

问题	解决方法
“文件未找到”	检查路径是否为工作区相对路径（例如： `output/videos/x.mp4` ）
原生模式返回错误	检查 `config/video-analysis.yaml` 中的 `default_model` 是否有效
无音频转录结果	视频可能没有音轨；检查结果中的 `has_audio` 字段
提取的帧数过少	降低 `config/video-analysis.yaml` 中的 `scene_threshold` （例如：0.15）
帧数过多 / 成本过高	减少 `max_frames` 或提高 `scene_threshold`