video-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Video Analysis

视频分析

Analyze video files using either native model understanding or frame extraction + transcription.
使用原生模型理解帧提取+转录功能分析视频文件。

How It Works

工作原理

analyze_video(path, question)
      ├─ file_size ≤ threshold (default 20MB)
      │     → Send video to a supports_video model (default Gemini 3.1 Flash Lite)
      │     → Model sees full video natively (best quality)
      └─ file_size > threshold
            → ffmpeg extracts keyframes (scene detection for long videos)
            → Whisper transcribes audio track
            → Returns frame image paths + transcript text
            → Agent feeds these to the current chat model
analyze_video(path, question)
      ├─ file_size ≤ threshold (default 20MB)
      │     → Send video to a supports_video model (default Gemini 3.1 Flash Lite)
      │     → Model sees full video natively (best quality)
      └─ file_size > threshold
            → ffmpeg extracts keyframes (scene detection for long videos)
            → Whisper transcribes audio track
            → Returns frame image paths + transcript text
            → Agent feeds these to the current chat model

Quick Start

快速开始

⚠️ Invocation — do NOT use dotted imports. The directory name contains a hyphen (
video-analysis
), so
from skills.video-analysis.exports import ...
is a Python syntax error (
-
is parsed as minus). This is true for every hyphenated skill, not just this one. Use one of the two patterns below.
Pattern A — from workspace root (recommended for scripts):
bash
cd /data/workspace/skills/video-analysis && \
  python3 -c "from exports import analyze_video; \
    import json; \
    print(json.dumps(analyze_video('output/videos/clip.mp4', \
      question='What happens in this video?'), ensure_ascii=False))"
Note: pass the video path workspace-relative (analyze.py resolves it against
WORKSPACE_DIR
), even though you cd into the skill dir.
Pattern B — inside a starchild-clawd script:
python
from core.skill_tools import video_analysis
result = video_analysis.analyze_video("output/videos/clip.mp4",
                                      question="What happens in this video?")
Do NOT
exec(open('skills/video-analysis/analyze.py').read())
— analyze.py uses
__file__
at import time, which is undefined under
exec
, so it crashes. Load it by file path with
importlib.util.spec_from_file_location
if you must avoid both patterns above.
python
undefined
⚠️ 调用注意——请勿使用点式导入。目录名称包含连字符(
video-analysis
),因此
from skills.video-analysis.exports import ...
会触发Python语法错误
-
会被解析为减号)。所有带连字符的技能都存在这个问题,并非仅本技能如此。请使用以下两种模式之一。
模式A——从工作区根目录调用(推荐用于脚本):
bash
cd /data/workspace/skills/video-analysis && \
  python3 -c "from exports import analyze_video; \
    import json; \
    print(json.dumps(analyze_video('output/videos/clip.mp4', \
      question='What happens in this video?'), ensure_ascii=False))"
注意:即使进入技能目录,也要传入工作区相对路径的视频路径(analyze.py会基于
WORKSPACE_DIR
解析路径)。
模式B——在starchild-clawd脚本中调用:
python
from core.skill_tools import video_analysis
result = video_analysis.analyze_video("output/videos/clip.mp4",
                                      question="What happens in this video?")
请勿使用
exec(open('skills/video-analysis/analyze.py').read())
——analyze.py在导入时会使用
__file__
,而
exec
环境下该变量未定义,会导致崩溃。如果必须避开上述两种模式,请使用
importlib.util.spec_from_file_location
按文件路径加载。
python
undefined

result keys (same for both patterns):

result keys (same for both patterns):

Analyze a video — auto-selects native or extraction mode

Analyze a video — auto-selects native or extraction mode

result = analyze_video("output/videos/clip.mp4", question="What happens in this video?")

result = analyze_video("output/videos/clip.mp4", question="What happens in this video?")

result keys:

result keys:

success: bool

success: bool

mode: "native" | "extraction"

mode: "native" | "extraction"

If mode == "native":

If mode == "native":

analysis: str (model's text response)

analysis: str (model's text response)

model: str (which model was used)

model: str (which model was used)

tokens: {input, output, video, audio}

tokens: {input, output, video, audio}

If mode == "extraction":

If mode == "extraction":

frame_paths: list[str] (workspace-relative paths to keyframe JPEGs)

frame_paths: list[str] (workspace-relative paths to keyframe JPEGs)

transcript: str | None (Whisper transcription text)

transcript: str | None (Whisper transcription text)

frame_count: int

frame_count: int

duration_sec: float

duration_sec: float

undefined
undefined

Using the Exports

使用导出模块

python
from core.skill_tools import video_analysis
python
from core.skill_tools import video_analysis

Full analysis (auto-selects mode)

完整分析(自动选择模式)

result = video_analysis.analyze_video("output/videos/my_video.mp4", question="Describe this video")
result = video_analysis.analyze_video("output/videos/my_video.mp4", question="Describe this video")

Check current config

查看当前配置

config = video_analysis.get_config()
config = video_analysis.get_config()

Get video metadata without analyzing

获取视频元数据而不进行分析

info = video_analysis.get_video_info("output/videos/my_video.mp4")
info = video_analysis.get_video_info("output/videos/my_video.mp4")

→ {"duration": 45.2, "size": 12345678, "width": 1920, "height": 1080, "has_audio": true}

→ {"duration": 45.2, "size": 12345678, "width": 1920, "height": 1080, "has_audio": true}

undefined
undefined

Native Mode (small videos)

原生模式(小视频)

For videos under the size threshold, the skill sends the full video to a model that supports native video input. The model sees every frame and hears the audio.
Default model:
google/gemini-3.1-flash-lite
— best price/quality for video.
Model benchmark (6MB clip, vs
gemini-3.1-pro-preview
baseline):
ModelTierCostTimeAccuracyNotes
google/gemini-3.1-flash-litebudget~$0.00148.1s~88%⭐ Default — cheapest + fastest
google/gemini-3.5-flashstd~$0.015211.8s~85%More detail, higher cost
qwen/qwen3.6-plusbudget~$0.005844.2s~95%Accurate but slow
qwen/qwen3.6-flashbudget~$0.002716.6s~80%Misreads subjects sometimes
google/gemini-3.1-pro-previewstd~$0.019919.7s100%Baseline (best, most expensive)
flash-lite identifies the full scene, action sequence, and transitions correctly at ~14x lower cost than the Pro baseline. For maximum accuracy (exact character names, fine detail), switch
default_model
to
gemini-3.1-pro-preview
or
gemini-3.5-flash
in
config/video-analysis.yaml
.
对于小于大小阈值的视频,本技能会将完整视频发送至支持原生视频输入的模型。模型可以看到每一帧并听到音频。
默认模型:
google/gemini-3.1-flash-lite
——视频分析性价比最优选择。
模型基准测试(6MB视频片段,以
gemini-3.1-pro-preview
为基准):
模型层级成本耗时准确率备注
google/gemini-3.1-flash-lite经济型~$0.00148.1s~88%⭐ 默认选项——最便宜且最快
google/gemini-3.5-flash标准型~$0.015211.8s~85%细节更丰富,但成本更高
qwen/qwen3.6-plus经济型~$0.005844.2s~95%准确率高但速度慢
qwen/qwen3.6-flash经济型~$0.002716.6s~80%有时会误判主体
google/gemini-3.1-pro-preview标准型~$0.019919.7s100%基准模型(质量最佳,成本最高)
flash-lite能正确识别完整场景、动作序列和转场,成本仅为Pro基准模型的1/14左右。如果需要最高准确率(如识别确切角色名称、精细细节),可在
config/video-analysis.yaml
中将
default_model
切换为
gemini-3.1-pro-preview
gemini-3.5-flash

Extraction Mode (large videos)

提取模式(大视频)

For videos over the size threshold, the skill extracts keyframes and transcribes audio:
  • Short videos (≤60s): One frame every N seconds (default: 2s)
  • Long videos (>60s): Scene-change detection picks visually distinct frames
  • Audio: Extracted and sent to Whisper for transcription
  • Max frames: Capped at 30 (configurable) to control cost
The agent receives frame image paths and transcript text, then feeds them to the current chat model as image attachments + context text.
对于超过大小阈值的视频,本技能会提取关键帧并转录音频:
  • 短视频(≤60秒): 每N秒提取一帧(默认:2秒)
  • 长视频(>60秒): 通过场景变化检测选择视觉差异明显的帧
  • 音频: 提取后发送至Whisper进行转录
  • 最大帧数: 上限为30(可配置)以控制成本
Agent会收到帧图片路径和转录文本,然后将其作为图片附件+上下文文本传入当前聊天模型。

Configuration

配置

Edit
config/video-analysis.yaml
(in the workspace) to customize. This file is created automatically on first use, only needs the keys you want to override, and survives skill updates.
Do NOT edit
skills/video-analysis/config.yaml
— that's the factory default and is overwritten on every skill auto-update. The user file overlays it.
Both the standalone skill and the chat "send a video" flow read this same config, so one edit changes the model everywhere. Available keys:
yaml
undefined
编辑工作区中的**
config/video-analysis.yaml
进行自定义配置。该文件会在首次使用时自动创建,仅需覆盖需要修改的配置项,且在技能更新时会保留**。
请勿编辑
skills/video-analysis/config.yaml
——这是工厂默认配置,每次技能自动更新时都会被覆盖。用户配置文件会覆盖默认配置。
独立技能和聊天“发送视频”流程都会读取同一配置,因此一次修改即可应用于所有场景。可用配置项:
yaml
undefined

Model for native video understanding

用于原生视频理解的模型

default_model: google/gemini-3.1-flash-lite
default_model: google/gemini-3.1-flash-lite

Size threshold: native (≤) vs extraction (>)

大小阈值:原生模式(≤) vs 提取模式(>)

Set to 0 → always extraction. Set to 100 → always native.

设置为0 → 始终使用提取模式。设置为100 → 始终使用原生模式。

native_size_limit_mb: 20
native_size_limit_mb: 20

Frame extraction settings

帧提取设置

extraction: max_frames: 30 # Max keyframes to extract short_video_interval_sec: 2 # Frame interval for ≤60s videos scene_threshold: 0.3 # Scene detection sensitivity (0.0-1.0) transcribe_audio: true # Whether to Whisper-transcribe audio
undefined
extraction: max_frames: 30 # 提取的最大关键帧数 short_video_interval_sec: 2 # ≤60秒视频的帧提取间隔 scene_threshold: 0.3 # 场景检测敏感度(0.0-1.0) transcribe_audio: true # 是否使用Whisper转录音频
undefined

Available Video Models

可用视频模型

ModelAliasTierNotes
google/gemini-3.1-flash-liteflash31budget⭐ Default, best price/quality
google/gemini-3.5-flashgemini35standardMore detail, higher cost
google/gemini-3.1-flash-liteflash31budgetCheapest option
google/gemini-3.1-pro-previewgeministandardHighest quality
qwen/qwen3.6-flashqwenfbudgetGood alternative
qwen/qwen3.6-plusqwenbudget
minimax/minimax-m3mm3standard
meta-llama/llama-4-maverickmaverickstandard
meta-llama/llama-4-scoutscoutbudget
xiaomi/mimo-v2.5mimostandard
z-ai/glm-5v-turboglm5vstandard
minimax/minimax-m2.7mm27budgetAudio-only, no image
模型别名层级备注
google/gemini-3.1-flash-liteflash31经济型⭐ 默认选项,性价比最优
google/gemini-3.5-flashgemini35标准型细节更丰富,成本更高
google/gemini-3.1-flash-liteflash31经济型最便宜的选项
google/gemini-3.1-pro-previewgemini标准型质量最高
qwen/qwen3.6-flashqwenf经济型优质替代方案
qwen/qwen3.6-plusqwen经济型
minimax/minimax-m3mm3标准型
meta-llama/llama-4-maverickmaverick标准型
meta-llama/llama-4-scoutscout经济型
xiaomi/mimo-v2.5mimo标准型
z-ai/glm-5v-turboglm5v标准型
minimax/minimax-m2.7mm27经济型仅支持音频,不支持图像

Agent Behavior

Agent行为

When the user provides a video file (via upload or file path) and the current chat model does NOT support video:
  1. Call
    analyze_video(path, question)
    .
  2. If result mode is
    "native"
    → return
    result["analysis"]
    directly.
  3. If result mode is
    "extraction"
    → use
    result["frame_paths"]
    as image references and
    result["transcript"]
    as context, then ask the current model to analyze based on the frames + transcript.
When the current model DOES support video, the backend handles it natively via Phase 1 (base64 content block injection) — no need for this skill.
当用户提供视频文件(通过上传或文件路径)且当前聊天模型不支持视频时:
  1. 调用
    analyze_video(path, question)
  2. 如果结果模式为
    "native"
    → 直接返回
    result["analysis"]
  3. 如果结果模式为
    "extraction"
    → 使用
    result["frame_paths"]
    作为图片引用,
    result["transcript"]
    作为上下文,然后请求当前模型基于帧+转录内容进行分析。
当当前模型支持视频时,后端会通过第一阶段(base64内容块注入)原生处理——无需使用本技能。

Troubleshooting

故障排除

ProblemFix
"File not found"Check path is workspace-relative (e.g.
output/videos/x.mp4
)
Native mode returns errorCheck
default_model
in config/video-analysis.yaml is valid
No audio transcriptionVideo may have no audio track; check
has_audio
in result
Too few frames extractedLower
scene_threshold
in config/video-analysis.yaml (e.g. 0.15)
Too many frames / high costReduce
max_frames
or raise
scene_threshold
问题解决方法
“文件未找到”检查路径是否为工作区相对路径(例如:
output/videos/x.mp4
原生模式返回错误检查
config/video-analysis.yaml
中的
default_model
是否有效
无音频转录结果视频可能没有音轨;检查结果中的
has_audio
字段
提取的帧数过少降低
config/video-analysis.yaml
中的
scene_threshold
(例如:0.15)
帧数过多 / 成本过高减少
max_frames
或提高
scene_threshold