video-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVideo Analysis
视频分析
Analyze video files using either native model understanding or frame extraction + transcription.
使用原生模型理解或帧提取+转录功能分析视频文件。
How It Works
工作原理
analyze_video(path, question)
│
├─ file_size ≤ threshold (default 20MB)
│ → Send video to a supports_video model (default Gemini 3.1 Flash Lite)
│ → Model sees full video natively (best quality)
│
└─ file_size > threshold
→ ffmpeg extracts keyframes (scene detection for long videos)
→ Whisper transcribes audio track
→ Returns frame image paths + transcript text
→ Agent feeds these to the current chat modelanalyze_video(path, question)
│
├─ file_size ≤ threshold (default 20MB)
│ → Send video to a supports_video model (default Gemini 3.1 Flash Lite)
│ → Model sees full video natively (best quality)
│
└─ file_size > threshold
→ ffmpeg extracts keyframes (scene detection for long videos)
→ Whisper transcribes audio track
→ Returns frame image paths + transcript text
→ Agent feeds these to the current chat modelQuick Start
快速开始
⚠️ Invocation — do NOT use dotted imports. The directory name contains a
hyphen (), so
is a Python syntax error ( is parsed as minus). This is true for every
hyphenated skill, not just this one. Use one of the two patterns below.
video-analysisfrom skills.video-analysis.exports import ...-Pattern A — from workspace root (recommended for scripts):
bash
cd /data/workspace/skills/video-analysis && \
python3 -c "from exports import analyze_video; \
import json; \
print(json.dumps(analyze_video('output/videos/clip.mp4', \
question='What happens in this video?'), ensure_ascii=False))"Note: pass the video path workspace-relative (analyze.py resolves it
against ), even though you cd into the skill dir.
WORKSPACE_DIRPattern B — inside a starchild-clawd script:
python
from core.skill_tools import video_analysis
result = video_analysis.analyze_video("output/videos/clip.mp4",
question="What happens in this video?")❌ Do NOT — analyze.py
uses at import time, which is undefined under , so it crashes.
Load it by file path with if you must
avoid both patterns above.
exec(open('skills/video-analysis/analyze.py').read())__file__execimportlib.util.spec_from_file_locationpython
undefined⚠️ 调用注意——请勿使用点式导入。目录名称包含连字符(),因此会触发Python语法错误(会被解析为减号)。所有带连字符的技能都存在这个问题,并非仅本技能如此。请使用以下两种模式之一。
video-analysisfrom skills.video-analysis.exports import ...-模式A——从工作区根目录调用(推荐用于脚本):
bash
cd /data/workspace/skills/video-analysis && \
python3 -c "from exports import analyze_video; \
import json; \
print(json.dumps(analyze_video('output/videos/clip.mp4', \
question='What happens in this video?'), ensure_ascii=False))"注意:即使进入技能目录,也要传入工作区相对路径的视频路径(analyze.py会基于解析路径)。
WORKSPACE_DIR模式B——在starchild-clawd脚本中调用:
python
from core.skill_tools import video_analysis
result = video_analysis.analyze_video("output/videos/clip.mp4",
question="What happens in this video?")❌ 请勿使用——analyze.py在导入时会使用,而环境下该变量未定义,会导致崩溃。如果必须避开上述两种模式,请使用按文件路径加载。
exec(open('skills/video-analysis/analyze.py').read())__file__execimportlib.util.spec_from_file_locationpython
undefinedresult keys (same for both patterns):
result keys (same for both patterns):
Analyze a video — auto-selects native or extraction mode
Analyze a video — auto-selects native or extraction mode
result = analyze_video("output/videos/clip.mp4", question="What happens in this video?")
result = analyze_video("output/videos/clip.mp4", question="What happens in this video?")
result keys:
result keys:
success: bool
success: bool
mode: "native" | "extraction"
mode: "native" | "extraction"
If mode == "native":
If mode == "native":
analysis: str (model's text response)
analysis: str (model's text response)
model: str (which model was used)
model: str (which model was used)
tokens: {input, output, video, audio}
tokens: {input, output, video, audio}
If mode == "extraction":
If mode == "extraction":
frame_paths: list[str] (workspace-relative paths to keyframe JPEGs)
frame_paths: list[str] (workspace-relative paths to keyframe JPEGs)
transcript: str | None (Whisper transcription text)
transcript: str | None (Whisper transcription text)
frame_count: int
frame_count: int
duration_sec: float
duration_sec: float
undefinedundefinedUsing the Exports
使用导出模块
python
from core.skill_tools import video_analysispython
from core.skill_tools import video_analysisFull analysis (auto-selects mode)
完整分析(自动选择模式)
result = video_analysis.analyze_video("output/videos/my_video.mp4", question="Describe this video")
result = video_analysis.analyze_video("output/videos/my_video.mp4", question="Describe this video")
Check current config
查看当前配置
config = video_analysis.get_config()
config = video_analysis.get_config()
Get video metadata without analyzing
获取视频元数据而不进行分析
info = video_analysis.get_video_info("output/videos/my_video.mp4")
info = video_analysis.get_video_info("output/videos/my_video.mp4")
→ {"duration": 45.2, "size": 12345678, "width": 1920, "height": 1080, "has_audio": true}
→ {"duration": 45.2, "size": 12345678, "width": 1920, "height": 1080, "has_audio": true}
undefinedundefinedNative Mode (small videos)
原生模式(小视频)
For videos under the size threshold, the skill sends the full video to a model
that supports native video input. The model sees every frame and hears the audio.
Default model: — best price/quality for video.
google/gemini-3.1-flash-liteModel benchmark (6MB clip, vs baseline):
gemini-3.1-pro-preview| Model | Tier | Cost | Time | Accuracy | Notes |
|---|---|---|---|---|---|
| google/gemini-3.1-flash-lite | budget | ~$0.0014 | 8.1s | ~88% | ⭐ Default — cheapest + fastest |
| google/gemini-3.5-flash | std | ~$0.0152 | 11.8s | ~85% | More detail, higher cost |
| qwen/qwen3.6-plus | budget | ~$0.0058 | 44.2s | ~95% | Accurate but slow |
| qwen/qwen3.6-flash | budget | ~$0.0027 | 16.6s | ~80% | Misreads subjects sometimes |
| google/gemini-3.1-pro-preview | std | ~$0.0199 | 19.7s | 100% | Baseline (best, most expensive) |
flash-lite identifies the full scene, action sequence, and transitions
correctly at ~14x lower cost than the Pro baseline. For maximum accuracy
(exact character names, fine detail), switch to
or in .
default_modelgemini-3.1-pro-previewgemini-3.5-flashconfig/video-analysis.yaml对于小于大小阈值的视频,本技能会将完整视频发送至支持原生视频输入的模型。模型可以看到每一帧并听到音频。
默认模型: ——视频分析性价比最优选择。
google/gemini-3.1-flash-lite模型基准测试(6MB视频片段,以为基准):
gemini-3.1-pro-preview| 模型 | 层级 | 成本 | 耗时 | 准确率 | 备注 |
|---|---|---|---|---|---|
| google/gemini-3.1-flash-lite | 经济型 | ~$0.0014 | 8.1s | ~88% | ⭐ 默认选项——最便宜且最快 |
| google/gemini-3.5-flash | 标准型 | ~$0.0152 | 11.8s | ~85% | 细节更丰富,但成本更高 |
| qwen/qwen3.6-plus | 经济型 | ~$0.0058 | 44.2s | ~95% | 准确率高但速度慢 |
| qwen/qwen3.6-flash | 经济型 | ~$0.0027 | 16.6s | ~80% | 有时会误判主体 |
| google/gemini-3.1-pro-preview | 标准型 | ~$0.0199 | 19.7s | 100% | 基准模型(质量最佳,成本最高) |
flash-lite能正确识别完整场景、动作序列和转场,成本仅为Pro基准模型的1/14左右。如果需要最高准确率(如识别确切角色名称、精细细节),可在中将切换为或。
config/video-analysis.yamldefault_modelgemini-3.1-pro-previewgemini-3.5-flashExtraction Mode (large videos)
提取模式(大视频)
For videos over the size threshold, the skill extracts keyframes and transcribes audio:
- Short videos (≤60s): One frame every N seconds (default: 2s)
- Long videos (>60s): Scene-change detection picks visually distinct frames
- Audio: Extracted and sent to Whisper for transcription
- Max frames: Capped at 30 (configurable) to control cost
The agent receives frame image paths and transcript text, then feeds them
to the current chat model as image attachments + context text.
对于超过大小阈值的视频,本技能会提取关键帧并转录音频:
- 短视频(≤60秒): 每N秒提取一帧(默认:2秒)
- 长视频(>60秒): 通过场景变化检测选择视觉差异明显的帧
- 音频: 提取后发送至Whisper进行转录
- 最大帧数: 上限为30(可配置)以控制成本
Agent会收到帧图片路径和转录文本,然后将其作为图片附件+上下文文本传入当前聊天模型。
Configuration
配置
Edit (in the workspace) to customize. This file
is created automatically on first use, only needs the keys you want to override,
and survives skill updates.
config/video-analysis.yamlDo NOT edit— that's the factory default and is overwritten on every skill auto-update. The user file overlays it.skills/video-analysis/config.yaml
Both the standalone skill and the chat "send a video" flow read this same config,
so one edit changes the model everywhere. Available keys:
yaml
undefined编辑工作区中的**进行自定义配置。该文件会在首次使用时自动创建,仅需覆盖需要修改的配置项,且在技能更新时会保留**。
config/video-analysis.yaml请勿编辑——这是工厂默认配置,每次技能自动更新时都会被覆盖。用户配置文件会覆盖默认配置。skills/video-analysis/config.yaml
独立技能和聊天“发送视频”流程都会读取同一配置,因此一次修改即可应用于所有场景。可用配置项:
yaml
undefinedModel for native video understanding
用于原生视频理解的模型
default_model: google/gemini-3.1-flash-lite
default_model: google/gemini-3.1-flash-lite
Size threshold: native (≤) vs extraction (>)
大小阈值:原生模式(≤) vs 提取模式(>)
Set to 0 → always extraction. Set to 100 → always native.
设置为0 → 始终使用提取模式。设置为100 → 始终使用原生模式。
native_size_limit_mb: 20
native_size_limit_mb: 20
Frame extraction settings
帧提取设置
extraction:
max_frames: 30 # Max keyframes to extract
short_video_interval_sec: 2 # Frame interval for ≤60s videos
scene_threshold: 0.3 # Scene detection sensitivity (0.0-1.0)
transcribe_audio: true # Whether to Whisper-transcribe audio
undefinedextraction:
max_frames: 30 # 提取的最大关键帧数
short_video_interval_sec: 2 # ≤60秒视频的帧提取间隔
scene_threshold: 0.3 # 场景检测敏感度(0.0-1.0)
transcribe_audio: true # 是否使用Whisper转录音频
undefinedAvailable Video Models
可用视频模型
| Model | Alias | Tier | Notes |
|---|---|---|---|
| google/gemini-3.1-flash-lite | flash31 | budget | ⭐ Default, best price/quality |
| google/gemini-3.5-flash | gemini35 | standard | More detail, higher cost |
| google/gemini-3.1-flash-lite | flash31 | budget | Cheapest option |
| google/gemini-3.1-pro-preview | gemini | standard | Highest quality |
| qwen/qwen3.6-flash | qwenf | budget | Good alternative |
| qwen/qwen3.6-plus | qwen | budget | — |
| minimax/minimax-m3 | mm3 | standard | — |
| meta-llama/llama-4-maverick | maverick | standard | — |
| meta-llama/llama-4-scout | scout | budget | — |
| xiaomi/mimo-v2.5 | mimo | standard | — |
| z-ai/glm-5v-turbo | glm5v | standard | — |
| minimax/minimax-m2.7 | mm27 | budget | Audio-only, no image |
| 模型 | 别名 | 层级 | 备注 |
|---|---|---|---|
| google/gemini-3.1-flash-lite | flash31 | 经济型 | ⭐ 默认选项,性价比最优 |
| google/gemini-3.5-flash | gemini35 | 标准型 | 细节更丰富,成本更高 |
| google/gemini-3.1-flash-lite | flash31 | 经济型 | 最便宜的选项 |
| google/gemini-3.1-pro-preview | gemini | 标准型 | 质量最高 |
| qwen/qwen3.6-flash | qwenf | 经济型 | 优质替代方案 |
| qwen/qwen3.6-plus | qwen | 经济型 | — |
| minimax/minimax-m3 | mm3 | 标准型 | — |
| meta-llama/llama-4-maverick | maverick | 标准型 | — |
| meta-llama/llama-4-scout | scout | 经济型 | — |
| xiaomi/mimo-v2.5 | mimo | 标准型 | — |
| z-ai/glm-5v-turbo | glm5v | 标准型 | — |
| minimax/minimax-m2.7 | mm27 | 经济型 | 仅支持音频,不支持图像 |
Agent Behavior
Agent行为
When the user provides a video file (via upload or file path) and the current
chat model does NOT support video:
- Call .
analyze_video(path, question) - If result mode is → return
"native"directly.result["analysis"] - If result mode is → use
"extraction"as image references andresult["frame_paths"]as context, then ask the current model to analyze based on the frames + transcript.result["transcript"]
When the current model DOES support video, the backend handles it natively
via Phase 1 (base64 content block injection) — no need for this skill.
当用户提供视频文件(通过上传或文件路径)且当前聊天模型不支持视频时:
- 调用。
analyze_video(path, question) - 如果结果模式为→ 直接返回
"native"。result["analysis"] - 如果结果模式为→ 使用
"extraction"作为图片引用,result["frame_paths"]作为上下文,然后请求当前模型基于帧+转录内容进行分析。result["transcript"]
当当前模型支持视频时,后端会通过第一阶段(base64内容块注入)原生处理——无需使用本技能。
Troubleshooting
故障排除
| Problem | Fix |
|---|---|
| "File not found" | Check path is workspace-relative (e.g. |
| Native mode returns error | Check |
| No audio transcription | Video may have no audio track; check |
| Too few frames extracted | Lower |
| Too many frames / high cost | Reduce |
| 问题 | 解决方法 |
|---|---|
| “文件未找到” | 检查路径是否为工作区相对路径(例如: |
| 原生模式返回错误 | 检查 |
| 无音频转录结果 | 视频可能没有音轨;检查结果中的 |
| 提取的帧数过少 | 降低 |
| 帧数过多 / 成本过高 | 减少 |