Loading...
Loading...
Video understanding for any model — native passthrough for small files, frame extraction + audio transcription fallback for large files. Use when the user asks to analyze, describe, or understand a video file (e.g. "what's in this video", "summarize this clip", "transcribe this recording").
npx skill4agent add starchild-ai-agent/official-skills video-analysisanalyze_video(path, question)
│
├─ file_size ≤ threshold (default 20MB)
│ → Send video to a supports_video model (default Gemini 3.1 Flash Lite)
│ → Model sees full video natively (best quality)
│
└─ file_size > threshold
→ ffmpeg extracts keyframes (scene detection for long videos)
→ Whisper transcribes audio track
→ Returns frame image paths + transcript text
→ Agent feeds these to the current chat modelvideo-analysisfrom skills.video-analysis.exports import ...-cd /data/workspace/skills/video-analysis && \
python3 -c "from exports import analyze_video; \
import json; \
print(json.dumps(analyze_video('output/videos/clip.mp4', \
question='What happens in this video?'), ensure_ascii=False))"WORKSPACE_DIRfrom core.skill_tools import video_analysis
result = video_analysis.analyze_video("output/videos/clip.mp4",
question="What happens in this video?")exec(open('skills/video-analysis/analyze.py').read())__file__execimportlib.util.spec_from_file_location# result keys (same for both patterns):
# Analyze a video — auto-selects native or extraction mode
# result = analyze_video("output/videos/clip.mp4", question="What happens in this video?")
# result keys:
# success: bool
# mode: "native" | "extraction"
#
# If mode == "native":
# analysis: str (model's text response)
# model: str (which model was used)
# tokens: {input, output, video, audio}
#
# If mode == "extraction":
# frame_paths: list[str] (workspace-relative paths to keyframe JPEGs)
# transcript: str | None (Whisper transcription text)
# frame_count: int
# duration_sec: floatfrom core.skill_tools import video_analysis
# Full analysis (auto-selects mode)
result = video_analysis.analyze_video("output/videos/my_video.mp4", question="Describe this video")
# Check current config
config = video_analysis.get_config()
# Get video metadata without analyzing
info = video_analysis.get_video_info("output/videos/my_video.mp4")
# → {"duration": 45.2, "size": 12345678, "width": 1920, "height": 1080, "has_audio": true}google/gemini-3.1-flash-litegemini-3.1-pro-preview| Model | Tier | Cost | Time | Accuracy | Notes |
|---|---|---|---|---|---|
| google/gemini-3.1-flash-lite | budget | ~$0.0014 | 8.1s | ~88% | ⭐ Default — cheapest + fastest |
| google/gemini-3.5-flash | std | ~$0.0152 | 11.8s | ~85% | More detail, higher cost |
| qwen/qwen3.6-plus | budget | ~$0.0058 | 44.2s | ~95% | Accurate but slow |
| qwen/qwen3.6-flash | budget | ~$0.0027 | 16.6s | ~80% | Misreads subjects sometimes |
| google/gemini-3.1-pro-preview | std | ~$0.0199 | 19.7s | 100% | Baseline (best, most expensive) |
default_modelgemini-3.1-pro-previewgemini-3.5-flashconfig/video-analysis.yamlconfig/video-analysis.yamlDo NOT edit— that's the factory default and is overwritten on every skill auto-update. The user file overlays it.skills/video-analysis/config.yaml
# Model for native video understanding
default_model: google/gemini-3.1-flash-lite
# Size threshold: native (≤) vs extraction (>)
# Set to 0 → always extraction. Set to 100 → always native.
native_size_limit_mb: 20
# Frame extraction settings
extraction:
max_frames: 30 # Max keyframes to extract
short_video_interval_sec: 2 # Frame interval for ≤60s videos
scene_threshold: 0.3 # Scene detection sensitivity (0.0-1.0)
transcribe_audio: true # Whether to Whisper-transcribe audio| Model | Alias | Tier | Notes |
|---|---|---|---|
| google/gemini-3.1-flash-lite | flash31 | budget | ⭐ Default, best price/quality |
| google/gemini-3.5-flash | gemini35 | standard | More detail, higher cost |
| google/gemini-3.1-flash-lite | flash31 | budget | Cheapest option |
| google/gemini-3.1-pro-preview | gemini | standard | Highest quality |
| qwen/qwen3.6-flash | qwenf | budget | Good alternative |
| qwen/qwen3.6-plus | qwen | budget | — |
| minimax/minimax-m3 | mm3 | standard | — |
| meta-llama/llama-4-maverick | maverick | standard | — |
| meta-llama/llama-4-scout | scout | budget | — |
| xiaomi/mimo-v2.5 | mimo | standard | — |
| z-ai/glm-5v-turbo | glm5v | standard | — |
| minimax/minimax-m2.7 | mm27 | budget | Audio-only, no image |
analyze_video(path, question)"native"result["analysis"]"extraction"result["frame_paths"]result["transcript"]| Problem | Fix |
|---|---|
| "File not found" | Check path is workspace-relative (e.g. |
| Native mode returns error | Check |
| No audio transcription | Video may have no audio track; check |
| Too few frames extracted | Lower |
| Too many frames / high cost | Reduce |