Video Creator
Tool for combining images and audio to create videos.
⛔⛔⛔ Top Rule: Audio-Visual Synchronization (Violation Requires Full Redo!)
Every time you generate a video, the duration must be calculated precisely from the timestamps in narration.json. Manual estimation based on feeling is strictly forbidden!
Mandatory Workflow (No Steps Skipped!)
1. Generate dubbing via TTS → Get narration.mp3 + narration.json (timestamps)
2. Read narration.json and analyze the semantic meaning of each sentence
3. Determine which sentences each image corresponds to (match by content semantics, not average distribution!)
4. Duration of each image = end timestamp of the last corresponding sentence - start timestamp of the first corresponding sentence
5. Verification: Sum of all durations ≈ total audio duration (error < 1s)
6. ⛔ Run verify_alignment.py for validation (must pass before synthesis!)
7. Only write video_config.yaml after passing validation; otherwise, stop!
Example (Ancient Poetry Teaching)
Timestamps in narration.json:
Sentence 0 [0.1-2.6] "Quiet Night Thoughts, Tang Dynasty, Li Bai."
Sentence 1 [2.6-5.7] "Before my bed a pool of light, I wonder if it's frost on the ground."
Sentence 2 [5.7-8.6] "I lift my eyes and look at the bright moon, I lower my head and think of my hometown."
Sentence 3 [8.6-16.4] "This poem was written by Li Bai, a great poet of the Tang Dynasty..."
Sentence 4 [16.4-20.9] "In just twenty characters..."
Sentence 5 [20.9-22.6] "Before my bed a pool of light."
...
Image Allocation:
01_title.png → Sentences 0-2 (Full poem recitation) → duration = 8.6 - 0.1 = 8.5s
02_poet.png → Sentences 3-4 (Poet introduction) → duration = 20.9 - 8.6 = 12.3s
03_moonlight.png → Sentences 5-8 (Interpretation of the first line) → duration = 38.6 - 20.9 = 17.7s
...
Validation: 8.5 + 12.3 + 17.7 + ... = 114.4s ≈ Audio duration 114.5s ✅
Forbidden Actions
- ❌ Arbitrarily assign 10s, 15s, 20s to each image based on feeling
- ❌ Average distribution (Total duration / number of images)
- ❌ Write duration without reading narration.json
- ❌ Force synthesis when the total image duration differs from audio by more than 5 seconds
- ❌ Allow video_maker.py to auto-stretch by more than 1 second
- ❌ Use duration: auto (completely removed from code!)
Image Generation Rules (Violation Requires Redo)
0. Default Dimensions (Most Important!)
Default Aspect Ratio: 16:9 (1920×1080 Landscape)
Use 16:9 unless:
- User explicitly specifies another ratio
- Collaborating with other workflows that have clear requirements (e.g., 3:4 for Xiaohongshu, 9:16 for Douyin)
bash
# ✅ Default: Do not specify -r parameter, or explicitly write -r 16:9
python text_to_image.py "prompt" -o output.png
python text_to_image.py "prompt" -r 16:9 -o output.png
# ❌ Forbidden: Randomly switch ratios, alternating between portrait and landscape
# All images in the same video project must use the same ratio!
1. Image Density Requirements
| Video Duration | Minimum Number of Images | Duration per Image |
|---|
| 30s | 8 | 3-4s |
| 60s | 15 | 3-5s |
| 90s | 22 | 3-5s |
| 120s | 30 | 3-5s |
Rule: Each image can be displayed for a maximum of 7 seconds. If exceeded, split it!
Calculation Formula:
number_of_images = ceil(audio_duration / 4)
2. Prompt Language Requirements for Image Generation
All image generation prompts must be in Chinese. English prompts are strictly forbidden!
bash
# ❌ Wrong: Use English prompt
python text_to_image.py "modern tech illustration, AI robot, blue gradient background"
# ❌ Wrong: Mix English and Chinese in prompt
python text_to_image.py "tech style, Chinese text 'Duel', blue theme"
# ✅ Correct: Chinese-only prompt
python text_to_image.py "Modern tech illustration style, cute AI robot sitting in front of a computer, blue-purple gradient background, neon lighting effects, multiple floating holographic UI panels, high information density, professional infographic style"
Rules:
- Prompts must be Chinese-only
- Any text in generated images must be Chinese
- No English is allowed
3. Information Density Requirements
Information Density = Rich text points + Abundant visual elements
Rich Text Content:
# ❌ Wrong: Too little text
Only "AI Comparison" in the image
# ✅ Correct: Rich text points
Image includes:
- Title: QoderWork vs OpenClaw
- Subtitle: Desktop AI Assistant Comparison
- Point 1: Out-of-the-box vs Customizable
- Point 2: $19/month vs Free & Open Source
- Point 3: General Users vs Tech Enthusiasts
Abundant Visual Elements:
# ❌ Wrong: Too empty
Only text, no icons, charts, or decorations
# ✅ Correct: Rich visualization
- Match with icons (checkmarks, crosses, arrows, stars)
- Match with charts (bar charts, pie charts, comparison bars)
- Match with illustrations (robots, computers, user avatars)
- Match with decorations (lighting effects, gradients, borders)
Information Density Principles:
- Each image must have a clear text title and key points
- Text content must correspond to the dubbing content
- Visualization is more important: icons, charts, illustrations, decorations
- Pure text images or pure decorative images are forbidden
4. Detailed and Specific Image Generation Descriptions
Each image must have rich, specific visual elements. Vague descriptions are forbidden!
bash
# ❌ Wrong: Too vague
"A robot"
"Tech-style image"
"Comparison image"
# ✅ Correct: Detailed and specific
"Cute blue AI robot mascot, rounded metallic texture, sitting at a modern minimalist desk,
with a 27-inch curved monitor displaying code, a coffee cup and succulent plant beside it,
three holographic panels floating above the robot showing line charts, pie charts, and progress bars,
dark blue tech background, blue light strips on the floor, cyberpunk style, soft volumetric lighting"
6 Essential Elements for Prompts (None can be missing):
| Element | Description | Example |
|---|
| Subject | Who/what, specific | "Blue rounded metallic AI robot" instead of "robot" |
| Action | What it's doing, posture | "Hands on keyboard typing, slightly tilted head" |
| Environment | Where it is, background | "Modern minimalist office, city night view outside floor-to-ceiling windows" |
| Details | Surrounding items/elements | "Coffee cup, succulent plant, sticky notes on desk" |
| Style | Art style/lighting effects | "Cyberpunk style, neon lighting, volumetric lighting effect" |
| Color | Main color tone | "Blue-purple gradient main tone, orange accents" |
5. Visual Style Consistency
All images in the same video must maintain consistent style:
- Use the same style prefix
- Use the same color scheme
- Use the same aspect ratio (forbid mixing portrait and landscape!)
bash
# Define unified style prefix (Chinese!)
STYLE="Modern tech illustration style, clean vector design, blue-purple gradient color scheme, professional infographic aesthetics, high information density, neon glow effect, dark background"
# All images use this prefix + same aspect ratio
python text_to_image.py "$STYLE, [specific scene content]" -r 16:9 -o scene01.png
python text_to_image.py "$STYLE, [specific scene content]" -r 16:9 -o scene02.png
6. Aspect Ratio Guide
| Scenario | Aspect Ratio | Description |
|---|
| Default/General | 16:9 | Bilibili, YouTube, Official Account Videos, PPT Illustrations |
| Douyin/Video Account/Kuaishou | 9:16 | Vertical short video platforms, requires user specification or workflow requirements |
| Xiaohongshu | 3:4 | Xiaohongshu note illustrations, requires user specification or workflow requirements |
| Moments | 1:1 | Square, requires user specification |
Rule: When in doubt, use 16:9
Core Workflow (Rules)
Story Video Generation Workflow (Nested Workflow)
When users provide stories/plots/scripts, strictly follow this nested workflow:
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Story → Split into scenes → Parallel generation of scene main images (text-to-image) │
│ │
│ Havoc in Heaven → Scene 1: Sun Wukong humiliated as Horse Keeper │
│ Scene 2: Returns to Flower Fruit Mountain on Cloud Somersault │
│ Scene 3: Jade Emperor sends troops │
│ ... │
│ → Parallel call text_to_image.py to generate main image for each scene │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Each scene main image → Generate close-up shots via image-to-image (maintain character consistency) │
│ │
│ Scene 1 main image → Close-up 1: Sun Wukong doubts the official seal │
│ Close-up 2: Sun Wukong kicks over the horse trough │
│ Scene 2 main image → Close-up 1: Soars on Cloud Somersault │
│ Close-up 2: Proclaims himself Great Sage Equal to Heaven at Flower Fruit Mountain │
│ → Parallel call image_to_image.py with main image as reference │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 3: Generate dubbing + subtitles + validation + video synthesis │
│ │
│ 1. tts_generator.py generates dubbing + timestamps │
│ 2. [Rule] Precisely calculate duration of each image based on timestamps │
│ 3. Generate SRT subtitles │
│ 4. Generate video_config.yaml │
│ 5. ⛔ Run verify_alignment.py for validation (must pass!) │
│ 6. video_maker.py synthesis: │
│ → Image synthesis (with transitions) │
│ → Audio merging │
│ → Burn subtitles (ASS format, fixed at bottom center) │
│ → Auto-append outro (QR code + "Follow for more content") │
│ → Add BGM │
└─────────────────────────────────────────────────────────────┘
**Rule: All videos must auto-append outro!**
Directory Structure Specification
assets/generated/{project_name}/
├── scene1/
│ ├── main.png # Scene 1 main image (text-to-image)
│ ├── shot_01.png # Close-up 1 (image-to-image)
│ └── shot_02.png # Close-up 2 (image-to-image)
├── scene2/
│ ├── main.png
│ ├── shot_01.png
│ └── shot_02.png
├── ...
├── narration.mp3 # Dubbing
├── narration.json # Timestamps
├── subtitles.srt # Subtitles
├── video_config.yaml # Video configuration
└── {project_name}.mp4 # Final video
Example Execution Commands
bash
# Layer 1: Parallel generation of scene main images (default 16:9)
python .opencode/skills/image-service/scripts/text_to_image.py "Style description, Scene 1 content" -r 16:9 -o scene1/main.png &
python .opencode/skills/image-service/scripts/text_to_image.py "Style description, Scene 2 content" -r 16:9 -o scene2/main.png &
wait
# Layer 2: Parallel image-to-image generation of close-up shots (same aspect ratio)
python .opencode/skills/image-service/scripts/image_to_image.py scene1/main.png "Maintain character style, close-up description" -r 16:9 -o scene1/shot_01.png &
python .opencode/skills/image-service/scripts/image_to_image.py scene1/main.png "Maintain character style, close-up description" -r 16:9 -o scene1/shot_02.png &
wait
# Layer 3: Generate dubbing + validation + video synthesis
python .opencode/skills/video-creator/scripts/tts_generator.py --text "Full narration" --output narration.mp3 --timestamps
# ⛔ Mandatory validation (must pass before synthesis!)
python .opencode/skills/video-creator/scripts/verify_alignment.py video_config.yaml
# Synthesize only after validation passes
python .opencode/skills/video-creator/scripts/video_maker.py video_config.yaml --srt subtitles.srt --bgm epic
Video Configuration File Format
yaml
# video_config.yaml
ratio: "16:9" # Default landscape! Must be quoted to avoid YAML parsing errors
bgm_volume: 0.12
outro: true
scenes:
- audio: narration.mp3
images:
# Arrange all close-up shots in scene order
- file: scene1/shot_01.png
duration: 4.34
- file: scene1/shot_02.png
duration: 4.88
- file: scene2/shot_01.png
duration: 2.15
# ...
Note:
must be enclosed in quotes, e.g.,
, otherwise YAML will parse it as a time format.
Duration Allocation Specification (Rule!)
Before generating video_config.yaml, strictly calculate duration following this workflow:
Step 1: Read Timestamp File
python
import json
with open("narration.json", "r") as f:
timestamps = json.load(f)
audio_duration = timestamps[-1]["end"]
print(f"Total audio duration: {audio_duration:.1f}s")
Step 2: Divide Scenes by Content Semantics
Determine the time period corresponding to each image based on the narration content:
python
# Example: Division based on narration content
# Find timestamp points where theme changes
scenes = [
("cover.png", 0, 12.5), # Opening to first theme change
("scene01.png", 12.5, 26), # Second segment content
# ... Precisely divide based on sentence boundaries in narration.json
]
Step 3: Calculate Duration for Each Image
python
for file, start, end in scenes:
duration = end - start
print(f"{file}: {duration:.1f}s")
Step 4: Validate Total Duration
python
total_duration = sum(end - start for _, start, end in scenes)
assert abs(total_duration - audio_duration) < 1.0, \
f"Duration mismatch! Total image duration {total_duration}s vs Audio duration {audio_duration}s"
Rules
- Must read timestamps from narration.json first, no estimation based on feeling
- Divide by sentence semantic boundaries, no average distribution
- Must validate before generating configuration, ensure total image duration = audio duration (error < 0.5s)
- Forbid auto-stretching by script, videos with audio-visual asynchrony are unqualified
- Forbid duration=0, each image must be displayed for at least 1 second
- Must validate with verify_alignment.py before synthesis after generating yaml
Validation Script (Mandatory Execution! Must Pass Before Synthesis!)
bash
# ⛔ Must run validation script before video synthesis. Forbid synthesis if it fails!
python .opencode/skills/video-creator/scripts/verify_alignment.py video_config.yaml
# Validation Content:
# 1. Whether all image files exist
# 2. Whether duration is a precise numerical value (reject non-numeric values directly!)
# 3. Total image duration vs Total audio duration (error < 1s)
# 4. Whether each image duration is within reasonable range (1-7s)
# 5. Semantic cross-check between image filename keywords and voice content keywords
# 6. Output complete comparison table (Image + Duration + Semantic ✅/❌ + Corresponding voice text)
# Exit code 0 = Pass, 1 = Fail
# Forbid synthesis if failed, must fix and re-validate!
Note: video_maker.py also has built-in hard validation — duration must be a precise positive float. Missing or non-numeric values will trigger immediate exit(1)! duration: auto has been completely removed!
Duration Allocation Table Template
Output the allocation table for confirmation before generating configuration:
markdown
|-------------|-----------------------|-------|-----|----------|
| cover.png | Opening Introduction | 0s | 12.5s | 12.5s |
| scene01.png | AI Agent Era | 12.5s | 26s | 13.5s |
| ... | ... | ... | ... | ... |
| **Total** | | | | **{total}s** |
Total Audio Duration: {audio_duration}s
Difference: {diff}s ✅/❌
Subtitle Specification
Subtitles use ASS format, fixed at bottom center position:
- Position: Bottom center (Alignment=2)
- Font: PingFang SC
- Size: Screen height / 40
- Stroke: 2px black stroke + 1px shadow
- Bottom Margin: Screen height / 20
Forbid: Subtitles moving randomly, inconsistent sizes, or unfixed positions
Script Parameter Description
video_maker.py
bash
python video_maker.py config.yaml [options]
| Parameter | Description | Default Value |
|---|
| Do not add outro | Add outro |
| Do not add BGM | Add BGM |
| Transition duration (seconds) | 0.5 |
| BGM volume | 0.08 |
| Custom BGM (Optional: epic) | Default tech style |
| Video aspect ratio | 16:9 (overridden by configuration file) |
| Subtitle file path | None |
verify_alignment.py
bash
python verify_alignment.py video_config.yaml
| Validation Item | Description |
|---|
| Image Existence | All image files must exist |
| Duration Precision | Must be a positive float, forbid auto/empty/non-numeric values |
| Total Duration Match | Total image duration vs Total audio duration, error < 1s |
| Single Image Duration Range | Each image 1-7s, warning if exceeded |
| Semantic Cross-Check | Match between image filename keywords and voice content keywords |
tts_generator.py
bash
python tts_generator.py --text "Text" --output audio.mp3 [options]
| Parameter | Description | Default Value |
|---|
| Voice ID | zh-CN-YunxiNeural |
| Speech rate | +0% |
| Output timestamp JSON | No |
Supported Video Aspect Ratios
Consistent with image-service, supports 10 aspect ratios:
| Aspect Ratio | Resolution | Applicable Scenario |
|---|
| 1:1 | 1024×1024 | Square, Moments |
| 2:3 | 832×1248 | Vertical poster |
| 3:2 | 1248×832 | Horizontal poster |
| 3:4 | 1080×1440 | Xiaohongshu, Moments |
| 4:3 | 1440×1080 | Traditional monitor |
| 4:5 | 864×1080 | Instagram |
| 5:4 | 1080×864 | Horizontal photo |
| 9:16 | 1080×1920 | Douyin, Video Account, Vertical screen |
| 16:9 | 1920×1080 | Bilibili, YouTube, Official Account Videos, Landscape |
| 21:9 | 1536×672 | Ultra-wide screen movie |
Outro Specification
Rule: All videos must auto-append the corresponding outro based on aspect ratio!
Outro Matching Order:
- Exact match:
- Orientation match: Portrait → , Landscape →
- Fallback:
BGM Resources
Classification by Style
| Style | File | Shortcut Parameter | Applicable Scenario |
|---|
| Ancient Chinese Style | | | Water Margin, Romance of the Three Kingdoms, Historical Stories |
| | | Martial Arts, Action, Battle |
| | | Zen, Meditation, Traditional Chinese Studies |
| | | Pastoral, Leisure, Scenery |
| Healing/Relaxing | | | Vlog, Daily Life, Lifestyle |
| | | Fantasy, Memories, Warmth |
| | | Cheerful, Bright, Sunny |
| Passionate/Epic | | | Inspirational, Battle, High-energy |
| | | Heroes, Victory, Glory |
| | | War, Epic, Grand Scale |
| | | Cinematic, Narrative, Emotional |
| Tech/Futuristic | | | AI, Products, Tutorials |
| | | Digital, Internet, Online Services |
| | | Sci-fi, Space, Mystery |
| Suspense/Tense | | | Mystery, Tense, Crisis |
| | | Horror, Dark, Suspense |
| Cheerful/Lively | | | Comedy, Relaxing, Rhythmic |
| | | Cute, Children, Animation |
| | | Playful, Funny, Short Videos |
| Electronic/Rhythmic | | | Dynamic, Editing, Beat-matching |
| | | Electronic, Modern, Trendy |
| | | Games, 8-bit, Retro Electronic |
| | | Rap, Street, Trendy |
Usage
bash
# Method 1: Shortcut parameter (Recommended)
python video_maker.py config.yaml --bgm epic
python video_maker.py config.yaml --bgm ancient
python video_maker.py config.yaml --bgm edm
# Method 2: Full filename
python video_maker.py config.yaml --bgm bgm_ancient_tale.mp3
# Method 3: Custom path
python video_maker.py config.yaml --bgm /path/to/custom.mp3
Common Voice IDs
| Voice ID | Style |
|---|
| zh-CN-YunyangNeural | Male, News Broadcast |
| zh-CN-YunxiNeural | Male, Sunny and Lively |
| zh-CN-XiaoxiaoNeural | Female, Warm and Natural |
| zh-CN-XiaoyiNeural | Female, Lively and Cute |
Directory Structure
video-creator/
├── SKILL.md
├── scripts/
│ ├── video_maker.py # Main script: Images + Audio → Video (built-in duration hard check)
│ ├── verify_alignment.py # Mandatory pre-synthesis validation (Duration + Semantic cross-check)
│ ├── tts_generator.py # TTS Voice Generation
│ └── scene_splitter.py # Scene Splitter (Optional)
├── assets/
│ ├── outro.mp4 # General outro (16:9)
│ ├── outro_9x16.mp4 # Portrait outro
│ ├── outro_3x4.mp4 # 3:4 outro
│ └── bgm_*.mp3 # 22 BGM tracks (see BGM Resource Table)
└── references/
└── edge_tts_voices.md
Dependencies
bash
# System Dependencies
brew install ffmpeg # Mac
# Python Dependencies
pip install edge-tts pyyaml