Video Creator

Tool for combining images and audio to create videos.

⛔⛔⛔ Top Rule: Audio-Visual Synchronization (Violation Requires Full Redo!)

Every time you generate a video, the duration must be calculated precisely from the timestamps in narration.json. Manual estimation based on feeling is strictly forbidden!

Mandatory Workflow (No Steps Skipped!)

1. Generate dubbing via TTS → Get narration.mp3 + narration.json (timestamps)
2. Read narration.json and analyze the semantic meaning of each sentence
3. Determine which sentences each image corresponds to (match by content semantics, not average distribution!)
4. Duration of each image = end timestamp of the last corresponding sentence - start timestamp of the first corresponding sentence
5. Verification: Sum of all durations ≈ total audio duration (error < 1s)
6. ⛔ Run verify_alignment.py for validation (must pass before synthesis!)
7. Only write video_config.yaml after passing validation; otherwise, stop!

Example (Ancient Poetry Teaching)

Timestamps in narration.json:
  Sentence 0 [0.1-2.6]  "Quiet Night Thoughts, Tang Dynasty, Li Bai."
  Sentence 1 [2.6-5.7]  "Before my bed a pool of light, I wonder if it's frost on the ground."
  Sentence 2 [5.7-8.6]  "I lift my eyes and look at the bright moon, I lower my head and think of my hometown."
  Sentence 3 [8.6-16.4] "This poem was written by Li Bai, a great poet of the Tang Dynasty..."
  Sentence 4 [16.4-20.9] "In just twenty characters..."
  Sentence 5 [20.9-22.6] "Before my bed a pool of light."
  ...

Image Allocation:
  01_title.png    → Sentences 0-2 (Full poem recitation) → duration = 8.6 - 0.1 = 8.5s
  02_poet.png     → Sentences 3-4 (Poet introduction) → duration = 20.9 - 8.6 = 12.3s
  03_moonlight.png → Sentences 5-8 (Interpretation of the first line) → duration = 38.6 - 20.9 = 17.7s
  ...

Validation: 8.5 + 12.3 + 17.7 + ... = 114.4s ≈ Audio duration 114.5s ✅

Forbidden Actions

❌ Arbitrarily assign 10s, 15s, 20s to each image based on feeling
❌ Average distribution (Total duration / number of images)
❌ Write duration without reading narration.json
❌ Force synthesis when the total image duration differs from audio by more than 5 seconds
❌ Allow video_maker.py to auto-stretch by more than 1 second
❌ Use duration: auto (completely removed from code!)

Image Generation Rules (Violation Requires Redo)

0. Default Dimensions (Most Important!)

Default Aspect Ratio: 16:9 (1920×1080 Landscape)

Use 16:9 unless:

User explicitly specifies another ratio
Collaborating with other workflows that have clear requirements (e.g., 3:4 for Xiaohongshu, 9:16 for Douyin)

bash

# ✅ Default: Do not specify -r parameter, or explicitly write -r 16:9
python text_to_image.py "prompt" -o output.png
python text_to_image.py "prompt" -r 16:9 -o output.png

# ❌ Forbidden: Randomly switch ratios, alternating between portrait and landscape
# All images in the same video project must use the same ratio!

1. Image Density Requirements

Video Duration	Minimum Number of Images	Duration per Image
30s	8	3-4s
60s	15	3-5s
90s	22	3-5s
120s	30	3-5s

Rule: Each image can be displayed for a maximum of 7 seconds. If exceeded, split it!

Calculation Formula:

number_of_images = ceil(audio_duration / 4)

2. Prompt Language Requirements for Image Generation

All image generation prompts must be in Chinese. English prompts are strictly forbidden!

bash

# ❌ Wrong: Use English prompt
python text_to_image.py "modern tech illustration, AI robot, blue gradient background"

# ❌ Wrong: Mix English and Chinese in prompt
python text_to_image.py "tech style, Chinese text 'Duel', blue theme"

# ✅ Correct: Chinese-only prompt
python text_to_image.py "Modern tech illustration style, cute AI robot sitting in front of a computer, blue-purple gradient background, neon lighting effects, multiple floating holographic UI panels, high information density, professional infographic style"

Rules:

Prompts must be Chinese-only
Any text in generated images must be Chinese
No English is allowed

3. Information Density Requirements

Information Density = Rich text points + Abundant visual elements

Rich Text Content:

# ❌ Wrong: Too little text
Only "AI Comparison" in the image

# ✅ Correct: Rich text points
Image includes:
- Title: QoderWork vs OpenClaw
- Subtitle: Desktop AI Assistant Comparison
- Point 1: Out-of-the-box vs Customizable
- Point 2: $19/month vs Free & Open Source
- Point 3: General Users vs Tech Enthusiasts

Abundant Visual Elements:

# ❌ Wrong: Too empty
Only text, no icons, charts, or decorations

# ✅ Correct: Rich visualization
- Match with icons (checkmarks, crosses, arrows, stars)
- Match with charts (bar charts, pie charts, comparison bars)
- Match with illustrations (robots, computers, user avatars)
- Match with decorations (lighting effects, gradients, borders)

Information Density Principles:

Each image must have a clear text title and key points
Text content must correspond to the dubbing content
Visualization is more important: icons, charts, illustrations, decorations
Pure text images or pure decorative images are forbidden

4. Detailed and Specific Image Generation Descriptions

Each image must have rich, specific visual elements. Vague descriptions are forbidden!

bash

# ❌ Wrong: Too vague
"A robot"
"Tech-style image"
"Comparison image"

# ✅ Correct: Detailed and specific
"Cute blue AI robot mascot, rounded metallic texture, sitting at a modern minimalist desk,
with a 27-inch curved monitor displaying code, a coffee cup and succulent plant beside it,
three holographic panels floating above the robot showing line charts, pie charts, and progress bars,
dark blue tech background, blue light strips on the floor, cyberpunk style, soft volumetric lighting"

6 Essential Elements for Prompts (None can be missing):

Element	Description	Example
Subject	Who/what, specific	"Blue rounded metallic AI robot" instead of "robot"
Action	What it's doing, posture	"Hands on keyboard typing, slightly tilted head"
Environment	Where it is, background	"Modern minimalist office, city night view outside floor-to-ceiling windows"
Details	Surrounding items/elements	"Coffee cup, succulent plant, sticky notes on desk"
Style	Art style/lighting effects	"Cyberpunk style, neon lighting, volumetric lighting effect"
Color	Main color tone	"Blue-purple gradient main tone, orange accents"

5. Visual Style Consistency

All images in the same video must maintain consistent style:

Use the same style prefix
Use the same color scheme
Use the same aspect ratio (forbid mixing portrait and landscape!)

bash

# Define unified style prefix (Chinese!)
STYLE="Modern tech illustration style, clean vector design, blue-purple gradient color scheme, professional infographic aesthetics, high information density, neon glow effect, dark background"

# All images use this prefix + same aspect ratio
python text_to_image.py "$STYLE, [specific scene content]" -r 16:9 -o scene01.png
python text_to_image.py "$STYLE, [specific scene content]" -r 16:9 -o scene02.png

6. Aspect Ratio Guide

Scenario	Aspect Ratio	Description
Default/General	16:9	Bilibili, YouTube, Official Account Videos, PPT Illustrations
Douyin/Video Account/Kuaishou	9:16	Vertical short video platforms, requires user specification or workflow requirements
Xiaohongshu	3:4	Xiaohongshu note illustrations, requires user specification or workflow requirements
Moments	1:1	Square, requires user specification

Rule: When in doubt, use 16:9

Core Workflow (Rules)

Story Video Generation Workflow (Nested Workflow)

When users provide stories/plots/scripts, strictly follow this nested workflow:

┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Story → Split into scenes → Parallel generation of scene main images (text-to-image) │
│                                                              │
│   Havoc in Heaven →  Scene 1: Sun Wukong humiliated as Horse Keeper │
│              Scene 2: Returns to Flower Fruit Mountain on Cloud Somersault │
│              Scene 3: Jade Emperor sends troops │
│              ...                                              │
│              → Parallel call text_to_image.py to generate main image for each scene │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Each scene main image → Generate close-up shots via image-to-image (maintain character consistency) │
│                                                              │
│   Scene 1 main image → Close-up 1: Sun Wukong doubts the official seal │
│              Close-up 2: Sun Wukong kicks over the horse trough │
│   Scene 2 main image → Close-up 1: Soars on Cloud Somersault │
│              Close-up 2: Proclaims himself Great Sage Equal to Heaven at Flower Fruit Mountain │
│              → Parallel call image_to_image.py with main image as reference │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│ Layer 3: Generate dubbing + subtitles + validation + video synthesis │
│                                                              │
│   1. tts_generator.py generates dubbing + timestamps │
│   2. [Rule] Precisely calculate duration of each image based on timestamps │
│   3. Generate SRT subtitles │
│   4. Generate video_config.yaml │
│   5. ⛔ Run verify_alignment.py for validation (must pass!) │
│   6. video_maker.py synthesis: │
│      → Image synthesis (with transitions) │
│      → Audio merging │
│      → Burn subtitles (ASS format, fixed at bottom center) │
│      → Auto-append outro (QR code + "Follow for more content") │
│      → Add BGM │
└─────────────────────────────────────────────────────────────┘

**Rule: All videos must auto-append outro!**

Directory Structure Specification

assets/generated/{project_name}/
├── scene1/
│   ├── main.png         # Scene 1 main image (text-to-image)
│   ├── shot_01.png      # Close-up 1 (image-to-image)
│   └── shot_02.png      # Close-up 2 (image-to-image)
├── scene2/
│   ├── main.png
│   ├── shot_01.png
│   └── shot_02.png
├── ...
├── narration.mp3        # Dubbing
├── narration.json       # Timestamps
├── subtitles.srt        # Subtitles
├── video_config.yaml    # Video configuration
└── {project_name}.mp4   # Final video

Example Execution Commands

bash

# Layer 1: Parallel generation of scene main images (default 16:9)
python .opencode/skills/image-service/scripts/text_to_image.py "Style description, Scene 1 content" -r 16:9 -o scene1/main.png &
python .opencode/skills/image-service/scripts/text_to_image.py "Style description, Scene 2 content" -r 16:9 -o scene2/main.png &
wait

# Layer 2: Parallel image-to-image generation of close-up shots (same aspect ratio)
python .opencode/skills/image-service/scripts/image_to_image.py scene1/main.png "Maintain character style, close-up description" -r 16:9 -o scene1/shot_01.png &
python .opencode/skills/image-service/scripts/image_to_image.py scene1/main.png "Maintain character style, close-up description" -r 16:9 -o scene1/shot_02.png &
wait

# Layer 3: Generate dubbing + validation + video synthesis
python .opencode/skills/video-creator/scripts/tts_generator.py --text "Full narration" --output narration.mp3 --timestamps

# ⛔ Mandatory validation (must pass before synthesis!)
python .opencode/skills/video-creator/scripts/verify_alignment.py video_config.yaml

# Synthesize only after validation passes
python .opencode/skills/video-creator/scripts/video_maker.py video_config.yaml --srt subtitles.srt --bgm epic

Video Configuration File Format

yaml

# video_config.yaml
ratio: "16:9"           # Default landscape! Must be quoted to avoid YAML parsing errors
bgm_volume: 0.12
outro: true

scenes:
  - audio: narration.mp3
    images:
      # Arrange all close-up shots in scene order
      - file: scene1/shot_01.png
        duration: 4.34
      - file: scene1/shot_02.png
        duration: 4.88
      - file: scene2/shot_01.png
        duration: 2.15
      # ...

Note:

ratio

must be enclosed in quotes, e.g.,

"16:9"

, otherwise YAML will parse it as a time format.

Duration Allocation Specification (Rule!)

Before generating video_config.yaml, strictly calculate duration following this workflow:

Step 1: Read Timestamp File

python

import json
with open("narration.json", "r") as f:
    timestamps = json.load(f)
audio_duration = timestamps[-1]["end"]
print(f"Total audio duration: {audio_duration:.1f}s")

Step 2: Divide Scenes by Content Semantics

Determine the time period corresponding to each image based on the narration content:

python

# Example: Division based on narration content
# Find timestamp points where theme changes
scenes = [
    ("cover.png", 0, 12.5),      # Opening to first theme change
    ("scene01.png", 12.5, 26),   # Second segment content
    # ... Precisely divide based on sentence boundaries in narration.json
]

Step 3: Calculate Duration for Each Image

python

for file, start, end in scenes:
    duration = end - start
    print(f"{file}: {duration:.1f}s")

Step 4: Validate Total Duration

python

total_duration = sum(end - start for _, start, end in scenes)
assert abs(total_duration - audio_duration) < 1.0, \
    f"Duration mismatch! Total image duration {total_duration}s vs Audio duration {audio_duration}s"

Rules

Must read timestamps from narration.json first, no estimation based on feeling
Divide by sentence semantic boundaries, no average distribution
Must validate before generating configuration, ensure total image duration = audio duration (error < 0.5s)
Forbid auto-stretching by script, videos with audio-visual asynchrony are unqualified
Forbid duration=0, each image must be displayed for at least 1 second
Must validate with verify_alignment.py before synthesis after generating yaml

Validation Script (Mandatory Execution! Must Pass Before Synthesis!)

bash

# ⛔ Must run validation script before video synthesis. Forbid synthesis if it fails!
python .opencode/skills/video-creator/scripts/verify_alignment.py video_config.yaml

# Validation Content:
# 1. Whether all image files exist
# 2. Whether duration is a precise numerical value (reject non-numeric values directly!)
# 3. Total image duration vs Total audio duration (error < 1s)
# 4. Whether each image duration is within reasonable range (1-7s)
# 5. Semantic cross-check between image filename keywords and voice content keywords
# 6. Output complete comparison table (Image + Duration + Semantic ✅/❌ + Corresponding voice text)

# Exit code 0 = Pass, 1 = Fail
# Forbid synthesis if failed, must fix and re-validate!

Note: video_maker.py also has built-in hard validation — duration must be a precise positive float. Missing or non-numeric values will trigger immediate exit(1)! duration: auto has been completely removed!

Duration Allocation Table Template

Output the allocation table for confirmation before generating configuration:

markdown

| Scene Image | Corresponding Content | Start | End | Duration |
|-------------|-----------------------|-------|-----|----------|
| cover.png | Opening Introduction | 0s | 12.5s | 12.5s |
| scene01.png | AI Agent Era | 12.5s | 26s | 13.5s |
| ... | ... | ... | ... | ... |
| **Total** | | | | **{total}s** |

Total Audio Duration: {audio_duration}s
Difference: {diff}s ✅/❌

Subtitle Specification

Subtitles use ASS format, fixed at bottom center position:

Position: Bottom center (Alignment=2)
Font: PingFang SC
Size: Screen height / 40
Stroke: 2px black stroke + 1px shadow
Bottom Margin: Screen height / 20

Forbid: Subtitles moving randomly, inconsistent sizes, or unfixed positions

Script Parameter Description

video_maker.py

bash

python video_maker.py config.yaml [options]

Parameter	Description	Default Value
`--no-outro`	Do not add outro	Add outro
`--no-bgm`	Do not add BGM	Add BGM
`--fade`	Transition duration (seconds)	0.5
`--bgm-volume`	BGM volume	0.08
`--bgm`	Custom BGM (Optional: epic)	Default tech style
`--ratio`	Video aspect ratio	16:9 (overridden by configuration file)
`--srt`	Subtitle file path	None

verify_alignment.py

bash

python verify_alignment.py video_config.yaml

Validation Item	Description
Image Existence	All image files must exist
Duration Precision	Must be a positive float, forbid auto/empty/non-numeric values
Total Duration Match	Total image duration vs Total audio duration, error < 1s
Single Image Duration Range	Each image 1-7s, warning if exceeded
Semantic Cross-Check	Match between image filename keywords and voice content keywords

tts_generator.py

bash

python tts_generator.py --text "Text" --output audio.mp3 [options]

Parameter	Description	Default Value
`--voice`	Voice ID	zh-CN-YunxiNeural
`--rate`	Speech rate	+0%
`--timestamps`	Output timestamp JSON	No

Supported Video Aspect Ratios

Consistent with image-service, supports 10 aspect ratios:

Aspect Ratio	Resolution	Applicable Scenario
1:1	1024×1024	Square, Moments
2:3	832×1248	Vertical poster
3:2	1248×832	Horizontal poster
3:4	1080×1440	Xiaohongshu, Moments
4:3	1440×1080	Traditional monitor
4:5	864×1080	Instagram
5:4	1080×864	Horizontal photo
9:16	1080×1920	Douyin, Video Account, Vertical screen
16:9	1920×1080	Bilibili, YouTube, Official Account Videos, Landscape
21:9	1536×672	Ultra-wide screen movie

Outro Specification

Rule: All videos must auto-append the corresponding outro based on aspect ratio!

Outro Matching Order:

Exact match:
```
outro_{ratio}.mp4
```
Orientation match: Portrait →
```
outro_9x16.mp4
```
, Landscape →
```
outro_16x9.mp4
```
Fallback:
```
outro.mp4
```

BGM Resources

Classification by Style

Style	File	Shortcut Parameter	Applicable Scenario
Ancient Chinese Style	`bgm_ancient_tale.mp3`	`ancient`	Water Margin, Romance of the Three Kingdoms, Historical Stories
	`bgm_asian_drums.mp3`	`asian`	Martial Arts, Action, Battle
	`bgm_meditation.mp3`	`meditation`	Zen, Meditation, Traditional Chinese Studies
	`bgm_garden.mp3`	`garden`	Pastoral, Leisure, Scenery
Healing/Relaxing	`bgm_carefree.mp3`	`carefree`	Vlog, Daily Life, Lifestyle
	`bgm_dreamy.mp3`	`dreamy`	Fantasy, Memories, Warmth
	`bgm_happybee.mp3`	`happybee`	Cheerful, Bright, Sunny
Passionate/Epic	`bgm_epic.mp3`	`epic`	Inspirational, Battle, High-energy
	`bgm_heroic.mp3`	`heroic`	Heroes, Victory, Glory
	`bgm_crusade.mp3`	`crusade`	War, Epic, Grand Scale
	`bgm_allthis.mp3`	`allthis`	Cinematic, Narrative, Emotional
Tech/Futuristic	`bgm_technology.mp3`	`tech`	AI, Products, Tutorials
	`bgm_digital.mp3`	`digital`	Digital, Internet, Online Services
	`bgm_chasm.mp3`	`chasm`	Sci-fi, Space, Mystery
Suspense/Tense	`bgm_anxiety.mp3`	`anxiety`	Mystery, Tense, Crisis
	`bgm_darkfog.mp3`	`darkfog`	Horror, Dark, Suspense
Cheerful/Lively	`bgm_funky.mp3`	`funky`	Comedy, Relaxing, Rhythmic
	`bgm_happyboy.mp3`	`happyboy`	Cute, Children, Animation
	`bgm_doh.mp3`	`doh`	Playful, Funny, Short Videos
Electronic/Rhythmic	`bgm_edm.mp3`	`edm`	Dynamic, Editing, Beat-matching
	`bgm_electro.mp3`	`electro`	Electronic, Modern, Trendy
	`bgm_bitshift.mp3`	`bitshift`	Games, 8-bit, Retro Electronic
	`bgm_hiphop.mp3`	`hiphop`	Rap, Street, Trendy

Usage

bash

# Method 1: Shortcut parameter (Recommended)
python video_maker.py config.yaml --bgm epic
python video_maker.py config.yaml --bgm ancient
python video_maker.py config.yaml --bgm edm

# Method 2: Full filename
python video_maker.py config.yaml --bgm bgm_ancient_tale.mp3

# Method 3: Custom path
python video_maker.py config.yaml --bgm /path/to/custom.mp3

Common Voice IDs

Voice ID	Style
zh-CN-YunyangNeural	Male, News Broadcast
zh-CN-YunxiNeural	Male, Sunny and Lively
zh-CN-XiaoxiaoNeural	Female, Warm and Natural
zh-CN-XiaoyiNeural	Female, Lively and Cute

Directory Structure

video-creator/
├── SKILL.md
├── scripts/
│   ├── video_maker.py        # Main script: Images + Audio → Video (built-in duration hard check)
│   ├── verify_alignment.py   # Mandatory pre-synthesis validation (Duration + Semantic cross-check)
│   ├── tts_generator.py      # TTS Voice Generation
│   └── scene_splitter.py     # Scene Splitter (Optional)
├── assets/
│   ├── outro.mp4             # General outro (16:9)
│   ├── outro_9x16.mp4        # Portrait outro
│   ├── outro_3x4.mp4         # 3:4 outro
│   └── bgm_*.mp3             # 22 BGM tracks (see BGM Resource Table)
└── references/
    └── edge_tts_voices.md

Dependencies

bash

# System Dependencies
brew install ffmpeg  # Mac

# Python Dependencies
pip install edge-tts pyyaml

video-creator

NPX Install

Tags

SKILL.md Content (Chinese)

Video Creator

⛔⛔⛔ Top Rule: Audio-Visual Synchronization (Violation Requires Full Redo!)

Mandatory Workflow (No Steps Skipped!)

Example (Ancient Poetry Teaching)

Forbidden Actions

Image Generation Rules (Violation Requires Redo)

0. Default Dimensions (Most Important!)

1. Image Density Requirements

2. Prompt Language Requirements for Image Generation

3. Information Density Requirements

4. Detailed and Specific Image Generation Descriptions

5. Visual Style Consistency

6. Aspect Ratio Guide

Core Workflow (Rules)

Story Video Generation Workflow (Nested Workflow)

Directory Structure Specification

Example Execution Commands

Video Configuration File Format

Duration Allocation Specification (Rule!)

Step 1: Read Timestamp File

Step 2: Divide Scenes by Content Semantics

Step 3: Calculate Duration for Each Image

Step 4: Validate Total Duration

Rules

Validation Script (Mandatory Execution! Must Pass Before Synthesis!)

Duration Allocation Table Template

Subtitle Specification

Script Parameter Description

video_maker.py

verify_alignment.py

tts_generator.py

Supported Video Aspect Ratios

Outro Specification

BGM Resources

Classification by Style

Usage

Common Voice IDs

Directory Structure

Dependencies