video-generation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Video Generation

视频生成

Generate videos using

generate_media

with

mode="video"

. The system auto-selects the best backend based on available API keys.

使用

generate_media

并设置

mode="video"

来生成视频。系统会根据可用的API密钥自动选择最优后端。

Quick Start

快速开始

python

undefined

python

undefined

Simple text-to-video (auto-selects backend)

简单文本转视频（自动选择后端）

generate_media(prompt="A robot walking through a city", mode="video")

Specify backend and duration

指定后端和时长

generate_media(prompt="Ocean waves crashing on rocks", mode="video", backend_type="google", duration=8)

With aspect ratio

指定宽高比

generate_media(prompt="A timelapse of clouds", mode="video", backend_type="grok", aspect_ratio="16:9", duration=10)

undefined

generate_media(prompt="A timelapse of clouds", mode="video", backend_type="grok", aspect_ratio="16:9", duration=10)

undefined

Backend Comparison

后端对比

Backend	Default Model	Duration Range	Default Duration	Resolutions	API Key
Grok (priority 1)	`grok-imagine-video`	1-15s	5s	480p, 720p	`XAI_API_KEY`
Google Veo (priority 2)	`veo-3.1-generate-preview`	4-8s	8s	720p, 1080p, 4K (use `size` ); default 16:9	`GOOGLE_API_KEY`
OpenAI Sora (priority 3)	`sora-2`	4, 8, or 12s (discrete)	4s	Standard	`OPENAI_API_KEY`

后端	默认模型	时长范围	默认时长	分辨率	API密钥
Grok（优先级1）	`grok-imagine-video`	1-15秒	5秒	480p、720p	`XAI_API_KEY`
Google Veo（优先级2）	`veo-3.1-generate-preview`	4-8秒	8秒	720p、1080p、4K（使用 `size` 参数设置）；默认16:9	`GOOGLE_API_KEY`
OpenAI Sora（优先级3）	`sora-2`	4、8或12秒（离散值）	4秒	标准分辨率	`OPENAI_API_KEY`

Key Parameters

核心参数

Parameter	Description	Example
`prompt`	Text description of the video	`"A drone flying over mountains"`
`backend_type`	Force a specific backend	`"grok"` , `"google"` , `"openai"`
`model`	Override default model	`"veo-3.1-generate-preview"`
`duration`	Video length in seconds	`8` (clamped to backend limits)
`aspect_ratio`	Video aspect ratio	`"16:9"` , `"9:16"` , `"1:1"`
`size`	Resolution (Grok: 480p/720p; Veo: 720p/1080p/4k)	`"720p"` , `"1080p"` , `"4k"`
`input_images`	Source image for image-to-video	`["starting_frame.jpg"]`
`video_reference_images`	Style/content guide images (Veo, up to 3)	`["ref1.png", "ref2.png"]`
`negative_prompt`	What to exclude (Veo)	`"blurry, low quality"`

参数	描述	示例
`prompt`	视频的文本描述	`"A drone flying over mountains"`
`backend_type`	强制使用指定后端	`"grok"` , `"google"` , `"openai"`
`model`	覆盖默认模型	`"veo-3.1-generate-preview"`
`duration`	视频时长（单位：秒）	`8` （会被自动裁剪到后端支持的范围）
`aspect_ratio`	视频宽高比	`"16:9"` , `"9:16"` , `"1:1"`
`size`	分辨率（Grok支持480p/720p；Veo支持720p/1080p/4k）	`"720p"` , `"1080p"` , `"4k"`
`input_images`	图生视频的源图像	`["starting_frame.jpg"]`
`video_reference_images`	风格/内容参考图像（仅Veo支持，最多3张）	`["ref1.png",` ref2.png `]`
`negative_prompt`	要排除的内容（仅Veo支持）	`"blurry, low quality"`

Duration Handling

时长处理

Each backend has different duration constraints.

generate_media

automatically clamps the requested duration:

Grok: Continuous range 1-15s (clamped to bounds)
Google Veo: Continuous range 4-8s (clamped to bounds), defaults to 16:9 aspect ratio
OpenAI Sora: Discrete values only (4, 8, or 12s) - snaps to nearest valid value

A warning is logged if duration is adjusted.

每个后端的时长约束不同，

generate_media

会自动裁剪请求的时长：

Grok：支持1-15秒连续时长（自动裁剪到边界值）
Google Veo：支持4-8秒连续时长（自动裁剪到边界值），默认宽高比为16:9
OpenAI Sora：仅支持4、8、12秒离散值——自动对齐到最近的有效值

如果时长被调整，系统会记录一条警告日志。

Image-to-Video

图生视频

All three video backends support starting video from an existing image via

input_images

python

generate_media(
    prompt="Animate this scene with gentle movement",
    mode="video",
    input_images=["scene.jpg"],
    duration=5
)

The first image in

input_images

is used; additional images are ignored.

三个视频后端都支持通过

input_images

参数基于现有图像生成视频：

python

generate_media(
    prompt="Animate this scene with gentle movement",
    mode="video",
    input_images=["scene.jpg"],
    duration=5
)

系统仅使用

input_images

中的第一张图像，多余图像会被忽略。

Generation Time

生成耗时

Video generation is significantly slower than images. All backends use polling:

Grok: SDK handles polling internally (up to 10 min timeout)
Google Veo: Custom polling every 20s (up to 10 min)
OpenAI Sora: Custom polling every 2s

视频生成速度远慢于图像生成，所有后端都采用轮询机制获取结果：

Grok：SDK内部处理轮询（最长10分钟超时）
Google Veo：每20秒自定义轮询一次（最长10分钟超时）
OpenAI Sora：每2秒自定义轮询一次

Veo 3.1: Native Audio

Veo 3.1：原生音频生成

Veo 3.1 generates audio (dialogue, SFX, ambient) automatically from prompt content. No extra parameter needed — just describe the sounds:

Dialogue: Use quotation marks in prompt (
```
"Hello," she said.
```
)
Sound effects: Describe sounds (
```
tires screeching, engine roaring
```
)
Ambient: Describe atmosphere (
```
eerie hum resonates through the hallway
```
)

Veo 3.1会根据提示词内容自动生成音频（对话、音效、环境音），无需额外参数，只需在提示词中描述声音即可：

对话：在提示词中使用引号包裹内容（
```
"Hello," she said.
```
）
音效：描述声音内容（
```
tires screeching, engine roaring
```
）
环境音：描述氛围（
```
eerie hum resonates through the hallway
```
）

Veo 3.1: Extension Constraints

Veo 3.1：续剪约束

When extending videos via

continue_from

with a

veo_vid_*

ID:

Resolution is forced to 720p (API requirement for extensions)
Only 16:9 and 9:16 aspect ratios are supported
Each extension adds up to 7 seconds (API limit: 20 extensions, ~141s total)
Generated videos are retained for 2 days before expiry

当通过

continue_from

参数和

veo_vid_*

格式的ID续剪视频时：

分辨率强制为720p（API续剪要求）
仅支持16:9和9:16两种宽高比
每次续剪最多增加7秒（API限制：最多续剪20次，总时长约141秒）
生成的视频仅保留2天，到期自动失效

Producing Longer Videos

制作长视频

Current APIs cap at 15 seconds max per clip (Grok), with most backends at 4-8s. There is no way to generate a continuous 30+ second video in one call. The proven approach:

Plan a shot list — break your video into 6-8s segments with specific camera language per shot
Generate clips in parallel — launch all segments concurrently using
```
background=True
```
Composite in Remotion (see below) — layer programmatic animation on top of generated footage
Bridge with audio — a unified narration or music track smooths over visual cuts between clips

For visual continuity, use the same style anchor in every prompt (e.g., "BBC Earth documentary cinematography") and maintain consistent lighting/color descriptions.

Full production guide with examples, transition types, and duration strategy: See references/production.md

目前API单段生成最长支持15秒（Grok），大部分后端仅支持4-8秒，无法单次调用生成30秒以上的连续视频。经过验证的解决方案如下：

规划镜头列表——将视频拆分为6-8秒的片段，为每个镜头指定明确的镜头语言
并行生成片段——使用
```
background=True
```
参数同时启动所有片段的生成任务
在Remotion中合成（见下文）——在生成的素材上叠加程序化动画
用音频衔接片段——统一的旁白或音轨可以平滑片段之间的视觉剪切

为了保证视觉连贯性，所有提示词中要使用相同的风格锚点（例如"BBC地球纪录片摄影风格"），并保持一致的光影/色彩描述。

包含示例、转场类型和时长策略的完整制作指南：参考references/production.md

Hybrid Workflow: AI Footage + Remotion Animation

混合工作流：AI素材 + Remotion动画

The best results come from combining AI-generated footage with Remotion's programmatic animation — not choosing one or the other.

AI video generation produces photorealistic, cinematic footage that pure programmatic rendering cannot match. Remotion produces precise typography, motion graphics, overlays, and transitions that AI generation cannot reliably control. Use both together.

最佳效果来自AI生成素材和Remotion程序化动画的结合，而非二选一。

AI视频生成可以产出纯程序化渲染无法实现的写实、电影级素材；Remotion可以实现AI生成无法可靠控制的精确排版、动效、叠加层和转场。建议两者结合使用。

The Rule: Generate First, Composite Second

原则：先生成，后合成

Generate AI clips for cinematic/photorealistic shots (environments, product demos, atmospheric footage)
Use those clips as visual foundations in Remotion — import them as
```
<Video>
```
or
```
<OffthreadVideo>
```
background layers
Composite programmatic elements on top — typography, motion graphics, logos, data overlays, transitions, captions
Fill gaps with pure Remotion animation — title cards, intro sequences, motion-graphics-only segments where AI footage isn't needed

生成AI片段用于电影级/写实镜头（环境、产品演示、氛围素材）
将这些片段作为视觉基础导入Remotion——作为
```
<Video>
```
或
```
<OffthreadVideo>
```
背景层
在顶部叠加程序化元素——排版、动效、Logo、数据叠加层、转场、字幕
用纯Remotion动画填充空白——片头卡、介绍序列、不需要AI素材的纯动效片段

Do NOT Discard Generated Clips

不要丢弃生成的片段

Every AI-generated clip costs real money and time. Do not abandon generated footage and fall back to purely programmatic rendering. This is a common failure mode — agents generate clips, notice minor artifacts (e.g., repeated patterns, slight distortion), then pivot entirely to OpenCV/PIL/moviepy rendering, wasting all the generation budget.

Instead:

Situation	Wrong Approach	Right Approach
Minor artifacts in generated clip	Discard clip, render from scratch with OpenCV	Use clip as background, mask artifacts with overlays/motion graphics
Generated clip doesn't match vision exactly	Regenerate or abandon	Composite typography/effects on top to guide the viewer's attention
Need precise text/logo placement	Skip AI generation, use pure programmatic	Generate atmospheric footage, overlay text in Remotion
Some shots need AI footage, others don't	Use one approach for everything	Mix: AI-backed shots + pure Remotion animation shots

**每段AI生成的片段都消耗实际的成本和时间，不要放弃生成的素材转而使用纯程序化渲染。**这是常见的失败模式：Agent生成片段后发现轻微瑕疵（例如重复图案、轻微变形），就完全转向OpenCV/PIL/moviepy渲染，浪费了所有生成预算。

正确的处理方式：

场景	错误做法	正确做法
生成的片段有轻微瑕疵	丢弃片段，用OpenCV从零开始渲染	保留片段作为背景，用叠加层/动效遮盖瑕疵
生成的片段不完全符合预期	重新生成或直接放弃	在顶部叠加排版/效果引导观众注意力
需要精确的文字/Logo位置	跳过AI生成，使用纯程序化实现	生成氛围素材，在Remotion中叠加文字
部分镜头需要AI素材，部分不需要	全流程只用一种方案	混合使用：AI生成镜头 + 纯Remotion动画镜头

Cost Awareness

成本意识

Each

generate_media(mode="video")

call is expensive. Plan before generating:

Decide which shots need AI footage before generating anything — not every shot needs it
Generate only what you'll use — don't speculatively generate 8 clips hoping some work out
Review and use what you generate — analyze each clip with
```
read_media
```
, then plan your Remotion composition around actual footage
One good clip composited well beats five unused clips — invest in composition quality over generation quantity

每次

generate_media(mode="video")

调用成本很高，生成前要做好规划：

生成前确定哪些镜头需要AI素材——不是所有镜头都需要AI生成
只生成你会用到的内容——不要投机性生成8个片段指望其中几个能用
审核并利用生成的所有内容——用
```
read_media
```
分析每个片段，基于实际素材规划Remotion合成方案
一段合成效果好的优质片段胜过五段未使用的片段——投入资源提升合成质量，而非追求生成数量

Post-Production: Always Use Remotion

后期制作：始终使用Remotion

Remotion is the default post-production tool for any video that needs editing beyond simple concatenation. This includes captions, titles, transitions, overlays, motion graphics — essentially any video intended to look professional. Do not use raw ffmpeg

drawtext

or manual filter chains for these tasks; the results look amateur compared to what Remotion produces.

When you have video clips to assemble, load the Remotion skill and use it. This is not optional for professional output.

**对于任何需要简单拼接之外编辑操作的视频，Remotion是默认的后期制作工具。**包括字幕、标题、转场、叠加层、动效——基本上所有需要专业效果的视频都适用。不要使用原生ffmpeg的

drawtext

或手动滤镜链实现这些功能，和Remotion的产出相比效果非常业余。

**当你需要组装视频片段时，加载Remotion skill并使用它。**对于专业输出这是必须的要求。

Loading the Remotion Skill

加载Remotion Skill

Load the skill to get detailed rules and code examples:

Local path (if installed via quickstart):
```
.agent/skills/remotion/SKILL.md
```
Remote repo (if not installed): https://github.com/remotion-dev/skills

加载skill可以获取详细规则和代码示例：

本地路径（如果通过快速入门安装）：
```
.agent/skills/remotion/SKILL.md
```
远程仓库（如果未安装）：https://github.com/remotion-dev/skills

What Remotion Gives You

Remotion的能力

Capability	Remotion	Raw ffmpeg
Styled animated captions	CSS-styled, word-level highlighting, animations	`drawtext` — ugly, painful escaping
Title cards / lower thirds	React components, any font/layout	Manual positioning, limited fonts
Scene transitions	Timing curves, spring animations, custom effects	Basic xfade (fade, wipe)
Motion graphics	Full React/CSS/Three.js/Lottie ecosystem	Not possible
Light leak / overlay effects	Built-in `@remotion/light-leaks`	Complex filter chains
Text animations	Typography effects, per-character animation	Not feasible
AI footage + overlays	Import clips as `<Video>` , layer React components on top	Not feasible at quality

功能	Remotion	原生ffmpeg
样式化动画字幕	CSS样式、单词级别高亮、动画效果	`drawtext` ——效果粗糙，转义麻烦
标题卡/字幕条	React组件，支持任意字体/布局	手动定位，字体支持有限
场景转场	时间曲线、弹簧动画、自定义效果	基础淡入淡出（溶解、擦除）
动效设计	完整的React/CSS/Three.js/Lottie生态	无法实现
漏光/叠加效果	内置 `@remotion/light-leaks`	复杂滤镜链
文字动画	排版效果、逐字动画	不可行
AI素材+叠加层	将片段导入为 `<Video>` ，在顶部叠加React组件	无法实现同等质量

When ffmpeg Alone Is Sufficient

仅用ffmpeg足够的场景

Only use ffmpeg without Remotion for:

Concatenating clips with no captions, titles, or transitions (just hard cuts)
Audio mixing / ducking (ffmpeg or Pydub)
Color grading via LUT files (
```
lut3d
```
filter)
Quick format conversion or rescaling

只有以下场景可以不用Remotion，单独使用ffmpeg：

无字幕、标题、转场的片段拼接（仅硬切）
音频混音/闪避（ffmpeg或Pydub）
通过LUT文件调色（
```
lut3d
```
滤镜）
快速格式转换或分辨率调整

Workflow

工作流

Generate AI clips with
```
generate_media
```
(parallel, background mode) — for shots that need cinematic/photorealistic quality
Review clips with
```
read_media
```
— assess what you have, plan composition around actual footage
Generate audio (narration, music) with
```
generate_media(mode="audio")
```
Load the Remotion skill and set up a Remotion project
Composite in Remotion: import AI clips as
```
<Video>
```
background layers, overlay typography/motion graphics/captions, add pure-animation segments for title cards and transitions
Render via Remotion's headless renderer

用
generate_media
生成AI片段（并行后台模式）——用于需要电影级/写实质量的镜头
用
read_media
审核片段——评估现有素材，基于实际内容规划合成方案
用
generate_media(mode="audio")
生成音频（旁白、音乐）
加载Remotion skill并创建Remotion项目
在Remotion中合成：将AI片段导入为
```
<Video>
```
背景层，叠加排版/动效/字幕，添加纯动画片段作为标题卡和转场
通过Remotion的无头渲染器导出视频

Key Remotion Rule Files to Load

需要加载的核心Remotion规则文件

When working on a specific task, load the relevant rule files from the Remotion skill:

Captions/subtitles:

rules/subtitles.md

rules/display-captions.md

rules/transcribe-captions.md

Transitions:
```
rules/transitions.md
```
Text animations:
```
rules/text-animations.md
```
Light leaks:
```
rules/light-leaks.md
```

Audio:

rules/audio.md

rules/audio-visualization.md

Sequencing/timeline:
```
rules/sequencing.md
```
,
```
rules/trimming.md
```
3D motion graphics:
```
rules/3d.md
```
Animations/timing:
```
rules/animations.md
```
,
```
rules/timing.md
```

处理特定任务时，从Remotion skill中加载对应的规则文件：

字幕/副标题：

rules/subtitles.md

、

rules/display-captions.md

、

rules/transcribe-captions.md

转场：
```
rules/transitions.md
```
文字动画：
```
rules/text-animations.md
```
漏光效果：
```
rules/light-leaks.md
```

音频：

rules/audio.md

、

rules/audio-visualization.md

序列/时间线：
```
rules/sequencing.md
```
、
```
rules/trimming.md
```
3D动效：
```
rules/3d.md
```
动画/时序：
```
rules/animations.md
```
、
```
rules/timing.md
```

Need More Control?

需要更多控制？

Per-backend resolution, duration details, and quirks: See references/backends.md
Video continuation, remix, and image-to-video: See references/editing.md
Multi-shot production, transitions, and cinematic workflow: See references/production.md

各后端的分辨率、时长细节和特殊限制：参考references/backends.md
视频续剪、混剪和图生视频：参考references/editing.md
多镜头制作、转场和电影级工作流：参考references/production.md