video-generation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Video Generation

视频生成

Generate videos using
generate_media
with
mode="video"
. The system auto-selects the best backend based on available API keys.
使用
generate_media
并设置
mode="video"
来生成视频。系统会根据可用的API密钥自动选择最优后端。

Quick Start

快速开始

python
undefined
python
undefined

Simple text-to-video (auto-selects backend)

简单文本转视频(自动选择后端)

generate_media(prompt="A robot walking through a city", mode="video")
generate_media(prompt="A robot walking through a city", mode="video")

Specify backend and duration

指定后端和时长

generate_media(prompt="Ocean waves crashing on rocks", mode="video", backend_type="google", duration=8)
generate_media(prompt="Ocean waves crashing on rocks", mode="video", backend_type="google", duration=8)

With aspect ratio

指定宽高比

generate_media(prompt="A timelapse of clouds", mode="video", backend_type="grok", aspect_ratio="16:9", duration=10)
undefined
generate_media(prompt="A timelapse of clouds", mode="video", backend_type="grok", aspect_ratio="16:9", duration=10)
undefined

Backend Comparison

后端对比

BackendDefault ModelDuration RangeDefault DurationResolutionsAPI Key
Grok (priority 1)
grok-imagine-video
1-15s5s480p, 720p
XAI_API_KEY
Google Veo (priority 2)
veo-3.1-generate-preview
4-8s8s720p, 1080p, 4K (use
size
); default 16:9
GOOGLE_API_KEY
OpenAI Sora (priority 3)
sora-2
4, 8, or 12s (discrete)4sStandard
OPENAI_API_KEY
后端默认模型时长范围默认时长分辨率API密钥
Grok(优先级1)
grok-imagine-video
1-15秒5秒480p、720p
XAI_API_KEY
Google Veo(优先级2)
veo-3.1-generate-preview
4-8秒8秒720p、1080p、4K(使用
size
参数设置);默认16:9
GOOGLE_API_KEY
OpenAI Sora(优先级3)
sora-2
4、8或12秒(离散值)4秒标准分辨率
OPENAI_API_KEY

Key Parameters

核心参数

ParameterDescriptionExample
prompt
Text description of the video
"A drone flying over mountains"
backend_type
Force a specific backend
"grok"
,
"google"
,
"openai"
model
Override default model
"veo-3.1-generate-preview"
duration
Video length in seconds
8
(clamped to backend limits)
aspect_ratio
Video aspect ratio
"16:9"
,
"9:16"
,
"1:1"
size
Resolution (Grok: 480p/720p; Veo: 720p/1080p/4k)
"720p"
,
"1080p"
,
"4k"
input_images
Source image for image-to-video
["starting_frame.jpg"]
video_reference_images
Style/content guide images (Veo, up to 3)
["ref1.png", "ref2.png"]
negative_prompt
What to exclude (Veo)
"blurry, low quality"
参数描述示例
prompt
视频的文本描述
"A drone flying over mountains"
backend_type
强制使用指定后端
"grok"
,
"google"
,
"openai"
model
覆盖默认模型
"veo-3.1-generate-preview"
duration
视频时长(单位:秒)
8
(会被自动裁剪到后端支持的范围)
aspect_ratio
视频宽高比
"16:9"
,
"9:16"
,
"1:1"
size
分辨率(Grok支持480p/720p;Veo支持720p/1080p/4k)
"720p"
,
"1080p"
,
"4k"
input_images
图生视频的源图像
["starting_frame.jpg"]
video_reference_images
风格/内容参考图像(仅Veo支持,最多3张)
["ref1.png", 
ref2.png
]
negative_prompt
要排除的内容(仅Veo支持)
"blurry, low quality"

Duration Handling

时长处理

Each backend has different duration constraints.
generate_media
automatically clamps the requested duration:
  • Grok: Continuous range 1-15s (clamped to bounds)
  • Google Veo: Continuous range 4-8s (clamped to bounds), defaults to 16:9 aspect ratio
  • OpenAI Sora: Discrete values only (4, 8, or 12s) - snaps to nearest valid value
A warning is logged if duration is adjusted.
每个后端的时长约束不同,
generate_media
会自动裁剪请求的时长:
  • Grok:支持1-15秒连续时长(自动裁剪到边界值)
  • Google Veo:支持4-8秒连续时长(自动裁剪到边界值),默认宽高比为16:9
  • OpenAI Sora:仅支持4、8、12秒离散值——自动对齐到最近的有效值
如果时长被调整,系统会记录一条警告日志。

Image-to-Video

图生视频

All three video backends support starting video from an existing image via
input_images
:
python
generate_media(
    prompt="Animate this scene with gentle movement",
    mode="video",
    input_images=["scene.jpg"],
    duration=5
)
The first image in
input_images
is used; additional images are ignored.
三个视频后端都支持通过
input_images
参数基于现有图像生成视频:
python
generate_media(
    prompt="Animate this scene with gentle movement",
    mode="video",
    input_images=["scene.jpg"],
    duration=5
)
系统仅使用
input_images
中的第一张图像,多余图像会被忽略。

Generation Time

生成耗时

Video generation is significantly slower than images. All backends use polling:
  • Grok: SDK handles polling internally (up to 10 min timeout)
  • Google Veo: Custom polling every 20s (up to 10 min)
  • OpenAI Sora: Custom polling every 2s
视频生成速度远慢于图像生成,所有后端都采用轮询机制获取结果:
  • Grok:SDK内部处理轮询(最长10分钟超时)
  • Google Veo:每20秒自定义轮询一次(最长10分钟超时)
  • OpenAI Sora:每2秒自定义轮询一次

Veo 3.1: Native Audio

Veo 3.1:原生音频生成

Veo 3.1 generates audio (dialogue, SFX, ambient) automatically from prompt content. No extra parameter needed — just describe the sounds:
  • Dialogue: Use quotation marks in prompt (
    "Hello," she said.
    )
  • Sound effects: Describe sounds (
    tires screeching, engine roaring
    )
  • Ambient: Describe atmosphere (
    eerie hum resonates through the hallway
    )
Veo 3.1会根据提示词内容自动生成音频(对话、音效、环境音),无需额外参数,只需在提示词中描述声音即可:
  • 对话:在提示词中使用引号包裹内容(
    "Hello," she said.
  • 音效:描述声音内容(
    tires screeching, engine roaring
  • 环境音:描述氛围(
    eerie hum resonates through the hallway

Veo 3.1: Extension Constraints

Veo 3.1:续剪约束

When extending videos via
continue_from
with a
veo_vid_*
ID:
  • Resolution is forced to 720p (API requirement for extensions)
  • Only 16:9 and 9:16 aspect ratios are supported
  • Each extension adds up to 7 seconds (API limit: 20 extensions, ~141s total)
  • Generated videos are retained for 2 days before expiry
当通过
continue_from
参数和
veo_vid_*
格式的ID续剪视频时:
  • 分辨率强制为720p(API续剪要求)
  • 仅支持16:99:16两种宽高比
  • 每次续剪最多增加7秒(API限制:最多续剪20次,总时长约141秒)
  • 生成的视频仅保留2天,到期自动失效

Producing Longer Videos

制作长视频

Current APIs cap at 15 seconds max per clip (Grok), with most backends at 4-8s. There is no way to generate a continuous 30+ second video in one call. The proven approach:
  1. Plan a shot list — break your video into 6-8s segments with specific camera language per shot
  2. Generate clips in parallel — launch all segments concurrently using
    background=True
  3. Composite in Remotion (see below) — layer programmatic animation on top of generated footage
  4. Bridge with audio — a unified narration or music track smooths over visual cuts between clips
For visual continuity, use the same style anchor in every prompt (e.g., "BBC Earth documentary cinematography") and maintain consistent lighting/color descriptions.
Full production guide with examples, transition types, and duration strategy: See references/production.md
目前API单段生成最长支持15秒(Grok),大部分后端仅支持4-8秒,无法单次调用生成30秒以上的连续视频。经过验证的解决方案如下:
  1. 规划镜头列表——将视频拆分为6-8秒的片段,为每个镜头指定明确的镜头语言
  2. 并行生成片段——使用
    background=True
    参数同时启动所有片段的生成任务
  3. 在Remotion中合成(见下文)——在生成的素材上叠加程序化动画
  4. 用音频衔接片段——统一的旁白或音轨可以平滑片段之间的视觉剪切
为了保证视觉连贯性,所有提示词中要使用相同的风格锚点(例如"BBC地球纪录片摄影风格"),并保持一致的光影/色彩描述。
包含示例、转场类型和时长策略的完整制作指南:参考references/production.md

Hybrid Workflow: AI Footage + Remotion Animation

混合工作流:AI素材 + Remotion动画

The best results come from combining AI-generated footage with Remotion's programmatic animation — not choosing one or the other.
AI video generation produces photorealistic, cinematic footage that pure programmatic rendering cannot match. Remotion produces precise typography, motion graphics, overlays, and transitions that AI generation cannot reliably control. Use both together.
最佳效果来自AI生成素材和Remotion程序化动画的结合,而非二选一。
AI视频生成可以产出纯程序化渲染无法实现的写实、电影级素材;Remotion可以实现AI生成无法可靠控制的精确排版、动效、叠加层和转场。建议两者结合使用。

The Rule: Generate First, Composite Second

原则:先生成,后合成

  1. Generate AI clips for cinematic/photorealistic shots (environments, product demos, atmospheric footage)
  2. Use those clips as visual foundations in Remotion — import them as
    <Video>
    or
    <OffthreadVideo>
    background layers
  3. Composite programmatic elements on top — typography, motion graphics, logos, data overlays, transitions, captions
  4. Fill gaps with pure Remotion animation — title cards, intro sequences, motion-graphics-only segments where AI footage isn't needed
  1. 生成AI片段用于电影级/写实镜头(环境、产品演示、氛围素材)
  2. 将这些片段作为视觉基础导入Remotion——作为
    <Video>
    <OffthreadVideo>
    背景层
  3. 在顶部叠加程序化元素——排版、动效、Logo、数据叠加层、转场、字幕
  4. 用纯Remotion动画填充空白——片头卡、介绍序列、不需要AI素材的纯动效片段

Do NOT Discard Generated Clips

不要丢弃生成的片段

Every AI-generated clip costs real money and time. Do not abandon generated footage and fall back to purely programmatic rendering. This is a common failure mode — agents generate clips, notice minor artifacts (e.g., repeated patterns, slight distortion), then pivot entirely to OpenCV/PIL/moviepy rendering, wasting all the generation budget.
Instead:
SituationWrong ApproachRight Approach
Minor artifacts in generated clipDiscard clip, render from scratch with OpenCVUse clip as background, mask artifacts with overlays/motion graphics
Generated clip doesn't match vision exactlyRegenerate or abandonComposite typography/effects on top to guide the viewer's attention
Need precise text/logo placementSkip AI generation, use pure programmaticGenerate atmospheric footage, overlay text in Remotion
Some shots need AI footage, others don'tUse one approach for everythingMix: AI-backed shots + pure Remotion animation shots
**每段AI生成的片段都消耗实际的成本和时间,不要放弃生成的素材转而使用纯程序化渲染。**这是常见的失败模式:Agent生成片段后发现轻微瑕疵(例如重复图案、轻微变形),就完全转向OpenCV/PIL/moviepy渲染,浪费了所有生成预算。
正确的处理方式:
场景错误做法正确做法
生成的片段有轻微瑕疵丢弃片段,用OpenCV从零开始渲染保留片段作为背景,用叠加层/动效遮盖瑕疵
生成的片段不完全符合预期重新生成或直接放弃在顶部叠加排版/效果引导观众注意力
需要精确的文字/Logo位置跳过AI生成,使用纯程序化实现生成氛围素材,在Remotion中叠加文字
部分镜头需要AI素材,部分不需要全流程只用一种方案混合使用:AI生成镜头 + 纯Remotion动画镜头

Cost Awareness

成本意识

Each
generate_media(mode="video")
call is expensive. Plan before generating:
  • Decide which shots need AI footage before generating anything — not every shot needs it
  • Generate only what you'll use — don't speculatively generate 8 clips hoping some work out
  • Review and use what you generate — analyze each clip with
    read_media
    , then plan your Remotion composition around actual footage
  • One good clip composited well beats five unused clips — invest in composition quality over generation quantity
每次
generate_media(mode="video")
调用成本很高,生成前要做好规划:
  • 生成前确定哪些镜头需要AI素材——不是所有镜头都需要AI生成
  • 只生成你会用到的内容——不要投机性生成8个片段指望其中几个能用
  • 审核并利用生成的所有内容——用
    read_media
    分析每个片段,基于实际素材规划Remotion合成方案
  • 一段合成效果好的优质片段胜过五段未使用的片段——投入资源提升合成质量,而非追求生成数量

Post-Production: Always Use Remotion

后期制作:始终使用Remotion

Remotion is the default post-production tool for any video that needs editing beyond simple concatenation. This includes captions, titles, transitions, overlays, motion graphics — essentially any video intended to look professional. Do not use raw ffmpeg
drawtext
or manual filter chains for these tasks; the results look amateur compared to what Remotion produces.
When you have video clips to assemble, load the Remotion skill and use it. This is not optional for professional output.
**对于任何需要简单拼接之外编辑操作的视频,Remotion是默认的后期制作工具。**包括字幕、标题、转场、叠加层、动效——基本上所有需要专业效果的视频都适用。不要使用原生ffmpeg的
drawtext
或手动滤镜链实现这些功能,和Remotion的产出相比效果非常业余。
**当你需要组装视频片段时,加载Remotion skill并使用它。**对于专业输出这是必须的要求。

Loading the Remotion Skill

加载Remotion Skill

Load the skill to get detailed rules and code examples:
加载skill可以获取详细规则和代码示例:

What Remotion Gives You

Remotion的能力

CapabilityRemotionRaw ffmpeg
Styled animated captionsCSS-styled, word-level highlighting, animations
drawtext
— ugly, painful escaping
Title cards / lower thirdsReact components, any font/layoutManual positioning, limited fonts
Scene transitionsTiming curves, spring animations, custom effectsBasic xfade (fade, wipe)
Motion graphicsFull React/CSS/Three.js/Lottie ecosystemNot possible
Light leak / overlay effectsBuilt-in
@remotion/light-leaks
Complex filter chains
Text animationsTypography effects, per-character animationNot feasible
AI footage + overlaysImport clips as
<Video>
, layer React components on top
Not feasible at quality
功能Remotion原生ffmpeg
样式化动画字幕CSS样式、单词级别高亮、动画效果
drawtext
——效果粗糙,转义麻烦
标题卡/字幕条React组件,支持任意字体/布局手动定位,字体支持有限
场景转场时间曲线、弹簧动画、自定义效果基础淡入淡出(溶解、擦除)
动效设计完整的React/CSS/Three.js/Lottie生态无法实现
漏光/叠加效果内置
@remotion/light-leaks
复杂滤镜链
文字动画排版效果、逐字动画不可行
AI素材+叠加层将片段导入为
<Video>
,在顶部叠加React组件
无法实现同等质量

When ffmpeg Alone Is Sufficient

仅用ffmpeg足够的场景

Only use ffmpeg without Remotion for:
  • Concatenating clips with no captions, titles, or transitions (just hard cuts)
  • Audio mixing / ducking (ffmpeg or Pydub)
  • Color grading via LUT files (
    lut3d
    filter)
  • Quick format conversion or rescaling
只有以下场景可以不用Remotion,单独使用ffmpeg:
  • 无字幕、标题、转场的片段拼接(仅硬切)
  • 音频混音/闪避(ffmpeg或Pydub)
  • 通过LUT文件调色(
    lut3d
    滤镜)
  • 快速格式转换或分辨率调整

Workflow

工作流

  1. Generate AI clips with
    generate_media
    (parallel, background mode) — for shots that need cinematic/photorealistic quality
  2. Review clips with
    read_media
    — assess what you have, plan composition around actual footage
  3. Generate audio (narration, music) with
    generate_media(mode="audio")
  4. Load the Remotion skill and set up a Remotion project
  5. Composite in Remotion: import AI clips as
    <Video>
    background layers, overlay typography/motion graphics/captions, add pure-animation segments for title cards and transitions
  6. Render via Remotion's headless renderer
  1. generate_media
    生成AI片段
    (并行后台模式)——用于需要电影级/写实质量的镜头
  2. read_media
    审核片段
    ——评估现有素材,基于实际内容规划合成方案
  3. generate_media(mode="audio")
    生成音频
    (旁白、音乐)
  4. 加载Remotion skill并创建Remotion项目
  5. 在Remotion中合成:将AI片段导入为
    <Video>
    背景层,叠加排版/动效/字幕,添加纯动画片段作为标题卡和转场
  6. 通过Remotion的无头渲染器导出视频

Key Remotion Rule Files to Load

需要加载的核心Remotion规则文件

When working on a specific task, load the relevant rule files from the Remotion skill:
  • Captions/subtitles:
    rules/subtitles.md
    ,
    rules/display-captions.md
    ,
    rules/transcribe-captions.md
  • Transitions:
    rules/transitions.md
  • Text animations:
    rules/text-animations.md
  • Light leaks:
    rules/light-leaks.md
  • Audio:
    rules/audio.md
    ,
    rules/audio-visualization.md
  • Sequencing/timeline:
    rules/sequencing.md
    ,
    rules/trimming.md
  • 3D motion graphics:
    rules/3d.md
  • Animations/timing:
    rules/animations.md
    ,
    rules/timing.md
处理特定任务时,从Remotion skill中加载对应的规则文件:
  • 字幕/副标题
    rules/subtitles.md
    rules/display-captions.md
    rules/transcribe-captions.md
  • 转场
    rules/transitions.md
  • 文字动画
    rules/text-animations.md
  • 漏光效果
    rules/light-leaks.md
  • 音频
    rules/audio.md
    rules/audio-visualization.md
  • 序列/时间线
    rules/sequencing.md
    rules/trimming.md
  • 3D动效
    rules/3d.md
  • 动画/时序
    rules/animations.md
    rules/timing.md

Need More Control?

需要更多控制?

  • Per-backend resolution, duration details, and quirks: See references/backends.md
  • Video continuation, remix, and image-to-video: See references/editing.md
  • Multi-shot production, transitions, and cinematic workflow: See references/production.md
  • 各后端的分辨率、时长细节和特殊限制:参考references/backends.md
  • 视频续剪、混剪和图生视频:参考references/editing.md
  • 多镜头制作、转场和电影级工作流:参考references/production.md