narrate-video

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Video Narration

视频旁白配音

Add professional voiceover to a video. Analyze the video, write or refine a timed script, generate speech via Azure TTS or Gemini 3.1 Flash TTS, and merge — producing a narrated video where audio and visuals stay in sync.
Input: $ARGUMENTS
为视频添加专业旁白配音。分析视频内容,撰写或优化带时间轴的脚本,通过Azure TTS或Gemini 3.1 Flash TTS生成语音,并将二者合并——最终生成音画同步的带旁白视频。
输入:$ARGUMENTS

Additional resources

额外资源

  • Voice table and timing estimates: references/voices.md
  • Gemini TTS API and AI Studio request shape: references/gemini-tts.md
  • Python script template: scripts/narration_script_template.py — copy into the video's directory as
    narration_script.py
    and fill in the placeholders

  • 语音表与时长估算:references/voices.md
  • Gemini TTS API与AI Studio请求格式:references/gemini-tts.md
  • Python脚本模板:scripts/narration_script_template.py —— 复制到视频所在目录并重命名为
    narration_script.py
    ,然后填充占位符

Phase 0: Setup

阶段0:准备工作

Provider

服务提供商

Default to
azure
unless the user explicitly asks for Gemini or already has
GEMINI_API_KEY
configured. When using Gemini, use the official Gemini TTS request pattern documented in references/gemini-tts.md.
默认使用
azure
,除非用户明确要求使用Gemini,或已配置
GEMINI_API_KEY
。使用Gemini时,请遵循references/gemini-tts.md中记录的官方Gemini TTS请求模式。

Language

语言选择

Ask the user which language they want. Default to English. Look up the voice and speech rate in references/voices.md.
询问用户想要使用的语言。默认语言为英语。在references/voices.md中查找对应的语音类型和语速。

Environment

环境检查

bash
undefined
bash
undefined

1. Check provider credentials exist (NEVER read or display their values)

1. 检查提供商凭证是否存在(切勿读取或显示其值)

scripts/check_env.py azure
scripts/check_env.py azure

or

scripts/check_env.py gemini
scripts/check_env.py gemini

2. Check tool dependencies

2. 检查工具依赖

command -v ffmpeg && command -v ffprobe && command -v python3
command -v ffmpeg && command -v ffprobe && command -v python3

3. Check Python dependencies

3. 检查Python依赖

python3 -c "import dotenv" 2>&1
python3 -c "import dotenv" 2>&1

4. Azure only

4. 仅Azure需要

python3 -c "import azure.cognitiveservices.speech" 2>&1

If Azure is selected and `AZURE_SPEECH_KEY` or `AZURE_SPEECH_REGION` is missing, ask the user to add them to `~/.narrate_video.env`:
AZURE_SPEECH_KEY=your-key-here AZURE_SPEECH_REGION=your-region-here

If Gemini is selected and `GEMINI_API_KEY` is missing, ask the user to add it to `~/.narrate_video.env`:
GEMINI_API_KEY=your-key-here
python3 -c "import azure.cognitiveservices.speech" 2>&1

如果选择Azure但缺少`AZURE_SPEECH_KEY`或`AZURE_SPEECH_REGION`,请要求用户将其添加到`~/.narrate_video.env`文件中:
AZURE_SPEECH_KEY=your-key-here AZURE_SPEECH_REGION=your-region-here

如果选择Gemini但缺少`GEMINI_API_KEY`,请要求用户将其添加到`~/.narrate_video.env`文件中:
GEMINI_API_KEY=your-key-here

Optional override

可选覆盖配置

GEMINI_TTS_MODEL=gemini-3.1-flash-tts-preview

Then stop — the key is sensitive, only check whether it exists, never read or display its value.

---
GEMINI_TTS_MODEL=gemini-3.1-flash-tts-preview

然后停止操作——密钥属于敏感信息,仅需检查其是否存在,切勿读取或显示其值。

---

Phase 1: Video Analysis

阶段1:视频分析

1.1 Metadata

1.1 元数据

bash
ffprobe -v quiet -print_format json -show_format -show_streams <video>
Record total duration, resolution, frame rate, and whether an audio track exists.
bash
ffprobe -v quiet -print_format json -show_format -show_streams <video>
记录视频总时长、分辨率、帧率以及是否包含音轨。

1.2 Scene extraction

1.2 场景提取

Extract frames at 3–4 second intervals to identify scene transitions:
bash
mkdir -p /tmp/narration-frames
for t in $(seq 0 3 <duration>); do
    ffmpeg -y -ss $t -i <video> -frames:v 1 -q:v 2 /tmp/narration-frames/frame_${t}s.jpg 2>/dev/null
done
Review the frames (use Read tool to view images). For each scene transition, note the precise timestamp. Where timing is ambiguous, extract additional frames at 1–2 second intervals to pinpoint the exact moment.
每隔3-4秒提取一帧画面,以识别场景切换:
bash
mkdir -p /tmp/narration-frames
for t in $(seq 0 3 <duration>); do
    ffmpeg -y -ss $t -i <video> -frames:v 1 -q:v 2 /tmp/narration-frames/frame_${t}s.jpg 2>/dev/null
done
查看提取的帧(使用读取工具查看图片)。记录每个场景切换的精确时间戳。若时间点不明确,在该区域每隔1-2秒额外提取帧以确定准确的切换时刻。

1.3 Transition map

1.3 场景转换映射表

Build a scene transition table mapping timestamps to visual content:
0s   - Opening screen
3s   - User starts typing
8s   - System begins processing
34s  - Response appears
Narration describing something on screen should start after that content is already visible. Viewers notice when audio arrives before the visuals — it feels disorienting. Narrating slightly after the visual appears feels natural, like a presenter walking you through what you're seeing.

构建场景转换表,将时间戳映射到对应的视觉内容:
0s   - 开场画面
3s   - 用户开始输入
8s   - 系统开始处理
34s  - 响应结果出现
描述屏幕内容的旁白应在内容显示之后开始。观众会注意到音频先于视觉出现的情况——这会让人感到困惑。在视觉内容出现后稍晚一点开始旁白会更自然,就像主持人引导你观看内容一样。

Phase 2: Script Writing

阶段2:脚本撰写

Format

格式

Each narration segment is a
(start_seconds, text)
tuple:
python
SEGMENTS = [
    (0, "Opening narration here."),
    (8, "Next segment narration..."),
]
每个旁白片段为
(start_seconds, text)
元组:
python
SEGMENTS = [
    (0, "开场旁白内容。"),
    (8, "下一段旁白内容..."),
]

Writing guidance

撰写指南

Timing: Leave at least 1 second of silence between segments — this breathing room makes narration feel conversational rather than rushed. Use the timing estimates from references/voices.md to estimate whether text fits: for English, multiply the window (in seconds) by 2.5 words/sec, then take 80% as the safe word count.
Flow: Each segment should connect logically to the next. Transition words ("And", "Now", "So") help, but vary them — three consecutive "And now" transitions sound robotic.
Adapting to input: If the user provided a draft, calibrate its timestamps against the scene analysis, trim text that overflows its time window, and polish the language — but preserve their intent and key points. Without a draft, write narration for each scene based on what's visible.
Gemini prompt hygiene: If using Gemini, keep style instructions separate from the spoken transcript. The script template already wraps text in a safe
TRANSCRIPT:
preamble because Gemini 3.1 Flash TTS can occasionally read metadata aloud or reject vague prompts.
时长控制:片段之间至少保留1秒的静默时间——这段喘息空间会让旁白听起来更自然,而非仓促。使用references/voices.md中的时长估算来判断文本是否符合时长:对于英语,将时间窗口(秒)乘以2.5词/秒,再取80%作为安全词数上限。
流程连贯性:每个片段应与下一个片段逻辑衔接。使用过渡词(如“并且”、“现在”、“因此”)会有所帮助,但要注意变化——连续三个“现在”开头的过渡会显得生硬。
适配输入内容:如果用户提供了草稿脚本,请根据场景分析校准时间戳,删减超出时间窗口的文本,并优化语言——但需保留用户的意图和关键点。若没有草稿,则根据每个场景的视觉内容撰写旁白。
Gemini提示规范:如果使用Gemini,请将风格说明与口述文本分开。脚本模板已将文本包裹在安全的
TRANSCRIPT:
前缀中,因为Gemini 3.1 Flash TTS偶尔会读取元数据或拒绝模糊的提示。

Pre-flight check

预检查

Before generating audio, verify each segment fits:
window = next_segment_start - this_segment_start
max_words = window * words_per_second * 0.8
If a segment is too long, shorten the text now — trimming words is much cheaper than regenerating audio.

生成音频前,验证每个片段是否符合时长要求:
window = next_segment_start - this_segment_start
max_words = window * words_per_second * 0.8
如果片段过长,请立即缩短文本——删减文字比重新生成音频成本更低。

Phase 3: Generate the Script

阶段3:生成脚本

Copy scripts/narration_script_template.py into the video's directory as
narration_script.py
. Fill in:
  • TTS_PROVIDER
    as
    azure
    or
    gemini
  • VOICE_NAME
    from the provider-specific table
  • INPUT_VIDEO
    and
    OUTPUT_VIDEO
    (relative paths only)
  • SEGMENTS
    from Phase 2
scripts/narration_script_template.py复制到视频所在目录并重命名为
narration_script.py
。填写以下内容:
  • TTS_PROVIDER
    设为
    azure
    gemini
  • VOICE_NAME
    从提供商专属的语音表中选择
  • INPUT_VIDEO
    OUTPUT_VIDEO
    (仅使用相对路径)
  • SEGMENTS
    来自阶段2的内容

Design notes

设计说明

These choices come from debugging real production issues:
  • normalize=0
    on amix
    : ffmpeg's
    amix
    divides volume by input count by default. With 20 segments, output would be 1/20th volume — essentially silent.
  • Discarding original audio: Even mixing original audio at 5% volume produces audible double-voice artifacts.
  • Aborting on overlap: If any segment's audio extends past the next segment's start time, the script stops and reports the problem. Overlapping audio sounds broken.
  • Skipping existing audio files: The script only generates audio for segments without an existing cached file. Azure uses
    .mp3
    ; Gemini uses
    .wav
    . If you change a segment's text, delete the matching
    seg_XXX.*
    file before re-running.
  • Gemini retry logic: Gemini 3.1 Flash TTS can occasionally return transient
    500
    errors. The template retries a few times automatically before failing.

这些选择来自对实际生产问题的调试:
  • amix设置
    normalize=0
    :ffmpeg的
    amix
    默认会将音量除以输入数量。若有20个片段,输出音量会变为原来的1/20——几乎静音。
  • 丢弃原始音轨:即使将原始音轨混合至5%音量,也会产生明显的双重语音干扰。
  • 重叠时终止:如果任何片段的音频超出下一个片段的开始时间,脚本会停止并报告问题。重叠的音频会听起来混乱。
  • 跳过已存在的音频文件:脚本仅为没有缓存文件的片段生成音频。Azure使用
    .mp3
    格式;Gemini使用
    .wav
    格式。如果修改了片段文本,请先删除对应的
    seg_XXX.*
    文件,再重新运行脚本。
  • Gemini重试逻辑:Gemini 3.1 Flash TTS偶尔会返回临时的
    500
    错误。模板会自动重试几次后才会失败。

Phase 4: Run & Iterate

阶段4:运行与迭代

bash
python3 narration_script.py
If the timing report shows overlaps (gap < 0), decide whether to shorten the text or push the next segment's start time later. If you change text, delete the corresponding cached audio file in
narration_segments/
first. If you only change start times, re-run directly.
Keep iterating until all gaps are non-negative.

bash
python3 narration_script.py
如果时长报告显示重叠(间隙<0),请决定是缩短文本还是推迟下一个片段的开始时间。如果修改了文本,请先删除
narration_segments/
中对应的缓存音频文件。如果仅修改了开始时间,可直接重新运行脚本。
持续迭代直到所有间隙均为非负值。

Phase 5: Verification

阶段5:验证

Run all three checks after every successful build:
每次成功生成视频后,运行以下三项检查:

Volume

音量检查

bash
ffmpeg -i <output> -ss 0 -t 30 -af "volumedetect" -vn -f null - 2>&1 | grep -E "mean_volume|max_volume"
Expect mean_volume between -25 and -15 dB, max between -10 and 0 dB. If mean is below -40 dB, the
normalize=0
fix isn't applied — check the filter string.
bash
ffmpeg -i <output> -ss 0 -t 30 -af "volumedetect" -vn -f null - 2>&1 | grep -E "mean_volume|max_volume"
预期平均音量在-25至-15 dB之间,最大音量在-10至0 dB之间。如果平均音量低于-40 dB,说明未应用
normalize=0
修复——请检查过滤器字符串。

Silence gaps

静默间隙检查

bash
ffmpeg -i <output> -af "silencedetect=noise=-30dB:d=0.3" -vn -f null - 2>&1 | grep -E "silence_(start|end)" | head -20
Confirm clean silence between segment transitions. Silence boundaries should match expected segment end/start times.
bash
ffmpeg -i <output> -af "silencedetect=noise=-30dB:d=0.3" -vn -f null - 2>&1 | grep -E "silence_(start|end)" | head -20
确认片段转换之间的静默清晰。静默的起止时间应与预期的片段结束/开始时间一致。

Audio-video sync

音画同步检查

Extract frames at 5–8 key segment start times and view them:
bash
for t in <timestamps>; do
    ffmpeg -y -ss $t -i <output> -frames:v 1 -q:v 2 /tmp/verify_${t}s.jpg 2>/dev/null
done
The on-screen content should already be visible when the narration for that scene begins.

在5-8个关键片段的开始时间提取帧并查看:
bash
for t in <timestamps>; do
    ffmpeg -y -ss $t -i <output> -frames:v 1 -q:v 2 /tmp/verify_${t}s.jpg 2>/dev/null
done
当对应场景的旁白开始时,屏幕上的内容应已显示。

Troubleshooting

故障排除

SymptomCauseFix
Two voices playingOriginal audio was mixed inOnly map
[final]
audio track, never
0:a
Audio nearly silentamix divided volume by input countAdd
:normalize=0
to amix parameters
Narration out of syncImprecise scene timestampsRe-extract frames at 1–2s intervals around the problem area
Overlap at segment boundaryPrevious segment runs too longShorten that segment's text or delay the next segment
Text changed but audio didn'tOld cached audio file still existsDelete
narration_segments/seg_XXX.*
and re-run
Audio cut off at video endLast segment overflows video durationShorten to finish 3–4s before video ends
Gemini returns
500
Preview model emitted text tokens instead of audioRe-run; the template already retries transient failures
Gemini reads prompt labels aloudPrompt classifier failed or prompt was too vagueKeep the transcript explicit and use the template's
TRANSCRIPT:
wrapper
症状原因修复方案
出现双重语音混合了原始音轨仅映射
[final]
音轨,切勿使用
0:a
音频几乎静音amix将音量除以输入数量在amix参数中添加
:normalize=0
旁白与画面不同步场景时间戳不准确在问题区域周围每隔1-2秒重新提取帧
片段边界处音频重叠上一个片段时长过长缩短该片段的文本或推迟下一个片段的开始时间
修改文本但音频未更新旧的缓存音频文件仍存在删除
narration_segments/seg_XXX.*
并重新运行
视频结尾处音频被截断最后一个片段超出视频时长缩短文本,使其在视频结束前3-4秒完成
Gemini返回
500
错误
预览模型返回文本令牌而非音频重新运行;模板已针对临时失败自动重试
Gemini读取提示标签内容提示分类失败或提示过于模糊明确转录文本,并使用模板的
TRANSCRIPT:
包裹