narrate-video
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVideo Narration
视频旁白配音
Add professional voiceover to a video. Analyze the video, write or refine a timed script, generate speech via Azure TTS or Gemini 3.1 Flash TTS, and merge — producing a narrated video where audio and visuals stay in sync.
Input: $ARGUMENTS
为视频添加专业旁白配音。分析视频内容,撰写或优化带时间轴的脚本,通过Azure TTS或Gemini 3.1 Flash TTS生成语音,并将二者合并——最终生成音画同步的带旁白视频。
输入:$ARGUMENTS
Additional resources
额外资源
- Voice table and timing estimates: references/voices.md
- Gemini TTS API and AI Studio request shape: references/gemini-tts.md
- Python script template: scripts/narration_script_template.py — copy into the video's directory as and fill in the placeholders
narration_script.py
- 语音表与时长估算:references/voices.md
- Gemini TTS API与AI Studio请求格式:references/gemini-tts.md
- Python脚本模板:scripts/narration_script_template.py —— 复制到视频所在目录并重命名为,然后填充占位符
narration_script.py
Phase 0: Setup
阶段0:准备工作
Provider
服务提供商
Default to unless the user explicitly asks for Gemini or already has configured. When using Gemini, use the official Gemini TTS request pattern documented in references/gemini-tts.md.
azureGEMINI_API_KEY默认使用,除非用户明确要求使用Gemini,或已配置。使用Gemini时,请遵循references/gemini-tts.md中记录的官方Gemini TTS请求模式。
azureGEMINI_API_KEYLanguage
语言选择
Ask the user which language they want. Default to English. Look up the voice and speech rate in references/voices.md.
询问用户想要使用的语言。默认语言为英语。在references/voices.md中查找对应的语音类型和语速。
Environment
环境检查
bash
undefinedbash
undefined1. Check provider credentials exist (NEVER read or display their values)
1. 检查提供商凭证是否存在(切勿读取或显示其值)
scripts/check_env.py azure
scripts/check_env.py azure
or
或
scripts/check_env.py gemini
scripts/check_env.py gemini
2. Check tool dependencies
2. 检查工具依赖
command -v ffmpeg && command -v ffprobe && command -v python3
command -v ffmpeg && command -v ffprobe && command -v python3
3. Check Python dependencies
3. 检查Python依赖
python3 -c "import dotenv" 2>&1
python3 -c "import dotenv" 2>&1
4. Azure only
4. 仅Azure需要
python3 -c "import azure.cognitiveservices.speech" 2>&1
If Azure is selected and `AZURE_SPEECH_KEY` or `AZURE_SPEECH_REGION` is missing, ask the user to add them to `~/.narrate_video.env`:
AZURE_SPEECH_KEY=your-key-here
AZURE_SPEECH_REGION=your-region-here
If Gemini is selected and `GEMINI_API_KEY` is missing, ask the user to add it to `~/.narrate_video.env`:
GEMINI_API_KEY=your-key-here
python3 -c "import azure.cognitiveservices.speech" 2>&1
如果选择Azure但缺少`AZURE_SPEECH_KEY`或`AZURE_SPEECH_REGION`,请要求用户将其添加到`~/.narrate_video.env`文件中:
AZURE_SPEECH_KEY=your-key-here
AZURE_SPEECH_REGION=your-region-here
如果选择Gemini但缺少`GEMINI_API_KEY`,请要求用户将其添加到`~/.narrate_video.env`文件中:
GEMINI_API_KEY=your-key-here
Optional override
可选覆盖配置
GEMINI_TTS_MODEL=gemini-3.1-flash-tts-preview
Then stop — the key is sensitive, only check whether it exists, never read or display its value.
---GEMINI_TTS_MODEL=gemini-3.1-flash-tts-preview
然后停止操作——密钥属于敏感信息,仅需检查其是否存在,切勿读取或显示其值。
---Phase 1: Video Analysis
阶段1:视频分析
1.1 Metadata
1.1 元数据
bash
ffprobe -v quiet -print_format json -show_format -show_streams <video>Record total duration, resolution, frame rate, and whether an audio track exists.
bash
ffprobe -v quiet -print_format json -show_format -show_streams <video>记录视频总时长、分辨率、帧率以及是否包含音轨。
1.2 Scene extraction
1.2 场景提取
Extract frames at 3–4 second intervals to identify scene transitions:
bash
mkdir -p /tmp/narration-frames
for t in $(seq 0 3 <duration>); do
ffmpeg -y -ss $t -i <video> -frames:v 1 -q:v 2 /tmp/narration-frames/frame_${t}s.jpg 2>/dev/null
doneReview the frames (use Read tool to view images). For each scene transition, note the precise timestamp. Where timing is ambiguous, extract additional frames at 1–2 second intervals to pinpoint the exact moment.
每隔3-4秒提取一帧画面,以识别场景切换:
bash
mkdir -p /tmp/narration-frames
for t in $(seq 0 3 <duration>); do
ffmpeg -y -ss $t -i <video> -frames:v 1 -q:v 2 /tmp/narration-frames/frame_${t}s.jpg 2>/dev/null
done查看提取的帧(使用读取工具查看图片)。记录每个场景切换的精确时间戳。若时间点不明确,在该区域每隔1-2秒额外提取帧以确定准确的切换时刻。
1.3 Transition map
1.3 场景转换映射表
Build a scene transition table mapping timestamps to visual content:
0s - Opening screen
3s - User starts typing
8s - System begins processing
34s - Response appearsNarration describing something on screen should start after that content is already visible. Viewers notice when audio arrives before the visuals — it feels disorienting. Narrating slightly after the visual appears feels natural, like a presenter walking you through what you're seeing.
构建场景转换表,将时间戳映射到对应的视觉内容:
0s - 开场画面
3s - 用户开始输入
8s - 系统开始处理
34s - 响应结果出现描述屏幕内容的旁白应在内容显示之后开始。观众会注意到音频先于视觉出现的情况——这会让人感到困惑。在视觉内容出现后稍晚一点开始旁白会更自然,就像主持人引导你观看内容一样。
Phase 2: Script Writing
阶段2:脚本撰写
Format
格式
Each narration segment is a tuple:
(start_seconds, text)python
SEGMENTS = [
(0, "Opening narration here."),
(8, "Next segment narration..."),
]每个旁白片段为元组:
(start_seconds, text)python
SEGMENTS = [
(0, "开场旁白内容。"),
(8, "下一段旁白内容..."),
]Writing guidance
撰写指南
Timing: Leave at least 1 second of silence between segments — this breathing room makes narration feel conversational rather than rushed. Use the timing estimates from references/voices.md to estimate whether text fits: for English, multiply the window (in seconds) by 2.5 words/sec, then take 80% as the safe word count.
Flow: Each segment should connect logically to the next. Transition words ("And", "Now", "So") help, but vary them — three consecutive "And now" transitions sound robotic.
Adapting to input: If the user provided a draft, calibrate its timestamps against the scene analysis, trim text that overflows its time window, and polish the language — but preserve their intent and key points. Without a draft, write narration for each scene based on what's visible.
Gemini prompt hygiene: If using Gemini, keep style instructions separate from the spoken transcript. The script template already wraps text in a safe preamble because Gemini 3.1 Flash TTS can occasionally read metadata aloud or reject vague prompts.
TRANSCRIPT:时长控制:片段之间至少保留1秒的静默时间——这段喘息空间会让旁白听起来更自然,而非仓促。使用references/voices.md中的时长估算来判断文本是否符合时长:对于英语,将时间窗口(秒)乘以2.5词/秒,再取80%作为安全词数上限。
流程连贯性:每个片段应与下一个片段逻辑衔接。使用过渡词(如“并且”、“现在”、“因此”)会有所帮助,但要注意变化——连续三个“现在”开头的过渡会显得生硬。
适配输入内容:如果用户提供了草稿脚本,请根据场景分析校准时间戳,删减超出时间窗口的文本,并优化语言——但需保留用户的意图和关键点。若没有草稿,则根据每个场景的视觉内容撰写旁白。
Gemini提示规范:如果使用Gemini,请将风格说明与口述文本分开。脚本模板已将文本包裹在安全的前缀中,因为Gemini 3.1 Flash TTS偶尔会读取元数据或拒绝模糊的提示。
TRANSCRIPT:Pre-flight check
预检查
Before generating audio, verify each segment fits:
window = next_segment_start - this_segment_start
max_words = window * words_per_second * 0.8If a segment is too long, shorten the text now — trimming words is much cheaper than regenerating audio.
生成音频前,验证每个片段是否符合时长要求:
window = next_segment_start - this_segment_start
max_words = window * words_per_second * 0.8如果片段过长,请立即缩短文本——删减文字比重新生成音频成本更低。
Phase 3: Generate the Script
阶段3:生成脚本
Copy scripts/narration_script_template.py into the video's directory as . Fill in:
narration_script.py- as
TTS_PROVIDERorazuregemini - from the provider-specific table
VOICE_NAME - and
INPUT_VIDEO(relative paths only)OUTPUT_VIDEO - from Phase 2
SEGMENTS
将scripts/narration_script_template.py复制到视频所在目录并重命名为。填写以下内容:
narration_script.py- 设为
TTS_PROVIDER或azuregemini - 从提供商专属的语音表中选择
VOICE_NAME - 和
INPUT_VIDEO(仅使用相对路径)OUTPUT_VIDEO - 来自阶段2的内容
SEGMENTS
Design notes
设计说明
These choices come from debugging real production issues:
- on amix: ffmpeg's
normalize=0divides volume by input count by default. With 20 segments, output would be 1/20th volume — essentially silent.amix - Discarding original audio: Even mixing original audio at 5% volume produces audible double-voice artifacts.
- Aborting on overlap: If any segment's audio extends past the next segment's start time, the script stops and reports the problem. Overlapping audio sounds broken.
- Skipping existing audio files: The script only generates audio for segments without an existing cached file. Azure uses ; Gemini uses
.mp3. If you change a segment's text, delete the matching.wavfile before re-running.seg_XXX.* - Gemini retry logic: Gemini 3.1 Flash TTS can occasionally return transient errors. The template retries a few times automatically before failing.
500
这些选择来自对实际生产问题的调试:
- amix设置:ffmpeg的
normalize=0默认会将音量除以输入数量。若有20个片段,输出音量会变为原来的1/20——几乎静音。amix - 丢弃原始音轨:即使将原始音轨混合至5%音量,也会产生明显的双重语音干扰。
- 重叠时终止:如果任何片段的音频超出下一个片段的开始时间,脚本会停止并报告问题。重叠的音频会听起来混乱。
- 跳过已存在的音频文件:脚本仅为没有缓存文件的片段生成音频。Azure使用格式;Gemini使用
.mp3格式。如果修改了片段文本,请先删除对应的.wav文件,再重新运行脚本。seg_XXX.* - Gemini重试逻辑:Gemini 3.1 Flash TTS偶尔会返回临时的错误。模板会自动重试几次后才会失败。
500
Phase 4: Run & Iterate
阶段4:运行与迭代
bash
python3 narration_script.pyIf the timing report shows overlaps (gap < 0), decide whether to shorten the text or push the next segment's start time later. If you change text, delete the corresponding cached audio file in first. If you only change start times, re-run directly.
narration_segments/Keep iterating until all gaps are non-negative.
bash
python3 narration_script.py如果时长报告显示重叠(间隙<0),请决定是缩短文本还是推迟下一个片段的开始时间。如果修改了文本,请先删除中对应的缓存音频文件。如果仅修改了开始时间,可直接重新运行脚本。
narration_segments/持续迭代直到所有间隙均为非负值。
Phase 5: Verification
阶段5:验证
Run all three checks after every successful build:
每次成功生成视频后,运行以下三项检查:
Volume
音量检查
bash
ffmpeg -i <output> -ss 0 -t 30 -af "volumedetect" -vn -f null - 2>&1 | grep -E "mean_volume|max_volume"Expect mean_volume between -25 and -15 dB, max between -10 and 0 dB. If mean is below -40 dB, the fix isn't applied — check the filter string.
normalize=0bash
ffmpeg -i <output> -ss 0 -t 30 -af "volumedetect" -vn -f null - 2>&1 | grep -E "mean_volume|max_volume"预期平均音量在-25至-15 dB之间,最大音量在-10至0 dB之间。如果平均音量低于-40 dB,说明未应用修复——请检查过滤器字符串。
normalize=0Silence gaps
静默间隙检查
bash
ffmpeg -i <output> -af "silencedetect=noise=-30dB:d=0.3" -vn -f null - 2>&1 | grep -E "silence_(start|end)" | head -20Confirm clean silence between segment transitions. Silence boundaries should match expected segment end/start times.
bash
ffmpeg -i <output> -af "silencedetect=noise=-30dB:d=0.3" -vn -f null - 2>&1 | grep -E "silence_(start|end)" | head -20确认片段转换之间的静默清晰。静默的起止时间应与预期的片段结束/开始时间一致。
Audio-video sync
音画同步检查
Extract frames at 5–8 key segment start times and view them:
bash
for t in <timestamps>; do
ffmpeg -y -ss $t -i <output> -frames:v 1 -q:v 2 /tmp/verify_${t}s.jpg 2>/dev/null
doneThe on-screen content should already be visible when the narration for that scene begins.
在5-8个关键片段的开始时间提取帧并查看:
bash
for t in <timestamps>; do
ffmpeg -y -ss $t -i <output> -frames:v 1 -q:v 2 /tmp/verify_${t}s.jpg 2>/dev/null
done当对应场景的旁白开始时,屏幕上的内容应已显示。
Troubleshooting
故障排除
| Symptom | Cause | Fix |
|---|---|---|
| Two voices playing | Original audio was mixed in | Only map |
| Audio nearly silent | amix divided volume by input count | Add |
| Narration out of sync | Imprecise scene timestamps | Re-extract frames at 1–2s intervals around the problem area |
| Overlap at segment boundary | Previous segment runs too long | Shorten that segment's text or delay the next segment |
| Text changed but audio didn't | Old cached audio file still exists | Delete |
| Audio cut off at video end | Last segment overflows video duration | Shorten to finish 3–4s before video ends |
Gemini returns | Preview model emitted text tokens instead of audio | Re-run; the template already retries transient failures |
| Gemini reads prompt labels aloud | Prompt classifier failed or prompt was too vague | Keep the transcript explicit and use the template's |
| 症状 | 原因 | 修复方案 |
|---|---|---|
| 出现双重语音 | 混合了原始音轨 | 仅映射 |
| 音频几乎静音 | amix将音量除以输入数量 | 在amix参数中添加 |
| 旁白与画面不同步 | 场景时间戳不准确 | 在问题区域周围每隔1-2秒重新提取帧 |
| 片段边界处音频重叠 | 上一个片段时长过长 | 缩短该片段的文本或推迟下一个片段的开始时间 |
| 修改文本但音频未更新 | 旧的缓存音频文件仍存在 | 删除 |
| 视频结尾处音频被截断 | 最后一个片段超出视频时长 | 缩短文本,使其在视频结束前3-4秒完成 |
Gemini返回 | 预览模型返回文本令牌而非音频 | 重新运行;模板已针对临时失败自动重试 |
| Gemini读取提示标签内容 | 提示分类失败或提示过于模糊 | 明确转录文本,并使用模板的 |