document-to-narration

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Document to Narration

文档转旁白工具

Convert written documents into narrated video scripts with precise word-level timing.
将书面文档转换为带有精确单词级时间戳的旁白视频脚本。

Core Principle

核心原则

The agent interprets; the document guides. Rather than rigid template-based splits, this skill uses agent judgment to find where the content naturally breathes, argues, and transitions. The document's argument flow determines scene breaks, not a predetermined structure.
Agent 负责解读,文档主导内容。与刻板的模板拆分不同,本技能借助Agent的判断来寻找内容自然停顿、论述转折的节点。场景划分由文档的论证逻辑决定,而非预设的固定结构。

When to Use This Skill

适用场景

Use this skill when:
  • Converting a blog post or essay to video narration
  • Preparing content for TTS audio generation
  • Breaking long-form content into digestible scenes
  • Creating word-level synchronized captions for video
Do NOT use this skill when:
  • The content is already in scene/script format
  • You need real-time voice synthesis (this is batch processing)
  • Working with dialogue or multi-speaker content (single voice only)
在以下场景使用本技能:
  • 将博客文章或散文转换为视频旁白
  • 为TTS音频生成准备内容
  • 将长篇内容拆分为易于理解的场景
  • 为视频创建与单词同步的字幕
请勿在以下场景使用本技能:
  • 内容已为场景/脚本格式
  • 需要实时语音合成(本工具为批量处理)
  • 处理对话或多发言人内容(仅支持单语音)

Prerequisites

前置要求

  • Deno installed (https://deno.land/)
  • Python 3.12 with venv support
  • ffmpeg for audio conversion
  • whisper-cpp (installed via @remotion/install-whisper-cpp)
  • TTS model at
    tts/model/
    (not in git due to size - see Model Setup below)
  • 已安装 Denohttps://deno.land/)
  • 已安装带venv支持的 Python 3.12
  • 用于音频转换的 ffmpeg
  • whisper-cpp(通过 @remotion/install-whisper-cpp 安装)
  • 放置在
    tts/model/
    路径下的 TTS模型(因体积过大未纳入git - 详见下方模型设置)

Complete Pipeline

完整工作流

There are two approaches: per-scene (legacy) and full narration (recommended).
有两种实现方式:按场景处理(旧版)和完整旁白生成(推荐)。

Full Narration Pipeline (Recommended)

完整旁白工作流(推荐)

Generates a single audio file for consistent volume and pacing:
Document (.md)
    ↓ [agent interprets scene breaks]
Scene .txt files (01-scene-name.txt, 02-scene-name.txt, ...)
    ↓ [TTS via narrate-full.py - SINGLE PASS]
full-narration.wav (one consistent audio file)
    ↓ [Whisper via transcribe-full.py]
full-narration.json + full-narration.vtt (word-level timing)
    ↓ [extract-scene-boundaries.py]
Scene timing boundaries for video composition
生成单个音频文件,确保音量和语速一致:
文档 (.md)
    ↓ [Agent 解读场景划分]
场景 .txt 文件(01-scene-name.txt、02-scene-name.txt...)
    ↓ [通过 narrate-full.py 执行TTS - 单次生成]
full-narration.wav(单个统一音频文件)
    ↓ [通过 transcribe-full.py 执行Whisper转写]
full-narration.json + full-narration.vtt(带单词级时间戳)
    ↓ [通过 extract-scene-boundaries.py]
用于视频合成的场景时间边界数据

Per-Scene Pipeline (Legacy)

按场景处理工作流(旧版)

Generates separate audio per scene - can cause volume inconsistencies:
Scene .txt files
    ↓ [TTS via narrate-scenes.py - MULTIPLE PASSES]
Scene .wav files (volume may vary between scenes)
    ↓ [concatenate]
Combined audio (may have clipping at boundaries)
Warning: Per-scene TTS generates audio with different volume levels and pacing. When concatenated, this causes audible jumps and clipping. Use the full narration pipeline instead.
为每个场景单独生成音频 - 可能导致音量不一致
场景 .txt 文件
    ↓ [通过 narrate-scenes.py 执行TTS - 多次生成]
场景 .wav 文件(场景间音量可能存在差异)
    ↓ [拼接]
合成音频(边界处可能出现剪辑瑕疵)
警告:按场景生成TTS会导致音频音量和语速不一致,拼接后会出现明显的声音跳变和剪辑问题。建议使用完整旁白工作流。

Quick Start

快速开始

Full Narration Pipeline (Recommended)

完整旁白工作流(推荐)

bash
cd .claude/skills/document-to-narration
source tts/.venv/bin/activate
bash
cd .claude/skills/document-to-narration
source tts/.venv/bin/activate

1. Split document into scenes (manual or scripted)

1. 将文档拆分为场景(手动或脚本自动处理)

deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/

2. Generate single audio file

2. 生成单个音频文件

python scripts/narrate-full.py ./output/scenes/
python scripts/narrate-full.py ./output/scenes/

3. Transcribe with word-level timestamps

3. 生成带单词级时间戳的转写结果

python scripts/transcribe-full.py ./output/full-narration.wav
python scripts/transcribe-full.py ./output/full-narration.wav

4. Extract scene boundaries for video timing

4. 提取场景时间边界用于视频时间轴

python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript
undefined
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript
undefined

Legacy Per-Scene Pipeline

旧版按场景处理工作流

bash
undefined
bash
undefined

1. Split document into scenes

1. 将文档拆分为场景

deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/

2. Generate audio per scene (may have volume inconsistencies)

2. 为每个场景生成音频(可能存在音量不一致)

source tts/.venv/bin/activate python scripts/narrate-scenes.py ./output/scenes/
source tts/.venv/bin/activate python scripts/narrate-scenes.py ./output/scenes/

3. Transcribe (DEPRECATED: transcribe-scenes.ts requires whisper-cpp)

3. 转写(已弃用:transcribe-scenes.ts 需要 whisper-cpp)

Use transcribe-full.py instead after concatenating audio

建议先拼接音频,再使用 transcribe-full.py 进行转写

undefined
undefined

Instructions

操作步骤

Step 1: Setup (First Time Only)

步骤1:首次设置

Create Python Virtual Environment

创建Python虚拟环境

bash
cd .claude/skills/document-to-narration/tts
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
bash
cd .claude/skills/document-to-narration/tts
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

TTS Model Setup

TTS模型设置

The fine-tuned voice model (~7.8GB) is not included in git due to size. Place your Qwen3-TTS model files in
tts/model/
:
tts/model/
├── config.json
├── generation_config.json
├── model.safetensors      # Main model weights
├── tokenizer_config.json
├── vocab.json
├── merges.txt
└── speech_tokenizer/
    └── ...
微调后的语音模型(约7.8GB)因体积过大未纳入git。请将你的Qwen3-TTS模型文件放置在
tts/model/
路径下:
tts/model/
├── config.json
├── generation_config.json
├── model.safetensors      # 主模型权重
├── tokenizer_config.json
├── vocab.json
├── merges.txt
└── speech_tokenizer/
    └── ...

Install Whisper (if not already installed)

安装Whisper(如未安装)

The @remotion/install-whisper-cpp package handles this:
typescript
import { installWhisperCpp, downloadWhisperModel } from '@remotion/install-whisper-cpp';

await installWhisperCpp({ to: './whisper-cpp', version: '1.5.5' });
await downloadWhisperModel({ model: 'medium', folder: './whisper-cpp' });
@remotion/install-whisper-cpp 包可自动处理安装:
typescript
import { installWhisperCpp, downloadWhisperModel } from '@remotion/install-whisper-cpp';

await installWhisperCpp({ to: './whisper-cpp', version: '1.5.5' });
await downloadWhisperModel({ model: 'medium', folder: './whisper-cpp' });

Step 2: Prepare Your Document

步骤2:准备文档

The skill works best with:
  • Markdown documents with clear heading structure (H1, H2)
  • Well-structured arguments with distinct sections
  • Content that reads naturally aloud
本技能对以下格式的文档处理效果最佳:
  • 具有清晰标题结构(H1、H2)的Markdown文档
  • 论证结构清晰、章节划分明确的内容
  • 适合口头朗读的自然文本

Step 3: Run the Pipeline

步骤3:运行工作流

bash
deno run -A scripts/full-pipeline.ts /path/to/essay.md --output ./output/essay-name/
bash
deno run -A scripts/full-pipeline.ts /path/to/essay.md --output ./output/essay-name/

Step 4: Review Output

步骤4:查看输出

output/essay-name/
├── scenes/
│   ├── 01-opening-hook.txt      # Scene script
│   ├── 01-opening-hook.wav      # Generated audio
│   ├── 01-opening-hook.vtt      # Word-level captions
│   ├── 02-core-argument.txt
│   ├── 02-core-argument.wav
│   ├── 02-core-argument.vtt
│   └── ...
└── manifest.json                # Complete timing data
output/essay-name/
├── scenes/
│   ├── 01-opening-hook.txt      # 场景脚本
│   ├── 01-opening-hook.wav      # 生成的音频
│   ├── 01-opening-hook.vtt      # 单词级字幕
│   ├── 02-core-argument.txt
│   ├── 02-core-argument.wav
│   ├── 02-core-argument.vtt
│   └── ...
└── manifest.json                # 完整时间戳数据

Scene Boundary Heuristics

场景划分规则

The agent identifies scene breaks using these heuristics:
Agent通过以下规则识别场景划分节点:

Strong Boundaries (Almost Always Break)

强划分节点(几乎总是拆分)

  • H2 heading changes
  • "Here's the thing" / "The point is" pivot statements
  • Major metaphor introduction
  • Explicit enumeration ("First...", "Second...")
  • Significant perspective shifts
  • H2标题变更
  • 类似“关键在于”/“要点是”的转折语句
  • 核心隐喻的引入
  • 明确的枚举(“第一...”、“第二...”)
  • 视角的重大转变

Moderate Boundaries (Consider Breaking)

中等划分节点(考虑拆分)

  • Long paragraph after short ones (or vice versa)
  • Example-to-principle transitions
  • "But" / "However" / "Meanwhile" at paragraph start
  • Question-then-answer patterns
  • 短段落之后的长段落(反之亦然)
  • 从示例到核心原则的过渡
  • 段落开头的“但是”/“然而”/“与此同时”
  • 提问-回答的结构

Weak Boundaries (Usually Keep Together)

弱划分节点(通常不拆分)

  • Paragraph-to-paragraph within same example
  • Sequential evidence for same point
  • Build-up to a punchline/reveal
  • 同一示例下的段落衔接
  • 支持同一论点的连续证据
  • 铺垫至高潮/揭秘的内容

Scene Length Guidance

场景长度建议

  • Target: 100-300 words per scene (30-90 seconds of audio)
  • Minimum: 50 words (avoid micro-scenes)
  • Maximum: 500 words (avoid cognitive overload)
  • 目标:每个场景100-300词(对应30-90秒音频)
  • 最小值:50词(避免过短的微场景)
  • 最大值:500词(避免认知过载)

Anti-Patterns

反模式

The Paragraph Slicer

机械段落拆分

Pattern: Breaking at every paragraph or heading mechanically. Problem: Ignores argument flow. Scenes feel choppy and disconnected. Fix: Look for rhetorical units, not structural units. Multiple paragraphs often form one scene.
模式:机械地按每个段落或标题拆分场景。 问题:忽略论证逻辑,场景衔接生硬、缺乏连贯性。 解决方法:关注修辞单元而非结构单元,多个段落往往构成一个完整场景。

The Wall of Text

文本堆砌

Pattern: Keeping entire sections as single scenes. Problem: Creates TTS audio that's too long. Loses natural breathing room. Fix: Target 100-300 words. Find the natural pause point within sections.
模式:将整个章节作为单个场景保留。 问题:生成的TTS音频过长,失去自然的停顿节奏。 解决方法:以100-300词为目标,在章节内寻找自然的停顿点。

The Verbatim Transcriber

逐字转录

Pattern: Copying written text exactly without spoken adaptation. Problem: Written conventions don't work when spoken. Parentheticals, complex punctuation, and nested clauses confuse TTS and listeners. Fix: Apply adaptation rules. Read it aloud mentally.
模式:完全照搬书面文本,未做口语化适配。 问题:书面表达习惯不适合口语传播。括号内容、复杂标点和嵌套从句会混淆TTS引擎并影响听众理解。 解决方法:应用口语化适配规则,先在脑海中朗读一遍。

The Over-Adapter

过度改编

Pattern: Rewriting content so heavily it loses the author's voice. Problem: The result doesn't sound like the original author. Fix: Preserve voice, adjust mechanics. If the author uses rhetorical questions, keep them.
模式:过度改写内容,导致失去原作者的风格。 问题:最终成品听起来不像原作者的表达。 解决方法:保留原作者的风格,仅调整表达形式。如果原作者使用了修辞疑问句,请保留。

Available Scripts

可用脚本

scripts/split-to-scenes.ts

scripts/split-to-scenes.ts

Parse a markdown document and output scene text files.
bash
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/ --adapt
deno run --allow-read scripts/split-to-scenes.ts input.md --dry-run
Options:
  • --output
    - Directory for scene files (created if doesn't exist)
  • --adapt
    - Apply spoken adaptation rules
  • --dry-run
    - Preview scene breaks without writing files
Output: Numbered
.txt
files and initial
manifest.json
解析Markdown文档并输出场景文本文件。
bash
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/ --adapt
deno run --allow-read scripts/split-to-scenes.ts input.md --dry-run
选项:
  • --output
    - 场景文件的输出目录(不存在则自动创建)
  • --adapt
    - 应用口语化适配规则
  • --dry-run
    - 预览场景划分,不实际写入文件
输出: 编号的
.txt
文件和初始的
manifest.json

scripts/narrate-full.py (Recommended)

scripts/narrate-full.py(推荐)

Generate a single TTS audio file from all scene files. Produces consistent volume and pacing.
bash
python scripts/narrate-full.py ./output/scenes/
python scripts/narrate-full.py ./output/scenes/ --force
python scripts/narrate-full.py ./output/scenes/ --speaker jwynia
python scripts/narrate-full.py ./output/scenes/ --output ./custom/path/audio.wav
Options:
  • --force
    - Regenerate even if output exists
  • --speaker
    - Speaker name (default: jwynia)
  • --output
    - Custom output path (default:
    ../full-narration.wav
    )
Output: Single
full-narration.wav
in parent directory of scenes
根据所有场景文件生成单个TTS音频文件,确保音量和语速一致。
bash
python scripts/narrate-full.py ./output/scenes/
python scripts/narrate-full.py ./output/scenes/ --force
python scripts/narrate-full.py ./output/scenes/ --speaker jwynia
python scripts/narrate-full.py ./output/scenes/ --output ./custom/path/audio.wav
选项:
  • --force
    - 即使输出文件已存在也重新生成
  • --speaker
    - 发言人名称(默认:jwynia)
  • --output
    - 自定义输出路径(默认:
    ../full-narration.wav
输出: 单个
full-narration.wav
文件,位于场景目录的父目录

scripts/narrate-scenes.py (Legacy)

scripts/narrate-scenes.py(旧版)

Generate TTS audio for each scene file separately. Not recommended - can cause volume inconsistencies when concatenated.
bash
python scripts/narrate-scenes.py ./output/scenes/
python scripts/narrate-scenes.py ./output/scenes/ --force
python scripts/narrate-scenes.py ./output/scenes/ --speaker jwynia
Options:
  • --force
    - Regenerate even if output exists
  • --speaker
    - Speaker name (default: jwynia)
Output:
.wav
files alongside each
.txt
file
为每个场景文件单独生成TTS音频。不推荐 - 拼接后可能导致音量不一致。
bash
python scripts/narrate-scenes.py ./output/scenes/
python scripts/narrate-scenes.py ./output/scenes/ --force
python scripts/narrate-scenes.py ./output/scenes/ --speaker jwynia
选项:
  • --force
    - 即使输出文件已存在也重新生成
  • --speaker
    - 发言人名称(默认:jwynia)
输出: 每个
.txt
文件对应的
.wav
文件

scripts/transcribe-full.py (Recommended)

scripts/transcribe-full.py(推荐)

Transcribe audio with word-level timestamps using Python's openai-whisper.
bash
python scripts/transcribe-full.py ./output/full-narration.wav
python scripts/transcribe-full.py ./output/full-narration.wav --model large-v3
python scripts/transcribe-full.py ./output/full-narration.wav --output-dir ./captions/
Options:
  • --model
    - Whisper model: tiny, base, small, medium, large, large-v2, large-v3 (default: medium)
  • --output-dir
    - Output directory (default: same as audio file)
Output:
  • .vtt
    file with word-level timestamps
  • .json
    file with captions array for Remotion
Dependencies: Requires
openai-whisper
in Python environment:
bash
pip install openai-whisper
使用Python的openai-whisper生成带单词级时间戳的转写结果。
bash
python scripts/transcribe-full.py ./output/full-narration.wav
python scripts/transcribe-full.py ./output/full-narration.wav --model large-v3
python scripts/transcribe-full.py ./output/full-narration.wav --output-dir ./captions/
选项:
  • --model
    - Whisper模型:tiny、base、small、medium、large、large-v2、large-v3(默认:medium)
  • --output-dir
    - 输出目录(默认:与音频文件同目录)
输出:
  • 带单词级时间戳的
    .vtt
    文件
  • 供Remotion使用的
    .json
    字幕数组文件
依赖: Python环境中需安装
openai-whisper
bash
pip install openai-whisper

scripts/extract-scene-boundaries.py

scripts/extract-scene-boundaries.py

Extract scene timing boundaries from transcript by matching scene opening phrases.
bash
undefined
通过匹配场景开头语句,从转写结果中提取场景时间边界。
bash
undefined

Human-readable table

人类可读表格

python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json

JSON output

JSON格式输出

python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --json
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --json

TypeScript for Video.tsx

供Video.tsx使用的TypeScript代码

python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript

**Options:**
- `--json` - Output as JSON array
- `--typescript` - Output as TypeScript code for Video.tsx scenes array

**Output:** Scene numbers, slugs, start times, and durations
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript

**选项:**
- `--json` - 输出为JSON数组
- `--typescript` - 输出为供Video.tsx使用的TypeScript场景数组代码

**输出:** 场景编号、别名、开始时间和时长

scripts/transcribe-scenes.ts (Deprecated)

scripts/transcribe-scenes.ts(已弃用)

Deprecated: Requires whisper-cpp binary which may not be installed. Use
transcribe-full.py
instead.
Transcribe per-scene audio files using whisper-cpp.
bash
deno run --allow-read --allow-write --allow-run scripts/transcribe-scenes.ts ./output/scenes/
Output:
.vtt
files with word-level timestamps
已弃用:需要whisper-cpp二进制文件,可能未安装。建议使用
transcribe-full.py
替代。
使用whisper-cpp为每个场景的音频文件生成转写结果。
bash
deno run --allow-read --allow-write --allow-run scripts/transcribe-scenes.ts ./output/scenes/
输出: 带单词级时间戳的
.vtt
文件

scripts/full-pipeline.ts

scripts/full-pipeline.ts

Orchestrate the complete pipeline.
bash
deno run -A scripts/full-pipeline.ts input.md --output ./output/project-name/
Options:
  • --output
    - Output directory (required)
  • --adapt
    - Apply spoken adaptation
  • --skip-tts
    - Skip audio generation (text only)
  • --skip-transcribe
    - Skip Whisper transcription
自动化执行完整工作流。
bash
deno run -A scripts/full-pipeline.ts input.md --output ./output/project-name/
选项:
  • --output
    - 输出目录(必填)
  • --adapt
    - 应用口语化适配
  • --skip-tts
    - 跳过音频生成(仅处理文本)
  • --skip-transcribe
    - 跳过Whisper转写

Output Format

输出格式

manifest.json

manifest.json

json
{
  "source": "appliance-vs-trade-tool-draft.md",
  "created_at": "2024-01-15T10:30:00Z",
  "total_scenes": 9,
  "total_duration_seconds": 420,
  "scenes": [
    {
      "number": 1,
      "slug": "popcorn-opening",
      "word_count": 185,
      "audio_duration_seconds": 55.2,
      "files": {
        "text": "scenes/01-popcorn-opening.txt",
        "audio": "scenes/01-popcorn-opening.wav",
        "captions": "scenes/01-popcorn-opening.vtt"
      },
      "captions": [
        { "text": "Two", "startMs": 0, "endMs": 180, "confidence": 0.98 },
        { "text": "people", "startMs": 180, "endMs": 450, "confidence": 0.97 }
      ]
    }
  ]
}
json
{
  "source": "appliance-vs-trade-tool-draft.md",
  "created_at": "2024-01-15T10:30:00Z",
  "total_scenes": 9,
  "total_duration_seconds": 420,
  "scenes": [
    {
      "number": 1,
      "slug": "popcorn-opening",
      "word_count": 185,
      "audio_duration_seconds": 55.2,
      "files": {
        "text": "scenes/01-popcorn-opening.txt",
        "audio": "scenes/01-popcorn-opening.wav",
        "captions": "scenes/01-popcorn-opening.vtt"
      },
      "captions": [
        { "text": "Two", "startMs": 0, "endMs": 180, "confidence": 0.98 },
        { "text": "people", "startMs": 180, "endMs": 450, "confidence": 0.97 }
      ]
    }
  ]
}

VTT Format

VTT格式

vtt
WEBVTT

00:00.000 --> 00:00.180
Two

00:00.180 --> 00:00.450
people

00:00.450 --> 00:00.720
walk

00:00.720 --> 00:01.100
into
vtt
WEBVTT

00:00.000 --> 00:00.180
Two

00:00.180 --> 00:00.450
people

00:00.450 --> 00:00.720
walk

00:00.720 --> 00:01.100
into

Spoken Adaptation

口语化适配

When
--adapt
is enabled, the skill transforms written conventions to spoken equivalents:
WrittenSpoken
Parenthetical asidesEm-dash or separate sentence
"e.g.""for example"
"i.e.""that is"
Long nested clausesSplit into multiple sentences
SemicolonsPeriods
*emphasis*
Context-appropriate stress
Preserve:
  • Author's voice and tone
  • Rhetorical questions
  • Deliberate repetition
  • Key phrases and memorable formulations
启用
--adapt
选项后,本技能会将书面表达转换为口语化形式:
书面表达口语化转换
括号内的补充说明破折号引导的分句或独立句子
"e.g.""例如"
"i.e.""也就是说"
长嵌套从句拆分为多个短句
分号句号
*emphasis*
根据上下文调整重音
需保留的内容:
  • 原作者的语气和风格
  • 修辞疑问句
  • 刻意的重复表达
  • 关键短语和易记的表述

Integration

集成方案

With remotion-designer

与remotion-designer集成

  • Pass manifest scene list to remotion-designer
  • Each scene becomes a visual design unit
  • Word-level timing drives text animation
  • 将manifest场景列表传入remotion-designer
  • 每个场景对应一个视觉设计单元
  • 单词级时间戳驱动文本动画

With Remotion Compositions

与Remotion合成集成

tsx
import { Audio, useCurrentFrame, Sequence } from 'remotion';
import manifest from './output/manifest.json';

// Use scene durations for Sequence timing
{manifest.scenes.map((scene, i) => (
  <Sequence
    from={accumulatedFrames}
    durationInFrames={scene.audio_duration_seconds * fps}
  >
    <Audio src={staticFile(scene.files.audio)} />
    <CaptionRenderer captions={scene.captions} />
  </Sequence>
))}
tsx
import { Audio, useCurrentFrame, Sequence } from 'remotion';
import manifest from './output/manifest.json';

// 使用场景时长设置Sequence的时间
{manifest.scenes.map((scene, i) => (
  <Sequence
    from={accumulatedFrames}
    durationInFrames={scene.audio_duration_seconds * fps}
  >
    <Audio src={staticFile(scene.files.audio)} />
    <CaptionRenderer captions={scene.captions} />
  </Sequence>
))}

Technical Notes

技术说明

WAV Format Conversion

WAV格式转换

Whisper requires 16kHz mono WAV. The pipeline handles conversion automatically:
bash
ffmpeg -i input.wav -ar 16000 -ac 1 output_16khz.wav
Whisper要求输入为16kHz单声道WAV。工作流会自动处理格式转换:
bash
ffmpeg -i input.wav -ar 16000 -ac 1 output_16khz.wav

TTS Model

TTS模型

The fine-tuned voice model (~7.8GB) is bundled at
tts/model/
. Uses Qwen3-TTS with custom speaker embedding.
微调后的语音模型(约7.8GB)存放在
tts/model/
路径下,使用带自定义发言人嵌入的Qwen3-TTS模型。

Performance

性能

  • TTS: ~5-30 seconds per sentence (Apple Silicon MPS or NVIDIA CUDA)
  • Whisper: ~0.5-2x realtime depending on model size
  • Full essay (~2000 words): ~10-20 minutes total processing
  • TTS:每句约5-30秒(取决于Apple Silicon MPS或NVIDIA CUDA加速)
  • Whisper:约0.5-2倍实时速度,取决于模型大小
  • 完整散文(约2000词):总处理时间约10-20分钟

What This Skill Does NOT Do

本技能不支持的功能

  • Generate video visuals (use remotion-designer)
  • Real-time voice synthesis
  • Multi-speaker dialogue
  • Edit or improve the content's argument
  • Make editorial changes beyond mechanical spoken adaptation
  • 生成视频视觉元素(请使用remotion-designer)
  • 实时语音合成
  • 多发言人对话
  • 编辑或改进内容的论证逻辑
  • 除机械口语化适配外的编辑修改