document-to-narration
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDocument to Narration
文档转旁白工具
Convert written documents into narrated video scripts with precise word-level timing.
将书面文档转换为带有精确单词级时间戳的旁白视频脚本。
Core Principle
核心原则
The agent interprets; the document guides. Rather than rigid template-based splits, this skill uses agent judgment to find where the content naturally breathes, argues, and transitions. The document's argument flow determines scene breaks, not a predetermined structure.
Agent 负责解读,文档主导内容。与刻板的模板拆分不同,本技能借助Agent的判断来寻找内容自然停顿、论述转折的节点。场景划分由文档的论证逻辑决定,而非预设的固定结构。
When to Use This Skill
适用场景
Use this skill when:
- Converting a blog post or essay to video narration
- Preparing content for TTS audio generation
- Breaking long-form content into digestible scenes
- Creating word-level synchronized captions for video
Do NOT use this skill when:
- The content is already in scene/script format
- You need real-time voice synthesis (this is batch processing)
- Working with dialogue or multi-speaker content (single voice only)
在以下场景使用本技能:
- 将博客文章或散文转换为视频旁白
- 为TTS音频生成准备内容
- 将长篇内容拆分为易于理解的场景
- 为视频创建与单词同步的字幕
请勿在以下场景使用本技能:
- 内容已为场景/脚本格式
- 需要实时语音合成(本工具为批量处理)
- 处理对话或多发言人内容(仅支持单语音)
Prerequisites
前置要求
- Deno installed (https://deno.land/)
- Python 3.12 with venv support
- ffmpeg for audio conversion
- whisper-cpp (installed via @remotion/install-whisper-cpp)
- TTS model at (not in git due to size - see Model Setup below)
tts/model/
- 已安装 Deno(https://deno.land/)
- 已安装带venv支持的 Python 3.12
- 用于音频转换的 ffmpeg
- whisper-cpp(通过 @remotion/install-whisper-cpp 安装)
- 放置在 路径下的 TTS模型(因体积过大未纳入git - 详见下方模型设置)
tts/model/
Complete Pipeline
完整工作流
There are two approaches: per-scene (legacy) and full narration (recommended).
有两种实现方式:按场景处理(旧版)和完整旁白生成(推荐)。
Full Narration Pipeline (Recommended)
完整旁白工作流(推荐)
Generates a single audio file for consistent volume and pacing:
Document (.md)
↓ [agent interprets scene breaks]
Scene .txt files (01-scene-name.txt, 02-scene-name.txt, ...)
↓ [TTS via narrate-full.py - SINGLE PASS]
full-narration.wav (one consistent audio file)
↓ [Whisper via transcribe-full.py]
full-narration.json + full-narration.vtt (word-level timing)
↓ [extract-scene-boundaries.py]
Scene timing boundaries for video composition生成单个音频文件,确保音量和语速一致:
文档 (.md)
↓ [Agent 解读场景划分]
场景 .txt 文件(01-scene-name.txt、02-scene-name.txt...)
↓ [通过 narrate-full.py 执行TTS - 单次生成]
full-narration.wav(单个统一音频文件)
↓ [通过 transcribe-full.py 执行Whisper转写]
full-narration.json + full-narration.vtt(带单词级时间戳)
↓ [通过 extract-scene-boundaries.py]
用于视频合成的场景时间边界数据Per-Scene Pipeline (Legacy)
按场景处理工作流(旧版)
Generates separate audio per scene - can cause volume inconsistencies:
Scene .txt files
↓ [TTS via narrate-scenes.py - MULTIPLE PASSES]
Scene .wav files (volume may vary between scenes)
↓ [concatenate]
Combined audio (may have clipping at boundaries)Warning: Per-scene TTS generates audio with different volume levels and pacing. When concatenated, this causes audible jumps and clipping. Use the full narration pipeline instead.
为每个场景单独生成音频 - 可能导致音量不一致:
场景 .txt 文件
↓ [通过 narrate-scenes.py 执行TTS - 多次生成]
场景 .wav 文件(场景间音量可能存在差异)
↓ [拼接]
合成音频(边界处可能出现剪辑瑕疵)警告:按场景生成TTS会导致音频音量和语速不一致,拼接后会出现明显的声音跳变和剪辑问题。建议使用完整旁白工作流。
Quick Start
快速开始
Full Narration Pipeline (Recommended)
完整旁白工作流(推荐)
bash
cd .claude/skills/document-to-narration
source tts/.venv/bin/activatebash
cd .claude/skills/document-to-narration
source tts/.venv/bin/activate1. Split document into scenes (manual or scripted)
1. 将文档拆分为场景(手动或脚本自动处理)
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
2. Generate single audio file
2. 生成单个音频文件
python scripts/narrate-full.py ./output/scenes/
python scripts/narrate-full.py ./output/scenes/
3. Transcribe with word-level timestamps
3. 生成带单词级时间戳的转写结果
python scripts/transcribe-full.py ./output/full-narration.wav
python scripts/transcribe-full.py ./output/full-narration.wav
4. Extract scene boundaries for video timing
4. 提取场景时间边界用于视频时间轴
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript
undefinedpython scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript
undefinedLegacy Per-Scene Pipeline
旧版按场景处理工作流
bash
undefinedbash
undefined1. Split document into scenes
1. 将文档拆分为场景
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
2. Generate audio per scene (may have volume inconsistencies)
2. 为每个场景生成音频(可能存在音量不一致)
source tts/.venv/bin/activate
python scripts/narrate-scenes.py ./output/scenes/
source tts/.venv/bin/activate
python scripts/narrate-scenes.py ./output/scenes/
3. Transcribe (DEPRECATED: transcribe-scenes.ts requires whisper-cpp)
3. 转写(已弃用:transcribe-scenes.ts 需要 whisper-cpp)
Use transcribe-full.py instead after concatenating audio
建议先拼接音频,再使用 transcribe-full.py 进行转写
undefinedundefinedInstructions
操作步骤
Step 1: Setup (First Time Only)
步骤1:首次设置
Create Python Virtual Environment
创建Python虚拟环境
bash
cd .claude/skills/document-to-narration/tts
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtbash
cd .claude/skills/document-to-narration/tts
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtTTS Model Setup
TTS模型设置
The fine-tuned voice model (~7.8GB) is not included in git due to size.
Place your Qwen3-TTS model files in :
tts/model/tts/model/
├── config.json
├── generation_config.json
├── model.safetensors # Main model weights
├── tokenizer_config.json
├── vocab.json
├── merges.txt
└── speech_tokenizer/
└── ...微调后的语音模型(约7.8GB)因体积过大未纳入git。请将你的Qwen3-TTS模型文件放置在 路径下:
tts/model/tts/model/
├── config.json
├── generation_config.json
├── model.safetensors # 主模型权重
├── tokenizer_config.json
├── vocab.json
├── merges.txt
└── speech_tokenizer/
└── ...Install Whisper (if not already installed)
安装Whisper(如未安装)
The @remotion/install-whisper-cpp package handles this:
typescript
import { installWhisperCpp, downloadWhisperModel } from '@remotion/install-whisper-cpp';
await installWhisperCpp({ to: './whisper-cpp', version: '1.5.5' });
await downloadWhisperModel({ model: 'medium', folder: './whisper-cpp' });@remotion/install-whisper-cpp 包可自动处理安装:
typescript
import { installWhisperCpp, downloadWhisperModel } from '@remotion/install-whisper-cpp';
await installWhisperCpp({ to: './whisper-cpp', version: '1.5.5' });
await downloadWhisperModel({ model: 'medium', folder: './whisper-cpp' });Step 2: Prepare Your Document
步骤2:准备文档
The skill works best with:
- Markdown documents with clear heading structure (H1, H2)
- Well-structured arguments with distinct sections
- Content that reads naturally aloud
本技能对以下格式的文档处理效果最佳:
- 具有清晰标题结构(H1、H2)的Markdown文档
- 论证结构清晰、章节划分明确的内容
- 适合口头朗读的自然文本
Step 3: Run the Pipeline
步骤3:运行工作流
bash
deno run -A scripts/full-pipeline.ts /path/to/essay.md --output ./output/essay-name/bash
deno run -A scripts/full-pipeline.ts /path/to/essay.md --output ./output/essay-name/Step 4: Review Output
步骤4:查看输出
output/essay-name/
├── scenes/
│ ├── 01-opening-hook.txt # Scene script
│ ├── 01-opening-hook.wav # Generated audio
│ ├── 01-opening-hook.vtt # Word-level captions
│ ├── 02-core-argument.txt
│ ├── 02-core-argument.wav
│ ├── 02-core-argument.vtt
│ └── ...
└── manifest.json # Complete timing dataoutput/essay-name/
├── scenes/
│ ├── 01-opening-hook.txt # 场景脚本
│ ├── 01-opening-hook.wav # 生成的音频
│ ├── 01-opening-hook.vtt # 单词级字幕
│ ├── 02-core-argument.txt
│ ├── 02-core-argument.wav
│ ├── 02-core-argument.vtt
│ └── ...
└── manifest.json # 完整时间戳数据Scene Boundary Heuristics
场景划分规则
The agent identifies scene breaks using these heuristics:
Agent通过以下规则识别场景划分节点:
Strong Boundaries (Almost Always Break)
强划分节点(几乎总是拆分)
- H2 heading changes
- "Here's the thing" / "The point is" pivot statements
- Major metaphor introduction
- Explicit enumeration ("First...", "Second...")
- Significant perspective shifts
- H2标题变更
- 类似“关键在于”/“要点是”的转折语句
- 核心隐喻的引入
- 明确的枚举(“第一...”、“第二...”)
- 视角的重大转变
Moderate Boundaries (Consider Breaking)
中等划分节点(考虑拆分)
- Long paragraph after short ones (or vice versa)
- Example-to-principle transitions
- "But" / "However" / "Meanwhile" at paragraph start
- Question-then-answer patterns
- 短段落之后的长段落(反之亦然)
- 从示例到核心原则的过渡
- 段落开头的“但是”/“然而”/“与此同时”
- 提问-回答的结构
Weak Boundaries (Usually Keep Together)
弱划分节点(通常不拆分)
- Paragraph-to-paragraph within same example
- Sequential evidence for same point
- Build-up to a punchline/reveal
- 同一示例下的段落衔接
- 支持同一论点的连续证据
- 铺垫至高潮/揭秘的内容
Scene Length Guidance
场景长度建议
- Target: 100-300 words per scene (30-90 seconds of audio)
- Minimum: 50 words (avoid micro-scenes)
- Maximum: 500 words (avoid cognitive overload)
- 目标:每个场景100-300词(对应30-90秒音频)
- 最小值:50词(避免过短的微场景)
- 最大值:500词(避免认知过载)
Anti-Patterns
反模式
The Paragraph Slicer
机械段落拆分
Pattern: Breaking at every paragraph or heading mechanically.
Problem: Ignores argument flow. Scenes feel choppy and disconnected.
Fix: Look for rhetorical units, not structural units. Multiple paragraphs often form one scene.
模式:机械地按每个段落或标题拆分场景。
问题:忽略论证逻辑,场景衔接生硬、缺乏连贯性。
解决方法:关注修辞单元而非结构单元,多个段落往往构成一个完整场景。
The Wall of Text
文本堆砌
Pattern: Keeping entire sections as single scenes.
Problem: Creates TTS audio that's too long. Loses natural breathing room.
Fix: Target 100-300 words. Find the natural pause point within sections.
模式:将整个章节作为单个场景保留。
问题:生成的TTS音频过长,失去自然的停顿节奏。
解决方法:以100-300词为目标,在章节内寻找自然的停顿点。
The Verbatim Transcriber
逐字转录
Pattern: Copying written text exactly without spoken adaptation.
Problem: Written conventions don't work when spoken. Parentheticals, complex punctuation, and nested clauses confuse TTS and listeners.
Fix: Apply adaptation rules. Read it aloud mentally.
模式:完全照搬书面文本,未做口语化适配。
问题:书面表达习惯不适合口语传播。括号内容、复杂标点和嵌套从句会混淆TTS引擎并影响听众理解。
解决方法:应用口语化适配规则,先在脑海中朗读一遍。
The Over-Adapter
过度改编
Pattern: Rewriting content so heavily it loses the author's voice.
Problem: The result doesn't sound like the original author.
Fix: Preserve voice, adjust mechanics. If the author uses rhetorical questions, keep them.
模式:过度改写内容,导致失去原作者的风格。
问题:最终成品听起来不像原作者的表达。
解决方法:保留原作者的风格,仅调整表达形式。如果原作者使用了修辞疑问句,请保留。
Available Scripts
可用脚本
scripts/split-to-scenes.ts
scripts/split-to-scenes.ts
Parse a markdown document and output scene text files.
bash
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/ --adapt
deno run --allow-read scripts/split-to-scenes.ts input.md --dry-runOptions:
- - Directory for scene files (created if doesn't exist)
--output - - Apply spoken adaptation rules
--adapt - - Preview scene breaks without writing files
--dry-run
Output: Numbered files and initial
.txtmanifest.json解析Markdown文档并输出场景文本文件。
bash
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/ --adapt
deno run --allow-read scripts/split-to-scenes.ts input.md --dry-run选项:
- - 场景文件的输出目录(不存在则自动创建)
--output - - 应用口语化适配规则
--adapt - - 预览场景划分,不实际写入文件
--dry-run
输出: 编号的 文件和初始的
.txtmanifest.jsonscripts/narrate-full.py (Recommended)
scripts/narrate-full.py(推荐)
Generate a single TTS audio file from all scene files. Produces consistent volume and pacing.
bash
python scripts/narrate-full.py ./output/scenes/
python scripts/narrate-full.py ./output/scenes/ --force
python scripts/narrate-full.py ./output/scenes/ --speaker jwynia
python scripts/narrate-full.py ./output/scenes/ --output ./custom/path/audio.wavOptions:
- - Regenerate even if output exists
--force - - Speaker name (default: jwynia)
--speaker - - Custom output path (default:
--output)../full-narration.wav
Output: Single in parent directory of scenes
full-narration.wav根据所有场景文件生成单个TTS音频文件,确保音量和语速一致。
bash
python scripts/narrate-full.py ./output/scenes/
python scripts/narrate-full.py ./output/scenes/ --force
python scripts/narrate-full.py ./output/scenes/ --speaker jwynia
python scripts/narrate-full.py ./output/scenes/ --output ./custom/path/audio.wav选项:
- - 即使输出文件已存在也重新生成
--force - - 发言人名称(默认:jwynia)
--speaker - - 自定义输出路径(默认:
--output)../full-narration.wav
输出: 单个 文件,位于场景目录的父目录
full-narration.wavscripts/narrate-scenes.py (Legacy)
scripts/narrate-scenes.py(旧版)
Generate TTS audio for each scene file separately. Not recommended - can cause volume inconsistencies when concatenated.
bash
python scripts/narrate-scenes.py ./output/scenes/
python scripts/narrate-scenes.py ./output/scenes/ --force
python scripts/narrate-scenes.py ./output/scenes/ --speaker jwyniaOptions:
- - Regenerate even if output exists
--force - - Speaker name (default: jwynia)
--speaker
Output: files alongside each file
.wav.txt为每个场景文件单独生成TTS音频。不推荐 - 拼接后可能导致音量不一致。
bash
python scripts/narrate-scenes.py ./output/scenes/
python scripts/narrate-scenes.py ./output/scenes/ --force
python scripts/narrate-scenes.py ./output/scenes/ --speaker jwynia选项:
- - 即使输出文件已存在也重新生成
--force - - 发言人名称(默认:jwynia)
--speaker
输出: 每个 文件对应的 文件
.txt.wavscripts/transcribe-full.py (Recommended)
scripts/transcribe-full.py(推荐)
Transcribe audio with word-level timestamps using Python's openai-whisper.
bash
python scripts/transcribe-full.py ./output/full-narration.wav
python scripts/transcribe-full.py ./output/full-narration.wav --model large-v3
python scripts/transcribe-full.py ./output/full-narration.wav --output-dir ./captions/Options:
- - Whisper model: tiny, base, small, medium, large, large-v2, large-v3 (default: medium)
--model - - Output directory (default: same as audio file)
--output-dir
Output:
- file with word-level timestamps
.vtt - file with captions array for Remotion
.json
Dependencies: Requires in Python environment:
openai-whisperbash
pip install openai-whisper使用Python的openai-whisper生成带单词级时间戳的转写结果。
bash
python scripts/transcribe-full.py ./output/full-narration.wav
python scripts/transcribe-full.py ./output/full-narration.wav --model large-v3
python scripts/transcribe-full.py ./output/full-narration.wav --output-dir ./captions/选项:
- - Whisper模型:tiny、base、small、medium、large、large-v2、large-v3(默认:medium)
--model - - 输出目录(默认:与音频文件同目录)
--output-dir
输出:
- 带单词级时间戳的 文件
.vtt - 供Remotion使用的 字幕数组文件
.json
依赖: Python环境中需安装 :
openai-whisperbash
pip install openai-whisperscripts/extract-scene-boundaries.py
scripts/extract-scene-boundaries.py
Extract scene timing boundaries from transcript by matching scene opening phrases.
bash
undefined通过匹配场景开头语句,从转写结果中提取场景时间边界。
bash
undefinedHuman-readable table
人类可读表格
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json
JSON output
JSON格式输出
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --json
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --json
TypeScript for Video.tsx
供Video.tsx使用的TypeScript代码
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript
**Options:**
- `--json` - Output as JSON array
- `--typescript` - Output as TypeScript code for Video.tsx scenes array
**Output:** Scene numbers, slugs, start times, and durationspython scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript
**选项:**
- `--json` - 输出为JSON数组
- `--typescript` - 输出为供Video.tsx使用的TypeScript场景数组代码
**输出:** 场景编号、别名、开始时间和时长scripts/transcribe-scenes.ts (Deprecated)
scripts/transcribe-scenes.ts(已弃用)
Deprecated: Requires whisper-cpp binary which may not be installed. Useinstead.transcribe-full.py
Transcribe per-scene audio files using whisper-cpp.
bash
deno run --allow-read --allow-write --allow-run scripts/transcribe-scenes.ts ./output/scenes/Output: files with word-level timestamps
.vtt已弃用:需要whisper-cpp二进制文件,可能未安装。建议使用替代。transcribe-full.py
使用whisper-cpp为每个场景的音频文件生成转写结果。
bash
deno run --allow-read --allow-write --allow-run scripts/transcribe-scenes.ts ./output/scenes/输出: 带单词级时间戳的 文件
.vttscripts/full-pipeline.ts
scripts/full-pipeline.ts
Orchestrate the complete pipeline.
bash
deno run -A scripts/full-pipeline.ts input.md --output ./output/project-name/Options:
- - Output directory (required)
--output - - Apply spoken adaptation
--adapt - - Skip audio generation (text only)
--skip-tts - - Skip Whisper transcription
--skip-transcribe
自动化执行完整工作流。
bash
deno run -A scripts/full-pipeline.ts input.md --output ./output/project-name/选项:
- - 输出目录(必填)
--output - - 应用口语化适配
--adapt - - 跳过音频生成(仅处理文本)
--skip-tts - - 跳过Whisper转写
--skip-transcribe
Output Format
输出格式
manifest.json
manifest.json
json
{
"source": "appliance-vs-trade-tool-draft.md",
"created_at": "2024-01-15T10:30:00Z",
"total_scenes": 9,
"total_duration_seconds": 420,
"scenes": [
{
"number": 1,
"slug": "popcorn-opening",
"word_count": 185,
"audio_duration_seconds": 55.2,
"files": {
"text": "scenes/01-popcorn-opening.txt",
"audio": "scenes/01-popcorn-opening.wav",
"captions": "scenes/01-popcorn-opening.vtt"
},
"captions": [
{ "text": "Two", "startMs": 0, "endMs": 180, "confidence": 0.98 },
{ "text": "people", "startMs": 180, "endMs": 450, "confidence": 0.97 }
]
}
]
}json
{
"source": "appliance-vs-trade-tool-draft.md",
"created_at": "2024-01-15T10:30:00Z",
"total_scenes": 9,
"total_duration_seconds": 420,
"scenes": [
{
"number": 1,
"slug": "popcorn-opening",
"word_count": 185,
"audio_duration_seconds": 55.2,
"files": {
"text": "scenes/01-popcorn-opening.txt",
"audio": "scenes/01-popcorn-opening.wav",
"captions": "scenes/01-popcorn-opening.vtt"
},
"captions": [
{ "text": "Two", "startMs": 0, "endMs": 180, "confidence": 0.98 },
{ "text": "people", "startMs": 180, "endMs": 450, "confidence": 0.97 }
]
}
]
}VTT Format
VTT格式
vtt
WEBVTT
00:00.000 --> 00:00.180
Two
00:00.180 --> 00:00.450
people
00:00.450 --> 00:00.720
walk
00:00.720 --> 00:01.100
intovtt
WEBVTT
00:00.000 --> 00:00.180
Two
00:00.180 --> 00:00.450
people
00:00.450 --> 00:00.720
walk
00:00.720 --> 00:01.100
intoSpoken Adaptation
口语化适配
When is enabled, the skill transforms written conventions to spoken equivalents:
--adapt| Written | Spoken |
|---|---|
| Parenthetical asides | Em-dash or separate sentence |
| "e.g." | "for example" |
| "i.e." | "that is" |
| Long nested clauses | Split into multiple sentences |
| Semicolons | Periods |
| Context-appropriate stress |
Preserve:
- Author's voice and tone
- Rhetorical questions
- Deliberate repetition
- Key phrases and memorable formulations
启用 选项后,本技能会将书面表达转换为口语化形式:
--adapt| 书面表达 | 口语化转换 |
|---|---|
| 括号内的补充说明 | 破折号引导的分句或独立句子 |
| "e.g." | "例如" |
| "i.e." | "也就是说" |
| 长嵌套从句 | 拆分为多个短句 |
| 分号 | 句号 |
| 根据上下文调整重音 |
需保留的内容:
- 原作者的语气和风格
- 修辞疑问句
- 刻意的重复表达
- 关键短语和易记的表述
Integration
集成方案
With remotion-designer
与remotion-designer集成
- Pass manifest scene list to remotion-designer
- Each scene becomes a visual design unit
- Word-level timing drives text animation
- 将manifest场景列表传入remotion-designer
- 每个场景对应一个视觉设计单元
- 单词级时间戳驱动文本动画
With Remotion Compositions
与Remotion合成集成
tsx
import { Audio, useCurrentFrame, Sequence } from 'remotion';
import manifest from './output/manifest.json';
// Use scene durations for Sequence timing
{manifest.scenes.map((scene, i) => (
<Sequence
from={accumulatedFrames}
durationInFrames={scene.audio_duration_seconds * fps}
>
<Audio src={staticFile(scene.files.audio)} />
<CaptionRenderer captions={scene.captions} />
</Sequence>
))}tsx
import { Audio, useCurrentFrame, Sequence } from 'remotion';
import manifest from './output/manifest.json';
// 使用场景时长设置Sequence的时间
{manifest.scenes.map((scene, i) => (
<Sequence
from={accumulatedFrames}
durationInFrames={scene.audio_duration_seconds * fps}
>
<Audio src={staticFile(scene.files.audio)} />
<CaptionRenderer captions={scene.captions} />
</Sequence>
))}Technical Notes
技术说明
WAV Format Conversion
WAV格式转换
Whisper requires 16kHz mono WAV. The pipeline handles conversion automatically:
bash
ffmpeg -i input.wav -ar 16000 -ac 1 output_16khz.wavWhisper要求输入为16kHz单声道WAV。工作流会自动处理格式转换:
bash
ffmpeg -i input.wav -ar 16000 -ac 1 output_16khz.wavTTS Model
TTS模型
The fine-tuned voice model (~7.8GB) is bundled at . Uses Qwen3-TTS with custom speaker embedding.
tts/model/微调后的语音模型(约7.8GB)存放在 路径下,使用带自定义发言人嵌入的Qwen3-TTS模型。
tts/model/Performance
性能
- TTS: ~5-30 seconds per sentence (Apple Silicon MPS or NVIDIA CUDA)
- Whisper: ~0.5-2x realtime depending on model size
- Full essay (~2000 words): ~10-20 minutes total processing
- TTS:每句约5-30秒(取决于Apple Silicon MPS或NVIDIA CUDA加速)
- Whisper:约0.5-2倍实时速度,取决于模型大小
- 完整散文(约2000词):总处理时间约10-20分钟
What This Skill Does NOT Do
本技能不支持的功能
- Generate video visuals (use remotion-designer)
- Real-time voice synthesis
- Multi-speaker dialogue
- Edit or improve the content's argument
- Make editorial changes beyond mechanical spoken adaptation
- 生成视频视觉元素(请使用remotion-designer)
- 实时语音合成
- 多发言人对话
- 编辑或改进内容的论证逻辑
- 除机械口语化适配外的编辑修改