document-to-narration

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Document to Narration

文档转旁白工具

Convert written documents into narrated video scripts with precise word-level timing.

将书面文档转换为带有精确单词级时间戳的旁白视频脚本。

Core Principle

核心原则

The agent interprets; the document guides. Rather than rigid template-based splits, this skill uses agent judgment to find where the content naturally breathes, argues, and transitions. The document's argument flow determines scene breaks, not a predetermined structure.

Agent 负责解读，文档主导内容。与刻板的模板拆分不同，本技能借助Agent的判断来寻找内容自然停顿、论述转折的节点。场景划分由文档的论证逻辑决定，而非预设的固定结构。

When to Use This Skill

适用场景

Use this skill when:

Converting a blog post or essay to video narration
Preparing content for TTS audio generation
Breaking long-form content into digestible scenes
Creating word-level synchronized captions for video

Do NOT use this skill when:

The content is already in scene/script format
You need real-time voice synthesis (this is batch processing)
Working with dialogue or multi-speaker content (single voice only)

在以下场景使用本技能：

将博客文章或散文转换为视频旁白
为TTS音频生成准备内容
将长篇内容拆分为易于理解的场景
为视频创建与单词同步的字幕

请勿在以下场景使用本技能：

内容已为场景/脚本格式
需要实时语音合成（本工具为批量处理）
处理对话或多发言人内容（仅支持单语音）

Prerequisites

前置要求

Deno installed (https://deno.land/)
Python 3.12 with venv support
ffmpeg for audio conversion
whisper-cpp (installed via @remotion/install-whisper-cpp)
TTS model at
```
tts/model/
```
(not in git due to size - see Model Setup below)

已安装 Deno（https://deno.land/）
已安装带venv支持的 Python 3.12
用于音频转换的 ffmpeg
whisper-cpp（通过 @remotion/install-whisper-cpp 安装）
放置在
```
tts/model/
```
路径下的 TTS模型（因体积过大未纳入git - 详见下方模型设置）

Complete Pipeline

完整工作流

There are two approaches: per-scene (legacy) and full narration (recommended).

有两种实现方式：按场景处理（旧版）和完整旁白生成（推荐）。

Full Narration Pipeline (Recommended)

完整旁白工作流（推荐）

Generates a single audio file for consistent volume and pacing:

Document (.md)
    ↓ [agent interprets scene breaks]
Scene .txt files (01-scene-name.txt, 02-scene-name.txt, ...)
    ↓ [TTS via narrate-full.py - SINGLE PASS]
full-narration.wav (one consistent audio file)
    ↓ [Whisper via transcribe-full.py]
full-narration.json + full-narration.vtt (word-level timing)
    ↓ [extract-scene-boundaries.py]
Scene timing boundaries for video composition

生成单个音频文件，确保音量和语速一致：

文档 (.md)
    ↓ [Agent 解读场景划分]
场景 .txt 文件（01-scene-name.txt、02-scene-name.txt...）
    ↓ [通过 narrate-full.py 执行TTS - 单次生成]
full-narration.wav（单个统一音频文件）
    ↓ [通过 transcribe-full.py 执行Whisper转写]
full-narration.json + full-narration.vtt（带单词级时间戳）
    ↓ [通过 extract-scene-boundaries.py]
用于视频合成的场景时间边界数据

Per-Scene Pipeline (Legacy)

按场景处理工作流（旧版）

Generates separate audio per scene - can cause volume inconsistencies:

Scene .txt files
    ↓ [TTS via narrate-scenes.py - MULTIPLE PASSES]
Scene .wav files (volume may vary between scenes)
    ↓ [concatenate]
Combined audio (may have clipping at boundaries)

Warning: Per-scene TTS generates audio with different volume levels and pacing. When concatenated, this causes audible jumps and clipping. Use the full narration pipeline instead.

为每个场景单独生成音频 - 可能导致音量不一致：

场景 .txt 文件
    ↓ [通过 narrate-scenes.py 执行TTS - 多次生成]
场景 .wav 文件（场景间音量可能存在差异）
    ↓ [拼接]
合成音频（边界处可能出现剪辑瑕疵）

警告：按场景生成TTS会导致音频音量和语速不一致，拼接后会出现明显的声音跳变和剪辑问题。建议使用完整旁白工作流。

Quick Start

快速开始

Full Narration Pipeline (Recommended)

完整旁白工作流（推荐）

bash

cd .claude/skills/document-to-narration
source tts/.venv/bin/activate

bash

cd .claude/skills/document-to-narration
source tts/.venv/bin/activate

1. Split document into scenes (manual or scripted)

1. 将文档拆分为场景（手动或脚本自动处理）

deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/

2. Generate single audio file

2. 生成单个音频文件

python scripts/narrate-full.py ./output/scenes/

3. Transcribe with word-level timestamps

3. 生成带单词级时间戳的转写结果

python scripts/transcribe-full.py ./output/full-narration.wav

4. Extract scene boundaries for video timing

4. 提取场景时间边界用于视频时间轴

python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript

undefined

python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript

undefined

Legacy Per-Scene Pipeline

旧版按场景处理工作流

bash

undefined

bash

undefined

1. Split document into scenes

1. 将文档拆分为场景

deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/

2. Generate audio per scene (may have volume inconsistencies)

2. 为每个场景生成音频（可能存在音量不一致）

source tts/.venv/bin/activate python scripts/narrate-scenes.py ./output/scenes/

3. Transcribe (DEPRECATED: transcribe-scenes.ts requires whisper-cpp)

3. 转写（已弃用：transcribe-scenes.ts 需要 whisper-cpp）

Use transcribe-full.py instead after concatenating audio

建议先拼接音频，再使用 transcribe-full.py 进行转写

undefined

undefined

Instructions

操作步骤

Step 1: Setup (First Time Only)

步骤1：首次设置

Create Python Virtual Environment

创建Python虚拟环境

bash

cd .claude/skills/document-to-narration/tts
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

bash

cd .claude/skills/document-to-narration/tts
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

TTS Model Setup

TTS模型设置

The fine-tuned voice model (~7.8GB) is not included in git due to size. Place your Qwen3-TTS model files in

tts/model/

tts/model/
├── config.json
├── generation_config.json
├── model.safetensors      # Main model weights
├── tokenizer_config.json
├── vocab.json
├── merges.txt
└── speech_tokenizer/
    └── ...

微调后的语音模型（约7.8GB）因体积过大未纳入git。请将你的Qwen3-TTS模型文件放置在

tts/model/

路径下：

tts/model/
├── config.json
├── generation_config.json
├── model.safetensors      # 主模型权重
├── tokenizer_config.json
├── vocab.json
├── merges.txt
└── speech_tokenizer/
    └── ...

Install Whisper (if not already installed)

安装Whisper（如未安装）

The @remotion/install-whisper-cpp package handles this:

typescript

import { installWhisperCpp, downloadWhisperModel } from '@remotion/install-whisper-cpp';

await installWhisperCpp({ to: './whisper-cpp', version: '1.5.5' });
await downloadWhisperModel({ model: 'medium', folder: './whisper-cpp' });

@remotion/install-whisper-cpp 包可自动处理安装：

typescript

import { installWhisperCpp, downloadWhisperModel } from '@remotion/install-whisper-cpp';

await installWhisperCpp({ to: './whisper-cpp', version: '1.5.5' });
await downloadWhisperModel({ model: 'medium', folder: './whisper-cpp' });

Step 2: Prepare Your Document

步骤2：准备文档

The skill works best with:

Markdown documents with clear heading structure (H1, H2)
Well-structured arguments with distinct sections
Content that reads naturally aloud

本技能对以下格式的文档处理效果最佳：

具有清晰标题结构（H1、H2）的Markdown文档
论证结构清晰、章节划分明确的内容
适合口头朗读的自然文本

Step 3: Run the Pipeline

步骤3：运行工作流

bash

deno run -A scripts/full-pipeline.ts /path/to/essay.md --output ./output/essay-name/

bash

deno run -A scripts/full-pipeline.ts /path/to/essay.md --output ./output/essay-name/

Step 4: Review Output

步骤4：查看输出

output/essay-name/
├── scenes/
│   ├── 01-opening-hook.txt      # Scene script
│   ├── 01-opening-hook.wav      # Generated audio
│   ├── 01-opening-hook.vtt      # Word-level captions
│   ├── 02-core-argument.txt
│   ├── 02-core-argument.wav
│   ├── 02-core-argument.vtt
│   └── ...
└── manifest.json                # Complete timing data

output/essay-name/
├── scenes/
│   ├── 01-opening-hook.txt      # 场景脚本
│   ├── 01-opening-hook.wav      # 生成的音频
│   ├── 01-opening-hook.vtt      # 单词级字幕
│   ├── 02-core-argument.txt
│   ├── 02-core-argument.wav
│   ├── 02-core-argument.vtt
│   └── ...
└── manifest.json                # 完整时间戳数据

Scene Boundary Heuristics

场景划分规则

The agent identifies scene breaks using these heuristics:

Agent通过以下规则识别场景划分节点：

Strong Boundaries (Almost Always Break)

强划分节点（几乎总是拆分）

H2 heading changes
"Here's the thing" / "The point is" pivot statements
Major metaphor introduction
Explicit enumeration ("First...", "Second...")
Significant perspective shifts

H2标题变更
类似“关键在于”/“要点是”的转折语句
核心隐喻的引入
明确的枚举（“第一...”、“第二...”）
视角的重大转变

Moderate Boundaries (Consider Breaking)

中等划分节点（考虑拆分）

Long paragraph after short ones (or vice versa)
Example-to-principle transitions
"But" / "However" / "Meanwhile" at paragraph start
Question-then-answer patterns

短段落之后的长段落（反之亦然）
从示例到核心原则的过渡
段落开头的“但是”/“然而”/“与此同时”
提问-回答的结构

Weak Boundaries (Usually Keep Together)

弱划分节点（通常不拆分）

Paragraph-to-paragraph within same example
Sequential evidence for same point
Build-up to a punchline/reveal

同一示例下的段落衔接
支持同一论点的连续证据
铺垫至高潮/揭秘的内容

Scene Length Guidance

场景长度建议

Target: 100-300 words per scene (30-90 seconds of audio)
Minimum: 50 words (avoid micro-scenes)
Maximum: 500 words (avoid cognitive overload)

目标：每个场景100-300词（对应30-90秒音频）
最小值：50词（避免过短的微场景）
最大值：500词（避免认知过载）

Anti-Patterns

反模式

The Paragraph Slicer

机械段落拆分

Pattern: Breaking at every paragraph or heading mechanically. Problem: Ignores argument flow. Scenes feel choppy and disconnected. Fix: Look for rhetorical units, not structural units. Multiple paragraphs often form one scene.

模式：机械地按每个段落或标题拆分场景。问题：忽略论证逻辑，场景衔接生硬、缺乏连贯性。 解决方法：关注修辞单元而非结构单元，多个段落往往构成一个完整场景。

The Wall of Text

文本堆砌

Pattern: Keeping entire sections as single scenes. Problem: Creates TTS audio that's too long. Loses natural breathing room. Fix: Target 100-300 words. Find the natural pause point within sections.

模式：将整个章节作为单个场景保留。问题：生成的TTS音频过长，失去自然的停顿节奏。 解决方法：以100-300词为目标，在章节内寻找自然的停顿点。

The Verbatim Transcriber

逐字转录

Pattern: Copying written text exactly without spoken adaptation. Problem: Written conventions don't work when spoken. Parentheticals, complex punctuation, and nested clauses confuse TTS and listeners. Fix: Apply adaptation rules. Read it aloud mentally.

模式：完全照搬书面文本，未做口语化适配。问题：书面表达习惯不适合口语传播。括号内容、复杂标点和嵌套从句会混淆TTS引擎并影响听众理解。 解决方法：应用口语化适配规则，先在脑海中朗读一遍。

The Over-Adapter

过度改编

Pattern: Rewriting content so heavily it loses the author's voice. Problem: The result doesn't sound like the original author. Fix: Preserve voice, adjust mechanics. If the author uses rhetorical questions, keep them.

模式：过度改写内容，导致失去原作者的风格。问题：最终成品听起来不像原作者的表达。 解决方法：保留原作者的风格，仅调整表达形式。如果原作者使用了修辞疑问句，请保留。

Available Scripts

可用脚本

scripts/split-to-scenes.ts

Parse a markdown document and output scene text files.

bash

deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/ --adapt
deno run --allow-read scripts/split-to-scenes.ts input.md --dry-run

Options:

```
--output
```
- Directory for scene files (created if doesn't exist)
```
--adapt
```
- Apply spoken adaptation rules
```
--dry-run
```
- Preview scene breaks without writing files

Output: Numbered

.txt

files and initial

manifest.json

解析Markdown文档并输出场景文本文件。

bash

deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/ --adapt
deno run --allow-read scripts/split-to-scenes.ts input.md --dry-run

选项：

```
--output
```
- 场景文件的输出目录（不存在则自动创建）
```
--adapt
```
- 应用口语化适配规则
```
--dry-run
```
- 预览场景划分，不实际写入文件

输出： 编号的

.txt

文件和初始的

manifest.json

scripts/narrate-full.py (Recommended)

scripts/narrate-full.py（推荐）

Generate a single TTS audio file from all scene files. Produces consistent volume and pacing.

bash

python scripts/narrate-full.py ./output/scenes/
python scripts/narrate-full.py ./output/scenes/ --force
python scripts/narrate-full.py ./output/scenes/ --speaker jwynia
python scripts/narrate-full.py ./output/scenes/ --output ./custom/path/audio.wav

Options:

```
--force
```
- Regenerate even if output exists
```
--speaker
```
- Speaker name (default: jwynia)
```
--output
```
- Custom output path (default:
```
../full-narration.wav
```
)

Output: Single

full-narration.wav

in parent directory of scenes

根据所有场景文件生成单个TTS音频文件，确保音量和语速一致。

bash

python scripts/narrate-full.py ./output/scenes/
python scripts/narrate-full.py ./output/scenes/ --force
python scripts/narrate-full.py ./output/scenes/ --speaker jwynia
python scripts/narrate-full.py ./output/scenes/ --output ./custom/path/audio.wav

选项：

```
--force
```
- 即使输出文件已存在也重新生成
```
--speaker
```
- 发言人名称（默认：jwynia）
```
--output
```
- 自定义输出路径（默认：
```
../full-narration.wav
```
）

输出： 单个

full-narration.wav

scripts/narrate-scenes.py (Legacy)

scripts/narrate-scenes.py（旧版）

Generate TTS audio for each scene file separately. Not recommended - can cause volume inconsistencies when concatenated.

bash

python scripts/narrate-scenes.py ./output/scenes/
python scripts/narrate-scenes.py ./output/scenes/ --force
python scripts/narrate-scenes.py ./output/scenes/ --speaker jwynia

Options:

```
--force
```
- Regenerate even if output exists
```
--speaker
```
- Speaker name (default: jwynia)

Output:

.wav

files alongside each

.txt

file

为每个场景文件单独生成TTS音频。不推荐 - 拼接后可能导致音量不一致。

bash

python scripts/narrate-scenes.py ./output/scenes/
python scripts/narrate-scenes.py ./output/scenes/ --force
python scripts/narrate-scenes.py ./output/scenes/ --speaker jwynia

选项：

```
--force
```
- 即使输出文件已存在也重新生成
```
--speaker
```
- 发言人名称（默认：jwynia）

输出： 每个

.txt

文件对应的

.wav

文件

scripts/transcribe-full.py (Recommended)

scripts/transcribe-full.py（推荐）

Transcribe audio with word-level timestamps using Python's openai-whisper.

bash

python scripts/transcribe-full.py ./output/full-narration.wav
python scripts/transcribe-full.py ./output/full-narration.wav --model large-v3
python scripts/transcribe-full.py ./output/full-narration.wav --output-dir ./captions/

Options:

```
--model
```
- Whisper model: tiny, base, small, medium, large, large-v2, large-v3 (default: medium)
```
--output-dir
```
- Output directory (default: same as audio file)

Output:

```
.vtt
```
file with word-level timestamps
```
.json
```
file with captions array for Remotion

Dependencies: Requires

openai-whisper

in Python environment:

bash

pip install openai-whisper

使用Python的openai-whisper生成带单词级时间戳的转写结果。

bash

python scripts/transcribe-full.py ./output/full-narration.wav
python scripts/transcribe-full.py ./output/full-narration.wav --model large-v3
python scripts/transcribe-full.py ./output/full-narration.wav --output-dir ./captions/

选项：

```
--model
```
- Whisper模型：tiny、base、small、medium、large、large-v2、large-v3（默认：medium）
```
--output-dir
```
- 输出目录（默认：与音频文件同目录）

输出：

带单词级时间戳的
```
.vtt
```
文件
供Remotion使用的
```
.json
```
字幕数组文件

依赖： Python环境中需安装

openai-whisper

：

bash

pip install openai-whisper

scripts/extract-scene-boundaries.py

Extract scene timing boundaries from transcript by matching scene opening phrases.

bash

undefined

通过匹配场景开头语句，从转写结果中提取场景时间边界。

bash

undefined

Human-readable table

人类可读表格

python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json

JSON output

JSON格式输出

python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --json

TypeScript for Video.tsx

供Video.tsx使用的TypeScript代码

python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript


**Options:**
- `--json` - Output as JSON array
- `--typescript` - Output as TypeScript code for Video.tsx scenes array

**Output:** Scene numbers, slugs, start times, and durations

python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript


**选项：**
- `--json` - 输出为JSON数组
- `--typescript` - 输出为供Video.tsx使用的TypeScript场景数组代码

**输出：** 场景编号、别名、开始时间和时长

scripts/transcribe-scenes.ts (Deprecated)

scripts/transcribe-scenes.ts（已弃用）

Deprecated: Requires whisper-cpp binary which may not be installed. Use
transcribe-full.py
instead.

Transcribe per-scene audio files using whisper-cpp.

bash

deno run --allow-read --allow-write --allow-run scripts/transcribe-scenes.ts ./output/scenes/

Output:

.vtt

files with word-level timestamps

已弃用：需要whisper-cpp二进制文件，可能未安装。建议使用
transcribe-full.py
替代。

使用whisper-cpp为每个场景的音频文件生成转写结果。

bash

deno run --allow-read --allow-write --allow-run scripts/transcribe-scenes.ts ./output/scenes/

输出： 带单词级时间戳的

.vtt

文件

scripts/full-pipeline.ts

Orchestrate the complete pipeline.

bash

deno run -A scripts/full-pipeline.ts input.md --output ./output/project-name/

Options:

```
--output
```
- Output directory (required)
```
--adapt
```
- Apply spoken adaptation
```
--skip-tts
```
- Skip audio generation (text only)
```
--skip-transcribe
```
- Skip Whisper transcription

自动化执行完整工作流。

bash

deno run -A scripts/full-pipeline.ts input.md --output ./output/project-name/

选项：

```
--output
```
- 输出目录（必填）
```
--adapt
```
- 应用口语化适配
```
--skip-tts
```
- 跳过音频生成（仅处理文本）
```
--skip-transcribe
```
- 跳过Whisper转写

Output Format

输出格式

manifest.json

json

{
  "source": "appliance-vs-trade-tool-draft.md",
  "created_at": "2024-01-15T10:30:00Z",
  "total_scenes": 9,
  "total_duration_seconds": 420,
  "scenes": [
    {
      "number": 1,
      "slug": "popcorn-opening",
      "word_count": 185,
      "audio_duration_seconds": 55.2,
      "files": {
        "text": "scenes/01-popcorn-opening.txt",
        "audio": "scenes/01-popcorn-opening.wav",
        "captions": "scenes/01-popcorn-opening.vtt"
      },
      "captions": [
        { "text": "Two", "startMs": 0, "endMs": 180, "confidence": 0.98 },
        { "text": "people", "startMs": 180, "endMs": 450, "confidence": 0.97 }
      ]
    }
  ]
}

json

{
  "source": "appliance-vs-trade-tool-draft.md",
  "created_at": "2024-01-15T10:30:00Z",
  "total_scenes": 9,
  "total_duration_seconds": 420,
  "scenes": [
    {
      "number": 1,
      "slug": "popcorn-opening",
      "word_count": 185,
      "audio_duration_seconds": 55.2,
      "files": {
        "text": "scenes/01-popcorn-opening.txt",
        "audio": "scenes/01-popcorn-opening.wav",
        "captions": "scenes/01-popcorn-opening.vtt"
      },
      "captions": [
        { "text": "Two", "startMs": 0, "endMs": 180, "confidence": 0.98 },
        { "text": "people", "startMs": 180, "endMs": 450, "confidence": 0.97 }
      ]
    }
  ]
}

VTT Format

VTT格式

vtt

WEBVTT

00:00.000 --> 00:00.180
Two

00:00.180 --> 00:00.450
people

00:00.450 --> 00:00.720
walk

00:00.720 --> 00:01.100
into

vtt

WEBVTT

00:00.000 --> 00:00.180
Two

00:00.180 --> 00:00.450
people

00:00.450 --> 00:00.720
walk

00:00.720 --> 00:01.100
into

Spoken Adaptation

口语化适配

When

--adapt

is enabled, the skill transforms written conventions to spoken equivalents:

Written	Spoken
Parenthetical asides	Em-dash or separate sentence
"e.g."	"for example"
"i.e."	"that is"
Long nested clauses	Split into multiple sentences
Semicolons	Periods
`emphasis`	Context-appropriate stress

Preserve:

Author's voice and tone
Rhetorical questions
Deliberate repetition
Key phrases and memorable formulations

启用

--adapt

选项后，本技能会将书面表达转换为口语化形式：

书面表达	口语化转换
括号内的补充说明	破折号引导的分句或独立句子
"e.g."	"例如"
"i.e."	"也就是说"
长嵌套从句	拆分为多个短句
分号	句号
`emphasis`	根据上下文调整重音

需保留的内容：

原作者的语气和风格
修辞疑问句
刻意的重复表达
关键短语和易记的表述

Integration

集成方案

With remotion-designer

与remotion-designer集成

Pass manifest scene list to remotion-designer
Each scene becomes a visual design unit
Word-level timing drives text animation

将manifest场景列表传入remotion-designer
每个场景对应一个视觉设计单元
单词级时间戳驱动文本动画

With Remotion Compositions

与Remotion合成集成

tsx

import { Audio, useCurrentFrame, Sequence } from 'remotion';
import manifest from './output/manifest.json';

// Use scene durations for Sequence timing
{manifest.scenes.map((scene, i) => (
  <Sequence
    from={accumulatedFrames}
    durationInFrames={scene.audio_duration_seconds * fps}
  >
    <Audio src={staticFile(scene.files.audio)} />
    <CaptionRenderer captions={scene.captions} />
  </Sequence>
))}

tsx

import { Audio, useCurrentFrame, Sequence } from 'remotion';
import manifest from './output/manifest.json';

// 使用场景时长设置Sequence的时间
{manifest.scenes.map((scene, i) => (
  <Sequence
    from={accumulatedFrames}
    durationInFrames={scene.audio_duration_seconds * fps}
  >
    <Audio src={staticFile(scene.files.audio)} />
    <CaptionRenderer captions={scene.captions} />
  </Sequence>
))}

Technical Notes

技术说明

WAV Format Conversion

WAV格式转换

Whisper requires 16kHz mono WAV. The pipeline handles conversion automatically:

bash

ffmpeg -i input.wav -ar 16000 -ac 1 output_16khz.wav

Whisper要求输入为16kHz单声道WAV。工作流会自动处理格式转换：

bash

ffmpeg -i input.wav -ar 16000 -ac 1 output_16khz.wav

TTS Model

TTS模型

The fine-tuned voice model (~7.8GB) is bundled at

tts/model/

. Uses Qwen3-TTS with custom speaker embedding.

微调后的语音模型（约7.8GB）存放在

tts/model/

路径下，使用带自定义发言人嵌入的Qwen3-TTS模型。

Performance

性能

TTS: ~5-30 seconds per sentence (Apple Silicon MPS or NVIDIA CUDA)
Whisper: ~0.5-2x realtime depending on model size
Full essay (~2000 words): ~10-20 minutes total processing

TTS：每句约5-30秒（取决于Apple Silicon MPS或NVIDIA CUDA加速）
Whisper：约0.5-2倍实时速度，取决于模型大小
完整散文（约2000词）：总处理时间约10-20分钟

What This Skill Does NOT Do

本技能不支持的功能

Generate video visuals (use remotion-designer)
Real-time voice synthesis
Multi-speaker dialogue
Edit or improve the content's argument
Make editorial changes beyond mechanical spoken adaptation

生成视频视觉元素（请使用remotion-designer）
实时语音合成
多发言人对话
编辑或改进内容的论证逻辑
除机械口语化适配外的编辑修改