tts

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

tts

TTS 文本转语音

Convert any text into speech audio. Supports two backends (Kokoro local, Noiz cloud), two modes (simple or timeline-accurate), and per-segment voice control.

将任意文本转换为语音音频。支持两种后端（本地Kokoro、云端Noiz）、两种模式（简单模式或时间轴精准模式），以及逐段语音控制。

Triggers

触发场景

text to speech / tts / read aloud / generate audio
voice clone / subtitle dubbing / srt to audio
epub to audio / markdown to audio / kokoro

文本转语音 / TTS / 朗读 / 生成音频
语音克隆 / 字幕配音 / SRT转音频
EPUB转音频 / Markdown转音频 / kokoro

Simple Mode — text to audio

简单模式 — 文本转音频

bash

undefined

bash

undefined

Kokoro (auto-detected when installed)

Kokoro（安装后自动检测）

bash skills/tts/scripts/tts.sh speak -t "Hello world" -v af_sarah -o hello.wav bash skills/tts/scripts/tts.sh speak -f article.txt -v zf_xiaoni --lang cmn -o out.mp3 --format mp3

Noiz (auto-detected when NOIZ_API_KEY is set, or force with --backend noiz)

Noiz（设置NOIZ_API_KEY后自动检测，或通过--backend noiz强制指定）

If --voice-id is omitted, the script prints 5 available built-in voices and exits.

如果省略--voice-id参数，脚本会输出5个可用的内置语音后退出。

Pick one from the output and re-run with --voice-id <id>.

从输出中选择一个语音，重新运行时添加--voice-id <id>参数。

bash skills/tts/scripts/tts.sh speak -f input.txt --voice-id voice_abc --auto-emotion --emo '{"Joy":0.5}' -o out.wav

Voice cloning (Noiz only — no voice-id needed, uses ref audio)

语音克隆（仅Noiz支持 — 无需指定voice-id，使用参考音频）

bash skills/tts/scripts/tts.sh speak -t "Hello" --ref-audio ./ref.wav -o clone.wav

undefined

bash skills/tts/scripts/tts.sh speak -t "Hello" --ref-audio ./ref.wav -o clone.wav

undefined

Timeline Mode — SRT to time-aligned audio

时间轴模式 — SRT转时间轴对齐音频

For precise per-segment timing (dubbing, subtitles, video narration).

用于实现逐段精准定时（如配音、字幕、视频旁白）。

Step 1: Get or create an SRT

步骤1：获取或创建SRT文件

If the user doesn't have one, generate from text:

bash

bash skills/tts/scripts/tts.sh to-srt -i article.txt -o article.srt
bash skills/tts/scripts/tts.sh to-srt -i article.txt -o article.srt --cps 15 --gap 500

--cps

= characters per second (default 4, good for Chinese; ~15 for English). The agent can also write SRT manually.

如果用户没有SRT文件，可从文本生成：

bash

bash skills/tts/scripts/tts.sh to-srt -i article.txt -o article.srt
bash skills/tts/scripts/tts.sh to-srt -i article.txt -o article.srt --cps 15 --gap 500

--cps

= 每秒字符数（默认4，适合中文；英文建议约15）。也可手动编写SRT文件。

Step 2: Create a voice map

步骤2：创建语音映射文件

JSON file controlling default + per-segment voice settings.

segments

keys support single index

"3"

or range

"5-8"

Kokoro voice map:

json

{
  "default": { "voice": "zf_xiaoni", "lang": "cmn" },
  "segments": {
    "1": { "voice": "zm_yunxi" },
    "5-8": { "voice": "af_sarah", "lang": "en-us", "speed": 0.9 }
  }
}

Noiz voice map (adds

emo

reference_audio

support):

json

{
  "default": { "voice_id": "voice_123", "target_lang": "zh" },
  "segments": {
    "1": { "voice_id": "voice_host", "emo": { "Joy": 0.6 } },
    "2-4": { "reference_audio": "./refs/guest.wav" }
  }
}

See

examples/

for full samples.

用于控制默认语音及逐段语音设置的JSON文件。

segments

支持单个索引如

"3"

或范围如

"5-8"

。

Kokoro语音映射示例：

json

{
  "default": { "voice": "zf_xiaoni", "lang": "cmn" },
  "segments": {
    "1": { "voice": "zm_yunxi" },
    "5-8": { "voice": "af_sarah", "lang": "en-us", "speed": 0.9 }
  }
}

Noiz语音映射示例（新增

emo

、

reference_audio

支持）：

json

{
  "default": { "voice_id": "voice_123", "target_lang": "zh" },
  "segments": {
    "1": { "voice_id": "voice_host", "emo": { "Joy": 0.6 } },
    "2-4": { "reference_audio": "./refs/guest.wav" }
  }
}

完整示例可查看

examples/

目录。

Step 3: Render

步骤3：渲染生成音频

bash

bash skills/tts/scripts/tts.sh render --srt input.srt --voice-map vm.json -o output.wav
bash skills/tts/scripts/tts.sh render --srt input.srt --voice-map vm.json --backend noiz --auto-emotion -o output.wav

bash

bash skills/tts/scripts/tts.sh render --srt input.srt --voice-map vm.json -o output.wav
bash skills/tts/scripts/tts.sh render --srt input.srt --voice-map vm.json --backend noiz --auto-emotion -o output.wav

When to Choose Which

如何选择合适的后端

Need	Recommended
Just read text aloud, no fuss	Kokoro (default)
EPUB/PDF audiobook with chapters	Kokoro (native support)
Voice blending ( `"v1:60,v2:40"` )	Kokoro
Voice cloning from reference audio	Noiz
Emotion control ( `emo` param)	Noiz
Exact server-side duration per segment	Noiz

When the user needs emotion control + voice cloning + precise duration together, Noiz is the only backend that supports all three.

需求	推荐方案
仅需朗读文本，无需复杂设置	Kokoro（默认）
带章节的EPUB/PDF有声书	Kokoro（原生支持）
语音混合（如 `"v1:60,v2:40"` ）	Kokoro
基于参考音频的语音克隆	Noiz
情感控制（ `emo` 参数）	Noiz
逐段精准的服务器端时长控制	Noiz

当用户同时需要情感控制、语音克隆和精准时长控制时，Noiz是唯一支持这三项功能的后端。

Requirements

环境要求

```
ffmpeg
```
in PATH (timeline mode)

Noiz: get your API key at developers.noiz.ai, then run

bash skills/tts/scripts/tts.sh config --set-api-key YOUR_KEY

Kokoro: if already installed, pass
```
--backend kokoro
```
to use the local backend

For backend details and full argument reference, see reference.md.

系统PATH中需包含
```
ffmpeg
```
（时间轴模式需要）
Noiz：请在developers.noiz.ai获取API密钥，然后运行
```
bash skills/tts/scripts/tts.sh config --set-api-key YOUR_KEY
```
Kokoro：若已安装，可通过传递
```
--backend kokoro
```
参数使用本地后端

如需了解后端详情及完整参数说明，请查看reference.md。