tts

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

tts

TTS 文本转语音

Convert any text into speech audio. Supports two backends (Kokoro local, Noiz cloud), two modes (simple or timeline-accurate), and per-segment voice control.
将任意文本转换为语音音频。支持两种后端(本地Kokoro、云端Noiz)、两种模式(简单模式或时间轴精准模式),以及逐段语音控制。

Triggers

触发场景

  • text to speech / tts / read aloud / generate audio
  • voice clone / subtitle dubbing / srt to audio
  • epub to audio / markdown to audio / kokoro
  • 文本转语音 / TTS / 朗读 / 生成音频
  • 语音克隆 / 字幕配音 / SRT转音频
  • EPUB转音频 / Markdown转音频 / kokoro

Simple Mode — text to audio

简单模式 — 文本转音频

bash
undefined
bash
undefined

Kokoro (auto-detected when installed)

Kokoro(安装后自动检测)

bash skills/tts/scripts/tts.sh speak -t "Hello world" -v af_sarah -o hello.wav bash skills/tts/scripts/tts.sh speak -f article.txt -v zf_xiaoni --lang cmn -o out.mp3 --format mp3
bash skills/tts/scripts/tts.sh speak -t "Hello world" -v af_sarah -o hello.wav bash skills/tts/scripts/tts.sh speak -f article.txt -v zf_xiaoni --lang cmn -o out.mp3 --format mp3

Noiz (auto-detected when NOIZ_API_KEY is set, or force with --backend noiz)

Noiz(设置NOIZ_API_KEY后自动检测,或通过--backend noiz强制指定)

If --voice-id is omitted, the script prints 5 available built-in voices and exits.

如果省略--voice-id参数,脚本会输出5个可用的内置语音后退出。

Pick one from the output and re-run with --voice-id <id>.

从输出中选择一个语音,重新运行时添加--voice-id <id>参数。

bash skills/tts/scripts/tts.sh speak -f input.txt --voice-id voice_abc --auto-emotion --emo '{"Joy":0.5}' -o out.wav
bash skills/tts/scripts/tts.sh speak -f input.txt --voice-id voice_abc --auto-emotion --emo '{"Joy":0.5}' -o out.wav

Voice cloning (Noiz only — no voice-id needed, uses ref audio)

语音克隆(仅Noiz支持 — 无需指定voice-id,使用参考音频)

bash skills/tts/scripts/tts.sh speak -t "Hello" --ref-audio ./ref.wav -o clone.wav
undefined
bash skills/tts/scripts/tts.sh speak -t "Hello" --ref-audio ./ref.wav -o clone.wav
undefined

Timeline Mode — SRT to time-aligned audio

时间轴模式 — SRT转时间轴对齐音频

For precise per-segment timing (dubbing, subtitles, video narration).
用于实现逐段精准定时(如配音、字幕、视频旁白)。

Step 1: Get or create an SRT

步骤1:获取或创建SRT文件

If the user doesn't have one, generate from text:
bash
bash skills/tts/scripts/tts.sh to-srt -i article.txt -o article.srt
bash skills/tts/scripts/tts.sh to-srt -i article.txt -o article.srt --cps 15 --gap 500
--cps
= characters per second (default 4, good for Chinese; ~15 for English). The agent can also write SRT manually.
如果用户没有SRT文件,可从文本生成:
bash
bash skills/tts/scripts/tts.sh to-srt -i article.txt -o article.srt
bash skills/tts/scripts/tts.sh to-srt -i article.txt -o article.srt --cps 15 --gap 500
--cps
= 每秒字符数(默认4,适合中文;英文建议约15)。也可手动编写SRT文件。

Step 2: Create a voice map

步骤2:创建语音映射文件

JSON file controlling default + per-segment voice settings.
segments
keys support single index
"3"
or range
"5-8"
.
Kokoro voice map:
json
{
  "default": { "voice": "zf_xiaoni", "lang": "cmn" },
  "segments": {
    "1": { "voice": "zm_yunxi" },
    "5-8": { "voice": "af_sarah", "lang": "en-us", "speed": 0.9 }
  }
}
Noiz voice map (adds
emo
,
reference_audio
support):
json
{
  "default": { "voice_id": "voice_123", "target_lang": "zh" },
  "segments": {
    "1": { "voice_id": "voice_host", "emo": { "Joy": 0.6 } },
    "2-4": { "reference_audio": "./refs/guest.wav" }
  }
}
See
examples/
for full samples.
用于控制默认语音及逐段语音设置的JSON文件。
segments
支持单个索引如
"3"
或范围如
"5-8"
Kokoro语音映射示例:
json
{
  "default": { "voice": "zf_xiaoni", "lang": "cmn" },
  "segments": {
    "1": { "voice": "zm_yunxi" },
    "5-8": { "voice": "af_sarah", "lang": "en-us", "speed": 0.9 }
  }
}
Noiz语音映射示例(新增
emo
reference_audio
支持):
json
{
  "default": { "voice_id": "voice_123", "target_lang": "zh" },
  "segments": {
    "1": { "voice_id": "voice_host", "emo": { "Joy": 0.6 } },
    "2-4": { "reference_audio": "./refs/guest.wav" }
  }
}
完整示例可查看
examples/
目录。

Step 3: Render

步骤3:渲染生成音频

bash
bash skills/tts/scripts/tts.sh render --srt input.srt --voice-map vm.json -o output.wav
bash skills/tts/scripts/tts.sh render --srt input.srt --voice-map vm.json --backend noiz --auto-emotion -o output.wav
bash
bash skills/tts/scripts/tts.sh render --srt input.srt --voice-map vm.json -o output.wav
bash skills/tts/scripts/tts.sh render --srt input.srt --voice-map vm.json --backend noiz --auto-emotion -o output.wav

When to Choose Which

如何选择合适的后端

NeedRecommended
Just read text aloud, no fussKokoro (default)
EPUB/PDF audiobook with chaptersKokoro (native support)
Voice blending (
"v1:60,v2:40"
)
Kokoro
Voice cloning from reference audioNoiz
Emotion control (
emo
param)
Noiz
Exact server-side duration per segmentNoiz
When the user needs emotion control + voice cloning + precise duration together, Noiz is the only backend that supports all three.
需求推荐方案
仅需朗读文本,无需复杂设置Kokoro(默认)
带章节的EPUB/PDF有声书Kokoro(原生支持)
语音混合(如
"v1:60,v2:40"
Kokoro
基于参考音频的语音克隆Noiz
情感控制(
emo
参数)
Noiz
逐段精准的服务器端时长控制Noiz
当用户同时需要情感控制、语音克隆和精准时长控制时,Noiz是唯一支持这三项功能的后端。

Requirements

环境要求

  • ffmpeg
    in PATH (timeline mode)
  • Noiz: get your API key at developers.noiz.ai, then run
    bash skills/tts/scripts/tts.sh config --set-api-key YOUR_KEY
  • Kokoro: if already installed, pass
    --backend kokoro
    to use the local backend
For backend details and full argument reference, see reference.md.
  • 系统PATH中需包含
    ffmpeg
    (时间轴模式需要)
  • Noiz:请在developers.noiz.ai获取API密钥,然后运行
    bash skills/tts/scripts/tts.sh config --set-api-key YOUR_KEY
  • Kokoro:若已安装,可通过传递
    --backend kokoro
    参数使用本地后端
如需了解后端详情及完整参数说明,请查看reference.md