tts

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

When to Use

适用场景

  • User wants to convert text to spoken audio
  • User asks for "read aloud", "TTS", "text to speech", "voice narration"
  • User says "朗读", "配音", "语音合成"
  • User wants multi-speaker scripted audio or dialogue
  • 用户希望将文本转换为语音音频
  • 用户提出“read aloud”、“TTS”、“text to speech”、“voice narration”需求
  • 用户说出“朗读”、“配音”、“语音合成”指令
  • 用户需要多角色脚本音频或对话音频

When NOT to Use

不适用场景

  • User wants a podcast-style discussion with topic exploration (use
    /podcast
    )
  • User wants an explainer video with visuals (use
    /explainer
    )
  • User wants to generate an image (use
    /image-gen
    )
  • 用户需要带有主题探讨的播客式讨论(请使用
    /podcast
  • 用户需要带视觉效果的解说视频(请使用
    /explainer
  • 用户需要生成图片(请使用
    /image-gen

Purpose

功能目标

Convert text into natural-sounding speech audio. Two paths:
  1. Quick mode (
    /v1/tts
    ): Single voice, low-latency, sync MP3 stream. For casual chat, reading snippets, instant audio.
  2. Script mode (
    /v1/speech
    ): Multi-speaker, per-segment voice assignment. For dialogue, audiobooks, scripted content.
将文本转换为自然流畅的语音音频。提供两种模式:
  1. 快速模式 (
    /v1/tts
    ):单语音角色,低延迟,同步MP3流。适用于日常聊天、片段朗读、即时音频生成。
  2. 脚本模式 (
    /v1/speech
    ):多语音角色,支持按片段分配说话人。适用于对话内容、有声书、脚本化内容。

Hard Constraints

硬性约束

  • No shell scripts. Construct curl commands from the API reference files listed in Resources
  • Always read
    shared/authentication.md
    for API key and headers
  • Follow
    shared/common-patterns.md
    for errors and interaction patterns
  • Never hardcode speaker IDs — always fetch from the speakers API
  • Always read config following
    shared/config-pattern.md
    before any interaction
  • Always follow
    shared/speaker-selection.md
    for speaker selection (text table + free-text input)
  • Never save files to
    ~/Downloads/
    or
    /tmp/
    as primary output — use
    .listenhub/tts/
<HARD-GATE> Use the AskUserQuestion tool for every multiple-choice step — do NOT print options as plain text. Ask one question at a time. Wait for the user's answer before proceeding to the next step. After all parameters are collected, summarize the choices and ask the user to confirm. Do NOT call any generation API until the user has explicitly confirmed. </HARD-GATE>
  • 禁止使用shell脚本。请根据资源中列出的API参考文件构造curl命令
  • 务必阅读
    shared/authentication.md
    获取API密钥和请求头信息
  • 遵循
    shared/common-patterns.md
    中的错误处理和交互模式
  • 绝对不要硬编码说话人ID——务必从speakers API获取
  • 在进行任何交互前,务必遵循
    shared/config-pattern.md
    读取配置
  • 说话人选择务必遵循
    shared/speaker-selection.md
    (文本表格+自由文本输入)
  • 不要将文件保存到
    ~/Downloads/
    /tmp/
    作为主要输出——使用
    .listenhub/tts/
<HARD-GATE> 每一步多选操作都必须使用AskUserQuestion工具——不要以纯文本形式打印选项。一次只问一个问题。等待用户回答后再进行下一步。收集完所有参数后,总结所选内容并请用户确认。在用户明确确认前,禁止调用任何生成API。 </HARD-GATE>

Mode Detection

模式检测

Determine the mode from the user's input automatically before asking any questions:
SignalMode
"多角色", "脚本", "对话", "script", "dialogue", "multi-speaker"Script
Multiple characters mentioned by name or roleScript
Input contains structured segments (A: ..., B: ...)Script
Single paragraph of text, no character markersQuick
"读一下", "read this", "TTS", "朗读" with plain textQuick
AmbiguousQuick (default)
在提问前自动根据用户输入判断模式:
触发信号模式
“多角色”、“脚本”、“对话”、“script”、“dialogue”、“multi-speaker”脚本模式
输入中提到多个带名称或角色的人物脚本模式
输入包含结构化片段(A: ..., B: ...)脚本模式
单段落文本,无角色标记快速模式
搭配纯文本的“读一下”、“read this”、“TTS”、“朗读”快速模式
信号不明确默认使用快速模式

Interaction Flow

交互流程

Step -1: API Key Check

步骤-1:API密钥检查

Follow
shared/config-pattern.md
§ API Key Check. If the key is missing, stop immediately.
遵循
shared/config-pattern.md
中的API密钥检查章节。如果密钥缺失,立即停止操作。

Step 0: Config Setup

步骤0:配置初始化

Follow
shared/config-pattern.md
Step 0.
If file doesn't exist — ask location, then create immediately:
bash
mkdir -p ".listenhub/tts"
echo '{"outputDir":".listenhub","outputMode":"inline","language":null,"defaultSpeakers":{}}' > ".listenhub/tts/config.json"
CONFIG_PATH=".listenhub/tts/config.json"
遵循
shared/config-pattern.md
的步骤0。
如果配置文件不存在——询问保存位置,然后立即创建:
bash
mkdir -p ".listenhub/tts"
echo '{"outputDir":".listenhub","outputMode":"inline","language":null,"defaultSpeakers":{}}' > ".listenhub/tts/config.json"
CONFIG_PATH=".listenhub/tts/config.json"

(or $HOME/.listenhub/tts/config.json for global)

(or $HOME/.listenhub/tts/config.json for global)

Then run **Setup Flow** below.

**If file exists** — read config, display summary, and confirm:
当前配置 (tts): 输出方式:{inline / download / both} 语言偏好:{zh / en / 未设置} 默认主播:{speakerName / 未设置}
Ask: "使用已保存的配置?" → **确认,直接继续** / **重新配置**
然后执行下方的**配置流程**。

**如果配置文件已存在**——读取配置,显示摘要并请用户确认:
当前配置 (tts): 输出方式:{inline / download / both} 语言偏好:{zh / en / 未设置} 默认主播:{speakerName / 未设置}
询问:“使用已保存的配置?” → **确认,直接继续** / **重新配置**

Setup Flow (first run or reconfigure)

配置流程(首次运行或重新配置)

Ask these questions in order, then save all answers to config at once:
  1. outputMode: Follow
    shared/output-mode.md
    § Setup Flow Question.
  2. Language (optional): "默认语言?"
    • "中文 (zh)"
    • "English (en)"
    • "每次手动选择" → keep
      null
After collecting answers, save immediately:
bash
undefined
按顺序询问以下问题,然后一次性将所有答案保存到配置文件:
  1. outputMode:遵循
    shared/output-mode.md
    中的配置流程提问。
  2. 语言(可选):“默认语言?"
    • "中文 (zh)"
    • "English (en)"
    • "每次手动选择" → 保持
      null
收集完答案后立即保存:
bash
undefined

Save outputMode; only update language if user picked one

Save outputMode; only update language if user picked one

Follow shared/output-mode.md § Save to Config

Follow shared/output-mode.md § Save to Config

NEW_CONFIG=$(echo "$CONFIG" | jq --arg m "$OUTPUT_MODE" '. + {"outputMode": $m}')
NEW_CONFIG=$(echo "$CONFIG" | jq --arg m "$OUTPUT_MODE" '. + {"outputMode": $m}')

If language was chosen (not "每次手动选择"):

If language was chosen (not "每次手动选择"):

NEW_CONFIG=$(echo "$NEW_CONFIG" | jq --arg lang "zh" '. + {"language": $lang}') echo "$NEW_CONFIG" > "$CONFIG_PATH" CONFIG=$(cat "$CONFIG_PATH")

Note: `defaultSpeakers` are saved after speaker selection in Step 3 — not here.
NEW_CONFIG=$(echo "$NEW_CONFIG" | jq --arg lang "zh" '. + {"language": $lang}') echo "$NEW_CONFIG" > "$CONFIG_PATH" CONFIG=$(cat "$CONFIG_PATH")
注意:`defaultSpeakers`会在步骤3的说话人选择后保存——不在此步骤保存。

Quick Mode —
POST /v1/tts

快速模式 —
POST /v1/tts

Step 1: Extract text
Get the text to convert. If the user hasn't provided it, ask:
"What text would you like me to read aloud?"
Step 2: Determine voice
  • If
    config.defaultSpeakers.{language}[0]
    is set → use it silently (skip to Step 4)
  • Otherwise:
    GET /speakers/list?language={detected-language}
    , then follow
    shared/speaker-selection.md
    (text table + free-text input)
Step 3: Save preference
Question: "Save {voice name} as your default voice for {language}?"
Options:
  - "Yes" — update .listenhub/tts/config.json
  - "No" — use for this session only
Step 4: Confirm
Ready to generate:

  Text: "{first 80 chars}..."
  Voice: {voice name}

Proceed?
Step 5: Generate
bash
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": "...", "voice": "..."}' \
  --output /tmp/tts-output.mp3
Step 6: Present result
Read
OUTPUT_MODE
from config. Follow
shared/output-mode.md
for behavior.
Use a timestamped jobId:
$(date +%s)
inline
or
both
(TTS quick returns a sync audio stream — no
audioUrl
):
bash
JOB_ID=$(date +%s)
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": "...", "voice": "..."}' \
  --output /tmp/tts-${JOB_ID}.mp3
Then use the Read tool on
/tmp/tts-{jobId}.mp3
.
Present:
Audio generated!
download
or
both
:
bash
JOB_ID=$(date +%s)
DATE=$(date +%Y-%m-%d)
JOB_DIR=".listenhub/tts/${DATE}-${JOB_ID}"
mkdir -p "$JOB_DIR"
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": "...", "voice": "..."}' \
  --output "${JOB_DIR}/${JOB_ID}.mp3"
Present:
Audio generated!

已下载到 .listenhub/tts/{YYYY-MM-DD}-{jobId}/:
  {jobId}.mp3

步骤1:提取文本
获取需要转换的文本。如果用户未提供,询问:
"请问您需要朗读什么文本?"
步骤2:确定语音角色
  • 如果
    config.defaultSpeakers.{language}[0]
    已设置 → 自动使用该角色(跳至步骤4)
  • 否则:调用
    GET /speakers/list?language={detected-language}
    ,然后遵循
    shared/speaker-selection.md
    (文本表格+自由文本输入)
步骤3:保存偏好设置
问题:“是否将{voice name}设为{language}的默认语音角色?"
选项:
  - "是" — 更新.listenhub/tts/config.json
  - "否" — 仅本次会话使用
步骤4:确认参数
准备生成音频:

  文本:"{前80个字符}..."
  语音角色:{voice name}

是否继续?
步骤5:生成音频
bash
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": "...", "voice": "..."}' \
  --output /tmp/tts-output.mp3
步骤6:展示结果
从配置中读取
OUTPUT_MODE
,遵循
shared/output-mode.md
的行为规则。
使用带时间戳的任务ID:
$(date +%s)
inline
both
模式
(TTS快速模式返回同步音频流——无
audioUrl
):
bash
JOB_ID=$(date +%s)
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": "...", "voice": "..."}' \
  --output /tmp/tts-${JOB_ID}.mp3
然后使用Read工具读取
/tmp/tts-{jobId}.mp3
展示内容:
音频已生成!
download
both
模式
bash
JOB_ID=$(date +%s)
DATE=$(date +%Y-%m-%d)
JOB_DIR=".listenhub/tts/${DATE}-${JOB_ID}"
mkdir -p "$JOB_DIR"
curl -sS -X POST "https://api.marswave.ai/openapi/v1/tts" \
  -H "Authorization: Bearer $LISTENHUB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": "...", "voice": "..."}' \
  --output "${JOB_DIR}/${JOB_ID}.mp3"
展示内容:
音频已生成!

已下载至 .listenhub/tts/{YYYY-MM-DD}-{jobId}/:
  {jobId}.mp3

Script Mode —
POST /v1/speech

脚本模式 —
POST /v1/speech

Step 1: Get scripts
Determine whether the user already has a scripts array:
  • Already provided (JSON or clear segments): parse and display for confirmation
  • Not yet provided: help the user structure segments. Ask:
    "Please provide the script with speaker assignments. Format: each line as
    SpeakerName: text content
    . I'll convert it."
    Once the user provides the script, parse it into the
    scripts
    JSON format.
Step 2: Assign voices per character
For each unique character in the script:
  • If
    config.defaultSpeakers.{language}
    has saved voices → auto-assign silently (one per character in order)
  • Otherwise: fetch
    GET /speakers/list?language={detected-language}
    and follow
    shared/speaker-selection.md
    for each character
Step 3: Save preferences
After all voices are assigned (if any were new):
Question: "Save these voice assignments for future sessions?"
Options:
  - "Yes" — update defaultSpeakers in .listenhub/tts/config.json
  - "No" — use for this session only
Step 4: Confirm
Ready to generate:

  Characters:
    {name}: {voice}
    {name}: {voice}
  Segments: {count}
  Title: (auto-generated)

Proceed?
Step 5: Generate
Write the request body to a temp file, then submit:
bash
undefined
步骤1:获取脚本内容
判断用户是否已提供脚本数组:
  • 已提供(JSON格式或清晰的分段内容):解析后展示请用户确认
  • 未提供:帮助用户整理分段内容。询问:
    "请提供带有说话人分配的脚本,格式为:每行
    SpeakerName: 文本内容
    。我会为您转换。"
    用户提供脚本后,将其解析为
    scripts
    JSON格式。
步骤2:为每个角色分配语音
对于脚本中的每个独特角色:
  • 如果
    config.defaultSpeakers.{language}
    中已保存语音角色 → 自动分配(按顺序为每个角色分配一个)
  • 否则:调用
    GET /speakers/list?language={detected-language}
    ,然后为每个角色遵循
    shared/speaker-selection.md
    选择语音
步骤3:保存偏好设置
完成所有语音分配后(如果有新分配的角色):
问题:“是否保存这些语音分配以便后续会话使用?"
选项:
  - "是" — 更新.listenhub/tts/config.json中的defaultSpeakers
  - "否" — 仅本次会话使用
步骤4:确认参数
准备生成音频:

  角色列表:
    {name}: {voice}
    {name}: {voice}
  片段数量:{count}
  标题:(自动生成)

是否继续?
步骤5:生成音频
将请求体写入临时文件,然后提交:
bash
undefined

Write request to temp file

Write request to temp file

cat > /tmp/lh-speech-request.json << 'ENDJSON' { "scripts": [ {"content": "...", "speakerId": "..."}, {"content": "...", "speakerId": "..."} ] } ENDJSON
cat > /tmp/lh-speech-request.json << 'ENDJSON' { "scripts": [ {"content": "...", "speakerId": "..."}, {"content": "...", "speakerId": "..."} ] } ENDJSON

Submit

Submit

curl -sS -X POST "https://api.marswave.ai/openapi/v1/speech"
-H "Authorization: Bearer $LISTENHUB_API_KEY"
-H "Content-Type: application/json"
-d @/tmp/lh-speech-request.json
rm /tmp/lh-speech-request.json

**Step 6: Present result**

Read `OUTPUT_MODE` from config. Follow `shared/output-mode.md` for behavior.

**`inline` or `both`**: Display the `audioUrl` and `subtitlesUrl` as clickable links.

Present:
Audio generated!
在线收听:{audioUrl} 字幕:{subtitlesUrl} 时长:{audioDuration / 1000}s 消耗积分:{credits}

**`download` or `both`**: Also download the file.
```bash
DATE=$(date +%Y-%m-%d)
JOB_DIR=".listenhub/tts/${DATE}-{jobId}"
mkdir -p "$JOB_DIR"
curl -sS -o "${JOB_DIR}/{jobId}.mp3" "{audioUrl}"
Present the download path in addition to the above summary.

curl -sS -X POST "https://api.marswave.ai/openapi/v1/speech"
-H "Authorization: Bearer $LISTENHUB_API_KEY"
-H "Content-Type: application/json"
-d @/tmp/lh-speech-request.json
rm /tmp/lh-speech-request.json

**步骤6:展示结果**

从配置中读取`OUTPUT_MODE`,遵循`shared/output-mode.md`的行为规则。

**`inline`或`both`模式**:将`audioUrl`和`subtitlesUrl`显示为可点击链接。

展示内容:
音频已生成!
在线收听:{audioUrl} 字幕:{subtitlesUrl} 时长:{audioDuration / 1000}s 消耗积分:{credits}

**`download`或`both`模式**:同时下载文件。
```bash
DATE=$(date +%Y-%m-%d)
JOB_DIR=".listenhub/tts/${DATE}-{jobId}"
mkdir -p "$JOB_DIR"
curl -sS -o "${JOB_DIR}/{jobId}.mp3" "{audioUrl}"
除上述摘要外,还需展示下载路径。

Updating Config

更新配置

When saving preferences, merge into
.listenhub/tts/config.json
— do not overwrite unchanged keys. Follow the merge pattern in
shared/config-pattern.md
.
  • Quick voice: set
    defaultSpeakers.{language}[0]
    to the selected
    speakerId
  • Script voices: set
    defaultSpeakers.{language}
    to the full array assigned this session
  • Language: set
    language
    if the user explicitly specifies it
保存偏好设置时,将内容合并到
.listenhub/tts/config.json
——不要覆盖未修改的键。遵循
shared/config-pattern.md
中的合并模式。
  • 快速模式语音角色:将
    defaultSpeakers.{language}[0]
    设为所选的
    speakerId
  • 脚本模式语音角色:将
    defaultSpeakers.{language}
    设为本会话分配的完整角色数组
  • 语言:如果用户明确指定,设置
    language
    字段

API Reference

API参考

  • TTS & Speech endpoints:
    shared/api-tts.md
  • Speaker list:
    shared/api-speakers.md
  • Speaker selection guide:
    shared/speaker-selection.md
  • Error handling:
    shared/common-patterns.md
    § Error Handling
  • Long text input:
    shared/common-patterns.md
    § Long Text Input
  • TTS与Speech接口:
    shared/api-tts.md
  • 说话人列表:
    shared/api-speakers.md
  • 说话人选择指南:
    shared/speaker-selection.md
  • 错误处理:
    shared/common-patterns.md
    中的错误处理章节
  • 长文本输入:
    shared/common-patterns.md
    中的长文本输入章节

Composability

组合性

  • Invokes: speakers API (for speaker selection)
  • Invoked by: explainer (for voiceover)
  • 调用:speakers API(用于说话人选择)
  • 被调用:explainer(用于旁白生成)

Examples

示例

Quick mode:
"TTS this: The server will be down for maintenance at midnight."
  1. Detect: Quick mode (plain text, "TTS this")
  2. Read config:
    quickVoice
    is
    null
  3. Fetch speakers, user picks "Yuanye"
  4. Ask to save → yes → update config
  5. POST /v1/tts
    with
    input
    +
    voice
  6. Present:
    /tmp/tts-output.mp3
Script mode:
"帮我做一段双人对话配音,A说:欢迎大家,B说:谢谢邀请"
  1. Detect: Script mode ("双人对话")
  2. Parse segments: A → "欢迎大家", B → "谢谢邀请"
  3. Read config:
    scriptVoices
    empty
  4. Fetch
    zh
    speakers, assign A and B voices
  5. Ask to save → yes → update config
  6. POST /v1/speech
    with scripts array
  7. Present:
    audioUrl
    ,
    subtitlesUrl
    , duration
快速模式示例:
"TTS这段文本:服务器将于午夜进行维护。"
  1. 检测:快速模式(纯文本,含“TTS this”)
  2. 读取配置:
    quickVoice
    null
  3. 获取说话人列表,用户选择“Yuanye”
  4. 询问是否保存 → 是 → 更新配置
  5. 携带
    input
    voice
    参数调用
    POST /v1/tts
  6. 展示结果:
    /tmp/tts-output.mp3
脚本模式示例:
"帮我做一段双人对话配音,A说:欢迎大家,B说:谢谢邀请"
  1. 检测:脚本模式(含“双人对话”)
  2. 解析片段:A → "欢迎大家",B → "谢谢邀请"
  3. 读取配置:
    scriptVoices
    为空
  4. 获取中文说话人列表,为A和B分配语音角色
  5. 询问是否保存 → 是 → 更新配置
  6. 携带scripts数组调用
    POST /v1/speech
  7. 展示结果:
    audioUrl
    subtitlesUrl
    、时长