stepfun-tts
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseStepFun stepaudio-2.5-tts
StepFun stepaudio-2.5-tts
Generate Chinese / Japanese speech with (released 2026-04, verified 2026-04-23). Contextual TTS — emotion and prosody go through natural-language description, not fixed labels.
stepaudio-2.5-ttsCompanion: for transcription with(the sibling model), use thestepaudio-2.5-asrskill — they share an API key but live on different endpoints with different body shapes.stepfun-asr
Why this skill exists — StepAudio 2.5 has two non-obvious pitfalls that cost hours if you don't know them:
- rejects
stepaudio-2.5-tts(the step-tts-2 way). Emotion/prosody now goes throughvoice_label(natural-language description, ≤200 chars) and inlineinstructionparentheses inside the text itself.() - Censorship is stricter — anything containing 死 / 消失 / sensitive political terms returns . Your rewrite options are in
censorship_block.references/migration_from_v2.md
使用(2026年4月发布,2026年4月23日验证)生成中文/日文语音。这是一款上下文感知型TTS——情绪和韵律通过自然语言描述来控制,而非固定标签。
stepaudio-2.5-tts配套工具:如需使用同系列模型进行语音转文本,请使用stepaudio-2.5-asr技能——二者共享API密钥,但部署在不同的端点,请求体格式也不同。stepfun-asr
本技能存在的原因——StepAudio 2.5有两个不明显的陷阱,如果不了解会耗费大量时间:
- 不支持
stepaudio-2.5-tts(step-tts-2的用法)。现在情绪/韵律需要通过voice_label(自然语言描述,≤200字符)以及文本内部内嵌的instruction括号来控制。() - 内容审核更加严格——任何包含“死”/“消失”/敏感政治词汇的内容都会返回。改写方案可参考
censorship_block。references/migration_from_v2.md
Config and auth
配置与认证
API key lives in (preferred) or (fallback for cross-session persistence). All bundled scripts try env first, then config.
$STEPFUN_API_KEY${CLAUDE_PLUGIN_DATA}/config.jsonFirst-time setup (one-liner):
bash
mkdir -p "${CLAUDE_PLUGIN_DATA}" && cat > "${CLAUDE_PLUGIN_DATA}/config.json" <<EOF
{"api_key": "<paste key here>"}
EOFIf the user hasn't set a key, ask them to paste it (don't guess / don't use a placeholder). StepFun API keys are available at https://platform.stepfun.com/ → API Keys. Use a Normal key, not a Plan key (Plan keys are restricted to text models and silently fail on audio endpoints).
API密钥存储在(推荐)或(跨会话持久化的备选方案)中。所有捆绑脚本都会优先读取环境变量,再读取配置文件。
$STEPFUN_API_KEY${CLAUDE_PLUGIN_DATA}/config.json首次设置(单行命令):
bash
mkdir -p "${CLAUDE_PLUGIN_DATA}" && cat > "${CLAUDE_PLUGIN_DATA}/config.json" <<EOF
{"api_key": "<paste key here>"}
EOF如果用户尚未设置密钥,请让他们粘贴密钥(不要猜测/不要使用占位符)。StepFun API密钥可在https://platform.stepfun.com/ → API Keys获取。请使用Normal密钥,不要使用Plan密钥(Plan密钥仅限制用于文本模型,在音频端点会静默失败)。
Common tasks — decision tree
常见任务——决策树
| User wants... | Script | Key detail |
|---|---|---|
| Synthesize 1–500 char Chinese with emotion | | Use |
| Synthesize long text (500–1000 char) | | 1000 char is the hard cap; split at semantic boundaries above that |
| Batch-generate game/app voice lines | | Handle |
| A/B compare two TTS models | | Compares duration/size across two directories |
Migrate from | see | |
| 用户需求... | 脚本 | 关键细节 |
|---|---|---|
| 合成1–500字符带情绪的中文语音 | | 使用 |
| 合成长文本(500–1000字符) | | 1000字符是硬上限;超过该长度需按语义边界拆分 |
| 批量生成游戏/应用语音台词 | | 单独处理 |
| A/B对比两个TTS模型 | | 对比两个目录下音频的时长/大小 |
从 | 参考 | 将 |
Starting points
快速入门
- Synthesize a single line: Run . For fine-grained control read the "Contextual TTS" section below.
python3 scripts/tts_generate.py --text "你好" --out /tmp/hello.mp3 --instruction "温暖的希望感" - A full migration from →
step-tts-2: readstepaudio-2.5-ttsend-to-end before touching code. It has thereferences/migration_from_v2.md, the SKIP_CENSORED list pattern, and the output-directory-strategy for non-destructive A/B.INSTRUCTION_MAP
- 合成单条语音:运行。如需精细控制,请阅读下方“上下文感知型TTS”章节。
python3 scripts/tts_generate.py --text "你好" --out /tmp/hello.mp3 --instruction "温暖的希望感" - 从step-tts-2全面迁移到stepaudio-2.5-tts:在修改代码前,请完整阅读。其中包含
references/migration_from_v2.md、SKIP_CENSORED列表模式,以及用于无损A/B测试的输出目录策略。INSTRUCTION_MAP
Contextual TTS — beyond emotion labels
上下文感知型TTS——超越情绪标签
The headline feature of is that you stop mapping emotions to fixed tags and start describing what you want in natural language. Two layers:
stepaudio-2.5-ttsGlobal context ( parameter) — sets the overall tone for the entire utterance. ≤200 chars. Think of it like giving stage direction to a voice actor.
instructioninstruction: "克制的悲伤,语气低沉柔弱,像快要消失一样"Inline context ( parentheses inside ) —句内 directives. Parenthesised content is consumed as directions and is NOT read aloud. Use for precise control of pauses, breath, emphasis, or mid-sentence emotion shifts.
()inputinput: "(试探着问)你好吗?(开心地)太好了!(突然沉下来)不过...我快要消失了。"Examples that worked in practice (from 2026-04-23 verification):
- — visibly speeds up delivery vs neutral
instruction: "活泼俏皮,像是在撒娇,带点嘴硬" - — produces audible whisper/breath
instruction: "耳语声,气声很重,几乎听不清" - — inline directives all respected
input: "你好(停顿一下)我是蕾格(轻声)今天(加重)的天气真不错。"
What will NOT accept — parameter. Error: . This is the #1 migration gotcha from step-tts-2.
stepaudio-2.5-ttsvoice_labelvoice_label is not supported for v2 modelsstepaudio-2.5-tts全局上下文(参数)——为整个语音设置整体基调。≤200字符。可以理解为给配音演员的舞台指导。
instructioninstruction: "克制的悲伤,语气低沉柔弱,像快要消失一样"内嵌上下文(中的括号)——句内指令。括号中的内容会被解析为指令,不会被朗读出来。用于精确控制停顿、呼吸、重音或句中情绪转折。
input()input: "(试探着问)你好吗?(开心地)太好了!(突然沉下来)不过...我快要消失了。"实际验证有效的示例(2026年4月23日验证):
- ——相比中性语气,语速明显加快
instruction: "活泼俏皮,像是在撒娇,带点嘴硬" - ——生成可识别的耳语/呼吸声
instruction: "耳语声,气声很重,几乎听不清" - ——所有内嵌指令均被正确执行
input: "你好(停顿一下)我是蕾格(轻声)今天(加重)的天气真不错。"
stepaudio-2.5-ttsvoice_labelvoice_label is not supported for v2 modelsCommon error patterns (real errors, real fixes)
常见错误模式(真实错误与修复方案)
| Error response | Actual cause | Fix |
|---|---|---|
| Sent | Remove |
| Sensitive word (死 / 消失 / etc.) | Rewrite the phrase OR fall back to |
| Silent audio truncation (input > 1000 chars) | Hard cap exceeded | Split at semantic boundaries; don't truncate mid-sentence |
More in .
references/known_issues.md| 错误响应 | 实际原因 | 修复方案 |
|---|---|---|
| 向 | 删除 |
| 包含敏感词汇(死/消失等) | 改写语句 OR 针对该特定台词回退使用 |
| 音频静默截断(输入超过1000字符) | 超过硬上限 | 按语义边界拆分;不要在句中截断 |
更多内容请参考。
references/known_issues.mdWhen to read references
何时阅读参考文档
- — exact request/response JSON for
references/api_reference.md, all fields, error responses. Read when writing raw HTTP calls instead of using the bundled scripts./v1/audio/speech - — complete playbook for moving a step-tts-2 project to stepaudio-2.5-tts. Has the emotion→instruction rewrite table, the A/B directory strategy, decision checkpoints, and the 2026-04 speed/quality trade-off data (
references/migration_from_v2.mdis ~20% slower than step-tts-2; audible prosody improvement). Read before any migration work.stepaudio-2.5-tts - — censorship patterns, TTS duration inflation, v2-family parameter naming gotcha, 1000-char hard cap. Read when debugging anomalous output or evaluating whether to adopt.
references/known_issues.md
- ——
references/api_reference.md接口的精确请求/响应JSON格式、所有字段及错误响应。当你不使用捆绑脚本而直接编写HTTP调用时,请阅读此文档。/v1/audio/speech - ——将step-tts-2项目迁移到stepaudio-2.5-tts的完整指南。包含情绪→指令的重写对照表、A/B目录策略、决策检查点,以及2026年4月的速度/质量权衡数据(
references/migration_from_v2.md比step-tts-2慢约20%;但韵律表现有明显提升)。在进行任何迁移工作前,请阅读此文档。stepaudio-2.5-tts - ——审核规则、TTS时长膨胀问题、v2系列参数命名陷阱、1000字符硬上限。当调试异常输出或评估是否采用该模型时,请阅读此文档。
references/known_issues.md
Design invariants (don't break these)
设计原则(请勿违反)
- Non-destructive A/B output — when regenerating a corpus with a new model, write to a parallel directory (), never overwrite the production corpus. The migration playbook shows why.
voice/zh_v25/ - Per-line censorship handling — if 2/29 lines get , don't fail the batch. Log the skipped IDs, continue. Mixed-model fallback (step-tts-2 for the skipped 2) is normal.
censorship_block - Don't duplicate voice_label logic in new code — any new TTS code targeting stepaudio-2.5-tts should only use + inline
instruction. Do not write a branch that conditionally emits().voice_label
- 无损A/B输出——当使用新模型重新生成语料时,请写入并行目录(如),切勿覆盖生产语料。迁移指南中说明了原因。
voice/zh_v25/ - 逐行处理审核问题——如果29条台词中有2条被,不要终止批量任务。记录跳过的ID,继续执行。针对跳过的2条回退使用step-tts-2的混合模型方案是正常的。
censorship_block - 不要在新代码中重复voice_label逻辑——任何针对stepaudio-2.5-tts的新TTS代码都应仅使用+ 内嵌
instruction。不要编写条件性输出()的分支。voice_label
Pricing (verified 2026-04-23, volatile)
定价(2026年4月23日验证,价格可能波动)
- contextual synthesis: ~5.8 元 / 万字符
stepaudio-2.5-tts - Zero-shot voice cloning: ~9.9 元 / 音色
Re-verify at https://platform.stepfun.com/docs/zh/guides/pricing/details before quoting to stakeholders.
- 上下文合成:约5.8元/万字符
stepaudio-2.5-tts - 零样本语音克隆:约9.9元/音色