speech

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Speech Generation Skill

语音生成技能

Generate spoken audio for the current project (narration, product demo voiceover, IVR prompts, accessibility reads). Defaults to

gpt-4o-mini-tts-2025-12-15

and built-in voices, and prefers the bundled CLI for deterministic, reproducible runs.

为当前项目生成语音音频（旁白、产品演示配音、IVR提示、无障碍朗读）。默认使用

gpt-4o-mini-tts-2025-12-15

模型和内置语音，优先使用捆绑的CLI以实现可确定、可复现的运行。

When to use

使用场景

Generate a single spoken clip from text
Generate a batch of prompts (many lines, many files)

从文本生成单个语音片段
生成批量提示（多行文本、多个文件）

Decision tree (single vs batch)

决策树（单条 vs 批量）

If the user provides multiple lines/prompts or wants many outputs -> batch
Else -> single

如果用户提供多行/多个提示或需要多个输出 -> 批量
否则 -> 单条

Workflow

工作流程

Decide intent: single vs batch (see decision tree above).
Collect inputs up front: exact text (verbatim), desired voice, delivery style, format, and any constraints.
If batch: write a temporary JSONL under tmp/ (one job per line), run once, then delete the JSONL.
Augment instructions into a short labeled spec without rewriting the input text.
Run the bundled CLI (
```
scripts/text_to_speech.py
```
) with sensible defaults (see references/cli.md).
For important clips, validate: intelligibility, pacing, pronunciation, and adherence to constraints.
Iterate with a single targeted change (voice, speed, or instructions), then re-check.
Save/return final outputs and note the final text + instructions + flags used.

确定意图：单条或批量（见上述决策树）。
提前收集输入：准确文本（原文）、所需语音、交付风格、格式及任何约束条件。
如果是批量：在tmp/下编写临时JSONL文件（每行一个任务），运行一次后删除该JSONL文件。
将指令扩充为简短的带标签规范，不要重写输入文本。
使用合理的默认值运行捆绑的CLI（
```
scripts/text_to_speech.py
```
）（详见references/cli.md）。
对于重要片段，进行验证：清晰度、语速、发音及是否符合约束条件。
进行单次针对性修改（语音、速度或指令）后重新运行，然后再次检查。
保存/返回最终输出，并记录使用的最终文本+指令+参数。

Temp and output conventions

临时文件与输出约定

Use
```
tmp/speech/
```
for intermediate files (for example JSONL batches); delete when done.
Write final artifacts under
```
output/speech/
```
when working in this repo.
Use
```
--out
```
or
```
--out-dir
```
to control output paths; keep filenames stable and descriptive.

使用
```
tmp/speech/
```
存储中间文件（例如JSONL批量文件）；完成后删除。
在本仓库中工作时，将最终产物写入
```
output/speech/
```
目录。
使用
```
--out
```
或
```
--out-dir
```
控制输出路径；保持文件名稳定且具有描述性。

Dependencies (install if missing)

依赖项（缺失时安装）

Prefer

uv

for dependency management.

Python packages:

uv pip install openai

uv

is unavailable:

python3 -m pip install openai

优先使用

uv

进行依赖管理。

Python包：

uv pip install openai

如果

uv

不可用：

python3 -m pip install openai

Environment

环境要求

```
OPENAI_API_KEY
```
must be set for live API calls.

If the key is missing, give the user these steps:

Create an API key in the OpenAI platform UI: https://platform.openai.com/api-keys
Set
```
OPENAI_API_KEY
```
as an environment variable in their system.
Offer to guide them through setting the environment variable for their OS/shell if needed.

Never ask the user to paste the full key in chat. Ask them to set it locally and confirm when ready.

If installation isn't possible in this environment, tell the user which dependency is missing and how to install it locally.

实时API调用必须设置
```
OPENAI_API_KEY
```
。

如果密钥缺失，请告知用户以下步骤：

在OpenAI平台UI中创建API密钥：https://platform.openai.com/api-keys
在系统中设置
```
OPENAI_API_KEY
```
为环境变量。
如有需要，可指导用户根据其操作系统/Shell设置环境变量。

切勿要求用户在聊天中粘贴完整密钥。请让他们在本地设置并确认准备就绪。

如果在此环境中无法安装，请告知用户缺失的依赖项以及如何在本地安装。

Defaults & rules

默认设置与规则

Use
```
gpt-4o-mini-tts-2025-12-15
```
unless the user requests another model.
Default voice:
```
cedar
```
. If the user wants a brighter tone, prefer
```
marin
```
.
Built-in voices only. Custom voices are out of scope for this skill.
```
instructions
```
are supported for GPT-4o mini TTS models, but not for
```
tts-1
```
or
```
tts-1-hd
```
.
Input length must be <= 4096 characters per request. Split longer text into chunks.
Enforce 50 requests/minute. The CLI caps
```
--rpm
```
at 50.
Require
```
OPENAI_API_KEY
```
before any live API call.
Provide a clear disclosure to end users that the voice is AI-generated.
Use the OpenAI Python SDK (
```
openai
```
package) for all API calls; do not use raw HTTP.
Prefer the bundled CLI (
```
scripts/text_to_speech.py
```
) over writing new one-off scripts.
Never modify
```
scripts/text_to_speech.py
```
. If something is missing, ask the user before doing anything else.

除非用户要求其他模型，否则使用
```
gpt-4o-mini-tts-2025-12-15
```
。
默认语音：
```
cedar
```
。如果用户想要更明快的语调，优先选择
```
marin
```
。
仅使用内置语音。自定义语音不在本技能范围内。

gpt-4o mini TTS

模型支持

instructions

参数，但

tts-1

或

tts-1-hd

不支持。

每个请求的输入长度必须<=4096字符。较长文本需拆分成分块。
强制执行每分钟50次请求的限制。CLI将
```
--rpm
```
参数上限设为50。
进行任何实时API调用前必须要求设置
```
OPENAI_API_KEY
```
。
需向终端用户明确披露该语音为AI生成。
所有API调用均使用OpenAI Python SDK（
```
openai
```
包）；请勿使用原始HTTP请求。
优先使用捆绑的CLI（
```
scripts/text_to_speech.py
```
）而非编写新的一次性脚本。
切勿修改
```
scripts/text_to_speech.py
```
。如果有缺失功能，先询问用户再进行操作。

Instruction augmentation

指令扩充

Reformat user direction into a short, labeled spec. Only make implicit details explicit; do not invent new requirements.

Quick clarification (augmentation vs invention):

If the user says "narration for a demo", you may add implied delivery constraints (clear, steady pacing, friendly tone).
Do not introduce a new persona, accent, or emotional style the user did not request.

Template (include only relevant lines):

Voice Affect: <overall character and texture of the voice>
Tone: <attitude, formality, warmth>
Pacing: <slow, steady, brisk>
Emotion: <key emotions to convey>
Pronunciation: <words to enunciate or emphasize>
Pauses: <where to add intentional pauses>
Emphasis: <key words or phrases to stress>
Delivery: <cadence or rhythm notes>

Augmentation rules:

Keep it short; add only details the user already implied or provided elsewhere.
Do not rewrite the input text.
If any critical detail is missing and blocks success, ask a question; otherwise proceed.

将用户需求重新格式化为简短的带标签规范。仅将隐含细节明确化；不要创造新要求。

快速说明（扩充 vs 创造）：

如果用户说“演示旁白”，你可以添加隐含的交付约束（清晰、稳定的语速、友好的语调）。
不要引入用户未要求的新角色、口音或情感风格。

模板（仅包含相关条目）：

Voice Affect: <语音的整体特质与质感>
Tone: <态度、正式程度、亲切感>
Pacing: <缓慢、稳定、轻快>
Emotion: <需要传达的关键情感>
Pronunciation: <需要清晰发音或强调的词汇>
Pauses: <需要添加有意停顿的位置>
Emphasis: <需要重读的关键词或短语>
Delivery: <节奏或韵律说明>

扩充规则：

保持简洁；仅添加用户已隐含或在其他地方提供的细节。
不要重写输入文本。
如果任何关键细节缺失且阻碍任务完成，请提问；否则继续执行。

Examples

示例

Single example (narration)

单条示例（旁白）

Input text: "Welcome to the demo. Today we'll show how it works."
Instructions:
Voice Affect: Warm and composed.
Tone: Friendly and confident.
Pacing: Steady and moderate.
Emphasis: Stress "demo" and "show".

输入文本："Welcome to the demo. Today we'll show how it works."
指令：
Voice Affect: Warm and composed.
Tone: Friendly and confident.
Pacing: Steady and moderate.
Emphasis: Stress "demo" and "show".

Batch example (IVR prompts)

批量示例（IVR提示）

{"input":"Thank you for calling. Please hold.","voice":"cedar","response_format":"mp3","out":"hold.mp3"}
{"input":"For sales, press 1. For support, press 2.","voice":"marin","instructions":"Tone: Clear and neutral. Pacing: Slow.","response_format":"wav"}

{"input":"Thank you for calling. Please hold.","voice":"cedar","response_format":"mp3","out":"hold.mp3"}
{"input":"For sales, press 1. For support, press 2.","voice":"marin","instructions":"Tone: Clear and neutral. Pacing: Slow.","response_format":"wav"}

Instructioning best practices (short list)

指令最佳实践（简短列表）

Structure directions as: affect -> tone -> pacing -> emotion -> pronunciation/pauses -> emphasis.
Keep 4 to 8 short lines; avoid conflicting guidance.
For names/acronyms, add pronunciation hints (e.g., "enunciate A-I") or supply a phonetic spelling in the text.
For edits/iterations, repeat invariants (e.g., "keep pacing steady") to reduce drift.
Iterate with single-change follow-ups.

More principles:

references/prompting.md

. Copy/paste specs:

references/sample-prompts.md

按以下结构组织指令：特质 -> 语调 -> 语速 -> 情感 -> 发音/停顿 -> 重读。
保持4-8条简短条目；避免冲突的指导。
对于名称/缩写，添加发音提示（例如："enunciate A-I"）或在文本中提供音标拼写。
对于编辑/迭代，重复不变的要求（例如："保持语速稳定"）以减少偏差。
每次仅进行一项修改后再迭代。

更多原则：

references/prompting.md

。可复制的规范：

references/sample-prompts.md

。

Guidance by use case

按使用场景的指导

Use these modules when the request is for a specific delivery style. They provide targeted defaults and templates.

Narration / explainer:
```
references/narration.md
```
Product demo / voiceover:
```
references/voiceover.md
```
IVR / phone prompts:
```
references/ivr.md
```
Accessibility reads:
```
references/accessibility.md
```

当请求特定交付风格时，使用以下模块。它们提供针对性的默认设置和模板。

旁白 / 讲解：
```
references/narration.md
```
产品演示 / 配音：
```
references/voiceover.md
```
IVR / 电话提示：
```
references/ivr.md
```
无障碍朗读：
```
references/accessibility.md
```

CLI + environment notes

CLI + 环境说明

CLI commands + examples:
```
references/cli.md
```
API parameter quick reference:
```
references/audio-api.md
```
Instruction patterns + examples:
```
references/voice-directions.md
```
If network approvals / sandbox settings are getting in the way:
```
references/codex-network.md
```

CLI命令 + 示例：
```
references/cli.md
```
API参数快速参考：
```
references/audio-api.md
```
指令模式 + 示例：
```
references/voice-directions.md
```
如果网络审批 / 沙箱设置造成阻碍：
```
references/codex-network.md
```

Reference map

参考地图

references/cli.md
: how to run speech generation/batches via
```
scripts/text_to_speech.py
```
(commands, flags, recipes).
references/audio-api.md
: API parameters, limits, voice list.
references/voice-directions.md
: instruction patterns and examples.
references/prompting.md
: instruction best practices (structure, constraints, iteration patterns).
references/sample-prompts.md
: copy/paste instruction recipes (examples only; no extra theory).
references/narration.md
: templates + defaults for narration and explainers.
references/voiceover.md
: templates + defaults for product demo voiceovers.
references/ivr.md
: templates + defaults for IVR/phone prompts.
references/accessibility.md
: templates + defaults for accessibility reads.
references/codex-network.md
: environment/sandbox/network-approval troubleshooting.

references/cli.md
：如何通过
```
scripts/text_to_speech.py
```
运行语音生成/批量任务（命令、参数、方案）。
references/audio-api.md
：API参数、限制、语音列表。
references/voice-directions.md
：指令模式和示例。
references/prompting.md
：指令最佳实践（结构、约束、迭代模式）。
references/sample-prompts.md
：可复制的指令方案（仅示例；无额外理论）。
references/narration.md
：旁白和讲解的模板 + 默认设置。
references/voiceover.md
：产品演示配音的模板 + 默认设置。
references/ivr.md
：IVR/电话提示的模板 + 默认设置。
references/accessibility.md
：无障碍朗读的模板 + 默认设置。
references/codex-network.md
：环境/沙箱/网络审批故障排除。