voice-generation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Voice Generation Skill

语音生成技能

Generate realistic speech using AI (Google Gemini TTS, ElevenLabs, OpenAI TTS).

使用AI生成逼真的语音（支持Google Gemini TTS、ElevenLabs、OpenAI TTS）。

Prerequisites

前置条件

At least one API key is required:

```
GOOGLE_API_KEY
```
- For Google Gemini TTS (same key as video/image/music) ✅
```
ELEVENLABS_API_KEY
```
- For ElevenLabs high-quality voice synthesis
```
OPENAI_API_KEY
```
- For OpenAI TTS voices

至少需要一个API密钥：

```
GOOGLE_API_KEY
```
- 适用于Google Gemini TTS（与视频/图片/音乐使用相同密钥）✅
```
ELEVENLABS_API_KEY
```
- 适用于ElevenLabs高质量语音合成
```
OPENAI_API_KEY
```
- 适用于OpenAI TTS语音

Available APIs

可用API

Google Gemini TTS (Recommended - Same API Key)

Google Gemini TTS（推荐 - 共享同一密钥）

Best for: Podcasts, dialogues, audiobooks with style control
Voices: 30 voices with natural language style control
Multi-speaker: Up to 2 speakers for dialogues ✅
Languages: 24 languages (auto-detected)
Features: Control style, accent, pace via prompts
Output: 24kHz WAV
API Key: Same
```
GOOGLE_API_KEY
```
as video/image/music ✅

最佳适用场景：播客、对话、带风格控制的有声书
语音选项：30种语音，支持自然语言风格控制
多说话人：最多支持2个说话人进行对话 ✅
语言支持：24种语言（自动检测）
功能特性：可通过提示词控制风格、口音、语速
输出格式：24kHz WAV
API密钥：与视频/图片/音乐使用相同的
```
GOOGLE_API_KEY
```
✅

ElevenLabs (Best Quality)

ElevenLabs（最高音质）

Best for: Natural-sounding voices, voice cloning, long-form content
Voices: 100+ pre-made voices + custom voice cloning
Languages: 29+ languages
Models: Eleven Multilingual v2, Eleven Turbo v2

最佳适用场景：自然语音、语音克隆、长内容生成
语音选项：100+预制语音 + 自定义语音克隆
语言支持：29+种语言
模型：Eleven Multilingual v2、Eleven Turbo v2

OpenAI TTS (Simplest)

OpenAI TTS（最简易用）

Best for: Quick, reliable text-to-speech with consistent quality
Voices: alloy, echo, fable, onyx, nova, shimmer
Models: tts-1 (fast), tts-1-hd (high quality)
Output: MP3, Opus, AAC, FLAC

最佳适用场景：快速、可靠的文本转语音，音质稳定
语音选项：alloy、echo、fable、onyx、nova、shimmer
模型：tts-1（快速）、tts-1-hd（高质量）
输出格式：MP3、Opus、AAC、FLAC

Workflow

工作流程

Step 1: Understand the Request

步骤1：理解用户请求

Parse the user's voice request for:

Text content: What should be spoken?
Voice type: Male, female, specific character?
Tone: Professional, casual, dramatic, cheerful?
Use case: Narration, voiceover, audiobook, notification?
Language: English, Spanish, other?
Speed: Normal, slow, fast?

解析用户的语音生成请求，明确以下信息：

文本内容：需要朗读的内容是什么？
语音类型：男声、女声、特定角色？
语气：专业、随意、戏剧化、欢快？
使用场景：旁白、配音、有声书、通知？
语言：英语、西班牙语或其他语言？
语速：正常、慢速、快速？

Step 2: Select Voice and API

步骤2：选择语音与API

Choose based on requirements:

Use Case	Recommended API	Reason
Default / Same key as video	Gemini TTS	Same `GOOGLE_API_KEY` ✅
Multi-speaker dialogue	Gemini TTS	Up to 2 speakers built-in
Style/accent control	Gemini TTS	Natural language prompts
Voice cloning	ElevenLabs	Only API with cloning
100+ voice options	ElevenLabs	Widest selection
Audiobook/podcast	ElevenLabs or Gemini	Both excellent for long content
Quick narration	OpenAI TTS	Fast, reliable
Budget-conscious	OpenAI TTS	Lower cost

根据需求选择合适的选项：

使用场景	推荐API	理由
默认选项 / 与视频共享密钥	Gemini TTS	使用相同的 `GOOGLE_API_KEY` ✅
多说话人对话	Gemini TTS	内置支持最多2个说话人
风格/口音控制	Gemini TTS	支持自然语言提示词
语音克隆	ElevenLabs	唯一支持克隆的API
100+语音选项	ElevenLabs	语音选择范围最广
有声书/播客	ElevenLabs或Gemini	两者均适用于长内容
快速旁白生成	OpenAI TTS	快速、可靠
预算友好	OpenAI TTS	成本更低

Step 3: Prepare the Text

步骤3：优化文本

Optimize text for speech:

Add pauses: Use commas, periods for natural rhythm
Spell out numbers: "1,234" → "one thousand two hundred thirty-four" (if needed)
Handle acronyms: "NASA" vs "N.A.S.A." depending on pronunciation
Mark emphasis: Some APIs support emphasis markers

Example transformation:

Original: "The Q4 2024 results show a 15% YoY increase."
Optimized: "The Q4 2024 results show a fifteen percent year-over-year increase."

为语音合成优化文本：

添加停顿：使用逗号、句号营造自然节奏
数字拼写：如将“1,234”转换为“one thousand two hundred thirty-four”（按需调整）
首字母缩写词处理：根据发音选择“NASA”或“N.A.S.A.”
标记重音：部分API支持重音标记

转换示例：

原文："The Q4 2024 results show a 15% YoY increase."
优化后："The Q4 2024 results show a fifteen percent year-over-year increase."

Step 4: Generate the Audio

步骤4：生成音频

Execute the appropriate script from

${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/

For Google Gemini TTS (single speaker):

bash

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --text "Welcome to our podcast!" \
  --voice "Charon"

Gemini TTS with style direction:

bash

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --text "Have a wonderful day!" \
  --voice "Puck" \
  --style "Say cheerfully with a British accent:"

Gemini TTS multi-speaker (dialogue):

bash

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --multi \
  --speaker "Host:Charon" \
  --speaker "Guest:Aoede" \
  --text "Host: Welcome to the show!
Guest: Thanks for having me!"

For ElevenLabs:

bash

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/elevenlabs.py \
  --text "Your text here" \
  --voice "Rachel" \
  --model "eleven_multilingual_v2"

For OpenAI TTS:

bash

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/openai_tts.py \
  --text "Your text here" \
  --voice "nova" \
  --model "tts-1-hd"

List Gemini voices:

bash

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py --list-voices

执行

${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/

目录下的对应脚本：

Google Gemini TTS（单说话人）：

bash

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --text "Welcome to our podcast!" \
  --voice "Charon"

带风格指令的Gemini TTS：

bash

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --text "Have a wonderful day!" \
  --voice "Puck" \
  --style "Say cheerfully with a British accent:"

Gemini TTS多说话人（对话）：

bash

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --multi \
  --speaker "Host:Charon" \
  --speaker "Guest:Aoede" \
  --text "Host: Welcome to the show!
Guest: Thanks for having me!"

ElevenLabs：

bash

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/elevenlabs.py \
  --text "Your text here" \
  --voice "Rachel" \
  --model "eleven_multilingual_v2"

OpenAI TTS：

bash

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/openai_tts.py \
  --text "Your text here" \
  --voice "nova" \
  --model "tts-1-hd"

列出Gemini语音选项：

bash

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py --list-voices

Step 5: Deliver the Result

步骤5：交付结果

Provide the generated audio file path
Mention the voice and settings used
Offer to:
- Try a different voice
- Adjust speed or tone
- Use a different API
- Generate in a different format

提供生成的音频文件路径
说明使用的语音及设置
提供以下可选服务：
- 尝试不同语音
- 调整语速或语气
- 使用其他API
- 生成其他格式的音频

Error Handling

错误处理

Missing API key: Inform the user which key is needed:

Gemini TTS: Same
```
GOOGLE_API_KEY
```
as video/image - https://aistudio.google.com/apikey
ElevenLabs: https://elevenlabs.io
OpenAI: https://platform.openai.com/api-keys

Gemini TTS requires google-genai package:

pip install google-genai

Text too long: Split into chunks and concatenate, or suggest shorter text.

Rate limit: Suggest waiting or trying a different API.

Unsupported language: Suggest an alternative API that supports the language.

Multi-speaker limit: Gemini TTS supports max 2 speakers. For more, use ElevenLabs with multiple calls.

缺少API密钥：告知用户所需的密钥类型及获取链接：

Gemini TTS：与视频/图片使用相同的
```
GOOGLE_API_KEY
```
- https://aistudio.google.com/apikey
ElevenLabs：https://elevenlabs.io
OpenAI：https://platform.openai.com/api-keys

Gemini TTS需要google-genai包：执行

pip install google-genai

进行安装

文本过长：将文本拆分后拼接，或建议缩短文本

速率限制：建议等待一段时间或尝试其他API

不支持的语言：推荐支持该语言的替代API

多说话人限制：Gemini TTS最多支持2个说话人。如需更多说话人，可使用ElevenLabs并进行多次调用

Voice Selection Guide

语音选择指南

Google Gemini TTS Voices (30 voices)

Google Gemini TTS语音（30种）

Style	Voices	Best For
Bright/Upbeat	Zephyr, Puck, Aoede, Laomedeia	Marketing, cheerful content
Firm/Informative	Charon, Kore, Orus, Rasalgethi	News, tutorials, professional
Soft/Warm	Achernar, Sulafat, Vindemiatrix	Meditation, gentle narration
Smooth	Algieba, Despina, Callirrhoe	Audiobooks, storytelling
Clear	Erinome, Iapetus, Pulcherrima	Instructions, clarity
Character	Fenrir (excitable), Enceladus (breathy), Algenib (gravelly), Gacrux (mature)	Character voices, drama
Friendly	Achird, Zubenelgenubi (casual)	Casual, conversational

Gemini TTS Style Tips:

Use natural language:

--style "Say angrily:"

--style "Whisper mysteriously:"

Specify accents:

--style "Speak with a British accent from London:"

Control pace:

--style "Speak slowly and deliberately:"

Combine:

--style "Say excitedly with a Southern US accent:"

风格	语音选项	最佳适用场景
明亮/欢快	Zephyr、Puck、Aoede、Laomedeia	营销内容、欢快类内容
坚定/资讯类	Charon、Kore、Orus、Rasalgethi	新闻、教程、专业内容
柔和/温暖	Achernar、Sulafat、Vindemiatrix	冥想内容、温和旁白
流畅	Algieba、Despina、Callirrhoe	有声书、故事讲述
清晰	Erinome、Iapetus、Pulcherrima	操作说明、需要清晰表达的内容
角色类	Fenrir（兴奋）、Enceladus（轻柔呼吸感）、Algenib（沙哑）、Gacrux（成熟）	角色配音、戏剧内容
友好	Achird、Zubenelgenubi（随意）	日常对话、非正式内容

Gemini TTS风格提示技巧：

使用自然语言：

--style "Say angrily:"

或

--style "Whisper mysteriously:"

指定口音：

--style "Speak with a British accent from London:"

控制语速：

--style "Speak slowly and deliberately:"

组合指令：

--style "Say excitedly with a Southern US accent:"

OpenAI TTS Voices

OpenAI TTS语音

Voice	Description	Best For
alloy	Neutral, balanced	General purpose
echo	Warm, conversational	Podcasts, casual
fable	Expressive, British	Storytelling
onyx	Deep, authoritative	Narration, professional
nova	Friendly, upbeat	Marketing, tutorials
shimmer	Soft, gentle	Meditation, ASMR

语音	描述	最佳适用场景
alloy	中性、均衡	通用场景
echo	温暖、口语化	播客、非正式内容
fable	富有表现力、英式口音	故事讲述
onyx	低沉、权威	旁白、专业内容
nova	友好、欢快	营销内容、教程
shimmer	柔和、轻柔	冥想、ASMR内容

ElevenLabs Popular Voices

ElevenLabs热门语音

Voice	Description	Best For
Rachel	Young female, American	Narration, audiobooks
Domi	Young female, energetic	Marketing, ads
Bella	Young female, soft	Storytelling
Antoni	Young male, well-rounded	Narration
Josh	Young male, deep	Audiobooks
Arnold	Mature male, authoritative	Documentary
Adam	Middle-aged male, deep	Narration
Sam	Young male, raspy	Character voices

语音	描述	最佳适用场景
Rachel	年轻女声、美式口音	旁白、有声书
Domi	年轻女声、充满活力	营销内容、广告
Bella	年轻女声、柔和	故事讲述
Antoni	年轻男声、全面均衡	旁白
Josh	年轻男声、低沉	有声书
Arnold	成熟男声、权威	纪录片
Adam	中年男声、低沉	旁白
Sam	年轻男声、沙哑	角色配音

Best Practices

最佳实践

For Narration

旁白场景

Use a consistent voice throughout
Add natural pauses between paragraphs
Consider pacing for the content type

全程使用统一语音
在段落间添加自然停顿
根据内容类型调整语速

For Dialogue

对话场景

Use different voices for different characters
Match voice characteristics to character descriptions
Adjust speed for emotional scenes

为不同角色使用不同语音
语音特征与角色描述匹配
根据情感场景调整语速

For Accessibility

无障碍场景

Use clear, well-paced speech
Avoid overly stylized voices
Test with screen readers if applicable

使用清晰、语速适中的语音
避免过度风格化的语音
如有需要，配合屏幕阅读器进行测试

API Comparison

API对比

Feature	Gemini TTS	ElevenLabs	OpenAI TTS
API Key	`GOOGLE_API_KEY` ✅	`ELEVENLABS_API_KEY`	`OPENAI_API_KEY`
Voice quality	Excellent	Excellent	Very good
Voice variety	30 voices	100+ voices	6 voices
Multi-speaker	✅ Up to 2	❌ No	❌ No
Style control	✅ Natural language	Limited	❌ No
Voice cloning	❌ No	✅ Yes	❌ No
Languages	24	29+	50+
Speed control	Via prompts	Yes	Yes (0.25-4x)
Max length	32k tokens	5,000 chars	4,096 chars
Output format	WAV (24kHz)	MP3, WAV	MP3, Opus, AAC, FLAC
Same key as video/image	✅ Yes	❌ No	❌ No

特性	Gemini TTS	ElevenLabs	OpenAI TTS
API密钥	`GOOGLE_API_KEY` ✅	`ELEVENLABS_API_KEY`	`OPENAI_API_KEY`
语音质量	优秀	优秀	非常好
语音多样性	30种	100+种	6种
多说话人支持	✅ 最多2个	❌ 不支持	❌ 不支持
风格控制	✅ 自然语言指令	有限支持	❌ 不支持
语音克隆	❌ 不支持	✅ 支持	❌ 不支持
语言数量	24种	29+种	50+种
语速控制	通过提示词	支持	支持（0.25-4倍）
最大内容长度	32k tokens	5000字符	4096字符
输出格式	WAV（24kHz）	MP3、WAV	MP3、Opus、AAC、FLAC
与视频/图片共享密钥	✅ 是	❌ 否	❌ 否