voice-generation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Voice Generation Skill

语音生成技能

Generate realistic speech using AI (Google Gemini TTS, ElevenLabs, OpenAI TTS).
使用AI生成逼真的语音(支持Google Gemini TTS、ElevenLabs、OpenAI TTS)。

Prerequisites

前置条件

At least one API key is required:
  • GOOGLE_API_KEY
    - For Google Gemini TTS (same key as video/image/music) ✅
  • ELEVENLABS_API_KEY
    - For ElevenLabs high-quality voice synthesis
  • OPENAI_API_KEY
    - For OpenAI TTS voices
至少需要一个API密钥:
  • GOOGLE_API_KEY
    - 适用于Google Gemini TTS(与视频/图片/音乐使用相同密钥)✅
  • ELEVENLABS_API_KEY
    - 适用于ElevenLabs高质量语音合成
  • OPENAI_API_KEY
    - 适用于OpenAI TTS语音

Available APIs

可用API

Google Gemini TTS (Recommended - Same API Key)

Google Gemini TTS(推荐 - 共享同一密钥)

  • Best for: Podcasts, dialogues, audiobooks with style control
  • Voices: 30 voices with natural language style control
  • Multi-speaker: Up to 2 speakers for dialogues ✅
  • Languages: 24 languages (auto-detected)
  • Features: Control style, accent, pace via prompts
  • Output: 24kHz WAV
  • API Key: Same
    GOOGLE_API_KEY
    as video/image/music ✅
  • 最佳适用场景:播客、对话、带风格控制的有声书
  • 语音选项:30种语音,支持自然语言风格控制
  • 多说话人:最多支持2个说话人进行对话 ✅
  • 语言支持:24种语言(自动检测)
  • 功能特性:可通过提示词控制风格、口音、语速
  • 输出格式:24kHz WAV
  • API密钥:与视频/图片/音乐使用相同的
    GOOGLE_API_KEY

ElevenLabs (Best Quality)

ElevenLabs(最高音质)

  • Best for: Natural-sounding voices, voice cloning, long-form content
  • Voices: 100+ pre-made voices + custom voice cloning
  • Languages: 29+ languages
  • Models: Eleven Multilingual v2, Eleven Turbo v2
  • 最佳适用场景:自然语音、语音克隆、长内容生成
  • 语音选项:100+预制语音 + 自定义语音克隆
  • 语言支持:29+种语言
  • 模型:Eleven Multilingual v2、Eleven Turbo v2

OpenAI TTS (Simplest)

OpenAI TTS(最简易用)

  • Best for: Quick, reliable text-to-speech with consistent quality
  • Voices: alloy, echo, fable, onyx, nova, shimmer
  • Models: tts-1 (fast), tts-1-hd (high quality)
  • Output: MP3, Opus, AAC, FLAC
  • 最佳适用场景:快速、可靠的文本转语音,音质稳定
  • 语音选项:alloy、echo、fable、onyx、nova、shimmer
  • 模型:tts-1(快速)、tts-1-hd(高质量)
  • 输出格式:MP3、Opus、AAC、FLAC

Workflow

工作流程

Step 1: Understand the Request

步骤1:理解用户请求

Parse the user's voice request for:
  • Text content: What should be spoken?
  • Voice type: Male, female, specific character?
  • Tone: Professional, casual, dramatic, cheerful?
  • Use case: Narration, voiceover, audiobook, notification?
  • Language: English, Spanish, other?
  • Speed: Normal, slow, fast?
解析用户的语音生成请求,明确以下信息:
  • 文本内容:需要朗读的内容是什么?
  • 语音类型:男声、女声、特定角色?
  • 语气:专业、随意、戏剧化、欢快?
  • 使用场景:旁白、配音、有声书、通知?
  • 语言:英语、西班牙语或其他语言?
  • 语速:正常、慢速、快速?

Step 2: Select Voice and API

步骤2:选择语音与API

Choose based on requirements:
Use CaseRecommended APIReason
Default / Same key as videoGemini TTSSame
GOOGLE_API_KEY
Multi-speaker dialogueGemini TTSUp to 2 speakers built-in
Style/accent controlGemini TTSNatural language prompts
Voice cloningElevenLabsOnly API with cloning
100+ voice optionsElevenLabsWidest selection
Audiobook/podcastElevenLabs or GeminiBoth excellent for long content
Quick narrationOpenAI TTSFast, reliable
Budget-consciousOpenAI TTSLower cost
根据需求选择合适的选项:
使用场景推荐API理由
默认选项 / 与视频共享密钥Gemini TTS使用相同的
GOOGLE_API_KEY
多说话人对话Gemini TTS内置支持最多2个说话人
风格/口音控制Gemini TTS支持自然语言提示词
语音克隆ElevenLabs唯一支持克隆的API
100+语音选项ElevenLabs语音选择范围最广
有声书/播客ElevenLabs或Gemini两者均适用于长内容
快速旁白生成OpenAI TTS快速、可靠
预算友好OpenAI TTS成本更低

Step 3: Prepare the Text

步骤3:优化文本

Optimize text for speech:
  1. Add pauses: Use commas, periods for natural rhythm
  2. Spell out numbers: "1,234" → "one thousand two hundred thirty-four" (if needed)
  3. Handle acronyms: "NASA" vs "N.A.S.A." depending on pronunciation
  4. Mark emphasis: Some APIs support emphasis markers
Example transformation:
  • Original: "The Q4 2024 results show a 15% YoY increase."
  • Optimized: "The Q4 2024 results show a fifteen percent year-over-year increase."
为语音合成优化文本:
  1. 添加停顿:使用逗号、句号营造自然节奏
  2. 数字拼写:如将“1,234”转换为“one thousand two hundred thirty-four”(按需调整)
  3. 首字母缩写词处理:根据发音选择“NASA”或“N.A.S.A.”
  4. 标记重音:部分API支持重音标记
转换示例:
  • 原文:"The Q4 2024 results show a 15% YoY increase."
  • 优化后:"The Q4 2024 results show a fifteen percent year-over-year increase."

Step 4: Generate the Audio

步骤4:生成音频

Execute the appropriate script from
${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/
:
For Google Gemini TTS (single speaker):
bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --text "Welcome to our podcast!" \
  --voice "Charon"
Gemini TTS with style direction:
bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --text "Have a wonderful day!" \
  --voice "Puck" \
  --style "Say cheerfully with a British accent:"
Gemini TTS multi-speaker (dialogue):
bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --multi \
  --speaker "Host:Charon" \
  --speaker "Guest:Aoede" \
  --text "Host: Welcome to the show!
Guest: Thanks for having me!"
For ElevenLabs:
bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/elevenlabs.py \
  --text "Your text here" \
  --voice "Rachel" \
  --model "eleven_multilingual_v2"
For OpenAI TTS:
bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/openai_tts.py \
  --text "Your text here" \
  --voice "nova" \
  --model "tts-1-hd"
List Gemini voices:
bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py --list-voices
执行
${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/
目录下的对应脚本:
Google Gemini TTS(单说话人):
bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --text "Welcome to our podcast!" \
  --voice "Charon"
带风格指令的Gemini TTS:
bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --text "Have a wonderful day!" \
  --voice "Puck" \
  --style "Say cheerfully with a British accent:"
Gemini TTS多说话人(对话):
bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --multi \
  --speaker "Host:Charon" \
  --speaker "Guest:Aoede" \
  --text "Host: Welcome to the show!
Guest: Thanks for having me!"
ElevenLabs:
bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/elevenlabs.py \
  --text "Your text here" \
  --voice "Rachel" \
  --model "eleven_multilingual_v2"
OpenAI TTS:
bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/openai_tts.py \
  --text "Your text here" \
  --voice "nova" \
  --model "tts-1-hd"
列出Gemini语音选项:
bash
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py --list-voices

Step 5: Deliver the Result

步骤5:交付结果

  1. Provide the generated audio file path
  2. Mention the voice and settings used
  3. Offer to:
    • Try a different voice
    • Adjust speed or tone
    • Use a different API
    • Generate in a different format
  1. 提供生成的音频文件路径
  2. 说明使用的语音及设置
  3. 提供以下可选服务:
    • 尝试不同语音
    • 调整语速或语气
    • 使用其他API
    • 生成其他格式的音频

Error Handling

错误处理

Missing API key: Inform the user which key is needed:
Gemini TTS requires google-genai package:
pip install google-genai
Text too long: Split into chunks and concatenate, or suggest shorter text.
Rate limit: Suggest waiting or trying a different API.
Unsupported language: Suggest an alternative API that supports the language.
Multi-speaker limit: Gemini TTS supports max 2 speakers. For more, use ElevenLabs with multiple calls.
缺少API密钥:告知用户所需的密钥类型及获取链接:
Gemini TTS需要google-genai包:执行
pip install google-genai
进行安装
文本过长:将文本拆分后拼接,或建议缩短文本
速率限制:建议等待一段时间或尝试其他API
不支持的语言:推荐支持该语言的替代API
多说话人限制:Gemini TTS最多支持2个说话人。如需更多说话人,可使用ElevenLabs并进行多次调用

Voice Selection Guide

语音选择指南

Google Gemini TTS Voices (30 voices)

Google Gemini TTS语音(30种)

StyleVoicesBest For
Bright/UpbeatZephyr, Puck, Aoede, LaomedeiaMarketing, cheerful content
Firm/InformativeCharon, Kore, Orus, RasalgethiNews, tutorials, professional
Soft/WarmAchernar, Sulafat, VindemiatrixMeditation, gentle narration
SmoothAlgieba, Despina, CallirrhoeAudiobooks, storytelling
ClearErinome, Iapetus, PulcherrimaInstructions, clarity
CharacterFenrir (excitable), Enceladus (breathy), Algenib (gravelly), Gacrux (mature)Character voices, drama
FriendlyAchird, Zubenelgenubi (casual)Casual, conversational
Gemini TTS Style Tips:
  • Use natural language:
    --style "Say angrily:"
    or
    --style "Whisper mysteriously:"
  • Specify accents:
    --style "Speak with a British accent from London:"
  • Control pace:
    --style "Speak slowly and deliberately:"
  • Combine:
    --style "Say excitedly with a Southern US accent:"
风格语音选项最佳适用场景
明亮/欢快Zephyr、Puck、Aoede、Laomedeia营销内容、欢快类内容
坚定/资讯类Charon、Kore、Orus、Rasalgethi新闻、教程、专业内容
柔和/温暖Achernar、Sulafat、Vindemiatrix冥想内容、温和旁白
流畅Algieba、Despina、Callirrhoe有声书、故事讲述
清晰Erinome、Iapetus、Pulcherrima操作说明、需要清晰表达的内容
角色类Fenrir(兴奋)、Enceladus(轻柔呼吸感)、Algenib(沙哑)、Gacrux(成熟)角色配音、戏剧内容
友好Achird、Zubenelgenubi(随意)日常对话、非正式内容
Gemini TTS风格提示技巧:
  • 使用自然语言:
    --style "Say angrily:"
    --style "Whisper mysteriously:"
  • 指定口音:
    --style "Speak with a British accent from London:"
  • 控制语速:
    --style "Speak slowly and deliberately:"
  • 组合指令:
    --style "Say excitedly with a Southern US accent:"

OpenAI TTS Voices

OpenAI TTS语音

VoiceDescriptionBest For
alloyNeutral, balancedGeneral purpose
echoWarm, conversationalPodcasts, casual
fableExpressive, BritishStorytelling
onyxDeep, authoritativeNarration, professional
novaFriendly, upbeatMarketing, tutorials
shimmerSoft, gentleMeditation, ASMR
语音描述最佳适用场景
alloy中性、均衡通用场景
echo温暖、口语化播客、非正式内容
fable富有表现力、英式口音故事讲述
onyx低沉、权威旁白、专业内容
nova友好、欢快营销内容、教程
shimmer柔和、轻柔冥想、ASMR内容

ElevenLabs Popular Voices

ElevenLabs热门语音

VoiceDescriptionBest For
RachelYoung female, AmericanNarration, audiobooks
DomiYoung female, energeticMarketing, ads
BellaYoung female, softStorytelling
AntoniYoung male, well-roundedNarration
JoshYoung male, deepAudiobooks
ArnoldMature male, authoritativeDocumentary
AdamMiddle-aged male, deepNarration
SamYoung male, raspyCharacter voices
语音描述最佳适用场景
Rachel年轻女声、美式口音旁白、有声书
Domi年轻女声、充满活力营销内容、广告
Bella年轻女声、柔和故事讲述
Antoni年轻男声、全面均衡旁白
Josh年轻男声、低沉有声书
Arnold成熟男声、权威纪录片
Adam中年男声、低沉旁白
Sam年轻男声、沙哑角色配音

Best Practices

最佳实践

For Narration

旁白场景

  • Use a consistent voice throughout
  • Add natural pauses between paragraphs
  • Consider pacing for the content type
  • 全程使用统一语音
  • 在段落间添加自然停顿
  • 根据内容类型调整语速

For Dialogue

对话场景

  • Use different voices for different characters
  • Match voice characteristics to character descriptions
  • Adjust speed for emotional scenes
  • 为不同角色使用不同语音
  • 语音特征与角色描述匹配
  • 根据情感场景调整语速

For Accessibility

无障碍场景

  • Use clear, well-paced speech
  • Avoid overly stylized voices
  • Test with screen readers if applicable
  • 使用清晰、语速适中的语音
  • 避免过度风格化的语音
  • 如有需要,配合屏幕阅读器进行测试

API Comparison

API对比

FeatureGemini TTSElevenLabsOpenAI TTS
API Key
GOOGLE_API_KEY
ELEVENLABS_API_KEY
OPENAI_API_KEY
Voice qualityExcellentExcellentVery good
Voice variety30 voices100+ voices6 voices
Multi-speaker✅ Up to 2❌ No❌ No
Style control✅ Natural languageLimited❌ No
Voice cloning❌ No✅ Yes❌ No
Languages2429+50+
Speed controlVia promptsYesYes (0.25-4x)
Max length32k tokens5,000 chars4,096 chars
Output formatWAV (24kHz)MP3, WAVMP3, Opus, AAC, FLAC
Same key as video/image✅ Yes❌ No❌ No
特性Gemini TTSElevenLabsOpenAI TTS
API密钥
GOOGLE_API_KEY
ELEVENLABS_API_KEY
OPENAI_API_KEY
语音质量优秀优秀非常好
语音多样性30种100+种6种
多说话人支持✅ 最多2个❌ 不支持❌ 不支持
风格控制✅ 自然语言指令有限支持❌ 不支持
语音克隆❌ 不支持✅ 支持❌ 不支持
语言数量24种29+种50+种
语速控制通过提示词支持支持(0.25-4倍)
最大内容长度32k tokens5000字符4096字符
输出格式WAV(24kHz)MP3、WAVMP3、Opus、AAC、FLAC
与视频/图片共享密钥✅ 是❌ 否❌ 否