gemini-tts
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGemini Text-to-Speech
Gemini 文本转语音
Generate natural-sounding speech from text using Gemini's TTS models through executable scripts with support for multiple voices and multi-speaker conversations.
通过可执行脚本,利用Gemini的TTS模型将文本转换为自然流畅的语音,支持多种音色和多角色对话。
When to Use This Skill
何时使用该Skill
Use this skill when you need to:
- Convert text to natural speech
- Create audio for podcasts, audiobooks, or videos
- Generate multi-speaker conversations
- Stream audio for long content
- Choose from multiple voice options
- Create accessible audio content
- Generate voiceovers for presentations
- Batch convert text to audio files
当你需要以下功能时,可使用本Skill:
- 将文本转换为自然语音
- 为播客、有声书或视频创建音频
- 生成多角色对话音频
- 对长文本内容进行流式音频输出
- 从多种音色中选择合适的声音
- 创建无障碍音频内容
- 为演示文稿生成旁白
- 批量将文本转换为音频文件
Available Scripts
可用脚本
scripts/tts.py
scripts/tts.py
Purpose: Convert text to speech using Gemini TTS models
When to use:
- Any text-to-speech conversion
- Multi-speaker conversation generation
- Streaming audio for long texts
- Voiceovers for content creation
- Accessible audio generation
Key parameters:
| Parameter | Description | Example |
|---|---|---|
| Text to convert (required) | |
| Voice name | |
| Base name for output file | |
| Output directory for audio | |
| Disable auto timestamp | Flag |
| TTS model | |
| Enable streaming | Flag |
| Multi-speaker mapping | |
Output: WAV audio file path
用途:使用Gemini TTS模型将文本转换为语音
适用场景:
- 任何文本转语音的转换需求
- 生成多角色对话音频
- 对长文本进行流式音频输出
- 为内容创作生成旁白
- 生成无障碍音频内容
关键参数:
| 参数 | 描述 | 示例 |
|---|---|---|
| 需要转换的文本(必填) | |
| 音色名称 | |
| 输出文件的基础名称 | |
| 音频输出目录 | |
| 禁用自动时间戳 | 标志参数 |
| TTS模型 | |
| 启用流式输出 | 标志参数 |
| 多角色音色映射 | |
输出:WAV音频文件路径
Workflows
工作流
Workflow 1: Basic Text-to-Speech
工作流1:基础文本转语音
bash
python scripts/tts.py "Hello, world! Have a wonderful day."- Best for: Quick audio generation, simple messages
- Voice: (default, clear and professional)
Kore - Output: (auto timestamp)
audio/tts_output_YYYYMMDD_HHMMSS.wav
bash
python scripts/tts.py "Hello, world! Have a wonderful day."- 最佳适用场景:快速生成音频、简单消息
- 默认音色:(清晰、专业)
Kore - 输出文件:(自动添加时间戳)
audio/tts_output_YYYYMMDD_HHMMSS.wav
Workflow 2: Choose Different Voice
工作流2:选择不同音色
bash
python scripts/tts.py "Welcome to our podcast about technology trends" --voice Puck --output welcome- Best for: Friendly, conversational content
- Voice options: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
- Output:
audio/welcome_YYYYMMDD_HHMMSS.wav
bash
python scripts/tts.py "Welcome to our podcast about technology trends" --voice Puck --output welcome- 最佳适用场景:友好的对话类内容
- 可用音色:Kore、Puck、Charon、Fenrir、Aoede、Zephyr、Sulafat
- 输出文件:
audio/welcome_YYYYMMDD_HHMMSS.wav
Workflow 3: Multi-Speaker Conversation
工作流3:多角色对话
bash
python scripts/tts.py "TTS the following conversation:
Joe: How's it going today?
Jane: Not too bad, how about you?
Joe: I'm working on a new project.
Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation- Best for: Dialogues, interviews, role-playing content
- Format: Marked conversation with speaker names
- Script automatically routes text to appropriate voices
- Output:
audio/conversation_YYYYMMDD_HHMMSS.wav
bash
python scripts/tts.py "TTS the following conversation:
Joe: How's it going today?
Jane: Not too bad, how about you?
Joe: I'm working on a new project.
Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation- 最佳适用场景:对话内容、访谈、角色扮演类内容
- 格式:带有角色名称标记的对话文本
- 脚本会自动将对应文本分配给指定音色
- 输出文件:
audio/conversation_YYYYMMDD_HHMMSS.wav
Workflow 4: Long Content with Streaming
工作流4:长文本流式输出
bash
python scripts/tts.py "This is a very long text that would benefit from streaming..." --stream --output long-form- Best for: Podcasts, audiobooks, long articles
- Streaming: Processes audio in chunks for long texts
- Output:
audio/long-form_YYYYMMDD_HHMMSS.wav
bash
python scripts/tts.py "This is a very long text that would benefit from streaming..." --stream --output long-form- 最佳适用场景:播客、有声书、长篇文章
- 流式处理:将长文本分块处理生成音频
- 输出文件:
audio/long-form_YYYYMMDD_HHMMSS.wav
Workflow 5: Professional Voiceover
工作流5:专业旁白
bash
python scripts/tts.py "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover- Best for: Corporate content, presentations, formal announcements
- Voice: (deep, authoritative)
Charon - Use when: Professional, serious tone required
bash
python scripts/tts.py "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover- 最佳适用场景:企业内容、演示文稿、正式公告
- 音色:(低沉、权威)
Charon - 适用场景:需要专业、严肃语气的内容
Workflow 6: Custom Output Directory
工作流6:自定义输出目录
bash
python scripts/tts.py "Save to specific folder." --output-dir ./my-projects/podcasts/ --output episode1- Best for: Organized project structures
- Directory created automatically if it doesn't exist
- Output:
./my-projects/podcasts/episode1_YYYYMMDD_HHMMSS.wav
bash
python scripts/tts.py "Save to specific folder." --output-dir ./my-projects/podcasts/ --output episode1- 最佳适用场景:结构化的项目文件管理
- 目录不存在时会自动创建
- 输出文件:
./my-projects/podcasts/episode1_YYYYMMDD_HHMMSS.wav
Workflow 7: Content Creation Pipeline (Text → Audio)
工作流7:内容创作流水线(文本→音频)
bash
undefinedbash
undefined1. Generate script (gemini-text skill)
1. 生成脚本(gemini-text skill)
python skills/gemini-text/scripts/generate.py "Write a 2-minute podcast intro about sustainable energy"
python skills/gemini-text/scripts/generate.py "Write a 2-minute podcast intro about sustainable energy"
2. Generate audio (this skill)
2. 生成音频(本Skill)
python scripts/tts.py "[Paste generated script]" --voice Fenrir --output podcast-intro
python scripts/tts.py "[Paste generated script]" --voice Fenrir --output podcast-intro
3. Use in video or podcast
3. 用于视频或播客
- Best for: Podcasts, audiobooks, video narration
- Combines with: gemini-text for script generation- 最佳适用场景:播客、有声书、视频旁白
- 搭配使用:gemini-text Skill用于生成脚本Workflow 8: Accessible Content
工作流8:无障碍内容
bash
python scripts/tts.py "Welcome to our accessible website. This audio describes our main navigation options." --voice Aoede --output accessibility- Best for: Web accessibility, screen reader alternatives
- Voice: (melodic, pleasant)
Aoede - Use when: Making content accessible to visually impaired users
bash
python scripts/tts.py "Welcome to our accessible website. This audio describes our main navigation options." --voice Aoede --output accessibility- 最佳适用场景:网站无障碍优化、屏幕阅读器替代方案
- 音色:(悦耳、柔和)
Aoede - 适用场景:为视障用户创建可访问内容
Workflow 9: Educational Content
工作流9:教育类内容
bash
python scripts/tts.py "Chapter 1: Introduction to Quantum Computing. Let's explore the fundamental principles..." --voice Zephyr --output chapter1- Best for: Educational materials, tutorials, e-learning
- Voice: (light, airy)
Zephyr - Combines well with: gemini-text for content generation
bash
python scripts/tts.py "Chapter 1: Introduction to Quantum Computing. Let's explore the fundamental principles..." --voice Zephyr --output chapter1- 最佳适用场景:教育材料、教程、在线学习内容
- 音色:(轻快、清晰)
Zephyr - 搭配使用:gemini-text Skill用于生成内容
Workflow 10: Disable Timestamp
工作流10:禁用时间戳
bash
python scripts/tts.py "Fixed filename." --output my-audio --no-timestamp- Best for: When you want complete control over filename
- Output: (no timestamp)
audio/my-audio.wav - Use when: Generating files for specific naming schemes
bash
python scripts/tts.py "Fixed filename." --output my-audio --no-timestamp- 最佳适用场景:需要完全控制文件名的情况
- 输出文件:(无时间戳)
audio/my-audio.wav - 适用场景:生成符合特定命名规范的文件
Parameters Reference
参数参考
Model Selection
模型选择
| Model | Quality | Speed | Best For |
|---|---|---|---|
| Good | Fast | General use, high volume |
| Higher | Slower | Premium content, voiceovers |
| 模型 | 质量 | 速度 | 最佳适用场景 |
|---|---|---|---|
| 良好 | 快速 | 通用场景、大音量生成需求 |
| 更高 | 较慢 | 高质量内容、专业旁白 |
Voice Selection
音色选择
| Voice | Characteristics | Best For |
|---|---|---|
| Kore | Clear, professional | Announcements, general purpose (default) |
| Puck | Friendly, conversational | Casual content, interviews |
| Charon | Deep, authoritative | Corporate, serious content |
| Fenrir | Warm, expressive | Storytelling, narratives |
| Aoede | Melodic, pleasant | Educational, accessibility |
| Zephyr | Light, airy | Gentle content, tutorials |
| Sulafat | Neutral, balanced | Documentaries, factual content |
| 音色 | 特点 | 最佳适用场景 |
|---|---|---|
| Kore | 清晰、专业 | 公告、通用信息(默认音色) |
| Puck | 友好、口语化 | 播客、访谈、休闲内容 |
| Charon | 低沉、权威 | 企业内容、新闻、正式演示 |
| Fenrir | 温暖、富有表现力 | 有声书、故事、情感类内容 |
| Aoede | 悦耳、柔和 | 教育内容、无障碍优化 |
| Zephyr | 轻快、空灵 | 温和类内容、教程 |
| Sulafat | 中立、均衡 | 纪录片、事实性演示 |
Audio Format
音频格式
| Specification | Value |
|---|---|
| Format | WAV (PCM) |
| Sample rate | 24000 Hz |
| Channels | 1 (mono) |
| Bit depth | 16-bit |
| 规格 | 数值 |
|---|---|
| 格式 | WAV (PCM) |
| 采样率 | 24000 Hz |
| 声道 | 1(单声道) |
| 位深 | 16-bit |
Token Limits
令牌限制
| Limit | Type | Description |
|---|---|---|
| 8,192 | Input | Maximum input text tokens |
| 16,384 | Output | Maximum output audio tokens |
| 限制值 | 类型 | 描述 |
|---|---|---|
| 8,192 | 输入 | 最大输入文本令牌数 |
| 16,384 | 输出 | 最大输出音频令牌数 |
Output Interpretation
输出说明
Audio File
音频文件
- Format: WAV (compatible with most players)
- Mono channel (single audio track)
- Sample rate: 24000 Hz (broadcast quality)
- Can be converted to MP3/AAC if needed
- 格式:WAV(兼容大多数播放器)
- 单声道(单个音轨)
- 采样率:24000 Hz(广播级质量)
- 可按需转换为MP3/AAC格式
Multi-Speaker Files
多角色音频文件
- Single WAV file with multiple voices
- Voices separated by timing within file
- Use parameter to map speakers to voices
--speakers
- 包含多种音色的单个WAV文件
- 不同音色通过时间轴区分
- 使用参数映射角色与音色
--speakers
Streaming Output
流式输出
- Audio processed in chunks during generation
- Script shows "Streaming audio..." message
- Useful for very long texts or real-time applications
- 生成音频时按块处理内容
- 脚本会显示"Streaming audio..."提示
- 适用于超长文本或实时应用场景
Common Issues
常见问题
"google-genai not installed"
"google-genai not installed"
bash
pip install google-genaibash
pip install google-genai"Voice name not found"
"Voice name not found"
- Check voice name spelling
- Use available voices: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
- Voice names are case-sensitive
- 检查音色名称拼写
- 使用可用音色:Kore、Puck、Charon、Fenrir、Aoede、Zephyr、Sulafat
- 音色名称区分大小写
"No audio generated"
"No audio generated"
- Check text is not empty
- Verify text doesn't exceed token limit (8,192)
- Try shorter text segments
- Check API quota limits
- 检查输入文本是否为空
- 确认文本未超过令牌限制(8,192)
- 尝试缩短文本长度
- 检查API配额限制
"Multi-speaker format error"
"Multi-speaker format error"
- Format:
SpeakerName:VoiceName,Speaker2:Voice2 - Separate speakers with commas
- Use colon between speaker and voice
- Example:
"Joe:Kore,Jane:Puck,Host:Charon"
- 格式要求:
SpeakerName:VoiceName,Speaker2:Voice2 - 使用逗号分隔不同角色
- 角色与音色之间使用冒号分隔
- 示例:
"Joe:Kore,Jane:Puck,Host:Charon"
"Output file already exists"
"Output file already exists"
- Script will overwrite existing files
- Change filename to avoid conflicts
--output - Use unique names for batch generation
- 脚本会覆盖已存在的文件
- 修改参数的文件名以避免冲突
--output - 批量生成时使用唯一文件名
Audio quality issues
音频质量问题
- Check input text for unusual characters
- Try different voice for better pronunciation
- Consider splitting long text into smaller segments
- Verify audio playback software compatibility
- 检查输入文本是否包含特殊字符
- 尝试更换音色以获得更好的发音效果
- 考虑将长文本拆分为多个逻辑段落
- 确认音频播放软件的兼容性
Best Practices
最佳实践
Voice Selection
音色选择
- Kore: General purpose, clear articulation
- Puck: Conversational, engaging tone
- Charon: Professional, authoritative
- Fenrir: Emotional, storytelling
- Aoede: Soft, gentle for accessibility
- Zephyr: Educational, clear explanations
- Kore:通用场景、清晰发音
- Puck:口语化、引人入胜的语气
- Charon:专业、权威的风格
- Fenrir:富有情感、适合讲故事
- Aoede:柔和、适合无障碍内容
- Zephyr:教育场景、清晰的讲解
Text Preparation
文本准备
- Use natural language and punctuation
- Include pauses with commas and periods
- Spell out difficult words if needed
- Break very long text into logical segments
- Add speaker labels for multi-speaker content
- 使用自然语言和标点符号
- 用逗号和句号设置停顿
- 对生僻词可拼写完整
- 将超长文本拆分为逻辑段落
- 为多角色内容添加角色标签
Performance Optimization
性能优化
- Use streaming for very long texts
- Generate shorter segments for better control
- Use flash model for faster generation
- Batch process multiple files for efficiency
- 对超长文本使用流式输出
- 生成较短的文本段以获得更好的控制
- 使用flash模型提升生成速度
- 批量处理多个文件以提高效率
Quality Tips
质量提升技巧
- Test different voices for your content type
- Use appropriate pacing with punctuation
- Consider context when selecting voice
- Listen to output before final use
- Multi-speaker requires clear speaker labeling
- 针对不同内容类型测试多种音色
- 用标点符号控制语速
- 选择音色时考虑内容上下文
- 最终使用前先试听输出音频
- 多角色内容需要清晰的角色标记
Use Cases by Voice
按音色划分的适用场景
| Voice | Ideal Use Cases |
|---|---|
| Kore | Announcements, navigation, general info |
| Puck | Podcasts, interviews, casual content |
| Charon | Corporate, news, formal presentations |
| Fenrir | Audiobooks, stories, emotional content |
| Aoede | Accessibility, educational, gentle content |
| Zephyr | Tutorials, explanations, guides |
| Sulafat | Documentaries, factual presentations |
| 音色 | 理想适用场景 |
|---|---|
| Kore | 公告、导航、通用信息 |
| Puck | 播客、访谈、休闲内容 |
| Charon | 企业内容、新闻、正式演示 |
| Fenrir | 有声书、故事、情感类内容 |
| Aoede | 无障碍内容、教育、温和类内容 |
| Zephyr | 教程、讲解、指南 |
| Sulafat | 纪录片、事实性演示 |
Related Skills
相关Skill
- gemini-text: Generate scripts and text for TTS
- gemini-image: Create visuals to accompany audio
- gemini-batch: Process multiple TTS requests efficiently
- gemini-files: Upload audio files for processing
- gemini-text:为TTS生成脚本和文本
- gemini-image:创建与音频配套的视觉内容
- gemini-batch:高效处理多个TTS请求
- gemini-files:上传音频文件进行处理
Quick Reference
快速参考
bash
undefinedbash
undefinedBasic
基础用法
python scripts/tts.py "Your text here"
python scripts/tts.py "Your text here"
Custom voice
自定义音色
python scripts/tts.py "Your text" --voice Puck --output audio.wav
python scripts/tts.py "Your text" --voice Puck --output audio.wav
Multi-speaker
多角色对话
python scripts/tts.py "Joe: Hi. Jane: Hello!" --speakers "Joe:Kore,Jane:Puck"
python scripts/tts.py "Joe: Hi. Jane: Hello!" --speakers "Joe:Kore,Jane:Puck"
Streaming
流式输出
python scripts/tts.py "Long text..." --stream --output long.wav
python scripts/tts.py "Long text..." --stream --output long.wav
Professional
专业旁白
python scripts/tts.py "Corporate announcement" --voice Charon
undefinedpython scripts/tts.py "Corporate announcement" --voice Charon
undefinedReference
参考资料
- See for complete voice documentation
references/voices.md - Get API key: https://aistudio.google.com/apikey
- Documentation: https://ai.google.dev/gemini-api/docs/text-to-speech
- Sample rate: 24000 Hz standard for most applications
- 完整音色文档请查看
references/voices.md - 获取API密钥:https://aistudio.google.com/apikey
- 官方文档:https://ai.google.dev/gemini-api/docs/text-to-speech
- 采样率:24000 Hz为大多数应用的标准配置