gemini-tts

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Gemini Text-to-Speech

Gemini 文本转语音

Generate natural-sounding speech from text using Gemini's TTS models through executable scripts with support for multiple voices and multi-speaker conversations.

通过可执行脚本，利用Gemini的TTS模型将文本转换为自然流畅的语音，支持多种音色和多角色对话。

When to Use This Skill

何时使用该Skill

Use this skill when you need to:

Convert text to natural speech
Create audio for podcasts, audiobooks, or videos
Generate multi-speaker conversations
Stream audio for long content
Choose from multiple voice options
Create accessible audio content
Generate voiceovers for presentations
Batch convert text to audio files

当你需要以下功能时，可使用本Skill：

将文本转换为自然语音
为播客、有声书或视频创建音频
生成多角色对话音频
对长文本内容进行流式音频输出
从多种音色中选择合适的声音
创建无障碍音频内容
为演示文稿生成旁白
批量将文本转换为音频文件

Available Scripts

可用脚本

scripts/tts.py

Purpose: Convert text to speech using Gemini TTS models

When to use:

Any text-to-speech conversion
Multi-speaker conversation generation
Streaming audio for long texts
Voiceovers for content creation
Accessible audio generation

Key parameters:

Parameter	Description	Example
`text`	Text to convert (required)	`"Hello, world!"`
`--voice` , `-v`	Voice name	`Kore`
`--output` , `-o`	Base name for output file	`welcome`
`--output-dir`	Output directory for audio	`audio/`
`--no-timestamp`	Disable auto timestamp	Flag
`--model` , `-m`	TTS model	`gemini-2.5-flash-preview-tts`
`--stream` , `-s`	Enable streaming	Flag
`--speakers`	Multi-speaker mapping	`"Joe:Kore,Jane:Puck"`

Output: WAV audio file path

用途：使用Gemini TTS模型将文本转换为语音

适用场景：

任何文本转语音的转换需求
生成多角色对话音频
对长文本进行流式音频输出
为内容创作生成旁白
生成无障碍音频内容

关键参数:

参数	描述	示例
`text`	需要转换的文本（必填）	`"Hello, world!"`
`--voice` , `-v`	音色名称	`Kore`
`--output` , `-o`	输出文件的基础名称	`welcome`
`--output-dir`	音频输出目录	`audio/`
`--no-timestamp`	禁用自动时间戳	标志参数
`--model` , `-m`	TTS模型	`gemini-2.5-flash-preview-tts`
`--stream` , `-s`	启用流式输出	标志参数
`--speakers`	多角色音色映射	`"Joe:Kore,Jane:Puck"`

输出：WAV音频文件路径

Workflows

工作流

Workflow 1: Basic Text-to-Speech

工作流1：基础文本转语音

bash

python scripts/tts.py "Hello, world! Have a wonderful day."

Best for: Quick audio generation, simple messages
Voice:
```
Kore
```
(default, clear and professional)
Output:
```
audio/tts_output_YYYYMMDD_HHMMSS.wav
```
(auto timestamp)

bash

python scripts/tts.py "Hello, world! Have a wonderful day."

最佳适用场景：快速生成音频、简单消息
默认音色：
```
Kore
```
（清晰、专业）
输出文件：
```
audio/tts_output_YYYYMMDD_HHMMSS.wav
```
（自动添加时间戳）

Workflow 2: Choose Different Voice

工作流2：选择不同音色

bash

python scripts/tts.py "Welcome to our podcast about technology trends" --voice Puck --output welcome

Best for: Friendly, conversational content
Voice options: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
Output:
```
audio/welcome_YYYYMMDD_HHMMSS.wav
```

bash

python scripts/tts.py "Welcome to our podcast about technology trends" --voice Puck --output welcome

最佳适用场景：友好的对话类内容
可用音色：Kore、Puck、Charon、Fenrir、Aoede、Zephyr、Sulafat
输出文件：
```
audio/welcome_YYYYMMDD_HHMMSS.wav
```

Workflow 3: Multi-Speaker Conversation

工作流3：多角色对话

bash

python scripts/tts.py "TTS the following conversation:
Joe: How's it going today?
Jane: Not too bad, how about you?
Joe: I'm working on a new project.
Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation

Best for: Dialogues, interviews, role-playing content
Format: Marked conversation with speaker names
Script automatically routes text to appropriate voices
Output:
```
audio/conversation_YYYYMMDD_HHMMSS.wav
```

bash

python scripts/tts.py "TTS the following conversation:
Joe: How's it going today?
Jane: Not too bad, how about you?
Joe: I'm working on a new project.
Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation

最佳适用场景：对话内容、访谈、角色扮演类内容
格式：带有角色名称标记的对话文本
脚本会自动将对应文本分配给指定音色
输出文件：
```
audio/conversation_YYYYMMDD_HHMMSS.wav
```

Workflow 4: Long Content with Streaming

工作流4：长文本流式输出

bash

python scripts/tts.py "This is a very long text that would benefit from streaming..." --stream --output long-form

Best for: Podcasts, audiobooks, long articles
Streaming: Processes audio in chunks for long texts
Output:
```
audio/long-form_YYYYMMDD_HHMMSS.wav
```

bash

python scripts/tts.py "This is a very long text that would benefit from streaming..." --stream --output long-form

最佳适用场景：播客、有声书、长篇文章
流式处理：将长文本分块处理生成音频
输出文件：
```
audio/long-form_YYYYMMDD_HHMMSS.wav
```

Workflow 5: Professional Voiceover

工作流5：专业旁白

bash

python scripts/tts.py "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover

Best for: Corporate content, presentations, formal announcements
Voice:
```
Charon
```
(deep, authoritative)
Use when: Professional, serious tone required

bash

python scripts/tts.py "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover

最佳适用场景：企业内容、演示文稿、正式公告
音色：
```
Charon
```
（低沉、权威）
适用场景：需要专业、严肃语气的内容

Workflow 6: Custom Output Directory

工作流6：自定义输出目录

bash

python scripts/tts.py "Save to specific folder." --output-dir ./my-projects/podcasts/ --output episode1

Best for: Organized project structures
Directory created automatically if it doesn't exist

Output:

./my-projects/podcasts/episode1_YYYYMMDD_HHMMSS.wav

bash

python scripts/tts.py "Save to specific folder." --output-dir ./my-projects/podcasts/ --output episode1

最佳适用场景：结构化的项目文件管理
目录不存在时会自动创建

输出文件：

./my-projects/podcasts/episode1_YYYYMMDD_HHMMSS.wav

Workflow 7: Content Creation Pipeline (Text → Audio)

工作流7：内容创作流水线（文本→音频）

bash

undefined

bash

undefined

1. Generate script (gemini-text skill)

1. 生成脚本（gemini-text skill）

python skills/gemini-text/scripts/generate.py "Write a 2-minute podcast intro about sustainable energy"

2. Generate audio (this skill)

2. 生成音频（本Skill）

python scripts/tts.py "[Paste generated script]" --voice Fenrir --output podcast-intro

3. Use in video or podcast

3. 用于视频或播客

- Best for: Podcasts, audiobooks, video narration
- Combines with: gemini-text for script generation

- 最佳适用场景：播客、有声书、视频旁白
- 搭配使用：gemini-text Skill用于生成脚本

Workflow 8: Accessible Content

工作流8：无障碍内容

bash

python scripts/tts.py "Welcome to our accessible website. This audio describes our main navigation options." --voice Aoede --output accessibility

Best for: Web accessibility, screen reader alternatives
Voice:
```
Aoede
```
(melodic, pleasant)
Use when: Making content accessible to visually impaired users

bash

python scripts/tts.py "Welcome to our accessible website. This audio describes our main navigation options." --voice Aoede --output accessibility

最佳适用场景：网站无障碍优化、屏幕阅读器替代方案
音色：
```
Aoede
```
（悦耳、柔和）
适用场景：为视障用户创建可访问内容

Workflow 9: Educational Content

工作流9：教育类内容

bash

python scripts/tts.py "Chapter 1: Introduction to Quantum Computing. Let's explore the fundamental principles..." --voice Zephyr --output chapter1

Best for: Educational materials, tutorials, e-learning
Voice:
```
Zephyr
```
(light, airy)
Combines well with: gemini-text for content generation

bash

python scripts/tts.py "Chapter 1: Introduction to Quantum Computing. Let's explore the fundamental principles..." --voice Zephyr --output chapter1

最佳适用场景：教育材料、教程、在线学习内容
音色：
```
Zephyr
```
（轻快、清晰）
搭配使用：gemini-text Skill用于生成内容

Workflow 10: Disable Timestamp

工作流10：禁用时间戳

bash

python scripts/tts.py "Fixed filename." --output my-audio --no-timestamp

Best for: When you want complete control over filename
Output:
```
audio/my-audio.wav
```
(no timestamp)
Use when: Generating files for specific naming schemes

bash

python scripts/tts.py "Fixed filename." --output my-audio --no-timestamp

最佳适用场景：需要完全控制文件名的情况
输出文件：
```
audio/my-audio.wav
```
（无时间戳）
适用场景：生成符合特定命名规范的文件

Parameters Reference

参数参考

Model Selection

模型选择

Model	Quality	Speed	Best For
`gemini-2.5-flash-preview-tts`	Good	Fast	General use, high volume
`gemini-2.5-pro-preview-tts`	Higher	Slower	Premium content, voiceovers

模型	质量	速度	最佳适用场景
`gemini-2.5-flash-preview-tts`	良好	快速	通用场景、大音量生成需求
`gemini-2.5-pro-preview-tts`	更高	较慢	高质量内容、专业旁白

Voice Selection

音色选择

Voice	Characteristics	Best For
Kore	Clear, professional	Announcements, general purpose (default)
Puck	Friendly, conversational	Casual content, interviews
Charon	Deep, authoritative	Corporate, serious content
Fenrir	Warm, expressive	Storytelling, narratives
Aoede	Melodic, pleasant	Educational, accessibility
Zephyr	Light, airy	Gentle content, tutorials
Sulafat	Neutral, balanced	Documentaries, factual content

音色	特点	最佳适用场景
Kore	清晰、专业	公告、通用信息（默认音色）
Puck	友好、口语化	播客、访谈、休闲内容
Charon	低沉、权威	企业内容、新闻、正式演示
Fenrir	温暖、富有表现力	有声书、故事、情感类内容
Aoede	悦耳、柔和	教育内容、无障碍优化
Zephyr	轻快、空灵	温和类内容、教程
Sulafat	中立、均衡	纪录片、事实性演示

Audio Format

音频格式

Specification	Value
Format	WAV (PCM)
Sample rate	24000 Hz
Channels	1 (mono)
Bit depth	16-bit

规格	数值
格式	WAV (PCM)
采样率	24000 Hz
声道	1（单声道）
位深	16-bit

Token Limits

令牌限制

Limit	Type	Description
8,192	Input	Maximum input text tokens
16,384	Output	Maximum output audio tokens

限制值	类型	描述
8,192	输入	最大输入文本令牌数
16,384	输出	最大输出音频令牌数

Output Interpretation

输出说明

Audio File

音频文件

Format: WAV (compatible with most players)
Mono channel (single audio track)
Sample rate: 24000 Hz (broadcast quality)
Can be converted to MP3/AAC if needed

格式：WAV（兼容大多数播放器）
单声道（单个音轨）
采样率：24000 Hz（广播级质量）
可按需转换为MP3/AAC格式

Multi-Speaker Files

多角色音频文件

Single WAV file with multiple voices
Voices separated by timing within file
Use
```
--speakers
```
parameter to map speakers to voices

包含多种音色的单个WAV文件
不同音色通过时间轴区分
使用
```
--speakers
```
参数映射角色与音色

Streaming Output

流式输出

Audio processed in chunks during generation
Script shows "Streaming audio..." message
Useful for very long texts or real-time applications

生成音频时按块处理内容
脚本会显示"Streaming audio..."提示
适用于超长文本或实时应用场景

Common Issues

常见问题

"google-genai not installed"

bash

pip install google-genai

bash

pip install google-genai

"Voice name not found"

Check voice name spelling
Use available voices: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
Voice names are case-sensitive

检查音色名称拼写
使用可用音色：Kore、Puck、Charon、Fenrir、Aoede、Zephyr、Sulafat
音色名称区分大小写

"No audio generated"

Check text is not empty
Verify text doesn't exceed token limit (8,192)
Try shorter text segments
Check API quota limits

检查输入文本是否为空
确认文本未超过令牌限制（8,192）
尝试缩短文本长度
检查API配额限制

"Multi-speaker format error"

Format:
```
SpeakerName:VoiceName,Speaker2:Voice2
```
Separate speakers with commas
Use colon between speaker and voice
Example:
```
"Joe:Kore,Jane:Puck,Host:Charon"
```

格式要求：
```
SpeakerName:VoiceName,Speaker2:Voice2
```
使用逗号分隔不同角色
角色与音色之间使用冒号分隔
示例：
```
"Joe:Kore,Jane:Puck,Host:Charon"
```

"Output file already exists"

Script will overwrite existing files
Change
```
--output
```
filename to avoid conflicts
Use unique names for batch generation

脚本会覆盖已存在的文件
修改
```
--output
```
参数的文件名以避免冲突
批量生成时使用唯一文件名

Audio quality issues

音频质量问题

Check input text for unusual characters
Try different voice for better pronunciation
Consider splitting long text into smaller segments
Verify audio playback software compatibility

检查输入文本是否包含特殊字符
尝试更换音色以获得更好的发音效果
考虑将长文本拆分为多个逻辑段落
确认音频播放软件的兼容性

Best Practices

最佳实践

Voice Selection

音色选择

Kore: General purpose, clear articulation
Puck: Conversational, engaging tone
Charon: Professional, authoritative
Fenrir: Emotional, storytelling
Aoede: Soft, gentle for accessibility
Zephyr: Educational, clear explanations

Kore：通用场景、清晰发音
Puck：口语化、引人入胜的语气
Charon：专业、权威的风格
Fenrir：富有情感、适合讲故事
Aoede：柔和、适合无障碍内容
Zephyr：教育场景、清晰的讲解

Text Preparation

文本准备

Use natural language and punctuation
Include pauses with commas and periods
Spell out difficult words if needed
Break very long text into logical segments
Add speaker labels for multi-speaker content

使用自然语言和标点符号
用逗号和句号设置停顿
对生僻词可拼写完整
将超长文本拆分为逻辑段落
为多角色内容添加角色标签

Performance Optimization

性能优化

Use streaming for very long texts
Generate shorter segments for better control
Use flash model for faster generation
Batch process multiple files for efficiency

对超长文本使用流式输出
生成较短的文本段以获得更好的控制
使用flash模型提升生成速度
批量处理多个文件以提高效率

Quality Tips

质量提升技巧

Test different voices for your content type
Use appropriate pacing with punctuation
Consider context when selecting voice
Listen to output before final use
Multi-speaker requires clear speaker labeling

针对不同内容类型测试多种音色
用标点符号控制语速
选择音色时考虑内容上下文
最终使用前先试听输出音频
多角色内容需要清晰的角色标记

Use Cases by Voice

按音色划分的适用场景

Voice	Ideal Use Cases
Kore	Announcements, navigation, general info
Puck	Podcasts, interviews, casual content
Charon	Corporate, news, formal presentations
Fenrir	Audiobooks, stories, emotional content
Aoede	Accessibility, educational, gentle content
Zephyr	Tutorials, explanations, guides
Sulafat	Documentaries, factual presentations

音色	理想适用场景
Kore	公告、导航、通用信息
Puck	播客、访谈、休闲内容
Charon	企业内容、新闻、正式演示
Fenrir	有声书、故事、情感类内容
Aoede	无障碍内容、教育、温和类内容
Zephyr	教程、讲解、指南
Sulafat	纪录片、事实性演示

Related Skills

Quick Reference

快速参考

bash

undefined

bash

undefined

Basic

基础用法

python scripts/tts.py "Your text here"

Custom voice

自定义音色

python scripts/tts.py "Your text" --voice Puck --output audio.wav

Multi-speaker

多角色对话

python scripts/tts.py "Joe: Hi. Jane: Hello!" --speakers "Joe:Kore,Jane:Puck"

Streaming

流式输出

python scripts/tts.py "Long text..." --stream --output long.wav

Professional

专业旁白

python scripts/tts.py "Corporate announcement" --voice Charon

undefined

python scripts/tts.py "Corporate announcement" --voice Charon

undefined

Reference

参考资料

See
```
references/voices.md
```
for complete voice documentation
Get API key: https://aistudio.google.com/apikey
Documentation: https://ai.google.dev/gemini-api/docs/text-to-speech
Sample rate: 24000 Hz standard for most applications

完整音色文档请查看
```
references/voices.md
```
获取API密钥：https://aistudio.google.com/apikey
官方文档：https://ai.google.dev/gemini-api/docs/text-to-speech
采样率：24000 Hz为大多数应用的标准配置