audio-generation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Audio Generation

音频生成

Generate audio using

generate_media

with

mode="audio"

. Supports speech (TTS), music, and sound effects. ElevenLabs is preferred when available, with OpenAI as fallback.

使用

generate_media

并设置

mode="audio"

生成音频，支持语音（TTS）、音乐和音效。如果条件允许会优先使用 ElevenLabs，OpenAI 作为 fallback 备选方案。

Quick Start

快速开始

python

undefined

python

undefined

Text-to-speech (auto-selects ElevenLabs if key available)

generate_media(prompt="Hello, welcome to our presentation!", mode="audio")

With specific voice

generate_media(prompt="Hello!", mode="audio", voice="Rachel")

Music generation (ElevenLabs only)

generate_media(prompt="Upbeat jazz piano with soft drums", mode="audio", audio_type="music", duration=30)

Sound effects (ElevenLabs only)

generate_media(prompt="Thunder rolling across a mountain valley", mode="audio", audio_type="sound_effect", duration=5)

undefined

generate_media(prompt="Thunder rolling across a mountain valley", mode="audio", audio_type="sound_effect", duration=5)

undefined

Audio Types

音频类型

Type	Backends	Description
`"speech"` (default)	ElevenLabs, OpenAI	Text-to-speech with voice selection
`"music"`	ElevenLabs only	Music generation from text prompt
`"sound_effect"`	ElevenLabs only	Sound effect generation
`"voice_conversion"`	ElevenLabs only	Change voice of existing audio (speech-to-speech)
`"audio_isolation"`	ElevenLabs only	Remove background noise, isolate vocals
`"voice_design"`	ElevenLabs only	Create a new synthetic voice from text description
`"voice_clone"`	ElevenLabs only	Clone a voice from audio samples
`"dubbing"`	ElevenLabs only	Translate and dub audio to another language

类型	支持后端	说明
`"speech"` (默认)	ElevenLabs, OpenAI	支持音色选择的文本转语音
`"music"`	仅支持 ElevenLabs	基于文本提示生成音乐
`"sound_effect"`	仅支持 ElevenLabs	音效生成
`"voice_conversion"`	仅支持 ElevenLabs	更改现有音频的音色（语音转语音）
`"audio_isolation"`	仅支持 ElevenLabs	去除背景噪音，分离人声
`"voice_design"`	仅支持 ElevenLabs	基于文本描述创建全新的合成音色
`"voice_clone"`	仅支持 ElevenLabs	基于音频样本克隆音色
`"dubbing"`	仅支持 ElevenLabs	将音频翻译并配音为其他语言

Backend Comparison

后端对比

Backend	Default Model	Supports	API Key
ElevenLabs (priority 1)	`eleven_multilingual_v2`	Speech, music, SFX	`ELEVENLABS_API_KEY`
OpenAI (priority 2)	`gpt-4o-mini-tts`	Speech only	`OPENAI_API_KEY`

If ElevenLabs TTS fails, the system automatically falls back to OpenAI TTS.

后端	默认模型	支持能力	所需 API Key
ElevenLabs (优先级1)	`eleven_multilingual_v2`	语音、音乐、音效	`ELEVENLABS_API_KEY`
OpenAI (优先级2)	`gpt-4o-mini-tts`	仅支持语音	`OPENAI_API_KEY`

如果 ElevenLabs TTS 调用失败，系统会自动回退到 OpenAI TTS。

Key Parameters

核心参数

Parameter	Description	Example
`prompt`	Text to speak (speech) or description (music/SFX)	`"Hello world!"`
`voice`	Voice name or ID	`"Rachel"` , `"nova"` , `"alloy"`
`audio_type`	Type of audio	`"speech"` , `"music"` , `"sound_effect"`
`duration`	Length in seconds (music/SFX only)	`30`
`instructions`	Speaking style (OpenAI `gpt-4o-mini-tts` only)	`"warm, reflective tone"`
`audio_format`	Output format	`"mp3"` , `"wav"` , `"opus"`

参数	说明	示例
`prompt`	要生成语音的文本（语音场景）或者对音频的描述（音乐/音效场景）	`"Hello world!"`
`voice`	音色名称或ID	`"Rachel"` , `"nova"` , `"alloy"`
`audio_type`	音频类型	`"speech"` , `"music"` , `"sound_effect"`
`duration`	音频时长，单位为秒（仅音乐/音效场景可用）	`30`
`instructions`	说话风格（仅 OpenAI `gpt-4o-mini-tts` 支持）	`"warm, reflective tone"`
`audio_format`	输出格式	`"mp3"` , `"wav"` , `"opus"`

Voice Quick Reference

音色快速参考

ElevenLabs (top voices):

Voice	Character
Rachel	Warm, conversational female
Sarah	Clear, professional female
Josh	Friendly male
Adam	Deep, authoritative male
Emily	Bright, energetic female

OpenAI voices:

alloy

echo

fable

onyx

nova

shimmer

coral

sage

ElevenLabs (热门音色):

音色	特点
Rachel	温暖、健谈的女性音色
Sarah	清晰、专业的女性音色
Josh	友好的男性音色
Adam	低沉、权威的男性音色
Emily	明亮、有活力的女性音色

OpenAI 音色:

alloy

echo

fable

onyx

nova

shimmer

coral

sage

Important: prompt vs instructions

重要提示：prompt 和 instructions 的区别

For speech,

prompt

is the literal text to speak. Style guidance goes in

instructions

python

undefined

对于语音生成场景，

prompt

是要朗读的具体文本，风格引导需要放在

instructions

中：

python

undefined

CORRECT: prompt = text to speak, instructions = how to speak it

generate_media( prompt="Welcome to the annual report presentation.", mode="audio", voice="alloy", instructions="warm, reflective tone with measured pacing", backend_type="openai" )

WRONG: Don't put style instructions in prompt

generate_media(prompt="Say this warmly: Welcome...", mode="audio") # Bad!


`instructions` only works with OpenAI `gpt-4o-mini-tts`. ElevenLabs uses voice selection for tone.

generate_media(prompt="Say this warmly: Welcome...", mode="audio") # Bad!


`instructions` 仅支持 OpenAI `gpt-4o-mini-tts`，ElevenLabs 通过选择不同音色来调整语气风格。

Audio Understanding

音频理解

Use

read_media

(not

generate_media

) to analyze existing audio:

python

read_media(path="recording.mp3", prompt="Transcribe and summarize this audio")

使用

read_media

（而非

generate_media

）分析已有音频：

python

read_media(path="recording.mp3", prompt="Transcribe and summarize this audio")

Need More Control?

需要更多自定义能力？

Full ElevenLabs voice catalog (28+ voices): See references/voices.md
Music and sound effects details: See references/music_and_sfx.md
Advanced audio capabilities (voice conversion, cloning, isolation, dubbing): See references/advanced.md

完整 ElevenLabs 音色目录（28+ 款音色）: 查看 references/voices.md
音乐与音效详情: 查看 references/music_and_sfx.md
高级音频能力（音色转换、克隆、分离、配音）: 查看 references/advanced.md