audio-generation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Audio Generation

音频生成

Generate audio using
generate_media
with
mode="audio"
. Supports speech (TTS), music, and sound effects. ElevenLabs is preferred when available, with OpenAI as fallback.
使用
generate_media
并设置
mode="audio"
生成音频,支持语音(TTS)、音乐和音效。如果条件允许会优先使用 ElevenLabs,OpenAI 作为 fallback 备选方案。

Quick Start

快速开始

python
undefined
python
undefined

Text-to-speech (auto-selects ElevenLabs if key available)

Text-to-speech (auto-selects ElevenLabs if key available)

generate_media(prompt="Hello, welcome to our presentation!", mode="audio")
generate_media(prompt="Hello, welcome to our presentation!", mode="audio")

With specific voice

With specific voice

generate_media(prompt="Hello!", mode="audio", voice="Rachel")
generate_media(prompt="Hello!", mode="audio", voice="Rachel")

Music generation (ElevenLabs only)

Music generation (ElevenLabs only)

generate_media(prompt="Upbeat jazz piano with soft drums", mode="audio", audio_type="music", duration=30)
generate_media(prompt="Upbeat jazz piano with soft drums", mode="audio", audio_type="music", duration=30)

Sound effects (ElevenLabs only)

Sound effects (ElevenLabs only)

generate_media(prompt="Thunder rolling across a mountain valley", mode="audio", audio_type="sound_effect", duration=5)
undefined
generate_media(prompt="Thunder rolling across a mountain valley", mode="audio", audio_type="sound_effect", duration=5)
undefined

Audio Types

音频类型

TypeBackendsDescription
"speech"
(default)
ElevenLabs, OpenAIText-to-speech with voice selection
"music"
ElevenLabs onlyMusic generation from text prompt
"sound_effect"
ElevenLabs onlySound effect generation
"voice_conversion"
ElevenLabs onlyChange voice of existing audio (speech-to-speech)
"audio_isolation"
ElevenLabs onlyRemove background noise, isolate vocals
"voice_design"
ElevenLabs onlyCreate a new synthetic voice from text description
"voice_clone"
ElevenLabs onlyClone a voice from audio samples
"dubbing"
ElevenLabs onlyTranslate and dub audio to another language
类型支持后端说明
"speech"
(默认)
ElevenLabs, OpenAI支持音色选择的文本转语音
"music"
仅支持 ElevenLabs基于文本提示生成音乐
"sound_effect"
仅支持 ElevenLabs音效生成
"voice_conversion"
仅支持 ElevenLabs更改现有音频的音色(语音转语音)
"audio_isolation"
仅支持 ElevenLabs去除背景噪音,分离人声
"voice_design"
仅支持 ElevenLabs基于文本描述创建全新的合成音色
"voice_clone"
仅支持 ElevenLabs基于音频样本克隆音色
"dubbing"
仅支持 ElevenLabs将音频翻译并配音为其他语言

Backend Comparison

后端对比

BackendDefault ModelSupportsAPI Key
ElevenLabs (priority 1)
eleven_multilingual_v2
Speech, music, SFX
ELEVENLABS_API_KEY
OpenAI (priority 2)
gpt-4o-mini-tts
Speech only
OPENAI_API_KEY
If ElevenLabs TTS fails, the system automatically falls back to OpenAI TTS.
后端默认模型支持能力所需 API Key
ElevenLabs (优先级1)
eleven_multilingual_v2
语音、音乐、音效
ELEVENLABS_API_KEY
OpenAI (优先级2)
gpt-4o-mini-tts
仅支持语音
OPENAI_API_KEY
如果 ElevenLabs TTS 调用失败,系统会自动回退到 OpenAI TTS。

Key Parameters

核心参数

ParameterDescriptionExample
prompt
Text to speak (speech) or description (music/SFX)
"Hello world!"
voice
Voice name or ID
"Rachel"
,
"nova"
,
"alloy"
audio_type
Type of audio
"speech"
,
"music"
,
"sound_effect"
duration
Length in seconds (music/SFX only)
30
instructions
Speaking style (OpenAI
gpt-4o-mini-tts
only)
"warm, reflective tone"
audio_format
Output format
"mp3"
,
"wav"
,
"opus"
参数说明示例
prompt
要生成语音的文本(语音场景)或者对音频的描述(音乐/音效场景)
"Hello world!"
voice
音色名称或ID
"Rachel"
,
"nova"
,
"alloy"
audio_type
音频类型
"speech"
,
"music"
,
"sound_effect"
duration
音频时长,单位为秒(仅音乐/音效场景可用)
30
instructions
说话风格(仅 OpenAI
gpt-4o-mini-tts
支持)
"warm, reflective tone"
audio_format
输出格式
"mp3"
,
"wav"
,
"opus"

Voice Quick Reference

音色快速参考

ElevenLabs (top voices):
VoiceCharacter
RachelWarm, conversational female
SarahClear, professional female
JoshFriendly male
AdamDeep, authoritative male
EmilyBright, energetic female
OpenAI voices:
alloy
,
echo
,
fable
,
onyx
,
nova
,
shimmer
,
coral
,
sage
ElevenLabs (热门音色):
音色特点
Rachel温暖、健谈的女性音色
Sarah清晰、专业的女性音色
Josh友好的男性音色
Adam低沉、权威的男性音色
Emily明亮、有活力的女性音色
OpenAI 音色:
alloy
,
echo
,
fable
,
onyx
,
nova
,
shimmer
,
coral
,
sage

Important: prompt vs instructions

重要提示:prompt 和 instructions 的区别

For speech,
prompt
is the literal text to speak. Style guidance goes in
instructions
:
python
undefined
对于语音生成场景,
prompt
要朗读的具体文本,风格引导需要放在
instructions
中:
python
undefined

CORRECT: prompt = text to speak, instructions = how to speak it

CORRECT: prompt = text to speak, instructions = how to speak it

generate_media( prompt="Welcome to the annual report presentation.", mode="audio", voice="alloy", instructions="warm, reflective tone with measured pacing", backend_type="openai" )
generate_media( prompt="Welcome to the annual report presentation.", mode="audio", voice="alloy", instructions="warm, reflective tone with measured pacing", backend_type="openai" )

WRONG: Don't put style instructions in prompt

WRONG: Don't put style instructions in prompt

generate_media(prompt="Say this warmly: Welcome...", mode="audio") # Bad!

`instructions` only works with OpenAI `gpt-4o-mini-tts`. ElevenLabs uses voice selection for tone.
generate_media(prompt="Say this warmly: Welcome...", mode="audio") # Bad!

`instructions` 仅支持 OpenAI `gpt-4o-mini-tts`,ElevenLabs 通过选择不同音色来调整语气风格。

Audio Understanding

音频理解

Use
read_media
(not
generate_media
) to analyze existing audio:
python
read_media(path="recording.mp3", prompt="Transcribe and summarize this audio")
使用
read_media
(而非
generate_media
)分析已有音频:
python
read_media(path="recording.mp3", prompt="Transcribe and summarize this audio")

Need More Control?

需要更多自定义能力?

  • Full ElevenLabs voice catalog (28+ voices): See references/voices.md
  • Music and sound effects details: See references/music_and_sfx.md
  • Advanced audio capabilities (voice conversion, cloning, isolation, dubbing): See references/advanced.md
  • 完整 ElevenLabs 音色目录(28+ 款音色): 查看 references/voices.md
  • 音乐与音效详情: 查看 references/music_and_sfx.md
  • 高级音频能力(音色转换、克隆、分离、配音): 查看 references/advanced.md