audio-generation

Original🇺🇸 English
Translated

Guide to audio generation and understanding in MassGen. Covers text-to-speech, music, sound effects, and audio understanding across ElevenLabs and OpenAI backends.

3installs
Added on

NPX Install

npx skill4agent add massgen/massgen audio-generation

Tags

Translated version includes tags in frontmatter

Audio Generation

Generate audio using
generate_media
with
mode="audio"
. Supports speech (TTS), music, and sound effects. ElevenLabs is preferred when available, with OpenAI as fallback.

Quick Start

python
# Text-to-speech (auto-selects ElevenLabs if key available)
generate_media(prompt="Hello, welcome to our presentation!", mode="audio")

# With specific voice
generate_media(prompt="Hello!", mode="audio", voice="Rachel")

# Music generation (ElevenLabs only)
generate_media(prompt="Upbeat jazz piano with soft drums", mode="audio",
               audio_type="music", duration=30)

# Sound effects (ElevenLabs only)
generate_media(prompt="Thunder rolling across a mountain valley", mode="audio",
               audio_type="sound_effect", duration=5)

Audio Types

TypeBackendsDescription
"speech"
(default)
ElevenLabs, OpenAIText-to-speech with voice selection
"music"
ElevenLabs onlyMusic generation from text prompt
"sound_effect"
ElevenLabs onlySound effect generation
"voice_conversion"
ElevenLabs onlyChange voice of existing audio (speech-to-speech)
"audio_isolation"
ElevenLabs onlyRemove background noise, isolate vocals
"voice_design"
ElevenLabs onlyCreate a new synthetic voice from text description
"voice_clone"
ElevenLabs onlyClone a voice from audio samples
"dubbing"
ElevenLabs onlyTranslate and dub audio to another language

Backend Comparison

BackendDefault ModelSupportsAPI Key
ElevenLabs (priority 1)
eleven_multilingual_v2
Speech, music, SFX
ELEVENLABS_API_KEY
OpenAI (priority 2)
gpt-4o-mini-tts
Speech only
OPENAI_API_KEY
If ElevenLabs TTS fails, the system automatically falls back to OpenAI TTS.

Key Parameters

ParameterDescriptionExample
prompt
Text to speak (speech) or description (music/SFX)
"Hello world!"
voice
Voice name or ID
"Rachel"
,
"nova"
,
"alloy"
audio_type
Type of audio
"speech"
,
"music"
,
"sound_effect"
duration
Length in seconds (music/SFX only)
30
instructions
Speaking style (OpenAI
gpt-4o-mini-tts
only)
"warm, reflective tone"
audio_format
Output format
"mp3"
,
"wav"
,
"opus"

Voice Quick Reference

ElevenLabs (top voices):
VoiceCharacter
RachelWarm, conversational female
SarahClear, professional female
JoshFriendly male
AdamDeep, authoritative male
EmilyBright, energetic female
OpenAI voices:
alloy
,
echo
,
fable
,
onyx
,
nova
,
shimmer
,
coral
,
sage

Important: prompt vs instructions

For speech,
prompt
is the literal text to speak. Style guidance goes in
instructions
:
python
# CORRECT: prompt = text to speak, instructions = how to speak it
generate_media(
    prompt="Welcome to the annual report presentation.",
    mode="audio",
    voice="alloy",
    instructions="warm, reflective tone with measured pacing",
    backend_type="openai"
)

# WRONG: Don't put style instructions in prompt
generate_media(prompt="Say this warmly: Welcome...", mode="audio")  # Bad!
instructions
only works with OpenAI
gpt-4o-mini-tts
. ElevenLabs uses voice selection for tone.

Audio Understanding

Use
read_media
(not
generate_media
) to analyze existing audio:
python
read_media(path="recording.mp3", prompt="Transcribe and summarize this audio")

Need More Control?

  • Full ElevenLabs voice catalog (28+ voices): See references/voices.md
  • Music and sound effects details: See references/music_and_sfx.md
  • Advanced audio capabilities (voice conversion, cloning, isolation, dubbing): See references/advanced.md