audio-generation
Original:🇺🇸 English
Translated
Guide to audio generation and understanding in MassGen. Covers text-to-speech, music, sound effects, and audio understanding across ElevenLabs and OpenAI backends.
3installs
Sourcemassgen/massgen
Added on
NPX Install
npx skill4agent add massgen/massgen audio-generationTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Audio Generation
Generate audio using with . Supports speech (TTS), music, and sound effects. ElevenLabs is preferred when available, with OpenAI as fallback.
generate_mediamode="audio"Quick Start
python
# Text-to-speech (auto-selects ElevenLabs if key available)
generate_media(prompt="Hello, welcome to our presentation!", mode="audio")
# With specific voice
generate_media(prompt="Hello!", mode="audio", voice="Rachel")
# Music generation (ElevenLabs only)
generate_media(prompt="Upbeat jazz piano with soft drums", mode="audio",
audio_type="music", duration=30)
# Sound effects (ElevenLabs only)
generate_media(prompt="Thunder rolling across a mountain valley", mode="audio",
audio_type="sound_effect", duration=5)Audio Types
| Type | Backends | Description |
|---|---|---|
| ElevenLabs, OpenAI | Text-to-speech with voice selection |
| ElevenLabs only | Music generation from text prompt |
| ElevenLabs only | Sound effect generation |
| ElevenLabs only | Change voice of existing audio (speech-to-speech) |
| ElevenLabs only | Remove background noise, isolate vocals |
| ElevenLabs only | Create a new synthetic voice from text description |
| ElevenLabs only | Clone a voice from audio samples |
| ElevenLabs only | Translate and dub audio to another language |
Backend Comparison
| Backend | Default Model | Supports | API Key |
|---|---|---|---|
| ElevenLabs (priority 1) | | Speech, music, SFX | |
| OpenAI (priority 2) | | Speech only | |
If ElevenLabs TTS fails, the system automatically falls back to OpenAI TTS.
Key Parameters
| Parameter | Description | Example |
|---|---|---|
| Text to speak (speech) or description (music/SFX) | |
| Voice name or ID | |
| Type of audio | |
| Length in seconds (music/SFX only) | |
| Speaking style (OpenAI | |
| Output format | |
Voice Quick Reference
ElevenLabs (top voices):
| Voice | Character |
|---|---|
| Rachel | Warm, conversational female |
| Sarah | Clear, professional female |
| Josh | Friendly male |
| Adam | Deep, authoritative male |
| Emily | Bright, energetic female |
OpenAI voices: , , , , , , ,
alloyechofableonyxnovashimmercoralsageImportant: prompt vs instructions
For speech, is the literal text to speak. Style guidance goes in :
promptinstructionspython
# CORRECT: prompt = text to speak, instructions = how to speak it
generate_media(
prompt="Welcome to the annual report presentation.",
mode="audio",
voice="alloy",
instructions="warm, reflective tone with measured pacing",
backend_type="openai"
)
# WRONG: Don't put style instructions in prompt
generate_media(prompt="Say this warmly: Welcome...", mode="audio") # Bad!instructionsgpt-4o-mini-ttsAudio Understanding
Use (not ) to analyze existing audio:
read_mediagenerate_mediapython
read_media(path="recording.mp3", prompt="Transcribe and summarize this audio")Need More Control?
- Full ElevenLabs voice catalog (28+ voices): See references/voices.md
- Music and sound effects details: See references/music_and_sfx.md
- Advanced audio capabilities (voice conversion, cloning, isolation, dubbing): See references/advanced.md