videoagent-audio-studio

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

🎙️ VideoAgent Audio Studio

🎙️ VideoAgent 音频工作室

Use when: User asks to generate speech, narrate text, create a voice-over, compose music, or produce a sound effect.
VideoAgent Audio Studio is a smart audio dispatcher. It analyzes your request and routes it to the best available model — ElevenLabs for speech and music, fal.ai for fast SFX — and returns a ready-to-use audio URL.

适用场景: 当用户要求生成语音、文本旁白、制作配音、创作音乐或生成音效时使用。
VideoAgent 音频工作室是一个智能音频调度器。它会分析你的请求并将其转发到最合适的可用模型——语音和音乐使用ElevenLabs,快速音效使用fal.ai——并返回可直接使用的音频URL。

Quick Reference

快速参考

Request TypeBest ModelLatency
Narrate text / Voice-over
elevenlabs-tts-v3
~3s
Low-latency TTS (real-time)
elevenlabs-tts-turbo
<1s
Background music
cassetteai-music
~15s
Sound effect
elevenlabs-sfx
~5s
Clone a voice from audio
elevenlabs-voice-clone
~10s

请求类型最佳模型延迟
文本旁白 / 配音
elevenlabs-tts-v3
~3s
低延迟TTS(实时)
elevenlabs-tts-turbo
<1s
背景音乐
cassetteai-music
~15s
音效
elevenlabs-sfx
~5s
从音频克隆声音
elevenlabs-voice-clone
~10s

How to Use

使用方法

1. Start the AudioMind server (once per session)

1. 启动AudioMind服务器(每个会话执行一次)

bash
bash {baseDir}/tools/start_server.sh
This starts the ElevenLabs MCP server on port 8124. The skill uses it for all audio generation.
bash
bash {baseDir}/tools/start_server.sh
这个命令会在8124端口启动ElevenLabs MCP服务器,本工具所有音频生成功能都会使用该服务器。

2. Route the request

2. 转发请求

Analyze the user's request and call the appropriate tool via the MCP server:
Text-to-Speech (TTS)
When user asks to "narrate", "read aloud", "say", or "create a voice-over":
Use MCP tool: text_to_speech
  text: "<the text to narrate>"
  voice_id: "JBFqnCBsd6RMkjVDRZzb"   # Default: "George" (professional, neutral)
  model_id: "eleven_multilingual_v2"   # Use "eleven_turbo_v2_5" for low latency
Music Generation
When user asks to "compose", "create background music", or "make a soundtrack":
Use MCP tool: text_to_sound_effects  (via cassetteai-music on fal.ai)
  prompt: "<music description, e.g. 'upbeat lo-fi hip hop, 90 seconds'>"
  duration_seconds: <duration>
Sound Effect (SFX)
When user asks for a specific sound (e.g., "a door creaking", "rain on a window"):
Use MCP tool: text_to_sound_effects
  text: "<sound description>"
  duration_seconds: <1-22>
Voice Cloning
When user provides an audio sample and wants to clone the voice:
Use MCP tool: voice_add
  name: "<voice name>"
  files: ["<audio_file_url>"]

分析用户请求,通过MCP服务器调用对应的工具:
Text-to-Speech (TTS)
当用户要求“旁白”、“朗读”、“播报”或“制作配音”时:
Use MCP tool: text_to_speech
  text: "<the text to narrate>"
  voice_id: "JBFqnCBsd6RMkjVDRZzb"   # Default: "George" (professional, neutral)
  model_id: "eleven_multilingual_v2"   # Use "eleven_turbo_v2_5" for low latency
Music Generation
当用户要求“作曲”、“创作背景音乐”或“制作配乐”时:
Use MCP tool: text_to_sound_effects  (via cassetteai-music on fal.ai)
  prompt: "<music description, e.g. 'upbeat lo-fi hip hop, 90 seconds'>"
  duration_seconds: <duration>
Sound Effect (SFX)
当用户需要特定音效时(例如“门吱呀作响的声音”、“雨打窗户的声音”):
Use MCP tool: text_to_sound_effects
  text: "<sound description>"
  duration_seconds: <1-22>
Voice Cloning
当用户提供音频样本并想要克隆对应声音时:
Use MCP tool: voice_add
  name: "<voice name>"
  files: ["<audio_file_url>"]

Example Conversations

对话示例

User: "Voice this text for me: Welcome to our product launch"
→ Route to: text_to_speech
  text: "Welcome to our product launch"
  voice_id: "JBFqnCBsd6RMkjVDRZzb"
  model_id: "eleven_multilingual_v2"
🎙️ Voiceover done! Listen here

User: "Generate 60 seconds of relaxing background music for a podcast"
→ Route to: cassetteai-music (fal.ai)
  prompt: "relaxing lo-fi background music for a podcast, gentle piano and soft beats, 60 seconds"
  duration_seconds: 60
🎵 Background music ready! Listen here

User: "Generate a sci-fi style door opening sound effect"
→ Route to: text_to_sound_effects
  text: "a futuristic sci-fi door sliding open with a hydraulic hiss"
  duration_seconds: 3

用户: "帮我给这段文本配音:欢迎来到我们的产品发布会"
→ Route to: text_to_speech
  text: "Welcome to our product launch"
  voice_id: "JBFqnCBsd6RMkjVDRZzb"
  model_id: "eleven_multilingual_v2"
🎙️ 配音完成!点击收听

用户: "生成60秒适合播客的放松背景音乐"
→ Route to: cassetteai-music (fal.ai)
  prompt: "relaxing lo-fi background music for a podcast, gentle piano and soft beats, 60 seconds"
  duration_seconds: 60
🎵 背景音乐已就绪!点击收听

用户: "生成一个科幻风格的开门音效"
→ Route to: text_to_sound_effects
  text: "a futuristic sci-fi door sliding open with a hydraulic hiss"
  duration_seconds: 3

Setup

配置

Required

必需配置

Set
ELEVENLABS_API_KEY
in
~/.openclaw/openclaw.json
:
json
{
  "skills": {
    "entries": {
      "videoagent-audio-studio": {
        "enabled": true,
        "env": {
          "ELEVENLABS_API_KEY": "your_elevenlabs_key_here"
        }
      }
    }
  }
}
~/.openclaw/openclaw.json
中设置
ELEVENLABS_API_KEY
json
{
  "skills": {
    "entries": {
      "videoagent-audio-studio": {
        "enabled": true,
        "env": {
          "ELEVENLABS_API_KEY": "your_elevenlabs_key_here"
        }
      }
    }
  }
}
你可以在elevenlabs.io/app/settings/api-keys获取你的密钥。

Optional (for fal.ai music & SFX models)

可选配置(用于fal.ai音乐和音效模型)

json
"FAL_KEY": "your_fal_key_here"
Get your key at fal.ai/dashboard/keys.

json
"FAL_KEY": "your_fal_key_here"
你可以在fal.ai/dashboard/keys获取你的密钥。

Self-Hosting the Proxy

自行部署代理服务

The
cli.js
connects to a hosted proxy by default. If you want full control — or need to serve users in regions where
vercel.app
is blocked — you can deploy your own instance from the
proxy/
directory.
cli.js
默认连接到托管代理服务。如果你想要完全掌控服务,或者需要为
vercel.app
被屏蔽的区域的用户提供服务,你可以从
proxy/
目录部署你自己的实例。

Quick Deploy (Vercel)

快速部署(Vercel)

bash
cd proxy
npm install
vercel --prod
bash
cd proxy
npm install
vercel --prod

Environment Variables

环境变量

Set these in your Vercel project (Dashboard → Settings → Environment Variables):
VariableRequired ForWhere to Get
ELEVENLABS_API_KEY
TTS, SFX, Voice Cloneelevenlabs.io/app/settings/api-keys
FAL_KEY
Music generationfal.ai/dashboard/keys
VALID_PRO_KEYS
(Optional) Restrict accessComma-separated list of allowed client keys
在你的Vercel项目中配置这些变量(控制台 → 设置 → 环境变量):
变量适用功能获取地址
ELEVENLABS_API_KEY
TTS、音效、声音克隆elevenlabs.io/app/settings/api-keys
FAL_KEY
音乐生成fal.ai/dashboard/keys
VALID_PRO_KEYS
(可选)限制访问逗号分隔的允许客户端密钥列表

Point cli.js to Your Proxy

把cli.js指向你的代理服务

bash
export AUDIOMIND_PROXY_URL="https://your-domain.com/api/audio"
Or set it in
~/.openclaw/openclaw.json
:
json
{
  "skills": {
    "entries": {
      "videoagent-audio-studio": {
        "env": {
          "AUDIOMIND_PROXY_URL": "https://your-domain.com/api/audio"
        }
      }
    }
  }
}
bash
export AUDIOMIND_PROXY_URL="https://your-domain.com/api/audio"
或者在
~/.openclaw/openclaw.json
中配置:
json
{
  "skills": {
    "entries": {
      "videoagent-audio-studio": {
        "env": {
          "AUDIOMIND_PROXY_URL": "https://your-domain.com/api/audio"
        }
      }
    }
  }
}

Custom Domain (Recommended)

自定义域名(推荐)

If your users are in mainland China, bind a custom domain in Vercel Dashboard → Settings → Domains to avoid DNS issues with
vercel.app
.

如果你的用户在中国大陆,请在Vercel控制台 → 设置 → 域名中绑定自定义域名,避免
vercel.app
的DNS解析问题。

Model Reference

模型参考

Model IDTypeProviderNotes
eleven_multilingual_v2
TTSElevenLabsBest quality, supports 29 languages
eleven_turbo_v2_5
TTSElevenLabsUltra-low latency, ideal for real-time
eleven_monolingual_v1
TTSElevenLabsEnglish only, fastest
cassetteai-music
Musicfal.aiReliable, fast music generation
elevenlabs-sfx
SFXElevenLabsHigh-quality sound effects (up to 22s)
elevenlabs-voice-clone
CloneElevenLabsClone any voice from a short audio sample

模型ID类型服务商说明
eleven_multilingual_v2
TTSElevenLabs最佳质量,支持29种语言
eleven_turbo_v2_5
TTSElevenLabs超低延迟,适合实时场景
eleven_monolingual_v1
TTSElevenLabs仅支持英文,速度最快
cassetteai-music
音乐fal.ai可靠、快速的音乐生成
elevenlabs-sfx
SFXElevenLabs高质量音效(最长22秒)
elevenlabs-voice-clone
克隆ElevenLabs可从短音频样本克隆任意声音

Changelog

更新日志

v3.0.0

v3.0.0

  • Simplified routing table: Removed unstable/offline models from the main reference. The skill now only surfaces models that reliably work.
  • Clearer use-case triggers: Added "Use when" section so the agent activates this skill at the right moment.
  • Unified setup: Single
    ELEVENLABS_API_KEY
    is all you need to get started.
    FAL_KEY
    is now optional.
  • Removed polling complexity: Music generation now uses
    cassetteai-music
    by default, which completes synchronously.
  • 简化路由表:从主参考中移除了不稳定/下线的模型,本工具现在仅展示稳定可用的模型。
  • 更清晰的使用场景触发:新增了“适用场景”模块,以便Agent在合适的时机调用本工具。
  • 统一配置:仅需配置
    ELEVENLABS_API_KEY
    即可开始使用,
    FAL_KEY
    现在改为可选配置。
  • 移除轮询复杂度:音乐生成现在默认使用
    cassetteai-music
    ,支持同步完成生成。

v2.1.0

v2.1.0

  • Added async workflow for long-running music generation tasks.
  • Added
    cassetteai-music
    as a stable alternative for music generation.
  • 为长时间运行的音乐生成任务添加了异步工作流。
  • 新增
    cassetteai-music
    作为音乐生成的稳定备选方案。

v2.0.0

v2.0.0

  • Migrated to ElevenLabs MCP server architecture.
  • Added voice cloning support.
  • 迁移到ElevenLabs MCP服务器架构。
  • 新增声音克隆支持。

v1.0.0

v1.0.0

  • Initial release with TTS, music, and SFX routing.
  • 首次发布,支持TTS、音乐和音效路由功能。