videoagent-audio-studio

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

🎙️ VideoAgent Audio Studio

🎙️ VideoAgent 音频工作室

Use when: User asks to generate speech, narrate text, create a voice-over, compose music, or produce a sound effect.

VideoAgent Audio Studio is a smart audio dispatcher. It analyzes your request and routes it to the best available model — ElevenLabs for speech and music, fal.ai for fast SFX — and returns a ready-to-use audio URL.

适用场景： 当用户要求生成语音、文本旁白、制作配音、创作音乐或生成音效时使用。

VideoAgent 音频工作室是一个智能音频调度器。它会分析你的请求并将其转发到最合适的可用模型——语音和音乐使用ElevenLabs，快速音效使用fal.ai——并返回可直接使用的音频URL。

Quick Reference

快速参考

Request Type	Best Model	Latency
Narrate text / Voice-over	`elevenlabs-tts-v3`	~3s
Low-latency TTS (real-time)	`elevenlabs-tts-turbo`	<1s
Background music	`cassetteai-music`	~15s
Sound effect	`elevenlabs-sfx`	~5s
Clone a voice from audio	`elevenlabs-voice-clone`	~10s

请求类型	最佳模型	延迟
文本旁白 / 配音	`elevenlabs-tts-v3`	~3s
低延迟TTS（实时）	`elevenlabs-tts-turbo`	<1s
背景音乐	`cassetteai-music`	~15s
音效	`elevenlabs-sfx`	~5s
从音频克隆声音	`elevenlabs-voice-clone`	~10s

How to Use

使用方法

1. Start the AudioMind server (once per session)

1. 启动AudioMind服务器（每个会话执行一次）

bash

bash {baseDir}/tools/start_server.sh

This starts the ElevenLabs MCP server on port 8124. The skill uses it for all audio generation.

bash

bash {baseDir}/tools/start_server.sh

这个命令会在8124端口启动ElevenLabs MCP服务器，本工具所有音频生成功能都会使用该服务器。

2. Route the request

2. 转发请求

Analyze the user's request and call the appropriate tool via the MCP server:

Text-to-Speech (TTS)

When user asks to "narrate", "read aloud", "say", or "create a voice-over":

Use MCP tool: text_to_speech
  text: "<the text to narrate>"
  voice_id: "JBFqnCBsd6RMkjVDRZzb"   # Default: "George" (professional, neutral)
  model_id: "eleven_multilingual_v2"   # Use "eleven_turbo_v2_5" for low latency

Music Generation

When user asks to "compose", "create background music", or "make a soundtrack":

Use MCP tool: text_to_sound_effects  (via cassetteai-music on fal.ai)
  prompt: "<music description, e.g. 'upbeat lo-fi hip hop, 90 seconds'>"
  duration_seconds: <duration>

Sound Effect (SFX)

When user asks for a specific sound (e.g., "a door creaking", "rain on a window"):

Use MCP tool: text_to_sound_effects
  text: "<sound description>"
  duration_seconds: <1-22>

Voice Cloning

When user provides an audio sample and wants to clone the voice:

Use MCP tool: voice_add
  name: "<voice name>"
  files: ["<audio_file_url>"]

分析用户请求，通过MCP服务器调用对应的工具：

Text-to-Speech (TTS)

当用户要求“旁白”、“朗读”、“播报”或“制作配音”时：

Use MCP tool: text_to_speech
  text: "<the text to narrate>"
  voice_id: "JBFqnCBsd6RMkjVDRZzb"   # Default: "George" (professional, neutral)
  model_id: "eleven_multilingual_v2"   # Use "eleven_turbo_v2_5" for low latency

Music Generation

当用户要求“作曲”、“创作背景音乐”或“制作配乐”时：

Use MCP tool: text_to_sound_effects  (via cassetteai-music on fal.ai)
  prompt: "<music description, e.g. 'upbeat lo-fi hip hop, 90 seconds'>"
  duration_seconds: <duration>

Sound Effect (SFX)

当用户需要特定音效时（例如“门吱呀作响的声音”、“雨打窗户的声音”）：

Use MCP tool: text_to_sound_effects
  text: "<sound description>"
  duration_seconds: <1-22>

Voice Cloning

当用户提供音频样本并想要克隆对应声音时：

Use MCP tool: voice_add
  name: "<voice name>"
  files: ["<audio_file_url>"]

Example Conversations

对话示例

User: "Voice this text for me: Welcome to our product launch"

→ Route to: text_to_speech
  text: "Welcome to our product launch"
  voice_id: "JBFqnCBsd6RMkjVDRZzb"
  model_id: "eleven_multilingual_v2"

🎙️ Voiceover done! Listen here

User: "Generate 60 seconds of relaxing background music for a podcast"

→ Route to: cassetteai-music (fal.ai)
  prompt: "relaxing lo-fi background music for a podcast, gentle piano and soft beats, 60 seconds"
  duration_seconds: 60

🎵 Background music ready! Listen here

User: "Generate a sci-fi style door opening sound effect"

→ Route to: text_to_sound_effects
  text: "a futuristic sci-fi door sliding open with a hydraulic hiss"
  duration_seconds: 3

用户： "帮我给这段文本配音：欢迎来到我们的产品发布会"

→ Route to: text_to_speech
  text: "Welcome to our product launch"
  voice_id: "JBFqnCBsd6RMkjVDRZzb"
  model_id: "eleven_multilingual_v2"

🎙️ 配音完成！点击收听

用户： "生成60秒适合播客的放松背景音乐"

→ Route to: cassetteai-music (fal.ai)
  prompt: "relaxing lo-fi background music for a podcast, gentle piano and soft beats, 60 seconds"
  duration_seconds: 60

🎵 背景音乐已就绪！点击收听

用户： "生成一个科幻风格的开门音效"

→ Route to: text_to_sound_effects
  text: "a futuristic sci-fi door sliding open with a hydraulic hiss"
  duration_seconds: 3

Setup

配置

Required

必需配置

Set

ELEVENLABS_API_KEY

~/.openclaw/openclaw.json

json

{
  "skills": {
    "entries": {
      "videoagent-audio-studio": {
        "enabled": true,
        "env": {
          "ELEVENLABS_API_KEY": "your_elevenlabs_key_here"
        }
      }
    }
  }
}

Get your key at elevenlabs.io/app/settings/api-keys.

在

~/.openclaw/openclaw.json

中设置

ELEVENLABS_API_KEY

：

json

{
  "skills": {
    "entries": {
      "videoagent-audio-studio": {
        "enabled": true,
        "env": {
          "ELEVENLABS_API_KEY": "your_elevenlabs_key_here"
        }
      }
    }
  }
}

你可以在elevenlabs.io/app/settings/api-keys获取你的密钥。

Optional (for fal.ai music & SFX models)

可选配置（用于fal.ai音乐和音效模型）

json

"FAL_KEY": "your_fal_key_here"

Get your key at fal.ai/dashboard/keys.

json

"FAL_KEY": "your_fal_key_here"

你可以在fal.ai/dashboard/keys获取你的密钥。

Self-Hosting the Proxy

自行部署代理服务

The

cli.js

connects to a hosted proxy by default. If you want full control — or need to serve users in regions where

vercel.app

is blocked — you can deploy your own instance from the

proxy/

directory.

cli.js

默认连接到托管代理服务。如果你想要完全掌控服务，或者需要为

vercel.app

被屏蔽的区域的用户提供服务，你可以从

proxy/

目录部署你自己的实例。

Quick Deploy (Vercel)

快速部署（Vercel）

bash

cd proxy
npm install
vercel --prod

bash

cd proxy
npm install
vercel --prod

Environment Variables

环境变量

Set these in your Vercel project (Dashboard → Settings → Environment Variables):

Variable	Required For	Where to Get
`ELEVENLABS_API_KEY`	TTS, SFX, Voice Clone	elevenlabs.io/app/settings/api-keys
`FAL_KEY`	Music generation	fal.ai/dashboard/keys
`VALID_PRO_KEYS`	(Optional) Restrict access	Comma-separated list of allowed client keys

在你的Vercel项目中配置这些变量（控制台 → 设置 → 环境变量）：

变量	适用功能	获取地址
`ELEVENLABS_API_KEY`	TTS、音效、声音克隆	elevenlabs.io/app/settings/api-keys
`FAL_KEY`	音乐生成	fal.ai/dashboard/keys
`VALID_PRO_KEYS`	（可选）限制访问	逗号分隔的允许客户端密钥列表

Point cli.js to Your Proxy

把cli.js指向你的代理服务

bash

export AUDIOMIND_PROXY_URL="https://your-domain.com/api/audio"

Or set it in

~/.openclaw/openclaw.json

json

{
  "skills": {
    "entries": {
      "videoagent-audio-studio": {
        "env": {
          "AUDIOMIND_PROXY_URL": "https://your-domain.com/api/audio"
        }
      }
    }
  }
}

bash

export AUDIOMIND_PROXY_URL="https://your-domain.com/api/audio"

或者在

~/.openclaw/openclaw.json

中配置：

json

{
  "skills": {
    "entries": {
      "videoagent-audio-studio": {
        "env": {
          "AUDIOMIND_PROXY_URL": "https://your-domain.com/api/audio"
        }
      }
    }
  }
}

Custom Domain (Recommended)

自定义域名（推荐）

If your users are in mainland China, bind a custom domain in Vercel Dashboard → Settings → Domains to avoid DNS issues with

vercel.app

如果你的用户在中国大陆，请在Vercel控制台 → 设置 → 域名中绑定自定义域名，避免

vercel.app

的DNS解析问题。

Model Reference

模型参考

Model ID	Type	Provider	Notes
`eleven_multilingual_v2`	TTS	ElevenLabs	Best quality, supports 29 languages
`eleven_turbo_v2_5`	TTS	ElevenLabs	Ultra-low latency, ideal for real-time
`eleven_monolingual_v1`	TTS	ElevenLabs	English only, fastest
`cassetteai-music`	Music	fal.ai	Reliable, fast music generation
`elevenlabs-sfx`	SFX	ElevenLabs	High-quality sound effects (up to 22s)
`elevenlabs-voice-clone`	Clone	ElevenLabs	Clone any voice from a short audio sample

模型ID	类型	服务商	说明
`eleven_multilingual_v2`	TTS	ElevenLabs	最佳质量，支持29种语言
`eleven_turbo_v2_5`	TTS	ElevenLabs	超低延迟，适合实时场景
`eleven_monolingual_v1`	TTS	ElevenLabs	仅支持英文，速度最快
`cassetteai-music`	音乐	fal.ai	可靠、快速的音乐生成
`elevenlabs-sfx`	SFX	ElevenLabs	高质量音效（最长22秒）
`elevenlabs-voice-clone`	克隆	ElevenLabs	可从短音频样本克隆任意声音

Changelog

更新日志

v3.0.0

Simplified routing table: Removed unstable/offline models from the main reference. The skill now only surfaces models that reliably work.
Clearer use-case triggers: Added "Use when" section so the agent activates this skill at the right moment.
Unified setup: Single
```
ELEVENLABS_API_KEY
```
is all you need to get started.
```
FAL_KEY
```
is now optional.
Removed polling complexity: Music generation now uses
```
cassetteai-music
```
by default, which completes synchronously.

简化路由表：从主参考中移除了不稳定/下线的模型，本工具现在仅展示稳定可用的模型。
更清晰的使用场景触发：新增了“适用场景”模块，以便Agent在合适的时机调用本工具。
统一配置：仅需配置
```
ELEVENLABS_API_KEY
```
即可开始使用，
```
FAL_KEY
```
现在改为可选配置。
移除轮询复杂度：音乐生成现在默认使用
```
cassetteai-music
```
，支持同步完成生成。

v2.1.0

Added async workflow for long-running music generation tasks.
Added
```
cassetteai-music
```
as a stable alternative for music generation.

为长时间运行的音乐生成任务添加了异步工作流。
新增
```
cassetteai-music
```
作为音乐生成的稳定备选方案。

v2.0.0

Migrated to ElevenLabs MCP server architecture.
Added voice cloning support.

迁移到ElevenLabs MCP服务器架构。
新增声音克隆支持。

v1.0.0

Initial release with TTS, music, and SFX routing.

首次发布，支持TTS、音乐和音效路由功能。