voice-audio-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVoice & Audio Engineer: Voice Synthesis, TTS & Speech Processing
语音与音频工程师:语音合成、TTS与语音处理
Expert in voice synthesis, speech processing, and vocal production using ElevenLabs and professional audio techniques. Specializes in TTS, voice cloning, podcast production, and voice UI design.
通过ElevenLabs及专业音频技术,精通语音合成、语音处理和人声制作。专注于TTS、语音克隆、播客制作及语音UI设计。
When to Use This Skill
何时使用此技能
✅ Use for:
- Text-to-speech (TTS) generation
- Voice cloning and voice design
- Speech-to-speech voice transformation
- Podcast production and editing
- Audiobook production
- Voice UI/conversational AI audio
- Dialogue mixing and processing
- Loudness normalization (LUFS)
- Voice quality enhancement (de-essing, compression)
- Transcription and speech-to-text
❌ Do NOT use for:
- Spatial audio (HRTF, Ambisonics) → sound-engineer
- Sound effects generation → sound-engineer (ElevenLabs SFX)
- Game audio middleware (Wwise, FMOD) → sound-engineer
- Music composition/production → DAW tools
- Live concert/event audio → specialized domain
✅ 适用场景:
- 文本转语音(TTS)生成
- 语音克隆与语音设计
- 语音转语音的语音转换
- 播客制作与编辑
- 有声书制作
- 语音UI/对话式AI音频
- 对话混音与处理
- 响度标准化(LUFS)
- 语音质量增强(去齿音、压缩)
- 转录与语音转文本
❌ 不适用场景:
- 空间音频(HRTF、Ambisonics)→ sound-engineer
- 音效生成 → sound-engineer(ElevenLabs SFX)
- 游戏音频中间件(Wwise、FMOD)→ sound-engineer
- 音乐创作/制作 → DAW工具
- 现场音乐会/活动音频 → 专业领域
MCP Integrations
MCP集成
| MCP Tool | Purpose |
|---|---|
| Generate speech from text with voice selection |
| Transform voice recordings to different voices |
| Create instant voice clones from audio samples |
| Find voices in ElevenLabs library |
| Transcribe audio with speaker diarization |
| Separate voice from background noise |
| Build conversational AI agents with voice |
| MCP工具 | 用途 |
|---|---|
| 选择语音后将文本转换为语音 |
| 将语音录音转换为其他音色 |
| 通过音频样本快速创建语音克隆 |
| 在ElevenLabs语音库中查找语音 |
| 带说话人分离的音频转录 |
| 将语音与背景噪音分离 |
| 构建带语音功能的对话式AI代理 |
Expert vs Novice Shibboleths
专家与新手的区别
| Topic | Novice | Expert |
|---|---|---|
| TTS quality | "Any voice works" | Matches voice to brand; considers emotion, pace, style |
| Voice cloning | "Upload any audio" | Knows 30s-3min of clean, varied speech needed; single speaker |
| Loudness | "Make it loud" | Targets -16 to -19 LUFS for podcasts; -14 for streaming |
| De-essing | "Doesn't matter" | Knows sibilance lives at 5-8kHz; frequency-selective compression |
| Compression | "Squash it" | Uses 3:1-4:1 for dialogue; slow attack (10-20ms) to preserve transients |
| High-pass | "Never use it" | Always HPF at 80-100Hz for voice; removes rumble, plosives |
| True peak | "Peak is peak" | Knows intersample peaks exceed 0dBFS; targets -1 dBTP |
| ElevenLabs models | "Use default" | |
| 主题 | 新手做法 | 专家做法 |
|---|---|---|
| TTS质量 | "随便用哪个语音都行" | 根据品牌匹配语音;考虑情感、语速、风格 |
| 语音克隆 | "上传任意音频即可" | 知道需要30秒-3分钟清晰、多样的语音样本;单人说话 |
| 响度 | "调大声就行" | 播客目标响度为-16至-19 LUFS;流媒体为-14 LUFS |
| 去齿音 | "无所谓" | 知道齿音集中在5-8kHz;使用频段选择性压缩 |
| 压缩 | "压到最紧" | 对话使用3:1-4:1压缩比;慢启动(10-20ms)以保留瞬态细节 |
| 高通滤波 | "从不使用" | 语音必加80-100Hz高通滤波;消除低频杂音、爆破音 |
| 真实峰值 | "峰值就是峰值" | 知道采样间峰值会超过0dBFS;目标为-1 dBTP |
| ElevenLabs模型 | "用默认的就行" | 追求质量用 |
Common Anti-Patterns
常见反模式
Anti-Pattern: Uploading Noisy Audio for Voice Cloning
反模式:上传含噪音频进行语音克隆
What it looks like: Voice clone from phone recording with background noise, echo
Why it's wrong: Clone learns the noise; output has artifacts
What to do instead: Use first; record in quiet space; provide 1-3 min of varied speech
isolate_audio表现: 用带背景噪音、回声的手机录音做语音克隆
问题: 克隆会学习到噪音;输出有伪影
正确做法: 先使用处理;在安静环境录音;提供1-3分钟多样的语音样本
isolate_audioAnti-Pattern: Ignoring Loudness Standards
反模式:忽略响度标准
What it looks like: Podcast at -6 LUFS, then normalized by platform → crushed dynamics
Why it's wrong: Each platform normalizes differently; too loud = distortion, too quiet = inaudible
What to do instead: Master to -16 LUFS for podcasts; -14 LUFS for streaming; always check true peak < -1 dBTP
表现: 播客响度为-6 LUFS,被平台标准化后动态范围被压缩
问题: 各平台标准化规则不同;过响会失真,过轻听不清
正确做法: 播客母带响度设为-16 LUFS;流媒体设为-14 LUFS;确保真实峰值<-1 dBTP
Anti-Pattern: TTS Without Voice Matching
反模式:TTS不匹配语音风格
What it looks like: Using default robotic voice for premium product
Why it's wrong: Voice IS brand; wrong voice = wrong emotional connection
What to do instead: to find matching tone; consider custom clone for brand consistency
search_voices表现: 高端产品使用默认机械语音
问题: 语音是品牌的一部分;错误的语音会破坏情感连接
正确做法: 使用找到匹配语气的语音;考虑自定义克隆以保持品牌一致性
search_voicesAnti-Pattern: No De-essing on Processed Voice
反模式:处理后语音不去齿音
What it looks like: "SSSSibilant" speech after compression and EQ boost
Why it's wrong: Compression brings up sibilance; EQ boost at 3-5kHz makes it worse
What to do instead: De-ess at 5-8kHz before compression; use frequency-selective compression
表现: 压缩和EQ提升后出现"SSSS齿音"问题
问题: 压缩会放大齿音;3-5kHz的EQ提升会让情况更糟
正确做法: 压缩前在5-8kHz频段去齿音;使用频段选择性压缩
Anti-Pattern: Single Take, No Editing
反模式:单次录制不编辑
What it looks like: Podcast with 20 "ums", breath sounds, long pauses
Why it's wrong: Listeners fatigue; unprofessional; reduces engagement
What to do instead: Edit out filler words; gate or manually cut breaths; tighten pacing
表现: 播客中有20个“嗯”、呼吸声、长停顿
问题: 听众会疲劳;显得不专业;降低参与度
正确做法: 删除填充词;用门限或手动剪切呼吸声;调整节奏
Evolution Timeline
发展时间线
Pre-2020: Robotic TTS
2020年前:机械TTS
- Concatenative synthesis (spliced recordings)
- Obvious robotic quality
- Limited voice options
- 拼接合成(剪辑录音片段)
- 明显的机械感
- 语音选项有限
2020-2022: Neural TTS Emerges
2020-2022:神经TTS兴起
- Tacotron, WaveNet improve naturalness
- Still detectable as synthetic
- Voice cloning requires hours of data
- Tacotron、WaveNet提升自然度
- 仍能被识别为合成语音
- 语音克隆需数小时数据
2023-2024: AI Voice Revolution
2023-2024:AI语音革命
- ElevenLabs instant voice cloning (30 seconds)
- Near-human quality in TTS
- Real-time voice transformation
- Voice agents for customer service
- ElevenLabs即时语音克隆(30秒样本)
- TTS接近人类音质
- 实时语音转换
- 用于客服的语音代理
2025+: Current Best Practices
2025+:当前最佳实践
- Emotional TTS (control tone, pace, emotion)
- Cross-lingual voice cloning
- Real-time voice transformation in apps
- Personalized voice agents
- Voice authentication integration
- 情感TTS(控制语气、语速、情感)
- 跨语言语音克隆
- 应用内实时语音转换
- 个性化语音代理
- 语音认证集成
Core Concepts
核心概念
ElevenLabs Voice Selection
ElevenLabs语音选择
Model comparison:
| Model | Quality | Latency | Languages | Use Case |
|---|---|---|---|---|
| Best | Higher | 29 | Production, quality-critical |
| Good | Lowest | 32 | Real-time, voice UI |
| Better | Low | 32 | Balanced |
Voice parameters:
python
undefined模型对比:
| 模型 | 音质 | 延迟 | 语言支持 | 适用场景 |
|---|---|---|---|---|
| 最佳 | 较高 | 29种 | 生产环境、对音质要求高的场景 |
| 良好 | 最低 | 32种 | 实时场景、语音UI |
| 较好 | 低 | 32种 | 平衡场景 |
语音参数:
python
undefinedStability: 0-1 (lower = more expressive, higher = more consistent)
Stability: 0-1(值越低表现力越强,值越高一致性越好)
Similarity boost: 0-1 (higher = closer to original voice)
Similarity boost: 0-1(值越高越接近原始语音)
Style: 0-1 (higher = more exaggerated style)
Style: 0-1(值越高风格越夸张)
For natural speech:
自然语音设置:
stability = 0.5 # Balanced expression
similarity = 0.75 # Close to voice but natural
style = 0.0 # Neutral (increase for dramatic)
undefinedstability = 0.5 # 平衡表现力
similarity = 0.75 # 接近原语音且自然
style = 0.0 # 中性风格(如需戏剧效果可提高)
undefinedVoice Cloning Best Practices
语音克隆最佳实践
Audio requirements:
- Duration: 1-3 minutes (more = better, diminishing returns after 3min)
- Quality: Clean, no background noise, no reverb
- Content: Varied speech (questions, statements, emotions)
- Format: WAV/MP3, 44.1kHz or higher
Cloning workflow:
- to clean source material
isolate_audio - with cleaned audio
voice_clone - Test with varied prompts
- Adjust stability/similarity for output quality
音频要求:
- 时长:1-3分钟(越长越好,3分钟后收益递减)
- 质量:清晰、无背景噪音、无混响
- 内容:多样的语音(问句、陈述句、不同情感)
- 格式:WAV/MP3,44.1kHz或更高采样率
克隆流程:
- 使用清理源素材
isolate_audio - 用清理后的音频进行
voice_clone - 用多样提示词测试
- 调整稳定性/相似度参数优化输出质量
Voice Processing Chain
语音处理流程
Standard voice chain (order matters!):
[Raw Recording]
↓
[High-Pass Filter @ 80Hz] ← Remove rumble, plosives
↓
[De-esser @ 5-8kHz] ← Before compression!
↓
[Compressor 3:1, 10ms/100ms] ← Smooth dynamics
↓
[EQ: +2dB @ 3kHz presence] ← Clarity boost
↓
[Limiter -1 dBTP] ← Prevent clipping
↓
[Loudness Norm -16 LUFS] ← Target loudness标准语音处理链(顺序很重要!):
[原始录音]
↓
[80Hz高通滤波] ← 消除低频杂音、爆破音
↓
[5-8kHz去齿音] ← 一定要在压缩前做!
↓
[压缩器 3:1,10ms启动/100ms释放] ← 平滑动态范围
↓
[均衡器:3kHz频段+2dB提升] ← 增强清晰度
↓
[限制器 -1 dBTP] ← 防止削波
↓
[响度标准化 -16 LUFS] ← 目标响度Loudness Standards
响度标准
| Platform/Format | Target LUFS | True Peak |
|---|---|---|
| Podcast | -16 to -19 | -1 dBTP |
| Audiobook (ACX) | -18 to -23 RMS | -3 dBFS |
| YouTube | -14 | -1 dBTP |
| Spotify/Apple Music | -14 | -1 dBTP |
| Broadcast (EBU R128) | -23 ±1 | -1 dBTP |
Measurement:
- LUFS = Loudness Units Full Scale (integrated)
- True Peak = Maximum level including intersample peaks
- Always measure with K-weighting (ITU-R BS.1770)
| 平台/格式 | 目标LUFS | 真实峰值 |
|---|---|---|
| 播客 | -16至-19 | -1 dBTP |
| 有声书(ACX) | -18至-23 RMS | -3 dBFS |
| YouTube | -14 | -1 dBTP |
| Spotify/Apple Music | -14 | -1 dBTP |
| 广播(EBU R128) | -23 ±1 | -1 dBTP |
测量说明:
- LUFS = 响度单位全刻度(集成响度)
- 真实峰值 = 包含采样间峰值的最大电平
- 始终使用K加权(ITU-R BS.1770)测量
Conversational AI Agents
对话式AI代理
ElevenLabs agent configuration:
python
create_agent(
name="Support Agent",
first_message="Hi, how can I help you today?",
system_prompt="You are a helpful customer support agent...",
voice_id="your_voice_id",
language="en",
llm="gemini-2.0-flash-001", # Fast for conversation
temperature=0.5,
asr_quality="high", # Speech recognition quality
turn_timeout=7, # Seconds before agent responds
max_duration_seconds=300 # 5 minute call limit
)Voice UI considerations:
- Use fast model () for real-time
eleven_flash_v2_5 - Keep responses concise (< 30 seconds)
- Add pauses for natural conversation flow
- Handle interruptions gracefully
ElevenLabs代理配置:
python
create_agent(
name="Support Agent",
first_message="Hi, how can I help you today?",
system_prompt="You are a helpful customer support agent...",
voice_id="your_voice_id",
language="en",
llm="gemini-2.0-flash-001", # 对话场景速度快
temperature=0.5,
asr_quality="high", # 语音识别质量
turn_timeout=7, # 代理响应前等待秒数
max_duration_seconds=300 # 通话最长限制5分钟
)语音UI注意事项:
- 使用快速模型()实现实时交互
eleven_flash_v2_5 - 响应保持简洁(<30秒)
- 添加停顿以保证自然对话流程
- 优雅处理打断
Quick Reference
快速参考
Voice Selection Decision Tree
语音选择决策树
- Brand/professional content? → Custom clone or curated voice
- Real-time/interactive? → model
eleven_flash_v2_5 - Quality-critical? → model
eleven_multilingual_v2 - Multiple languages? → Check language support per voice
- 品牌/专业内容? → 自定义克隆或精选语音
- 实时/交互场景? → 模型
eleven_flash_v2_5 - 对音质要求高? → 模型
eleven_multilingual_v2 - 多语言需求? → 检查各语音的语言支持
Processing Decision Tree
处理决策树
- Voice sounds muddy? → HPF at 80Hz, boost 3kHz
- Sibilance harsh? → De-ess at 5-8kHz
- Inconsistent volume? → Compress 3:1, then limit
- Too quiet? → Normalize to target LUFS
- Background noise? → Use first
isolate_audio
- 语音发闷? → 80Hz高通滤波,3kHz频段提升
- 齿音刺耳? → 5-8kHz去齿音
- 音量不稳定? → 3:1压缩后加限制器
- 音量太小? → 标准化到目标LUFS
- 有背景噪音? → 先使用
isolate_audio
Common Settings
常见设置
De-esser: 5-8kHz, -6dB reduction, Q=2
Compressor: 3:1 ratio, -20dB threshold, 10ms attack, 100ms release
EQ presence: +2-3dB shelf at 3kHz
HPF: 80-100Hz, 12dB/oct
Limiter: -1 dBTP ceiling去齿音:5-8kHz,-6dB衰减,Q=2
压缩器:3:1压缩比,-20dB阈值,10ms启动,100ms释放
均衡器清晰度提升:3kHz频段+2-3dB shelf
高通滤波:80-100Hz,12dB/oct
限制器:-1 dBTP上限Working With Speech Disfluencies
处理言语不流畅问题
Cluttering vs Stuttering
语流障碍 vs 口吃
| Type | Characteristics | ASR Impact |
|---|---|---|
| Stuttering | Repetitions ("I-I-I"), prolongations ("wwwant"), blocks (silent pauses) | Word boundaries confused; repetitions misrecognized |
| Cluttering | Irregular rate, collapsed syllables, filler overload, tangential speech | Words merged; rate changes confuse timing |
| 类型 | 特征 | 对ASR的影响 |
|---|---|---|
| 口吃 | 重复("我-我-我")、拖长音("我我我想要")、阻塞(无声停顿) | 词边界识别错误;重复内容被错误转录 |
| 语流障碍 | 语速不规则、音节省略、填充词过多、话题偏离 | 词语合并;语速变化打乱时序识别 |
ASR Challenges with Disfluent Speech
不流畅语音的ASR挑战
Most ASR models trained on fluent speech. Disfluencies cause:
- Word boundary detection errors
- Repetitions transcribed literally ("I I I want" vs "I want")
- Collapsed syllables missed entirely
- Timing models confused by irregular pace
大多数ASR模型基于流畅语音训练。不流畅语音会导致:
- 词边界检测错误
- 重复内容被逐字转录(如"我我我想要"而非"我想要")
- 省略的音节完全丢失
- 不规则语速打乱时序模型
Solutions & Workarounds
解决方案与变通方法
1. Model selection (best to worst for disfluencies):
- Whisper large-v3 - Most robust to disfluencies
- ElevenLabs speech_to_text - Good with varied speech
- Google Speech-to-Text - Decent with enhanced models
- Fast/lightweight models - Usually worst
2. Pre-processing:
python
undefined1. 模型选择(对不流畅语音的鲁棒性从优到劣):
- Whisper large-v3 - 对不流畅语音鲁棒性最强
- ElevenLabs speech_to_text - 对多样语音适配性好
- Google Speech-to-Text - 增强型模型表现尚可
- 轻量快速模型 - 通常表现最差
2. 预处理:
python
undefinedNormalize speech rate before ASR
ASR前标准化语速
Use librosa to stretch irregular segments toward target rate
使用librosa将不规则片段拉伸至目标语速
import librosa
y, sr = librosa.load("disfluent.wav")
y_stretched = librosa.effects.time_stretch(y, rate=0.9) # Slow down
**3. Post-processing:**
- Remove duplicate words: "I I I want" → "I want"
- Filter common fillers: "um", "uh", "like", "you know"
- Use LLM to clean transcripts while preserving meaning
**4. Fine-tuning Whisper (advanced):**
```pythonimport librosa
y, sr = librosa.load("disfluent.wav")
y_stretched = librosa.effects.time_stretch(y, rate=0.9) # 放慢语速
**3. 后处理:**
- 删除重复词语:"我我我想要" → "我想要"
- 过滤常见填充词:"嗯"、"呃"、"比如"、"你知道"
- 使用LLM清理转录文本同时保留原意
**4. 微调Whisper(进阶):**
```pythonFine-tune on disfluent speech dataset
在不流畅语音数据集上微调
Datasets: FluencyBank, UCLASS, SEP-28k (stuttering)
数据集:FluencyBank、UCLASS、SEP-28k(口吃数据集)
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
Fine-tune on your speech samples with corrected transcripts
在带修正转录的语音样本上微调
Training loop with disfluent audio → fluent transcript pairs
训练流程:不流畅音频 → 流畅转录对
**5. ElevenLabs voice cloning approach:**
- Clone your voice from fluent segments
- Use TTS for fluent output with your voice
- Great for pre-recorded content, not live
**5. ElevenLabs语音克隆方案:**
- 从流畅片段克隆语音
- 用TTS生成带你的语音的流畅输出
- 非常适合预录内容,不适合实时场景Accessibility Considerations
无障碍设计考虑
- Always provide manual transcript correction option
- Consider hybrid: ASR + human review
- For voice UI: longer timeout, confirmation prompts
- Test with actual users from target population
- 始终提供手动转录修正选项
- 考虑混合方案:ASR + 人工审核
- 语音UI:延长超时时间,添加确认提示
- 与目标用户群体一起测试
Performance Targets
性能指标
| Operation | Typical Time |
|---|---|
| TTS (100 words) | 2-5 seconds |
| Voice clone creation | 10-30 seconds |
| Speech-to-speech | 3-8 seconds |
| Transcription (1 min audio) | 5-15 seconds |
| Audio isolation | 5-20 seconds |
| 操作 | 典型耗时 |
|---|---|
| TTS(100词) | 2-5秒 |
| 语音克隆创建 | 10-30秒 |
| 语音转语音 | 3-8秒 |
| 转录(1分钟音频) | 5-15秒 |
| 音频分离 | 5-20秒 |
Integrates With
集成对象
- sound-engineer - For spatial audio, game audio, procedural SFX
- native-app-designer - Voice UI implementation in apps
- vr-avatar-engineer - Avatar voice integration
For detailed implementations: See
/references/implementations.mdRemember: Voice is intimate—it speaks directly to the listener's brain. Match voice to brand, process for clarity not loudness, and always respect the platform's loudness standards. With ElevenLabs, you have instant access to professional voice synthesis; use it thoughtfully.
- sound-engineer - 用于空间音频、游戏音频、程序化音效
- native-app-designer - 应用中语音UI的实现
- vr-avatar-engineer - 虚拟形象语音集成
详细实现参考:请查看
/references/implementations.md注意:语音是极具亲和力的媒介——它直接触达听众的大脑。让语音与品牌匹配,为清晰度而非响度处理,始终遵守平台的响度标准。借助ElevenLabs,你可以即时获得专业级语音合成能力;请谨慎且有思考地使用它。