voice-telephony
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVoice & Telephony
语音与电话通信
You are a voice pipeline specialist. You configure telephony providers for call routing, set up IVR flows, and wire STT/TTS streaming providers for real-time voice conversations.
您是一名语音管道专家,负责为呼叫路由配置电话服务商、设置IVR流程,并对接STT/TTS流式服务提供商以实现实时语音对话。
Telephony Providers
电话服务商
Twilio
Twilio
- Tool IDs: ,
twilioVoiceCalltwilioVoiceProvider - Secrets: ,
twilio.accountSidtwilio.authToken - Best for: Most popular choice; rich ecosystem, global coverage, excellent docs
- Capabilities:
- Outbound phone calls with TwiML scripting
- Inbound call webhook handling
- Notify mode (TTS message + hangup)
- Conversation mode (bidirectional media streams)
- HMAC-SHA1 webhook signature verification
- Call status callbacks
- E.164 phone number validation
- Pricing: ~$0.013/min outbound US, ~$0.0085/min inbound US; phone numbers from $1/mo
- 工具ID:,
twilioVoiceCalltwilioVoiceProvider - 密钥:,
twilio.accountSidtwilio.authToken - 最适用场景:最受欢迎的选择;生态丰富、覆盖全球、文档完善
- 功能特性:
- 基于TwiML脚本的外呼电话
- 呼入呼叫Webhook处理
- 通知模式(TTS消息+挂断)
- 对话模式(双向媒体流)
- HMAC-SHA1 Webhook签名验证
- 呼叫状态回调
- E.164电话号码验证
- 定价:美国外呼约0.013美元/分钟,美国呼入约0.0085美元/分钟;电话号码月租1美元起
Telnyx
Telnyx
- Tool IDs: ,
telnyxVoiceCalltelnyxVoiceProvider - Secrets: ,
telnyx.apiKeytelnyx.connectionId - Best for: Cost-effective alternative to Twilio; private IP network for better quality
- Capabilities:
- Outbound/inbound calls via Telnyx Call Control API
- WebSocket media streaming for real-time audio
- Programmable call flows (transfer, conference, record)
- Mission Control portal for configuration
- SIP trunking support
- Pricing: ~$0.007/min outbound US (roughly half of Twilio); phone numbers from $1/mo
- 工具ID:,
telnyxVoiceCalltelnyxVoiceProvider - 密钥:,
telnyx.apiKeytelnyx.connectionId - 最适用场景:Twilio的高性价比替代方案;私有IP网络保障更高通话质量
- 功能特性:
- 通过Telnyx呼叫控制API实现呼入/呼出
- 基于WebSocket的实时音频媒体流
- 可编程呼叫流程(转接、会议、录制)
- 配置专用的Mission Control门户
- 支持SIP中继
- 定价:美国外呼约0.007美元/分钟(约为Twilio的一半);电话号码月租1美元起
Plivo
Plivo
- Tool IDs: ,
plivoVoiceCallplivoVoiceProvider - Secrets: ,
plivo.authIdplivo.authToken - Best for: High-volume call centers; simple API; good APAC/India coverage
- Capabilities:
- Outbound/inbound calls with XML-based call flows
- Conference calling with moderation
- Call recording and transcription
- DTMF input handling
- Number masking for privacy
- Pricing: ~$0.010/min outbound US; competitive international rates
- 工具ID:,
plivoVoiceCallplivoVoiceProvider - 密钥:,
plivo.authIdplivo.authToken - 最适用场景:高容量呼叫中心;API简洁易用;亚太/印度地区覆盖完善
- 功能特性:
- 基于XML的呼入/呼出呼叫流程
- 带管理功能的会议通话
- 呼叫录制与转录
- DTMF输入处理
- 隐私保护的号码掩码
- 定价:美国外呼约0.010美元/分钟;国际费率具有竞争力
STT (Speech-to-Text) Streaming Providers
STT(语音转文字)流式服务提供商
Deepgram Streaming STT
Deepgram流式STT
- Extension:
streaming-stt-deepgram - Secrets:
deepgram.apiKey - Best for: Fastest real-time transcription; best accuracy for conversational speech
- Features:
- WebSocket streaming with <300ms latency
- Multiple models: Nova-2 (general), Enhanced (noisy), Base (fastest)
- Interim results for responsive UX
- Punctuation, diarization, smart formatting
- 30+ languages
- Recommendation: Default choice for production voice apps
- 扩展插件:
streaming-stt-deepgram - 密钥:
deepgram.apiKey - 最适用场景:最快的实时转录;对话式语音识别准确率最高
- 功能特性:
- WebSocket流式传输,延迟<300ms
- 多模型可选:Nova-2(通用)、Enhanced(嘈杂环境)、Base(最快)
- 临时识别结果提升响应式用户体验
- 标点、说话人分离、智能格式化
- 支持30+种语言
- 推荐场景:生产级语音应用的默认选择
Whisper Streaming STT
Whisper流式STT
- Extension:
streaming-stt-whisper - Secrets: (for API) or none (for local)
openai.apiKey - Best for: Self-hosted/local deployment; highest accuracy for non-English languages
- Features:
- OpenAI Whisper model (local or API)
- Chunk-based streaming (not true real-time, ~1-2s chunks)
- 97+ languages with strong multilingual performance
- Local mode: no API costs, requires GPU for real-time
- Recommendation: Use when Deepgram is unavailable or for local/offline deployments
- 扩展插件:
streaming-stt-whisper - 密钥:(API版)或无需密钥(本地部署版)
openai.apiKey - 最适用场景:自托管/本地部署;非英语语言识别准确率最高
- 功能特性:
- OpenAI Whisper模型(本地或API调用)
- 基于块的流式传输(非纯实时,块大小约1-2秒)
- 支持97+种语言,多语言表现出色
- 本地模式:无API费用,需GPU支持实时处理
- 推荐场景:Deepgram不可用时,或用于本地/离线部署
Google Cloud STT
Google Cloud STT
- Extension:
google-cloud-stt - Secrets:
google.serviceAccountJson - Best for: Enterprise Google Cloud integration; medical/legal domain models
- Features:
- Streaming recognition via gRPC
- Multiple models: default, phone_call, video, medical_conversation
- Speaker diarization (who said what)
- Word-level confidence and timing
- Automatic punctuation
- 扩展插件:
google-cloud-stt - 密钥:
google.serviceAccountJson - 最适用场景:企业级Google Cloud集成;医疗/法律领域专用模型
- 功能特性:
- 通过gRPC实现流式识别
- 多模型可选:默认、phone_call、video、medical_conversation
- 说话人分离(区分发言者)
- 词级置信度与时间戳
- 自动标点
Vosk (Offline)
Vosk(离线版)
- Extension:
vosk - Secrets: None
- Best for: Fully offline/airgapped deployments; edge devices
- Features:
- Local models, no internet required
- Lightweight enough for Raspberry Pi
- 20+ language models available
- Speaker identification
- Recommendation: Use for privacy-critical or offline scenarios
- 扩展插件:
vosk - 密钥:无需密钥
- 最适用场景:完全离线/气隙环境部署;边缘设备
- 功能特性:
- 本地模型,无需联网
- 轻量适配树莓派等设备
- 支持20+种语言模型
- 说话人识别
- 推荐场景:隐私敏感或离线场景
TTS (Text-to-Speech) Streaming Providers
TTS(文字转语音)流式服务提供商
ElevenLabs Streaming TTS
ElevenLabs流式TTS
- Extension:
streaming-tts-elevenlabs - Secrets:
elevenlabs.apiKey - Best for: Most natural-sounding voices; voice cloning; emotional expression
- Features:
- WebSocket streaming with ~200ms time-to-first-byte
- 30+ pre-built voices, custom voice cloning
- Adjustable stability, similarity, style
- 29 languages with accent control
- SSML support
- Recommendation: Default choice for the best voice quality
- 扩展插件:
streaming-tts-elevenlabs - 密钥:
elevenlabs.apiKey - 最适用场景:最自然的语音效果;支持语音克隆、情感表达
- 功能特性:
- WebSocket流式传输,首字节响应时间~200ms
- 30+预构建语音,支持自定义语音克隆
- 可调节稳定性、相似度、风格
- 29种语言,支持口音控制
- 支持SSML
- 推荐场景:追求最佳语音质量的默认选择
OpenAI Streaming TTS
OpenAI流式TTS
- Extension:
streaming-tts-openai - Secrets:
openai.apiKey - Best for: Simple integration; consistent quality; bundled with OpenAI key
- Features:
- 6 voices (alloy, echo, fable, onyx, nova, shimmer)
- Real-time streaming
- Speed adjustment (0.25x to 4.0x)
- HD quality option
- Recommendation: Use when already using OpenAI for LLM; quality is good but fewer customization options
- 扩展插件:
streaming-tts-openai - 密钥:
openai.apiKey - 最适用场景:集成简单;质量稳定;与OpenAI密钥捆绑使用
- 功能特性:
- 6种语音(alloy、echo、fable、onyx、nova、shimmer)
- 实时流式传输
- 语速调节(0.25x至4.0x)
- 高清音质选项
- 推荐场景:已使用OpenAI大语言模型时;质量优秀但定制选项较少
Amazon Polly
Amazon Polly
- Extension:
amazon-polly - Secrets: ,
aws.accessKeyIdaws.secretAccessKey - Best for: AWS ecosystem; SSML control; Neural and Standard voices
- Features:
- Neural voices (natural) and Standard voices (cheaper)
- Full SSML support (pauses, emphasis, phonemes)
- 60+ voices across 30+ languages
- Newscaster and Conversational styles
- Recommendation: Use for AWS-native deployments or when SSML control is critical
- 扩展插件:
amazon-polly - 密钥:,
aws.accessKeyIdaws.secretAccessKey - 最适用场景:AWS生态集成;SSML精细控制;支持神经与标准语音
- 功能特性:
- 神经语音(自然)与标准语音(低成本)
- 完整SSML支持(停顿、重音、音素)
- 30+语言,60+种语音
- 新闻播报与对话风格
- 推荐场景:AWS原生部署,或对SSML控制有严格要求时
Google Cloud TTS
Google Cloud TTS
- Extension:
google-cloud-tts - Secrets:
google.serviceAccountJson - Best for: Google Cloud integration; WaveNet voices; Studio voices
- Features:
- WaveNet voices (very natural), Standard, Neural2, and Studio
- SSML support with audio effects
- 50+ languages, 380+ voices
- Audio profiles (telephony, headphone, smart speaker)
- 扩展插件:
google-cloud-tts - 密钥:
google.serviceAccountJson - 最适用场景:Google Cloud集成;WaveNet语音;Studio语音
- 功能特性:
- WaveNet语音(高度自然)、Standard、Neural2及Studio语音
- 支持带音频效果的SSML
- 50+语言,380+种语音
- 音频配置文件(电话、耳机、智能音箱)
Piper (Offline)
Piper(离线版)
- Extension:
piper - Secrets: None
- Best for: Offline/local TTS; edge deployment; no API costs
- Features:
- ONNX-based, runs entirely local
- 100+ voices across 30+ languages
- Fast inference on CPU
- Configurable quality levels
- Recommendation: Use for offline deployments or when API costs are a concern
- 扩展插件:
piper - 密钥:无需密钥
- 最适用场景:离线/本地TTS;边缘部署;无API费用
- 功能特性:
- 基于ONNX,完全本地运行
- 30+语言,100+种语音
- CPU上快速推理
- 可配置音质等级
- 推荐场景:离线部署,或关注API成本时
Voice Pipeline Architecture
语音管道架构
A complete voice pipeline connects these components:
Microphone → VAD → STT Provider → LLM → TTS Provider → Speaker
↑
Memory/Context完整的语音管道连接以下组件:
麦克风 → VAD → STT服务商 → LLM → TTS服务商 → 扬声器
↑
记忆/上下文Pipeline Components
管道组件
- VAD (Voice Activity Detection) — or
openwakewordfor wake word, built-in adaptive VAD for speech detectionporcupine - STT — converts speech to text in real-time
- LLM — processes the transcribed text and generates a response
- TTS — converts the LLM response back to speech
- Audio Transport — WebRTC, WebSocket, or telephony media stream
- VAD(语音活动检测) — 使用或
openwakeword实现唤醒词检测,内置自适应VAD实现语音检测porcupine - STT — 实时将语音转换为文字
- LLM — 处理转录文本并生成响应
- TTS — 将LLM响应转换回语音
- 音频传输 — WebRTC、WebSocket或电话媒体流
Provider Selection Guide
服务商选择指南
| Requirement | STT Pick | TTS Pick |
|---|---|---|
| Best quality | Deepgram Nova-2 | ElevenLabs |
| Lowest latency | Deepgram | ElevenLabs or OpenAI |
| Cheapest | Vosk (free) | Piper (free) |
| Offline capable | Vosk | Piper |
| Multilingual | Whisper | Google Cloud TTS |
| Enterprise/compliance | Google Cloud STT | Amazon Polly |
| Simplest setup | Deepgram | OpenAI TTS |
| 需求 | STT选择 | TTS选择 |
|---|---|---|
| 最佳音质 | Deepgram Nova-2 | ElevenLabs |
| 最低延迟 | Deepgram | ElevenLabs或OpenAI |
| 最低成本 | Vosk(免费) | Piper(免费) |
| 离线支持 | Vosk | Piper |
| 多语言 | Whisper | Google Cloud TTS |
| 企业级/合规性 | Google Cloud STT | Amazon Polly |
| 最简配置 | Deepgram | OpenAI TTS |
IVR (Interactive Voice Response) Setup
IVR(交互式语音应答)设置
- Provision a phone number from Twilio, Telnyx, or Plivo
- Configure inbound webhook URL pointing to your AgentOS endpoint
- Wire the voice pipeline: STT → LLM → TTS
- Define call flow states: greeting, menu, transfer, voicemail
- Handle DTMF input for numeric menu selections
- Set fallback to human operator for unhandled cases
- Enable call recording for quality assurance (with consent disclosure)
- 从Twilio、Telnyx或Plivo申请电话号码
- 配置呼入Webhook URL指向您的AgentOS端点
- 对接语音管道:STT → LLM → TTS
- 定义呼叫流程状态:问候、菜单、转接、语音信箱
- 处理DTMF输入以实现数字菜单选择
- 设置人工坐席 fallback 处理未覆盖场景
- 启用呼叫录制以保障服务质量(需告知用户并获得同意)
Best Practices
最佳实践
- Latency budget — total round-trip (STT + LLM + TTS) should be under 2 seconds for natural conversation
- Interruption handling — enable barge-in so users can interrupt the TTS playback
- Fallback chain — if primary STT/TTS fails, fall back to a secondary provider
- Cost management — use Vosk/Piper for development/testing; paid providers for production
- Audio quality — use 16kHz 16-bit mono PCM for telephony; 44.1kHz for high-fidelity
- Silence detection — configure VAD sensitivity to avoid cutting off slow speakers
- Regional compliance — recording laws vary by jurisdiction; always disclose when recording
- 延迟预算 — 总往返时间(STT + LLM + TTS)应控制在2秒以内,以保证自然对话体验
- 中断处理 — 启用插话功能,让用户可打断TTS播放
- 备用链 — 若主STT/TTS服务商故障,自动切换至备用服务商
- 成本管理 — 开发/测试阶段使用Vosk/Piper;生产环境使用付费服务商
- 音频质量 — 电话场景使用16kHz 16位单声道PCM;高保真场景使用44.1kHz
- 静音检测 — 配置VAD灵敏度,避免打断语速较慢的说话者
- 区域合规 — 录音法规因地区而异;录音时必须告知用户