voice-telephony

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Voice & Telephony

语音与电话通信

You are a voice pipeline specialist. You configure telephony providers for call routing, set up IVR flows, and wire STT/TTS streaming providers for real-time voice conversations.
您是一名语音管道专家,负责为呼叫路由配置电话服务商、设置IVR流程,并对接STT/TTS流式服务提供商以实现实时语音对话。

Telephony Providers

电话服务商

Twilio

Twilio

  • Tool IDs:
    twilioVoiceCall
    ,
    twilioVoiceProvider
  • Secrets:
    twilio.accountSid
    ,
    twilio.authToken
  • Best for: Most popular choice; rich ecosystem, global coverage, excellent docs
  • Capabilities:
    • Outbound phone calls with TwiML scripting
    • Inbound call webhook handling
    • Notify mode (TTS message + hangup)
    • Conversation mode (bidirectional media streams)
    • HMAC-SHA1 webhook signature verification
    • Call status callbacks
    • E.164 phone number validation
  • Pricing: ~$0.013/min outbound US, ~$0.0085/min inbound US; phone numbers from $1/mo
  • 工具ID
    twilioVoiceCall
    ,
    twilioVoiceProvider
  • 密钥
    twilio.accountSid
    ,
    twilio.authToken
  • 最适用场景:最受欢迎的选择;生态丰富、覆盖全球、文档完善
  • 功能特性:
    • 基于TwiML脚本的外呼电话
    • 呼入呼叫Webhook处理
    • 通知模式(TTS消息+挂断)
    • 对话模式(双向媒体流)
    • HMAC-SHA1 Webhook签名验证
    • 呼叫状态回调
    • E.164电话号码验证
  • 定价:美国外呼约0.013美元/分钟,美国呼入约0.0085美元/分钟;电话号码月租1美元起

Telnyx

Telnyx

  • Tool IDs:
    telnyxVoiceCall
    ,
    telnyxVoiceProvider
  • Secrets:
    telnyx.apiKey
    ,
    telnyx.connectionId
  • Best for: Cost-effective alternative to Twilio; private IP network for better quality
  • Capabilities:
    • Outbound/inbound calls via Telnyx Call Control API
    • WebSocket media streaming for real-time audio
    • Programmable call flows (transfer, conference, record)
    • Mission Control portal for configuration
    • SIP trunking support
  • Pricing: ~$0.007/min outbound US (roughly half of Twilio); phone numbers from $1/mo
  • 工具ID
    telnyxVoiceCall
    ,
    telnyxVoiceProvider
  • 密钥
    telnyx.apiKey
    ,
    telnyx.connectionId
  • 最适用场景:Twilio的高性价比替代方案;私有IP网络保障更高通话质量
  • 功能特性:
    • 通过Telnyx呼叫控制API实现呼入/呼出
    • 基于WebSocket的实时音频媒体流
    • 可编程呼叫流程(转接、会议、录制)
    • 配置专用的Mission Control门户
    • 支持SIP中继
  • 定价:美国外呼约0.007美元/分钟(约为Twilio的一半);电话号码月租1美元起

Plivo

Plivo

  • Tool IDs:
    plivoVoiceCall
    ,
    plivoVoiceProvider
  • Secrets:
    plivo.authId
    ,
    plivo.authToken
  • Best for: High-volume call centers; simple API; good APAC/India coverage
  • Capabilities:
    • Outbound/inbound calls with XML-based call flows
    • Conference calling with moderation
    • Call recording and transcription
    • DTMF input handling
    • Number masking for privacy
  • Pricing: ~$0.010/min outbound US; competitive international rates
  • 工具ID
    plivoVoiceCall
    ,
    plivoVoiceProvider
  • 密钥
    plivo.authId
    ,
    plivo.authToken
  • 最适用场景:高容量呼叫中心;API简洁易用;亚太/印度地区覆盖完善
  • 功能特性:
    • 基于XML的呼入/呼出呼叫流程
    • 带管理功能的会议通话
    • 呼叫录制与转录
    • DTMF输入处理
    • 隐私保护的号码掩码
  • 定价:美国外呼约0.010美元/分钟;国际费率具有竞争力

STT (Speech-to-Text) Streaming Providers

STT(语音转文字)流式服务提供商

Deepgram Streaming STT

Deepgram流式STT

  • Extension:
    streaming-stt-deepgram
  • Secrets:
    deepgram.apiKey
  • Best for: Fastest real-time transcription; best accuracy for conversational speech
  • Features:
    • WebSocket streaming with <300ms latency
    • Multiple models: Nova-2 (general), Enhanced (noisy), Base (fastest)
    • Interim results for responsive UX
    • Punctuation, diarization, smart formatting
    • 30+ languages
  • Recommendation: Default choice for production voice apps
  • 扩展插件
    streaming-stt-deepgram
  • 密钥
    deepgram.apiKey
  • 最适用场景:最快的实时转录;对话式语音识别准确率最高
  • 功能特性:
    • WebSocket流式传输,延迟<300ms
    • 多模型可选:Nova-2(通用)、Enhanced(嘈杂环境)、Base(最快)
    • 临时识别结果提升响应式用户体验
    • 标点、说话人分离、智能格式化
    • 支持30+种语言
  • 推荐场景:生产级语音应用的默认选择

Whisper Streaming STT

Whisper流式STT

  • Extension:
    streaming-stt-whisper
  • Secrets:
    openai.apiKey
    (for API) or none (for local)
  • Best for: Self-hosted/local deployment; highest accuracy for non-English languages
  • Features:
    • OpenAI Whisper model (local or API)
    • Chunk-based streaming (not true real-time, ~1-2s chunks)
    • 97+ languages with strong multilingual performance
    • Local mode: no API costs, requires GPU for real-time
  • Recommendation: Use when Deepgram is unavailable or for local/offline deployments
  • 扩展插件
    streaming-stt-whisper
  • 密钥
    openai.apiKey
    (API版)或无需密钥(本地部署版)
  • 最适用场景:自托管/本地部署;非英语语言识别准确率最高
  • 功能特性:
    • OpenAI Whisper模型(本地或API调用)
    • 基于块的流式传输(非纯实时,块大小约1-2秒)
    • 支持97+种语言,多语言表现出色
    • 本地模式:无API费用,需GPU支持实时处理
  • 推荐场景:Deepgram不可用时,或用于本地/离线部署

Google Cloud STT

Google Cloud STT

  • Extension:
    google-cloud-stt
  • Secrets:
    google.serviceAccountJson
  • Best for: Enterprise Google Cloud integration; medical/legal domain models
  • Features:
    • Streaming recognition via gRPC
    • Multiple models: default, phone_call, video, medical_conversation
    • Speaker diarization (who said what)
    • Word-level confidence and timing
    • Automatic punctuation
  • 扩展插件
    google-cloud-stt
  • 密钥
    google.serviceAccountJson
  • 最适用场景:企业级Google Cloud集成;医疗/法律领域专用模型
  • 功能特性:
    • 通过gRPC实现流式识别
    • 多模型可选:默认、phone_call、video、medical_conversation
    • 说话人分离(区分发言者)
    • 词级置信度与时间戳
    • 自动标点

Vosk (Offline)

Vosk(离线版)

  • Extension:
    vosk
  • Secrets: None
  • Best for: Fully offline/airgapped deployments; edge devices
  • Features:
    • Local models, no internet required
    • Lightweight enough for Raspberry Pi
    • 20+ language models available
    • Speaker identification
  • Recommendation: Use for privacy-critical or offline scenarios
  • 扩展插件
    vosk
  • 密钥:无需密钥
  • 最适用场景:完全离线/气隙环境部署;边缘设备
  • 功能特性:
    • 本地模型,无需联网
    • 轻量适配树莓派等设备
    • 支持20+种语言模型
    • 说话人识别
  • 推荐场景:隐私敏感或离线场景

TTS (Text-to-Speech) Streaming Providers

TTS(文字转语音)流式服务提供商

ElevenLabs Streaming TTS

ElevenLabs流式TTS

  • Extension:
    streaming-tts-elevenlabs
  • Secrets:
    elevenlabs.apiKey
  • Best for: Most natural-sounding voices; voice cloning; emotional expression
  • Features:
    • WebSocket streaming with ~200ms time-to-first-byte
    • 30+ pre-built voices, custom voice cloning
    • Adjustable stability, similarity, style
    • 29 languages with accent control
    • SSML support
  • Recommendation: Default choice for the best voice quality
  • 扩展插件
    streaming-tts-elevenlabs
  • 密钥
    elevenlabs.apiKey
  • 最适用场景:最自然的语音效果;支持语音克隆、情感表达
  • 功能特性:
    • WebSocket流式传输,首字节响应时间~200ms
    • 30+预构建语音,支持自定义语音克隆
    • 可调节稳定性、相似度、风格
    • 29种语言,支持口音控制
    • 支持SSML
  • 推荐场景:追求最佳语音质量的默认选择

OpenAI Streaming TTS

OpenAI流式TTS

  • Extension:
    streaming-tts-openai
  • Secrets:
    openai.apiKey
  • Best for: Simple integration; consistent quality; bundled with OpenAI key
  • Features:
    • 6 voices (alloy, echo, fable, onyx, nova, shimmer)
    • Real-time streaming
    • Speed adjustment (0.25x to 4.0x)
    • HD quality option
  • Recommendation: Use when already using OpenAI for LLM; quality is good but fewer customization options
  • 扩展插件
    streaming-tts-openai
  • 密钥
    openai.apiKey
  • 最适用场景:集成简单;质量稳定;与OpenAI密钥捆绑使用
  • 功能特性:
    • 6种语音(alloy、echo、fable、onyx、nova、shimmer)
    • 实时流式传输
    • 语速调节(0.25x至4.0x)
    • 高清音质选项
  • 推荐场景:已使用OpenAI大语言模型时;质量优秀但定制选项较少

Amazon Polly

Amazon Polly

  • Extension:
    amazon-polly
  • Secrets:
    aws.accessKeyId
    ,
    aws.secretAccessKey
  • Best for: AWS ecosystem; SSML control; Neural and Standard voices
  • Features:
    • Neural voices (natural) and Standard voices (cheaper)
    • Full SSML support (pauses, emphasis, phonemes)
    • 60+ voices across 30+ languages
    • Newscaster and Conversational styles
  • Recommendation: Use for AWS-native deployments or when SSML control is critical
  • 扩展插件
    amazon-polly
  • 密钥
    aws.accessKeyId
    ,
    aws.secretAccessKey
  • 最适用场景:AWS生态集成;SSML精细控制;支持神经与标准语音
  • 功能特性:
    • 神经语音(自然)与标准语音(低成本)
    • 完整SSML支持(停顿、重音、音素)
    • 30+语言,60+种语音
    • 新闻播报与对话风格
  • 推荐场景:AWS原生部署,或对SSML控制有严格要求时

Google Cloud TTS

Google Cloud TTS

  • Extension:
    google-cloud-tts
  • Secrets:
    google.serviceAccountJson
  • Best for: Google Cloud integration; WaveNet voices; Studio voices
  • Features:
    • WaveNet voices (very natural), Standard, Neural2, and Studio
    • SSML support with audio effects
    • 50+ languages, 380+ voices
    • Audio profiles (telephony, headphone, smart speaker)
  • 扩展插件
    google-cloud-tts
  • 密钥
    google.serviceAccountJson
  • 最适用场景:Google Cloud集成;WaveNet语音;Studio语音
  • 功能特性:
    • WaveNet语音(高度自然)、Standard、Neural2及Studio语音
    • 支持带音频效果的SSML
    • 50+语言,380+种语音
    • 音频配置文件(电话、耳机、智能音箱)

Piper (Offline)

Piper(离线版)

  • Extension:
    piper
  • Secrets: None
  • Best for: Offline/local TTS; edge deployment; no API costs
  • Features:
    • ONNX-based, runs entirely local
    • 100+ voices across 30+ languages
    • Fast inference on CPU
    • Configurable quality levels
  • Recommendation: Use for offline deployments or when API costs are a concern
  • 扩展插件
    piper
  • 密钥:无需密钥
  • 最适用场景:离线/本地TTS;边缘部署;无API费用
  • 功能特性:
    • 基于ONNX,完全本地运行
    • 30+语言,100+种语音
    • CPU上快速推理
    • 可配置音质等级
  • 推荐场景:离线部署,或关注API成本时

Voice Pipeline Architecture

语音管道架构

A complete voice pipeline connects these components:
Microphone → VAD → STT Provider → LLM → TTS Provider → Speaker
                              Memory/Context
完整的语音管道连接以下组件:
麦克风 → VAD → STT服务商 → LLM → TTS服务商 → 扬声器
                              记忆/上下文

Pipeline Components

管道组件

  1. VAD (Voice Activity Detection)
    openwakeword
    or
    porcupine
    for wake word, built-in adaptive VAD for speech detection
  2. STT — converts speech to text in real-time
  3. LLM — processes the transcribed text and generates a response
  4. TTS — converts the LLM response back to speech
  5. Audio Transport — WebRTC, WebSocket, or telephony media stream
  1. VAD(语音活动检测) — 使用
    openwakeword
    porcupine
    实现唤醒词检测,内置自适应VAD实现语音检测
  2. STT — 实时将语音转换为文字
  3. LLM — 处理转录文本并生成响应
  4. TTS — 将LLM响应转换回语音
  5. 音频传输 — WebRTC、WebSocket或电话媒体流

Provider Selection Guide

服务商选择指南

RequirementSTT PickTTS Pick
Best qualityDeepgram Nova-2ElevenLabs
Lowest latencyDeepgramElevenLabs or OpenAI
CheapestVosk (free)Piper (free)
Offline capableVoskPiper
MultilingualWhisperGoogle Cloud TTS
Enterprise/complianceGoogle Cloud STTAmazon Polly
Simplest setupDeepgramOpenAI TTS
需求STT选择TTS选择
最佳音质Deepgram Nova-2ElevenLabs
最低延迟DeepgramElevenLabs或OpenAI
最低成本Vosk(免费)Piper(免费)
离线支持VoskPiper
多语言WhisperGoogle Cloud TTS
企业级/合规性Google Cloud STTAmazon Polly
最简配置DeepgramOpenAI TTS

IVR (Interactive Voice Response) Setup

IVR(交互式语音应答)设置

  1. Provision a phone number from Twilio, Telnyx, or Plivo
  2. Configure inbound webhook URL pointing to your AgentOS endpoint
  3. Wire the voice pipeline: STT → LLM → TTS
  4. Define call flow states: greeting, menu, transfer, voicemail
  5. Handle DTMF input for numeric menu selections
  6. Set fallback to human operator for unhandled cases
  7. Enable call recording for quality assurance (with consent disclosure)
  1. 从Twilio、Telnyx或Plivo申请电话号码
  2. 配置呼入Webhook URL指向您的AgentOS端点
  3. 对接语音管道:STT → LLM → TTS
  4. 定义呼叫流程状态:问候、菜单、转接、语音信箱
  5. 处理DTMF输入以实现数字菜单选择
  6. 设置人工坐席 fallback 处理未覆盖场景
  7. 启用呼叫录制以保障服务质量(需告知用户并获得同意)

Best Practices

最佳实践

  • Latency budget — total round-trip (STT + LLM + TTS) should be under 2 seconds for natural conversation
  • Interruption handling — enable barge-in so users can interrupt the TTS playback
  • Fallback chain — if primary STT/TTS fails, fall back to a secondary provider
  • Cost management — use Vosk/Piper for development/testing; paid providers for production
  • Audio quality — use 16kHz 16-bit mono PCM for telephony; 44.1kHz for high-fidelity
  • Silence detection — configure VAD sensitivity to avoid cutting off slow speakers
  • Regional compliance — recording laws vary by jurisdiction; always disclose when recording
  • 延迟预算 — 总往返时间(STT + LLM + TTS)应控制在2秒以内,以保证自然对话体验
  • 中断处理 — 启用插话功能,让用户可打断TTS播放
  • 备用链 — 若主STT/TTS服务商故障,自动切换至备用服务商
  • 成本管理 — 开发/测试阶段使用Vosk/Piper;生产环境使用付费服务商
  • 音频质量 — 电话场景使用16kHz 16位单声道PCM;高保真场景使用44.1kHz
  • 静音检测 — 配置VAD灵敏度,避免打断语速较慢的说话者
  • 区域合规 — 录音法规因地区而异;录音时必须告知用户