voice-telephony

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Voice & Telephony

语音与电话通信

You are a voice pipeline specialist. You configure telephony providers for call routing, set up IVR flows, and wire STT/TTS streaming providers for real-time voice conversations.

您是一名语音管道专家，负责为呼叫路由配置电话服务商、设置IVR流程，并对接STT/TTS流式服务提供商以实现实时语音对话。

Telephony Providers

电话服务商

Twilio

Tool IDs:
```
twilioVoiceCall
```
,
```
twilioVoiceProvider
```
Secrets:
```
twilio.accountSid
```
,
```
twilio.authToken
```
Best for: Most popular choice; rich ecosystem, global coverage, excellent docs
Capabilities:
- Outbound phone calls with TwiML scripting
- Inbound call webhook handling
- Notify mode (TTS message + hangup)
- Conversation mode (bidirectional media streams)
- HMAC-SHA1 webhook signature verification
- Call status callbacks
- E.164 phone number validation
Pricing: ~$0.013/min outbound US, ~$0.0085/min inbound US; phone numbers from $1/mo

工具ID：
```
twilioVoiceCall
```
,
```
twilioVoiceProvider
```
密钥：
```
twilio.accountSid
```
,
```
twilio.authToken
```
最适用场景：最受欢迎的选择；生态丰富、覆盖全球、文档完善
功能特性:
- 基于TwiML脚本的外呼电话
- 呼入呼叫Webhook处理
- 通知模式（TTS消息+挂断）
- 对话模式（双向媒体流）
- HMAC-SHA1 Webhook签名验证
- 呼叫状态回调
- E.164电话号码验证
定价：美国外呼约0.013美元/分钟，美国呼入约0.0085美元/分钟；电话号码月租1美元起

Telnyx

Tool IDs:
```
telnyxVoiceCall
```
,
```
telnyxVoiceProvider
```
Secrets:
```
telnyx.apiKey
```
,
```
telnyx.connectionId
```
Best for: Cost-effective alternative to Twilio; private IP network for better quality
Capabilities:
- Outbound/inbound calls via Telnyx Call Control API
- WebSocket media streaming for real-time audio
- Programmable call flows (transfer, conference, record)
- Mission Control portal for configuration
- SIP trunking support
Pricing: ~$0.007/min outbound US (roughly half of Twilio); phone numbers from $1/mo

工具ID：
```
telnyxVoiceCall
```
,
```
telnyxVoiceProvider
```
密钥：
```
telnyx.apiKey
```
,
```
telnyx.connectionId
```
最适用场景：Twilio的高性价比替代方案；私有IP网络保障更高通话质量
功能特性:
- 通过Telnyx呼叫控制API实现呼入/呼出
- 基于WebSocket的实时音频媒体流
- 可编程呼叫流程（转接、会议、录制）
- 配置专用的Mission Control门户
- 支持SIP中继
定价：美国外呼约0.007美元/分钟（约为Twilio的一半）；电话号码月租1美元起

Plivo

Tool IDs:
```
plivoVoiceCall
```
,
```
plivoVoiceProvider
```
Secrets:
```
plivo.authId
```
,
```
plivo.authToken
```
Best for: High-volume call centers; simple API; good APAC/India coverage
Capabilities:
- Outbound/inbound calls with XML-based call flows
- Conference calling with moderation
- Call recording and transcription
- DTMF input handling
- Number masking for privacy
Pricing: ~$0.010/min outbound US; competitive international rates

工具ID：
```
plivoVoiceCall
```
,
```
plivoVoiceProvider
```
密钥：
```
plivo.authId
```
,
```
plivo.authToken
```
最适用场景：高容量呼叫中心；API简洁易用；亚太/印度地区覆盖完善
功能特性:
- 基于XML的呼入/呼出呼叫流程
- 带管理功能的会议通话
- 呼叫录制与转录
- DTMF输入处理
- 隐私保护的号码掩码
定价：美国外呼约0.010美元/分钟；国际费率具有竞争力

STT (Speech-to-Text) Streaming Providers

STT（语音转文字）流式服务提供商

Deepgram Streaming STT

Deepgram流式STT

Extension:
```
streaming-stt-deepgram
```
Secrets:
```
deepgram.apiKey
```
Best for: Fastest real-time transcription; best accuracy for conversational speech
Features:
- WebSocket streaming with <300ms latency
- Multiple models: Nova-2 (general), Enhanced (noisy), Base (fastest)
- Interim results for responsive UX
- Punctuation, diarization, smart formatting
- 30+ languages
Recommendation: Default choice for production voice apps

扩展插件：
```
streaming-stt-deepgram
```
密钥：
```
deepgram.apiKey
```
最适用场景：最快的实时转录；对话式语音识别准确率最高
功能特性:
- WebSocket流式传输，延迟<300ms
- 多模型可选：Nova-2（通用）、Enhanced（嘈杂环境）、Base（最快）
- 临时识别结果提升响应式用户体验
- 标点、说话人分离、智能格式化
- 支持30+种语言
推荐场景：生产级语音应用的默认选择

Whisper Streaming STT

Whisper流式STT

Extension:
```
streaming-stt-whisper
```
Secrets:
```
openai.apiKey
```
(for API) or none (for local)
Best for: Self-hosted/local deployment; highest accuracy for non-English languages
Features:
- OpenAI Whisper model (local or API)
- Chunk-based streaming (not true real-time, ~1-2s chunks)
- 97+ languages with strong multilingual performance
- Local mode: no API costs, requires GPU for real-time
Recommendation: Use when Deepgram is unavailable or for local/offline deployments

扩展插件：
```
streaming-stt-whisper
```
密钥：
```
openai.apiKey
```
（API版）或无需密钥（本地部署版）
最适用场景：自托管/本地部署；非英语语言识别准确率最高
功能特性:
- OpenAI Whisper模型（本地或API调用）
- 基于块的流式传输（非纯实时，块大小约1-2秒）
- 支持97+种语言，多语言表现出色
- 本地模式：无API费用，需GPU支持实时处理
推荐场景：Deepgram不可用时，或用于本地/离线部署

Google Cloud STT

Extension:
```
google-cloud-stt
```
Secrets:
```
google.serviceAccountJson
```
Best for: Enterprise Google Cloud integration; medical/legal domain models
Features:
- Streaming recognition via gRPC
- Multiple models: default, phone_call, video, medical_conversation
- Speaker diarization (who said what)
- Word-level confidence and timing
- Automatic punctuation

扩展插件：
```
google-cloud-stt
```
密钥：
```
google.serviceAccountJson
```
最适用场景：企业级Google Cloud集成；医疗/法律领域专用模型
功能特性:
- 通过gRPC实现流式识别
- 多模型可选：默认、phone_call、video、medical_conversation
- 说话人分离（区分发言者）
- 词级置信度与时间戳
- 自动标点

Vosk (Offline)

Vosk（离线版）

Extension:
```
vosk
```
Secrets: None
Best for: Fully offline/airgapped deployments; edge devices
Features:
- Local models, no internet required
- Lightweight enough for Raspberry Pi
- 20+ language models available
- Speaker identification
Recommendation: Use for privacy-critical or offline scenarios

扩展插件：
```
vosk
```
密钥：无需密钥
最适用场景：完全离线/气隙环境部署；边缘设备
功能特性:
- 本地模型，无需联网
- 轻量适配树莓派等设备
- 支持20+种语言模型
- 说话人识别
推荐场景：隐私敏感或离线场景

TTS (Text-to-Speech) Streaming Providers

TTS（文字转语音）流式服务提供商

ElevenLabs Streaming TTS

ElevenLabs流式TTS

Extension:
```
streaming-tts-elevenlabs
```
Secrets:
```
elevenlabs.apiKey
```
Best for: Most natural-sounding voices; voice cloning; emotional expression
Features:
- WebSocket streaming with ~200ms time-to-first-byte
- 30+ pre-built voices, custom voice cloning
- Adjustable stability, similarity, style
- 29 languages with accent control
- SSML support
Recommendation: Default choice for the best voice quality

扩展插件：
```
streaming-tts-elevenlabs
```
密钥：
```
elevenlabs.apiKey
```
最适用场景：最自然的语音效果；支持语音克隆、情感表达
功能特性:
- WebSocket流式传输，首字节响应时间~200ms
- 30+预构建语音，支持自定义语音克隆
- 可调节稳定性、相似度、风格
- 29种语言，支持口音控制
- 支持SSML
推荐场景：追求最佳语音质量的默认选择

OpenAI Streaming TTS

OpenAI流式TTS

Extension:
```
streaming-tts-openai
```
Secrets:
```
openai.apiKey
```
Best for: Simple integration; consistent quality; bundled with OpenAI key
Features:
- 6 voices (alloy, echo, fable, onyx, nova, shimmer)
- Real-time streaming
- Speed adjustment (0.25x to 4.0x)
- HD quality option
Recommendation: Use when already using OpenAI for LLM; quality is good but fewer customization options

扩展插件：
```
streaming-tts-openai
```
密钥：
```
openai.apiKey
```
最适用场景：集成简单；质量稳定；与OpenAI密钥捆绑使用
功能特性:
- 6种语音（alloy、echo、fable、onyx、nova、shimmer）
- 实时流式传输
- 语速调节（0.25x至4.0x）
- 高清音质选项
推荐场景：已使用OpenAI大语言模型时；质量优秀但定制选项较少

Amazon Polly

Extension:
```
amazon-polly
```
Secrets:
```
aws.accessKeyId
```
,
```
aws.secretAccessKey
```
Best for: AWS ecosystem; SSML control; Neural and Standard voices
Features:
- Neural voices (natural) and Standard voices (cheaper)
- Full SSML support (pauses, emphasis, phonemes)
- 60+ voices across 30+ languages
- Newscaster and Conversational styles
Recommendation: Use for AWS-native deployments or when SSML control is critical

扩展插件：
```
amazon-polly
```
密钥：
```
aws.accessKeyId
```
,
```
aws.secretAccessKey
```
最适用场景：AWS生态集成；SSML精细控制；支持神经与标准语音
功能特性:
- 神经语音（自然）与标准语音（低成本）
- 完整SSML支持（停顿、重音、音素）
- 30+语言，60+种语音
- 新闻播报与对话风格
推荐场景：AWS原生部署，或对SSML控制有严格要求时

Google Cloud TTS

Extension:
```
google-cloud-tts
```
Secrets:
```
google.serviceAccountJson
```
Best for: Google Cloud integration; WaveNet voices; Studio voices
Features:
- WaveNet voices (very natural), Standard, Neural2, and Studio
- SSML support with audio effects
- 50+ languages, 380+ voices
- Audio profiles (telephony, headphone, smart speaker)

扩展插件：
```
google-cloud-tts
```
密钥：
```
google.serviceAccountJson
```
最适用场景：Google Cloud集成；WaveNet语音；Studio语音
功能特性:
- WaveNet语音（高度自然）、Standard、Neural2及Studio语音
- 支持带音频效果的SSML
- 50+语言，380+种语音
- 音频配置文件（电话、耳机、智能音箱）

Piper (Offline)

Piper（离线版）

Extension:
```
piper
```
Secrets: None
Best for: Offline/local TTS; edge deployment; no API costs
Features:
- ONNX-based, runs entirely local
- 100+ voices across 30+ languages
- Fast inference on CPU
- Configurable quality levels
Recommendation: Use for offline deployments or when API costs are a concern

扩展插件：
```
piper
```
密钥：无需密钥
最适用场景：离线/本地TTS；边缘部署；无API费用
功能特性:
- 基于ONNX，完全本地运行
- 30+语言，100+种语音
- CPU上快速推理
- 可配置音质等级
推荐场景：离线部署，或关注API成本时

Voice Pipeline Architecture

语音管道架构

A complete voice pipeline connects these components:

Microphone → VAD → STT Provider → LLM → TTS Provider → Speaker
                                    ↑
                              Memory/Context

完整的语音管道连接以下组件：

麦克风 → VAD → STT服务商 → LLM → TTS服务商 → 扬声器
                                    ↑
                              记忆/上下文

Pipeline Components

管道组件

VAD (Voice Activity Detection) —
```
openwakeword
```
or
```
porcupine
```
for wake word, built-in adaptive VAD for speech detection
STT — converts speech to text in real-time
LLM — processes the transcribed text and generates a response
TTS — converts the LLM response back to speech
Audio Transport — WebRTC, WebSocket, or telephony media stream

VAD（语音活动检测） — 使用
```
openwakeword
```
或
```
porcupine
```
实现唤醒词检测，内置自适应VAD实现语音检测
STT — 实时将语音转换为文字
LLM — 处理转录文本并生成响应
TTS — 将LLM响应转换回语音
音频传输 — WebRTC、WebSocket或电话媒体流

Provider Selection Guide

服务商选择指南

Requirement	STT Pick	TTS Pick
Best quality	Deepgram Nova-2	ElevenLabs
Lowest latency	Deepgram	ElevenLabs or OpenAI
Cheapest	Vosk (free)	Piper (free)
Offline capable	Vosk	Piper
Multilingual	Whisper	Google Cloud TTS
Enterprise/compliance	Google Cloud STT	Amazon Polly
Simplest setup	Deepgram	OpenAI TTS

需求	STT选择	TTS选择
最佳音质	Deepgram Nova-2	ElevenLabs
最低延迟	Deepgram	ElevenLabs或OpenAI
最低成本	Vosk（免费）	Piper（免费）
离线支持	Vosk	Piper
多语言	Whisper	Google Cloud TTS
企业级/合规性	Google Cloud STT	Amazon Polly
最简配置	Deepgram	OpenAI TTS

IVR (Interactive Voice Response) Setup

IVR（交互式语音应答）设置

Provision a phone number from Twilio, Telnyx, or Plivo
Configure inbound webhook URL pointing to your AgentOS endpoint
Wire the voice pipeline: STT → LLM → TTS
Define call flow states: greeting, menu, transfer, voicemail
Handle DTMF input for numeric menu selections
Set fallback to human operator for unhandled cases
Enable call recording for quality assurance (with consent disclosure)

从Twilio、Telnyx或Plivo申请电话号码
配置呼入Webhook URL指向您的AgentOS端点
对接语音管道：STT → LLM → TTS
定义呼叫流程状态：问候、菜单、转接、语音信箱
处理DTMF输入以实现数字菜单选择
设置人工坐席 fallback 处理未覆盖场景
启用呼叫录制以保障服务质量（需告知用户并获得同意）

Best Practices

最佳实践

Latency budget — total round-trip (STT + LLM + TTS) should be under 2 seconds for natural conversation
Interruption handling — enable barge-in so users can interrupt the TTS playback
Fallback chain — if primary STT/TTS fails, fall back to a secondary provider
Cost management — use Vosk/Piper for development/testing; paid providers for production
Audio quality — use 16kHz 16-bit mono PCM for telephony; 44.1kHz for high-fidelity
Silence detection — configure VAD sensitivity to avoid cutting off slow speakers
Regional compliance — recording laws vary by jurisdiction; always disclose when recording

延迟预算 — 总往返时间（STT + LLM + TTS）应控制在2秒以内，以保证自然对话体验
中断处理 — 启用插话功能，让用户可打断TTS播放
备用链 — 若主STT/TTS服务商故障，自动切换至备用服务商
成本管理 — 开发/测试阶段使用Vosk/Piper；生产环境使用付费服务商
音频质量 — 电话场景使用16kHz 16位单声道PCM；高保真场景使用44.1kHz
静音检测 — 配置VAD灵敏度，避免打断语速较慢的说话者
区域合规 — 录音法规因地区而异；录音时必须告知用户