voice-ai
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<objective>
Build production voice AI agents with sub-500ms latency:
- STT - Deepgram Nova-3 streaming transcription (~150ms)
- LLM - Groq llama-3.1-8b-instant for fastest inference (~220ms)
- TTS - Cartesia Sonic for ultra-realistic voice (~90ms)
- Telephony - Twilio Media Streams for real-time bidirectional audio
CRITICAL: NO OPENAI - Never use
from openai import OpenAIKey deliverables:
- Streaming STT with voice activity detection
- Low-latency LLM responses optimized for voice
- Expressive TTS with emotion controls
- Twilio Media Streams WebSocket handler </objective>
<quick_start>
Minimal Voice Pipeline (~50 lines, <500ms):
python
import os
import asyncio
from groq import AsyncGroq
from deepgram import AsyncDeepgramClient
from cartesia import AsyncCartesia<目标>
打造延迟低于500毫秒的生产级语音AI代理:
- 语音转文字(STT) - Deepgram Nova-3流式转录(约150毫秒)
- 大语言模型(LLM) - Groq llama-3.1-8b-instant,实现最快推理(约220毫秒)
- 文字转语音(TTS) - Cartesia Sonic,生成超逼真语音(约90毫秒)
- 电话通信 - Twilio Media Streams,实现实时双向音频
重要提示:禁止使用OpenAI - 绝不允许导入
from openai import OpenAI核心交付成果:
- 带语音活动检测的流式STT
- 针对语音场景优化的低延迟LLM响应
- 带情感控制的富有表现力的TTS
- Twilio Media Streams WebSocket处理器 </目标>
<快速开始>
极简语音处理流水线(约50行代码,延迟<500毫秒):
python
import os
import asyncio
from groq import AsyncGroq
from deepgram import AsyncDeepgramClient
from cartesia import AsyncCartesiaNEVER: from openai import OpenAI
NEVER: from openai import OpenAI
async def voice_pipeline(user_audio: bytes) -> bytes:
"""Process audio input, return audio response."""
# 1. STT: Deepgram Nova-3 (~150ms)
dg = AsyncDeepgramClient(api_key=os.getenv("DEEPGRAM_API_KEY"))
result = await dg.listen.rest.v1.transcribe(
{"buffer": user_audio, "mimetype": "audio/wav"},
{"model": "nova-3", "language": "en-US"}
)
user_text = result.results.channels[0].alternatives[0].transcript
# 2. LLM: Groq (~220ms) - NOT OpenAI
groq = AsyncGroq(api_key=os.getenv("GROQ_API_KEY"))
response = await groq.chat.completions.create(
model="llama-3.1-8b-instant",
messages=[
{"role": "system", "content": "Keep responses under 2 sentences."},
{"role": "user", "content": user_text}
],
max_tokens=150
)
response_text = response.choices[0].message.content
# 3. TTS: Cartesia Sonic-2 (~90ms)
cartesia = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY"))
audio_chunks = []
for chunk in cartesia.tts.sse(
model_id="sonic-2",
transcript=response_text,
voice={"id": "f9836c6e-a0bd-460e-9d3c-f7299fa60f94"},
output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 8000}
):
if chunk.audio:
audio_chunks.append(chunk.audio)
return b"".join(audio_chunks) # Total: ~460ms</quick_start>
<success_criteria>
A voice AI agent is successful when:
- Total latency is under 500ms (STT + LLM + TTS)
- STT correctly transcribes with utterance end detection
- TTS sounds natural and conversational
- Barge-in (interruption) works smoothly (Enterprise tier)
- Bilingual support handles language switching
</success_criteria>
<optimal_stack>async def voice_pipeline(user_audio: bytes) -> bytes:
"""Process audio input, return audio response."""
# 1. STT: Deepgram Nova-3 (~150ms)
dg = AsyncDeepgramClient(api_key=os.getenv("DEEPGRAM_API_KEY"))
result = await dg.listen.rest.v1.transcribe(
{"buffer": user_audio, "mimetype": "audio/wav"},
{"model": "nova-3", "language": "en-US"}
)
user_text = result.results.channels[0].alternatives[0].transcript
# 2. LLM: Groq (~220ms) - NOT OpenAI
groq = AsyncGroq(api_key=os.getenv("GROQ_API_KEY"))
response = await groq.chat.completions.create(
model="llama-3.1-8b-instant",
messages=[
{"role": "system", "content": "Keep responses under 2 sentences."},
{"role": "user", "content": user_text}
],
max_tokens=150
)
response_text = response.choices[0].message.content
# 3. TTS: Cartesia Sonic-2 (~90ms)
cartesia = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY"))
audio_chunks = []
for chunk in cartesia.tts.sse(
model_id="sonic-2",
transcript=response_text,
voice={"id": "f9836c6e-a0bd-460e-9d3c-f7299fa60f94"},
output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 8000}
):
if chunk.audio:
audio_chunks.append(chunk.audio)
return b"".join(audio_chunks) # Total: ~460ms</快速开始>
<成功标准>
语音AI代理达标标准:
- 总延迟低于500毫秒(STT + LLM + TTS)
- STT能准确转录并检测话语结束
- TTS语音自然、符合对话场景
- 插话(打断)功能流畅运行(企业版)
- 双语支持可处理语言切换
</成功标准>
<最优技术栈>VozLux-Tested Stack
经VozLux测试的技术栈
| Component | Provider | Model | Latency | Notes |
|---|---|---|---|---|
| STT | Deepgram | Nova-3 | ~150ms | Streaming, VAD, utterance detection |
| LLM | Groq | llama-3.1-8b-instant | ~220ms | LPU hardware, fastest inference |
| TTS | Cartesia | Sonic-2 | ~90ms | Streaming, emotions, bilingual |
| TOTAL | - | - | ~460ms | Sub-500ms target achieved |
| 组件 | 提供商 | 模型 | 延迟 | 说明 |
|---|---|---|---|---|
| STT | Deepgram | Nova-3 | ~150ms | 流式处理、语音活动检测(VAD)、话语检测 |
| LLM | Groq | llama-3.1-8b-instant | ~220ms | LPU硬件,最快推理速度 |
| TTS | Cartesia | Sonic-2 | ~90ms | 流式处理、情感支持、双语能力 |
| 总计 | - | - | ~460ms | 达成低于500毫秒的目标 |
LLM Priority (Never OpenAI)
LLM优先级(绝不使用OpenAI)
python
LLM_PRIORITY = [
("groq", "GROQ_API_KEY", "~220ms"), # Primary
("cerebras", "CEREBRAS_API_KEY", "~200ms"), # Fallback
("anthropic", "ANTHROPIC_API_KEY", "~500ms"), # Quality fallback
]python
LLM_PRIORITY = [
("groq", "GROQ_API_KEY", "~220ms"), # Primary
("cerebras", "CEREBRAS_API_KEY", "~200ms"), # Fallback
("anthropic", "ANTHROPIC_API_KEY", "~500ms"), # Quality fallback
]NEVER: from openai import OpenAI
NEVER: from openai import OpenAI
undefinedundefinedTier Architecture
层级架构
| Tier | Latency | STT | LLM | TTS | Features |
|---|---|---|---|---|---|
| Free | 3000ms | TwiML Gather | Groq | Polly | Basic IVR |
| Pro | 600ms | Deepgram Nova | Groq | Cartesia | Media Streams |
| Enterprise | 400ms | Deepgram + VAD | Groq | Cartesia | Barge-in |
| </optimal_stack> |
<deepgram_stt>
| 层级 | 延迟 | STT | LLM | TTS | 功能 |
|---|---|---|---|---|---|
| 免费版 | 3000ms | TwiML Gather | Groq | Polly | 基础IVR |
| 专业版 | 600ms | Deepgram Nova | Groq | Cartesia | 媒体流 |
| 企业版 | 400ms | Deepgram + VAD | Groq | Cartesia | 插话功能 |
| </最优技术栈> |
<Deepgram语音转文字(STT)>
Deepgram STT (v5 SDK)
Deepgram STT(v5 SDK)
Streaming WebSocket Pattern
流式WebSocket模式
python
from deepgram import AsyncDeepgramClient
from deepgram.core.events import EventType
from deepgram.extensions.types.sockets import (
ListenV1SocketClientResponse,
ListenV1MediaMessage,
ListenV1ControlMessage
)
async def streaming_stt():
client = AsyncDeepgramClient(api_key=os.getenv("DEEPGRAM_API_KEY"))
async with client.listen.v1.connect(model="nova-3") as connection:
def on_message(message: ListenV1SocketClientResponse):
msg_type = getattr(message, "type", None)
if msg_type == "Results":
channel = getattr(message, "channel", None)
if channel and channel.alternatives:
text = channel.alternatives[0].transcript
is_final = getattr(message, "is_final", False)
if text:
print(f"{'[FINAL]' if is_final else '[INTERIM]'} {text}")
elif msg_type == "UtteranceEnd":
print("[USER FINISHED SPEAKING]")
elif msg_type == "SpeechStarted":
print("[USER STARTED SPEAKING - barge-in trigger]")
connection.on(EventType.MESSAGE, on_message)
await connection.start_listening()
# Send audio chunks
await connection.send_media(ListenV1MediaMessage(data=audio_bytes))
# Keep alive for long sessions
await connection.send_control(ListenV1ControlMessage(type="KeepAlive"))python
from deepgram import AsyncDeepgramClient
from deepgram.core.events import EventType
from deepgram.extensions.types.sockets import (
ListenV1SocketClientResponse,
ListenV1MediaMessage,
ListenV1ControlMessage
)
async def streaming_stt():
client = AsyncDeepgramClient(api_key=os.getenv("DEEPGRAM_API_KEY"))
async with client.listen.v1.connect(model="nova-3") as connection:
def on_message(message: ListenV1SocketClientResponse):
msg_type = getattr(message, "type", None)
if msg_type == "Results":
channel = getattr(message, "channel", None)
if channel and channel.alternatives:
text = channel.alternatives[0].transcript
is_final = getattr(message, "is_final", False)
if text:
print(f"{'[FINAL]' if is_final else '[INTERIM]'} {text}")
elif msg_type == "UtteranceEnd":
print("[USER FINISHED SPEAKING]")
elif msg_type == "SpeechStarted":
print("[USER STARTED SPEAKING - barge-in trigger]")
connection.on(EventType.MESSAGE, on_message)
await connection.start_listening()
# Send audio chunks
await connection.send_media(ListenV1MediaMessage(data=audio_bytes))
# Keep alive for long sessions
await connection.send_control(ListenV1ControlMessage(type="KeepAlive"))Connection Options
连接选项
python
options = {
"model": "nova-3",
"language": "en-US",
"encoding": "mulaw", # Twilio format
"sample_rate": 8000, # Telephony standard
"interim_results": True, # Get partial transcripts
"utterance_end_ms": 1000, # Silence to end utterance
"vad_events": True, # Voice activity detection
}Seefor full streaming setup. </deepgram_stt>reference/deepgram-setup.md
<groq_llm>
python
options = {
"model": "nova-3",
"language": "en-US",
"encoding": "mulaw", # Twilio format
"sample_rate": 8000, # Telephony standard
"interim_results": True, # Get partial transcripts
"utterance_end_ms": 1000, # Silence to end utterance
"vad_events": True, # Voice activity detection
}完整流式设置请参考。 </Deepgram语音转文字(STT)>reference/deepgram-setup.md
<Groq大语言模型(LLM)>
Groq LLM (Fastest Inference)
Groq LLM(最快推理速度)
Voice-Optimized Pattern
语音场景优化模式
python
from groq import AsyncGroq
class GroqVoiceLLM:
def __init__(self, model: str = "llama-3.1-8b-instant"):
self.client = AsyncGroq()
self.model = model
self.system_prompt = (
"You are a helpful voice assistant. "
"Keep responses to 2-3 sentences max. "
"Speak naturally as if on a phone call."
)
async def generate_stream(self, user_input: str):
"""Streaming for lowest TTFB."""
stream = await self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": user_input}
],
max_tokens=150,
temperature=0.7,
stream=True,
)
async for chunk in stream:
content = chunk.choices[0].delta.content
if content:
yield content # Pipe to TTS immediatelypython
from groq import AsyncGroq
class GroqVoiceLLM:
def __init__(self, model: str = "llama-3.1-8b-instant"):
self.client = AsyncGroq()
self.model = model
self.system_prompt = (
"You are a helpful voice assistant. "
"Keep responses to 2-3 sentences max. "
"Speak naturally as if on a phone call."
)
async def generate_stream(self, user_input: str):
"""Streaming for lowest TTFB."""
stream = await self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": user_input}
],
max_tokens=150,
temperature=0.7,
stream=True,
)
async for chunk in stream:
content = chunk.choices[0].delta.content
if content:
yield content # Pipe to TTS immediatelyModel Selection
模型选择
| Model | Speed | Quality | Use Case |
|---|---|---|---|
| llama-3.1-8b-instant | ~220ms | Good | Primary voice |
| llama-3.3-70b-versatile | ~500ms | Best | Complex queries |
| mixtral-8x7b-32768 | ~300ms | Good | Long context |
Seefor context management. </groq_llm>reference/groq-voice-llm.md
<cartesia_tts>
| 模型 | 速度 | 质量 | 适用场景 |
|---|---|---|---|
| llama-3.1-8b-instant | ~220ms | 良好 | 主流语音场景 |
| llama-3.3-70b-versatile | ~500ms | 最佳 | 复杂查询 |
| mixtral-8x7b-32768 | ~300ms | 良好 | 长上下文场景 |
上下文管理相关内容请参考。 </Groq大语言模型(LLM)>reference/groq-voice-llm.md
<Cartesia文字转语音(TTS)>
Cartesia TTS (Sonic-2)
Cartesia TTS(Sonic-2)
Streaming Pattern
流式模式
python
from cartesia import AsyncCartesia
class CartesiaTTS:
VOICES = {
"en": "f9836c6e-a0bd-460e-9d3c-f7299fa60f94", # Warm female
"es": "5c5ad5e7-1020-476b-8b91-fdcbe9cc313c", # Mexican Spanish
}
EMOTIONS = {
"greeting": "excited",
"confirmation": "grateful",
"info": "calm",
"complaint": "sympathetic",
"apology": "apologetic",
}
def __init__(self, api_key: str):
self.client = AsyncCartesia(api_key=api_key)
async def synthesize_stream(
self,
text: str,
language: str = "en",
emotion: str = "neutral"
):
voice_id = self.VOICES.get(language, self.VOICES["en"])
response = self.client.tts.sse(
model_id="sonic-2",
transcript=text,
voice={
"id": voice_id,
"experimental_controls": {
"speed": "normal",
"emotion": [emotion] if emotion != "neutral" else []
}
},
language=language,
output_format={
"container": "raw",
"encoding": "pcm_s16le",
"sample_rate": 8000, # Telephony
},
)
for chunk in response:
if chunk.audio:
yield chunk.audiopython
from cartesia import AsyncCartesia
class CartesiaTTS:
VOICES = {
"en": "f9836c6e-a0bd-460e-9d3c-f7299fa60f94", # Warm female
"es": "5c5ad5e7-1020-476b-8b91-fdcbe9cc313c", # Mexican Spanish
}
EMOTIONS = {
"greeting": "excited",
"confirmation": "grateful",
"info": "calm",
"complaint": "sympathetic",
"apology": "apologetic",
}
def __init__(self, api_key: str):
self.client = AsyncCartesia(api_key=api_key)
async def synthesize_stream(
self,
text: str,
language: str = "en",
emotion: str = "neutral"
):
voice_id = self.VOICES.get(language, self.VOICES["en"])
response = self.client.tts.sse(
model_id="sonic-2",
transcript=text,
voice={
"id": voice_id,
"experimental_controls": {
"speed": "normal",
"emotion": [emotion] if emotion != "neutral" else []
}
},
language=language,
output_format={
"container": "raw",
"encoding": "pcm_s16le",
"sample_rate": 8000, # Telephony
},
)
for chunk in response:
if chunk.audio:
yield chunk.audioWith Timestamps
带时间戳版本
python
response = client.tts.sse(
model_id="sonic-2",
transcript="Hello, world!",
voice={"id": voice_id},
output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
add_timestamps=True,
)
for chunk in response:
if chunk.word_timestamps:
for word, start, end in zip(
chunk.word_timestamps.words,
chunk.word_timestamps.start,
chunk.word_timestamps.end
):
print(f"'{word}': {start:.2f}s - {end:.2f}s")Seefor all 57 emotions. </cartesia_tts>reference/cartesia-tts.md
<twilio_media_streams>
python
response = client.tts.sse(
model_id="sonic-2",
transcript="Hello, world!",
voice={"id": voice_id},
output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
add_timestamps=True,
)
for chunk in response:
if chunk.word_timestamps:
for word, start, end in zip(
chunk.word_timestamps.words,
chunk.word_timestamps.start,
chunk.word_timestamps.end
):
print(f"'{word}': {start:.2f}s - {end:.2f}s")全部57种情感类型请参考。 </Cartesia文字转语音(TTS)>reference/cartesia-tts.md
<Twilio媒体流>
Twilio Media Streams
Twilio Media Streams
WebSocket Handler (FastAPI)
WebSocket处理器(FastAPI)
python
from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import Response
from twilio.twiml.voice_response import VoiceResponse, Connect, Stream
import json, base64, audioop
app = FastAPI()
@app.post("/voice/incoming")
async def incoming_call(request: Request):
"""Route to Media Streams WebSocket."""
form = await request.form()
caller = form.get("From", "")
lang = "es" if caller.startswith("+52") else "en"
response = VoiceResponse()
connect = Connect()
connect.append(Stream(url=f"wss://your-app.com/voice/stream?lang={lang}"))
response.append(connect)
return Response(content=str(response), media_type="application/xml")
@app.websocket("/voice/stream")
async def media_stream(websocket: WebSocket, lang: str = "en"):
await websocket.accept()
stream_sid = None
while True:
message = await websocket.receive_text()
data = json.loads(message)
if data["event"] == "start":
stream_sid = data["start"]["streamSid"]
# Initialize STT, send greeting
elif data["event"] == "media":
audio = base64.b64decode(data["media"]["payload"])
# Send to Deepgram STT
elif data["event"] == "stop":
break
async def send_audio(websocket, stream_sid: str, pcm_audio: bytes):
"""Convert PCM to mu-law and send to Twilio."""
mulaw = audioop.lin2ulaw(pcm_audio, 2)
await websocket.send_text(json.dumps({
"event": "media",
"streamSid": stream_sid,
"media": {"payload": base64.b64encode(mulaw).decode()}
}))Seefor complete handler. </twilio_media_streams>reference/twilio-webhooks.md
<bilingual_support>
python
from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import Response
from twilio.twiml.voice_response import VoiceResponse, Connect, Stream
import json, base64, audioop
app = FastAPI()
@app.post("/voice/incoming")
async def incoming_call(request: Request):
"""Route to Media Streams WebSocket."""
form = await request.form()
caller = form.get("From", "")
lang = "es" if caller.startswith("+52") else "en"
response = VoiceResponse()
connect = Connect()
connect.append(Stream(url=f"wss://your-app.com/voice/stream?lang={lang}"))
response.append(connect)
return Response(content=str(response), media_type="application/xml")
@app.websocket("/voice/stream")
async def media_stream(websocket: WebSocket, lang: str = "en"):
await websocket.accept()
stream_sid = None
while True:
message = await websocket.receive_text()
data = json.loads(message)
if data["event"] == "start":
stream_sid = data["start"]["streamSid"]
# Initialize STT, send greeting
elif data["event"] == "media":
audio = base64.b64decode(data["media"]["payload"])
# Send to Deepgram STT
elif data["event"] == "stop":
break
async def send_audio(websocket, stream_sid: str, pcm_audio: bytes):
"""Convert PCM to mu-law and send to Twilio."""
mulaw = audioop.lin2ulaw(pcm_audio, 2)
await websocket.send_text(json.dumps({
"event": "media",
"streamSid": stream_sid,
"media": {"payload": base64.b64encode(mulaw).decode()}
}))完整处理器实现请参考。 </Twilio媒体流>reference/twilio-webhooks.md
<双语支持(英/西)>
Bilingual Support (EN/ES)
双语支持(英语/西班牙语)
Auto-Detection
自动检测
python
def detect_language(caller_number: str) -> str:
if caller_number.startswith("+52"):
return "es" # Mexico
elif caller_number.startswith("+1"):
return "en" # US/Canada
return "es" # Default Spanishpython
def detect_language(caller_number: str) -> str:
if caller_number.startswith("+52"):
return "es" # 墨西哥
elif caller_number.startswith("+1"):
return "en" # 美国/加拿大
return "es" # 默认西班牙语Voice Prompts
语音提示
python
GREETINGS = {
"en": "Hello! How can I help you today?",
"es": "Hola! En que puedo ayudarle hoy?", # Use "usted" for respect
}</bilingual_support>
<voice_prompts>
python
GREETINGS = {
"en": "Hello! How can I help you today?",
"es": "Hola! En que puedo ayudarle hoy?", # 使用"usted"表尊重
}</双语支持(英/西)>
<语音提示工程>
Voice Prompt Engineering
语音提示工程
python
VOICE_PROMPT = """python
VOICE_PROMPT = """Role
角色
You are a bilingual voice assistant for {business_name}.
您是{business_name}的双语语音助手。
Tone
语气
- 2-3 sentences max for phone clarity
- NEVER use bullet points, lists, or markdown
- Spell out emails: "john at company dot com"
- Phone numbers with pauses: "five one two... eight seven seven..."
- Spanish: Use "usted" for formal respect
- 最多2-3句话,适配电话清晰度
- 绝不使用项目符号、列表或Markdown格式
- 邮箱拼写:"john at company dot com"
- 电话号码带停顿:"five one two... eight seven seven..."
- 西班牙语:使用"usted"表正式尊重
Guardrails
防护规则
- Never make up information
- Transfer to human after 3 failed attempts
- Match caller's language
- 绝不编造信息
- 3次尝试失败后转接人工
- 匹配来电者语言
Error Recovery
错误恢复
English: "I want to make sure I got that right. Did you say [repeat]?"
Spanish: "Quiero asegurarme de entender bien. Dijo [repetir]?"
"""
> See `reference/voice-prompts.md` for full template.
</voice_prompts>
<file_locations>英语:"I want to make sure I got that right. Did you say [repeat]?"
西班牙语:"Quiero asegurarme de entender bien. Dijo [repetir]?"
"""
> 完整模板请参考`reference/voice-prompts.md`。
</语音提示工程>
<文件位置>Reference Files
参考文件
- - Full streaming STT setup
reference/deepgram-setup.md - - Groq patterns for voice
reference/groq-voice-llm.md - - All 57 emotions, voice cloning
reference/cartesia-tts.md - - Complete Media Streams handler
reference/twilio-webhooks.md - - Sub-500ms techniques
reference/latency-optimization.md - - Voice-optimized prompts </file_locations>
reference/voice-prompts.md
- - 完整流式STT设置
reference/deepgram-setup.md - - Groq语音场景适配模式
reference/groq-voice-llm.md - - 全部57种情感、语音克隆
reference/cartesia-tts.md - - 完整媒体流处理器
reference/twilio-webhooks.md - - 低于500毫秒延迟优化技巧
reference/latency-optimization.md - - 语音场景优化提示词 </文件位置>
reference/voice-prompts.md
<请求路由>
Request Routing
请求路由
User wants voice agent:
→ Provide full stack (Deepgram + Groq + Cartesia + Twilio)
→ Start with quick_start pipeline
User wants STT only:
→ Provide Deepgram streaming pattern
→ Reference:
reference/deepgram-setup.mdUser wants TTS only:
→ Provide Cartesia pattern with emotions
→ Reference:
reference/cartesia-tts.mdUser wants latency optimization:
→ Audit current stack, identify bottlenecks
→ Reference:
reference/latency-optimization.mdUser mentions OpenAI:
→ REDIRECT to Groq immediately
→ Explain: "NO OPENAI - Use Groq for lowest latency"
</routing>
<env_setup>
用户需要语音代理:
→ 提供完整技术栈(Deepgram + Groq + Cartesia + Twilio)
→ 从快速开始流水线入手
用户仅需要STT:
→ 提供Deepgram流式模式
→ 参考:
reference/deepgram-setup.md用户仅需要TTS:
→ 提供带情感控制的Cartesia模式
→ 参考:
reference/cartesia-tts.md用户需要延迟优化:
→ 审核当前技术栈,识别瓶颈
→ 参考:
reference/latency-optimization.md用户提及OpenAI:
→ 立即引导至Groq
→ 说明:"禁止使用OpenAI - 请使用Groq实现最低延迟"
</请求路由>
<环境变量配置>
Environment Variables
环境变量
bash
undefinedbash
undefinedRequired (NEVER OpenAI)
必填(绝不使用OpenAI)
DEEPGRAM_API_KEY=your_key
GROQ_API_KEY=gsk_xxxx
CARTESIA_API_KEY=your_key
DEEPGRAM_API_KEY=your_key
GROQ_API_KEY=gsk_xxxx
CARTESIA_API_KEY=your_key
Twilio
Twilio相关
TWILIO_ACCOUNT_SID=ACxxxx
TWILIO_AUTH_TOKEN=your_token
TWILIO_PHONE_NUMBER=+15551234567
TWILIO_ACCOUNT_SID=ACxxxx
TWILIO_AUTH_TOKEN=your_token
TWILIO_PHONE_NUMBER=+15551234567
Fallbacks
备选方案
ANTHROPIC_API_KEY=sk-ant-xxxx
CEREBRAS_API_KEY=csk_xxxx
```bash
pip install deepgram-sdk groq cartesia twilio fastapi</env_setup>
<quick_reference>
ANTHROPIC_API_KEY=sk-ant-xxxx
CEREBRAS_API_KEY=csk_xxxx
```bash
pip install deepgram-sdk groq cartesia twilio fastapi</环境变量配置>
<快速参考卡片>
Quick Reference Card
快速参考卡片
STACK:
STT: Deepgram Nova-3 (~150ms)
LLM: Groq llama-3.1-8b-instant (~220ms) - NOT OPENAI
TTS: Cartesia Sonic-2 (~90ms)
LATENCY TARGETS:
Pro: 600ms (Media Streams)
Enterprise: 400ms (Full streaming + barge-in)
BILINGUAL:
+52 -> Spanish (es)
+1 -> English (en)
Default -> Spanish
EMOTIONS (Cartesia):
greeting -> excited
confirmation -> grateful
complaint -> sympathetic</quick_reference>
技术栈:
STT: Deepgram Nova-3 (~150ms)
LLM: Groq llama-3.1-8b-instant (~220ms) - 禁止使用OpenAI
TTS: Cartesia Sonic-2 (~90ms)
延迟目标:
专业版: 600ms(媒体流)
企业版: 400ms(全流式+插话功能)
双语规则:
+52 -> 西班牙语(es)
+1 -> 英语(en)
默认 -> 西班牙语
情感映射(Cartesia):
greeting -> excited
confirmation -> grateful
complaint -> sympathetic</快速参考卡片>