streaming-stt-deepgram
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDeepgram Streaming STT
Deepgram流式STT
Use this skill when the user needs real-time speech-to-text transcription with the lowest possible latency. Deepgram's WebSocket API provides sub-300 ms interim transcripts using the Nova-2 model.
Prefer this provider over file-based Whisper when the agent needs live voice input during a conversation, or when speaker identification (diarization) is required without a separate processing step.
当用户需要低延迟的实时语音转文本转录时,可使用该Skill。Deepgram的WebSocket API借助Nova-2模型提供延迟低于300毫秒的临时转录结果。
当Agent需要在对话过程中接收实时语音输入,或者无需单独处理步骤即可实现说话人识别(分离)时,优先选择该服务商,而非基于文件的Whisper。
Setup
配置步骤
Set in the environment or agent secrets store before starting a voice session.
DEEPGRAM_API_KEY在启动语音会话前,在环境变量或Agent密钥存储中设置。
DEEPGRAM_API_KEYConfiguration
参数配置
json
{
"voice": {
"stt": "deepgram"
}
}To enable diarization and keyword boosting:
json
{
"voice": {
"stt": "deepgram",
"providerOptions": {
"model": "nova-2",
"diarize": true,
"keywords": ["AgentOS:2"],
"endpointing": 300
}
}
}json
{
"voice": {
"stt": "deepgram"
}
}如需启用说话人分离和关键词增强功能:
json
{
"voice": {
"stt": "deepgram",
"providerOptions": {
"model": "nova-2",
"diarize": true,
"keywords": ["AgentOS:2"],
"endpointing": 300
}
}
}Provider Rules
服务商规则
- Use as the default model — highest accuracy on Deepgram's current tier.
nova-2 - Enable when the conversation involves multiple speakers; word-level
diarize: truelabels are included in the transcript events.speaker - Tune (ms of silence before finalization) to balance responsiveness vs. over-splitting. Default 300 ms is suitable for most conversations.
endpointing - The provider auto-reconnects on WebSocket drops using exponential back-off (100 ms → 5 s cap).
- Use to boost domain-specific terms (e.g. product names, abbreviations).
providerOptions.keywords
- 将设为默认模型——这是Deepgram当前层级中准确率最高的模型。
nova-2 - 当对话涉及多位说话人时,启用;转录事件中将包含单词级别的
diarize: true标签。speaker - 调整(最终确认前的静音时长,单位:毫秒)以平衡响应速度与过度拆分问题。默认300毫秒适用于大多数对话场景。
endpointing - 当WebSocket连接断开时,服务商将使用指数退避策略(100毫秒→上限5秒)自动重连。
- 使用增强特定领域术语(如产品名称、缩写)的识别优先级。
providerOptions.keywords
Events
事件列表
| Event | Description |
|---|---|
| Every hypothesis (interim + final) |
| Non-final hypothesis |
| Stable, final hypothesis |
| First non-empty word in an utterance |
| Deepgram |
| Unrecoverable provider error |
| Session fully terminated |
| 事件名称 | 描述 |
|---|---|
| 所有转录假设结果(临时+最终) |
| 非最终转录假设结果 |
| 稳定的最终转录结果 |
| 话语中的首个非空单词 |
| Deepgram触发 |
| 服务商无法恢复的错误 |
| 会话完全终止 |
Examples
使用示例
- "Start a live voice session using Deepgram for transcription."
- "Enable speaker diarization for this multi-person meeting transcription."
- "Use Deepgram with keyword boosting for AgentOS and Wunderland terms."
- “启动一个使用Deepgram进行转录的实时语音会话。”
- “为本次多人会议转录启用说话人分离功能。”
- “使用Deepgram并开启AgentOS和Wunderland术语的关键词增强。”
Constraints
限制条件
- Requires . Free tier available at console.deepgram.com.
DEEPGRAM_API_KEY - Audio must be streamed as PCM/WebSocket-compatible frames.
- Diarization adds slight latency; disable if single-speaker performance is the priority.
- 需要。可在console.deepgram.com获取免费层级。
DEEPGRAM_API_KEY - 音频必须以PCM/WebSocket兼容的帧格式流式传输。
- 说话人分离会增加轻微延迟;若优先考虑单说话人场景的性能,可禁用该功能。