streaming-stt-deepgram

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Deepgram Streaming STT

Deepgram流式STT

Use this skill when the user needs real-time speech-to-text transcription with the lowest possible latency. Deepgram's WebSocket API provides sub-300 ms interim transcripts using the Nova-2 model.
Prefer this provider over file-based Whisper when the agent needs live voice input during a conversation, or when speaker identification (diarization) is required without a separate processing step.
当用户需要低延迟的实时语音转文本转录时,可使用该Skill。Deepgram的WebSocket API借助Nova-2模型提供延迟低于300毫秒的临时转录结果。
当Agent需要在对话过程中接收实时语音输入,或者无需单独处理步骤即可实现说话人识别(分离)时,优先选择该服务商,而非基于文件的Whisper。

Setup

配置步骤

Set
DEEPGRAM_API_KEY
in the environment or agent secrets store before starting a voice session.
在启动语音会话前,在环境变量或Agent密钥存储中设置
DEEPGRAM_API_KEY

Configuration

参数配置

json
{
  "voice": {
    "stt": "deepgram"
  }
}
To enable diarization and keyword boosting:
json
{
  "voice": {
    "stt": "deepgram",
    "providerOptions": {
      "model": "nova-2",
      "diarize": true,
      "keywords": ["AgentOS:2"],
      "endpointing": 300
    }
  }
}
json
{
  "voice": {
    "stt": "deepgram"
  }
}
如需启用说话人分离和关键词增强功能:
json
{
  "voice": {
    "stt": "deepgram",
    "providerOptions": {
      "model": "nova-2",
      "diarize": true,
      "keywords": ["AgentOS:2"],
      "endpointing": 300
    }
  }
}

Provider Rules

服务商规则

  • Use
    nova-2
    as the default model — highest accuracy on Deepgram's current tier.
  • Enable
    diarize: true
    when the conversation involves multiple speakers; word-level
    speaker
    labels are included in the transcript events.
  • Tune
    endpointing
    (ms of silence before finalization) to balance responsiveness vs. over-splitting. Default 300 ms is suitable for most conversations.
  • The provider auto-reconnects on WebSocket drops using exponential back-off (100 ms → 5 s cap).
  • Use
    providerOptions.keywords
    to boost domain-specific terms (e.g. product names, abbreviations).
  • nova-2
    设为默认模型——这是Deepgram当前层级中准确率最高的模型。
  • 当对话涉及多位说话人时,启用
    diarize: true
    ;转录事件中将包含单词级别的
    speaker
    标签。
  • 调整
    endpointing
    (最终确认前的静音时长,单位:毫秒)以平衡响应速度与过度拆分问题。默认300毫秒适用于大多数对话场景。
  • 当WebSocket连接断开时,服务商将使用指数退避策略(100毫秒→上限5秒)自动重连。
  • 使用
    providerOptions.keywords
    增强特定领域术语(如产品名称、缩写)的识别优先级。

Events

事件列表

EventDescription
transcript
Every hypothesis (interim + final)
interim_transcript
Non-final hypothesis
final_transcript
Stable, final hypothesis
speech_start
First non-empty word in an utterance
speech_end
Deepgram
speech_final
flag raised
error
Unrecoverable provider error
close
Session fully terminated
事件名称描述
transcript
所有转录假设结果(临时+最终)
interim_transcript
非最终转录假设结果
final_transcript
稳定的最终转录结果
speech_start
话语中的首个非空单词
speech_end
Deepgram触发
speech_final
标记
error
服务商无法恢复的错误
close
会话完全终止

Examples

使用示例

  • "Start a live voice session using Deepgram for transcription."
  • "Enable speaker diarization for this multi-person meeting transcription."
  • "Use Deepgram with keyword boosting for AgentOS and Wunderland terms."
  • “启动一个使用Deepgram进行转录的实时语音会话。”
  • “为本次多人会议转录启用说话人分离功能。”
  • “使用Deepgram并开启AgentOS和Wunderland术语的关键词增强。”

Constraints

限制条件

  • Requires
    DEEPGRAM_API_KEY
    . Free tier available at console.deepgram.com.
  • Audio must be streamed as PCM/WebSocket-compatible frames.
  • Diarization adds slight latency; disable if single-speaker performance is the priority.
  • 需要
    DEEPGRAM_API_KEY
    。可在console.deepgram.com获取免费层级。
  • 音频必须以PCM/WebSocket兼容的帧格式流式传输。
  • 说话人分离会增加轻微延迟;若优先考虑单说话人场景的性能,可禁用该功能。