api

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Deepgram API

Deepgram API

Build with Deepgram's speech-to-text, text-to-speech, voice agent, and audio intelligence APIs.
基于Deepgram的语音转文本、文本转语音、语音代理和音频智能API进行开发。

Getting Started

快速开始

All API requests require authentication via API key or JWT:
  • API Key:
    Authorization: Token <API_KEY>
  • JWT:
    Authorization: Bearer <JWT>
Base servers:
  • REST & STT/TTS WebSocket:
    https://api.deepgram.com
  • Voice Agent WebSocket:
    https://agent.deepgram.com
所有API请求都需要通过API密钥或JWT进行认证:
  • API密钥
    Authorization: Token <API_KEY>
  • JWT
    Authorization: Bearer <JWT>
基础服务器地址:
  • REST与STT/TTS WebSocket:
    https://api.deepgram.com
  • 语音代理WebSocket:
    https://agent.deepgram.com

How Deepgram's APIs Fit Together

Deepgram各API的协作方式

                   ┌──────────────────────────────┐
                   │       api.deepgram.com        │
                   └──────────────────────────────┘
  ┌──────────────┬──────────────┼──────────────┬──────────────┐
  ▼              ▼              ▼              ▼              ▼
/v1/listen   /v2/listen     /v1/speak      /v1/read    /v1/projects/*
 Nova — ASR   Flux — conv.   TTS            Text AI     Management
REST or WSS   WSS only       REST or WSS    REST only   REST only

                   ┌──────────────────────────────┐
                   │      agent.deepgram.com       │
                   └──────────────────────────────┘
                   /v1/agent/converse
                   WebSocket only
                   audio ──▶ STT ──▶ LLM ──▶ TTS ──▶ audio
                   (Deepgram orchestrates the full pipeline)
                   ┌──────────────────────────────┐
                   │       api.deepgram.com        │
                   └──────────────────────────────┘
  ┌──────────────┬──────────────┼──────────────┬──────────────┐
  ▼              ▼              ▼              ▼              ▼
/v1/listen   /v2/listen     /v1/speak      /v1/read    /v1/projects/*
 Nova — ASR   Flux — conv.   TTS            Text AI     Management
REST or WSS   WSS only       REST or WSS    REST only   REST only

                   ┌──────────────────────────────┐
                   │      agent.deepgram.com       │
                   └──────────────────────────────┘
                   /v1/agent/converse
                   WebSocket only
                   audio ──▶ STT ──▶ LLM ──▶ TTS ──▶ audio
                   (Deepgram orchestrates the full pipeline)

Which API Should I Use?

如何选择合适的API?

Audio → text (transcription)?
├─ General-purpose transcription (captions, batch, call logs, live streams with custom turn logic)
│  └─ Nova models via /v1/listen
│     ├─ Pre-recorded file    →  REST  POST https://api.deepgram.com/v1/listen?model=nova-3
│     └─ Live stream          →  WSS   wss://api.deepgram.com/v1/listen?model=nova-3
└─ Conversational audio / voice-agent-style turn detection
   └─ Flux models via /v2/listen
      └─ Live stream          →  WSS   wss://api.deepgram.com/v2/listen?model=flux-general-en

Text → audio?
├─ One-shot                   →  REST POST /v1/speak
└─ Low-latency stream         →  WSS  wss://api.deepgram.com/v1/speak

Full conversational voice agent (audio in, audio out)?
└─ WSS wss://agent.deepgram.com/v1/agent/converse
   Deepgram handles STT + your configured LLM + TTS internally

Analyze text for insights?
└─ REST POST /v1/read
   (summaries, sentiment, topics, intents)
音频转文本(转录)?
├─ 通用转录(字幕、批量处理、通话记录、自定义轮次逻辑的直播流)
│  └─ 通过/v1/listen使用Nova模型
│     ├─ 预录制文件    →  REST  POST https://api.deepgram.com/v1/listen?model=nova-3
│     └─ 直播流          →  WSS   wss://api.deepgram.com/v1/listen?model=nova-3
└─ 对话式音频/语音代理风格的轮次检测
   └─ 通过/v2/listen使用Flux模型
      └─ 直播流          →  WSS   wss://api.deepgram.com/v2/listen?model=flux-general-en

文本转音频?
├─ 一次性转换                   →  REST POST /v1/speak
└─ 低延迟流转换         →  WSS  wss://api.deepgram.com/v1/speak

完整对话式语音代理(音频输入,音频输出)?
└─ WSS wss://agent.deepgram.com/v1/agent/converse
   Deepgram在内部处理STT + 您配置的LLM + TTS

分析文本获取洞察?
└─ REST POST /v1/read
   (摘要、情感分析、主题识别、意图识别)

Speech-to-Text: Nova (
/v1/listen
) vs Flux (
/v2/listen
)

语音转文本:Nova (
/v1/listen
) vs Flux (
/v2/listen
)

Both model families are actively maintained and industry-leading. They solve different problems — pick the one that matches your use case.
Nova (
/v1/listen
)
Flux (
/v2/listen
)
Endpoint
/v1/listen
/v2/listen
Available models
nova-3
,
nova-2
,
nova
,
enhanced
,
base
flux-general-en
Best forGeneral transcription — captions, subtitles, call logs, batchConversational audio — voice agents, interactive assistants, turn-taking UIs
OutputContinuous transcript streamStructured turn events + transcripts (built-in turn state machine)
Turn detectionManual (
utterance_end_ms
, VAD events)
Built-in (EOT, eager-EOT, turn_index)
TransportsREST + WebSocketWebSocket only
Intelligence overlaysYes —
summarize
,
sentiment
,
topics
,
intents
,
diarize
,
redact
, etc.
No — smaller focused param set; no
smart_format
/
diarize
/
punctuate
Mid-session reconfigNo (reconnect to change)Yes (
Configure
message updates EOT thresholds + keyterms live)
Pick Nova (
/v1/listen
,
model=nova-3
) when:
  • Generating captions, subtitles, or transcripts for recorded media
  • Running batch transcription over files (REST)
  • You need analytics overlays (
    summarize
    ,
    sentiment
    ,
    topics
    ,
    intents
    ,
    diarize
    ,
    redact
    )
  • You want WebSocket streaming with your own turn-detection logic
Pick Flux (
/v2/listen
,
model=flux-general-en
) when:
  • Building an interactive voice agent or assistant
  • You want end-of-turn detection handled for you
  • You need low-latency turn signals and barge-in support
  • You want to update EOT thresholds or keyterms mid-session without reconnecting
Migrating from Nova 3 to Flux? See the official Nova 3 → Flux migration guide.
这两个模型系列均处于活跃维护状态,且处于行业领先水平。它们适用于不同场景,请根据您的使用需求选择。
Nova (
/v1/listen
)
Flux (
/v2/listen
)
端点
/v1/listen
/v2/listen
可用模型
nova-3
,
nova-2
,
nova
,
enhanced
,
base
flux-general-en
最佳适用场景通用转录——字幕、副标题、通话记录、批量处理对话式音频——语音代理、交互式助手、轮次交互界面
输出连续转录流结构化轮次事件 + 转录文本(内置轮次状态机)
轮次检测手动配置(
utterance_end_ms
、VAD事件)
内置支持(EOT、eager-EOT、turn_index)
传输方式REST + WebSocket仅WebSocket
智能叠加功能支持——
summarize
sentiment
topics
intents
diarize
redact
不支持——参数集精简;无
smart_format
/
diarize
/
punctuate
功能
会话中重新配置不支持(需重新连接修改)支持(
Configure
消息可实时更新EOT阈值和关键词)
选择Nova (
/v1/listen
,
model=nova-3
)的场景:
  • 为录制媒体生成字幕、副标题或转录文本
  • 对文件进行批量转录(REST)
  • 需要分析叠加功能(
    summarize
    sentiment
    topics
    intents
    diarize
    redact
  • 希望使用自定义轮次检测逻辑的WebSocket流
选择Flux (
/v2/listen
,
model=flux-general-en
)的场景:
  • 构建交互式语音代理或助手
  • 希望自动处理轮次结束检测
  • 需要低延迟轮次信号和插话支持
  • 希望在会话中更新EOT阈值或关键词,无需重新连接
从Nova 3迁移到Flux?请查看官方Nova 3 → Flux迁移指南

API Domains

API领域

DomainRESTWebSocketReference
Listen v1 — STT, Nova models
POST /v1/listen
wss://api.deepgram.com/v1/listen
listen.md
Listen v2 — STT, Flux (conversational)
wss://api.deepgram.com/v2/listen
listen.md
Speak (TTS)
POST /v1/speak
wss://api.deepgram.com/v1/speak
speak.md
Voice Agent
GET /v1/agent/settings/think/models
wss://agent.deepgram.com/v1/agent/converse
agent.md
Read (Intelligence)
POST /v1/read
read.md
Models
GET /v1/models
models.md
Projects
/v1/projects/*
projects.md
Auth
POST /v1/auth/grant
auth.md
Self-Hosted
/v1/projects/*/selfhosted/*
self-hosted.md
领域RESTWebSocket参考文档
Listen v1 — 语音转文本,Nova模型
POST /v1/listen
wss://api.deepgram.com/v1/listen
listen.md
Listen v2 — 语音转文本,Flux(对话式)
wss://api.deepgram.com/v2/listen
listen.md
Speak(文本转语音)
POST /v1/speak
wss://api.deepgram.com/v1/speak
speak.md
语音代理
GET /v1/agent/settings/think/models
wss://agent.deepgram.com/v1/agent/converse
agent.md
Read(智能分析)
POST /v1/read
read.md
模型
GET /v1/models
models.md
项目
/v1/projects/*
projects.md
认证
POST /v1/auth/grant
auth.md
自托管
/v1/projects/*/selfhosted/*
self-hosted.md

Common Mistakes to Avoid

需避免的常见错误

All APIs

所有API通用

  1. Feature flags are query params — except for Voice Agent and Flux mid-session updates. For
    /v1/listen
    ,
    /v2/listen
    , and
    /v1/speak
    , initial options go on the URL. The request body carries only audio data (REST) or audio frames (WebSocket). Two exceptions:
    /v1/agent/converse
    has no URL query params at all (all config goes in the
    Settings
    message); and
    /v2/listen
    supports a
    Configure
    message after connection to update EOT thresholds and keyterms mid-session. Also note that
    /v2/listen
    has a much smaller param set than
    /v1/listen
    — flags like
    smart_format
    ,
    diarize
    , and
    punctuate
    are not available.
  2. Rate limits are concurrent connections, not total requests. A 429 means too many simultaneous open connections, not too high a request volume. Diarization and other compute-heavy features reduce your concurrency allowance further.
  1. 功能标识为查询参数——语音代理和Flux会话中更新除外。对于
    /v1/listen
    /v2/listen
    /v1/speak
    ,初始配置选项需放在URL中。请求体仅携带音频数据(REST)或音频帧(WebSocket)。两个例外:
    /v1/agent/converse
    完全没有URL查询参数(所有配置都在
    Settings
    消息中);
    /v2/listen
    支持在连接后发送
    Configure
    消息,以便在会话中更新EOT阈值和关键词。另外注意,
    /v2/listen
    的参数集比
    /v1/listen
    精简得多——
    smart_format
    diarize
    punctuate
    等标识不可用。
  2. 速率限制针对并发连接数,而非总请求数。返回429状态码表示同时打开的连接过多,而非请求量过大。说话人分离和其他计算密集型功能会进一步降低您的并发连接限额。

STT WebSocket (
/v1/listen
)

STT WebSocket (
/v1/listen
)

  1. Send KeepAlive as a text frame, not binary. The connection closes after 10 seconds of no audio. Send
    {"type":"KeepAlive"}
    as a text (JSON) frame every 3–5 seconds during silence. Sending it as a binary frame causes transcription delays — the audio pipeline chokes — not a silent no-op.
  2. Never send empty byte payloads. Sending a zero-length binary frame to
    /v1/listen
    is treated as a close — it terminates the connection. Always check that your audio packet has length before sending.
  3. encoding
    must match the actual audio format.
    If
    encoding=linear16
    but you're sending opus, you'll get a DATA-0000 error or garbled output. Omit
    encoding
    entirely when sending containerized formats (mp3, wav, ogg) — Deepgram detects them automatically.
  4. Timestamps reset on reconnect. Each new WebSocket connection restarts timestamps at 00:00:00. For real-time apps, maintain a timestamp offset across reconnections or you'll silently corrupt your transcript timeline.
  1. 发送KeepAlive作为文本帧,而非二进制帧。如果10秒内无音频,连接将关闭。在静默期间每隔3-5秒发送
    {"type":"KeepAlive"}
    作为文本(JSON)帧。如果以二进制帧发送会导致转录延迟——音频处理管道会阻塞——而非无操作。
  2. 切勿发送空字节负载。向
    /v1/listen
    发送零长度二进制帧会被视为关闭请求——连接将终止。发送前务必检查音频数据包的长度。
  3. encoding
    必须与实际音频格式匹配
    。如果设置
    encoding=linear16
    但发送的是opus格式,您会收到DATA-0000错误或乱码输出。发送容器化格式(mp3、wav、ogg)时可省略
    encoding
    参数——Deepgram会自动检测格式。
  4. 重新连接时时间戳会重置。每个新的WebSocket连接都会将时间戳重置为00:00:00。对于实时应用,需在重新连接时维护时间戳偏移量,否则会无声地破坏转录时间线。

TTS WebSocket (
/v1/speak
)

TTS WebSocket (
/v1/speak
)

  1. Don't send empty text. A
    Speak
    message with an empty
    text
    field returns a 400 error. Always validate input before sending.
  2. Character rate limiting (DATA-0001) means slow down, not retry. If you hit this, reduce how fast you're submitting text chunks — don't immediately retry or you'll compound the problem.
  1. 不要发送空文本
    Speak
    消息中
    text
    字段为空会返回400错误。发送前务必验证输入内容。
  2. 字符速率限制(DATA-0001)意味着需要放慢速度,而非重试。如果遇到此限制,请降低提交文本块的速度——不要立即重试,否则会加剧问题。

Voice Agent (
/v1/agent/converse
)

语音代理 (
/v1/agent/converse
)

  1. Send the
    Settings
    message before any audio.
    The agent ignores everything until it receives and acknowledges the Settings configuration. Message ordering is strictly required.
  1. 发送音频前先发送
    Settings
    消息
    。代理会忽略所有内容,直到接收并确认Settings配置。消息顺序严格要求。

Flux model

Flux模型

  1. Use
    /v2/listen
    and
    model=flux-general-en
    .
    /v1/listen
    does not support Flux.
    model=flux
    alone is not a valid value. Do not include
    language
    or
    encoding
    params for containerized audio.
  2. Use
    Configure
    to update EOT thresholds and keyterms mid-session.
    Unlike
    /v1/listen
    , Flux supports live reconfiguration after connection — no need to reconnect to change turn detection sensitivity or boost new keyterms:
    json
    { "type": "Configure", "thresholds": { "eot_threshold": "0.8", "eot_timeout_ms": "3000" }, "keyterms": ["Deepgram"] }
    The server responds with
    ConfigureSuccess
    (echoing back applied values) or
    ConfigureFailure
    . Omitted threshold fields keep their current values.
  1. 使用
    /v2/listen
    并设置
    model=flux-general-en
    /v1/listen
    不支持Flux模型。仅设置
    model=flux
    不是有效值。发送容器化音频时请勿包含
    language
    encoding
    参数。
  2. 使用
    Configure
    在会话中更新EOT阈值和关键词
    。与
    /v1/listen
    不同,Flux支持连接后的实时重新配置——无需重新连接即可修改轮次检测灵敏度或新增关键词:
    json
    { "type": "Configure", "thresholds": { "eot_threshold": "0.8", "eot_timeout_ms": "3000" }, "keyterms": ["Deepgram"] }
    服务器会返回
    ConfigureSuccess
    (回显已应用的值)或
    ConfigureFailure
    。未指定的阈值字段将保持当前值。

Authentication

认证

  1. JWT TTL applies only to the initial handshake. Tokens default to 30 seconds. Once the WebSocket connection is established, the token expiring does not close it — tokens are only needed for the upgrade request.
  1. JWT过期时间仅适用于初始握手。令牌默认有效期为30秒。一旦WebSocket连接建立,令牌过期不会关闭连接——令牌仅在升级请求时需要。

SDK-Specific Skills

SDK专属技能

This
api
skill covers the product contracts (endpoints, query params, message shapes) that are identical across SDKs. For language-idiomatic code — imports, async patterns, builder APIs, common errors — install the SDK-specific skills. Each Deepgram SDK publishes 7 product skills named
deepgram-{lang}-{product}
(e.g.
deepgram-python-speech-to-text
,
deepgram-js-voice-agent
) plus a maintainer skill
deepgram-{lang}-maintaining-sdk
. The
deepgram-{lang}-
prefix avoids collisions when you install skills from multiple SDKs.
bash
undefined
api
技能涵盖所有SDK通用的产品约定(端点、查询参数、消息格式)。如需符合语言习惯的代码——导入、异步模式、构建器API、常见错误——请安装SDK专属技能。每个Deepgram SDK都会发布7个产品技能,命名格式为
deepgram-{lang}-{product}
(例如
deepgram-python-speech-to-text
deepgram-js-voice-agent
),外加一个维护者技能
deepgram-{lang}-maintaining-sdk
deepgram-{lang}-
前缀可避免安装多个SDK的技能时发生冲突。
bash
undefined

Install all skills from a specific SDK

安装特定SDK的所有技能

npx skills add deepgram/deepgram-python-sdk # Python npx skills add deepgram/deepgram-js-sdk # JavaScript / TypeScript npx skills add deepgram/deepgram-java-sdk # Java npx skills add deepgram/deepgram-go-sdk # Go npx skills add deepgram/deepgram-rust-sdk # Rust npx skills add deepgram/deepgram-swift-sdk # Swift npx skills add deepgram/deepgram-kotlin-sdk # Kotlin npx skills add deepgram/deepgram-dotnet-sdk # C# / .NET npx skills add deepgram/deepgram-browser-sdk # Browser TypeScript
npx skills add deepgram/deepgram-python-sdk # Python npx skills add deepgram/deepgram-js-sdk # JavaScript / TypeScript npx skills add deepgram/deepgram-java-sdk # Java npx skills add deepgram/deepgram-go-sdk # Go npx skills add deepgram/deepgram-rust-sdk # Rust npx skills add deepgram/deepgram-swift-sdk # Swift npx skills add deepgram/deepgram-kotlin-sdk # Kotlin npx skills add deepgram/deepgram-dotnet-sdk # C# / .NET npx skills add deepgram/deepgram-browser-sdk # Browser TypeScript

Or install a specific product skill from one SDK (note the deepgram-{lang}- prefix)

或安装某个SDK的特定产品技能(注意deepgram-{lang}-前缀)

npx skills add deepgram/deepgram-python-sdk --skill deepgram-python-speech-to-text npx skills add deepgram/deepgram-js-sdk --skill deepgram-js-voice-agent
undefined
npx skills add deepgram/deepgram-python-sdk --skill deepgram-python-speech-to-text npx skills add deepgram/deepgram-js-sdk --skill deepgram-js-voice-agent
undefined

Related Deepgram skills

相关Deepgram技能

SkillPurpose
recipes
Minimal runnable snippets per feature per language
examples
Full integration examples with third-party platforms (Twilio, LiveKit, etc.)
starters
Runnable starter apps (framework × feature matrix)
docs
Navigate Deepgram documentation
setup-mcp
Install the Deepgram MCP server
技能用途
recipes
各语言各功能的极简可运行代码片段
examples
与第三方平台(Twilio、LiveKit等)的完整集成示例
starters
可运行的启动应用(框架×功能矩阵)
docs
浏览Deepgram文档
setup-mcp
安装Deepgram MCP服务器

Documentation

文档链接