podcast-generation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Podcast Generation with GPT Realtime Mini

使用GPT Realtime Mini生成播客

Generate real audio narratives from text content using Azure OpenAI's Realtime API.
通过Azure OpenAI的Realtime API将文本内容转换为真实的音频叙事。

Quick Start

快速开始

  1. Configure environment variables for Realtime API
  2. Connect via WebSocket to Azure OpenAI Realtime endpoint
  3. Send text prompt, collect PCM audio chunks + transcript
  4. Convert PCM to WAV format
  5. Return base64-encoded audio to frontend for playback
  1. 为Realtime API配置环境变量
  2. 通过WebSocket连接到Azure OpenAI Realtime端点
  3. 发送文本提示,收集PCM音频片段和转录文本
  4. 将PCM转换为WAV格式
  5. 返回Base64编码的音频到前端进行播放

Environment Configuration

环境配置

env
AZURE_OPENAI_AUDIO_API_KEY=your_realtime_api_key
AZURE_OPENAI_AUDIO_ENDPOINT=https://your-resource.cognitiveservices.azure.com
AZURE_OPENAI_AUDIO_DEPLOYMENT=gpt-realtime-mini
Note: Endpoint should NOT include
/openai/v1/
- just the base URL.
env
AZURE_OPENAI_AUDIO_API_KEY=your_realtime_api_key
AZURE_OPENAI_AUDIO_ENDPOINT=https://your-resource.cognitiveservices.azure.com
AZURE_OPENAI_AUDIO_DEPLOYMENT=gpt-realtime-mini
注意:端点不应包含
/openai/v1/
- 仅需基础URL。

Core Workflow

核心工作流

Backend Audio Generation

后端音频生成

python
from openai import AsyncOpenAI
import base64
python
from openai import AsyncOpenAI
import base64

Convert HTTPS endpoint to WebSocket URL

Convert HTTPS endpoint to WebSocket URL

ws_url = endpoint.replace("https://", "wss://") + "/openai/v1"
client = AsyncOpenAI( websocket_base_url=ws_url, api_key=api_key )
audio_chunks = [] transcript_parts = []
async with client.realtime.connect(model="gpt-realtime-mini") as conn: # Configure for audio-only output await conn.session.update(session={ "output_modalities": ["audio"], "instructions": "You are a narrator. Speak naturally." })
# Send text to narrate
await conn.conversation.item.create(item={
    "type": "message",
    "role": "user",
    "content": [{"type": "input_text", "text": prompt}]
})

await conn.response.create()

# Collect streaming events
async for event in conn:
    if event.type == "response.output_audio.delta":
        audio_chunks.append(base64.b64decode(event.delta))
    elif event.type == "response.output_audio_transcript.delta":
        transcript_parts.append(event.delta)
    elif event.type == "response.done":
        break
ws_url = endpoint.replace("https://", "wss://") + "/openai/v1"
client = AsyncOpenAI( websocket_base_url=ws_url, api_key=api_key )
audio_chunks = [] transcript_parts = []
async with client.realtime.connect(model="gpt-realtime-mini") as conn: # Configure for audio-only output await conn.session.update(session={ "output_modalities": ["audio"], "instructions": "You are a narrator. Speak naturally." })
# Send text to narrate
await conn.conversation.item.create(item={
    "type": "message",
    "role": "user",
    "content": [{"type": "input_text", "text": prompt}]
})

await conn.response.create()

# Collect streaming events
async for event in conn:
    if event.type == "response.output_audio.delta":
        audio_chunks.append(base64.b64decode(event.delta))
    elif event.type == "response.output_audio_transcript.delta":
        transcript_parts.append(event.delta)
    elif event.type == "response.done":
        break

Convert PCM to WAV (see scripts/pcm_to_wav.py)

Convert PCM to WAV (see scripts/pcm_to_wav.py)

pcm_audio = b''.join(audio_chunks) wav_audio = pcm_to_wav(pcm_audio, sample_rate=24000)
undefined
pcm_audio = b''.join(audio_chunks) wav_audio = pcm_to_wav(pcm_audio, sample_rate=24000)
undefined

Frontend Audio Playback

前端音频播放

javascript
// Convert base64 WAV to playable blob
const base64ToBlob = (base64, mimeType) => {
  const bytes = atob(base64);
  const arr = new Uint8Array(bytes.length);
  for (let i = 0; i < bytes.length; i++) arr[i] = bytes.charCodeAt(i);
  return new Blob([arr], { type: mimeType });
};

const audioBlob = base64ToBlob(response.audio_data, 'audio/wav');
const audioUrl = URL.createObjectURL(audioBlob);
new Audio(audioUrl).play();
javascript
// Convert base64 WAV to playable blob
const base64ToBlob = (base64, mimeType) => {
  const bytes = atob(base64);
  const arr = new Uint8Array(bytes.length);
  for (let i = 0; i < bytes.length; i++) arr[i] = bytes.charCodeAt(i);
  return new Blob([arr], { type: mimeType });
};

const audioBlob = base64ToBlob(response.audio_data, 'audio/wav');
const audioUrl = URL.createObjectURL(audioBlob);
new Audio(audioUrl).play();

Voice Options

语音选项

VoiceCharacter
alloyNeutral
echoWarm
fableExpressive
onyxDeep
novaFriendly
shimmerClear
语音特点
alloy中性
echo温暖
fable富有表现力
onyx低沉
nova友好
shimmer清晰

Realtime API Events

Realtime API事件

  • response.output_audio.delta
    - Base64 audio chunk
  • response.output_audio_transcript.delta
    - Transcript text
  • response.done
    - Generation complete
  • error
    - Handle with
    event.error.message
  • response.output_audio.delta
    - Base64音频片段
  • response.output_audio_transcript.delta
    - 转录文本
  • response.done
    - 生成完成
  • error
    - 通过
    event.error.message
    处理

Audio Format

音频格式

  • Input: Text prompt
  • Output: PCM audio (24kHz, 16-bit, mono)
  • Storage: Base64-encoded WAV
  • 输入:文本提示
  • 输出:PCM音频(24kHz,16位,单声道)
  • 存储:Base64编码的WAV

References

参考资料

  • Full architecture: See references/architecture.md for complete stack design
  • Code examples: See references/code-examples.md for production patterns
  • PCM conversion: Use scripts/pcm_to_wav.py for audio format conversion
  • 完整架构:查看references/architecture.md获取完整的堆栈设计
  • 代码示例:查看references/code-examples.md获取生产环境模式
  • PCM转换:使用scripts/pcm_to_wav.py进行音频格式转换

When to Use

适用场景

This skill is applicable to execute the workflow or actions described in the overview.
此技能适用于执行概述中描述的工作流或操作。