podcast-generation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePodcast Generation with GPT Realtime Mini
基于GPT Realtime Mini的播客生成
Generate real audio narratives from text content using Azure OpenAI's Realtime API.
使用Azure OpenAI的Realtime API从文本内容生成真实的音频叙事。
Quick Start
快速开始
- Configure environment variables for Realtime API
- Connect via WebSocket to Azure OpenAI Realtime endpoint
- Send text prompt, collect PCM audio chunks + transcript
- Convert PCM to WAV format
- Return base64-encoded audio to frontend for playback
- 为Realtime API配置环境变量
- 通过WebSocket连接到Azure OpenAI Realtime端点
- 发送文本提示,收集PCM音频块和转录文本
- 将PCM转换为WAV格式
- 将base64编码的音频返回给前端进行播放
Environment Configuration
环境配置
env
AZURE_OPENAI_AUDIO_API_KEY=your_realtime_api_key
AZURE_OPENAI_AUDIO_ENDPOINT=https://your-resource.cognitiveservices.azure.com
AZURE_OPENAI_AUDIO_DEPLOYMENT=gpt-realtime-miniNote: Endpoint should NOT include - just the base URL.
/openai/v1/env
AZURE_OPENAI_AUDIO_API_KEY=your_realtime_api_key
AZURE_OPENAI_AUDIO_ENDPOINT=https://your-resource.cognitiveservices.azure.com
AZURE_OPENAI_AUDIO_DEPLOYMENT=gpt-realtime-mini注意:端点不应包含 - 仅需基础URL。
/openai/v1/Core Workflow
核心工作流
Backend Audio Generation
后端音频生成
python
from openai import AsyncOpenAI
import base64python
from openai import AsyncOpenAI
import base64Convert HTTPS endpoint to WebSocket URL
Convert HTTPS endpoint to WebSocket URL
ws_url = endpoint.replace("https://", "wss://") + "/openai/v1"
client = AsyncOpenAI(
websocket_base_url=ws_url,
api_key=api_key
)
audio_chunks = []
transcript_parts = []
async with client.realtime.connect(model="gpt-realtime-mini") as conn:
# Configure for audio-only output
await conn.session.update(session={
"output_modalities": ["audio"],
"instructions": "You are a narrator. Speak naturally."
})
# Send text to narrate
await conn.conversation.item.create(item={
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": prompt}]
})
await conn.response.create()
# Collect streaming events
async for event in conn:
if event.type == "response.output_audio.delta":
audio_chunks.append(base64.b64decode(event.delta))
elif event.type == "response.output_audio_transcript.delta":
transcript_parts.append(event.delta)
elif event.type == "response.done":
breakws_url = endpoint.replace("https://", "wss://") + "/openai/v1"
client = AsyncOpenAI(
websocket_base_url=ws_url,
api_key=api_key
)
audio_chunks = []
transcript_parts = []
async with client.realtime.connect(model="gpt-realtime-mini") as conn:
# Configure for audio-only output
await conn.session.update(session={
"output_modalities": ["audio"],
"instructions": "You are a narrator. Speak naturally."
})
# Send text to narrate
await conn.conversation.item.create(item={
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": prompt}]
})
await conn.response.create()
# Collect streaming events
async for event in conn:
if event.type == "response.output_audio.delta":
audio_chunks.append(base64.b64decode(event.delta))
elif event.type == "response.output_audio_transcript.delta":
transcript_parts.append(event.delta)
elif event.type == "response.done":
breakConvert PCM to WAV (see scripts/pcm_to_wav.py)
Convert PCM to WAV (see scripts/pcm_to_wav.py)
pcm_audio = b''.join(audio_chunks)
wav_audio = pcm_to_wav(pcm_audio, sample_rate=24000)
undefinedpcm_audio = b''.join(audio_chunks)
wav_audio = pcm_to_wav(pcm_audio, sample_rate=24000)
undefinedFrontend Audio Playback
前端音频播放
javascript
// Convert base64 WAV to playable blob
const base64ToBlob = (base64, mimeType) => {
const bytes = atob(base64);
const arr = new Uint8Array(bytes.length);
for (let i = 0; i < bytes.length; i++) arr[i] = bytes.charCodeAt(i);
return new Blob([arr], { type: mimeType });
};
const audioBlob = base64ToBlob(response.audio_data, 'audio/wav');
const audioUrl = URL.createObjectURL(audioBlob);
new Audio(audioUrl).play();javascript
// Convert base64 WAV to playable blob
const base64ToBlob = (base64, mimeType) => {
const bytes = atob(base64);
const arr = new Uint8Array(bytes.length);
for (let i = 0; i < bytes.length; i++) arr[i] = bytes.charCodeAt(i);
return new Blob([arr], { type: mimeType });
};
const audioBlob = base64ToBlob(response.audio_data, 'audio/wav');
const audioUrl = URL.createObjectURL(audioBlob);
new Audio(audioUrl).play();Voice Options
语音选项
| Voice | Character |
|---|---|
| alloy | Neutral |
| echo | Warm |
| fable | Expressive |
| onyx | Deep |
| nova | Friendly |
| shimmer | Clear |
| Voice | 风格 |
|---|---|
| alloy | 中性 |
| echo | 温暖 |
| fable | 富有表现力 |
| onyx | 低沉 |
| nova | 亲切 |
| shimmer | 清晰 |
Realtime API Events
Realtime API事件
- - Base64 audio chunk
response.output_audio.delta - - Transcript text
response.output_audio_transcript.delta - - Generation complete
response.done - - Handle with
errorevent.error.message
- - Base64音频块
response.output_audio.delta - - 转录文本
response.output_audio_transcript.delta - - 生成完成
response.done - - 通过
error处理event.error.message
Audio Format
音频格式
- Input: Text prompt
- Output: PCM audio (24kHz, 16-bit, mono)
- Storage: Base64-encoded WAV
- 输入:文本提示
- 输出:PCM音频(24kHz,16位,单声道)
- 存储:Base64编码的WAV
References
参考资料
- Full architecture: See references/architecture.md for complete stack design
- Code examples: See references/code-examples.md for production patterns
- PCM conversion: Use scripts/pcm_to_wav.py for audio format conversion
- 完整架构:查看references/architecture.md获取完整的栈设计
- 代码示例:查看references/code-examples.md获取生产模式
- PCM转换:使用scripts/pcm_to_wav.py进行音频格式转换