podcast-generation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePodcast Generation with GPT Realtime Mini
使用GPT Realtime Mini生成播客
Generate real audio narratives from text content using Azure OpenAI's Realtime API.
通过Azure OpenAI的Realtime API将文本内容转换为真实的音频叙事。
Quick Start
快速开始
- Configure environment variables for Realtime API
- Connect via WebSocket to Azure OpenAI Realtime endpoint
- Send text prompt, collect PCM audio chunks + transcript
- Convert PCM to WAV format
- Return base64-encoded audio to frontend for playback
- 为Realtime API配置环境变量
- 通过WebSocket连接到Azure OpenAI Realtime端点
- 发送文本提示,收集PCM音频片段和转录文本
- 将PCM转换为WAV格式
- 返回Base64编码的音频到前端进行播放
Environment Configuration
环境配置
env
AZURE_OPENAI_AUDIO_API_KEY=your_realtime_api_key
AZURE_OPENAI_AUDIO_ENDPOINT=https://your-resource.cognitiveservices.azure.com
AZURE_OPENAI_AUDIO_DEPLOYMENT=gpt-realtime-miniNote: Endpoint should NOT include - just the base URL.
/openai/v1/env
AZURE_OPENAI_AUDIO_API_KEY=your_realtime_api_key
AZURE_OPENAI_AUDIO_ENDPOINT=https://your-resource.cognitiveservices.azure.com
AZURE_OPENAI_AUDIO_DEPLOYMENT=gpt-realtime-mini注意:端点不应包含 - 仅需基础URL。
/openai/v1/Core Workflow
核心工作流
Backend Audio Generation
后端音频生成
python
from openai import AsyncOpenAI
import base64python
from openai import AsyncOpenAI
import base64Convert HTTPS endpoint to WebSocket URL
Convert HTTPS endpoint to WebSocket URL
ws_url = endpoint.replace("https://", "wss://") + "/openai/v1"
client = AsyncOpenAI(
websocket_base_url=ws_url,
api_key=api_key
)
audio_chunks = []
transcript_parts = []
async with client.realtime.connect(model="gpt-realtime-mini") as conn:
# Configure for audio-only output
await conn.session.update(session={
"output_modalities": ["audio"],
"instructions": "You are a narrator. Speak naturally."
})
# Send text to narrate
await conn.conversation.item.create(item={
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": prompt}]
})
await conn.response.create()
# Collect streaming events
async for event in conn:
if event.type == "response.output_audio.delta":
audio_chunks.append(base64.b64decode(event.delta))
elif event.type == "response.output_audio_transcript.delta":
transcript_parts.append(event.delta)
elif event.type == "response.done":
breakws_url = endpoint.replace("https://", "wss://") + "/openai/v1"
client = AsyncOpenAI(
websocket_base_url=ws_url,
api_key=api_key
)
audio_chunks = []
transcript_parts = []
async with client.realtime.connect(model="gpt-realtime-mini") as conn:
# Configure for audio-only output
await conn.session.update(session={
"output_modalities": ["audio"],
"instructions": "You are a narrator. Speak naturally."
})
# Send text to narrate
await conn.conversation.item.create(item={
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": prompt}]
})
await conn.response.create()
# Collect streaming events
async for event in conn:
if event.type == "response.output_audio.delta":
audio_chunks.append(base64.b64decode(event.delta))
elif event.type == "response.output_audio_transcript.delta":
transcript_parts.append(event.delta)
elif event.type == "response.done":
breakConvert PCM to WAV (see scripts/pcm_to_wav.py)
Convert PCM to WAV (see scripts/pcm_to_wav.py)
pcm_audio = b''.join(audio_chunks)
wav_audio = pcm_to_wav(pcm_audio, sample_rate=24000)
undefinedpcm_audio = b''.join(audio_chunks)
wav_audio = pcm_to_wav(pcm_audio, sample_rate=24000)
undefinedFrontend Audio Playback
前端音频播放
javascript
// Convert base64 WAV to playable blob
const base64ToBlob = (base64, mimeType) => {
const bytes = atob(base64);
const arr = new Uint8Array(bytes.length);
for (let i = 0; i < bytes.length; i++) arr[i] = bytes.charCodeAt(i);
return new Blob([arr], { type: mimeType });
};
const audioBlob = base64ToBlob(response.audio_data, 'audio/wav');
const audioUrl = URL.createObjectURL(audioBlob);
new Audio(audioUrl).play();javascript
// Convert base64 WAV to playable blob
const base64ToBlob = (base64, mimeType) => {
const bytes = atob(base64);
const arr = new Uint8Array(bytes.length);
for (let i = 0; i < bytes.length; i++) arr[i] = bytes.charCodeAt(i);
return new Blob([arr], { type: mimeType });
};
const audioBlob = base64ToBlob(response.audio_data, 'audio/wav');
const audioUrl = URL.createObjectURL(audioBlob);
new Audio(audioUrl).play();Voice Options
语音选项
| Voice | Character |
|---|---|
| alloy | Neutral |
| echo | Warm |
| fable | Expressive |
| onyx | Deep |
| nova | Friendly |
| shimmer | Clear |
| 语音 | 特点 |
|---|---|
| alloy | 中性 |
| echo | 温暖 |
| fable | 富有表现力 |
| onyx | 低沉 |
| nova | 友好 |
| shimmer | 清晰 |
Realtime API Events
Realtime API事件
- - Base64 audio chunk
response.output_audio.delta - - Transcript text
response.output_audio_transcript.delta - - Generation complete
response.done - - Handle with
errorevent.error.message
- - Base64音频片段
response.output_audio.delta - - 转录文本
response.output_audio_transcript.delta - - 生成完成
response.done - - 通过
error处理event.error.message
Audio Format
音频格式
- Input: Text prompt
- Output: PCM audio (24kHz, 16-bit, mono)
- Storage: Base64-encoded WAV
- 输入:文本提示
- 输出:PCM音频(24kHz,16位,单声道)
- 存储:Base64编码的WAV
References
参考资料
- Full architecture: See references/architecture.md for complete stack design
- Code examples: See references/code-examples.md for production patterns
- PCM conversion: Use scripts/pcm_to_wav.py for audio format conversion
- 完整架构:查看references/architecture.md获取完整的堆栈设计
- 代码示例:查看references/code-examples.md获取生产环境模式
- PCM转换:使用scripts/pcm_to_wav.py进行音频格式转换
When to Use
适用场景
This skill is applicable to execute the workflow or actions described in the overview.
此技能适用于执行概述中描述的工作流或操作。