azure-ai-voicelive-dotnet

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Azure.AI.VoiceLive (.NET)

Azure.AI.VoiceLive(.NET)

Real-time voice AI SDK for building bidirectional voice assistants with Azure AI.
用于借助Azure AI构建双向语音助手的实时语音AI SDK。

Installation

安装

bash
dotnet add package Azure.AI.VoiceLive
dotnet add package Azure.Identity
dotnet add package NAudio                    # For audio capture/playback
Current Versions: Stable v1.0.0, Preview v1.1.0-beta.1
bash
dotnet add package Azure.AI.VoiceLive
dotnet add package Azure.Identity
dotnet add package NAudio                    # 用于音频捕获/播放
当前版本:稳定版v1.0.0,预览版v1.1.0-beta.1

Environment Variables

环境变量

bash
AZURE_VOICELIVE_ENDPOINT=https://<resource>.services.ai.azure.com/
AZURE_VOICELIVE_MODEL=gpt-4o-realtime-preview
AZURE_VOICELIVE_VOICE=en-US-AvaNeural
bash
AZURE_VOICELIVE_ENDPOINT=https://<resource>.services.ai.azure.com/
AZURE_VOICELIVE_MODEL=gpt-4o-realtime-preview
AZURE_VOICELIVE_VOICE=en-US-AvaNeural

Optional: API key if not using Entra ID

可选:不使用Entra ID时的API密钥

AZURE_VOICELIVE_API_KEY=<your-api-key>
undefined
AZURE_VOICELIVE_API_KEY=<your-api-key>
undefined

Authentication

认证

Microsoft Entra ID (Recommended)

Microsoft Entra ID(推荐)

csharp
using Azure.Identity;
using Azure.AI.VoiceLive;

Uri endpoint = new Uri("https://your-resource.cognitiveservices.azure.com");
DefaultAzureCredential credential = new DefaultAzureCredential();
VoiceLiveClient client = new VoiceLiveClient(endpoint, credential);
Required Role:
Cognitive Services User
(assign in Azure Portal → Access control)
csharp
using Azure.Identity;
using Azure.AI.VoiceLive;

Uri endpoint = new Uri("https://your-resource.cognitiveservices.azure.com");
DefaultAzureCredential credential = new DefaultAzureCredential();
VoiceLiveClient client = new VoiceLiveClient(endpoint, credential);
所需角色
Cognitive Services User
(在Azure门户→访问控制中分配)

API Key

API密钥

csharp
Uri endpoint = new Uri("https://your-resource.cognitiveservices.azure.com");
AzureKeyCredential credential = new AzureKeyCredential("your-api-key");
VoiceLiveClient client = new VoiceLiveClient(endpoint, credential);
csharp
Uri endpoint = new Uri("https://your-resource.cognitiveservices.azure.com");
AzureKeyCredential credential = new AzureKeyCredential("your-api-key");
VoiceLiveClient client = new VoiceLiveClient(endpoint, credential);

Client Hierarchy

客户端层级

VoiceLiveClient
└── VoiceLiveSession (WebSocket connection)
    ├── ConfigureSessionAsync()
    ├── GetUpdatesAsync() → SessionUpdate events
    ├── AddItemAsync() → UserMessageItem, FunctionCallOutputItem
    ├── SendAudioAsync()
    └── StartResponseAsync()
VoiceLiveClient
└── VoiceLiveSession (WebSocket连接)
    ├── ConfigureSessionAsync()
    ├── GetUpdatesAsync() → SessionUpdate事件
    ├── AddItemAsync() → UserMessageItem, FunctionCallOutputItem
    ├── SendAudioAsync()
    └── StartResponseAsync()

Core Workflow

核心工作流

1. Start Session and Configure

1. 启动会话并配置

csharp
using Azure.Identity;
using Azure.AI.VoiceLive;

var endpoint = new Uri(Environment.GetEnvironmentVariable("AZURE_VOICELIVE_ENDPOINT"));
var client = new VoiceLiveClient(endpoint, new DefaultAzureCredential());

var model = "gpt-4o-mini-realtime-preview";

// Start session
using VoiceLiveSession session = await client.StartSessionAsync(model);

// Configure session
VoiceLiveSessionOptions sessionOptions = new()
{
    Model = model,
    Instructions = "You are a helpful AI assistant. Respond naturally.",
    Voice = new AzureStandardVoice("en-US-AvaNeural"),
    TurnDetection = new AzureSemanticVadTurnDetection()
    {
        Threshold = 0.5f,
        PrefixPadding = TimeSpan.FromMilliseconds(300),
        SilenceDuration = TimeSpan.FromMilliseconds(500)
    },
    InputAudioFormat = InputAudioFormat.Pcm16,
    OutputAudioFormat = OutputAudioFormat.Pcm16
};

// Set modalities (both text and audio for voice assistants)
sessionOptions.Modalities.Clear();
sessionOptions.Modalities.Add(InteractionModality.Text);
sessionOptions.Modalities.Add(InteractionModality.Audio);

await session.ConfigureSessionAsync(sessionOptions);
csharp
using Azure.Identity;
using Azure.AI.VoiceLive;

var endpoint = new Uri(Environment.GetEnvironmentVariable("AZURE_VOICELIVE_ENDPOINT"));
var client = new VoiceLiveClient(endpoint, new DefaultAzureCredential());

var model = "gpt-4o-mini-realtime-preview";

// 启动会话
using VoiceLiveSession session = await client.StartSessionAsync(model);

// 配置会话
VoiceLiveSessionOptions sessionOptions = new()
{
    Model = model,
    Instructions = "You are a helpful AI assistant. Respond naturally.",
    Voice = new AzureStandardVoice("en-US-AvaNeural"),
    TurnDetection = new AzureSemanticVadTurnDetection()
    {
        Threshold = 0.5f,
        PrefixPadding = TimeSpan.FromMilliseconds(300),
        SilenceDuration = TimeSpan.FromMilliseconds(500)
    },
    InputAudioFormat = InputAudioFormat.Pcm16,
    OutputAudioFormat = OutputAudioFormat.Pcm16
};

// 设置交互模式(语音助手需同时支持文本和音频)
sessionOptions.Modalities.Clear();
sessionOptions.Modalities.Add(InteractionModality.Text);
sessionOptions.Modalities.Add(InteractionModality.Audio);

await session.ConfigureSessionAsync(sessionOptions);

2. Process Events

2. 处理事件

csharp
await foreach (SessionUpdate serverEvent in session.GetUpdatesAsync())
{
    switch (serverEvent)
    {
        case SessionUpdateResponseAudioDelta audioDelta:
            byte[] audioData = audioDelta.Delta.ToArray();
            // Play audio via NAudio or other audio library
            break;
            
        case SessionUpdateResponseTextDelta textDelta:
            Console.Write(textDelta.Delta);
            break;
            
        case SessionUpdateResponseFunctionCallArgumentsDone functionCall:
            // Handle function call (see Function Calling section)
            break;
            
        case SessionUpdateError error:
            Console.WriteLine($"Error: {error.Error.Message}");
            break;
            
        case SessionUpdateResponseDone:
            Console.WriteLine("\n--- Response complete ---");
            break;
    }
}
csharp
await foreach (SessionUpdate serverEvent in session.GetUpdatesAsync())
{
    switch (serverEvent)
    {
        case SessionUpdateResponseAudioDelta audioDelta:
            byte[] audioData = audioDelta.Delta.ToArray();
            // 通过NAudio或其他音频库播放音频
            break;
            
        case SessionUpdateResponseTextDelta textDelta:
            Console.Write(textDelta.Delta);
            break;
            
        case SessionUpdateResponseFunctionCallArgumentsDone functionCall:
            // 处理函数调用(参见函数调用章节)
            break;
            
        case SessionUpdateError error:
            Console.WriteLine($"Error: {error.Error.Message}");
            break;
            
        case SessionUpdateResponseDone:
            Console.WriteLine("\n--- 响应完成 ---");
            break;
    }
}

3. Send User Message

3. 发送用户消息

csharp
await session.AddItemAsync(new UserMessageItem("Hello, can you help me?"));
await session.StartResponseAsync();
csharp
await session.AddItemAsync(new UserMessageItem("Hello, can you help me?"));
await session.StartResponseAsync();

4. Function Calling

4. 函数调用

csharp
// Define function
var weatherFunction = new VoiceLiveFunctionDefinition("get_current_weather")
{
    Description = "Get the current weather for a given location",
    Parameters = BinaryData.FromString("""
        {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state or country"
                }
            },
            "required": ["location"]
        }
        """)
};

// Add to session options
sessionOptions.Tools.Add(weatherFunction);

// Handle function call in event loop
if (serverEvent is SessionUpdateResponseFunctionCallArgumentsDone functionCall)
{
    if (functionCall.Name == "get_current_weather")
    {
        var parameters = JsonSerializer.Deserialize<Dictionary<string, string>>(functionCall.Arguments);
        string location = parameters?["location"] ?? "";
        
        // Call external service
        string weatherInfo = $"The weather in {location} is sunny, 75°F.";
        
        // Send response
        await session.AddItemAsync(new FunctionCallOutputItem(functionCall.CallId, weatherInfo));
        await session.StartResponseAsync();
    }
}
csharp
// 定义函数
var weatherFunction = new VoiceLiveFunctionDefinition("get_current_weather")
{
    Description = "Get the current weather for a given location",
    Parameters = BinaryData.FromString("""
        {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state or country"
                }
            },
            "required": ["location"]
        }
        """)
};

// 添加到会话选项
sessionOptions.Tools.Add(weatherFunction);

// 在事件循环中处理函数调用
if (serverEvent is SessionUpdateResponseFunctionCallArgumentsDone functionCall)
{
    if (functionCall.Name == "get_current_weather")
    {
        var parameters = JsonSerializer.Deserialize<Dictionary<string, string>>(functionCall.Arguments);
        string location = parameters?["location"] ?? "";
        
        // 调用外部服务
        string weatherInfo = $"The weather in {location} is sunny, 75°F.";
        
        // 发送响应
        await session.AddItemAsync(new FunctionCallOutputItem(functionCall.CallId, weatherInfo));
        await session.StartResponseAsync();
    }
}

Voice Options

语音选项

Voice TypeClassExample
Azure Standard
AzureStandardVoice
"en-US-AvaNeural"
Azure HD
AzureStandardVoice
"en-US-Ava:DragonHDLatestNeural"
Azure Custom
AzureCustomVoice
Custom voice with endpoint ID
语音类型示例
Azure标准语音
AzureStandardVoice
"en-US-AvaNeural"
Azure高清语音
AzureStandardVoice
"en-US-Ava:DragonHDLatestNeural"
Azure自定义语音
AzureCustomVoice
带端点ID的自定义语音

Supported Models

支持的模型

ModelDescription
gpt-4o-realtime-preview
GPT-4o with real-time audio
gpt-4o-mini-realtime-preview
Lightweight, fast interactions
phi4-mm-realtime
Cost-effective multimodal
模型描述
gpt-4o-realtime-preview
支持实时音频的GPT-4o
gpt-4o-mini-realtime-preview
轻量级、交互快速的模型
phi4-mm-realtime
高性价比的多模态模型

Key Types Reference

核心类型参考

TypePurpose
VoiceLiveClient
Main client for creating sessions
VoiceLiveSession
Active WebSocket session
VoiceLiveSessionOptions
Session configuration
AzureStandardVoice
Standard Azure voice provider
AzureSemanticVadTurnDetection
Voice activity detection
VoiceLiveFunctionDefinition
Function tool definition
UserMessageItem
User text message
FunctionCallOutputItem
Function call response
SessionUpdateResponseAudioDelta
Audio chunk event
SessionUpdateResponseTextDelta
Text chunk event
类型用途
VoiceLiveClient
创建会话的主客户端
VoiceLiveSession
活跃的WebSocket会话
VoiceLiveSessionOptions
会话配置项
AzureStandardVoice
标准Azure语音提供程序
AzureSemanticVadTurnDetection
语音活动检测
VoiceLiveFunctionDefinition
函数工具定义
UserMessageItem
用户文本消息
FunctionCallOutputItem
函数调用响应
SessionUpdateResponseAudioDelta
音频块事件
SessionUpdateResponseTextDelta
文本块事件

Best Practices

最佳实践

  1. Always set both modalities — Include
    Text
    and
    Audio
    for voice assistants
  2. Use
    AzureSemanticVadTurnDetection
    — Provides natural conversation flow
  3. Configure appropriate silence duration — 500ms typical to avoid premature cutoffs
  4. Use
    using
    statement
    — Ensures proper session disposal
  5. Handle all event types — Check for errors, audio, text, and function calls
  6. Use DefaultAzureCredential — Never hardcode API keys
  1. 始终设置两种交互模式 — 为语音助手同时启用
    Text
    Audio
  2. 使用
    AzureSemanticVadTurnDetection
    — 提供自然的对话流程
  3. 配置合适的静音时长 — 通常设为500ms以避免提前截断对话
  4. 使用
    using
    语句
    — 确保会话被正确释放
  5. 处理所有事件类型 — 检查错误、音频、文本和函数调用事件
  6. 使用DefaultAzureCredential — 切勿硬编码API密钥

Error Handling

错误处理

csharp
if (serverEvent is SessionUpdateError error)
{
    if (error.Error.Message.Contains("Cancellation failed: no active response"))
    {
        // Benign error, can ignore
    }
    else
    {
        Console.WriteLine($"Error: {error.Error.Message}");
    }
}
csharp
if (serverEvent is SessionUpdateError error)
{
    if (error.Error.Message.Contains("Cancellation failed: no active response"))
    {
        // 良性错误,可忽略
    }
    else
    {
        Console.WriteLine($"Error: {error.Error.Message}");
    }
}

Audio Configuration

音频配置

  • Input Format:
    InputAudioFormat.Pcm16
    (16-bit PCM)
  • Output Format:
    OutputAudioFormat.Pcm16
  • Sample Rate: 24kHz recommended
  • Channels: Mono
  • 输入格式
    InputAudioFormat.Pcm16
    (16位PCM)
  • 输出格式
    OutputAudioFormat.Pcm16
  • 采样率:推荐24kHz
  • 声道:单声道

Related SDKs

相关SDK

SDKPurposeInstall
Azure.AI.VoiceLive
Real-time voice (this SDK)
dotnet add package Azure.AI.VoiceLive
Microsoft.CognitiveServices.Speech
Speech-to-text, text-to-speech
dotnet add package Microsoft.CognitiveServices.Speech
NAudio
Audio capture/playback
dotnet add package NAudio
SDK用途安装命令
Azure.AI.VoiceLive
实时语音(本SDK)
dotnet add package Azure.AI.VoiceLive
Microsoft.CognitiveServices.Speech
语音转文本、文本转语音
dotnet add package Microsoft.CognitiveServices.Speech
NAudio
音频捕获/播放
dotnet add package NAudio

Reference Links

参考链接