azure-ai-evaluation-py
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAzure AI Evaluation SDK for Python
Azure AI Evaluation SDK for Python
Assess generative AI application performance with built-in quality, safety, agent evaluators, Azure OpenAI graders, and custom evaluators.
借助内置的质量、安全性、Agent评估器、Azure OpenAI评分器以及自定义评估器,评估生成式AI应用的性能。
Installation
安装
bash
pip install azure-ai-evaluationbash
pip install azure-ai-evaluationWith red team support
With red team support
pip install azure-ai-evaluation[redteam]
undefinedpip install azure-ai-evaluation[redteam]
undefinedEnvironment Variables
环境变量
bash
undefinedbash
undefinedFor AI-assisted evaluators
For AI-assisted evaluators
AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com
AZURE_OPENAI_API_KEY=<your-api-key>
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com
AZURE_OPENAI_API_KEY=<your-api-key>
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
For Foundry project integration
For Foundry project integration
AIPROJECT_CONNECTION_STRING=<your-connection-string>
undefinedAIPROJECT_CONNECTION_STRING=<your-connection-string>
undefinedBuilt-in Evaluators
内置评估器
Quality Evaluators (AI-Assisted)
质量评估器(AI辅助型)
python
from azure.ai.evaluation import (
GroundednessEvaluator,
GroundednessProEvaluator, # Service-based groundedness
RelevanceEvaluator,
CoherenceEvaluator,
FluencyEvaluator,
SimilarityEvaluator,
RetrievalEvaluator
)python
from azure.ai.evaluation import (
GroundednessEvaluator,
GroundednessProEvaluator, # Service-based groundedness
RelevanceEvaluator,
CoherenceEvaluator,
FluencyEvaluator,
SimilarityEvaluator,
RetrievalEvaluator
)Initialize with Azure OpenAI model config
Initialize with Azure OpenAI model config
model_config = {
"azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"],
"api_key": os.environ["AZURE_OPENAI_API_KEY"],
"azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"]
}
groundedness = GroundednessEvaluator(model_config)
relevance = RelevanceEvaluator(model_config)
coherence = CoherenceEvaluator(model_config)
model_config = {
"azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"],
"api_key": os.environ["AZURE_OPENAI_API_KEY"],
"azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"]
}
groundedness = GroundednessEvaluator(model_config)
relevance = RelevanceEvaluator(model_config)
coherence = CoherenceEvaluator(model_config)
For reasoning models (o1/o3), use is_reasoning_model parameter
For reasoning models (o1/o3), use is_reasoning_model parameter
groundedness_reasoning = GroundednessEvaluator(model_config, is_reasoning_model=True)
undefinedgroundedness_reasoning = GroundednessEvaluator(model_config, is_reasoning_model=True)
undefinedQuality Evaluators (NLP-based)
质量评估器(基于NLP)
python
from azure.ai.evaluation import (
F1ScoreEvaluator,
RougeScoreEvaluator,
BleuScoreEvaluator,
GleuScoreEvaluator,
MeteorScoreEvaluator
)
f1 = F1ScoreEvaluator()
rouge = RougeScoreEvaluator()
bleu = BleuScoreEvaluator()python
from azure.ai.evaluation import (
F1ScoreEvaluator,
RougeScoreEvaluator,
BleuScoreEvaluator,
GleuScoreEvaluator,
MeteorScoreEvaluator
)
f1 = F1ScoreEvaluator()
rouge = RougeScoreEvaluator()
bleu = BleuScoreEvaluator()Safety Evaluators
安全性评估器
python
from azure.ai.evaluation import (
ViolenceEvaluator,
SexualEvaluator,
SelfHarmEvaluator,
HateUnfairnessEvaluator,
IndirectAttackEvaluator,
ProtectedMaterialEvaluator,
CodeVulnerabilityEvaluator,
UngroundedAttributesEvaluator
)python
from azure.ai.evaluation import (
ViolenceEvaluator,
SexualEvaluator,
SelfHarmEvaluator,
HateUnfairnessEvaluator,
IndirectAttackEvaluator,
ProtectedMaterialEvaluator,
CodeVulnerabilityEvaluator,
UngroundedAttributesEvaluator
)Project scope for safety evaluators
Project scope for safety evaluators
azure_ai_project = {
"subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
"resource_group_name": os.environ["AZURE_RESOURCE_GROUP"],
"project_name": os.environ["AZURE_AI_PROJECT_NAME"],
}
violence = ViolenceEvaluator(azure_ai_project=azure_ai_project)
sexual = SexualEvaluator(azure_ai_project=azure_ai_project)
code_vuln = CodeVulnerabilityEvaluator(azure_ai_project=azure_ai_project)
azure_ai_project = {
"subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
"resource_group_name": os.environ["AZURE_RESOURCE_GROUP"],
"project_name": os.environ["AZURE_AI_PROJECT_NAME"],
}
violence = ViolenceEvaluator(azure_ai_project=azure_ai_project)
sexual = SexualEvaluator(azure_ai_project=azure_ai_project)
code_vuln = CodeVulnerabilityEvaluator(azure_ai_project=azure_ai_project)
Control whether queries are evaluated (default: False, only response evaluated)
Control whether queries are evaluated (default: False, only response evaluated)
violence_with_query = ViolenceEvaluator(azure_ai_project=azure_ai_project, evaluate_query=True)
undefinedviolence_with_query = ViolenceEvaluator(azure_ai_project=azure_ai_project, evaluate_query=True)
undefinedAgent Evaluators
Agent评估器
python
from azure.ai.evaluation import (
IntentResolutionEvaluator,
ResponseCompletenessEvaluator,
TaskAdherenceEvaluator,
ToolCallAccuracyEvaluator
)
intent = IntentResolutionEvaluator(model_config)
completeness = ResponseCompletenessEvaluator(model_config)
task_adherence = TaskAdherenceEvaluator(model_config)
tool_accuracy = ToolCallAccuracyEvaluator(model_config)python
from azure.ai.evaluation import (
IntentResolutionEvaluator,
ResponseCompletenessEvaluator,
TaskAdherenceEvaluator,
ToolCallAccuracyEvaluator
)
intent = IntentResolutionEvaluator(model_config)
completeness = ResponseCompletenessEvaluator(model_config)
task_adherence = TaskAdherenceEvaluator(model_config)
tool_accuracy = ToolCallAccuracyEvaluator(model_config)Single Row Evaluation
单行数据评估
python
from azure.ai.evaluation import GroundednessEvaluator
groundedness = GroundednessEvaluator(model_config)
result = groundedness(
query="What is Azure AI?",
context="Azure AI is Microsoft's AI platform...",
response="Azure AI provides AI services and tools."
)
print(f"Groundedness score: {result['groundedness']}")
print(f"Reason: {result['groundedness_reason']}")python
from azure.ai.evaluation import GroundednessEvaluator
groundedness = GroundednessEvaluator(model_config)
result = groundedness(
query="What is Azure AI?",
context="Azure AI is Microsoft's AI platform...",
response="Azure AI provides AI services and tools."
)
print(f"Groundedness score: {result['groundedness']}")
print(f"Reason: {result['groundedness_reason']}")Batch Evaluation with evaluate()
借助evaluate()进行批量评估
python
from azure.ai.evaluation import evaluate
result = evaluate(
data="test_data.jsonl",
evaluators={
"groundedness": groundedness,
"relevance": relevance,
"coherence": coherence
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.query}",
"context": "${data.context}",
"response": "${data.response}"
}
}
},
# Optional: Add tags for experiment tracking
tags={"experiment": "v1", "model": "gpt-4o"}
)
print(result["metrics"])python
from azure.ai.evaluation import evaluate
result = evaluate(
data="test_data.jsonl",
evaluators={
"groundedness": groundedness,
"relevance": relevance,
"coherence": coherence
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.query}",
"context": "${data.context}",
"response": "${data.response}"
}
}
},
# Optional: Add tags for experiment tracking
tags={"experiment": "v1", "model": "gpt-4o"}
)
print(result["metrics"])Composite Evaluators
复合评估器
python
from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluatorpython
from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluatorAll quality metrics in one
All quality metrics in one
qa_evaluator = QAEvaluator(model_config)
qa_evaluator = QAEvaluator(model_config)
All safety metrics in one
All safety metrics in one
safety_evaluator = ContentSafetyEvaluator(azure_ai_project=azure_ai_project)
result = evaluate(
data="data.jsonl",
evaluators={
"qa": qa_evaluator,
"content_safety": safety_evaluator
}
)
undefinedsafety_evaluator = ContentSafetyEvaluator(azure_ai_project=azure_ai_project)
result = evaluate(
data="data.jsonl",
evaluators={
"qa": qa_evaluator,
"content_safety": safety_evaluator
}
)
undefinedAzure OpenAI Graders
Azure OpenAI评分器
Use grader classes for structured evaluation via Azure OpenAI's grading API:
python
from azure.ai.evaluation import (
AzureOpenAILabelGrader,
AzureOpenAIStringCheckGrader,
AzureOpenAITextSimilarityGrader,
AzureOpenAIScoreModelGrader,
AzureOpenAIPythonGrader
)使用评分器类,通过Azure OpenAI的评分API进行结构化评估:
python
from azure.ai.evaluation import (
AzureOpenAILabelGrader,
AzureOpenAIStringCheckGrader,
AzureOpenAITextSimilarityGrader,
AzureOpenAIScoreModelGrader,
AzureOpenAIPythonGrader
)Label grader for classification
Label grader for classification
label_grader = AzureOpenAILabelGrader(
model_config=model_config,
labels=["positive", "negative", "neutral"],
passing_labels=["positive"]
)
label_grader = AzureOpenAILabelGrader(
model_config=model_config,
labels=["positive", "negative", "neutral"],
passing_labels=["positive"]
)
Score model grader with custom threshold
Score model grader with custom threshold
score_grader = AzureOpenAIScoreModelGrader(
model_config=model_config,
pass_threshold=0.7
)
score_grader = AzureOpenAIScoreModelGrader(
model_config=model_config,
pass_threshold=0.7
)
Use graders as evaluators in evaluate()
Use graders as evaluators in evaluate()
result = evaluate(
data="data.jsonl",
evaluators={
"sentiment": label_grader,
"quality": score_grader
}
)
undefinedresult = evaluate(
data="data.jsonl",
evaluators={
"sentiment": label_grader,
"quality": score_grader
}
)
undefinedEvaluate Application Target
评估应用目标
python
from azure.ai.evaluation import evaluate
from my_app import chat_app # Your application
result = evaluate(
data="queries.jsonl",
target=chat_app, # Callable that takes query, returns response
evaluators={
"groundedness": groundedness
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.query}",
"context": "${outputs.context}",
"response": "${outputs.response}"
}
}
}
)python
from azure.ai.evaluation import evaluate
from my_app import chat_app # Your application
result = evaluate(
data="queries.jsonl",
target=chat_app, # Callable that takes query, returns response
evaluators={
"groundedness": groundedness
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.query}",
"context": "${outputs.context}",
"response": "${outputs.response}"
}
}
}
)Custom Evaluators
自定义评估器
Code-Based
基于代码
python
from azure.ai.evaluation import evaluator
@evaluator
def word_count_evaluator(response: str) -> dict:
return {"word_count": len(response.split())}python
from azure.ai.evaluation import evaluator
@evaluator
def word_count_evaluator(response: str) -> dict:
return {"word_count": len(response.split())}Use in evaluate()
Use in evaluate()
result = evaluate(
data="data.jsonl",
evaluators={"word_count": word_count_evaluator}
)
undefinedresult = evaluate(
data="data.jsonl",
evaluators={"word_count": word_count_evaluator}
)
undefinedClass-Based with Initialization
带初始化的基于类
python
class DomainSpecificEvaluator:
def __init__(self, domain_terms: list[str], threshold: float = 0.5):
self.domain_terms = [t.lower() for t in domain_terms]
self.threshold = threshold
def __call__(self, response: str) -> dict:
response_lower = response.lower()
matches = sum(1 for term in self.domain_terms if term in response_lower)
score = matches / len(self.domain_terms) if self.domain_terms else 0
return {
"domain_relevance": score,
"passes_threshold": score >= self.threshold
}python
class DomainSpecificEvaluator:
def __init__(self, domain_terms: list[str], threshold: float = 0.5):
self.domain_terms = [t.lower() for t in domain_terms]
self.threshold = threshold
def __call__(self, response: str) -> dict:
response_lower = response.lower()
matches = sum(1 for term in self.domain_terms if term in response_lower)
score = matches / len(self.domain_terms) if self.domain_terms else 0
return {
"domain_relevance": score,
"passes_threshold": score >= self.threshold
}Usage
Usage
domain_eval = DomainSpecificEvaluator(domain_terms=["azure", "cloud", "api"])
undefineddomain_eval = DomainSpecificEvaluator(domain_terms=["azure", "cloud", "api"])
undefinedPrompt-Based with Azure OpenAI
基于Prompt(借助Azure OpenAI)
python
from openai import AzureOpenAI
import json
class PromptBasedEvaluator:
def __init__(self, model_config: dict):
self.client = AzureOpenAI(
azure_endpoint=model_config["azure_endpoint"],
api_key=model_config.get("api_key"),
api_version="2024-06-01"
)
self.deployment = model_config["azure_deployment"]
def __call__(self, query: str, response: str) -> dict:
prompt = f"Rate this response 1-5 for helpfulness. Query: {query}, Response: {response}. Return JSON: {{\"score\": <int>}}"
completion = self.client.chat.completions.create(
model=self.deployment,
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"}
)
result = json.loads(completion.choices[0].message.content)
return {"helpfulness": result["score"]}python
from openai import AzureOpenAI
import json
class PromptBasedEvaluator:
def __init__(self, model_config: dict):
self.client = AzureOpenAI(
azure_endpoint=model_config["azure_endpoint"],
api_key=model_config.get("api_key"),
api_version="2024-06-01"
)
self.deployment = model_config["azure_deployment"]
def __call__(self, query: str, response: str) -> dict:
prompt = f"Rate this response 1-5 for helpfulness. Query: {query}, Response: {response}. Return JSON: {{\"score\": <int>}}"
completion = self.client.chat.completions.create(
model=self.deployment,
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"}
)
result = json.loads(completion.choices[0].message.content)
return {"helpfulness": result["score"]}Log to Foundry Project
记录到Foundry项目
python
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
project = AIProjectClient.from_connection_string(
conn_str=os.environ["AIPROJECT_CONNECTION_STRING"],
credential=DefaultAzureCredential()
)
result = evaluate(
data="data.jsonl",
evaluators={"groundedness": groundedness},
azure_ai_project=project.scope, # Logs results to Foundry
tags={"version": "1.0", "experiment": "baseline"}
)
print(f"View results: {result['studio_url']}")python
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
project = AIProjectClient.from_connection_string(
conn_str=os.environ["AIPROJECT_CONNECTION_STRING"],
credential=DefaultAzureCredential()
)
result = evaluate(
data="data.jsonl",
evaluators={"groundedness": groundedness},
azure_ai_project=project.scope, # Logs results to Foundry
tags={"version": "1.0", "experiment": "baseline"}
)
print(f"View results: {result['studio_url']}")Red Team Adversarial Testing
Red Team对抗测试
python
from azure.ai.evaluation.red_team import RedTeam, AttackStrategy
from azure.identity import DefaultAzureCredential
red_team = RedTeam(
azure_ai_project=azure_ai_project,
credential=DefaultAzureCredential()
)python
from azure.ai.evaluation.red_team import RedTeam, AttackStrategy
from azure.identity import DefaultAzureCredential
red_team = RedTeam(
azure_ai_project=azure_ai_project,
credential=DefaultAzureCredential()
)Run adversarial scan against your application
Run adversarial scan against your application
result = await red_team.scan(
target=my_chat_app, # Your application callable
risk_categories=["violence", "hate_unfairness", "sexual", "self_harm"],
attack_strategies=[
AttackStrategy.DIRECT,
AttackStrategy.MultiTurn,
AttackStrategy.Crescendo
],
attack_success_thresholds={"violence": 3, "hate_unfairness": 3}
)
print(f"Attack success rate: {result.attack_success_rate}")
undefinedresult = await red_team.scan(
target=my_chat_app, # Your application callable
risk_categories=["violence", "hate_unfairness", "sexual", "self_harm"],
attack_strategies=[
AttackStrategy.DIRECT,
AttackStrategy.MultiTurn,
AttackStrategy.Crescendo
],
attack_success_thresholds={"violence": 3, "hate_unfairness": 3}
)
print(f"Attack success rate: {result.attack_success_rate}")
undefinedMultimodal Evaluation
多模态评估
python
from azure.ai.evaluation import ContentSafetyEvaluator
safety = ContentSafetyEvaluator(azure_ai_project=azure_ai_project)python
from azure.ai.evaluation import ContentSafetyEvaluator
safety = ContentSafetyEvaluator(azure_ai_project=azure_ai_project)Evaluate conversations with images
Evaluate conversations with images
conversation = {
"messages": [
{"role": "user", "content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]},
{"role": "assistant", "content": [
{"type": "text", "text": "The image shows..."}
]}
]
}
result = safety(conversation=conversation)
undefinedconversation = {
"messages": [
{"role": "user", "content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]},
{"role": "assistant", "content": [
{"type": "text", "text": "The image shows..."}
]}
]
}
result = safety(conversation=conversation)
undefinedEvaluator Reference
评估器参考
| Evaluator | Type | Metrics |
|---|---|---|
| AI | groundedness (1-5) |
| Service | groundedness (1-5) |
| AI | relevance (1-5) |
| AI | coherence (1-5) |
| AI | fluency (1-5) |
| AI | similarity (1-5) |
| AI | retrieval (1-5) |
| NLP | f1_score (0-1) |
| NLP | rouge scores |
| NLP | bleu_score (0-1) |
| Agent | intent_resolution (1-5) |
| Agent | response_completeness (1-5) |
| Agent | task_adherence (1-5) |
| Agent | tool_call_accuracy (1-5) |
| Safety | violence (0-7) |
| Safety | sexual (0-7) |
| Safety | self_harm (0-7) |
| Safety | hate_unfairness (0-7) |
| Safety | code vulnerabilities |
| Safety | ungrounded attributes |
| Composite | All quality metrics |
| Composite | All safety metrics |
| 评估器 | 类型 | 指标 |
|---|---|---|
| AI | groundedness (1-5) |
| 服务 | groundedness (1-5) |
| AI | relevance (1-5) |
| AI | coherence (1-5) |
| AI | fluency (1-5) |
| AI | similarity (1-5) |
| AI | retrieval (1-5) |
| NLP | f1_score (0-1) |
| NLP | rouge scores |
| NLP | bleu_score (0-1) |
| Agent | intent_resolution (1-5) |
| Agent | response_completeness (1-5) |
| Agent | task_adherence (1-5) |
| Agent | tool_call_accuracy (1-5) |
| 安全性 | violence (0-7) |
| 安全性 | sexual (0-7) |
| 安全性 | self_harm (0-7) |
| 安全性 | hate_unfairness (0-7) |
| 安全性 | code vulnerabilities |
| 安全性 | ungrounded attributes |
| 复合 | 所有质量指标 |
| 复合 | 所有安全性指标 |
Best Practices
最佳实践
- Use composite evaluators for comprehensive assessment
- Map columns correctly — mismatched columns cause silent failures
- Log to Foundry for tracking and comparison across runs with
tags - Create custom evaluators for domain-specific metrics
- Use NLP evaluators when you have ground truth answers
- Safety evaluators require Azure AI project scope
- Batch evaluation is more efficient than single-row loops
- Use graders for structured evaluation with Azure OpenAI's grading API
- Agent evaluators for AI agents with tool calls
- RedTeam scanning for adversarial safety testing before deployment
- Use when evaluating with o1/o3 models
is_reasoning_model=True
- 使用复合评估器进行全面评估
- 正确映射列 —— 列不匹配会导致静默失败
- 记录到Foundry,借助跟踪对比不同运行结果
tags - 创建自定义评估器以适配领域特定指标
- 当有基准答案时,使用NLP评估器 6.安全性评估器需要Azure AI项目范围
- 批量评估比单行循环更高效
- 评分器适用于借助Azure OpenAI评分API进行结构化评估
- Agent评估器适用于带工具调用的AI Agent
- RedTeam扫描用于部署前的对抗性安全测试
- 当评估o1/o3模型时,使用
is_reasoning_model=True
Reference Files
参考文件
| File | Contents |
|---|---|
| references/built-in-evaluators.md | Detailed patterns for AI-assisted, NLP-based, Safety, and Agent evaluators with configuration tables |
| references/custom-evaluators.md | Creating code-based and prompt-based custom evaluators, testing patterns |
| scripts/run_batch_evaluation.py | CLI tool for running batch evaluations with quality, safety, agent, and custom evaluators |
| 文件 | contents |
|---|---|
| references/built-in-evaluators.md | AI辅助型、基于NLP、安全性及Agent评估器的详细使用模式,含配置表 |
| references/custom-evaluators.md | 创建基于代码和基于Prompt自定义评估器的方法、测试模式 |
| scripts/run_batch_evaluation.py | 用于运行质量、安全性、Agent及自定义评估器批量评估CLI工具 |