azure-ai-evaluation-py

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Azure AI Evaluation SDK for Python

Azure AI Evaluation SDK for Python

Assess generative AI application performance with built-in quality, safety, agent evaluators, Azure OpenAI graders, and custom evaluators.
借助内置的质量、安全性、Agent评估器、Azure OpenAI评分器以及自定义评估器,评估生成式AI应用的性能。

Installation

安装

bash
pip install azure-ai-evaluation
bash
pip install azure-ai-evaluation

With red team support

With red team support

pip install azure-ai-evaluation[redteam]
undefined
pip install azure-ai-evaluation[redteam]
undefined

Environment Variables

环境变量

bash
undefined
bash
undefined

For AI-assisted evaluators

For AI-assisted evaluators

AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com AZURE_OPENAI_API_KEY=<your-api-key> AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com AZURE_OPENAI_API_KEY=<your-api-key> AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini

For Foundry project integration

For Foundry project integration

AIPROJECT_CONNECTION_STRING=<your-connection-string>
undefined
AIPROJECT_CONNECTION_STRING=<your-connection-string>
undefined

Built-in Evaluators

内置评估器

Quality Evaluators (AI-Assisted)

质量评估器(AI辅助型)

python
from azure.ai.evaluation import (
    GroundednessEvaluator,
    GroundednessProEvaluator,  # Service-based groundedness
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator,
    RetrievalEvaluator
)
python
from azure.ai.evaluation import (
    GroundednessEvaluator,
    GroundednessProEvaluator,  # Service-based groundedness
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator,
    RetrievalEvaluator
)

Initialize with Azure OpenAI model config

Initialize with Azure OpenAI model config

model_config = { "azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"], "api_key": os.environ["AZURE_OPENAI_API_KEY"], "azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"] }
groundedness = GroundednessEvaluator(model_config) relevance = RelevanceEvaluator(model_config) coherence = CoherenceEvaluator(model_config)
model_config = { "azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"], "api_key": os.environ["AZURE_OPENAI_API_KEY"], "azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"] }
groundedness = GroundednessEvaluator(model_config) relevance = RelevanceEvaluator(model_config) coherence = CoherenceEvaluator(model_config)

For reasoning models (o1/o3), use is_reasoning_model parameter

For reasoning models (o1/o3), use is_reasoning_model parameter

groundedness_reasoning = GroundednessEvaluator(model_config, is_reasoning_model=True)
undefined
groundedness_reasoning = GroundednessEvaluator(model_config, is_reasoning_model=True)
undefined

Quality Evaluators (NLP-based)

质量评估器(基于NLP)

python
from azure.ai.evaluation import (
    F1ScoreEvaluator,
    RougeScoreEvaluator,
    BleuScoreEvaluator,
    GleuScoreEvaluator,
    MeteorScoreEvaluator
)

f1 = F1ScoreEvaluator()
rouge = RougeScoreEvaluator()
bleu = BleuScoreEvaluator()
python
from azure.ai.evaluation import (
    F1ScoreEvaluator,
    RougeScoreEvaluator,
    BleuScoreEvaluator,
    GleuScoreEvaluator,
    MeteorScoreEvaluator
)

f1 = F1ScoreEvaluator()
rouge = RougeScoreEvaluator()
bleu = BleuScoreEvaluator()

Safety Evaluators

安全性评估器

python
from azure.ai.evaluation import (
    ViolenceEvaluator,
    SexualEvaluator,
    SelfHarmEvaluator,
    HateUnfairnessEvaluator,
    IndirectAttackEvaluator,
    ProtectedMaterialEvaluator,
    CodeVulnerabilityEvaluator,
    UngroundedAttributesEvaluator
)
python
from azure.ai.evaluation import (
    ViolenceEvaluator,
    SexualEvaluator,
    SelfHarmEvaluator,
    HateUnfairnessEvaluator,
    IndirectAttackEvaluator,
    ProtectedMaterialEvaluator,
    CodeVulnerabilityEvaluator,
    UngroundedAttributesEvaluator
)

Project scope for safety evaluators

Project scope for safety evaluators

azure_ai_project = { "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"], "resource_group_name": os.environ["AZURE_RESOURCE_GROUP"], "project_name": os.environ["AZURE_AI_PROJECT_NAME"], }
violence = ViolenceEvaluator(azure_ai_project=azure_ai_project) sexual = SexualEvaluator(azure_ai_project=azure_ai_project) code_vuln = CodeVulnerabilityEvaluator(azure_ai_project=azure_ai_project)
azure_ai_project = { "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"], "resource_group_name": os.environ["AZURE_RESOURCE_GROUP"], "project_name": os.environ["AZURE_AI_PROJECT_NAME"], }
violence = ViolenceEvaluator(azure_ai_project=azure_ai_project) sexual = SexualEvaluator(azure_ai_project=azure_ai_project) code_vuln = CodeVulnerabilityEvaluator(azure_ai_project=azure_ai_project)

Control whether queries are evaluated (default: False, only response evaluated)

Control whether queries are evaluated (default: False, only response evaluated)

violence_with_query = ViolenceEvaluator(azure_ai_project=azure_ai_project, evaluate_query=True)
undefined
violence_with_query = ViolenceEvaluator(azure_ai_project=azure_ai_project, evaluate_query=True)
undefined

Agent Evaluators

Agent评估器

python
from azure.ai.evaluation import (
    IntentResolutionEvaluator,
    ResponseCompletenessEvaluator,
    TaskAdherenceEvaluator,
    ToolCallAccuracyEvaluator
)

intent = IntentResolutionEvaluator(model_config)
completeness = ResponseCompletenessEvaluator(model_config)
task_adherence = TaskAdherenceEvaluator(model_config)
tool_accuracy = ToolCallAccuracyEvaluator(model_config)
python
from azure.ai.evaluation import (
    IntentResolutionEvaluator,
    ResponseCompletenessEvaluator,
    TaskAdherenceEvaluator,
    ToolCallAccuracyEvaluator
)

intent = IntentResolutionEvaluator(model_config)
completeness = ResponseCompletenessEvaluator(model_config)
task_adherence = TaskAdherenceEvaluator(model_config)
tool_accuracy = ToolCallAccuracyEvaluator(model_config)

Single Row Evaluation

单行数据评估

python
from azure.ai.evaluation import GroundednessEvaluator

groundedness = GroundednessEvaluator(model_config)

result = groundedness(
    query="What is Azure AI?",
    context="Azure AI is Microsoft's AI platform...",
    response="Azure AI provides AI services and tools."
)

print(f"Groundedness score: {result['groundedness']}")
print(f"Reason: {result['groundedness_reason']}")
python
from azure.ai.evaluation import GroundednessEvaluator

groundedness = GroundednessEvaluator(model_config)

result = groundedness(
    query="What is Azure AI?",
    context="Azure AI is Microsoft's AI platform...",
    response="Azure AI provides AI services and tools."
)

print(f"Groundedness score: {result['groundedness']}")
print(f"Reason: {result['groundedness_reason']}")

Batch Evaluation with evaluate()

借助evaluate()进行批量评估

python
from azure.ai.evaluation import evaluate

result = evaluate(
    data="test_data.jsonl",
    evaluators={
        "groundedness": groundedness,
        "relevance": relevance,
        "coherence": coherence
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.context}",
                "response": "${data.response}"
            }
        }
    },
    # Optional: Add tags for experiment tracking
    tags={"experiment": "v1", "model": "gpt-4o"}
)

print(result["metrics"])
python
from azure.ai.evaluation import evaluate

result = evaluate(
    data="test_data.jsonl",
    evaluators={
        "groundedness": groundedness,
        "relevance": relevance,
        "coherence": coherence
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.context}",
                "response": "${data.response}"
            }
        }
    },
    # Optional: Add tags for experiment tracking
    tags={"experiment": "v1", "model": "gpt-4o"}
)

print(result["metrics"])

Composite Evaluators

复合评估器

python
from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluator
python
from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluator

All quality metrics in one

All quality metrics in one

qa_evaluator = QAEvaluator(model_config)
qa_evaluator = QAEvaluator(model_config)

All safety metrics in one

All safety metrics in one

safety_evaluator = ContentSafetyEvaluator(azure_ai_project=azure_ai_project)
result = evaluate( data="data.jsonl", evaluators={ "qa": qa_evaluator, "content_safety": safety_evaluator } )
undefined
safety_evaluator = ContentSafetyEvaluator(azure_ai_project=azure_ai_project)
result = evaluate( data="data.jsonl", evaluators={ "qa": qa_evaluator, "content_safety": safety_evaluator } )
undefined

Azure OpenAI Graders

Azure OpenAI评分器

Use grader classes for structured evaluation via Azure OpenAI's grading API:
python
from azure.ai.evaluation import (
    AzureOpenAILabelGrader,
    AzureOpenAIStringCheckGrader,
    AzureOpenAITextSimilarityGrader,
    AzureOpenAIScoreModelGrader,
    AzureOpenAIPythonGrader
)
使用评分器类,通过Azure OpenAI的评分API进行结构化评估:
python
from azure.ai.evaluation import (
    AzureOpenAILabelGrader,
    AzureOpenAIStringCheckGrader,
    AzureOpenAITextSimilarityGrader,
    AzureOpenAIScoreModelGrader,
    AzureOpenAIPythonGrader
)

Label grader for classification

Label grader for classification

label_grader = AzureOpenAILabelGrader( model_config=model_config, labels=["positive", "negative", "neutral"], passing_labels=["positive"] )
label_grader = AzureOpenAILabelGrader( model_config=model_config, labels=["positive", "negative", "neutral"], passing_labels=["positive"] )

Score model grader with custom threshold

Score model grader with custom threshold

score_grader = AzureOpenAIScoreModelGrader( model_config=model_config, pass_threshold=0.7 )
score_grader = AzureOpenAIScoreModelGrader( model_config=model_config, pass_threshold=0.7 )

Use graders as evaluators in evaluate()

Use graders as evaluators in evaluate()

result = evaluate( data="data.jsonl", evaluators={ "sentiment": label_grader, "quality": score_grader } )
undefined
result = evaluate( data="data.jsonl", evaluators={ "sentiment": label_grader, "quality": score_grader } )
undefined

Evaluate Application Target

评估应用目标

python
from azure.ai.evaluation import evaluate
from my_app import chat_app  # Your application

result = evaluate(
    data="queries.jsonl",
    target=chat_app,  # Callable that takes query, returns response
    evaluators={
        "groundedness": groundedness
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${outputs.context}",
                "response": "${outputs.response}"
            }
        }
    }
)
python
from azure.ai.evaluation import evaluate
from my_app import chat_app  # Your application

result = evaluate(
    data="queries.jsonl",
    target=chat_app,  # Callable that takes query, returns response
    evaluators={
        "groundedness": groundedness
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${outputs.context}",
                "response": "${outputs.response}"
            }
        }
    }
)

Custom Evaluators

自定义评估器

Code-Based

基于代码

python
from azure.ai.evaluation import evaluator

@evaluator
def word_count_evaluator(response: str) -> dict:
    return {"word_count": len(response.split())}
python
from azure.ai.evaluation import evaluator

@evaluator
def word_count_evaluator(response: str) -> dict:
    return {"word_count": len(response.split())}

Use in evaluate()

Use in evaluate()

result = evaluate( data="data.jsonl", evaluators={"word_count": word_count_evaluator} )
undefined
result = evaluate( data="data.jsonl", evaluators={"word_count": word_count_evaluator} )
undefined

Class-Based with Initialization

带初始化的基于类

python
class DomainSpecificEvaluator:
    def __init__(self, domain_terms: list[str], threshold: float = 0.5):
        self.domain_terms = [t.lower() for t in domain_terms]
        self.threshold = threshold
    
    def __call__(self, response: str) -> dict:
        response_lower = response.lower()
        matches = sum(1 for term in self.domain_terms if term in response_lower)
        score = matches / len(self.domain_terms) if self.domain_terms else 0
        return {
            "domain_relevance": score,
            "passes_threshold": score >= self.threshold
        }
python
class DomainSpecificEvaluator:
    def __init__(self, domain_terms: list[str], threshold: float = 0.5):
        self.domain_terms = [t.lower() for t in domain_terms]
        self.threshold = threshold
    
    def __call__(self, response: str) -> dict:
        response_lower = response.lower()
        matches = sum(1 for term in self.domain_terms if term in response_lower)
        score = matches / len(self.domain_terms) if self.domain_terms else 0
        return {
            "domain_relevance": score,
            "passes_threshold": score >= self.threshold
        }

Usage

Usage

domain_eval = DomainSpecificEvaluator(domain_terms=["azure", "cloud", "api"])
undefined
domain_eval = DomainSpecificEvaluator(domain_terms=["azure", "cloud", "api"])
undefined

Prompt-Based with Azure OpenAI

基于Prompt(借助Azure OpenAI)

python
from openai import AzureOpenAI
import json

class PromptBasedEvaluator:
    def __init__(self, model_config: dict):
        self.client = AzureOpenAI(
            azure_endpoint=model_config["azure_endpoint"],
            api_key=model_config.get("api_key"),
            api_version="2024-06-01"
        )
        self.deployment = model_config["azure_deployment"]
    
    def __call__(self, query: str, response: str) -> dict:
        prompt = f"Rate this response 1-5 for helpfulness. Query: {query}, Response: {response}. Return JSON: {{\"score\": <int>}}"
        completion = self.client.chat.completions.create(
            model=self.deployment,
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            response_format={"type": "json_object"}
        )
        result = json.loads(completion.choices[0].message.content)
        return {"helpfulness": result["score"]}
python
from openai import AzureOpenAI
import json

class PromptBasedEvaluator:
    def __init__(self, model_config: dict):
        self.client = AzureOpenAI(
            azure_endpoint=model_config["azure_endpoint"],
            api_key=model_config.get("api_key"),
            api_version="2024-06-01"
        )
        self.deployment = model_config["azure_deployment"]
    
    def __call__(self, query: str, response: str) -> dict:
        prompt = f"Rate this response 1-5 for helpfulness. Query: {query}, Response: {response}. Return JSON: {{\"score\": <int>}}"
        completion = self.client.chat.completions.create(
            model=self.deployment,
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            response_format={"type": "json_object"}
        )
        result = json.loads(completion.choices[0].message.content)
        return {"helpfulness": result["score"]}

Log to Foundry Project

记录到Foundry项目

python
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

project = AIProjectClient.from_connection_string(
    conn_str=os.environ["AIPROJECT_CONNECTION_STRING"],
    credential=DefaultAzureCredential()
)

result = evaluate(
    data="data.jsonl",
    evaluators={"groundedness": groundedness},
    azure_ai_project=project.scope,  # Logs results to Foundry
    tags={"version": "1.0", "experiment": "baseline"}
)

print(f"View results: {result['studio_url']}")
python
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

project = AIProjectClient.from_connection_string(
    conn_str=os.environ["AIPROJECT_CONNECTION_STRING"],
    credential=DefaultAzureCredential()
)

result = evaluate(
    data="data.jsonl",
    evaluators={"groundedness": groundedness},
    azure_ai_project=project.scope,  # Logs results to Foundry
    tags={"version": "1.0", "experiment": "baseline"}
)

print(f"View results: {result['studio_url']}")

Red Team Adversarial Testing

Red Team对抗测试

python
from azure.ai.evaluation.red_team import RedTeam, AttackStrategy
from azure.identity import DefaultAzureCredential

red_team = RedTeam(
    azure_ai_project=azure_ai_project,
    credential=DefaultAzureCredential()
)
python
from azure.ai.evaluation.red_team import RedTeam, AttackStrategy
from azure.identity import DefaultAzureCredential

red_team = RedTeam(
    azure_ai_project=azure_ai_project,
    credential=DefaultAzureCredential()
)

Run adversarial scan against your application

Run adversarial scan against your application

result = await red_team.scan( target=my_chat_app, # Your application callable risk_categories=["violence", "hate_unfairness", "sexual", "self_harm"], attack_strategies=[ AttackStrategy.DIRECT, AttackStrategy.MultiTurn, AttackStrategy.Crescendo ], attack_success_thresholds={"violence": 3, "hate_unfairness": 3} )
print(f"Attack success rate: {result.attack_success_rate}")
undefined
result = await red_team.scan( target=my_chat_app, # Your application callable risk_categories=["violence", "hate_unfairness", "sexual", "self_harm"], attack_strategies=[ AttackStrategy.DIRECT, AttackStrategy.MultiTurn, AttackStrategy.Crescendo ], attack_success_thresholds={"violence": 3, "hate_unfairness": 3} )
print(f"Attack success rate: {result.attack_success_rate}")
undefined

Multimodal Evaluation

多模态评估

python
from azure.ai.evaluation import ContentSafetyEvaluator

safety = ContentSafetyEvaluator(azure_ai_project=azure_ai_project)
python
from azure.ai.evaluation import ContentSafetyEvaluator

safety = ContentSafetyEvaluator(azure_ai_project=azure_ai_project)

Evaluate conversations with images

Evaluate conversations with images

conversation = { "messages": [ {"role": "user", "content": [ {"type": "text", "text": "Describe this image"}, {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}} ]}, {"role": "assistant", "content": [ {"type": "text", "text": "The image shows..."} ]} ] }
result = safety(conversation=conversation)
undefined
conversation = { "messages": [ {"role": "user", "content": [ {"type": "text", "text": "Describe this image"}, {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}} ]}, {"role": "assistant", "content": [ {"type": "text", "text": "The image shows..."} ]} ] }
result = safety(conversation=conversation)
undefined

Evaluator Reference

评估器参考

EvaluatorTypeMetrics
GroundednessEvaluator
AIgroundedness (1-5)
GroundednessProEvaluator
Servicegroundedness (1-5)
RelevanceEvaluator
AIrelevance (1-5)
CoherenceEvaluator
AIcoherence (1-5)
FluencyEvaluator
AIfluency (1-5)
SimilarityEvaluator
AIsimilarity (1-5)
RetrievalEvaluator
AIretrieval (1-5)
F1ScoreEvaluator
NLPf1_score (0-1)
RougeScoreEvaluator
NLProuge scores
BleuScoreEvaluator
NLPbleu_score (0-1)
IntentResolutionEvaluator
Agentintent_resolution (1-5)
ResponseCompletenessEvaluator
Agentresponse_completeness (1-5)
TaskAdherenceEvaluator
Agenttask_adherence (1-5)
ToolCallAccuracyEvaluator
Agenttool_call_accuracy (1-5)
ViolenceEvaluator
Safetyviolence (0-7)
SexualEvaluator
Safetysexual (0-7)
SelfHarmEvaluator
Safetyself_harm (0-7)
HateUnfairnessEvaluator
Safetyhate_unfairness (0-7)
CodeVulnerabilityEvaluator
Safetycode vulnerabilities
UngroundedAttributesEvaluator
Safetyungrounded attributes
QAEvaluator
CompositeAll quality metrics
ContentSafetyEvaluator
CompositeAll safety metrics
评估器类型指标
GroundednessEvaluator
AIgroundedness (1-5)
GroundednessProEvaluator
服务groundedness (1-5)
RelevanceEvaluator
AIrelevance (1-5)
CoherenceEvaluator
AIcoherence (1-5)
FluencyEvaluator
AIfluency (1-5)
SimilarityEvaluator
AIsimilarity (1-5)
RetrievalEvaluator
AIretrieval (1-5)
F1ScoreEvaluator
NLPf1_score (0-1)
RougeScoreEvaluator
NLProuge scores
BleuScoreEvaluator
NLPbleu_score (0-1)
IntentResolutionEvaluator
Agentintent_resolution (1-5)
ResponseCompletenessEvaluator
Agentresponse_completeness (1-5)
TaskAdherenceEvaluator
Agenttask_adherence (1-5)
ToolCallAccuracyEvaluator
Agenttool_call_accuracy (1-5)
ViolenceEvaluator
安全性violence (0-7)
SexualEvaluator
安全性sexual (0-7)
SelfHarmEvaluator
安全性self_harm (0-7)
HateUnfairnessEvaluator
安全性hate_unfairness (0-7)
CodeVulnerabilityEvaluator
安全性code vulnerabilities
UngroundedAttributesEvaluator
安全性ungrounded attributes
QAEvaluator
复合所有质量指标
ContentSafetyEvaluator
复合所有安全性指标

Best Practices

最佳实践

  1. Use composite evaluators for comprehensive assessment
  2. Map columns correctly — mismatched columns cause silent failures
  3. Log to Foundry for tracking and comparison across runs with
    tags
  4. Create custom evaluators for domain-specific metrics
  5. Use NLP evaluators when you have ground truth answers
  6. Safety evaluators require Azure AI project scope
  7. Batch evaluation is more efficient than single-row loops
  8. Use graders for structured evaluation with Azure OpenAI's grading API
  9. Agent evaluators for AI agents with tool calls
  10. RedTeam scanning for adversarial safety testing before deployment
  11. Use
    is_reasoning_model=True
    when evaluating with o1/o3 models
  1. 使用复合评估器进行全面评估
  2. 正确映射列 —— 列不匹配会导致静默失败
  3. 记录到Foundry,借助
    tags
    跟踪对比不同运行结果
  4. 创建自定义评估器以适配领域特定指标
  5. 当有基准答案时,使用NLP评估器 6.安全性评估器需要Azure AI项目范围
  6. 批量评估比单行循环更高效
  7. 评分器适用于借助Azure OpenAI评分API进行结构化评估
  8. Agent评估器适用于带工具调用的AI Agent
  9. RedTeam扫描用于部署前的对抗性安全测试
  10. 当评估o1/o3模型时,使用
    is_reasoning_model=True

Reference Files

参考文件

FileContents
references/built-in-evaluators.mdDetailed patterns for AI-assisted, NLP-based, Safety, and Agent evaluators with configuration tables
references/custom-evaluators.mdCreating code-based and prompt-based custom evaluators, testing patterns
scripts/run_batch_evaluation.pyCLI tool for running batch evaluations with quality, safety, agent, and custom evaluators
文件contents
references/built-in-evaluators.mdAI辅助型、基于NLP、安全性及Agent评估器的详细使用模式,含配置表
references/custom-evaluators.md创建基于代码和基于Prompt自定义评估器的方法、测试模式
scripts/run_batch_evaluation.py用于运行质量、安全性、Agent及自定义评估器批量评估CLI工具