agent-evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Agent Evaluation with MLflow

基于MLflow的Agent评估

Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.

本指南是使用MLflow评估生成式AI Agent的全面教程。你可以使用本技能完成完整的评估工作流，也可以单独使用其中的组件——追踪设置、环境配置、数据集创建、评分器定义或评估执行。可根据需求独立使用每个章节的内容。

⛔ CRITICAL: Must Use MLflow APIs

⛔ 重要提示：必须使用MLflow API

DO NOT create custom evaluation frameworks. You MUST use MLflow's native APIs:

Datasets: Use
```
mlflow.genai.datasets.create_dataset()
```
- NOT custom test case files

Scorers: Use

mlflow.genai.scorers

and

mlflow.genai.judges.make_judge()

- NOT custom scorer functions

Evaluation: Use
```
mlflow.genai.evaluate()
```
- NOT custom evaluation loops
Scripts: Use the provided
```
scripts/
```
directory templates - NOT custom
```
evaluation/
```
directories

Why? MLflow tracks everything (datasets, scorers, traces, results) in the experiment. Custom frameworks bypass this and lose all observability.

If you're tempted to create

evaluation/eval_dataset.py

or similar custom files, STOP. Use

scripts/create_dataset_template.py

instead.

请勿创建自定义评估框架。 你必须使用MLflow的原生API：

数据集：使用
```
mlflow.genai.datasets.create_dataset()
```
——而非自定义测试用例文件
评分器：使用
```
mlflow.genai.scorers
```
和
```
mlflow.genai.judges.make_judge()
```
——而非自定义评分函数
评估执行：使用
```
mlflow.genai.evaluate()
```
——而非自定义评估循环
脚本：使用提供的
```
scripts/
```
目录模板——而非自定义
```
evaluation/
```
目录

原因？ MLflow会在实验中追踪所有内容（数据集、评分器、追踪记录、结果）。自定义框架会绕过这一机制，导致完全失去可观测性。

如果你打算创建

evaluation/eval_dataset.py

或类似的自定义文件，请立即停止。改用

scripts/create_dataset_template.py

。

Quick Start

快速开始

⚠️ REMINDER: Use MLflow APIs from this skill. Do not create custom evaluation frameworks.

Setup (prerequisite): Install MLflow 3.8+, configure environment, integrate tracing

Evaluation workflow in 4 steps (each uses MLflow APIs):

Understand: Run agent, inspect traces, understand purpose
Scorers: Select and register scorers for quality criteria
Dataset: ALWAYS discover existing datasets first, only create new if needed
Evaluate: Run agent on dataset, apply scorers, analyze results

⚠️ 提醒：使用本技能中的MLflow API，请勿创建自定义评估框架。

设置（前提条件）：安装MLflow 3.8+、配置环境、集成追踪功能

四步评估工作流（每一步均使用MLflow API）：

理解Agent：运行Agent，检查追踪记录，明确其用途
配置评分器：选择并注册符合质量标准的评分器
准备数据集：优先查找现有数据集，仅在必要时创建新数据集
执行评估：在数据集上运行Agent，应用评分器，分析结果

Command Conventions

命令约定

Always use
uv run
for MLflow and Python commands:

bash

uv run mlflow --version          # MLflow CLI commands
uv run python scripts/xxx.py     # Python script execution
uv run python -c "..."           # Python one-liners

This ensures commands run in the correct environment with proper dependencies.

CRITICAL: Separate stderr from stdout when capturing CLI output:

When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data:

bash

undefined

所有MLflow和Python命令请始终使用
uv run
执行：

bash

uv run mlflow --version          # MLflow CLI命令
uv run python scripts/xxx.py     # Python脚本执行
uv run python -c "..."           # Python单行命令

这能确保命令在依赖正确的环境中运行。

重要提示：捕获CLI输出时请分离stderr和stdout：

将CLI命令输出保存到文件用于解析（如JSON、CSV等）时，请始终单独重定向stderr，避免日志与结构化数据混合：

bash

undefined

Save both separately for debugging

分别保存以便调试

uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log

undefined

uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log

undefined

Documentation Access Protocol

文档访问规范

All MLflow documentation must be accessed through llms.txt:

Start at:
```
https://mlflow.org/docs/latest/llms.txt
```
Query llms.txt for your topic with specific prompt
If llms.txt references another doc, use WebFetch with that URL
Do not use WebSearch - use WebFetch with llms.txt first

This applies to all steps, especially:

Dataset creation (read GenAI dataset docs from llms.txt)
Scorer registration (check MLflow docs for scorer APIs)
Evaluation execution (understand mlflow.genai.evaluate API)

所有MLflow文档必须通过llms.txt访问：

起始地址：
```
https://mlflow.org/docs/latest/llms.txt
```
使用特定查询语句在llms.txt中查找相关主题
如果llms.txt引用了其他文档，使用WebFetch访问该URL
请勿使用WebSearch - 优先通过llms.txt和WebFetch获取文档

此规范适用于所有步骤，尤其是：

数据集创建（从llms.txt读取生成式AI数据集文档）
评分器注册（查看MLflow文档中的评分器API）
评估执行（理解mlflow.genai.evaluate API）

Discovering Agent Structure

探索Agent结构

Each project has unique structure. Use dynamic exploration instead of assumptions:

每个项目的结构都独一无二。 请通过动态探索而非假设来了解：

Find Agent Entry Points

查找Agent入口点

bash

undefined

bash

undefined

Search for main agent functions

搜索主Agent函数

grep -r "def.agent" . --include=".py" grep -r "def (run|stream|handle|process)" . --include="*.py"

Check common locations

检查常见位置

ls main.py app.py src/*/agent.py 2>/dev/null

Look for API routes

查找API路由

grep -r "@app.(get|post)" . --include="*.py" # FastAPI/Flask grep -r "def.route" . --include=".py"

undefined

grep -r "@app.(get|post)" . --include="*.py" # FastAPI/Flask框架 grep -r "def.route" . --include=".py"

undefined

Understand Project Structure

理解项目结构

bash

undefined

bash

undefined

Check entry points in package config

检查包配置中的入口点

cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts|entry_points"

Read project documentation

阅读项目文档

cat README.md docs/*.md 2>/dev/null | head -100

Explore main directories

探索主目录

ls -la src/ app/ agent/ 2>/dev/null

undefined

ls -la src/ app/ agent/ 2>/dev/null

undefined

Setup Overview

设置概述

Before evaluation, complete these three setup steps:

Install MLflow (version >=3.8.0)
Configure environment (tracking URI and experiment)
- Guide: Follow
```
references/setup-guide.md
```
  Steps 1-2
Integrate tracing (autolog and @mlflow.trace decorators)
- ⚠️ MANDATORY: Follow
```
references/tracing-integration.md
```
  - the authoritative tracing guide
- ✓ VERIFY: Run
```
scripts/validate_agent_tracing.py
```
  after implementing

⚠️ Tracing must work before evaluation. If tracing fails, stop and troubleshoot.

Checkpoint - verify before proceeding:

MLflow >=3.8.0 installed
MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID set
Autolog enabled and @mlflow.trace decorators added
Test run creates a trace (verify trace ID is not None)

Validation scripts:

bash

uv run python scripts/validate_environment.py  # Check MLflow install, env vars, connectivity
uv run python scripts/validate_auth.py         # Test authentication before expensive operations

开始评估前，请完成以下三项设置步骤：

安装MLflow（版本 >=3.8.0）
配置环境（追踪URI和实验）
- 指南：遵循
```
references/setup-guide.md
```
  的步骤1-2
集成追踪功能（自动日志和@mlflow.trace装饰器）
- ⚠️ 强制要求：遵循
```
references/tracing-integration.md
```
  ——权威的追踪指南
- ✓ 验证：实现后运行
```
scripts/validate_agent_tracing.py
```

⚠️ 必须确保追踪功能正常后再开始评估。 如果追踪失败，请停止并排查问题。

检查点 - 继续前请验证：

已安装MLflow >=3.8.0
已设置MLFLOW_TRACKING_URI和MLFLOW_EXPERIMENT_ID
已启用自动日志并添加@mlflow.trace装饰器
测试运行可生成追踪记录（验证trace ID不为None）

验证脚本：

bash

uv run python scripts/validate_environment.py  # 检查MLflow安装、环境变量和连接性
uv run python scripts/validate_auth.py         # 在执行高成本操作前测试认证

Evaluation Workflow

评估工作流

Step 1: Understand Agent Purpose

步骤1：理解Agent用途

Invoke agent with sample input
Inspect MLflow trace (especially LLM prompts describing agent purpose)
Print your understanding and ask user for verification
Wait for confirmation before proceeding

使用示例输入调用Agent
检查MLflow追踪记录（尤其是描述Agent用途的LLM提示词）
记录你的理解并请求用户验证
获得确认后再继续

Step 2: Define Quality Scorers

步骤2：定义质量评分器

Check registered scorers in your experiment:

bash

uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID

IMPORTANT: if there are registered scorers in the experiment then they must be used for evaluation.

Select additional built-in scorers that apply to the agent

See

references/scorers.md

for the built-in scorers. Select any that are useful for assessing the agent's quality and that are not already registered.

Create additional custom scorers as needed

If needed, create additional scorers using the

make_judge()

API. See

references/scorers.md

on how to create custom scorers and

references/scorers-constraints.md

for best practices.

REQUIRED: Register new scorers before evaluation using Python API:

python

from mlflow.genai.judges import make_judge
from mlflow.genai.scorers import BuiltinScorerName
import os

scorer = make_judge(...)  # Or, scorer = BuiltinScorerName()
scorer.register()

** IMPORTANT: See

references/scorers.md

→ "Model Selection for Scorers" to configure the

model

parameter of scorers before registration.

⚠️ Scorers MUST be registered before evaluation. Inline scorers that aren't registered won't appear in

mlflow scorers list

and won't be reusable.

Verify registration:

bash

uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID  # Should show your scorers

检查实验中已注册的评分器：

bash

uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID

重要提示：如果实验中存在已注册的评分器，必须使用这些评分器进行评估。

选择额外的内置评分器

查看

references/scorers.md

获取内置评分器列表。选择任何适用于评估Agent质量且未注册的评分器。

根据需要创建自定义评分器

如有需要，使用

make_judge()

API创建额外评分器。请查看

references/scorers.md

了解如何创建自定义评分器，以及

references/scorers-constraints.md

获取最佳实践。

必须：评估前注册新评分器（使用Python API）：

python

from mlflow.genai.judges import make_judge
from mlflow.genai.scorers import BuiltinScorerName
import os

scorer = make_judge(...)  # 或者，scorer = BuiltinScorerName()
scorer.register()

重要提示： 请查看

references/scorers.md

→ "评分器模型选择"，在注册前配置评分器的

model

参数。

⚠️ 评分器必须在评估前注册。 未注册的内联评分器不会出现在

mlflow scorers list

中，也无法重复使用。

验证注册结果：

bash

uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID  # 应显示你注册的评分器

Step 3: Prepare Evaluation Dataset

步骤3：准备评估数据集

ALWAYS discover existing datasets first to prevent duplicate work:

Run dataset discovery (mandatory):

bash

uv run python scripts/list_datasets.py  # Lists, compares, recommends datasets
uv run python scripts/list_datasets.py --format json  # Machine-readable output
uv run python scripts/list_datasets.py --help  # All options

Present findings to user:
- Show all discovered datasets with their characteristics (size, topics covered)
- If datasets found, highlight most relevant options based on agent type
Ask user about existing datasets:
- "I found [N] existing evaluation dataset(s). Do you want to use one of these? (y/n)"
- If yes: Ask which dataset to use and record the dataset name
- If no: Proceed to step 4

Create new dataset only if user declined existing ones:

bash

# Generates dataset creation script from test cases file
uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt
uv run python scripts/create_dataset_template.py --help  # See all options

Generated code uses

mlflow.genai.datasets

APIs - review and execute the script.

IMPORTANT: Do not skip dataset discovery. Always run

list_datasets.py

first, even if you plan to create a new dataset. This prevents duplicate work and ensures users are aware of existing evaluation datasets.

For complete dataset guide: See

references/dataset-preparation.md

Checkpoint - verify before proceeding:

Scorers have been registered
Dataset has been created

始终优先查找现有数据集，避免重复工作：

执行数据集发现（强制要求）：

bash

uv run python scripts/list_datasets.py  # 列出、比较并推荐数据集
uv run python scripts/list_datasets.py --format json  # 机器可读格式输出
uv run python scripts/list_datasets.py --help  # 查看所有选项

向用户展示发现结果：
- 展示所有发现的数据集及其特征（规模、覆盖主题）
- 如果找到数据集，根据Agent类型高亮最相关的选项
询问用户是否使用现有数据集：
- "我找到了[N]个现有评估数据集。你是否想要使用其中一个？(是/否)"
- 如果是：询问使用哪个数据集并记录数据集名称
- 如果否：继续步骤4

仅当用户拒绝现有数据集时才创建新数据集：

bash

# 从测试用例文件生成数据集创建脚本
uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt
uv run python scripts/create_dataset_template.py --help  # 查看所有选项

生成的代码使用

mlflow.genai.datasets

API——请审核并执行该脚本。

重要提示：请勿跳过数据集发现步骤。即使你计划创建新数据集，也请先运行

list_datasets.py

。这能避免重复工作，确保用户了解现有评估数据集。

完整数据集指南： 查看

references/dataset-preparation.md

检查点 - 继续前请验证：

已注册评分器
已准备好数据集

Step 4: Run Evaluation

步骤4：运行评估

Generate and run evaluation script:

bash

# Generate evaluation script (specify module and entry point)
uv run python scripts/run_evaluation_template.py \
  --module mlflow_agent.agent \
  --entry-point run_agent

# Review the generated script, then execute it
uv run python run_agent_evaluation.py

The generated script creates a wrapper function that:

Accepts keyword arguments matching the dataset's input keys
Provides any additional arguments the agent needs (like
```
llm_provider
```
)

Runs

mlflow.genai.evaluate(data=df, predict_fn=wrapper, scorers=registered_scorers)

Saves results to
```
evaluation_results.csv
```

⚠️ CRITICAL: wrapper Signature Must Match Dataset Input Keys

MLflow calls

predict_fn(**inputs)

- it unpacks the inputs dict as keyword arguments.

Dataset Record MLflow Calls predict_fn Must Be

Dataset Record	MLflow Calls	predict_fn Must Be
`{"inputs": {"query": "..."}}`	`predict_fn(query="...")`	`def wrapper(query):`
`{"inputs": {"question": "...", "context": "..."}}`	`predict_fn(question="...", context="...")`	`def wrapper(question, context):`

{"inputs": {"query": "..."}}

predict_fn(query="...")

def wrapper(query):

{"inputs": {"question": "...", "context": "..."}}

predict_fn(question="...", context="...")

def wrapper(question, context):

Common Mistake (WRONG):

python

def wrapper(inputs):  # ❌ WRONG - inputs is NOT a dict
    return agent(inputs["query"])

Analyze results:

bash

# Pattern detection, failure analysis, recommendations
uv run python scripts/analyze_results.py evaluation_results.csv

Generates

evaluation_report.md

with pass rates and improvement suggestions.

生成并运行评估脚本：

bash

# 生成评估脚本（指定模块和入口点）
uv run python scripts/run_evaluation_template.py \
  --module mlflow_agent.agent \
  --entry-point run_agent

# 审核生成的脚本，然后执行
uv run python run_agent_evaluation.py

生成的脚本会创建一个包装函数，该函数：

接受与数据集输入键匹配的关键字参数
提供Agent所需的任何额外参数（如
```
llm_provider
```
）

运行

mlflow.genai.evaluate(data=df, predict_fn=wrapper, scorers=registered_scorers)

将结果保存到
```
evaluation_results.csv
```

⚠️ 重要提示：包装函数签名必须与数据集输入键匹配

MLflow会调用

predict_fn(**inputs)

——它会将输入字典解包为关键字参数。

数据集记录 MLflow调用方式 predict_fn必须定义为

数据集记录	MLflow调用方式	predict_fn必须定义为
`{"inputs": {"query": "..."}}`	`predict_fn(query="...")`	`def wrapper(query):`
`{"inputs": {"question": "...", "context": "..."}}`	`predict_fn(question="...", context="...")`	`def wrapper(question, context):`

{"inputs": {"query": "..."}}

predict_fn(query="...")

def wrapper(query):

{"inputs": {"question": "...", "context": "..."}}

predict_fn(question="...", context="...")

def wrapper(question, context):

常见错误（错误示例）：

python

def wrapper(inputs):  # ❌ 错误 - inputs不是字典
    return agent(inputs["query"])

分析结果：

bash

# 模式检测、失败分析、建议生成
uv run python scripts/analyze_results.py evaluation_results.csv

生成包含通过率和改进建议的

evaluation_report.md

。

References

参考资料

Detailed guides in

references/

(load as needed):

setup-guide.md - Environment setup (MLflow install, tracking URI configuration)
tracing-integration.md - Authoritative tracing guide (autolog, decorators, session tracking, verification)
dataset-preparation.md - Dataset schema, APIs, creation, Unity Catalog
scorers.md - Built-in vs custom scorers, registration, testing
scorers-constraints.md - CLI requirements for custom scorers (yes/no format, templates)
troubleshooting.md - Common errors by phase with solutions

Scripts are self-documenting - run with

--help

for usage details.

references/

目录下的详细指南（按需加载）：

setup-guide.md - 环境设置（MLflow安装、追踪URI配置）
tracing-integration.md - 权威追踪指南（自动日志、装饰器、会话追踪、验证）
dataset-preparation.md - 数据集 schema、API、创建、Unity Catalog
scorers.md - 内置vs自定义评分器、注册、测试
scorers-constraints.md - 自定义评分器的CLI要求（是/否格式、模板）
troubleshooting.md - 各阶段常见错误及解决方案

脚本自带文档——运行

--help

查看使用说明。

agent-evaluation

Original

Translation

Agent Evaluation with MLflow

基于MLflow的Agent评估

⛔ CRITICAL: Must Use MLflow APIs

⛔ 重要提示：必须使用MLflow API

Table of Contents

目录

Quick Start

快速开始

Command Conventions

命令约定

Save both separately for debugging

分别保存以便调试

Documentation Access Protocol

文档访问规范

Discovering Agent Structure

探索Agent结构

Find Agent Entry Points

查找Agent入口点

Search for main agent functions

搜索主Agent函数

Check common locations

检查常见位置

Look for API routes

查找API路由

Understand Project Structure

理解项目结构

Check entry points in package config

检查包配置中的入口点

Read project documentation

阅读项目文档

Explore main directories

探索主目录

Setup Overview

设置概述

Evaluation Workflow

评估工作流

Step 1: Understand Agent Purpose

步骤1：理解Agent用途

Step 2: Define Quality Scorers

步骤2：定义质量评分器

Step 3: Prepare Evaluation Dataset

步骤3：准备评估数据集

Step 4: Run Evaluation

步骤4：运行评估

References

参考资料