agent-evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAgent Evaluation with MLflow
基于MLflow的Agent评估
Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.
本指南是使用MLflow评估生成式AI Agent的全面教程。你可以使用本技能完成完整的评估工作流,也可以单独使用其中的组件——追踪设置、环境配置、数据集创建、评分器定义或评估执行。可根据需求独立使用每个章节的内容。
⛔ CRITICAL: Must Use MLflow APIs
⛔ 重要提示:必须使用MLflow API
DO NOT create custom evaluation frameworks. You MUST use MLflow's native APIs:
- Datasets: Use - NOT custom test case files
mlflow.genai.datasets.create_dataset() - Scorers: Use and
mlflow.genai.scorers- NOT custom scorer functionsmlflow.genai.judges.make_judge() - Evaluation: Use - NOT custom evaluation loops
mlflow.genai.evaluate() - Scripts: Use the provided directory templates - NOT custom
scripts/directoriesevaluation/
Why? MLflow tracks everything (datasets, scorers, traces, results) in the experiment. Custom frameworks bypass this and lose all observability.
If you're tempted to create or similar custom files, STOP. Use instead.
evaluation/eval_dataset.pyscripts/create_dataset_template.py请勿创建自定义评估框架。 你必须使用MLflow的原生API:
- 数据集:使用——而非自定义测试用例文件
mlflow.genai.datasets.create_dataset() - 评分器:使用和
mlflow.genai.scorers——而非自定义评分函数mlflow.genai.judges.make_judge() - 评估执行:使用——而非自定义评估循环
mlflow.genai.evaluate() - 脚本:使用提供的目录模板——而非自定义
scripts/目录evaluation/
原因? MLflow会在实验中追踪所有内容(数据集、评分器、追踪记录、结果)。自定义框架会绕过这一机制,导致完全失去可观测性。
如果你打算创建或类似的自定义文件,请立即停止。改用。
evaluation/eval_dataset.pyscripts/create_dataset_template.pyTable of Contents
目录
Quick Start
快速开始
⚠️ REMINDER: Use MLflow APIs from this skill. Do not create custom evaluation frameworks.
Setup (prerequisite): Install MLflow 3.8+, configure environment, integrate tracing
Evaluation workflow in 4 steps (each uses MLflow APIs):
- Understand: Run agent, inspect traces, understand purpose
- Scorers: Select and register scorers for quality criteria
- Dataset: ALWAYS discover existing datasets first, only create new if needed
- Evaluate: Run agent on dataset, apply scorers, analyze results
⚠️ 提醒:使用本技能中的MLflow API,请勿创建自定义评估框架。
设置(前提条件):安装MLflow 3.8+、配置环境、集成追踪功能
四步评估工作流(每一步均使用MLflow API):
- 理解Agent:运行Agent,检查追踪记录,明确其用途
- 配置评分器:选择并注册符合质量标准的评分器
- 准备数据集:优先查找现有数据集,仅在必要时创建新数据集
- 执行评估:在数据集上运行Agent,应用评分器,分析结果
Command Conventions
命令约定
Always use for MLflow and Python commands:
uv runbash
uv run mlflow --version # MLflow CLI commands
uv run python scripts/xxx.py # Python script execution
uv run python -c "..." # Python one-linersThis ensures commands run in the correct environment with proper dependencies.
CRITICAL: Separate stderr from stdout when capturing CLI output:
When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data:
bash
undefined所有MLflow和Python命令请始终使用执行:
uv runbash
uv run mlflow --version # MLflow CLI命令
uv run python scripts/xxx.py # Python脚本执行
uv run python -c "..." # Python单行命令这能确保命令在依赖正确的环境中运行。
重要提示:捕获CLI输出时请分离stderr和stdout:
将CLI命令输出保存到文件用于解析(如JSON、CSV等)时,请始终单独重定向stderr,避免日志与结构化数据混合:
bash
undefinedSave both separately for debugging
分别保存以便调试
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
undefineduv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
undefinedDocumentation Access Protocol
文档访问规范
All MLflow documentation must be accessed through llms.txt:
- Start at:
https://mlflow.org/docs/latest/llms.txt - Query llms.txt for your topic with specific prompt
- If llms.txt references another doc, use WebFetch with that URL
- Do not use WebSearch - use WebFetch with llms.txt first
This applies to all steps, especially:
- Dataset creation (read GenAI dataset docs from llms.txt)
- Scorer registration (check MLflow docs for scorer APIs)
- Evaluation execution (understand mlflow.genai.evaluate API)
所有MLflow文档必须通过llms.txt访问:
- 起始地址:
https://mlflow.org/docs/latest/llms.txt - 使用特定查询语句在llms.txt中查找相关主题
- 如果llms.txt引用了其他文档,使用WebFetch访问该URL
- 请勿使用WebSearch - 优先通过llms.txt和WebFetch获取文档
此规范适用于所有步骤,尤其是:
- 数据集创建(从llms.txt读取生成式AI数据集文档)
- 评分器注册(查看MLflow文档中的评分器API)
- 评估执行(理解mlflow.genai.evaluate API)
Discovering Agent Structure
探索Agent结构
Each project has unique structure. Use dynamic exploration instead of assumptions:
每个项目的结构都独一无二。 请通过动态探索而非假设来了解:
Find Agent Entry Points
查找Agent入口点
bash
undefinedbash
undefinedSearch for main agent functions
搜索主Agent函数
grep -r "def.agent" . --include=".py"
grep -r "def (run|stream|handle|process)" . --include="*.py"
grep -r "def.agent" . --include=".py"
grep -r "def (run|stream|handle|process)" . --include="*.py"
Check common locations
检查常见位置
ls main.py app.py src/*/agent.py 2>/dev/null
ls main.py app.py src/*/agent.py 2>/dev/null
Look for API routes
查找API路由
grep -r "@app.(get|post)" . --include="*.py" # FastAPI/Flask
grep -r "def.route" . --include=".py"
undefinedgrep -r "@app.(get|post)" . --include="*.py" # FastAPI/Flask框架
grep -r "def.route" . --include=".py"
undefinedUnderstand Project Structure
理解项目结构
bash
undefinedbash
undefinedCheck entry points in package config
检查包配置中的入口点
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts|entry_points"
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts|entry_points"
Read project documentation
阅读项目文档
cat README.md docs/*.md 2>/dev/null | head -100
cat README.md docs/*.md 2>/dev/null | head -100
Explore main directories
探索主目录
ls -la src/ app/ agent/ 2>/dev/null
undefinedls -la src/ app/ agent/ 2>/dev/null
undefinedSetup Overview
设置概述
Before evaluation, complete these three setup steps:
- Install MLflow (version >=3.8.0)
- Configure environment (tracking URI and experiment)
- Guide: Follow Steps 1-2
references/setup-guide.md
- Guide: Follow
- Integrate tracing (autolog and @mlflow.trace decorators)
- ⚠️ MANDATORY: Follow - the authoritative tracing guide
references/tracing-integration.md - ✓ VERIFY: Run after implementing
scripts/validate_agent_tracing.py
- ⚠️ MANDATORY: Follow
⚠️ Tracing must work before evaluation. If tracing fails, stop and troubleshoot.
Checkpoint - verify before proceeding:
- MLflow >=3.8.0 installed
- MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID set
- Autolog enabled and @mlflow.trace decorators added
- Test run creates a trace (verify trace ID is not None)
Validation scripts:
bash
uv run python scripts/validate_environment.py # Check MLflow install, env vars, connectivity
uv run python scripts/validate_auth.py # Test authentication before expensive operations开始评估前,请完成以下三项设置步骤:
- 安装MLflow(版本 >=3.8.0)
- 配置环境(追踪URI和实验)
- 指南:遵循的步骤1-2
references/setup-guide.md
- 指南:遵循
- 集成追踪功能(自动日志和@mlflow.trace装饰器)
- ⚠️ 强制要求:遵循——权威的追踪指南
references/tracing-integration.md - ✓ 验证:实现后运行
scripts/validate_agent_tracing.py
- ⚠️ 强制要求:遵循
⚠️ 必须确保追踪功能正常后再开始评估。 如果追踪失败,请停止并排查问题。
检查点 - 继续前请验证:
- 已安装MLflow >=3.8.0
- 已设置MLFLOW_TRACKING_URI和MLFLOW_EXPERIMENT_ID
- 已启用自动日志并添加@mlflow.trace装饰器
- 测试运行可生成追踪记录(验证trace ID不为None)
验证脚本:
bash
uv run python scripts/validate_environment.py # 检查MLflow安装、环境变量和连接性
uv run python scripts/validate_auth.py # 在执行高成本操作前测试认证Evaluation Workflow
评估工作流
Step 1: Understand Agent Purpose
步骤1:理解Agent用途
- Invoke agent with sample input
- Inspect MLflow trace (especially LLM prompts describing agent purpose)
- Print your understanding and ask user for verification
- Wait for confirmation before proceeding
- 使用示例输入调用Agent
- 检查MLflow追踪记录(尤其是描述Agent用途的LLM提示词)
- 记录你的理解并请求用户验证
- 获得确认后再继续
Step 2: Define Quality Scorers
步骤2:定义质量评分器
- Check registered scorers in your experiment:
bash
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID
IMPORTANT: if there are registered scorers in the experiment then they must be used for evaluation.
- Select additional built-in scorers that apply to the agent
See for the built-in scorers. Select any that are useful for assessing the agent's quality and that are not already registered.
references/scorers.md- Create additional custom scorers as needed
If needed, create additional scorers using the API. See on how to create custom scorers and for best practices.
make_judge()references/scorers.mdreferences/scorers-constraints.md-
REQUIRED: Register new scorers before evaluation using Python API:python
from mlflow.genai.judges import make_judge from mlflow.genai.scorers import BuiltinScorerName import os scorer = make_judge(...) # Or, scorer = BuiltinScorerName() scorer.register()
** IMPORTANT: See → "Model Selection for Scorers" to configure the parameter of scorers before registration.
references/scorers.mdmodel⚠️ Scorers MUST be registered before evaluation. Inline scorers that aren't registered won't appear in and won't be reusable.
mlflow scorers list- Verify registration:
bash
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID # Should show your scorers
- 检查实验中已注册的评分器:
bash
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID
重要提示:如果实验中存在已注册的评分器,必须使用这些评分器进行评估。
- 选择额外的内置评分器
查看获取内置评分器列表。选择任何适用于评估Agent质量且未注册的评分器。
references/scorers.md- 根据需要创建自定义评分器
如有需要,使用 API创建额外评分器。请查看了解如何创建自定义评分器,以及获取最佳实践。
make_judge()references/scorers.mdreferences/scorers-constraints.md-
必须:评估前注册新评分器(使用Python API):python
from mlflow.genai.judges import make_judge from mlflow.genai.scorers import BuiltinScorerName import os scorer = make_judge(...) # 或者,scorer = BuiltinScorerName() scorer.register()
重要提示: 请查看 → "评分器模型选择",在注册前配置评分器的参数。
references/scorers.mdmodel⚠️ 评分器必须在评估前注册。 未注册的内联评分器不会出现在中,也无法重复使用。
mlflow scorers list- 验证注册结果:
bash
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID # 应显示你注册的评分器
Step 3: Prepare Evaluation Dataset
步骤3:准备评估数据集
ALWAYS discover existing datasets first to prevent duplicate work:
-
Run dataset discovery (mandatory):bash
uv run python scripts/list_datasets.py # Lists, compares, recommends datasets uv run python scripts/list_datasets.py --format json # Machine-readable output uv run python scripts/list_datasets.py --help # All options -
Present findings to user:
- Show all discovered datasets with their characteristics (size, topics covered)
- If datasets found, highlight most relevant options based on agent type
-
Ask user about existing datasets:
- "I found [N] existing evaluation dataset(s). Do you want to use one of these? (y/n)"
- If yes: Ask which dataset to use and record the dataset name
- If no: Proceed to step 4
-
Create new dataset only if user declined existing ones:bash
# Generates dataset creation script from test cases file uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt uv run python scripts/create_dataset_template.py --help # See all optionsGenerated code usesAPIs - review and execute the script.mlflow.genai.datasets
IMPORTANT: Do not skip dataset discovery. Always run first, even if you plan to create a new dataset. This prevents duplicate work and ensures users are aware of existing evaluation datasets.
list_datasets.pyFor complete dataset guide: See
references/dataset-preparation.mdCheckpoint - verify before proceeding:
- Scorers have been registered
- Dataset has been created
始终优先查找现有数据集,避免重复工作:
-
执行数据集发现(强制要求):bash
uv run python scripts/list_datasets.py # 列出、比较并推荐数据集 uv run python scripts/list_datasets.py --format json # 机器可读格式输出 uv run python scripts/list_datasets.py --help # 查看所有选项 -
向用户展示发现结果:
- 展示所有发现的数据集及其特征(规模、覆盖主题)
- 如果找到数据集,根据Agent类型高亮最相关的选项
-
询问用户是否使用现有数据集:
- "我找到了[N]个现有评估数据集。你是否想要使用其中一个?(是/否)"
- 如果是:询问使用哪个数据集并记录数据集名称
- 如果否:继续步骤4
-
仅当用户拒绝现有数据集时才创建新数据集:bash
# 从测试用例文件生成数据集创建脚本 uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt uv run python scripts/create_dataset_template.py --help # 查看所有选项生成的代码使用API——请审核并执行该脚本。mlflow.genai.datasets
重要提示:请勿跳过数据集发现步骤。即使你计划创建新数据集,也请先运行。这能避免重复工作,确保用户了解现有评估数据集。
list_datasets.py完整数据集指南: 查看
references/dataset-preparation.md检查点 - 继续前请验证:
- 已注册评分器
- 已准备好数据集
Step 4: Run Evaluation
步骤4:运行评估
-
Generate and run evaluation script:bash
# Generate evaluation script (specify module and entry point) uv run python scripts/run_evaluation_template.py \ --module mlflow_agent.agent \ --entry-point run_agent # Review the generated script, then execute it uv run python run_agent_evaluation.pyThe generated script creates a wrapper function that:- Accepts keyword arguments matching the dataset's input keys
- Provides any additional arguments the agent needs (like )
llm_provider - Runs
mlflow.genai.evaluate(data=df, predict_fn=wrapper, scorers=registered_scorers) - Saves results to
evaluation_results.csv
⚠️ CRITICAL: wrapper Signature Must Match Dataset Input Keys
MLflow calls - it unpacks the inputs dict as keyword arguments.
predict_fn(**inputs)| Dataset Record | MLflow Calls | predict_fn Must Be |
|---|---|---|
| | |
| | |
Common Mistake (WRONG):
python
def wrapper(inputs): # ❌ WRONG - inputs is NOT a dict
return agent(inputs["query"])- Analyze results:
Generatesbash
# Pattern detection, failure analysis, recommendations uv run python scripts/analyze_results.py evaluation_results.csvwith pass rates and improvement suggestions.evaluation_report.md
-
生成并运行评估脚本:bash
# 生成评估脚本(指定模块和入口点) uv run python scripts/run_evaluation_template.py \ --module mlflow_agent.agent \ --entry-point run_agent # 审核生成的脚本,然后执行 uv run python run_agent_evaluation.py生成的脚本会创建一个包装函数,该函数:- 接受与数据集输入键匹配的关键字参数
- 提供Agent所需的任何额外参数(如)
llm_provider - 运行
mlflow.genai.evaluate(data=df, predict_fn=wrapper, scorers=registered_scorers) - 将结果保存到
evaluation_results.csv
⚠️ 重要提示:包装函数签名必须与数据集输入键匹配
MLflow会调用——它会将输入字典解包为关键字参数。
predict_fn(**inputs)| 数据集记录 | MLflow调用方式 | predict_fn必须定义为 |
|---|---|---|
| | |
| | |
常见错误(错误示例):
python
def wrapper(inputs): # ❌ 错误 - inputs不是字典
return agent(inputs["query"])- 分析结果:
生成包含通过率和改进建议的bash
# 模式检测、失败分析、建议生成 uv run python scripts/analyze_results.py evaluation_results.csv。evaluation_report.md
References
参考资料
Detailed guides in (load as needed):
references/- setup-guide.md - Environment setup (MLflow install, tracking URI configuration)
- tracing-integration.md - Authoritative tracing guide (autolog, decorators, session tracking, verification)
- dataset-preparation.md - Dataset schema, APIs, creation, Unity Catalog
- scorers.md - Built-in vs custom scorers, registration, testing
- scorers-constraints.md - CLI requirements for custom scorers (yes/no format, templates)
- troubleshooting.md - Common errors by phase with solutions
Scripts are self-documenting - run with for usage details.
--helpreferences/- setup-guide.md - 环境设置(MLflow安装、追踪URI配置)
- tracing-integration.md - 权威追踪指南(自动日志、装饰器、会话追踪、验证)
- dataset-preparation.md - 数据集 schema、API、创建、Unity Catalog
- scorers.md - 内置vs自定义评分器、注册、测试
- scorers-constraints.md - 自定义评分器的CLI要求(是/否格式、模板)
- troubleshooting.md - 各阶段常见错误及解决方案
脚本自带文档——运行查看使用说明。
--help