agent-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agent Evaluation with MLflow

基于MLflow的Agent评估

Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.
本指南是使用MLflow评估生成式AI Agent的全面教程。你可以使用本技能完成完整的评估工作流,也可以单独使用其中的组件——追踪设置、环境配置、数据集创建、评分器定义或评估执行。可根据需求独立使用每个章节的内容。

⛔ CRITICAL: Must Use MLflow APIs

⛔ 重要提示:必须使用MLflow API

DO NOT create custom evaluation frameworks. You MUST use MLflow's native APIs:
  • Datasets: Use
    mlflow.genai.datasets.create_dataset()
    - NOT custom test case files
  • Scorers: Use
    mlflow.genai.scorers
    and
    mlflow.genai.judges.make_judge()
    - NOT custom scorer functions
  • Evaluation: Use
    mlflow.genai.evaluate()
    - NOT custom evaluation loops
  • Scripts: Use the provided
    scripts/
    directory templates - NOT custom
    evaluation/
    directories
Why? MLflow tracks everything (datasets, scorers, traces, results) in the experiment. Custom frameworks bypass this and lose all observability.
If you're tempted to create
evaluation/eval_dataset.py
or similar custom files, STOP. Use
scripts/create_dataset_template.py
instead.
请勿创建自定义评估框架。 你必须使用MLflow的原生API:
  • 数据集:使用
    mlflow.genai.datasets.create_dataset()
    ——而非自定义测试用例文件
  • 评分器:使用
    mlflow.genai.scorers
    mlflow.genai.judges.make_judge()
    ——而非自定义评分函数
  • 评估执行:使用
    mlflow.genai.evaluate()
    ——而非自定义评估循环
  • 脚本:使用提供的
    scripts/
    目录模板——而非自定义
    evaluation/
    目录
原因? MLflow会在实验中追踪所有内容(数据集、评分器、追踪记录、结果)。自定义框架会绕过这一机制,导致完全失去可观测性。
如果你打算创建
evaluation/eval_dataset.py
或类似的自定义文件,请立即停止。改用
scripts/create_dataset_template.py

Table of Contents

目录

Quick Start

快速开始

⚠️ REMINDER: Use MLflow APIs from this skill. Do not create custom evaluation frameworks.
Setup (prerequisite): Install MLflow 3.8+, configure environment, integrate tracing
Evaluation workflow in 4 steps (each uses MLflow APIs):
  1. Understand: Run agent, inspect traces, understand purpose
  2. Scorers: Select and register scorers for quality criteria
  3. Dataset: ALWAYS discover existing datasets first, only create new if needed
  4. Evaluate: Run agent on dataset, apply scorers, analyze results
⚠️ 提醒:使用本技能中的MLflow API,请勿创建自定义评估框架。
设置(前提条件):安装MLflow 3.8+、配置环境、集成追踪功能
四步评估工作流(每一步均使用MLflow API):
  1. 理解Agent:运行Agent,检查追踪记录,明确其用途
  2. 配置评分器:选择并注册符合质量标准的评分器
  3. 准备数据集:优先查找现有数据集,仅在必要时创建新数据集
  4. 执行评估:在数据集上运行Agent,应用评分器,分析结果

Command Conventions

命令约定

Always use
uv run
for MLflow and Python commands:
bash
uv run mlflow --version          # MLflow CLI commands
uv run python scripts/xxx.py     # Python script execution
uv run python -c "..."           # Python one-liners
This ensures commands run in the correct environment with proper dependencies.
CRITICAL: Separate stderr from stdout when capturing CLI output:
When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data:
bash
undefined
所有MLflow和Python命令请始终使用
uv run
执行:
bash
uv run mlflow --version          # MLflow CLI命令
uv run python scripts/xxx.py     # Python脚本执行
uv run python -c "..."           # Python单行命令
这能确保命令在依赖正确的环境中运行。
重要提示:捕获CLI输出时请分离stderr和stdout:
将CLI命令输出保存到文件用于解析(如JSON、CSV等)时,请始终单独重定向stderr,避免日志与结构化数据混合:
bash
undefined

Save both separately for debugging

分别保存以便调试

uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
undefined
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
undefined

Documentation Access Protocol

文档访问规范

All MLflow documentation must be accessed through llms.txt:
  1. Start at:
    https://mlflow.org/docs/latest/llms.txt
  2. Query llms.txt for your topic with specific prompt
  3. If llms.txt references another doc, use WebFetch with that URL
  4. Do not use WebSearch - use WebFetch with llms.txt first
This applies to all steps, especially:
  • Dataset creation (read GenAI dataset docs from llms.txt)
  • Scorer registration (check MLflow docs for scorer APIs)
  • Evaluation execution (understand mlflow.genai.evaluate API)
所有MLflow文档必须通过llms.txt访问:
  1. 起始地址:
    https://mlflow.org/docs/latest/llms.txt
  2. 使用特定查询语句在llms.txt中查找相关主题
  3. 如果llms.txt引用了其他文档,使用WebFetch访问该URL
  4. 请勿使用WebSearch - 优先通过llms.txt和WebFetch获取文档
此规范适用于所有步骤,尤其是:
  • 数据集创建(从llms.txt读取生成式AI数据集文档)
  • 评分器注册(查看MLflow文档中的评分器API)
  • 评估执行(理解mlflow.genai.evaluate API)

Discovering Agent Structure

探索Agent结构

Each project has unique structure. Use dynamic exploration instead of assumptions:
每个项目的结构都独一无二。 请通过动态探索而非假设来了解:

Find Agent Entry Points

查找Agent入口点

bash
undefined
bash
undefined

Search for main agent functions

搜索主Agent函数

grep -r "def.agent" . --include=".py" grep -r "def (run|stream|handle|process)" . --include="*.py"
grep -r "def.agent" . --include=".py" grep -r "def (run|stream|handle|process)" . --include="*.py"

Check common locations

检查常见位置

ls main.py app.py src/*/agent.py 2>/dev/null
ls main.py app.py src/*/agent.py 2>/dev/null

Look for API routes

查找API路由

grep -r "@app.(get|post)" . --include="*.py" # FastAPI/Flask grep -r "def.route" . --include=".py"
undefined
grep -r "@app.(get|post)" . --include="*.py" # FastAPI/Flask框架 grep -r "def.route" . --include=".py"
undefined

Understand Project Structure

理解项目结构

bash
undefined
bash
undefined

Check entry points in package config

检查包配置中的入口点

cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts|entry_points"
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts|entry_points"

Read project documentation

阅读项目文档

cat README.md docs/*.md 2>/dev/null | head -100
cat README.md docs/*.md 2>/dev/null | head -100

Explore main directories

探索主目录

ls -la src/ app/ agent/ 2>/dev/null
undefined
ls -la src/ app/ agent/ 2>/dev/null
undefined

Setup Overview

设置概述

Before evaluation, complete these three setup steps:
  1. Install MLflow (version >=3.8.0)
  2. Configure environment (tracking URI and experiment)
    • Guide: Follow
      references/setup-guide.md
      Steps 1-2
  3. Integrate tracing (autolog and @mlflow.trace decorators)
    • ⚠️ MANDATORY: Follow
      references/tracing-integration.md
      - the authoritative tracing guide
    • VERIFY: Run
      scripts/validate_agent_tracing.py
      after implementing
⚠️ Tracing must work before evaluation. If tracing fails, stop and troubleshoot.
Checkpoint - verify before proceeding:
  • MLflow >=3.8.0 installed
  • MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID set
  • Autolog enabled and @mlflow.trace decorators added
  • Test run creates a trace (verify trace ID is not None)
Validation scripts:
bash
uv run python scripts/validate_environment.py  # Check MLflow install, env vars, connectivity
uv run python scripts/validate_auth.py         # Test authentication before expensive operations
开始评估前,请完成以下三项设置步骤:
  1. 安装MLflow(版本 >=3.8.0)
  2. 配置环境(追踪URI和实验)
    • 指南:遵循
      references/setup-guide.md
      的步骤1-2
  3. 集成追踪功能(自动日志和@mlflow.trace装饰器)
    • ⚠️ 强制要求:遵循
      references/tracing-integration.md
      ——权威的追踪指南
    • 验证:实现后运行
      scripts/validate_agent_tracing.py
⚠️ 必须确保追踪功能正常后再开始评估。 如果追踪失败,请停止并排查问题。
检查点 - 继续前请验证:
  • 已安装MLflow >=3.8.0
  • 已设置MLFLOW_TRACKING_URI和MLFLOW_EXPERIMENT_ID
  • 已启用自动日志并添加@mlflow.trace装饰器
  • 测试运行可生成追踪记录(验证trace ID不为None)
验证脚本:
bash
uv run python scripts/validate_environment.py  # 检查MLflow安装、环境变量和连接性
uv run python scripts/validate_auth.py         # 在执行高成本操作前测试认证

Evaluation Workflow

评估工作流

Step 1: Understand Agent Purpose

步骤1:理解Agent用途

  1. Invoke agent with sample input
  2. Inspect MLflow trace (especially LLM prompts describing agent purpose)
  3. Print your understanding and ask user for verification
  4. Wait for confirmation before proceeding
  1. 使用示例输入调用Agent
  2. 检查MLflow追踪记录(尤其是描述Agent用途的LLM提示词)
  3. 记录你的理解并请求用户验证
  4. 获得确认后再继续

Step 2: Define Quality Scorers

步骤2:定义质量评分器

  1. Check registered scorers in your experiment:
    bash
    uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID
IMPORTANT: if there are registered scorers in the experiment then they must be used for evaluation.
  1. Select additional built-in scorers that apply to the agent
See
references/scorers.md
for the built-in scorers. Select any that are useful for assessing the agent's quality and that are not already registered.
  1. Create additional custom scorers as needed
If needed, create additional scorers using the
make_judge()
API. See
references/scorers.md
on how to create custom scorers and
references/scorers-constraints.md
for best practices.
  1. REQUIRED: Register new scorers before evaluation using Python API:
    python
    from mlflow.genai.judges import make_judge
    from mlflow.genai.scorers import BuiltinScorerName
    import os
    
    scorer = make_judge(...)  # Or, scorer = BuiltinScorerName()
    scorer.register()
** IMPORTANT: See
references/scorers.md
→ "Model Selection for Scorers" to configure the
model
parameter of scorers before registration.
⚠️ Scorers MUST be registered before evaluation. Inline scorers that aren't registered won't appear in
mlflow scorers list
and won't be reusable.
  1. Verify registration:
    bash
    uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID  # Should show your scorers
  1. 检查实验中已注册的评分器:
    bash
    uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID
重要提示:如果实验中存在已注册的评分器,必须使用这些评分器进行评估。
  1. 选择额外的内置评分器
查看
references/scorers.md
获取内置评分器列表。选择任何适用于评估Agent质量且未注册的评分器。
  1. 根据需要创建自定义评分器
如有需要,使用
make_judge()
API创建额外评分器。请查看
references/scorers.md
了解如何创建自定义评分器,以及
references/scorers-constraints.md
获取最佳实践。
  1. 必须:评估前注册新评分器(使用Python API):
    python
    from mlflow.genai.judges import make_judge
    from mlflow.genai.scorers import BuiltinScorerName
    import os
    
    scorer = make_judge(...)  # 或者,scorer = BuiltinScorerName()
    scorer.register()
重要提示: 请查看
references/scorers.md
→ "评分器模型选择",在注册前配置评分器的
model
参数。
⚠️ 评分器必须在评估前注册。 未注册的内联评分器不会出现在
mlflow scorers list
中,也无法重复使用。
  1. 验证注册结果:
    bash
    uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID  # 应显示你注册的评分器

Step 3: Prepare Evaluation Dataset

步骤3:准备评估数据集

ALWAYS discover existing datasets first to prevent duplicate work:
  1. Run dataset discovery (mandatory):
    bash
    uv run python scripts/list_datasets.py  # Lists, compares, recommends datasets
    uv run python scripts/list_datasets.py --format json  # Machine-readable output
    uv run python scripts/list_datasets.py --help  # All options
  2. Present findings to user:
    • Show all discovered datasets with their characteristics (size, topics covered)
    • If datasets found, highlight most relevant options based on agent type
  3. Ask user about existing datasets:
    • "I found [N] existing evaluation dataset(s). Do you want to use one of these? (y/n)"
    • If yes: Ask which dataset to use and record the dataset name
    • If no: Proceed to step 4
  4. Create new dataset only if user declined existing ones:
    bash
    # Generates dataset creation script from test cases file
    uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt
    uv run python scripts/create_dataset_template.py --help  # See all options
    Generated code uses
    mlflow.genai.datasets
    APIs - review and execute the script.
IMPORTANT: Do not skip dataset discovery. Always run
list_datasets.py
first, even if you plan to create a new dataset. This prevents duplicate work and ensures users are aware of existing evaluation datasets.
For complete dataset guide: See
references/dataset-preparation.md
Checkpoint - verify before proceeding:
  • Scorers have been registered
  • Dataset has been created
始终优先查找现有数据集,避免重复工作:
  1. 执行数据集发现(强制要求):
    bash
    uv run python scripts/list_datasets.py  # 列出、比较并推荐数据集
    uv run python scripts/list_datasets.py --format json  # 机器可读格式输出
    uv run python scripts/list_datasets.py --help  # 查看所有选项
  2. 向用户展示发现结果
    • 展示所有发现的数据集及其特征(规模、覆盖主题)
    • 如果找到数据集,根据Agent类型高亮最相关的选项
  3. 询问用户是否使用现有数据集
    • "我找到了[N]个现有评估数据集。你是否想要使用其中一个?(是/否)"
    • 如果是:询问使用哪个数据集并记录数据集名称
    • 如果否:继续步骤4
  4. 仅当用户拒绝现有数据集时才创建新数据集
    bash
    # 从测试用例文件生成数据集创建脚本
    uv run python scripts/create_dataset_template.py --test-cases-file test_cases.txt
    uv run python scripts/create_dataset_template.py --help  # 查看所有选项
    生成的代码使用
    mlflow.genai.datasets
    API——请审核并执行该脚本。
重要提示:请勿跳过数据集发现步骤。即使你计划创建新数据集,也请先运行
list_datasets.py
。这能避免重复工作,确保用户了解现有评估数据集。
完整数据集指南: 查看
references/dataset-preparation.md
检查点 - 继续前请验证:
  • 已注册评分器
  • 已准备好数据集

Step 4: Run Evaluation

步骤4:运行评估

  1. Generate and run evaluation script:
    bash
    # Generate evaluation script (specify module and entry point)
    uv run python scripts/run_evaluation_template.py \
      --module mlflow_agent.agent \
      --entry-point run_agent
    
    # Review the generated script, then execute it
    uv run python run_agent_evaluation.py
    The generated script creates a wrapper function that:
    • Accepts keyword arguments matching the dataset's input keys
    • Provides any additional arguments the agent needs (like
      llm_provider
      )
    • Runs
      mlflow.genai.evaluate(data=df, predict_fn=wrapper, scorers=registered_scorers)
    • Saves results to
      evaluation_results.csv
⚠️ CRITICAL: wrapper Signature Must Match Dataset Input Keys
MLflow calls
predict_fn(**inputs)
- it unpacks the inputs dict as keyword arguments.
Dataset RecordMLflow Callspredict_fn Must Be
{"inputs": {"query": "..."}}
predict_fn(query="...")
def wrapper(query):
{"inputs": {"question": "...", "context": "..."}}
predict_fn(question="...", context="...")
def wrapper(question, context):
Common Mistake (WRONG):
python
def wrapper(inputs):  # ❌ WRONG - inputs is NOT a dict
    return agent(inputs["query"])
  1. Analyze results:
    bash
    # Pattern detection, failure analysis, recommendations
    uv run python scripts/analyze_results.py evaluation_results.csv
    Generates
    evaluation_report.md
    with pass rates and improvement suggestions.
  1. 生成并运行评估脚本:
    bash
    # 生成评估脚本(指定模块和入口点)
    uv run python scripts/run_evaluation_template.py \
      --module mlflow_agent.agent \
      --entry-point run_agent
    
    # 审核生成的脚本,然后执行
    uv run python run_agent_evaluation.py
    生成的脚本会创建一个包装函数,该函数:
    • 接受与数据集输入键匹配的关键字参数
    • 提供Agent所需的任何额外参数(如
      llm_provider
    • 运行
      mlflow.genai.evaluate(data=df, predict_fn=wrapper, scorers=registered_scorers)
    • 将结果保存到
      evaluation_results.csv
⚠️ 重要提示:包装函数签名必须与数据集输入键匹配
MLflow会调用
predict_fn(**inputs)
——它会将输入字典解包为关键字参数。
数据集记录MLflow调用方式predict_fn必须定义为
{"inputs": {"query": "..."}}
predict_fn(query="...")
def wrapper(query):
{"inputs": {"question": "...", "context": "..."}}
predict_fn(question="...", context="...")
def wrapper(question, context):
常见错误(错误示例):
python
def wrapper(inputs):  # ❌ 错误 - inputs不是字典
    return agent(inputs["query"])
  1. 分析结果:
    bash
    # 模式检测、失败分析、建议生成
    uv run python scripts/analyze_results.py evaluation_results.csv
    生成包含通过率和改进建议的
    evaluation_report.md

References

参考资料

Detailed guides in
references/
(load as needed):
  • setup-guide.md - Environment setup (MLflow install, tracking URI configuration)
  • tracing-integration.md - Authoritative tracing guide (autolog, decorators, session tracking, verification)
  • dataset-preparation.md - Dataset schema, APIs, creation, Unity Catalog
  • scorers.md - Built-in vs custom scorers, registration, testing
  • scorers-constraints.md - CLI requirements for custom scorers (yes/no format, templates)
  • troubleshooting.md - Common errors by phase with solutions
Scripts are self-documenting - run with
--help
for usage details.
references/
目录下的详细指南(按需加载):
  • setup-guide.md - 环境设置(MLflow安装、追踪URI配置)
  • tracing-integration.md - 权威追踪指南(自动日志、装饰器、会话追踪、验证)
  • dataset-preparation.md - 数据集 schema、API、创建、Unity Catalog
  • scorers.md - 内置vs自定义评分器、注册、测试
  • scorers-constraints.md - 自定义评分器的CLI要求(是/否格式、模板)
  • troubleshooting.md - 各阶段常见错误及解决方案
脚本自带文档——运行
--help
查看使用说明。