mcp-builder

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MCP Server Evaluation Guide

MCP服务器评估指南

Overview

概述

This document provides guidance on creating comprehensive evaluations for MCP servers. Evaluations test whether LLMs can effectively use your MCP server to answer realistic, complex questions using only the tools provided.

本文档提供了创建MCP服务器全面评估的指导方案。评估旨在测试LLM能否仅通过提供的工具,有效利用MCP服务器来回答真实场景下的复杂问题。

Quick Reference

快速参考

Evaluation Requirements

评估要求

  • Create 10 human-readable questions
  • Questions must be READ-ONLY, INDEPENDENT, NON-DESTRUCTIVE
  • Each question requires multiple tool calls (potentially dozens)
  • Answers must be single, verifiable values
  • Answers must be STABLE (won't change over time)
  • 创建10个易读的人工问题
  • 问题必须为只读、独立、无破坏性
  • 每个问题需要调用多个工具(可能多达数十次)
  • 答案必须是单一、可验证的数值或内容
  • 答案必须稳定(不会随时间变化)

Output Format

输出格式

xml
<evaluation>
   <qa_pair>
      <question>Your question here</question>
      <answer>Single verifiable answer</answer>
   </qa_pair>
</evaluation>

xml
<evaluation>
   <qa_pair>
      <question>Your question here</question>
      <answer>Single verifiable answer</answer>
   </qa_pair>
</evaluation>

Purpose of Evaluations

评估的目的

The measure of quality of an MCP server is NOT how well or comprehensively the server implements tools, but how well these implementations (input/output schemas, docstrings/descriptions, functionality) enable LLMs with no other context and access ONLY to the MCP servers to answer realistic and difficult questions.
MCP服务器的质量衡量标准,不在于服务器实现工具的完善程度,而在于这些实现(输入/输出模式、文档字符串/描述、功能)能否让无额外上下文的LLM仅通过MCP服务器,回答真实且有难度的问题。

Evaluation Overview

评估概述

Create 10 human-readable questions requiring ONLY READ-ONLY, INDEPENDENT, NON-DESTRUCTIVE, and IDEMPOTENT operations to answer. Each question should be:
  • Realistic
  • Clear and concise
  • Unambiguous
  • Complex, requiring potentially dozens of tool calls or steps
  • Answerable with a single, verifiable value that you identify in advance
创建10个易读的问题,要求仅通过只读、独立、无破坏性且幂等的操作来回答。每个问题需满足:
  • 贴合真实场景
  • 清晰简洁
  • 无歧义
  • 复杂度高,可能需要数十次工具调用或多步骤操作
  • 答案是预先确定的单一、可验证内容

Question Guidelines

问题设计指南

Core Requirements

核心要求

  1. Questions MUST be independent
    • Each question should NOT depend on the answer to any other question
    • Should not assume prior write operations from processing another question
  2. Questions MUST require ONLY NON-DESTRUCTIVE AND IDEMPOTENT tool use
    • Should not instruct or require modifying state to arrive at the correct answer
  3. Questions must be REALISTIC, CLEAR, CONCISE, and COMPLEX
    • Must require another LLM to use multiple (potentially dozens of) tools or steps to answer
  1. 问题必须独立
    • 每个问题不应依赖其他问题的答案
    • 不应假设处理其他问题时执行过的写入操作
  2. 问题必须仅要求无破坏性且幂等的工具调用
    • 不应指示或要求修改状态以得到正确答案
  3. 问题需贴合真实场景、清晰简洁且复杂度高
    • 必须要求LLM调用多个(可能多达数十个)工具或执行多步骤操作才能回答

Complexity and Depth

复杂度与深度

  1. Questions must require deep exploration
    • Consider multi-hop questions requiring multiple sub-questions and sequential tool calls
    • Each step should benefit from information found in previous questions
  2. Questions may require extensive paging
    • May need paging through multiple pages of results
    • May require querying old data (1-2 years out-of-date) to find niche information
    • The questions must be DIFFICULT
  3. Questions must require deep understanding
    • Rather than surface-level knowledge
    • May pose complex ideas as True/False questions requiring evidence
    • May use multiple-choice format where LLM must search different hypotheses
  4. Questions must not be solvable with straightforward keyword search
    • Do not include specific keywords from the target content
    • Use synonyms, related concepts, or paraphrases
    • Require multiple searches, analyzing multiple related items, extracting context, then deriving the answer
  1. 问题需要求深度探索
    • 考虑需要多轮子问题和连续工具调用的多跳问题
    • 每一步都应能从之前的操作中获取有用信息
  2. 问题可能需要大量分页操作
    • 可能需要翻过多页结果
    • 可能需要查询旧数据(过期1-2年)以获取小众信息
    • 问题必须具备一定难度
  3. 问题需要求深度理解
    • 而非表面知识的考察
    • 可设计需要证据支持的真假判断题
    • 可采用选择题形式,要求LLM搜索不同假设
  4. 问题不能通过简单关键词搜索解决
    • 不要包含目标内容中的特定关键词
    • 使用同义词、相关概念或转述方式
    • 要求多次搜索、分析多个相关条目、提取上下文后推导答案

Tool Testing

工具测试

  1. Questions should stress-test tool return values
    • May elicit tools returning large JSON objects or lists, overwhelming the LLM
    • Should require understanding multiple modalities of data:
      • IDs and names
      • Timestamps and datetimes (months, days, years, seconds)
      • File IDs, names, extensions, and mimetypes
      • URLs, GIDs, etc.
    • Should probe the tool's ability to return all useful forms of data
  2. Questions should MOSTLY reflect real human use cases
    • The kinds of information retrieval tasks that HUMANS assisted by an LLM would care about
  3. Questions may require dozens of tool calls
    • This challenges LLMs with limited context
    • Encourages MCP server tools to reduce information returned
  4. Include ambiguous questions
    • May be ambiguous OR require difficult decisions on which tools to call
    • Force the LLM to potentially make mistakes or misinterpret
    • Ensure that despite AMBIGUITY, there is STILL A SINGLE VERIFIABLE ANSWER
  1. 问题应能压力测试工具的返回值
    • 可设计让工具返回大型JSON对象或列表,测试LLM的处理能力
    • 要求理解多种数据类型:
      • ID与名称
      • 时间戳与日期时间(月、日、年、秒)
      • 文件ID、名称、扩展名与MIME类型
      • URL、GID等
    • 应考察工具返回所有有用数据形式的能力
  2. 问题应主要反映真实人类使用场景
    • 即人类借助LLM完成的信息检索类任务
  3. 问题可能需要数十次工具调用
    • 这会挑战LLM的上下文限制
    • 促使MCP服务器工具优化返回的信息量
  4. 包含歧义性问题
    • 可存在歧义或需要选择调用哪个工具的困难决策
    • 迫使LLM可能犯错或误解问题
    • 确保即使存在歧义,仍有单一可验证的答案

Stability

稳定性

  1. Questions must be designed so the answer DOES NOT CHANGE
    • Do not ask questions that rely on "current state" which is dynamic
    • For example, do not count:
      • Number of reactions to a post
      • Number of replies to a thread
      • Number of members in a channel
  2. DO NOT let the MCP server RESTRICT the kinds of questions you create
    • Create challenging and complex questions
    • Some may not be solvable with the available MCP server tools
    • Questions may require specific output formats (datetime vs. epoch time, JSON vs. MARKDOWN)
    • Questions may require dozens of tool calls to complete
  1. 问题的设计需保证答案不会变化
    • 不要询问依赖动态“当前状态”的问题
    • 例如,以下问题不可取:
      • 某帖子的互动数
      • 某线程的回复数
      • 某频道的成员数
  2. 不要让MCP服务器限制你创建的问题类型
    • 创建具有挑战性的复杂问题
    • 部分问题可能无法用现有MCP服务器工具解决
    • 问题可能要求特定输出格式(如日期时间vs时间戳、JSON vs MARKDOWN)
    • 问题可能需要数十次工具调用才能完成

Answer Guidelines

答案设计指南

Verification

可验证性

  1. Answers must be VERIFIABLE via direct string comparison
    • If the answer can be re-written in many formats, clearly specify the output format in the QUESTION
    • Examples: "Use YYYY/MM/DD.", "Respond True or False.", "Answer A, B, C, or D and nothing else."
    • Answer should be a single VERIFIABLE value such as:
      • User ID, user name, display name, first name, last name
      • Channel ID, channel name
      • Message ID, string
      • URL, title
      • Numerical quantity
      • Timestamp, datetime
      • Boolean (for True/False questions)
      • Email address, phone number
      • File ID, file name, file extension
      • Multiple choice answer
    • Answers must not require special formatting or complex, structured output
    • Answer will be verified using DIRECT STRING COMPARISON
  1. 答案必须可通过直接字符串比对验证
    • 如果答案有多种格式,需在问题中明确指定输出格式
    • 示例:“使用YYYY/MM/DD格式”、“回答True或False”、“仅回答A、B、C或D”
    • 答案应为单一可验证内容,例如:
      • 用户ID、用户名、显示名、名字、姓氏
      • 频道ID、频道名称
      • 消息ID、字符串内容
      • URL、标题
      • 数值
      • 时间戳、日期时间
      • 布尔值(用于真假判断)
      • 邮箱地址、电话号码
      • 文件ID、文件名、文件扩展名
      • 选择题答案
    • 答案无需特殊格式或复杂结构化输出
    • 答案将通过直接字符串比对进行验证

Readability

可读性

  1. Answers should generally prefer HUMAN-READABLE formats
    • Examples: names, first name, last name, datetime, file name, message string, URL, yes/no, true/false, a/b/c/d
    • Rather than opaque IDs (though IDs are acceptable)
    • The VAST MAJORITY of answers should be human-readable
  1. 答案应优先选择人类易读的格式
    • 示例:名称、名字、姓氏、日期时间、文件名、消息字符串、URL、是/否、真/假、A/B/C/D
    • 而非难以理解的ID(尽管ID也可接受)
    • 绝大多数答案应具备人类可读性

Stability

稳定性

  1. Answers must be STABLE/STATIONARY
    • Look at old content (e.g., conversations that have ended, projects that have launched, questions answered)
    • Create QUESTIONS based on "closed" concepts that will always return the same answer
    • Questions may ask to consider a fixed time window to insulate from non-stationary answers
    • Rely on context UNLIKELY to change
    • Example: if finding a paper name, be SPECIFIC enough so answer is not confused with papers published later
  2. Answers must be CLEAR and UNAMBIGUOUS
    • Questions must be designed so there is a single, clear answer
    • Answer can be derived from using the MCP server tools
  1. 答案必须稳定
    • 参考旧内容(如已结束的对话、已发布的项目、已解答的问题)
    • 基于“已完结”的概念设计问题,确保答案始终不变
    • 可要求问题限定在固定时间窗口内,避免答案随时间变化
    • 依赖不太可能变化的上下文
    • 示例:查找论文名称时,需足够具体,避免与后续发表的论文混淆
  2. 答案必须清晰无歧义
    • 问题的设计需保证存在唯一、明确的答案
    • 答案可通过调用MCP服务器工具推导得出

Diversity

多样性

  1. Answers must be DIVERSE
    • Answer should be a single VERIFIABLE value in diverse modalities and formats
    • User concept: user ID, user name, display name, first name, last name, email address, phone number
    • Channel concept: channel ID, channel name, channel topic
    • Message concept: message ID, message string, timestamp, month, day, year
  2. Answers must NOT be complex structures
    • Not a list of values
    • Not a complex object
    • Not a list of IDs or strings
    • Not natural language text
    • UNLESS the answer can be straightforwardly verified using DIRECT STRING COMPARISON
    • And can be realistically reproduced
    • It should be unlikely that an LLM would return the same list in any other order or format
  1. 答案必须具备多样性
    • 答案应为不同模态和格式的单一可验证内容
    • 用户相关:用户ID、用户名、显示名、名字、姓氏、邮箱地址、电话号码
    • 频道相关:频道ID、频道名称、频道主题
    • 消息相关:消息ID、消息字符串、时间戳、月、日、年
  2. 答案不能是复杂结构
    • 不能是值列表
    • 不能是复杂对象
    • 不能是ID或字符串列表
    • 不能是自然语言文本
    • 除非答案可通过直接字符串比对轻松验证,且能真实重现
    • 应确保LLM不会以其他顺序或格式返回相同内容

Evaluation Process

评估流程

Step 1: Documentation Inspection

步骤1:文档检查

Read the documentation of the target API to understand:
  • Available endpoints and functionality
  • If ambiguity exists, fetch additional information from the web
  • Parallelize this step AS MUCH AS POSSIBLE
  • Ensure each subagent is ONLY examining documentation from the file system or on the web
阅读目标API的文档,了解:
  • 可用的端点与功能
  • 若存在歧义,从网络获取补充信息
  • 尽可能并行执行此步骤
  • 确保每个子代理仅检查文件系统或网络上的文档

Step 2: Tool Inspection

步骤2:工具检查

List the tools available in the MCP server:
  • Inspect the MCP server directly
  • Understand input/output schemas, docstrings, and descriptions
  • WITHOUT calling the tools themselves at this stage
列出MCP服务器中可用的工具:
  • 直接检查MCP服务器
  • 了解输入/输出模式、文档字符串和描述
  • 此阶段不要调用工具本身

Step 3: Developing Understanding

步骤3:深化理解

Repeat steps 1 & 2 until you have a good understanding:
  • Iterate multiple times
  • Consider the kinds of tasks to create
  • Refine your understanding
  • At NO stage should you READ the code of the MCP server implementation itself
  • Use your intuition and understanding to create reasonable, realistic, but VERY challenging tasks
重复步骤1和2,直到充分理解:
  • 多次迭代
  • 构思要创建的任务类型
  • 完善你的理解
  • 绝对不要阅读MCP服务器实现的代码
  • 凭借直觉和理解,创建合理、真实且极具挑战性的任务

Step 4: Read-Only Content Inspection

步骤4:只读内容检查

After understanding the API and tools, USE the MCP server tools:
  • Inspect content using READ-ONLY and NON-DESTRUCTIVE operations ONLY
  • Goal: identify specific content (e.g., users, channels, messages, projects, tasks) for creating realistic questions
  • Should NOT call any tools that modify state
  • Will NOT read the code of the MCP server implementation itself
  • Parallelize this step with individual sub-agents pursuing independent explorations
  • Ensure each subagent is only performing READ-ONLY, NON-DESTRUCTIVE, and IDEMPOTENT operations
  • BE CAREFUL: SOME TOOLS may return LOTS OF DATA which would cause you to run out of CONTEXT
  • Make INCREMENTAL, SMALL, AND TARGETED tool calls for exploration
  • In all tool call requests, use the
    limit
    parameter to limit results (<10)
  • Use pagination
在理解API和工具后,使用MCP服务器工具:
  • 仅通过只读、无破坏性操作检查内容
  • 目标:识别特定内容(如用户、频道、消息、项目、任务),用于创建真实问题
  • 不要调用任何修改状态的工具
  • 不要阅读MCP服务器实现的代码
  • 并行执行此步骤,让各个子代理独立探索
  • 确保每个子代理仅执行只读、无破坏性且幂等的操作
  • 注意:部分工具可能返回大量数据,导致上下文耗尽
  • 采用增量、小规模、有针对性的工具调用进行探索
  • 在所有工具调用请求中,使用
    limit
    参数限制结果数量(<10)
  • 使用分页功能

Step 5: Task Generation

步骤5:任务生成

After inspecting the content, create 10 human-readable questions:
  • An LLM should be able to answer these with the MCP server
  • Follow all question and answer guidelines above
在检查内容后,创建10个易读的问题:
  • LLM应能通过MCP服务器回答这些问题
  • 遵循上述所有问题和答案设计指南

Output Format

输出格式

Each QA pair consists of a question and an answer. The output should be an XML file with this structure:
xml
<evaluation>
   <qa_pair>
      <question>Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name?</question>
      <answer>Website Redesign</answer>
   </qa_pair>
   <qa_pair>
      <question>Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username.</question>
      <answer>sarah_dev</answer>
   </qa_pair>
   <qa_pair>
      <question>Look for pull requests that modified files in the /api directory and were merged between January 1 and January 31, 2024. How many different contributors worked on these PRs?</question>
      <answer>7</answer>
   </qa_pair>
   <qa_pair>
      <question>Find the repository with the most stars that was created before 2023. What is the repository name?</question>
      <answer>data-pipeline</answer>
   </qa_pair>
</evaluation>
每个问答对包含一个问题和一个答案。输出应为符合以下结构的XML文件:
xml
<evaluation>
   <qa_pair>
      <question>Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name?</question>
      <answer>Website Redesign</answer>
   </qa_pair>
   <qa_pair>
      <question>Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username.</question>
      <answer>sarah_dev</answer>
   </qa_pair>
   <qa_pair>
      <question>Look for pull requests that modified files in the /api directory and were merged between January 1 and January 31, 2024. How many different contributors worked on these PRs?</question>
      <answer>7</answer>
   </qa_pair>
   <qa_pair>
      <question>Find the repository with the most stars that was created before 2023. What is the repository name?</question>
      <answer>data-pipeline</answer>
   </qa_pair>
</evaluation>

Evaluation Examples

评估示例

Good Questions

优秀问题示例

Example 1: Multi-hop question requiring deep exploration (GitHub MCP)
xml
<qa_pair>
   <question>Find the repository that was archived in Q3 2023 and had previously been the most forked project in the organization. What was the primary programming language used in that repository?</question>
   <answer>Python</answer>
</qa_pair>
This question is good because:
  • Requires multiple searches to find archived repositories
  • Needs to identify which had the most forks before archival
  • Requires examining repository details for the language
  • Answer is a simple, verifiable value
  • Based on historical (closed) data that won't change
Example 2: Requires understanding context without keyword matching (Project Management MCP)
xml
<qa_pair>
   <question>Locate the initiative focused on improving customer onboarding that was completed in late 2023. The project lead created a retrospective document after completion. What was the lead's role title at that time?</question>
   <answer>Product Manager</answer>
</qa_pair>
This question is good because:
  • Doesn't use specific project name ("initiative focused on improving customer onboarding")
  • Requires finding completed projects from specific timeframe
  • Needs to identify the project lead and their role
  • Requires understanding context from retrospective documents
  • Answer is human-readable and stable
  • Based on completed work (won't change)
Example 3: Complex aggregation requiring multiple steps (Issue Tracker MCP)
xml
<qa_pair>
   <question>Among all bugs reported in January 2024 that were marked as critical priority, which assignee resolved the highest percentage of their assigned bugs within 48 hours? Provide the assignee's username.</question>
   <answer>alex_eng</answer>
</qa_pair>
This question is good because:
  • Requires filtering bugs by date, priority, and status
  • Needs to group by assignee and calculate resolution rates
  • Requires understanding timestamps to determine 48-hour windows
  • Tests pagination (potentially many bugs to process)
  • Answer is a single username
  • Based on historical data from specific time period
Example 4: Requires synthesis across multiple data types (CRM MCP)
xml
<qa_pair>
   <question>Find the account that upgraded from the Starter to Enterprise plan in Q4 2023 and had the highest annual contract value. What industry does this account operate in?</question>
   <answer>Healthcare</answer>
</qa_pair>
This question is good because:
  • Requires understanding subscription tier changes
  • Needs to identify upgrade events in specific timeframe
  • Requires comparing contract values
  • Must access account industry information
  • Answer is simple and verifiable
  • Based on completed historical transactions
示例1:需要深度探索的多跳问题(GitHub MCP)
xml
<qa_pair>
   <question>Find the repository that was archived in Q3 2023 and had previously been the most forked project in the organization. What was the primary programming language used in that repository?</question>
   <answer>Python</answer>
</qa_pair>
此问题的优点:
  • 需要多次搜索才能找到已归档的仓库
  • 需要确定归档前被fork次数最多的项目
  • 需要查看仓库详情以获取编程语言信息
  • 答案是简单、可验证的内容
  • 基于历史(已完结)数据,答案不会变化
示例2:无需关键词匹配、需理解上下文的问题(项目管理MCP)
xml
<qa_pair>
   <question>Locate the initiative focused on improving customer onboarding that was completed in late 2023. The project lead created a retrospective document after completion. What was the lead's role title at that time?</question>
   <answer>Product Manager</answer>
</qa_pair>
此问题的优点:
  • 未使用特定项目名称(而是“专注于优化客户入职流程的举措”)
  • 需要找到特定时间段内已完成的项目
  • 需要确定项目负责人及其职位
  • 需要从回顾文档中理解上下文
  • 答案具备人类可读性且稳定
  • 基于已完成的工作,答案不会变化
示例3:需要多步骤复杂聚合的问题(问题追踪MCP)
xml
<qa_pair>
   <question>Among all bugs reported in January 2024 that were marked as critical priority, which assignee resolved the highest percentage of their assigned bugs within 48 hours? Provide the assignee's username.</question>
   <answer>alex_eng</answer>
</qa_pair>
此问题的优点:
  • 需要按日期、优先级和状态筛选bug
  • 需要按经办人分组并计算解决率
  • 需要理解时间戳以确定48小时窗口
  • 测试分页功能(可能需要处理大量bug)
  • 答案是单一用户名
  • 基于特定时间段的历史数据
示例4:需要跨多种数据类型综合分析的问题(CRM MCP)
xml
<qa_pair>
   <question>Find the account that upgraded from the Starter to Enterprise plan in Q4 2023 and had the highest annual contract value. What industry does this account operate in?</question>
   <answer>Healthcare</answer>
</qa_pair>
此问题的优点:
  • 需要理解订阅层级变更
  • 需要找到特定时间段内的升级事件
  • 需要对比合同价值
  • 必须获取客户所属行业信息
  • 答案简单且可验证
  • 基于已完成的历史交易

Poor Questions

糟糕问题示例

Example 1: Answer changes over time
xml
<qa_pair>
   <question>How many open issues are currently assigned to the engineering team?</question>
   <answer>47</answer>
</qa_pair>
This question is poor because:
  • The answer will change as issues are created, closed, or reassigned
  • Not based on stable/stationary data
  • Relies on "current state" which is dynamic
Example 2: Too easy with keyword search
xml
<qa_pair>
   <question>Find the pull request with title "Add authentication feature" and tell me who created it.</question>
   <answer>developer123</answer>
</qa_pair>
This question is poor because:
  • Can be solved with a straightforward keyword search for exact title
  • Doesn't require deep exploration or understanding
  • No synthesis or analysis needed
Example 3: Ambiguous answer format
xml
<qa_pair>
   <question>List all the repositories that have Python as their primary language.</question>
   <answer>repo1, repo2, repo3, data-pipeline, ml-tools</answer>
</qa_pair>
This question is poor because:
  • Answer is a list that could be returned in any order
  • Difficult to verify with direct string comparison
  • LLM might format differently (JSON array, comma-separated, newline-separated)
  • Better to ask for a specific aggregate (count) or superlative (most stars)
示例1:答案随时间变化
xml
<qa_pair>
   <question>How many open issues are currently assigned to the engineering team?</question>
   <answer>47</answer>
</qa_pair>
此问题的缺点:
  • 答案会随问题的创建、关闭或重新分配而变化
  • 不基于稳定数据
  • 依赖动态的“当前状态”
示例2:可通过简单关键词搜索解决
xml
<qa_pair>
   <question>Find the pull request with title "Add authentication feature" and tell me who created it.</question>
   <answer>developer123</answer>
</qa_pair>
此问题的缺点:
  • 可通过直接搜索标题关键词解决
  • 无需深度探索或理解
  • 不需要综合分析
示例3:答案格式不明确
xml
<qa_pair>
   <question>List all the repositories that have Python as their primary language.</question>
   <answer>repo1, repo2, repo3, data-pipeline, ml-tools</answer>
</qa_pair>
此问题的缺点:
  • 答案是列表,返回顺序可能不同
  • 难以通过直接字符串比对验证
  • LLM可能采用不同格式返回(JSON数组、逗号分隔、换行分隔)
  • 更好的方式是询问特定聚合值(如数量)或极值(如star最多的仓库)

Verification Process

验证流程

After creating evaluations:
  1. Examine the XML file to understand the schema
  2. Load each task instruction and in parallel using the MCP server and tools, identify the correct answer by attempting to solve the task directly
  3. Flag any operations that require WRITE or DESTRUCTIVE operations
  4. Accumulate all CORRECT answers and replace any incorrect answers in the document
  5. Remove any
    <qa_pair>
    that require WRITE or DESTRUCTIVE operations
Remember to parallelize solving tasks to avoid running out of context, then accumulate all answers and make changes to the file at the end.
创建评估用例后:
  1. 检查XML文件,理解其结构
  2. 加载每个任务指令,并行使用MCP服务器和工具,通过直接解决任务来确定正确答案
  3. 标记任何需要写入或破坏性操作的任务
  4. 收集所有正确答案,替换文档中的错误答案
  5. 移除所有需要写入或破坏性操作的
    <qa_pair>
    条目
记得并行解决任务以避免上下文耗尽,最后收集所有答案并修改文件。

Tips for Creating Quality Evaluations

高质量评估用例创建技巧

  1. Think Hard and Plan Ahead before generating tasks
  2. Parallelize Where Opportunity Arises to speed up the process and manage context
  3. Focus on Realistic Use Cases that humans would actually want to accomplish
  4. Create Challenging Questions that test the limits of the MCP server's capabilities
  5. Ensure Stability by using historical data and closed concepts
  6. Verify Answers by solving the questions using the MCP server tools
  7. Iterate and Refine based on what you learn during the process

  1. 仔细思考并提前规划,再生成任务
  2. 尽可能并行操作,以加快流程并管理上下文
  3. 聚焦真实使用场景,即人类实际需要完成的任务
  4. 设计具备挑战性的问题,测试MCP服务器的能力极限
  5. 确保稳定性,使用历史数据和已完结的概念
  6. 验证答案,通过MCP服务器工具解决问题来确认
  7. 迭代优化,根据过程中获得的经验改进评估用例

Running Evaluations

运行评估

After creating the evaluation file, use the provided evaluation harness to test the MCP server.
创建评估文件后,使用提供的评估框架来测试MCP服务器。

Setup

环境搭建

  1. Install Dependencies
    bash
    pip install -r scripts/requirements.txt
    Or install manually:
    bash
    pip install anthropic mcp
  2. Set API Key
    bash
    export ANTHROPIC_API_KEY=your_api_key_here
  1. 安装依赖
    bash
    pip install -r scripts/requirements.txt
    或手动安装:
    bash
    pip install anthropic mcp
  2. 设置API密钥
    bash
    export ANTHROPIC_API_KEY=your_api_key_here

Evaluation File Format

评估文件格式

Evaluation files use XML format with
<qa_pair>
elements:
xml
<evaluation>
   <qa_pair>
      <question>Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name?</question>
      <answer>Website Redesign</answer>
   </qa_pair>
   <qa_pair>
      <question>Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username.</question>
      <answer>sarah_dev</answer>
   </qa_pair>
</evaluation>
评估文件采用XML格式,包含
<qa_pair>
元素:
xml
<evaluation>
   <qa_pair>
      <question>Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name?</question>
      <answer>Website Redesign</answer>
   </qa_pair>
   <qa_pair>
      <question>Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username.</question>
      <answer>sarah_dev</answer>
   </qa_pair>
</evaluation>

Running Evaluations

运行评估

The evaluation script (
scripts/evaluation.py
) supports three transport types:
Important:
  • stdio transport: The evaluation script automatically launches and manages the MCP server process for you. Do not run the server manually.
  • sse/http transports: You must start the MCP server separately before running the evaluation. The script connects to the already-running server at the specified URL.
评估脚本(
scripts/evaluation.py
)支持三种传输类型:
重要提示:
  • stdio传输:评估脚本会自动启动并管理MCP服务器进程,无需手动启动服务器。
  • sse/http传输:必须先单独启动MCP服务器,再运行评估脚本。脚本会连接到指定URL上已运行的服务器。

1. Local STDIO Server

1. 本地STDIO服务器

For locally-run MCP servers (script launches the server automatically):
bash
python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a my_mcp_server.py \
  evaluation.xml
With environment variables:
bash
python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a my_mcp_server.py \
  -e API_KEY=abc123 \
  -e DEBUG=true \
  evaluation.xml
适用于本地运行的MCP服务器(脚本会自动启动服务器):
bash
python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a my_mcp_server.py \
  evaluation.xml
带环境变量的方式:
bash
python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a my_mcp_server.py \
  -e API_KEY=abc123 \
  -e DEBUG=true \
  evaluation.xml

2. Server-Sent Events (SSE)

2. Server-Sent Events(SSE)

For SSE-based MCP servers (start the server first):
bash
python scripts/evaluation.py \
  -t sse \
  -u https://example.com/mcp \
  -H "Authorization: Bearer token123" \
  -H "X-Custom-Header: value" \
  evaluation.xml
适用于基于SSE的MCP服务器(需先启动服务器):
bash
python scripts/evaluation.py \
  -t sse \
  -u https://example.com/mcp \
  -H "Authorization: Bearer token123" \
  -H "X-Custom-Header: value" \
  evaluation.xml

3. HTTP (Streamable HTTP)

3. HTTP(可流式传输的HTTP)

For HTTP-based MCP servers (start the server first):
bash
python scripts/evaluation.py \
  -t http \
  -u https://example.com/mcp \
  -H "Authorization: Bearer token123" \
  evaluation.xml
适用于基于HTTP的MCP服务器(需先启动服务器):
bash
python scripts/evaluation.py \
  -t http \
  -u https://example.com/mcp \
  -H "Authorization: Bearer token123" \
  evaluation.xml

Command-Line Options

命令行选项

usage: evaluation.py [-h] [-t {stdio,sse,http}] [-m MODEL] [-c COMMAND]
                     [-a ARGS [ARGS ...]] [-e ENV [ENV ...]] [-u URL]
                     [-H HEADERS [HEADERS ...]] [-o OUTPUT]
                     eval_file

positional arguments:
  eval_file             Path to evaluation XML file

optional arguments:
  -h, --help            Show help message
  -t, --transport       Transport type: stdio, sse, or http (default: stdio)
  -m, --model           Claude model to use (default: claude-3-7-sonnet-20250219)
  -o, --output          Output file for report (default: print to stdout)

stdio options:
  -c, --command         Command to run MCP server (e.g., python, node)
  -a, --args            Arguments for the command (e.g., server.py)
  -e, --env             Environment variables in KEY=VALUE format

sse/http options:
  -u, --url             MCP server URL
  -H, --header          HTTP headers in 'Key: Value' format
usage: evaluation.py [-h] [-t {stdio,sse,http}] [-m MODEL] [-c COMMAND]
                     [-a ARGS [ARGS ...]] [-e ENV [ENV ...]] [-u URL]
                     [-H HEADERS [HEADERS ...]] [-o OUTPUT]
                     eval_file

positional arguments:
  eval_file             Path to evaluation XML file

optional arguments:
  -h, --help            Show help message
  -t, --transport       Transport type: stdio, sse, or http (default: stdio)
  -m, --model           Claude model to use (default: claude-3-7-sonnet-20250219)
  -o, --output          Output file for report (default: print to stdout)

stdio options:
  -c, --command         Command to run MCP server (e.g., python, node)
  -a, --args            Arguments for the command (e.g., server.py)
  -e, --env             Environment variables in KEY=VALUE format

sse/http options:
  -u, --url             MCP server URL
  -H, --header          HTTP headers in 'Key: Value' format

Output

输出结果

The evaluation script generates a detailed report including:
  • Summary Statistics:
    • Accuracy (correct/total)
    • Average task duration
    • Average tool calls per task
    • Total tool calls
  • Per-Task Results:
    • Prompt and expected response
    • Actual response from the agent
    • Whether the answer was correct (✅/❌)
    • Duration and tool call details
    • Agent's summary of its approach
    • Agent's feedback on the tools
评估脚本会生成详细报告,包括:
  • 汇总统计数据
    • 准确率(正确数/总数)
    • 平均任务耗时
    • 平均每个任务的工具调用次数
    • 总工具调用次数
  • 单任务结果
    • 提示语与预期响应
    • 代理的实际响应
    • 答案是否正确(✅/❌)
    • 耗时与工具调用详情
    • 代理的方法总结
    • 代理对工具的反馈

Save Report to File

将报告保存到文件

bash
python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a my_server.py \
  -o evaluation_report.md \
  evaluation.xml
bash
python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a my_server.py \
  -o evaluation_report.md \
  evaluation.xml

Complete Example Workflow

完整示例工作流

Here's a complete example of creating and running an evaluation:
  1. Create your evaluation file (
    my_evaluation.xml
    ):
xml
<evaluation>
   <qa_pair>
      <question>Find the user who created the most issues in January 2024. What is their username?</question>
      <answer>alice_developer</answer>
   </qa_pair>
   <qa_pair>
      <question>Among all pull requests merged in Q1 2024, which repository had the highest number? Provide the repository name.</question>
      <answer>backend-api</answer>
   </qa_pair>
   <qa_pair>
      <question>Find the project that was completed in December 2023 and had the longest duration from start to finish. How many days did it take?</question>
      <answer>127</answer>
   </qa_pair>
</evaluation>
  1. Install dependencies:
bash
pip install -r scripts/requirements.txt
export ANTHROPIC_API_KEY=your_api_key
  1. Run evaluation:
bash
python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a github_mcp_server.py \
  -e GITHUB_TOKEN=ghp_xxx \
  -o github_eval_report.md \
  my_evaluation.xml
  1. Review the report in
    github_eval_report.md
    to:
    • See which questions passed/failed
    • Read the agent's feedback on your tools
    • Identify areas for improvement
    • Iterate on your MCP server design
以下是创建并运行评估的完整示例:
  1. 创建评估文件
    my_evaluation.xml
    ):
xml
<evaluation>
   <qa_pair>
      <question>Find the user who created the most issues in January 2024. What is their username?</question>
      <answer>alice_developer</answer>
   </qa_pair>
   <qa_pair>
      <question>Among all pull requests merged in Q1 2024, which repository had the highest number? Provide the repository name.</question>
      <answer>backend-api</answer>
   </qa_pair>
   <qa_pair>
      <question>Find the project that was completed in December 2023 and had the longest duration from start to finish. How many days did it take?</question>
      <answer>127</answer>
   </qa_pair>
</evaluation>
  1. 安装依赖
bash
pip install -r scripts/requirements.txt
export ANTHROPIC_API_KEY=your_api_key
  1. 运行评估
bash
python scripts/evaluation.py \
  -t stdio \
  -c python \
  -a github_mcp_server.py \
  -e GITHUB_TOKEN=ghp_xxx \
  -o github_eval_report.md \
  my_evaluation.xml
  1. 查看报告
    github_eval_report.md
    ):
    • 查看哪些问题通过/失败
    • 阅读代理对工具的反馈
    • 确定需要改进的地方
    • 迭代优化MCP服务器设计

Local MCP Server Conventions (mcp-* Repos)

本地MCP服务器约定(mcp-*仓库)

Use these patterns when building or evaluating your MCP servers:
  • Provide a
    mcp
    subcommand that runs the stdio server (e.g.,
    mcp-vector-search mcp
    ,
    mcp-ticketer mcp --path <repo>
    ,
    mcp-browser mcp
    ).
  • Favor compact, consolidated tool surfaces to reduce token footprint; use pagination and compact modes when listing data.
  • Include
    setup
    ,
    install
    , or
    doctor
    commands to validate runtime dependencies and integrate with clients.
  • Prefer
    .mcp.json
    entries with
    type: stdio
    , explicit
    command
    , and minimal
    env
    overrides.
Example
.mcp.json
entry:
json
{
  "mcpServers": {
    "mcp-vector-search": {
      "type": "stdio",
      "command": "uv",
      "args": ["run", "mcp-vector-search", "mcp"],
      "env": {
        "MCP_ENABLE_FILE_WATCHING": "true"
      }
    }
  }
}
Operational notes:
  • Use explicit env vars for adapters and database paths (e.g.,
    MCP_TICKETER_ADAPTER
    ,
    KUZU_MEMORY_DB
    ).
  • Expose a health endpoint when running HTTP/SSE variants.
  • Allow dynamic port allocation for browser integrations (mcp-browser uses a port range).
构建或评估MCP服务器时,请遵循以下模式:
  • 提供
    mcp
    子命令以运行stdio服务器(例如
    mcp-vector-search mcp
    mcp-ticketer mcp --path <repo>
    mcp-browser mcp
    )。
  • 优先选择紧凑、整合的工具界面以减少token占用;列出数据时使用分页和紧凑模式。
  • 包含
    setup
    install
    doctor
    命令,用于验证运行时依赖并与客户端集成。
  • 优先使用
    .mcp.json
    条目,指定
    type: stdio
    、明确的
    command
    和最少的
    env
    覆盖项。
示例
.mcp.json
条目:
json
{
  "mcpServers": {
    "mcp-vector-search": {
      "type": "stdio",
      "command": "uv",
      "args": ["run", "mcp-vector-search", "mcp"],
      "env": {
        "MCP_ENABLE_FILE_WATCHING": "true"
      }
    }
  }
}
操作注意事项:
  • 对适配器和数据库路径使用明确的环境变量(例如
    MCP_TICKETER_ADAPTER
    KUZU_MEMORY_DB
    )。
  • 运行HTTP/SSE版本时,暴露健康检查端点。
  • 浏览器集成时允许动态端口分配(mcp-browser使用端口范围)。

Troubleshooting

故障排除

Connection Errors

连接错误

When connection errors occur:
  • STDIO: Verify the command and arguments are correct
  • SSE/HTTP: Check the URL is accessible and headers are correct
  • Ensure any required API keys are set in environment variables or headers
出现连接错误时:
  • STDIO:验证命令和参数是否正确
  • SSE/HTTP:检查URL是否可访问,请求头是否正确
  • 确保所有必需的API密钥已在环境变量或请求头中设置

Low Accuracy

准确率低

If many evaluations fail:
  • Review the agent's feedback for each task
  • Check if tool descriptions are clear and comprehensive
  • Verify input parameters are well-documented
  • Consider whether tools return too much or too little data
  • Ensure error messages are actionable
如果多数评估失败:
  • 查看每个任务的代理反馈
  • 检查工具描述是否清晰全面
  • 验证输入参数是否有完善的文档
  • 考虑工具返回的数据是否过多或过少
  • 确保错误消息具备可操作性

Timeout Issues

超时问题

If tasks are timing out:
  • Use a more capable model (e.g.,
    claude-3-7-sonnet-20250219
    )
  • Check if tools are returning too much data
  • Verify pagination is working correctly
  • Consider simplifying complex questions
如果任务超时:
  • 使用更强大的模型(例如
    claude-3-7-sonnet-20250219
  • 检查工具是否返回过多数据
  • 验证分页功能是否正常工作
  • 考虑简化复杂问题