eval-mcp

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Evaluate MCP Tools

评估MCP工具

Tool descriptions are prompt engineering — they land directly in Claude's context window and determine whether Claude picks the right tool with the right arguments. This skill makes tool quality measurable and improvable instead of guesswork.
Three levels of testing, each building on the last:
  1. Static Analysis — deterministic schema quality checks (no Claude calls)
  2. Selection Testing — does Claude pick the right tool for each intent?
  3. Description Optimization — iterative improvement based on confusion patterns
工具描述属于提示词工程范畴——它们会直接进入Claude的上下文窗口,决定Claude是否能选择正确的工具并传入合适的参数。此技能让工具质量变得可衡量、可优化,而非仅凭经验猜测。
测试分为三个层级,逐层递进:
  1. 静态分析 —— 确定性的Schema质量检查(无需调用Claude)
  2. 选择测试 —— Claude是否能针对每个用户意图选择正确的工具?
  3. 描述优化 —— 基于混淆模式进行迭代优化

When to Apply

适用场景

  • User wants to check if their MCP tool schemas are well-designed
  • User wants to test whether Claude selects the right tools for user intents
  • User is debugging tool confusion (Claude picks the wrong tool)
  • User wants to optimize tool descriptions for better selection accuracy
  • User has finished scaffolding with
    build-mcp-server
    and wants to validate quality
  • 用户想要检查其MCP工具Schema的设计是否合理
  • 用户想要测试Claude是否能针对用户意图选择正确的工具
  • 用户正在调试工具混淆问题(Claude选择了错误的工具)
  • 用户想要优化工具描述以提升选择准确率
  • 用户已使用
    build-mcp-server
    完成搭建,想要验证质量

Workflow Overview

工作流概述

Phase 1: Connect → Phase 2: Static Analysis → Phase 3: Selection Testing → Phase 4: Optimize
                                                            ↑__________________________|
Phase 4 loops back: apply rewrites → refetch schemas → retest → compare accuracy.
Phase 1: Connect → Phase 2: Static Analysis → Phase 3: Selection Testing → Phase 4: Optimize
                                                            ↑__________________________|
第4阶段会循环往复:应用改写内容 → 重新获取Schema → 重新测试 → 对比准确率。

Prerequisites

前置条件

  • Node.js >= 18 — required for the MCP Inspector CLI (
    npx
    )
  • jq — required for schema analysis scripts
  • A running MCP server — the server must respond to
    tools/list
    . Use
    build-mcp-server/scripts/test-server.sh
    to verify connectivity first.

  • Node.js >= 18 —— MCP Inspector CLI(
    npx
    )的运行要求
  • jq —— Schema分析脚本的运行要求
  • 运行中的MCP服务器 —— 服务器必须能响应
    tools/list
    请求。可先使用
    build-mcp-server/scripts/test-server.sh
    验证连通性。

Phase 1 — Connect & Inventory

阶段1 —— 连接与清单整理

Connect to the user's MCP server and fetch the tool schemas.
连接到用户的MCP服务器并获取工具Schema。

1a: Get connection details

1a:获取连接信息

Ask the user how to reach their server:
  • HTTP/SSE: URL (e.g.,
    http://localhost:3000/mcp
    )
  • stdio: spawn command (e.g.,
    node dist/server.js
    )
询问用户如何连接到他们的服务器:
  • HTTP/SSE:URL(例如:
    http://localhost:3000/mcp
  • stdio:启动命令(例如:
    node dist/server.js

1b: Fetch tool schemas

1b:获取工具Schema

bash
bash scripts/fetch-tools.sh <url-or-command> <transport> <workspace>/tools.json
This calls
tools/list
via the MCP Inspector CLI and saves the schemas.
bash
bash scripts/fetch-tools.sh <url-or-command> <transport> <workspace>/tools.json
此命令通过MCP Inspector CLI调用
tools/list
接口并保存Schema。

1c: Display inventory

1c:展示清单

Show a summary table:
markdown
| # | Tool | Description (preview) | Params | Annotations |
|---|------|-----------------------|--------|-------------|
| 1 | search_issues | Search issues by keyword... | 3 | readOnlyHint |
| 2 | create_issue | Create a new issue... | 4 ||
Flag tool count: 1-15 optimal, 15-30 warning, 30+ excessive (consider search+execute pattern).
显示汇总表格:
markdown
| 序号 | 工具 | 描述预览 | 参数数量 | 注解 |
|---|------|-----------------------|--------|-------------|
| 1 | search_issues | 按关键词搜索问题... | 3 | readOnlyHint |
| 2 | create_issue | 创建新问题... | 4 ||
标记工具数量:1-15个为最优,15-30个需注意,30个以上过多(建议考虑搜索+执行的模式)。

1d: Create workspace

1d:创建工作区

Create workspace at
{server-name}-eval/
adjacent to the skill directory or in the user's project:
{server-name}-eval/
├── tools.json
├── evals/
│   └── evals.json
└── iteration-N/

在技能目录旁或用户项目中创建
{server-name}-eval/
工作区:
{server-name}-eval/
├── tools.json
├── evals/
│   └── evals.json
└── iteration-N/

Phase 2 — Static Analysis

阶段2 —— 静态分析

Run deterministic quality checks — no Claude calls needed. This gives immediate feedback during development.
执行确定性的质量检查——无需调用Claude。这能在开发过程中提供即时反馈。

2a: Run analysis

2a:运行分析

bash
bash scripts/analyze-schemas.sh <workspace>/tools.json <workspace>/iteration-N/static-analysis.json
bash
bash scripts/analyze-schemas.sh <workspace>/tools.json <workspace>/iteration-N/static-analysis.json

2b: Display results

2b:展示结果

Show per-tool quality scores. Read
references/quality-checklist.md
for the criteria being checked.
markdown
| Tool | Desc | Params | Schema | Annotations | Overall | Issues |
|------|------|--------|--------|-------------|---------|--------|
| search_issues | 3/3 | 3/3 | 2/3 | 2/3 | 2.5 | No negation |
| create_issue | 1/3 | 1/3 | 0/3 | 0/3 | 0.5 | 4 issues |
显示每个工具的质量得分。查看
references/quality-checklist.md
了解检查标准。
markdown
| 工具 | 描述 | 参数 | Schema | 注解 | 总分 | 问题 |
|------|------|--------|--------|-------------|---------|--------|
| search_issues | 3/3 | 3/3 | 2/3 | 2/3 | 2.5 | 无否定表述 |
| create_issue | 1/3 | 1/3 | 0/3 | 0/3 | 0.5 | 4个问题 |

2c: Flag sibling pairs

2c:标记相似工具对

If the analysis found tools with high description overlap, highlight them as confusion risks:
markdown
undefined
如果分析发现描述高度重叠的工具,需将其标记为混淆风险:
markdown
undefined

Sibling Pairs (confusion risk)

相似工具对(混淆风险)

Tool ATool BOverlapRisk
search_issueslist_issues52%HIGH
undefined
工具A工具B重叠度风险等级
search_issueslist_issues52%
undefined

2d: Decision point

2d:决策节点

If critical issues exist (missing descriptions, zero annotations), recommend fixing them before Phase 3. Static issues create noise in selection testing — fix the obvious problems first, then measure the subtle ones.
If all tools score well, proceed to Phase 3.

如果存在关键问题(缺失描述、无注解),建议先修复这些问题再进入阶段3。静态问题会干扰选择测试的结果——先解决明显问题,再衡量细微的不足。
若所有工具得分良好,则进入阶段3。

Phase 3 — Selection Testing

阶段3 —— 选择测试

Test whether Claude picks the right tool for each user intent. This is the core eval.
测试Claude是否能针对每个用户意图选择正确的工具。这是评估的核心环节。

3a: Generate test intents

3a:生成测试意图

Read
references/eval-patterns.md
for intent generation patterns.
For each tool, generate:
  • 3 should-trigger intents — direct, implicit, and casual phrasings
  • 2 should-not-trigger intents — near-miss and keyword overlap
For each sibling pair flagged in Phase 2:
  • 1 disambiguation intent per tool — tests whether Claude picks the RIGHT sibling
Present all intents to the user for review. Ask if any should be added, removed, or modified.
查看
references/eval-patterns.md
了解意图生成模式。
为每个工具生成:
  • 3个应触发意图 —— 直接、隐含、非正式的表述方式
  • 2个不应触发意图 —— 近似表述和关键词重叠的场景
为阶段2中标记的每对相似工具生成:
  • 每个工具1个歧义消除意图 —— 测试Claude是否能选择正确的相似工具
将所有意图展示给用户审核,询问是否需要添加、删除或修改。

3b: Save intents

3b:保存意图

Save to
{workspace}/evals/evals.json
:
json
{
  "server_name": "my-server",
  "generated_from": "tools.json",
  "intents": [
    {
      "id": 1,
      "intent": "Are there any open bugs related to checkout?",
      "expected_tool": "search_issues",
      "type": "should_trigger",
      "target_tool": "search_issues",
      "notes": "Implicit intent — doesn't name the action"
    }
  ]
}
保存到
{workspace}/evals/evals.json
json
{
  "server_name": "my-server",
  "generated_from": "tools.json",
  "intents": [
    {
      "id": 1,
      "intent": "有没有和结账相关的未解决Bug?",
      "expected_tool": "search_issues",
      "type": "should_trigger",
      "target_tool": "search_issues",
      "notes": "隐含意图——未明确提及操作"
    }
  ]
}

3c: Run selection tests

3c:运行选择测试

For each intent, spawn a subagent that receives:
  1. The full tool schemas from tools.json (formatted as they'd appear in Claude's context)
  2. The user intent text
  3. Instructions to select exactly one tool and provide arguments, or decline if no tool fits
The subagent prompt:
You have access to the following MCP tools:

{tool schemas as JSON}

A user sends this message:
"{intent text}"

Which tool would you call? Respond with JSON:
{
  "selected_tool": "tool_name" or null,
  "arguments": { ... } or {},
  "reasoning": "One sentence explaining your choice"
}

If no tool fits the user's request, set selected_tool to null.
Select exactly ONE tool. Do not suggest calling multiple tools.
Save each result to
{workspace}/iteration-N/selection/intent-{ID}/result.json
.
Launch all selection tests in parallel for efficiency.
针对每个意图,启动子代理,传入以下内容:
  1. tools.json
    中的完整工具Schema(格式与Claude上下文显示的一致)
  2. 用户意图文本
  3. 指令:选择恰好一个工具并提供参数,若没有合适工具则拒绝
子代理提示词:
你可以使用以下MCP工具:

{tool schemas as JSON}

用户发送了这条消息:
"{intent text}"

你会调用哪个工具?请用JSON格式回复:
{
  "selected_tool": "tool_name" 或 null,
  "arguments": { ... } 或 {},
  "reasoning": "用一句话解释你的选择"
}

如果没有合适的工具,将selected_tool设为null。
请恰好选择一个工具,不要建议调用多个工具。
将每个结果保存到
{workspace}/iteration-N/selection/intent-{ID}/result.json
为提升效率,可并行启动所有选择测试。

3d: Grade results

3d:评分结果

bash
bash scripts/grade-selection.sh \
  <workspace>/iteration-N/selection \
  <workspace>/evals/evals.json \
  <workspace>/iteration-N/benchmark.json
bash
bash scripts/grade-selection.sh \
  <workspace>/iteration-N/selection \
  <workspace>/evals/evals.json \
  <workspace>/iteration-N/benchmark.json

3e: Display results

3e:展示结果

markdown
undefined
markdown
undefined

Selection Results — Iteration N

选择测试结果 —— 第N次迭代

Accuracy: 82% (41/50 correct)
MetricCount
Correct41
Wrong tool5
False accept2
False reject2
准确率: 82%(50次测试中41次正确)
指标数量
正确41
工具选择错误5
误判(选择了不合适的工具)2
漏判(未选择合适的工具)2

Per-Tool Accuracy

各工具准确率

ToolPrecisionRecall
search_issues0.900.85
create_issue1.001.00
工具精确率召回率
search_issues0.900.85
create_issue1.001.00

Worst Confusions

最严重的混淆情况

ExpectedSelected InsteadTimes
list_issuessearch_issues3
get_userfind_user_by_email2

---
预期工具实际选择工具次数
list_issuessearch_issues3
get_userfind_user_by_email2

---

Phase 4 — Optimize & Iterate

阶段4 —— 优化与迭代

Analyze confusion patterns and suggest description improvements. Read
references/optimization.md
for rewrite patterns.
分析混淆模式并提出描述优化建议。查看
references/optimization.md
了解改写模式。

4a: Analyze confusions

4a:分析混淆情况

For each confused pair (from worst_confusions):
  1. Read both tools' current descriptions
  2. Identify why they're confusing (missing negation, overlapping scope, no cross-reference)
  3. Draft a specific rewrite following the disambiguation patterns in optimization.md
针对每对混淆工具(来自最严重混淆情况列表):
  1. 阅读两个工具当前的描述
  2. 识别混淆原因(缺失否定表述、范围重叠、无交叉引用)
  3. 遵循optimization.md中的歧义消除模式,起草具体的改写内容

4b: Present suggestions

4b:展示建议

markdown
undefined
markdown
undefined

Suggested Improvements

优化建议

search_issues ↔ list_issues (confused 3 times)

search_issues ↔ list_issues(混淆3次)

search_issues — Before:
Search issues by keyword.
search_issues — After:
Search issues by keyword across title and body. Returns up to
limit
results ranked by relevance. Does NOT filter by status, assignee, or date — use list_issues for structured filtering.
Reason: Adding scope boundary and cross-reference to disambiguate from list_issues.

Save to `{workspace}/iteration-N/suggestions.json` (format defined in optimization.md).
search_issues —— 改写前:
按关键词搜索问题。
search_issues —— 改写后:
按关键词在标题和正文中搜索问题。返回最多
limit
个按相关性排序的结果。不支持按状态、经办人或日期筛选——如需结构化筛选,请使用list_issues。
原因: 添加范围边界和交叉引用,与list_issues明确区分。

保存到`{workspace}/iteration-N/suggestions.json`(格式定义见optimization.md)。

4c: Apply and retest

4c:应用改写并重新测试

After the user applies the rewrites to their server code:
  1. Restart the server
  2. Re-run Phase 1 to refetch tools.json (descriptions may have changed)
  3. Re-run Phase 2 for updated static analysis
  4. Re-run Phase 3 into
    iteration-N+1
    using the same evals.json
  5. Compare accuracy:
markdown
undefined
用户将改写内容应用到服务器代码后:
  1. 重启服务器
  2. 重新运行阶段1以获取更新后的tools.json(描述可能已更改)
  3. 重新运行阶段2以获取更新后的静态分析结果
  4. 使用相同的evals.json重新运行阶段3,结果保存到
    iteration-N+1
  5. 对比准确率:
markdown
undefined

Iteration Comparison

迭代对比

MetricIteration 1Iteration 2Delta
Accuracy82%94%+12%
search↔list confusion30-3
undefined
指标第1次迭代第2次迭代变化值
准确率82%94%+12%
search↔list混淆次数30-3
undefined

4d: Iteration guidance

4d:迭代指导

  • Change one sibling pair per iteration so you can attribute improvements
  • If accuracy plateaus, the remaining confusions may need architectural changes (merging tools, renaming, or restructuring the tool surface)
  • Stop when accuracy exceeds 90% or when remaining confusions are in ambiguous edge cases that humans would also struggle with

  • 每次迭代只修改一对相似工具,以便明确准确率提升的原因
  • 如果准确率进入平台期,剩余的混淆问题可能需要架构调整(合并工具、重命名或重构工具表面)
  • 当准确率超过90%,或剩余混淆问题属于人类也难以判断的模糊边缘情况时,可停止迭代

Reference Files

参考文件

Read these when you reach the relevant phase — not upfront:
  • references/quality-checklist.md
    — Testable quality criteria for tool schemas (Phase 2)
  • references/eval-patterns.md
    — How to write tool selection test intents (Phase 3)
  • references/optimization.md
    — How to improve descriptions from eval results (Phase 4)
请在进入对应阶段时阅读这些文件,无需提前浏览:
  • references/quality-checklist.md
    —— 工具Schema的可测试质量标准(阶段2)
  • references/eval-patterns.md
    —— 如何编写工具选择测试意图(阶段3)
  • references/optimization.md
    —— 如何根据评估结果优化描述(阶段4)

Related Skills

相关技能

  • build-mcp-server
    — Design and scaffold MCP servers (run this first, then eval-mcp to validate)
  • build-mcp-app
    — MCP servers with interactive UI widgets
  • build-mcp-server
    —— 设计并搭建MCP服务器(先运行此技能,再用eval-mcp验证)
  • build-mcp-app
    —— 带交互式UI组件的MCP服务器