eval-mcp

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Evaluate MCP Tools

评估MCP工具

Tool descriptions are prompt engineering — they land directly in Claude's context window and determine whether Claude picks the right tool with the right arguments. This skill makes tool quality measurable and improvable instead of guesswork.

Three levels of testing, each building on the last:

Static Analysis — deterministic schema quality checks (no Claude calls)
Selection Testing — does Claude pick the right tool for each intent?
Description Optimization — iterative improvement based on confusion patterns

工具描述属于提示词工程范畴——它们会直接进入Claude的上下文窗口，决定Claude是否能选择正确的工具并传入合适的参数。此技能让工具质量变得可衡量、可优化，而非仅凭经验猜测。

测试分为三个层级，逐层递进：

静态分析 —— 确定性的Schema质量检查（无需调用Claude）
选择测试 —— Claude是否能针对每个用户意图选择正确的工具？
描述优化 —— 基于混淆模式进行迭代优化

When to Apply

适用场景

User wants to check if their MCP tool schemas are well-designed
User wants to test whether Claude selects the right tools for user intents
User is debugging tool confusion (Claude picks the wrong tool)
User wants to optimize tool descriptions for better selection accuracy
User has finished scaffolding with
```
build-mcp-server
```
and wants to validate quality

用户想要检查其MCP工具Schema的设计是否合理
用户想要测试Claude是否能针对用户意图选择正确的工具
用户正在调试工具混淆问题（Claude选择了错误的工具）
用户想要优化工具描述以提升选择准确率
用户已使用
```
build-mcp-server
```
完成搭建，想要验证质量

Workflow Overview

工作流概述

Phase 1: Connect → Phase 2: Static Analysis → Phase 3: Selection Testing → Phase 4: Optimize
                                                            ↑__________________________|

Phase 4 loops back: apply rewrites → refetch schemas → retest → compare accuracy.

Phase 1: Connect → Phase 2: Static Analysis → Phase 3: Selection Testing → Phase 4: Optimize
                                                            ↑__________________________|

第4阶段会循环往复：应用改写内容 → 重新获取Schema → 重新测试 → 对比准确率。

Prerequisites

前置条件

Node.js >= 18 — required for the MCP Inspector CLI (
```
npx
```
)
jq — required for schema analysis scripts
A running MCP server — the server must respond to
```
tools/list
```
. Use
```
build-mcp-server/scripts/test-server.sh
```
to verify connectivity first.

Node.js >= 18 —— MCP Inspector CLI（
```
npx
```
）的运行要求
jq —— Schema分析脚本的运行要求
运行中的MCP服务器 —— 服务器必须能响应
```
tools/list
```
请求。可先使用
```
build-mcp-server/scripts/test-server.sh
```
验证连通性。

Phase 1 — Connect & Inventory

阶段1 —— 连接与清单整理

Connect to the user's MCP server and fetch the tool schemas.

连接到用户的MCP服务器并获取工具Schema。

1a: Get connection details

1a：获取连接信息

Ask the user how to reach their server:

HTTP/SSE: URL (e.g.,
```
http://localhost:3000/mcp
```
)
stdio: spawn command (e.g.,
```
node dist/server.js
```
)

询问用户如何连接到他们的服务器：

HTTP/SSE：URL（例如：
```
http://localhost:3000/mcp
```
）
stdio：启动命令（例如：
```
node dist/server.js
```
）

1b: Fetch tool schemas

1b：获取工具Schema

bash

bash scripts/fetch-tools.sh <url-or-command> <transport> <workspace>/tools.json

This calls

tools/list

via the MCP Inspector CLI and saves the schemas.

bash

bash scripts/fetch-tools.sh <url-or-command> <transport> <workspace>/tools.json

此命令通过MCP Inspector CLI调用

tools/list

接口并保存Schema。

1c: Display inventory

1c：展示清单

Show a summary table:

markdown

| # | Tool | Description (preview) | Params | Annotations |
|---|------|-----------------------|--------|-------------|
| 1 | search_issues | Search issues by keyword... | 3 | readOnlyHint |
| 2 | create_issue | Create a new issue... | 4 | — |

Flag tool count: 1-15 optimal, 15-30 warning, 30+ excessive (consider search+execute pattern).

显示汇总表格：

markdown

| 序号 | 工具 | 描述预览 | 参数数量 | 注解 |
|---|------|-----------------------|--------|-------------|
| 1 | search_issues | 按关键词搜索问题... | 3 | readOnlyHint |
| 2 | create_issue | 创建新问题... | 4 | — |

标记工具数量：1-15个为最优，15-30个需注意，30个以上过多（建议考虑搜索+执行的模式）。

1d: Create workspace

1d：创建工作区

Create workspace at

{server-name}-eval/

adjacent to the skill directory or in the user's project:

{server-name}-eval/
├── tools.json
├── evals/
│   └── evals.json
└── iteration-N/

在技能目录旁或用户项目中创建

{server-name}-eval/

工作区：

{server-name}-eval/
├── tools.json
├── evals/
│   └── evals.json
└── iteration-N/

Phase 2 — Static Analysis

阶段2 —— 静态分析

Run deterministic quality checks — no Claude calls needed. This gives immediate feedback during development.

执行确定性的质量检查——无需调用Claude。这能在开发过程中提供即时反馈。

2a: Run analysis

2a：运行分析

bash

bash scripts/analyze-schemas.sh <workspace>/tools.json <workspace>/iteration-N/static-analysis.json

bash

bash scripts/analyze-schemas.sh <workspace>/tools.json <workspace>/iteration-N/static-analysis.json

2b: Display results

2b：展示结果

Show per-tool quality scores. Read

references/quality-checklist.md

for the criteria being checked.

markdown

| Tool | Desc | Params | Schema | Annotations | Overall | Issues |
|------|------|--------|--------|-------------|---------|--------|
| search_issues | 3/3 | 3/3 | 2/3 | 2/3 | 2.5 | No negation |
| create_issue | 1/3 | 1/3 | 0/3 | 0/3 | 0.5 | 4 issues |

显示每个工具的质量得分。查看

references/quality-checklist.md

了解检查标准。

markdown

| 工具 | 描述 | 参数 | Schema | 注解 | 总分 | 问题 |
|------|------|--------|--------|-------------|---------|--------|
| search_issues | 3/3 | 3/3 | 2/3 | 2/3 | 2.5 | 无否定表述 |
| create_issue | 1/3 | 1/3 | 0/3 | 0/3 | 0.5 | 4个问题 |

2c: Flag sibling pairs

2c：标记相似工具对

If the analysis found tools with high description overlap, highlight them as confusion risks:

markdown

undefined

如果分析发现描述高度重叠的工具，需将其标记为混淆风险：

markdown

undefined

Sibling Pairs (confusion risk)

相似工具对（混淆风险）

Tool A	Tool B	Overlap	Risk
search_issues	list_issues	52%	HIGH

undefined

工具A	工具B	重叠度	风险等级
search_issues	list_issues	52%	高

undefined

2d: Decision point

2d：决策节点

If critical issues exist (missing descriptions, zero annotations), recommend fixing them before Phase 3. Static issues create noise in selection testing — fix the obvious problems first, then measure the subtle ones.

If all tools score well, proceed to Phase 3.

如果存在关键问题（缺失描述、无注解），建议先修复这些问题再进入阶段3。静态问题会干扰选择测试的结果——先解决明显问题，再衡量细微的不足。

若所有工具得分良好，则进入阶段3。

Phase 3 — Selection Testing

阶段3 —— 选择测试

Test whether Claude picks the right tool for each user intent. This is the core eval.

测试Claude是否能针对每个用户意图选择正确的工具。这是评估的核心环节。

3a: Generate test intents

3a：生成测试意图

Read

references/eval-patterns.md

for intent generation patterns.

For each tool, generate:

3 should-trigger intents — direct, implicit, and casual phrasings
2 should-not-trigger intents — near-miss and keyword overlap

For each sibling pair flagged in Phase 2:

1 disambiguation intent per tool — tests whether Claude picks the RIGHT sibling

Present all intents to the user for review. Ask if any should be added, removed, or modified.

查看

references/eval-patterns.md

了解意图生成模式。

为每个工具生成：

3个应触发意图 —— 直接、隐含、非正式的表述方式
2个不应触发意图 —— 近似表述和关键词重叠的场景

为阶段2中标记的每对相似工具生成：

每个工具1个歧义消除意图 —— 测试Claude是否能选择正确的相似工具

将所有意图展示给用户审核，询问是否需要添加、删除或修改。

3b: Save intents

3b：保存意图

Save to

{workspace}/evals/evals.json

json

{
  "server_name": "my-server",
  "generated_from": "tools.json",
  "intents": [
    {
      "id": 1,
      "intent": "Are there any open bugs related to checkout?",
      "expected_tool": "search_issues",
      "type": "should_trigger",
      "target_tool": "search_issues",
      "notes": "Implicit intent — doesn't name the action"
    }
  ]
}

保存到

{workspace}/evals/evals.json

：

json

{
  "server_name": "my-server",
  "generated_from": "tools.json",
  "intents": [
    {
      "id": 1,
      "intent": "有没有和结账相关的未解决Bug？",
      "expected_tool": "search_issues",
      "type": "should_trigger",
      "target_tool": "search_issues",
      "notes": "隐含意图——未明确提及操作"
    }
  ]
}

3c: Run selection tests

3c：运行选择测试

For each intent, spawn a subagent that receives:

The full tool schemas from tools.json (formatted as they'd appear in Claude's context)
The user intent text
Instructions to select exactly one tool and provide arguments, or decline if no tool fits

The subagent prompt:

You have access to the following MCP tools:

{tool schemas as JSON}

A user sends this message:
"{intent text}"

Which tool would you call? Respond with JSON:
{
  "selected_tool": "tool_name" or null,
  "arguments": { ... } or {},
  "reasoning": "One sentence explaining your choice"
}

If no tool fits the user's request, set selected_tool to null.
Select exactly ONE tool. Do not suggest calling multiple tools.

Save each result to

{workspace}/iteration-N/selection/intent-{ID}/result.json

Launch all selection tests in parallel for efficiency.

针对每个意图，启动子代理，传入以下内容：

```
tools.json
```
中的完整工具Schema（格式与Claude上下文显示的一致）
用户意图文本
指令：选择恰好一个工具并提供参数，若没有合适工具则拒绝

子代理提示词：

你可以使用以下MCP工具：

{tool schemas as JSON}

用户发送了这条消息：
"{intent text}"

你会调用哪个工具？请用JSON格式回复：
{
  "selected_tool": "tool_name" 或 null,
  "arguments": { ... } 或 {},
  "reasoning": "用一句话解释你的选择"
}

如果没有合适的工具，将selected_tool设为null。
请恰好选择一个工具，不要建议调用多个工具。

将每个结果保存到

{workspace}/iteration-N/selection/intent-{ID}/result.json

。

为提升效率，可并行启动所有选择测试。

3d: Grade results

3d：评分结果

bash

bash scripts/grade-selection.sh \
  <workspace>/iteration-N/selection \
  <workspace>/evals/evals.json \
  <workspace>/iteration-N/benchmark.json

bash

bash scripts/grade-selection.sh \
  <workspace>/iteration-N/selection \
  <workspace>/evals/evals.json \
  <workspace>/iteration-N/benchmark.json

3e: Display results

3e：展示结果

markdown

undefined

markdown

undefined

Selection Results — Iteration N

选择测试结果 —— 第N次迭代

Accuracy: 82% (41/50 correct)

Metric	Count
Correct	41
Wrong tool	5
False accept	2
False reject	2

准确率： 82%（50次测试中41次正确）

指标	数量
正确	41
工具选择错误	5
误判（选择了不合适的工具）	2
漏判（未选择合适的工具）	2

Per-Tool Accuracy

各工具准确率

Tool	Precision	Recall
search_issues	0.90	0.85
create_issue	1.00	1.00

工具	精确率	召回率
search_issues	0.90	0.85
create_issue	1.00	1.00

Worst Confusions

最严重的混淆情况

Expected	Selected Instead	Times
list_issues	search_issues	3
get_user	find_user_by_email	2

---

预期工具	实际选择工具	次数
list_issues	search_issues	3
get_user	find_user_by_email	2

---

Phase 4 — Optimize & Iterate

阶段4 —— 优化与迭代

Analyze confusion patterns and suggest description improvements. Read

references/optimization.md

for rewrite patterns.

分析混淆模式并提出描述优化建议。查看

references/optimization.md

了解改写模式。

4a: Analyze confusions

4a：分析混淆情况

For each confused pair (from worst_confusions):

Read both tools' current descriptions
Identify why they're confusing (missing negation, overlapping scope, no cross-reference)
Draft a specific rewrite following the disambiguation patterns in optimization.md

针对每对混淆工具（来自最严重混淆情况列表）：

阅读两个工具当前的描述
识别混淆原因（缺失否定表述、范围重叠、无交叉引用）
遵循optimization.md中的歧义消除模式，起草具体的改写内容

4b: Present suggestions

4b：展示建议

markdown

undefined

markdown

undefined

Suggested Improvements

优化建议

search_issues ↔ list_issues (confused 3 times)

search_issues ↔ list_issues（混淆3次）

search_issues — Before:

Search issues by keyword.

search_issues — After:

Search issues by keyword across title and body. Returns up to
limit
results ranked by relevance. Does NOT filter by status, assignee, or date — use list_issues for structured filtering.

Reason: Adding scope boundary and cross-reference to disambiguate from list_issues.


Save to `{workspace}/iteration-N/suggestions.json` (format defined in optimization.md).

search_issues —— 改写前：

按关键词搜索问题。

search_issues —— 改写后：

按关键词在标题和正文中搜索问题。返回最多
limit
个按相关性排序的结果。不支持按状态、经办人或日期筛选——如需结构化筛选，请使用list_issues。

原因： 添加范围边界和交叉引用，与list_issues明确区分。


保存到`{workspace}/iteration-N/suggestions.json`（格式定义见optimization.md）。

4c: Apply and retest

4c：应用改写并重新测试

After the user applies the rewrites to their server code:

Restart the server
Re-run Phase 1 to refetch tools.json (descriptions may have changed)
Re-run Phase 2 for updated static analysis
Re-run Phase 3 into
```
iteration-N+1
```
using the same evals.json
Compare accuracy:

markdown

undefined

用户将改写内容应用到服务器代码后：

重启服务器
重新运行阶段1以获取更新后的tools.json（描述可能已更改）
重新运行阶段2以获取更新后的静态分析结果
使用相同的evals.json重新运行阶段3，结果保存到
```
iteration-N+1
```
对比准确率：

markdown

undefined

Iteration Comparison

迭代对比

Metric	Iteration 1	Iteration 2	Delta
Accuracy	82%	94%	+12%
search↔list confusion	3	0	-3

undefined

指标	第1次迭代	第2次迭代	变化值
准确率	82%	94%	+12%
search↔list混淆次数	3	0	-3

undefined

4d: Iteration guidance

4d：迭代指导

Change one sibling pair per iteration so you can attribute improvements
If accuracy plateaus, the remaining confusions may need architectural changes (merging tools, renaming, or restructuring the tool surface)
Stop when accuracy exceeds 90% or when remaining confusions are in ambiguous edge cases that humans would also struggle with

每次迭代只修改一对相似工具，以便明确准确率提升的原因
如果准确率进入平台期，剩余的混淆问题可能需要架构调整（合并工具、重命名或重构工具表面）
当准确率超过90%，或剩余混淆问题属于人类也难以判断的模糊边缘情况时，可停止迭代

Reference Files

参考文件

Read these when you reach the relevant phase — not upfront:

```
references/quality-checklist.md
```
— Testable quality criteria for tool schemas (Phase 2)
```
references/eval-patterns.md
```
— How to write tool selection test intents (Phase 3)
```
references/optimization.md
```
— How to improve descriptions from eval results (Phase 4)

请在进入对应阶段时阅读这些文件，无需提前浏览：

```
references/quality-checklist.md
```
—— 工具Schema的可测试质量标准（阶段2）
```
references/eval-patterns.md
```
—— 如何编写工具选择测试意图（阶段3）
```
references/optimization.md
```
—— 如何根据评估结果优化描述（阶段4）

eval-mcp

Original

Translation

Evaluate MCP Tools

评估MCP工具

When to Apply

适用场景

Workflow Overview

工作流概述

Prerequisites

前置条件

Phase 1 — Connect & Inventory

阶段1 —— 连接与清单整理

1a: Get connection details

1a：获取连接信息

1b: Fetch tool schemas

1b：获取工具Schema

1c: Display inventory

1c：展示清单

1d: Create workspace

1d：创建工作区

Phase 2 — Static Analysis

阶段2 —— 静态分析

2a: Run analysis

2a：运行分析

2b: Display results

2b：展示结果

2c: Flag sibling pairs

2c：标记相似工具对

Sibling Pairs (confusion risk)

相似工具对（混淆风险）

2d: Decision point

2d：决策节点

Phase 3 — Selection Testing

阶段3 —— 选择测试

3a: Generate test intents

3a：生成测试意图

3b: Save intents

3b：保存意图

3c: Run selection tests

3c：运行选择测试

3d: Grade results

3d：评分结果

3e: Display results

3e：展示结果

Selection Results — Iteration N

选择测试结果 —— 第N次迭代

Per-Tool Accuracy

各工具准确率

Worst Confusions

最严重的混淆情况

Phase 4 — Optimize & Iterate

阶段4 —— 优化与迭代

4a: Analyze confusions

4a：分析混淆情况

4b: Present suggestions

4b：展示建议

Suggested Improvements

优化建议

search_issues ↔ list_issues (confused 3 times)

search_issues ↔ list_issues（混淆3次）

4c: Apply and retest

4c：应用改写并重新测试

Iteration Comparison

迭代对比

4d: Iteration guidance

4d：迭代指导

Reference Files

参考文件

Related Skills

相关技能