eval-mcp
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEvaluate MCP Tools
评估MCP工具
Tool descriptions are prompt engineering — they land directly in Claude's context window and determine whether Claude picks the right tool with the right arguments. This skill makes tool quality measurable and improvable instead of guesswork.
Three levels of testing, each building on the last:
- Static Analysis — deterministic schema quality checks (no Claude calls)
- Selection Testing — does Claude pick the right tool for each intent?
- Description Optimization — iterative improvement based on confusion patterns
工具描述属于提示词工程范畴——它们会直接进入Claude的上下文窗口,决定Claude是否能选择正确的工具并传入合适的参数。此技能让工具质量变得可衡量、可优化,而非仅凭经验猜测。
测试分为三个层级,逐层递进:
- 静态分析 —— 确定性的Schema质量检查(无需调用Claude)
- 选择测试 —— Claude是否能针对每个用户意图选择正确的工具?
- 描述优化 —— 基于混淆模式进行迭代优化
When to Apply
适用场景
- User wants to check if their MCP tool schemas are well-designed
- User wants to test whether Claude selects the right tools for user intents
- User is debugging tool confusion (Claude picks the wrong tool)
- User wants to optimize tool descriptions for better selection accuracy
- User has finished scaffolding with and wants to validate quality
build-mcp-server
- 用户想要检查其MCP工具Schema的设计是否合理
- 用户想要测试Claude是否能针对用户意图选择正确的工具
- 用户正在调试工具混淆问题(Claude选择了错误的工具)
- 用户想要优化工具描述以提升选择准确率
- 用户已使用完成搭建,想要验证质量
build-mcp-server
Workflow Overview
工作流概述
Phase 1: Connect → Phase 2: Static Analysis → Phase 3: Selection Testing → Phase 4: Optimize
↑__________________________|Phase 4 loops back: apply rewrites → refetch schemas → retest → compare accuracy.
Phase 1: Connect → Phase 2: Static Analysis → Phase 3: Selection Testing → Phase 4: Optimize
↑__________________________|第4阶段会循环往复:应用改写内容 → 重新获取Schema → 重新测试 → 对比准确率。
Prerequisites
前置条件
- Node.js >= 18 — required for the MCP Inspector CLI ()
npx - jq — required for schema analysis scripts
- A running MCP server — the server must respond to . Use
tools/listto verify connectivity first.build-mcp-server/scripts/test-server.sh
- Node.js >= 18 —— MCP Inspector CLI()的运行要求
npx - jq —— Schema分析脚本的运行要求
- 运行中的MCP服务器 —— 服务器必须能响应请求。可先使用
tools/list验证连通性。build-mcp-server/scripts/test-server.sh
Phase 1 — Connect & Inventory
阶段1 —— 连接与清单整理
Connect to the user's MCP server and fetch the tool schemas.
连接到用户的MCP服务器并获取工具Schema。
1a: Get connection details
1a:获取连接信息
Ask the user how to reach their server:
- HTTP/SSE: URL (e.g., )
http://localhost:3000/mcp - stdio: spawn command (e.g., )
node dist/server.js
询问用户如何连接到他们的服务器:
- HTTP/SSE:URL(例如:)
http://localhost:3000/mcp - stdio:启动命令(例如:)
node dist/server.js
1b: Fetch tool schemas
1b:获取工具Schema
bash
bash scripts/fetch-tools.sh <url-or-command> <transport> <workspace>/tools.jsonThis calls via the MCP Inspector CLI and saves the schemas.
tools/listbash
bash scripts/fetch-tools.sh <url-or-command> <transport> <workspace>/tools.json此命令通过MCP Inspector CLI调用接口并保存Schema。
tools/list1c: Display inventory
1c:展示清单
Show a summary table:
markdown
| # | Tool | Description (preview) | Params | Annotations |
|---|------|-----------------------|--------|-------------|
| 1 | search_issues | Search issues by keyword... | 3 | readOnlyHint |
| 2 | create_issue | Create a new issue... | 4 | — |Flag tool count: 1-15 optimal, 15-30 warning, 30+ excessive (consider search+execute pattern).
显示汇总表格:
markdown
| 序号 | 工具 | 描述预览 | 参数数量 | 注解 |
|---|------|-----------------------|--------|-------------|
| 1 | search_issues | 按关键词搜索问题... | 3 | readOnlyHint |
| 2 | create_issue | 创建新问题... | 4 | — |标记工具数量:1-15个为最优,15-30个需注意,30个以上过多(建议考虑搜索+执行的模式)。
1d: Create workspace
1d:创建工作区
Create workspace at adjacent to the skill directory or in the user's project:
{server-name}-eval/{server-name}-eval/
├── tools.json
├── evals/
│ └── evals.json
└── iteration-N/在技能目录旁或用户项目中创建工作区:
{server-name}-eval/{server-name}-eval/
├── tools.json
├── evals/
│ └── evals.json
└── iteration-N/Phase 2 — Static Analysis
阶段2 —— 静态分析
Run deterministic quality checks — no Claude calls needed. This gives immediate feedback during development.
执行确定性的质量检查——无需调用Claude。这能在开发过程中提供即时反馈。
2a: Run analysis
2a:运行分析
bash
bash scripts/analyze-schemas.sh <workspace>/tools.json <workspace>/iteration-N/static-analysis.jsonbash
bash scripts/analyze-schemas.sh <workspace>/tools.json <workspace>/iteration-N/static-analysis.json2b: Display results
2b:展示结果
Show per-tool quality scores. Read for the criteria being checked.
references/quality-checklist.mdmarkdown
| Tool | Desc | Params | Schema | Annotations | Overall | Issues |
|------|------|--------|--------|-------------|---------|--------|
| search_issues | 3/3 | 3/3 | 2/3 | 2/3 | 2.5 | No negation |
| create_issue | 1/3 | 1/3 | 0/3 | 0/3 | 0.5 | 4 issues |显示每个工具的质量得分。查看了解检查标准。
references/quality-checklist.mdmarkdown
| 工具 | 描述 | 参数 | Schema | 注解 | 总分 | 问题 |
|------|------|--------|--------|-------------|---------|--------|
| search_issues | 3/3 | 3/3 | 2/3 | 2/3 | 2.5 | 无否定表述 |
| create_issue | 1/3 | 1/3 | 0/3 | 0/3 | 0.5 | 4个问题 |2c: Flag sibling pairs
2c:标记相似工具对
If the analysis found tools with high description overlap, highlight them as confusion risks:
markdown
undefined如果分析发现描述高度重叠的工具,需将其标记为混淆风险:
markdown
undefinedSibling Pairs (confusion risk)
相似工具对(混淆风险)
| Tool A | Tool B | Overlap | Risk |
|---|---|---|---|
| search_issues | list_issues | 52% | HIGH |
undefined| 工具A | 工具B | 重叠度 | 风险等级 |
|---|---|---|---|
| search_issues | list_issues | 52% | 高 |
undefined2d: Decision point
2d:决策节点
If critical issues exist (missing descriptions, zero annotations), recommend fixing them before Phase 3. Static issues create noise in selection testing — fix the obvious problems first, then measure the subtle ones.
If all tools score well, proceed to Phase 3.
如果存在关键问题(缺失描述、无注解),建议先修复这些问题再进入阶段3。静态问题会干扰选择测试的结果——先解决明显问题,再衡量细微的不足。
若所有工具得分良好,则进入阶段3。
Phase 3 — Selection Testing
阶段3 —— 选择测试
Test whether Claude picks the right tool for each user intent. This is the core eval.
测试Claude是否能针对每个用户意图选择正确的工具。这是评估的核心环节。
3a: Generate test intents
3a:生成测试意图
Read for intent generation patterns.
references/eval-patterns.mdFor each tool, generate:
- 3 should-trigger intents — direct, implicit, and casual phrasings
- 2 should-not-trigger intents — near-miss and keyword overlap
For each sibling pair flagged in Phase 2:
- 1 disambiguation intent per tool — tests whether Claude picks the RIGHT sibling
Present all intents to the user for review. Ask if any should be added, removed, or modified.
查看了解意图生成模式。
references/eval-patterns.md为每个工具生成:
- 3个应触发意图 —— 直接、隐含、非正式的表述方式
- 2个不应触发意图 —— 近似表述和关键词重叠的场景
为阶段2中标记的每对相似工具生成:
- 每个工具1个歧义消除意图 —— 测试Claude是否能选择正确的相似工具
将所有意图展示给用户审核,询问是否需要添加、删除或修改。
3b: Save intents
3b:保存意图
Save to :
{workspace}/evals/evals.jsonjson
{
"server_name": "my-server",
"generated_from": "tools.json",
"intents": [
{
"id": 1,
"intent": "Are there any open bugs related to checkout?",
"expected_tool": "search_issues",
"type": "should_trigger",
"target_tool": "search_issues",
"notes": "Implicit intent — doesn't name the action"
}
]
}保存到:
{workspace}/evals/evals.jsonjson
{
"server_name": "my-server",
"generated_from": "tools.json",
"intents": [
{
"id": 1,
"intent": "有没有和结账相关的未解决Bug?",
"expected_tool": "search_issues",
"type": "should_trigger",
"target_tool": "search_issues",
"notes": "隐含意图——未明确提及操作"
}
]
}3c: Run selection tests
3c:运行选择测试
For each intent, spawn a subagent that receives:
- The full tool schemas from tools.json (formatted as they'd appear in Claude's context)
- The user intent text
- Instructions to select exactly one tool and provide arguments, or decline if no tool fits
The subagent prompt:
You have access to the following MCP tools:
{tool schemas as JSON}
A user sends this message:
"{intent text}"
Which tool would you call? Respond with JSON:
{
"selected_tool": "tool_name" or null,
"arguments": { ... } or {},
"reasoning": "One sentence explaining your choice"
}
If no tool fits the user's request, set selected_tool to null.
Select exactly ONE tool. Do not suggest calling multiple tools.Save each result to .
{workspace}/iteration-N/selection/intent-{ID}/result.jsonLaunch all selection tests in parallel for efficiency.
针对每个意图,启动子代理,传入以下内容:
- 中的完整工具Schema(格式与Claude上下文显示的一致)
tools.json - 用户意图文本
- 指令:选择恰好一个工具并提供参数,若没有合适工具则拒绝
子代理提示词:
你可以使用以下MCP工具:
{tool schemas as JSON}
用户发送了这条消息:
"{intent text}"
你会调用哪个工具?请用JSON格式回复:
{
"selected_tool": "tool_name" 或 null,
"arguments": { ... } 或 {},
"reasoning": "用一句话解释你的选择"
}
如果没有合适的工具,将selected_tool设为null。
请恰好选择一个工具,不要建议调用多个工具。将每个结果保存到。
{workspace}/iteration-N/selection/intent-{ID}/result.json为提升效率,可并行启动所有选择测试。
3d: Grade results
3d:评分结果
bash
bash scripts/grade-selection.sh \
<workspace>/iteration-N/selection \
<workspace>/evals/evals.json \
<workspace>/iteration-N/benchmark.jsonbash
bash scripts/grade-selection.sh \
<workspace>/iteration-N/selection \
<workspace>/evals/evals.json \
<workspace>/iteration-N/benchmark.json3e: Display results
3e:展示结果
markdown
undefinedmarkdown
undefinedSelection Results — Iteration N
选择测试结果 —— 第N次迭代
Accuracy: 82% (41/50 correct)
| Metric | Count |
|---|---|
| Correct | 41 |
| Wrong tool | 5 |
| False accept | 2 |
| False reject | 2 |
准确率: 82%(50次测试中41次正确)
| 指标 | 数量 |
|---|---|
| 正确 | 41 |
| 工具选择错误 | 5 |
| 误判(选择了不合适的工具) | 2 |
| 漏判(未选择合适的工具) | 2 |
Per-Tool Accuracy
各工具准确率
| Tool | Precision | Recall |
|---|---|---|
| search_issues | 0.90 | 0.85 |
| create_issue | 1.00 | 1.00 |
| 工具 | 精确率 | 召回率 |
|---|---|---|
| search_issues | 0.90 | 0.85 |
| create_issue | 1.00 | 1.00 |
Worst Confusions
最严重的混淆情况
| Expected | Selected Instead | Times |
|---|---|---|
| list_issues | search_issues | 3 |
| get_user | find_user_by_email | 2 |
---| 预期工具 | 实际选择工具 | 次数 |
|---|---|---|
| list_issues | search_issues | 3 |
| get_user | find_user_by_email | 2 |
---Phase 4 — Optimize & Iterate
阶段4 —— 优化与迭代
Analyze confusion patterns and suggest description improvements. Read for rewrite patterns.
references/optimization.md分析混淆模式并提出描述优化建议。查看了解改写模式。
references/optimization.md4a: Analyze confusions
4a:分析混淆情况
For each confused pair (from worst_confusions):
- Read both tools' current descriptions
- Identify why they're confusing (missing negation, overlapping scope, no cross-reference)
- Draft a specific rewrite following the disambiguation patterns in optimization.md
针对每对混淆工具(来自最严重混淆情况列表):
- 阅读两个工具当前的描述
- 识别混淆原因(缺失否定表述、范围重叠、无交叉引用)
- 遵循optimization.md中的歧义消除模式,起草具体的改写内容
4b: Present suggestions
4b:展示建议
markdown
undefinedmarkdown
undefinedSuggested Improvements
优化建议
search_issues ↔ list_issues (confused 3 times)
search_issues ↔ list_issues(混淆3次)
search_issues — Before:
Search issues by keyword.
search_issues — After:
Search issues by keyword across title and body. Returns up toresults ranked by relevance. Does NOT filter by status, assignee, or date — use list_issues for structured filtering.limit
Reason: Adding scope boundary and cross-reference to disambiguate from list_issues.
Save to `{workspace}/iteration-N/suggestions.json` (format defined in optimization.md).search_issues —— 改写前:
按关键词搜索问题。
search_issues —— 改写后:
按关键词在标题和正文中搜索问题。返回最多个按相关性排序的结果。不支持按状态、经办人或日期筛选——如需结构化筛选,请使用list_issues。limit
原因: 添加范围边界和交叉引用,与list_issues明确区分。
保存到`{workspace}/iteration-N/suggestions.json`(格式定义见optimization.md)。4c: Apply and retest
4c:应用改写并重新测试
After the user applies the rewrites to their server code:
- Restart the server
- Re-run Phase 1 to refetch tools.json (descriptions may have changed)
- Re-run Phase 2 for updated static analysis
- Re-run Phase 3 into using the same evals.json
iteration-N+1 - Compare accuracy:
markdown
undefined用户将改写内容应用到服务器代码后:
- 重启服务器
- 重新运行阶段1以获取更新后的tools.json(描述可能已更改)
- 重新运行阶段2以获取更新后的静态分析结果
- 使用相同的evals.json重新运行阶段3,结果保存到
iteration-N+1 - 对比准确率:
markdown
undefinedIteration Comparison
迭代对比
| Metric | Iteration 1 | Iteration 2 | Delta |
|---|---|---|---|
| Accuracy | 82% | 94% | +12% |
| search↔list confusion | 3 | 0 | -3 |
undefined| 指标 | 第1次迭代 | 第2次迭代 | 变化值 |
|---|---|---|---|
| 准确率 | 82% | 94% | +12% |
| search↔list混淆次数 | 3 | 0 | -3 |
undefined4d: Iteration guidance
4d:迭代指导
- Change one sibling pair per iteration so you can attribute improvements
- If accuracy plateaus, the remaining confusions may need architectural changes (merging tools, renaming, or restructuring the tool surface)
- Stop when accuracy exceeds 90% or when remaining confusions are in ambiguous edge cases that humans would also struggle with
- 每次迭代只修改一对相似工具,以便明确准确率提升的原因
- 如果准确率进入平台期,剩余的混淆问题可能需要架构调整(合并工具、重命名或重构工具表面)
- 当准确率超过90%,或剩余混淆问题属于人类也难以判断的模糊边缘情况时,可停止迭代
Reference Files
参考文件
Read these when you reach the relevant phase — not upfront:
- — Testable quality criteria for tool schemas (Phase 2)
references/quality-checklist.md - — How to write tool selection test intents (Phase 3)
references/eval-patterns.md - — How to improve descriptions from eval results (Phase 4)
references/optimization.md
请在进入对应阶段时阅读这些文件,无需提前浏览:
- —— 工具Schema的可测试质量标准(阶段2)
references/quality-checklist.md - —— 如何编写工具选择测试意图(阶段3)
references/eval-patterns.md - —— 如何根据评估结果优化描述(阶段4)
references/optimization.md
Related Skills
相关技能
- — Design and scaffold MCP servers (run this first, then eval-mcp to validate)
build-mcp-server - — MCP servers with interactive UI widgets
build-mcp-app
- —— 设计并搭建MCP服务器(先运行此技能,再用eval-mcp验证)
build-mcp-server - —— 带交互式UI组件的MCP服务器
build-mcp-app