benchmark-e2e
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBenchmark E2E
端到端基准测试(Benchmark E2E)
Single-command pipeline that creates projects, exercises skill injection via , launches dev servers, verifies they work, analyzes conversation logs, and generates actionable improvement reports.
claude --print单命令流水线,可创建项目、通过执行skill注入、启动开发服务器、验证其正常运行、分析对话日志并生成可落地的改进报告。
claude --printQuick Start
快速开始
bash
undefinedbash
undefinedFull suite (9 projects, ~2-3 hours)
完整测试套件(9个项目,耗时约2-3小时)
bun run scripts/benchmark-e2e.ts
bun run scripts/benchmark-e2e.ts
Quick mode (first 3 projects, ~30-45 min)
快速模式(仅前3个项目,耗时约30-45分钟)
bun run scripts/benchmark-e2e.ts --quick
Options:
| Flag | Description | Default |
|------|-------------|---------|
| `--quick` | Run only first 3 projects | `false` |
| `--base <path>` | Override base directory | `~/dev/vercel-plugin-testing` |
| `--timeout <ms>` | Per-project timeout (forwarded to runner) | `900000` (15 min) |bun run scripts/benchmark-e2e.ts --quick
选项:
| Flag | 描述 | 默认值 |
|------|-------------|---------|
| `--quick` | 仅运行前3个项目 | `false` |
| `--base <path>` | 自定义基础目录 | `~/dev/vercel-plugin-testing` |
| `--timeout <ms>` | 单项目超时时间(传递给运行器) | `900000` (15分钟) |Pipeline Stages
流水线阶段
The orchestrator chains four stages sequentially, aborting on failure:
- runner — Creates test dirs, installs plugin, runs with
claude --printVERCEL_PLUGIN_LOG_LEVEL=trace - verify — Detects package manager, launches dev server, polls for 200 with non-empty HTML
- analyze — Matches JSONL sessions to projects via , extracts metrics
run-manifest.json - report — Generates and
report.mdwith scorecards and recommendationsreport.json
编排器按顺序串联四个阶段,任意阶段失败则终止流程:
- 运行器(runner) — 创建测试目录,安装插件,在环境下运行
VERCEL_PLUGIN_LOG_LEVEL=traceclaude --print - 验证(verify) — 检测包管理器,启动开发服务器,轮询请求直到返回200状态码且响应为非空HTML
- 分析(analyze) — 通过将JSONL会话与对应项目关联,提取指标
run-manifest.json - 报告(report) — 生成包含评分卡和优化建议的和
report.md文件report.json
Contracts
接口约定
run-manifest.json
run-manifest.jsonrun-manifest.json
run-manifest.jsonWritten by the runner at . Links all downstream stages to the same run.
<base>/results/run-manifest.jsontypescript
interface BenchmarkRunManifest {
runId: string; // UUID for this pipeline run
timestamp: string; // ISO 8601
baseDir: string; // Absolute path to base directory
projects: Array<{
slug: string; // e.g. "01-recipe-platform"
cwd: string; // Absolute path to project dir
promptHash: string; // SHA hash of the prompt text
expectedSkills: string[];
}>;
}The analyzer and verifier read this manifest to correlate sessions precisely instead of guessing from directory listings.
由运行器生成在路径,用于关联同一次流水线运行的所有下游阶段。
<base>/results/run-manifest.jsontypescript
interface BenchmarkRunManifest {
runId: string; // 本次流水线运行的UUID
timestamp: string; // ISO 8601格式时间戳
baseDir: string; // 基础目录的绝对路径
projects: Array<{
slug: string; // 例如 "01-recipe-platform"
cwd: string; // 项目目录的绝对路径
promptHash: string; // 提示词文本的SHA哈希
expectedSkills: string[];
}>;
}分析器和验证器读取该清单来精确关联会话,无需从目录列表猜测对应关系。
events.jsonl
events.jsonlevents.jsonl
events.jsonlThe orchestrator writes NDJSON events to tracking pipeline lifecycle:
<base>/results/events.jsonljsonc
// Each line is one JSON object:
{ "stage": "pipeline", "event": "start", "timestamp": "...", "data": { "baseDir": "...", "quick": false } }
{ "stage": "runner", "event": "start", "timestamp": "...", "data": { "script": "...", "args": [...] } }
{ "stage": "runner", "event": "complete", "timestamp": "...", "data": { "exitCode": 0, "durationMs": 120000 } }
// On failure:
{ "stage": "verify", "event": "error", "timestamp": "...", "data": { "exitCode": 1, "durationMs": 5000, "slug": "04-conference-tickets" } }
{ "stage": "pipeline", "event": "abort", "timestamp": "...", "data": { "failedStage": "verify", "exitCode": 1, "slug": "04-conference-tickets" } }编排器将NDJSON格式的事件写入,追踪流水线全生命周期:
<base>/results/events.jsonljsonc
// 每行是一个JSON对象:
{ "stage": "pipeline", "event": "start", "timestamp": "...", "data": { "baseDir": "...", "quick": false } }
{ "stage": "runner", "event": "start", "timestamp": "...", "data": { "script": "...", "args": [...] } }
{ "stage": "runner", "event": "complete", "timestamp": "...", "data": { "exitCode": 0, "durationMs": 120000 } }
// 失败时的日志:
{ "stage": "verify", "event": "error", "timestamp": "...", "data": { "exitCode": 1, "durationMs": 5000, "slug": "04-conference-tickets" } }
{ "stage": "pipeline", "event": "abort", "timestamp": "...", "data": { "failedStage": "verify", "exitCode": 1, "slug": "04-conference-tickets" } }report.json
report.jsonreport.json
report.jsonMachine-readable report at for programmatic consumption:
<base>/results/report.jsontypescript
interface ReportJson {
runId: string | null;
timestamp: string;
verdict: "pass" | "partial" | "fail";
gaps: Array<{
slug: string;
expected: string[];
actual: string[];
missing: string[];
}>;
recommendations: string[];
suggestedPatterns: Array<{
skill: string; // Skill that was expected but not injected
glob: string; // Suggested pathPattern glob
tool: string; // Tool name that should trigger injection
}>;
}机器可读的报告文件,存放在,可用于程序消费:
<base>/results/report.jsontypescript
interface ReportJson {
runId: string | null;
timestamp: string;
verdict: "pass" | "partial" | "fail";
gaps: Array<{
slug: string;
expected: string[];
actual: string[];
missing: string[];
}>;
recommendations: string[];
suggestedPatterns: Array<{
skill: string; // 预期但未被注入的Skill
glob: string; // 建议的pathPattern glob规则
tool: string; // 应该触发注入的工具名称
}>;
}Overnight Automation Loop
夜间自动化循环
Run the pipeline repeatedly with a cooldown between iterations:
bash
while true; do
bun run scripts/benchmark-e2e.ts
sleep 3600
doneEach run produces timestamped and files. Compare across runs to track improvement.
report.jsonreport.md循环运行流水线,两次运行之间设置冷却时间:
bash
while true; do
bun run scripts/benchmark-e2e.ts
sleep 3600
done每次运行都会生成带时间戳的和文件,可跨运行对比来跟踪改进效果。
report.jsonreport.mdSelf-Improvement Cycle
自我改进周期
The pipeline enables a closed feedback loop:
- Run — exercises the plugin against realistic projects
bun run scripts/benchmark-e2e.ts - Read gaps — lists which skills were expected but never injected, with exact slugs
report.json - Apply fixes — Use entries (copy-pasteable YAML) to add missing frontmatter patterns; use
suggestedPatternsto fix hook logicrecommendations - Re-run — Execute the pipeline again to verify the gaps are closed
- Compare — Diff across runs:
report.jsonshould trend fromverdict→"fail"→"partial""pass"
For overnight automation, combine with the loop above. Wake up to reports showing exactly what improved and what still needs work.
该流水线实现了闭环反馈循环:
- 运行 — 执行在真实项目中测试插件功能
bun run scripts/benchmark-e2e.ts - 读取缺陷 — 列出所有预期但未被注入的skill,附带精确的项目标识
report.json - 修复问题 — 使用条目(可直接复制的YAML配置)添加缺失的frontmatter规则;参考
suggestedPatterns修复钩子逻辑recommendations - 重跑测试 — 再次执行流水线验证缺陷已被修复
- 对比结果 — 对比多次运行的:verdict结果应该从
report.json→"fail"→"partial"逐步优化"pass"
配合上述夜间自动化循环使用,次日即可查看明确的改进点和待优化项报告。
Prompt Table
测试提示词表
Prompts never name specific technologies — they describe the product and features, letting the plugin infer which skills to inject.
| # | Slug | Expected Skills |
|---|---|---|
| 01 | recipe-platform | auth, vercel-storage, nextjs |
| 02 | trivia-game | vercel-storage, nextjs |
| 03 | code-review-bot | ai-sdk, nextjs |
| 04 | conference-tickets | payments, email, auth |
| 05 | content-aggregator | cron-jobs, ai-sdk |
| 06 | finance-tracker | cron-jobs, email |
| 07 | multi-tenant-blog | routing-middleware, cms, auth |
| 08 | status-page | cron-jobs, vercel-storage, observability |
| 09 | dog-walking-saas | payments, auth, vercel-storage, env-vars |
提示词不会指定具体技术,只会描述产品和功能,由插件自行推断需要注入的skill。
| # | Slug | 预期技能 |
|---|---|---|
| 01 | recipe-platform | auth, vercel-storage, nextjs |
| 02 | trivia-game | vercel-storage, nextjs |
| 03 | code-review-bot | ai-sdk, nextjs |
| 04 | conference-tickets | payments, email, auth |
| 05 | content-aggregator | cron-jobs, ai-sdk |
| 06 | finance-tracker | cron-jobs, email |
| 07 | multi-tenant-blog | routing-middleware, cms, auth |
| 08 | status-page | cron-jobs, vercel-storage, observability |
| 09 | dog-walking-saas | payments, auth, vercel-storage, env-vars |
Cleanup
清理
bash
rm -rf ~/dev/vercel-plugin-testingbash
rm -rf ~/dev/vercel-plugin-testing