benchmark-e2e

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Benchmark E2E

端到端基准测试(Benchmark E2E)

Single-command pipeline that creates projects, exercises skill injection via
claude --print
, launches dev servers, verifies they work, analyzes conversation logs, and generates actionable improvement reports.
单命令流水线,可创建项目、通过
claude --print
执行skill注入、启动开发服务器、验证其正常运行、分析对话日志并生成可落地的改进报告。

Quick Start

快速开始

bash
undefined
bash
undefined

Full suite (9 projects, ~2-3 hours)

完整测试套件(9个项目,耗时约2-3小时)

bun run scripts/benchmark-e2e.ts
bun run scripts/benchmark-e2e.ts

Quick mode (first 3 projects, ~30-45 min)

快速模式(仅前3个项目,耗时约30-45分钟)

bun run scripts/benchmark-e2e.ts --quick

Options:

| Flag | Description | Default |
|------|-------------|---------|
| `--quick` | Run only first 3 projects | `false` |
| `--base <path>` | Override base directory | `~/dev/vercel-plugin-testing` |
| `--timeout <ms>` | Per-project timeout (forwarded to runner) | `900000` (15 min) |
bun run scripts/benchmark-e2e.ts --quick

选项:

| Flag | 描述 | 默认值 |
|------|-------------|---------|
| `--quick` | 仅运行前3个项目 | `false` |
| `--base <path>` | 自定义基础目录 | `~/dev/vercel-plugin-testing` |
| `--timeout <ms>` | 单项目超时时间(传递给运行器) | `900000` (15分钟) |

Pipeline Stages

流水线阶段

The orchestrator chains four stages sequentially, aborting on failure:
  1. runner — Creates test dirs, installs plugin, runs
    claude --print
    with
    VERCEL_PLUGIN_LOG_LEVEL=trace
  2. verify — Detects package manager, launches dev server, polls for 200 with non-empty HTML
  3. analyze — Matches JSONL sessions to projects via
    run-manifest.json
    , extracts metrics
  4. report — Generates
    report.md
    and
    report.json
    with scorecards and recommendations
编排器按顺序串联四个阶段,任意阶段失败则终止流程:
  1. 运行器(runner) — 创建测试目录,安装插件,在
    VERCEL_PLUGIN_LOG_LEVEL=trace
    环境下运行
    claude --print
  2. 验证(verify) — 检测包管理器,启动开发服务器,轮询请求直到返回200状态码且响应为非空HTML
  3. 分析(analyze) — 通过
    run-manifest.json
    将JSONL会话与对应项目关联,提取指标
  4. 报告(report) — 生成包含评分卡和优化建议的
    report.md
    report.json
    文件

Contracts

接口约定

run-manifest.json

run-manifest.json

Written by the runner at
<base>/results/run-manifest.json
. Links all downstream stages to the same run.
typescript
interface BenchmarkRunManifest {
  runId: string;           // UUID for this pipeline run
  timestamp: string;       // ISO 8601
  baseDir: string;         // Absolute path to base directory
  projects: Array<{
    slug: string;          // e.g. "01-recipe-platform"
    cwd: string;           // Absolute path to project dir
    promptHash: string;    // SHA hash of the prompt text
    expectedSkills: string[];
  }>;
}
The analyzer and verifier read this manifest to correlate sessions precisely instead of guessing from directory listings.
由运行器生成在
<base>/results/run-manifest.json
路径,用于关联同一次流水线运行的所有下游阶段。
typescript
interface BenchmarkRunManifest {
  runId: string;           // 本次流水线运行的UUID
  timestamp: string;       // ISO 8601格式时间戳
  baseDir: string;         // 基础目录的绝对路径
  projects: Array<{
    slug: string;          // 例如 "01-recipe-platform"
    cwd: string;           // 项目目录的绝对路径
    promptHash: string;    // 提示词文本的SHA哈希
    expectedSkills: string[];
  }>;
}
分析器和验证器读取该清单来精确关联会话,无需从目录列表猜测对应关系。

events.jsonl

events.jsonl

The orchestrator writes NDJSON events to
<base>/results/events.jsonl
tracking pipeline lifecycle:
jsonc
// Each line is one JSON object:
{ "stage": "pipeline", "event": "start", "timestamp": "...", "data": { "baseDir": "...", "quick": false } }
{ "stage": "runner",   "event": "start", "timestamp": "...", "data": { "script": "...", "args": [...] } }
{ "stage": "runner",   "event": "complete", "timestamp": "...", "data": { "exitCode": 0, "durationMs": 120000 } }
// On failure:
{ "stage": "verify",   "event": "error", "timestamp": "...", "data": { "exitCode": 1, "durationMs": 5000, "slug": "04-conference-tickets" } }
{ "stage": "pipeline", "event": "abort", "timestamp": "...", "data": { "failedStage": "verify", "exitCode": 1, "slug": "04-conference-tickets" } }
编排器将NDJSON格式的事件写入
<base>/results/events.jsonl
,追踪流水线全生命周期:
jsonc
// 每行是一个JSON对象:
{ "stage": "pipeline", "event": "start", "timestamp": "...", "data": { "baseDir": "...", "quick": false } }
{ "stage": "runner",   "event": "start", "timestamp": "...", "data": { "script": "...", "args": [...] } }
{ "stage": "runner",   "event": "complete", "timestamp": "...", "data": { "exitCode": 0, "durationMs": 120000 } }
// 失败时的日志:
{ "stage": "verify",   "event": "error", "timestamp": "...", "data": { "exitCode": 1, "durationMs": 5000, "slug": "04-conference-tickets" } }
{ "stage": "pipeline", "event": "abort", "timestamp": "...", "data": { "failedStage": "verify", "exitCode": 1, "slug": "04-conference-tickets" } }

report.json

report.json

Machine-readable report at
<base>/results/report.json
for programmatic consumption:
typescript
interface ReportJson {
  runId: string | null;
  timestamp: string;
  verdict: "pass" | "partial" | "fail";
  gaps: Array<{
    slug: string;
    expected: string[];
    actual: string[];
    missing: string[];
  }>;
  recommendations: string[];
  suggestedPatterns: Array<{
    skill: string;   // Skill that was expected but not injected
    glob: string;    // Suggested pathPattern glob
    tool: string;    // Tool name that should trigger injection
  }>;
}
机器可读的报告文件,存放在
<base>/results/report.json
,可用于程序消费:
typescript
interface ReportJson {
  runId: string | null;
  timestamp: string;
  verdict: "pass" | "partial" | "fail";
  gaps: Array<{
    slug: string;
    expected: string[];
    actual: string[];
    missing: string[];
  }>;
  recommendations: string[];
  suggestedPatterns: Array<{
    skill: string;   // 预期但未被注入的Skill
    glob: string;    // 建议的pathPattern glob规则
    tool: string;    // 应该触发注入的工具名称
  }>;
}

Overnight Automation Loop

夜间自动化循环

Run the pipeline repeatedly with a cooldown between iterations:
bash
while true; do
  bun run scripts/benchmark-e2e.ts
  sleep 3600
done
Each run produces timestamped
report.json
and
report.md
files. Compare across runs to track improvement.
循环运行流水线,两次运行之间设置冷却时间:
bash
while true; do
  bun run scripts/benchmark-e2e.ts
  sleep 3600
done
每次运行都会生成带时间戳的
report.json
report.md
文件,可跨运行对比来跟踪改进效果。

Self-Improvement Cycle

自我改进周期

The pipeline enables a closed feedback loop:
  1. Run
    bun run scripts/benchmark-e2e.ts
    exercises the plugin against realistic projects
  2. Read gaps
    report.json
    lists which skills were expected but never injected, with exact slugs
  3. Apply fixes — Use
    suggestedPatterns
    entries (copy-pasteable YAML) to add missing frontmatter patterns; use
    recommendations
    to fix hook logic
  4. Re-run — Execute the pipeline again to verify the gaps are closed
  5. Compare — Diff
    report.json
    across runs:
    verdict
    should trend from
    "fail"
    "partial"
    "pass"
For overnight automation, combine with the loop above. Wake up to reports showing exactly what improved and what still needs work.
该流水线实现了闭环反馈循环:
  1. 运行 — 执行
    bun run scripts/benchmark-e2e.ts
    在真实项目中测试插件功能
  2. 读取缺陷
    report.json
    列出所有预期但未被注入的skill,附带精确的项目标识
  3. 修复问题 — 使用
    suggestedPatterns
    条目(可直接复制的YAML配置)添加缺失的frontmatter规则;参考
    recommendations
    修复钩子逻辑
  4. 重跑测试 — 再次执行流水线验证缺陷已被修复
  5. 对比结果 — 对比多次运行的
    report.json
    :verdict结果应该从
    "fail"
    "partial"
    "pass"
    逐步优化
配合上述夜间自动化循环使用,次日即可查看明确的改进点和待优化项报告。

Prompt Table

测试提示词表

Prompts never name specific technologies — they describe the product and features, letting the plugin infer which skills to inject.
#SlugExpected Skills
01recipe-platformauth, vercel-storage, nextjs
02trivia-gamevercel-storage, nextjs
03code-review-botai-sdk, nextjs
04conference-ticketspayments, email, auth
05content-aggregatorcron-jobs, ai-sdk
06finance-trackercron-jobs, email
07multi-tenant-blogrouting-middleware, cms, auth
08status-pagecron-jobs, vercel-storage, observability
09dog-walking-saaspayments, auth, vercel-storage, env-vars
提示词不会指定具体技术,只会描述产品和功能,由插件自行推断需要注入的skill。
#Slug预期技能
01recipe-platformauth, vercel-storage, nextjs
02trivia-gamevercel-storage, nextjs
03code-review-botai-sdk, nextjs
04conference-ticketspayments, email, auth
05content-aggregatorcron-jobs, ai-sdk
06finance-trackercron-jobs, email
07multi-tenant-blogrouting-middleware, cms, auth
08status-pagecron-jobs, vercel-storage, observability
09dog-walking-saaspayments, auth, vercel-storage, env-vars

Cleanup

清理

bash
rm -rf ~/dev/vercel-plugin-testing
bash
rm -rf ~/dev/vercel-plugin-testing