benchmark-e2e

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Benchmark E2E

端到端基准测试（Benchmark E2E）

Single-command pipeline that creates projects, exercises skill injection via

claude --print

, launches dev servers, verifies they work, analyzes conversation logs, and generates actionable improvement reports.

单命令流水线，可创建项目、通过

claude --print

执行skill注入、启动开发服务器、验证其正常运行、分析对话日志并生成可落地的改进报告。

Quick Start

快速开始

bash

undefined

bash

undefined

Full suite (9 projects, ~2-3 hours)

完整测试套件（9个项目，耗时约2-3小时）

bun run scripts/benchmark-e2e.ts

Quick mode (first 3 projects, ~30-45 min)

快速模式（仅前3个项目，耗时约30-45分钟）

bun run scripts/benchmark-e2e.ts --quick


Options:

| Flag | Description | Default |
|------|-------------|---------|
| `--quick` | Run only first 3 projects | `false` |
| `--base <path>` | Override base directory | `~/dev/vercel-plugin-testing` |
| `--timeout <ms>` | Per-project timeout (forwarded to runner) | `900000` (15 min) |

bun run scripts/benchmark-e2e.ts --quick


选项：

| Flag | 描述 | 默认值 |
|------|-------------|---------|
| `--quick` | 仅运行前3个项目 | `false` |
| `--base <path>` | 自定义基础目录 | `~/dev/vercel-plugin-testing` |
| `--timeout <ms>` | 单项目超时时间（传递给运行器） | `900000` (15分钟) |

Pipeline Stages

流水线阶段

The orchestrator chains four stages sequentially, aborting on failure:

runner — Creates test dirs, installs plugin, runs
```
claude --print
```
with
```
VERCEL_PLUGIN_LOG_LEVEL=trace
```
verify — Detects package manager, launches dev server, polls for 200 with non-empty HTML
analyze — Matches JSONL sessions to projects via
```
run-manifest.json
```
, extracts metrics
report — Generates
```
report.md
```
and
```
report.json
```
with scorecards and recommendations

编排器按顺序串联四个阶段，任意阶段失败则终止流程：

运行器（runner） — 创建测试目录，安装插件，在
```
VERCEL_PLUGIN_LOG_LEVEL=trace
```
环境下运行
```
claude --print
```
验证（verify） — 检测包管理器，启动开发服务器，轮询请求直到返回200状态码且响应为非空HTML
分析（analyze） — 通过
```
run-manifest.json
```
将JSONL会话与对应项目关联，提取指标
报告（report） — 生成包含评分卡和优化建议的
```
report.md
```
和
```
report.json
```
文件

Contracts

接口约定

run-manifest.json

run-manifest.json

Written by the runner at

<base>/results/run-manifest.json

. Links all downstream stages to the same run.

typescript

interface BenchmarkRunManifest {
  runId: string;           // UUID for this pipeline run
  timestamp: string;       // ISO 8601
  baseDir: string;         // Absolute path to base directory
  projects: Array<{
    slug: string;          // e.g. "01-recipe-platform"
    cwd: string;           // Absolute path to project dir
    promptHash: string;    // SHA hash of the prompt text
    expectedSkills: string[];
  }>;
}

The analyzer and verifier read this manifest to correlate sessions precisely instead of guessing from directory listings.

由运行器生成在

<base>/results/run-manifest.json

路径，用于关联同一次流水线运行的所有下游阶段。

typescript

interface BenchmarkRunManifest {
  runId: string;           // 本次流水线运行的UUID
  timestamp: string;       // ISO 8601格式时间戳
  baseDir: string;         // 基础目录的绝对路径
  projects: Array<{
    slug: string;          // 例如 "01-recipe-platform"
    cwd: string;           // 项目目录的绝对路径
    promptHash: string;    // 提示词文本的SHA哈希
    expectedSkills: string[];
  }>;
}

分析器和验证器读取该清单来精确关联会话，无需从目录列表猜测对应关系。

events.jsonl

events.jsonl

The orchestrator writes NDJSON events to

<base>/results/events.jsonl

tracking pipeline lifecycle:

jsonc

// Each line is one JSON object:
{ "stage": "pipeline", "event": "start", "timestamp": "...", "data": { "baseDir": "...", "quick": false } }
{ "stage": "runner",   "event": "start", "timestamp": "...", "data": { "script": "...", "args": [...] } }
{ "stage": "runner",   "event": "complete", "timestamp": "...", "data": { "exitCode": 0, "durationMs": 120000 } }
// On failure:
{ "stage": "verify",   "event": "error", "timestamp": "...", "data": { "exitCode": 1, "durationMs": 5000, "slug": "04-conference-tickets" } }
{ "stage": "pipeline", "event": "abort", "timestamp": "...", "data": { "failedStage": "verify", "exitCode": 1, "slug": "04-conference-tickets" } }

编排器将NDJSON格式的事件写入

<base>/results/events.jsonl

，追踪流水线全生命周期：

jsonc

// 每行是一个JSON对象：
{ "stage": "pipeline", "event": "start", "timestamp": "...", "data": { "baseDir": "...", "quick": false } }
{ "stage": "runner",   "event": "start", "timestamp": "...", "data": { "script": "...", "args": [...] } }
{ "stage": "runner",   "event": "complete", "timestamp": "...", "data": { "exitCode": 0, "durationMs": 120000 } }
// 失败时的日志：
{ "stage": "verify",   "event": "error", "timestamp": "...", "data": { "exitCode": 1, "durationMs": 5000, "slug": "04-conference-tickets" } }
{ "stage": "pipeline", "event": "abort", "timestamp": "...", "data": { "failedStage": "verify", "exitCode": 1, "slug": "04-conference-tickets" } }

report.json

report.json

Machine-readable report at

<base>/results/report.json

for programmatic consumption:

typescript

interface ReportJson {
  runId: string | null;
  timestamp: string;
  verdict: "pass" | "partial" | "fail";
  gaps: Array<{
    slug: string;
    expected: string[];
    actual: string[];
    missing: string[];
  }>;
  recommendations: string[];
  suggestedPatterns: Array<{
    skill: string;   // Skill that was expected but not injected
    glob: string;    // Suggested pathPattern glob
    tool: string;    // Tool name that should trigger injection
  }>;
}

机器可读的报告文件，存放在

<base>/results/report.json

，可用于程序消费：

typescript

interface ReportJson {
  runId: string | null;
  timestamp: string;
  verdict: "pass" | "partial" | "fail";
  gaps: Array<{
    slug: string;
    expected: string[];
    actual: string[];
    missing: string[];
  }>;
  recommendations: string[];
  suggestedPatterns: Array<{
    skill: string;   // 预期但未被注入的Skill
    glob: string;    // 建议的pathPattern glob规则
    tool: string;    // 应该触发注入的工具名称
  }>;
}

Overnight Automation Loop

夜间自动化循环

Run the pipeline repeatedly with a cooldown between iterations:

bash

while true; do
  bun run scripts/benchmark-e2e.ts
  sleep 3600
done

Each run produces timestamped

report.json

and

report.md

files. Compare across runs to track improvement.

循环运行流水线，两次运行之间设置冷却时间：

bash

while true; do
  bun run scripts/benchmark-e2e.ts
  sleep 3600
done

每次运行都会生成带时间戳的

report.json

和

report.md

文件，可跨运行对比来跟踪改进效果。

Self-Improvement Cycle

自我改进周期

The pipeline enables a closed feedback loop:

Run —
```
bun run scripts/benchmark-e2e.ts
```
exercises the plugin against realistic projects
Read gaps —
```
report.json
```
lists which skills were expected but never injected, with exact slugs
Apply fixes — Use
```
suggestedPatterns
```
entries (copy-pasteable YAML) to add missing frontmatter patterns; use
```
recommendations
```
to fix hook logic
Re-run — Execute the pipeline again to verify the gaps are closed
Compare — Diff
```
report.json
```
across runs:
```
verdict
```
should trend from
```
"fail"
```
→
```
"partial"
```
→
```
"pass"
```

For overnight automation, combine with the loop above. Wake up to reports showing exactly what improved and what still needs work.

该流水线实现了闭环反馈循环：

运行 — 执行
```
bun run scripts/benchmark-e2e.ts
```
在真实项目中测试插件功能
读取缺陷 —
```
report.json
```
列出所有预期但未被注入的skill，附带精确的项目标识
修复问题 — 使用
```
suggestedPatterns
```
条目（可直接复制的YAML配置）添加缺失的frontmatter规则；参考
```
recommendations
```
修复钩子逻辑
重跑测试 — 再次执行流水线验证缺陷已被修复
对比结果 — 对比多次运行的
```
report.json
```
：verdict结果应该从
```
"fail"
```
→
```
"partial"
```
→
```
"pass"
```
逐步优化

配合上述夜间自动化循环使用，次日即可查看明确的改进点和待优化项报告。

Prompt Table

测试提示词表

Prompts never name specific technologies — they describe the product and features, letting the plugin infer which skills to inject.

#	Slug	Expected Skills
01	recipe-platform	auth, vercel-storage, nextjs
02	trivia-game	vercel-storage, nextjs
03	code-review-bot	ai-sdk, nextjs
04	conference-tickets	payments, email, auth
05	content-aggregator	cron-jobs, ai-sdk
06	finance-tracker	cron-jobs, email
07	multi-tenant-blog	routing-middleware, cms, auth
08	status-page	cron-jobs, vercel-storage, observability
09	dog-walking-saas	payments, auth, vercel-storage, env-vars

提示词不会指定具体技术，只会描述产品和功能，由插件自行推断需要注入的skill。

#	Slug	预期技能
01	recipe-platform	auth, vercel-storage, nextjs
02	trivia-game	vercel-storage, nextjs
03	code-review-bot	ai-sdk, nextjs
04	conference-tickets	payments, email, auth
05	content-aggregator	cron-jobs, ai-sdk
06	finance-tracker	cron-jobs, email
07	multi-tenant-blog	routing-middleware, cms, auth
08	status-page	cron-jobs, vercel-storage, observability
09	dog-walking-saas	payments, auth, vercel-storage, env-vars

Cleanup

清理

bash

rm -rf ~/dev/vercel-plugin-testing

bash

rm -rf ~/dev/vercel-plugin-testing

benchmark-e2e

Original

Translation

Benchmark E2E

端到端基准测试（Benchmark E2E）

Quick Start

快速开始

Full suite (9 projects, ~2-3 hours)

完整测试套件（9个项目，耗时约2-3小时）

Quick mode (first 3 projects, ~30-45 min)

快速模式（仅前3个项目，耗时约30-45分钟）

Pipeline Stages

流水线阶段

Contracts

接口约定

`run-manifest.json`

`run-manifest.json`

`events.jsonl`

`events.jsonl`

`report.json`

`report.json`

Overnight Automation Loop

夜间自动化循环

Self-Improvement Cycle

自我改进周期

Prompt Table

测试提示词表

Cleanup

清理