benchmark-sandbox

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Benchmark Sandbox — Remote Eval via Vercel Sandboxes

基准测试沙箱 — 基于Vercel沙箱的远程评估

Run benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a 3-phase eval pipeline:

Phase 1 (BUILD): Claude Code builds the app with
```
--dangerously-skip-permissions --debug
```
Phase 2 (VERIFY): A follow-up Claude Code session uses
```
agent-browser
```
to walk through user stories, fixing issues until all pass (20 min timeout)
Phase 3 (DEPLOY): A third Claude Code session links to vercel-labs, runs
```
vercel deploy
```
, and fixes build errors (up to 3 retries). Deployed apps have deployment protection enabled by default.

Skills are tracked across all 3 phases — each phase may trigger additional skill injections as new files/patterns are created. After each phase, a haiku structured scoring step (

claude -p --json-schema --model haiku

) evaluates the results as structured JSON.

在Vercel沙箱（运行node24的临时Firecracker微虚拟机）中运行基准测试场景。每个沙箱都会预装全新的Claude Code + Vercel CLI + agent-browser，上传本地vercel-plugin，然后运行三阶段评估流水线：

阶段1（构建）: Claude Code使用
```
--dangerously-skip-permissions --debug
```
参数构建应用
阶段2（验证）: 后续的Claude Code会话使用
```
agent-browser
```
遍历用户故事，修复问题直至全部通过（超时时间20分钟）
阶段3（部署）: 第三个Claude Code会话关联到vercel-labs，运行
```
vercel deploy
```
，修复构建错误（最多重试3次）。部署的应用默认开启部署保护。

全部三个阶段都会跟踪技能使用情况——每个阶段在创建新文件/新模式时都可能触发额外的技能注入。每个阶段结束后，haiku结构化评分步骤（

claude -p --json-schema --model haiku

）会将结果评估为结构化JSON。

Proven Working Script

已验证可用脚本

Use

run-eval.ts

— the proven eval runner:

bash

undefined

使用

run-eval.ts

——经过验证的评估运行器：

bash

undefined

Run default scenarios with full 3-phase pipeline

运行默认场景，执行完整三阶段流水线

bun run .claude/skills/benchmark-sandbox/run-eval.ts

With dynamic scenarios from a JSON file (recommended — see "Dynamic Scenarios" below)

从JSON文件加载动态场景（推荐——参见下文「动态场景」）

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json

Keep sandboxes alive overnight with public URLs

保持沙箱运行过夜，生成公网URL

bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive --keep-hours 8

Build-only (skip verification and deploy)

仅构建（跳过验证和部署阶段）

bun run .claude/skills/benchmark-sandbox/run-eval.ts --skip-verify --skip-deploy

Run specific scenarios by slug

按标识运行指定场景

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone

undefined

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone

undefined

CLI Flags

CLI参数

Flag	Default	Description
`--concurrency N`	5	Max parallel sandboxes (max 10)
`--timeout MS`	1800000 (30 min)	Per-phase timeout in ms
`--keep-alive`	off	Keep sandboxes running after eval
`--keep-hours N`	8	Hours to keep alive (with `--keep-alive` )
`--skip-verify`	off	Skip the agent-browser verification phase
`--skip-deploy`	off	Skip the Vercel deploy phase
`--scenarios a,b,c`	all	Only run specific scenarios by slug
`--scenarios-file path`	—	Load scenarios from a JSON file instead of built-in defaults

参数	默认值	描述
`--concurrency N`	5	最大并行沙箱数（最高10）
`--timeout MS`	1800000 (30分钟)	单阶段超时时间（单位：毫秒）
`--keep-alive`	关闭	评估结束后保持沙箱运行
`--keep-hours N`	8	沙箱保持运行的时长（配合 `--keep-alive` 使用）
`--skip-verify`	关闭	跳过agent-browser验证阶段
`--skip-deploy`	关闭	跳过Vercel部署阶段
`--scenarios a,b,c`	全部	仅运行指定标识的场景
`--scenarios-file path`	—	从JSON文件加载场景，替代内置默认值

Dynamic Scenarios (Recommended Approach)

动态场景（推荐方案）

Instead of hardcoding tech-specific prompts, generate scenarios dynamically as a JSON file. Prompts should describe real-world apps people want to build using user stories — no tech name-dropping. Let the plugin figure out what Vercel tech to inject.

无需硬编码特定技术的提示词，而是动态生成JSON格式的场景文件。提示词应通过用户故事描述人们想要构建的真实应用——不要提及技术名称，让插件自行判断要注入的Vercel技术。

Scenario JSON Format

场景JSON格式

json

[
  {
    "slug": "pet-adoption-board",
    "prompt": "Build me a pet adoption listing board where shelters can post animals...",
    "expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
    "userStories": [
      "As a visitor, I can see a grid of pet listings with photos and names",
      "As a visitor, I can click a pet card to see a detail page",
      "As a visitor, I can filter pets by type"
    ]
  }
]

Each scenario needs:

slug

(string),

prompt

(string),

expectedSkills

(string[]),

userStories

(tuple of exactly 3 strings).

json

[
  {
    "slug": "pet-adoption-board",
    "prompt": "Build me a pet adoption listing board where shelters can post animals...",
    "expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
    "userStories": [
      "As a visitor, I can see a grid of pet listings with photos and names",
      "As a visitor, I can click a pet card to see a detail page",
      "As a visitor, I can filter pets by type"
    ]
  }
]

每个场景需要包含：

slug

（字符串）、

prompt

（字符串）、

expectedSkills

（字符串数组）、

userStories

（恰好包含3个字符串的元组）。

Prompt Design Guidelines

提示词设计指南

Focus on what the user wants, not what tech to use
Describe real-world apps that solve real problems with friendly, stylish UX
Include AI features naturally (recommendations, analysis, generation)

Always end with:

"Link the project to my vercel-labs team. After building all files, start the dev server on port 3000 with \

npx next dev --port 3000`."`

Include storage needs (photos, uploads) to trigger vercel-storage
Include scheduled tasks (reminders, cleanup) to trigger cron-jobs
Include auth/middleware to trigger routing-middleware

聚焦用户需求，而非使用什么技术
描述解决真实问题、体验友好时尚的真实应用
自然融入AI功能（推荐、分析、生成）

始终以这句话结尾：

"Link the project to my vercel-labs team. After building all files, start the dev server on port 3000 with \

npx next dev --port 3000`."`

包含存储需求（照片、上传）以触发vercel-storage
包含定时任务（提醒、清理）以触发cron-jobs
包含认证/中间件需求以触发routing-middleware

Structured Scoring (Haiku)

结构化评分（Haiku）

Each phase gets a structured JSON score via

claude -p --json-schema --model haiku --setting-sources ""

running inside the sandbox. This is a separate quick pass — no tools, no hooks — just reads the phase output and returns structured data.

每个阶段都会通过沙箱内运行的

claude -p --json-schema --model haiku --setting-sources ""

生成结构化JSON评分。这是独立的快速步骤——无需工具、无需钩子，仅读取阶段输出并返回结构化数据。

Build Score Schema

构建评分Schema

json

{
  "completeness": "complete|partial|minimal|empty",
  "hasApiRoutes": true,
  "hasUIComponents": true,
  "hasAIFeature": true,
  "devServerRunning": true,
  "missingFeatures": ["feature1"],
  "summary": "Brief assessment"
}

json

{
  "completeness": "complete|partial|minimal|empty",
  "hasApiRoutes": true,
  "hasUIComponents": true,
  "hasAIFeature": true,
  "devServerRunning": true,
  "missingFeatures": ["feature1"],
  "summary": "Brief assessment"
}

Verify Score Schema (per user story)

验证评分Schema（按用户故事）

json

{
  "stories": [
    { "index": 1, "status": "pass|fail", "reason": "Evidence from output" }
  ]
}

json

{
  "stories": [
    { "index": 1, "status": "pass|fail", "reason": "Evidence from output" }
  ]
}

Deploy Score Schema

部署评分Schema

json

{
  "deployed": true,
  "url": "https://xxx.vercel.app",
  "buildSucceeded": true,
  "errors": [],
  "summary": "Brief assessment"
}

Important: The

claude -p --output-format json

response wraps results — the actual schema data is in

parsed.structured_output

, not the top-level object.

json

{
  "deployed": true,
  "url": "https://xxx.vercel.app",
  "buildSucceeded": true,
  "errors": [],
  "summary": "Brief assessment"
}

重要提示：

claude -p --output-format json

响应会包裹结果，实际schema数据位于

parsed.structured_output

中，而非顶层对象。

Critical Sandbox Environment Facts

沙箱环境关键注意事项

Property	Value
Home directory	`/home/vercel-sandbox` (NOT `/home/user/` or `/root/` )
User	`vercel-sandbox` (NOT `root` )
Claude binary	`/home/vercel-sandbox/.global/npm/bin/claude`
PATH (via sh -c)	Includes `~/.global/npm/bin` — claude findable by name
Port exposure	`sandbox.domain(3000)` → `https://subdomain.vercel.run`
Snapshot persistence	Files AND npm globals survive snapshot restore — use `sandbox.snapshot()` → `Sandbox.create({ source: { type: "snapshot", snapshotId } })`
SDK version	`@vercel/sandbox@1.8.0` (v2 beta's named sandbox endpoint returns 404 for this team)
Team tier	Enterprise (vercel-labs) — no known sandbox time cap

属性	值
家目录	`/home/vercel-sandbox` （不是 `/home/user/` 或 `/root/` ）
用户	`vercel-sandbox` （不是 `root` ）
Claude二进制文件路径	`/home/vercel-sandbox/.global/npm/bin/claude`
PATH（通过sh -c）	包含 `~/.global/npm/bin` ——可直接通过名称调用claude
端口暴露	`sandbox.domain(3000)` → `https://subdomain.vercel.run`
快照持久化	文件和npm全局依赖都会在快照恢复后保留，使用 `sandbox.snapshot()` → `Sandbox.create({ source: { type: "snapshot", snapshotId } })`
SDK版本	`@vercel/sandbox@1.8.0` （该团队访问v2 beta的命名沙箱端点会返回404）
团队层级	企业版（vercel-labs）——无已知沙箱运行时长上限

Key Discoveries (Hard-Won)

关键踩坑经验

Snapshots work:
```
sandbox.snapshot()
```
preserves files AND npm globals. Use it after build to create a restore point before verify/deploy. Note: snapshotting stops the source sandbox — create a new one from the snapshot to continue.
Plugin install: Use
```
npx add-plugin <path> -s project -y --target claude-code
```
— works because claude is in PATH after
```
npm install -g
```
. The
```
--target claude-code
```
flag is required because add-plugin can't auto-detect Claude Code without an initialized
```
~/.claude/
```
dir.
File uploads: Use
```
sandbox.writeFiles([{ path, content: Buffer }])
```
— NOT runCommand heredocs. Heredocs with special characters cause 400 errors from the sandbox API.

Claude flags: Always use

--dangerously-skip-permissions --debug

. The

--debug

flag writes to

~/.claude/debug/

Auth: API key from macOS Keychain (
```
ANTHROPIC_AUTH_TOKEN
```
— a
```
vck_*
```
Vercel Claude Key for AI Gateway), Vercel token from
```
~/.local/share/com.vercel.cli/auth.json
```
(a
```
vca_*
```
token).

OIDC for sandbox SDK: Run

npx vercel link --scope vercel-labs -y

npx vercel env pull

once before first use.

Port exposure: Pass
```
ports: [3000]
```
in
```
Sandbox.create()
```
to get a public URL immediately via
```
sandbox.domain(3000)
```
. Works on v1.8.0 — URL is assigned at creation time, before anything listens.
extendTimeout: Use
```
sandbox.extendTimeout(ms)
```
to keep sandboxes alive past their initial timeout. Verified working — extends by the requested duration. Use this for overnight keep-alive.
Background commands:
```
runCommand
```
with backgrounded processes (
```
&
```
or
```
nohup
```
) may throw ZodError on v1. Write a script file first, then execute it.
Session cleanup race: The
```
session-end-cleanup.mjs
```
hook deletes
```
/tmp/vercel-plugin-*-seen-skills.d/
```
on session end. Extract artifacts BEFORE the session completes, or rely on poll history data.
agent-browser works in sandboxes: Install via
```
npm install -g agent-browser
```
. Claude Code can use it for browser-based verification inside the sandbox.
No hobby tier cap: Early 301s timeouts were from lower default timeout values in earlier script iterations, not a tier limitation. Enterprise (vercel-labs) has no known sandbox time cap — sandboxes ran 10+ minutes successfully.
claude -p works inside sandboxes:
```
claude -p --json-schema --output-format json --model haiku
```
works for structured scoring passes. No nesting issue when running inside a sandbox (only fails when running Claude inside Claude on the same machine).
Deploy project naming: ALWAYS use timestamped slugs with minute precision (e.g.,
```
pet-adoption-board-202603101853
```
) to avoid collisions when linking to vercel-labs team projects. These are demo projects — we generate many per day. Format:
```
<slug>-<YYYYMMDDHHMM>
```
.

快照功能可用：
```
sandbox.snapshot()
```
会保留文件和npm全局依赖。构建完成后使用它创建还原点，再进行验证/部署。注意：快照操作会停止源沙箱，需要从快照创建新沙箱才能继续操作。
插件安装：使用
```
npx add-plugin <path> -s project -y --target claude-code
```
——因为
```
npm install -g
```
后claude已在PATH中，所以可以正常运行。
```
--target claude-code
```
参数是必填项，因为如果没有初始化的
```
~/.claude/
```
目录，add-plugin无法自动检测Claude Code。
文件上传：使用
```
sandbox.writeFiles([{ path, content: Buffer }])
```
——不要用runCommand的heredoc。包含特殊字符的heredoc会导致沙箱API返回400错误。

Claude参数：始终使用

--dangerously-skip-permissions --debug

。

--debug

参数会将日志写入

~/.claude/debug/

。

认证：API密钥来自macOS钥匙串（
```
ANTHROPIC_AUTH_TOKEN
```
——AI网关使用的
```
vck_*
```
格式Vercel Claude密钥），Vercel令牌来自
```
~/.local/share/com.vercel.cli/auth.json
```
（
```
vca_*
```
格式令牌）。

沙箱SDK的OIDC认证：首次使用前运行一次

npx vercel link --scope vercel-labs -y

npx vercel env pull

。

端口暴露：在
```
Sandbox.create()
```
中传入
```
ports: [3000]
```
即可立即通过
```
sandbox.domain(3000)
```
获取公网URL。v1.8.0版本可用——URL在创建时就已分配，无需等服务启动监听。
extendTimeout：使用
```
sandbox.extendTimeout(ms)
```
延长沙箱的初始超时时间。已验证可用——会按请求的时长延长。用于过夜保持沙箱运行。
后台命令：v1版本中使用
```
runCommand
```
运行后台进程（
```
&
```
或
```
nohup
```
）可能抛出ZodError。请先编写脚本文件，再执行脚本。
会话清理竞争问题：
```
session-end-cleanup.mjs
```
钩子会在会话结束时删除
```
/tmp/vercel-plugin-*-seen-skills.d/
```
。请在会话完成前提取产物，或依赖轮询历史数据。
agent-browser可在沙箱中运行：通过
```
npm install -g agent-browser
```
安装。Claude Code可以在沙箱内用它进行基于浏览器的验证。
无 hobby 层级时长限制：早期的301超时是因为旧版本脚本的默认超时值较低，而非层级限制。企业版（vercel-labs）无已知沙箱运行时长上限——沙箱可成功运行10分钟以上。
claude -p
可在沙箱内运行：
```
claude -p --json-schema --output-format json --model haiku
```
可用于结构化评分。在沙箱内运行没有嵌套问题（仅在同一台机器上的Claude内运行Claude时会失败）。
部署项目命名：始终使用精确到分钟的时间戳后缀（例如
```
pet-adoption-board-202603101853
```
），避免关联到vercel-labs团队项目时出现命名冲突。这些都是演示项目，每天会生成大量项目。格式：
```
<slug>-<YYYYMMDDHHMM>
```
。

When to Use This vs benchmark-agents

本工具与benchmark-agents的适用场景对比

	benchmark-agents (WezTerm)	benchmark-sandbox
Environment	Local macOS terminal panes	Remote Vercel Sandboxes (Amazon Linux)
Parallelism	Limited by local resources	Up to 10 (Hobby) or 2,000 (Pro) concurrent
Session type	Interactive TTY via `/bin/zsh -ic`	Direct `sh -c` invocation (PTY not required)
Artifact access	Direct filesystem ( `~/.claude/debug/` )	`sandbox.readFile()` / poll via `runCommand`
Port exposure	`localhost:3000`	Public `https://sb-XXX.vercel.run` URLs
Verification	Manual browser check	Automated agent-browser in Phase 2
Deploy	Manual	Automated Phase 3 → permanent `*.vercel.app` URLs
Scoring	Manual review	Haiku structured JSON scoring per phase
Best for	Manual eval + iteration loop	Automated parallel coverage + verification + deploy runs

	benchmark-agents (WezTerm)	benchmark-sandbox
运行环境	本地macOS终端面板	远程Vercel沙箱（Amazon Linux）
并行度	受本地资源限制	最多10（Hobby层级）或2000（Pro层级）并发
会话类型	通过 `/bin/zsh -ic` 的交互式TTY	直接 `sh -c` 调用（无需PTY）
产物访问	直接访问文件系统（ `~/.claude/debug/` ）	`sandbox.readFile()` / 通过 `runCommand` 轮询
端口暴露	`localhost:3000`	公网 `https://sb-XXX.vercel.run` URL
验证方式	手动浏览器检查	阶段2自动运行agent-browser验证
部署方式	手动	阶段3自动部署 → 永久 `*.vercel.app` URL
评分方式	人工审核	每个阶段通过Haiku生成结构化JSON评分
最佳适用场景	手动评估 + 迭代循环	自动化并行覆盖率 + 验证 + 部署运行

How It Works

工作原理

Create fresh sandbox:

Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } })

— no snapshot

Install tools:

npm install -g @anthropic-ai/claude-code vercel agent-browser

(~20s per sandbox)

Auth Vercel CLI: Write token to
```
~/.local/share/com.vercel.cli/auth.json
```
Upload plugin:
```
sandbox.writeFiles()
```
for 80 plugin files, then
```
npx add-plugin
```
Phase 1 — BUILD: Claude Code builds the app (30 min timeout)
Score build: Haiku evaluates completeness, API routes, UI, AI features
Start dev server: If not already running, start
```
npx next dev --port 3000
```
Extend timeout:
```
sandbox.extendTimeout()
```
for verify + deploy + keep-alive
Phase 2 — VERIFY: Claude Code uses
```
agent-browser
```
to test user stories (20 min timeout). Prompt tells Claude to start dev server itself if not running.
Score verify: Haiku evaluates each user story as pass/fail with reasons
Re-extract skills: Skills re-collected after verify phase (agent-browser + code fixes trigger more)
Phase 3 — DEPLOY: Claude Code runs
```
vercel link
```
+
```
vercel deploy
```
, fixes build errors (30 min timeout)
Score deploy: Haiku evaluates deploy success, URL extraction, errors
Re-extract skills: Skills re-collected after deploy phase
Write incremental results: Each scenario writes its own
```
result.json
```
immediately on completion (survives crashes)
Extract source archive:
```
source.tar.gz
```
of project files saved locally
Generate report: Markdown report with build/verify/deploy scores, skill coverage, URLs

创建全新沙箱：

Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } })

——不使用快照

安装工具：

npm install -g @anthropic-ai/claude-code vercel agent-browser

（每个沙箱耗时约20秒）

Vercel CLI认证：将令牌写入
```
~/.local/share/com.vercel.cli/auth.json
```
上传插件：通过
```
sandbox.writeFiles()
```
上传80个插件文件，然后运行
```
npx add-plugin
```
阶段1 — 构建：Claude Code构建应用（超时30分钟）
构建评分：Haiku评估完整性、API路由、UI、AI功能
启动开发服务器：如果未运行，启动
```
npx next dev --port 3000
```
延长超时时间：调用
```
sandbox.extendTimeout()
```
覆盖验证+部署+保持运行的时长需求
阶段2 — 验证：Claude Code使用
```
agent-browser
```
测试用户故事（超时20分钟）。提示词会告知Claude如果服务未运行则自行启动开发服务器
验证评分：Haiku评估每个用户故事的通过/失败状态并给出原因
重新提取技能：验证阶段结束后重新收集技能（agent-browser + 代码修复会触发更多技能）
阶段3 — 部署：Claude Code运行
```
vercel link
```
+
```
vercel deploy
```
，修复构建错误（超时30分钟）
部署评分：Haiku评估部署成功状态、URL提取结果、错误信息
重新提取技能：部署阶段结束后重新收集技能
写入增量结果：每个场景完成后立即写入独立的
```
result.json
```
（避免崩溃丢失数据）
提取源码归档：将项目文件的
```
source.tar.gz
```
保存到本地
生成报告：包含构建/验证/部署评分、技能覆盖率、URL的Markdown报告

Sandbox Session Flow (Per Scenario)

沙箱会话流程（每个场景）

Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
  │
  ├─ npm install -g @anthropic-ai/claude-code vercel agent-browser   (~20s)
  ├─ Write Vercel CLI auth token to ~/.local/share/com.vercel.cli/auth.json
  ├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
  ├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/  (80 files, ~945KB)
  ├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
  │
  ├─ Phase 1: BUILD
  │   ├─ sandbox.writeFiles() → /tmp/prompt.txt
  │   ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
  │   │   (with AbortSignal.timeout(TIMEOUT_MS))
  │   ├─ Poll every 20s:
  │   │   ├─ ls /tmp/vercel-plugin-*-seen-skills.d/     (claimed skills)
  │   │   ├─ cat /tmp/vercel-plugin-*-seen-skills.txt    (seen skills snapshot)
  │   │   ├─ find ~/.claude/debug -type f                (debug log count)
  │   │   ├─ find <project> -newer /tmp/prompt.txt       (new project files)
  │   │   └─ curl localhost:3000                         (port status)
  │   ├─ Extract build artifacts
  │   └─ Haiku build score (structured JSON)
  │
  ├─ Start dev server (if not already running)
  ├─ sandbox.extendTimeout(...)
  │
  ├─ Phase 2: VERIFY (if >1 project file exists)
  │   ├─ sandbox.writeFiles() → /tmp/verify.txt  (agent-browser verification prompt)
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
  │   │   (with AbortSignal.timeout(1_200_000) — 20 min)
  │   ├─ Re-extract skills (verify phase triggers more)
  │   └─ Haiku verify score (per-story pass/fail JSON)
  │
  ├─ Phase 3: DEPLOY (if >3 project files)
  │   ├─ sandbox.writeFiles() → /tmp/deploy.txt
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
  │   │   (links to vercel-labs, deploys, fixes build errors up to 3x)
  │   ├─ Extract deploy URL from output (*.vercel.app)
  │   ├─ Re-extract skills (deploy phase triggers more)
  │   └─ Haiku deploy score (structured JSON)
  │
  ├─ Write <slug>/result.json immediately (crash-safe)
  ├─ Update aggregate results.json (complete: false until all done)
  ├─ Extract source.tar.gz
  └─ sandbox.stop()  (skipped if --keep-alive)

Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
  │
  ├─ npm install -g @anthropic-ai/claude-code vercel agent-browser   (~20s)
  ├─ Write Vercel CLI auth token to ~/.local/share/com.vercel.cli/auth.json
  ├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
  ├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/  (80 files, ~945KB)
  ├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
  │
  ├─ Phase 1: BUILD
  │   ├─ sandbox.writeFiles() → /tmp/prompt.txt
  │   ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
  │   │   (with AbortSignal.timeout(TIMEOUT_MS))
  │   ├─ Poll every 20s:
  │   │   ├─ ls /tmp/vercel-plugin-*-seen-skills.d/     (claimed skills)
  │   │   ├─ cat /tmp/vercel-plugin-*-seen-skills.txt    (seen skills snapshot)
  │   │   ├─ find ~/.claude/debug -type f                (debug log count)
  │   │   ├─ find <project> -newer /tmp/prompt.txt       (new project files)
  │   │   └─ curl localhost:3000                         (port status)
  │   ├─ Extract build artifacts
  │   └─ Haiku build score (structured JSON)
  │
  ├─ Start dev server (if not already running)
  ├─ sandbox.extendTimeout(...)
  │
  ├─ Phase 2: VERIFY (if >1 project file exists)
  │   ├─ sandbox.writeFiles() → /tmp/verify.txt  (agent-browser verification prompt)
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
  │   │   (with AbortSignal.timeout(1_200_000) — 20 min)
  │   ├─ Re-extract skills (verify phase triggers more)
  │   └─ Haiku verify score (per-story pass/fail JSON)
  │
  ├─ Phase 3: DEPLOY (if >3 project files)
  │   ├─ sandbox.writeFiles() → /tmp/deploy.txt
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
  │   │   (links to vercel-labs, deploys, fixes build errors up to 3x)
  │   ├─ Extract deploy URL from output (*.vercel.app)
  │   ├─ Re-extract skills (deploy phase triggers more)
  │   └─ Haiku deploy score (structured JSON)
  │
  ├─ Write <slug>/result.json immediately (crash-safe)
  ├─ Update aggregate results.json (complete: false until all done)
  ├─ Extract source.tar.gz
  └─ sandbox.stop()  (skipped if --keep-alive)

Verification Phase Details

验证阶段详情

The verify phase is the "closer" — its job is to make the app work and prove it. Key behaviors:

Always runs if >1 project file exists (no longer gated on port 3000 being up)
Starts dev server itself if not already running — the prompt tells Claude to check
```
localhost:3000
```
and run
```
npx next dev --port 3000
```
if needed
20 minute timeout — enough for agent-browser to open pages, screenshot, interact, fix broken code, restart server, and re-verify
Triggers skill injection — the verify session creates/edits files, triggering PreToolUse and PostToolUse hooks
Uses agent-browser workflow:
```
open
```
→
```
wait --load networkidle
```
→
```
screenshot --annotate
```
→
```
snapshot -i
```
→ interact → fix → re-verify
Results scored by haiku — no more parsing
```
STORY_1: PASS
```
from free text

验证阶段是「收尾环节」，职责是让应用正常运行并完成验证。核心特性：

只要存在>1个项目文件就始终运行（不再限制必须启动3000端口）
如果开发服务器未运行则自行启动——提示词会告知Claude检查
```
localhost:3000
```
，如有需要则运行
```
npx next dev --port 3000
```
20分钟超时——足够agent-browser打开页面、截图、交互、修复损坏代码、重启服务器、重新验证
触发技能注入——验证会话会创建/编辑文件，触发PreToolUse和PostToolUse钩子
使用agent-browser工作流：
```
open
```
→
```
wait --load networkidle
```
→
```
screenshot --annotate
```
→
```
snapshot -i
```
→ 交互 → 修复 → 重新验证
结果由haiku评分——无需从自由文本中解析
```
STORY_1: PASS
```
这类内容

Deploy Phase Details

部署阶段详情

The deploy phase uses a full Claude Code session (for skill tracking) to:

Run

vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDD

Run
```
vercel deploy --yes
```
If build fails, fix code and retry (up to 3 attempts)

Important: unsets

VERCEL_TOKEN

env var so CLI falls back to

~/.local/share/com.vercel.cli/auth.json

Deployment protection is enabled by default on vercel-labs team

Deploy URL is extracted by regex from Claude's output, with haiku as fallback URL extractor.

部署阶段使用完整的Claude Code会话（用于技能跟踪）完成以下操作：

运行

vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDD

运行
```
vercel deploy --yes
```
如果构建失败，修复代码并重试（最多3次）
重要提示：取消
```
VERCEL_TOKEN
```
环境变量，让CLI回退使用
```
~/.local/share/com.vercel.cli/auth.json
```
vercel-labs团队的部署默认开启部署保护

部署URL通过正则从Claude的输出中提取，haiku评分步骤作为备用URL提取方案。

DO NOT (Hard Rules)

禁止事项（硬性规则）

Same rules as

benchmark-agents

, plus sandbox-specific:

DO NOT use
```
claude --print
```
or
```
-p
```
flag for BUILD/VERIFY/DEPLOY phases — hooks don't fire without tool-calling sessions (use
```
-p
```
only for haiku scoring passes)
DO NOT let sandboxes run without extracting artifacts — ephemeral filesystem is lost on stop

DO NOT pass API keys via

writeFiles()

— use

Sandbox.create({ env: { ... } })

DO NOT skip snapshotting after build — it's your safety net if verify/deploy kills the sandbox
DO NOT use v2 beta SDK — named sandbox endpoint returns 404 for this team; use v1.8.0
DO NOT use
```
runCommand
```
heredocs to write file content — use
```
sandbox.writeFiles()
```
instead
DO NOT assume
```
/home/user/
```
exists — the home dir is
```
/home/vercel-sandbox/
```
DO NOT use simple project names without timestamps — always append
```
-YYYYMMDDHHMM
```
to avoid collisions across runs

与

benchmark-agents

规则一致，新增沙箱专属规则：

禁止在构建/验证/部署阶段使用
```
claude --print
```
或
```
-p
```
参数——没有工具调用会话的话钩子不会触发（仅在haiku评分步骤使用
```
-p
```
）
禁止未提取产物就停止沙箱运行——临时文件系统会在停止后丢失
禁止通过
```
writeFiles()
```
传递API密钥——使用
```
Sandbox.create({ env: { ... } })
```
传递
禁止构建完成后跳过快照——如果验证/部署导致沙箱损坏，快照是你的安全网
禁止使用v2 beta SDK——该团队访问命名沙箱端点会返回404，请使用v1.8.0
禁止使用
```
runCommand
```
heredoc写入文件内容——请改用
```
sandbox.writeFiles()
```
禁止假设
```
/home/user/
```
存在——家目录是
```
/home/vercel-sandbox/
```
禁止使用不带时间戳的简单项目名——始终追加
```
-YYYYMMDDHHMM
```
避免不同运行间的命名冲突

Prerequisites

前置要求

bash

undefined

bash

undefined

One-time setup: link project for OIDC sandbox auth

一次性设置：关联项目用于OIDC沙箱认证

npx vercel link --scope vercel-labs -y npx vercel env pull .env.local

Auth (auto-resolved from macOS Keychain + Vercel CLI auth):

认证（自动从macOS钥匙串 + Vercel CLI认证中读取）：

- ANTHROPIC_API_KEY: from Keychain "ANTHROPIC_AUTH_TOKEN" (vck_* key) or env var

- ANTHROPIC_API_KEY: 来自钥匙串的「ANTHROPIC_AUTH_TOKEN」（vck_*格式密钥）或环境变量

- VERCEL_TOKEN: from ~/.local/share/com.vercel.cli/auth.json (vca_* token) or env var

- VERCEL_TOKEN: 来自

~/.local/share/com.vercel.cli/auth.json

的（vca_*格式令牌）或环境变量

- ANTHROPIC_BASE_URL: defaults to https://ai-gateway.vercel.sh

- ANTHROPIC_BASE_URL: 默认值为https://ai-gateway.vercel.sh

undefined

undefined

Commands

命令

Run eval with dynamic scenarios (recommended)

运行动态场景评估（推荐）

bash

undefined

bash

undefined

Generate scenarios as JSON, then run

生成JSON格式场景，然后运行

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json

With all phases + keep-alive for overnight

运行全阶段 + 保持沙箱过夜运行

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --keep-alive --keep-hours 8

Build-only, no verification or deploy

仅构建，不进行验证和部署

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --skip-verify --skip-deploy

Filter to specific slugs from file or defaults

从文件或默认场景中过滤运行指定标识的场景

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone

undefined

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone

undefined

Monitoring While Running

运行时监控

The orchestrator prints live status. For manual checks on a running sandbox:

typescript

// List claimed skills
const claims = await sandbox.runCommand("sh", ["-c",
  "ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null"
]);

// Check hook firing count
const hooks = await sandbox.runCommand("sh", ["-c",
  "find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +"
]);

// Check port 3000
const port = await sandbox.runCommand("sh", ["-c",
  "curl -s -o /dev/null -w '%{http_code}' http://localhost:3000"
]);

// Get public URL (after ports: [3000] in Sandbox.create)
const url = sandbox.domain(3000);

编排器会打印实时状态。如果需要手动检查运行中的沙箱：

typescript

// 列出已认领的技能
const claims = await sandbox.runCommand("sh", ["-c",
  "ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null"
]);

// 检查钩子触发次数
const hooks = await sandbox.runCommand("sh", ["-c",
  "find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +"
]);

// 检查3000端口状态
const port = await sandbox.runCommand("sh", ["-c",
  "curl -s -o /dev/null -w '%{http_code}' http://localhost:3000"
]);

// 获取公网URL（在Sandbox.create中传入ports: [3000]后可用）
const url = sandbox.domain(3000);

Artifact Export Layout

产物导出结构

Results are written to

~/dev/vercel-plugin-testing/sandbox-results/<run-id>/

<run-id>/
  results.json             # Aggregate results (complete: false until all done, then true)
  report.md                # Markdown report with scores, coverage, URLs
  <slug>/
    result.json            # Per-scenario result (written immediately on completion)
    source.tar.gz          # Project source archive

Each scenario result includes:

```
slug
```
,
```
sandboxId
```
,
```
success
```
,
```
durationMs
```

claimedSkills[]

expectedSkills[]

projectFiles[]

```
appUrl
```
— public
```
https://sb-XXX.vercel.run
```
URL (sandbox lifetime only)
```
deployUrl
```
— permanent
```
https://xxx.vercel.app
```
URL (if deploy succeeded)
```
pollHistory[]
```
— timestamped skill/file/port snapshots

verification

—

{ ran, exitCode, stories: [{ index, status }], output }

```
buildScore
```
— haiku structured completeness assessment
```
deployScore
```
— haiku structured deploy assessment

The markdown report (

report.md

.reports/<timestamp>.md

) includes:

Summary table — slug, build status, skills, files, verify results, deploy URL, duration
Per-scenario details — build score, deploy score, verification per-story pass/fail
Skill coverage — expected vs actual per scenario, missing/bonus breakdown
Total unique skills across all scenarios

结果写入到

~/dev/vercel-plugin-testing/sandbox-results/<run-id>/

<run-id>/
  results.json             # 聚合结果（全部完成前complete为false，完成后为true）
  report.md                # 包含评分、覆盖率、URL的Markdown报告
  <slug>/
    result.json            # 单场景结果（完成后立即写入）
    source.tar.gz          # 项目源码归档

每个场景的结果包含：

```
slug
```
、
```
sandboxId
```
、
```
success
```
、
```
durationMs
```

claimedSkills[]

、

expectedSkills[]

、

projectFiles[]

```
appUrl
```
— 公网
```
https://sb-XXX.vercel.run
```
URL（仅沙箱运行期间有效）
```
deployUrl
```
— 永久
```
https://xxx.vercel.app
```
URL（部署成功时存在）
```
pollHistory[]
```
— 带时间戳的技能/文件/端口快照

verification

—

{ ran, exitCode, stories: [{ index, status }], output }

```
buildScore
```
— haiku结构化完整性评估
```
deployScore
```
— haiku结构化部署评估

Markdown报告（

report.md

.reports/<timestamp>.md

）包含：

汇总表 — 场景标识、构建状态、技能、文件、验证结果、部署URL、耗时
单场景详情 — 构建评分、部署评分、验证阶段每个用户故事的通过/失败状态
技能覆盖率 — 每个场景的预期 vs 实际技能，缺失/额外技能 breakdown
全部场景的唯一技能总数

Proven Results (2026-03-10)

验证结果（2026-03-10）

Across 34 scenarios run in 5 batches:

Metric	Best	Typical
Skills per scenario	31 (ai-interior-designer)	12-24
Expected skill coverage	100% (pet-adoption-board 4/4, apartment-hunting-copilot 7/7, splitwise-clone 6/6)	50-86%
User stories verified	3/3 PASS (ai-dream-journal, ai-gift-finder, ai-resume-roaster, ai-music-mood-radio, team-standup-bot, pet-adoption-board)	varies
Files built per scenario	37 (student-study-groups)	6-25
Build time	5-11 min	5-7 min

Key findings:

User-story-focused prompts (no tech name-dropping) work — plugin detects patterns from actual code
```
ai-sdk
```
,
```
shadcn
```
,
```
nextjs
```
,
```
vercel-functions
```
are the most consistently detected skills
```
cron-jobs
```
,
```
routing-middleware
```
need Claude to write specific file patterns to trigger
Lexical prompt inject (UserPromptSubmit) working — skills injected before any files written
```
session-end-cleanup
```
deletes claim dirs — use poll history for final skill counts
Enterprise tier (vercel-labs) — no sandbox time cap; builds ran 10+ minutes

在5个批次运行的34个场景中：

指标	最佳值	典型值
单场景技能数	31（ai-interior-designer）	12-24
预期技能覆盖率	100%（pet-adoption-board 4/4、apartment-hunting-copilot 7/7、splitwise-clone 6/6）	50-86%
用户故事验证通过率	3/3 全通过（ai-dream-journal、ai-gift-finder、ai-resume-roaster、ai-music-mood-radio、team-standup-bot、pet-adoption-board）	不固定
单场景构建文件数	37（student-study-groups）	6-25
构建耗时	5-11分钟	5-7分钟

核心发现：

聚焦用户故事的提示词（不提及技术名称）效果良好——插件可以从实际代码中检测模式
```
ai-sdk
```
、
```
shadcn
```
、
```
nextjs
```
、
```
vercel-functions
```
是最稳定检测到的技能
```
cron-jobs
```
、
```
routing-middleware
```
需要Claude编写特定的文件模式才能触发
词法提示词注入（UserPromptSubmit）正常工作——技能在写入任何文件前就已注入
```
session-end-cleanup
```
会删除认领目录——使用轮询历史获取最终技能计数
企业版层级（vercel-labs）无沙箱运行时长上限——构建可成功运行10分钟以上

Known Limitations

已知限制

Snapshot stops the source sandbox:
```
sandbox.snapshot()
```
stops the original sandbox. Create a new sandbox from the snapshot to continue. Files and npm globals DO survive.
v2 beta incompatible:
```
@vercel/sandbox@2.0.0-beta.3
```
's named sandbox endpoint returns 404 for this team. Stick with v1.8.0.
Artifact window: Must extract before
```
sandbox.stop()
```
— filesystem is ephemeral. Session cleanup hook may delete claim dirs before extraction.

Amazon Linux paths: User is

vercel-sandbox

(home at

/home/vercel-sandbox/

). NOT

/home/user/

/root/

--dangerously-skip-permissions
parity: Sandbox evals auto-approve all tool calls. WezTerm evals use normal permission flow. Coverage results may differ.

runCommand
timeout: Use

{ signal: AbortSignal.timeout(ms) }

— the

{ timeout }

option is silently ignored.

BrotliDecompressionError: Transient Vercel API errors can kill sandbox creation. Retry logic recommended for production runs.
Deploy reliability: Claude Code deploy sessions sometimes fail to output a parseable
```
*.vercel.app
```
URL. The haiku scoring step provides a fallback URL extraction attempt.
Verify timeout: Complex apps may need the full 20 minutes for agent-browser to test all stories. Simpler apps finish in 2-5 minutes.

快照会停止源沙箱：
```
sandbox.snapshot()
```
会停止原沙箱。需要从快照创建新沙箱才能继续操作。文件和npm全局依赖确实会保留。
v2 beta版本不兼容：
```
@vercel/sandbox@2.0.0-beta.3
```
的命名沙箱端点对该团队返回404。请坚持使用v1.8.0。
产物提取窗口期：必须在
```
sandbox.stop()
```
前提取——文件系统是临时的。会话清理钩子可能在提取前删除认领目录。
Amazon Linux路径问题：用户是
```
vercel-sandbox
```
（家目录在
```
/home/vercel-sandbox/
```
），不是
```
/home/user/
```
或
```
/root/
```
。
--dangerously-skip-permissions
一致性问题：沙箱评估会自动批准所有工具调用，WezTerm评估使用正常权限流，覆盖率结果可能存在差异。

runCommand
超时问题：使用

{ signal: AbortSignal.timeout(ms) }

——

{ timeout }

参数会被静默忽略。

BrotliDecompressionError：偶发的Vercel API错误可能导致沙箱创建失败。生产运行建议添加重试逻辑。
部署可靠性问题：Claude Code部署会话有时无法输出可解析的
```
*.vercel.app
```
URL，haiku评分步骤提供备用URL提取尝试。
验证超时问题：复杂应用可能需要完整20分钟让agent-browser测试所有故事，简单应用2-5分钟即可完成。

benchmark-sandbox

Original

Translation

Benchmark Sandbox — Remote Eval via Vercel Sandboxes

基准测试沙箱 — 基于Vercel沙箱的远程评估

Proven Working Script

已验证可用脚本

Run default scenarios with full 3-phase pipeline

运行默认场景，执行完整三阶段流水线

With dynamic scenarios from a JSON file (recommended — see "Dynamic Scenarios" below)

从JSON文件加载动态场景（推荐——参见下文「动态场景」）

Keep sandboxes alive overnight with public URLs

保持沙箱运行过夜，生成公网URL

Build-only (skip verification and deploy)

仅构建（跳过验证和部署阶段）

Run specific scenarios by slug

按标识运行指定场景

CLI Flags

CLI参数

Dynamic Scenarios (Recommended Approach)

动态场景（推荐方案）

Scenario JSON Format

场景JSON格式

Prompt Design Guidelines

提示词设计指南

Structured Scoring (Haiku)

结构化评分（Haiku）

Build Score Schema

构建评分Schema

Verify Score Schema (per user story)

验证评分Schema（按用户故事）

Deploy Score Schema

部署评分Schema

Critical Sandbox Environment Facts

沙箱环境关键注意事项

Key Discoveries (Hard-Won)

关键踩坑经验

When to Use This vs benchmark-agents

本工具与benchmark-agents的适用场景对比

How It Works

工作原理

Sandbox Session Flow (Per Scenario)

沙箱会话流程（每个场景）

Verification Phase Details

验证阶段详情

Deploy Phase Details

部署阶段详情

DO NOT (Hard Rules)

禁止事项（硬性规则）

Prerequisites

前置要求

One-time setup: link project for OIDC sandbox auth

一次性设置：关联项目用于OIDC沙箱认证

Auth (auto-resolved from macOS Keychain + Vercel CLI auth):

认证（自动从macOS钥匙串 + Vercel CLI认证中读取）：

- ANTHROPIC_API_KEY: from Keychain "ANTHROPIC_AUTH_TOKEN" (vck_* key) or env var

- ANTHROPIC_API_KEY: 来自钥匙串的「ANTHROPIC_AUTH_TOKEN」（vck_*格式密钥）或环境变量

- VERCEL_TOKEN: from ~/.local/share/com.vercel.cli/auth.json (vca_* token) or env var

- VERCEL_TOKEN: 来自~/.local/share/com.vercel.cli/auth.json的（vca_*格式令牌）或环境变量

- ANTHROPIC_BASE_URL: defaults to https://ai-gateway.vercel.sh

- ANTHROPIC_BASE_URL: 默认值为https://ai-gateway.vercel.sh

Commands

命令

Run eval with dynamic scenarios (recommended)

运行动态场景评估（推荐）

Generate scenarios as JSON, then run

生成JSON格式场景，然后运行

With all phases + keep-alive for overnight

运行全阶段 + 保持沙箱过夜运行

Build-only, no verification or deploy

仅构建，不进行验证和部署

Filter to specific slugs from file or defaults

从文件或默认场景中过滤运行指定标识的场景

Monitoring While Running

运行时监控

Artifact Export Layout

产物导出结构

Proven Results (2026-03-10)

验证结果（2026-03-10）

Known Limitations

- VERCEL_TOKEN: 来自
`~/.local/share/com.vercel.cli/auth.json`
的（vca_*格式令牌）或环境变量