benchmark-sandbox
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBenchmark Sandbox — Remote Eval via Vercel Sandboxes
基准测试沙箱 — 基于Vercel沙箱的远程评估
Run benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a 3-phase eval pipeline:
- Phase 1 (BUILD): Claude Code builds the app with
--dangerously-skip-permissions --debug - Phase 2 (VERIFY): A follow-up Claude Code session uses to walk through user stories, fixing issues until all pass (20 min timeout)
agent-browser - Phase 3 (DEPLOY): A third Claude Code session links to vercel-labs, runs , and fixes build errors (up to 3 retries). Deployed apps have deployment protection enabled by default.
vercel deploy
Skills are tracked across all 3 phases — each phase may trigger additional skill injections as new files/patterns are created. After each phase, a haiku structured scoring step () evaluates the results as structured JSON.
claude -p --json-schema --model haiku在Vercel沙箱(运行node24的临时Firecracker微虚拟机)中运行基准测试场景。每个沙箱都会预装全新的Claude Code + Vercel CLI + agent-browser,上传本地vercel-plugin,然后运行三阶段评估流水线:
- 阶段1(构建): Claude Code使用参数构建应用
--dangerously-skip-permissions --debug - 阶段2(验证): 后续的Claude Code会话使用遍历用户故事,修复问题直至全部通过(超时时间20分钟)
agent-browser - 阶段3(部署): 第三个Claude Code会话关联到vercel-labs,运行,修复构建错误(最多重试3次)。部署的应用默认开启部署保护。
vercel deploy
全部三个阶段都会跟踪技能使用情况——每个阶段在创建新文件/新模式时都可能触发额外的技能注入。每个阶段结束后,haiku结构化评分步骤()会将结果评估为结构化JSON。
claude -p --json-schema --model haikuProven Working Script
已验证可用脚本
Use — the proven eval runner:
run-eval.tsbash
undefined使用——经过验证的评估运行器:
run-eval.tsbash
undefinedRun default scenarios with full 3-phase pipeline
运行默认场景,执行完整三阶段流水线
bun run .claude/skills/benchmark-sandbox/run-eval.ts
bun run .claude/skills/benchmark-sandbox/run-eval.ts
With dynamic scenarios from a JSON file (recommended — see "Dynamic Scenarios" below)
从JSON文件加载动态场景(推荐——参见下文「动态场景」)
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
Keep sandboxes alive overnight with public URLs
保持沙箱运行过夜,生成公网URL
bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive --keep-hours 8
bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive --keep-hours 8
Build-only (skip verification and deploy)
仅构建(跳过验证和部署阶段)
bun run .claude/skills/benchmark-sandbox/run-eval.ts --skip-verify --skip-deploy
bun run .claude/skills/benchmark-sandbox/run-eval.ts --skip-verify --skip-deploy
Run specific scenarios by slug
按标识运行指定场景
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
undefinedbun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
undefinedCLI Flags
CLI参数
| Flag | Default | Description |
|---|---|---|
| 5 | Max parallel sandboxes (max 10) |
| 1800000 (30 min) | Per-phase timeout in ms |
| off | Keep sandboxes running after eval |
| 8 | Hours to keep alive (with |
| off | Skip the agent-browser verification phase |
| off | Skip the Vercel deploy phase |
| all | Only run specific scenarios by slug |
| — | Load scenarios from a JSON file instead of built-in defaults |
| 参数 | 默认值 | 描述 |
|---|---|---|
| 5 | 最大并行沙箱数(最高10) |
| 1800000 (30分钟) | 单阶段超时时间(单位:毫秒) |
| 关闭 | 评估结束后保持沙箱运行 |
| 8 | 沙箱保持运行的时长(配合 |
| 关闭 | 跳过agent-browser验证阶段 |
| 关闭 | 跳过Vercel部署阶段 |
| 全部 | 仅运行指定标识的场景 |
| — | 从JSON文件加载场景,替代内置默认值 |
Dynamic Scenarios (Recommended Approach)
动态场景(推荐方案)
Instead of hardcoding tech-specific prompts, generate scenarios dynamically as a JSON file. Prompts should describe real-world apps people want to build using user stories — no tech name-dropping. Let the plugin figure out what Vercel tech to inject.
无需硬编码特定技术的提示词,而是动态生成JSON格式的场景文件。提示词应通过用户故事描述人们想要构建的真实应用——不要提及技术名称,让插件自行判断要注入的Vercel技术。
Scenario JSON Format
场景JSON格式
json
[
{
"slug": "pet-adoption-board",
"prompt": "Build me a pet adoption listing board where shelters can post animals...",
"expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
"userStories": [
"As a visitor, I can see a grid of pet listings with photos and names",
"As a visitor, I can click a pet card to see a detail page",
"As a visitor, I can filter pets by type"
]
}
]Each scenario needs: (string), (string), (string[]), (tuple of exactly 3 strings).
slugpromptexpectedSkillsuserStoriesjson
[
{
"slug": "pet-adoption-board",
"prompt": "Build me a pet adoption listing board where shelters can post animals...",
"expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
"userStories": [
"As a visitor, I can see a grid of pet listings with photos and names",
"As a visitor, I can click a pet card to see a detail page",
"As a visitor, I can filter pets by type"
]
}
]每个场景需要包含:(字符串)、(字符串)、(字符串数组)、(恰好包含3个字符串的元组)。
slugpromptexpectedSkillsuserStoriesPrompt Design Guidelines
提示词设计指南
- Focus on what the user wants, not what tech to use
- Describe real-world apps that solve real problems with friendly, stylish UX
- Include AI features naturally (recommendations, analysis, generation)
- Always end with: npx next dev --port 3000`."`
"Link the project to my vercel-labs team. After building all files, start the dev server on port 3000 with \ - Include storage needs (photos, uploads) to trigger vercel-storage
- Include scheduled tasks (reminders, cleanup) to trigger cron-jobs
- Include auth/middleware to trigger routing-middleware
- 聚焦用户需求,而非使用什么技术
- 描述解决真实问题、体验友好时尚的真实应用
- 自然融入AI功能(推荐、分析、生成)
- 始终以这句话结尾:npx next dev --port 3000`."`
"Link the project to my vercel-labs team. After building all files, start the dev server on port 3000 with \ - 包含存储需求(照片、上传)以触发vercel-storage
- 包含定时任务(提醒、清理)以触发cron-jobs
- 包含认证/中间件需求以触发routing-middleware
Structured Scoring (Haiku)
结构化评分(Haiku)
Each phase gets a structured JSON score via running inside the sandbox. This is a separate quick pass — no tools, no hooks — just reads the phase output and returns structured data.
claude -p --json-schema --model haiku --setting-sources ""每个阶段都会通过沙箱内运行的生成结构化JSON评分。这是独立的快速步骤——无需工具、无需钩子,仅读取阶段输出并返回结构化数据。
claude -p --json-schema --model haiku --setting-sources ""Build Score Schema
构建评分Schema
json
{
"completeness": "complete|partial|minimal|empty",
"hasApiRoutes": true,
"hasUIComponents": true,
"hasAIFeature": true,
"devServerRunning": true,
"missingFeatures": ["feature1"],
"summary": "Brief assessment"
}json
{
"completeness": "complete|partial|minimal|empty",
"hasApiRoutes": true,
"hasUIComponents": true,
"hasAIFeature": true,
"devServerRunning": true,
"missingFeatures": ["feature1"],
"summary": "Brief assessment"
}Verify Score Schema (per user story)
验证评分Schema(按用户故事)
json
{
"stories": [
{ "index": 1, "status": "pass|fail", "reason": "Evidence from output" }
]
}json
{
"stories": [
{ "index": 1, "status": "pass|fail", "reason": "Evidence from output" }
]
}Deploy Score Schema
部署评分Schema
json
{
"deployed": true,
"url": "https://xxx.vercel.app",
"buildSucceeded": true,
"errors": [],
"summary": "Brief assessment"
}Important: The response wraps results — the actual schema data is in , not the top-level object.
claude -p --output-format jsonparsed.structured_outputjson
{
"deployed": true,
"url": "https://xxx.vercel.app",
"buildSucceeded": true,
"errors": [],
"summary": "Brief assessment"
}重要提示:响应会包裹结果,实际schema数据位于中,而非顶层对象。
claude -p --output-format jsonparsed.structured_outputCritical Sandbox Environment Facts
沙箱环境关键注意事项
| Property | Value |
|---|---|
| Home directory | |
| User | |
| Claude binary | |
| PATH (via sh -c) | Includes |
| Port exposure | |
| Snapshot persistence | Files AND npm globals survive snapshot restore — use |
| SDK version | |
| Team tier | Enterprise (vercel-labs) — no known sandbox time cap |
| 属性 | 值 |
|---|---|
| 家目录 | |
| 用户 | |
| Claude二进制文件路径 | |
| PATH(通过sh -c) | 包含 |
| 端口暴露 | |
| 快照持久化 | 文件和npm全局依赖都会在快照恢复后保留,使用 |
| SDK版本 | |
| 团队层级 | 企业版(vercel-labs)——无已知沙箱运行时长上限 |
Key Discoveries (Hard-Won)
关键踩坑经验
- Snapshots work: preserves files AND npm globals. Use it after build to create a restore point before verify/deploy. Note: snapshotting stops the source sandbox — create a new one from the snapshot to continue.
sandbox.snapshot() - Plugin install: Use — works because claude is in PATH after
npx add-plugin <path> -s project -y --target claude-code. Thenpm install -gflag is required because add-plugin can't auto-detect Claude Code without an initialized--target claude-codedir.~/.claude/ - File uploads: Use — NOT runCommand heredocs. Heredocs with special characters cause 400 errors from the sandbox API.
sandbox.writeFiles([{ path, content: Buffer }]) - Claude flags: Always use . The
--dangerously-skip-permissions --debugflag writes to--debug.~/.claude/debug/ - Auth: API key from macOS Keychain (— a
ANTHROPIC_AUTH_TOKENVercel Claude Key for AI Gateway), Vercel token fromvck_*(a~/.local/share/com.vercel.cli/auth.jsontoken).vca_* - OIDC for sandbox SDK: Run +
npx vercel link --scope vercel-labs -yonce before first use.npx vercel env pull - Port exposure: Pass in
ports: [3000]to get a public URL immediately viaSandbox.create(). Works on v1.8.0 — URL is assigned at creation time, before anything listens.sandbox.domain(3000) - extendTimeout: Use to keep sandboxes alive past their initial timeout. Verified working — extends by the requested duration. Use this for overnight keep-alive.
sandbox.extendTimeout(ms) - Background commands: with backgrounded processes (
runCommandor&) may throw ZodError on v1. Write a script file first, then execute it.nohup - Session cleanup race: The hook deletes
session-end-cleanup.mjson session end. Extract artifacts BEFORE the session completes, or rely on poll history data./tmp/vercel-plugin-*-seen-skills.d/ - agent-browser works in sandboxes: Install via . Claude Code can use it for browser-based verification inside the sandbox.
npm install -g agent-browser - No hobby tier cap: Early 301s timeouts were from lower default timeout values in earlier script iterations, not a tier limitation. Enterprise (vercel-labs) has no known sandbox time cap — sandboxes ran 10+ minutes successfully.
- claude -p works inside sandboxes: works for structured scoring passes. No nesting issue when running inside a sandbox (only fails when running Claude inside Claude on the same machine).
claude -p --json-schema --output-format json --model haiku - Deploy project naming: ALWAYS use timestamped slugs with minute precision (e.g., ) to avoid collisions when linking to vercel-labs team projects. These are demo projects — we generate many per day. Format:
pet-adoption-board-202603101853.<slug>-<YYYYMMDDHHMM>
- 快照功能可用:会保留文件和npm全局依赖。构建完成后使用它创建还原点,再进行验证/部署。注意:快照操作会停止源沙箱,需要从快照创建新沙箱才能继续操作。
sandbox.snapshot() - 插件安装:使用——因为
npx add-plugin <path> -s project -y --target claude-code后claude已在PATH中,所以可以正常运行。npm install -g参数是必填项,因为如果没有初始化的--target claude-code目录,add-plugin无法自动检测Claude Code。~/.claude/ - 文件上传:使用——不要用runCommand的heredoc。包含特殊字符的heredoc会导致沙箱API返回400错误。
sandbox.writeFiles([{ path, content: Buffer }]) - Claude参数:始终使用。
--dangerously-skip-permissions --debug参数会将日志写入--debug。~/.claude/debug/ - 认证:API密钥来自macOS钥匙串(——AI网关使用的
ANTHROPIC_AUTH_TOKEN格式Vercel Claude密钥),Vercel令牌来自vck_*(~/.local/share/com.vercel.cli/auth.json格式令牌)。vca_* - 沙箱SDK的OIDC认证:首次使用前运行一次+
npx vercel link --scope vercel-labs -y。npx vercel env pull - 端口暴露:在中传入
Sandbox.create()即可立即通过ports: [3000]获取公网URL。v1.8.0版本可用——URL在创建时就已分配,无需等服务启动监听。sandbox.domain(3000) - extendTimeout:使用延长沙箱的初始超时时间。已验证可用——会按请求的时长延长。用于过夜保持沙箱运行。
sandbox.extendTimeout(ms) - 后台命令:v1版本中使用运行后台进程(
runCommand或&)可能抛出ZodError。请先编写脚本文件,再执行脚本。nohup - 会话清理竞争问题:钩子会在会话结束时删除
session-end-cleanup.mjs。请在会话完成前提取产物,或依赖轮询历史数据。/tmp/vercel-plugin-*-seen-skills.d/ - agent-browser可在沙箱中运行:通过安装。Claude Code可以在沙箱内用它进行基于浏览器的验证。
npm install -g agent-browser - 无 hobby 层级时长限制:早期的301超时是因为旧版本脚本的默认超时值较低,而非层级限制。企业版(vercel-labs)无已知沙箱运行时长上限——沙箱可成功运行10分钟以上。
- 可在沙箱内运行:
claude -p可用于结构化评分。在沙箱内运行没有嵌套问题(仅在同一台机器上的Claude内运行Claude时会失败)。claude -p --json-schema --output-format json --model haiku - 部署项目命名:始终使用精确到分钟的时间戳后缀(例如),避免关联到vercel-labs团队项目时出现命名冲突。这些都是演示项目,每天会生成大量项目。格式:
pet-adoption-board-202603101853。<slug>-<YYYYMMDDHHMM>
When to Use This vs benchmark-agents
本工具与benchmark-agents的适用场景对比
| benchmark-agents (WezTerm) | benchmark-sandbox | |
|---|---|---|
| Environment | Local macOS terminal panes | Remote Vercel Sandboxes (Amazon Linux) |
| Parallelism | Limited by local resources | Up to 10 (Hobby) or 2,000 (Pro) concurrent |
| Session type | Interactive TTY via | Direct |
| Artifact access | Direct filesystem ( | |
| Port exposure | | Public |
| Verification | Manual browser check | Automated agent-browser in Phase 2 |
| Deploy | Manual | Automated Phase 3 → permanent |
| Scoring | Manual review | Haiku structured JSON scoring per phase |
| Best for | Manual eval + iteration loop | Automated parallel coverage + verification + deploy runs |
| benchmark-agents (WezTerm) | benchmark-sandbox | |
|---|---|---|
| 运行环境 | 本地macOS终端面板 | 远程Vercel沙箱(Amazon Linux) |
| 并行度 | 受本地资源限制 | 最多10(Hobby层级)或2000(Pro层级)并发 |
| 会话类型 | 通过 | 直接 |
| 产物访问 | 直接访问文件系统( | |
| 端口暴露 | | 公网 |
| 验证方式 | 手动浏览器检查 | 阶段2自动运行agent-browser验证 |
| 部署方式 | 手动 | 阶段3自动部署 → 永久 |
| 评分方式 | 人工审核 | 每个阶段通过Haiku生成结构化JSON评分 |
| 最佳适用场景 | 手动评估 + 迭代循环 | 自动化并行覆盖率 + 验证 + 部署运行 |
How It Works
工作原理
- Create fresh sandbox: — no snapshot
Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } }) - Install tools: (~20s per sandbox)
npm install -g @anthropic-ai/claude-code vercel agent-browser - Auth Vercel CLI: Write token to
~/.local/share/com.vercel.cli/auth.json - Upload plugin: for 80 plugin files, then
sandbox.writeFiles()npx add-plugin - Phase 1 — BUILD: Claude Code builds the app (30 min timeout)
- Score build: Haiku evaluates completeness, API routes, UI, AI features
- Start dev server: If not already running, start
npx next dev --port 3000 - Extend timeout: for verify + deploy + keep-alive
sandbox.extendTimeout() - Phase 2 — VERIFY: Claude Code uses to test user stories (20 min timeout). Prompt tells Claude to start dev server itself if not running.
agent-browser - Score verify: Haiku evaluates each user story as pass/fail with reasons
- Re-extract skills: Skills re-collected after verify phase (agent-browser + code fixes trigger more)
- Phase 3 — DEPLOY: Claude Code runs +
vercel link, fixes build errors (30 min timeout)vercel deploy - Score deploy: Haiku evaluates deploy success, URL extraction, errors
- Re-extract skills: Skills re-collected after deploy phase
- Write incremental results: Each scenario writes its own immediately on completion (survives crashes)
result.json - Extract source archive: of project files saved locally
source.tar.gz - Generate report: Markdown report with build/verify/deploy scores, skill coverage, URLs
- 创建全新沙箱:——不使用快照
Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } }) - 安装工具:(每个沙箱耗时约20秒)
npm install -g @anthropic-ai/claude-code vercel agent-browser - Vercel CLI认证:将令牌写入
~/.local/share/com.vercel.cli/auth.json - 上传插件:通过上传80个插件文件,然后运行
sandbox.writeFiles()npx add-plugin - 阶段1 — 构建:Claude Code构建应用(超时30分钟)
- 构建评分:Haiku评估完整性、API路由、UI、AI功能
- 启动开发服务器:如果未运行,启动
npx next dev --port 3000 - 延长超时时间:调用覆盖验证+部署+保持运行的时长需求
sandbox.extendTimeout() - 阶段2 — 验证:Claude Code使用测试用户故事(超时20分钟)。提示词会告知Claude如果服务未运行则自行启动开发服务器
agent-browser - 验证评分:Haiku评估每个用户故事的通过/失败状态并给出原因
- 重新提取技能:验证阶段结束后重新收集技能(agent-browser + 代码修复会触发更多技能)
- 阶段3 — 部署:Claude Code运行+
vercel link,修复构建错误(超时30分钟)vercel deploy - 部署评分:Haiku评估部署成功状态、URL提取结果、错误信息
- 重新提取技能:部署阶段结束后重新收集技能
- 写入增量结果:每个场景完成后立即写入独立的(避免崩溃丢失数据)
result.json - 提取源码归档:将项目文件的保存到本地
source.tar.gz - 生成报告:包含构建/验证/部署评分、技能覆盖率、URL的Markdown报告
Sandbox Session Flow (Per Scenario)
沙箱会话流程(每个场景)
Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
│
├─ npm install -g @anthropic-ai/claude-code vercel agent-browser (~20s)
├─ Write Vercel CLI auth token to ~/.local/share/com.vercel.cli/auth.json
├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/ (80 files, ~945KB)
├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
│
├─ Phase 1: BUILD
│ ├─ sandbox.writeFiles() → /tmp/prompt.txt
│ ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
│ │ (with AbortSignal.timeout(TIMEOUT_MS))
│ ├─ Poll every 20s:
│ │ ├─ ls /tmp/vercel-plugin-*-seen-skills.d/ (claimed skills)
│ │ ├─ cat /tmp/vercel-plugin-*-seen-skills.txt (seen skills snapshot)
│ │ ├─ find ~/.claude/debug -type f (debug log count)
│ │ ├─ find <project> -newer /tmp/prompt.txt (new project files)
│ │ └─ curl localhost:3000 (port status)
│ ├─ Extract build artifacts
│ └─ Haiku build score (structured JSON)
│
├─ Start dev server (if not already running)
├─ sandbox.extendTimeout(...)
│
├─ Phase 2: VERIFY (if >1 project file exists)
│ ├─ sandbox.writeFiles() → /tmp/verify.txt (agent-browser verification prompt)
│ ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
│ │ (with AbortSignal.timeout(1_200_000) — 20 min)
│ ├─ Re-extract skills (verify phase triggers more)
│ └─ Haiku verify score (per-story pass/fail JSON)
│
├─ Phase 3: DEPLOY (if >3 project files)
│ ├─ sandbox.writeFiles() → /tmp/deploy.txt
│ ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
│ │ (links to vercel-labs, deploys, fixes build errors up to 3x)
│ ├─ Extract deploy URL from output (*.vercel.app)
│ ├─ Re-extract skills (deploy phase triggers more)
│ └─ Haiku deploy score (structured JSON)
│
├─ Write <slug>/result.json immediately (crash-safe)
├─ Update aggregate results.json (complete: false until all done)
├─ Extract source.tar.gz
└─ sandbox.stop() (skipped if --keep-alive)Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
│
├─ npm install -g @anthropic-ai/claude-code vercel agent-browser (~20s)
├─ Write Vercel CLI auth token to ~/.local/share/com.vercel.cli/auth.json
├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/ (80 files, ~945KB)
├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
│
├─ Phase 1: BUILD
│ ├─ sandbox.writeFiles() → /tmp/prompt.txt
│ ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
│ │ (with AbortSignal.timeout(TIMEOUT_MS))
│ ├─ Poll every 20s:
│ │ ├─ ls /tmp/vercel-plugin-*-seen-skills.d/ (claimed skills)
│ │ ├─ cat /tmp/vercel-plugin-*-seen-skills.txt (seen skills snapshot)
│ │ ├─ find ~/.claude/debug -type f (debug log count)
│ │ ├─ find <project> -newer /tmp/prompt.txt (new project files)
│ │ └─ curl localhost:3000 (port status)
│ ├─ Extract build artifacts
│ └─ Haiku build score (structured JSON)
│
├─ Start dev server (if not already running)
├─ sandbox.extendTimeout(...)
│
├─ Phase 2: VERIFY (if >1 project file exists)
│ ├─ sandbox.writeFiles() → /tmp/verify.txt (agent-browser verification prompt)
│ ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
│ │ (with AbortSignal.timeout(1_200_000) — 20 min)
│ ├─ Re-extract skills (verify phase triggers more)
│ └─ Haiku verify score (per-story pass/fail JSON)
│
├─ Phase 3: DEPLOY (if >3 project files)
│ ├─ sandbox.writeFiles() → /tmp/deploy.txt
│ ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
│ │ (links to vercel-labs, deploys, fixes build errors up to 3x)
│ ├─ Extract deploy URL from output (*.vercel.app)
│ ├─ Re-extract skills (deploy phase triggers more)
│ └─ Haiku deploy score (structured JSON)
│
├─ Write <slug>/result.json immediately (crash-safe)
├─ Update aggregate results.json (complete: false until all done)
├─ Extract source.tar.gz
└─ sandbox.stop() (skipped if --keep-alive)Verification Phase Details
验证阶段详情
The verify phase is the "closer" — its job is to make the app work and prove it. Key behaviors:
- Always runs if >1 project file exists (no longer gated on port 3000 being up)
- Starts dev server itself if not already running — the prompt tells Claude to check and run
localhost:3000if needednpx next dev --port 3000 - 20 minute timeout — enough for agent-browser to open pages, screenshot, interact, fix broken code, restart server, and re-verify
- Triggers skill injection — the verify session creates/edits files, triggering PreToolUse and PostToolUse hooks
- Uses agent-browser workflow: →
open→wait --load networkidle→screenshot --annotate→ interact → fix → re-verifysnapshot -i - Results scored by haiku — no more parsing from free text
STORY_1: PASS
验证阶段是「收尾环节」,职责是让应用正常运行并完成验证。核心特性:
- 只要存在>1个项目文件就始终运行(不再限制必须启动3000端口)
- 如果开发服务器未运行则自行启动——提示词会告知Claude检查,如有需要则运行
localhost:3000npx next dev --port 3000 - 20分钟超时——足够agent-browser打开页面、截图、交互、修复损坏代码、重启服务器、重新验证
- 触发技能注入——验证会话会创建/编辑文件,触发PreToolUse和PostToolUse钩子
- 使用agent-browser工作流:→
open→wait --load networkidle→screenshot --annotate→ 交互 → 修复 → 重新验证snapshot -i - 结果由haiku评分——无需从自由文本中解析这类内容
STORY_1: PASS
Deploy Phase Details
部署阶段详情
The deploy phase uses a full Claude Code session (for skill tracking) to:
- Run
vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDD - Run
vercel deploy --yes - If build fails, fix code and retry (up to 3 attempts)
- Important: unsets env var so CLI falls back to
VERCEL_TOKEN~/.local/share/com.vercel.cli/auth.json - Deployment protection is enabled by default on vercel-labs team
Deploy URL is extracted by regex from Claude's output, with haiku as fallback URL extractor.
部署阶段使用完整的Claude Code会话(用于技能跟踪)完成以下操作:
- 运行
vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDD - 运行
vercel deploy --yes - 如果构建失败,修复代码并重试(最多3次)
- 重要提示:取消环境变量,让CLI回退使用
VERCEL_TOKEN~/.local/share/com.vercel.cli/auth.json - vercel-labs团队的部署默认开启部署保护
部署URL通过正则从Claude的输出中提取,haiku评分步骤作为备用URL提取方案。
DO NOT (Hard Rules)
禁止事项(硬性规则)
Same rules as , plus sandbox-specific:
benchmark-agents- DO NOT use or
claude --printflag for BUILD/VERIFY/DEPLOY phases — hooks don't fire without tool-calling sessions (use-ponly for haiku scoring passes)-p - DO NOT let sandboxes run without extracting artifacts — ephemeral filesystem is lost on stop
- DO NOT pass API keys via — use
writeFiles()Sandbox.create({ env: { ... } }) - DO NOT skip snapshotting after build — it's your safety net if verify/deploy kills the sandbox
- DO NOT use v2 beta SDK — named sandbox endpoint returns 404 for this team; use v1.8.0
- DO NOT use heredocs to write file content — use
runCommandinsteadsandbox.writeFiles() - DO NOT assume exists — the home dir is
/home/user//home/vercel-sandbox/ - DO NOT use simple project names without timestamps — always append to avoid collisions across runs
-YYYYMMDDHHMM
与规则一致,新增沙箱专属规则:
benchmark-agents- 禁止在构建/验证/部署阶段使用或
claude --print参数——没有工具调用会话的话钩子不会触发(仅在haiku评分步骤使用-p)-p - 禁止未提取产物就停止沙箱运行——临时文件系统会在停止后丢失
- 禁止通过传递API密钥——使用
writeFiles()传递Sandbox.create({ env: { ... } }) - 禁止构建完成后跳过快照——如果验证/部署导致沙箱损坏,快照是你的安全网
- 禁止使用v2 beta SDK——该团队访问命名沙箱端点会返回404,请使用v1.8.0
- 禁止使用heredoc写入文件内容——请改用
runCommandsandbox.writeFiles() - 禁止假设存在——家目录是
/home/user//home/vercel-sandbox/ - 禁止使用不带时间戳的简单项目名——始终追加避免不同运行间的命名冲突
-YYYYMMDDHHMM
Prerequisites
前置要求
bash
undefinedbash
undefinedOne-time setup: link project for OIDC sandbox auth
一次性设置:关联项目用于OIDC沙箱认证
npx vercel link --scope vercel-labs -y
npx vercel env pull .env.local
npx vercel link --scope vercel-labs -y
npx vercel env pull .env.local
Auth (auto-resolved from macOS Keychain + Vercel CLI auth):
认证(自动从macOS钥匙串 + Vercel CLI认证中读取):
- ANTHROPIC_API_KEY: from Keychain "ANTHROPIC_AUTH_TOKEN" (vck_* key) or env var
- ANTHROPIC_API_KEY: 来自钥匙串的「ANTHROPIC_AUTH_TOKEN」(vck_*格式密钥)或环境变量
- VERCEL_TOKEN: from ~/.local/share/com.vercel.cli/auth.json (vca_* token) or env var
- VERCEL_TOKEN: 来自~/.local/share/com.vercel.cli/auth.json
的(vca_*格式令牌)或环境变量
~/.local/share/com.vercel.cli/auth.json- ANTHROPIC_BASE_URL: defaults to https://ai-gateway.vercel.sh
- ANTHROPIC_BASE_URL: 默认值为https://ai-gateway.vercel.sh
undefinedundefinedCommands
命令
Run eval with dynamic scenarios (recommended)
运行动态场景评估(推荐)
bash
undefinedbash
undefinedGenerate scenarios as JSON, then run
生成JSON格式场景,然后运行
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
With all phases + keep-alive for overnight
运行全阶段 + 保持沙箱过夜运行
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --keep-alive --keep-hours 8
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --keep-alive --keep-hours 8
Build-only, no verification or deploy
仅构建,不进行验证和部署
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --skip-verify --skip-deploy
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --skip-verify --skip-deploy
Filter to specific slugs from file or defaults
从文件或默认场景中过滤运行指定标识的场景
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
undefinedbun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
undefinedMonitoring While Running
运行时监控
The orchestrator prints live status. For manual checks on a running sandbox:
typescript
// List claimed skills
const claims = await sandbox.runCommand("sh", ["-c",
"ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null"
]);
// Check hook firing count
const hooks = await sandbox.runCommand("sh", ["-c",
"find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +"
]);
// Check port 3000
const port = await sandbox.runCommand("sh", ["-c",
"curl -s -o /dev/null -w '%{http_code}' http://localhost:3000"
]);
// Get public URL (after ports: [3000] in Sandbox.create)
const url = sandbox.domain(3000);编排器会打印实时状态。如果需要手动检查运行中的沙箱:
typescript
// 列出已认领的技能
const claims = await sandbox.runCommand("sh", ["-c",
"ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null"
]);
// 检查钩子触发次数
const hooks = await sandbox.runCommand("sh", ["-c",
"find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +"
]);
// 检查3000端口状态
const port = await sandbox.runCommand("sh", ["-c",
"curl -s -o /dev/null -w '%{http_code}' http://localhost:3000"
]);
// 获取公网URL(在Sandbox.create中传入ports: [3000]后可用)
const url = sandbox.domain(3000);Artifact Export Layout
产物导出结构
Results are written to :
~/dev/vercel-plugin-testing/sandbox-results/<run-id>/<run-id>/
results.json # Aggregate results (complete: false until all done, then true)
report.md # Markdown report with scores, coverage, URLs
<slug>/
result.json # Per-scenario result (written immediately on completion)
source.tar.gz # Project source archiveEach scenario result includes:
- ,
slug,sandboxId,successdurationMs - ,
claimedSkills[],expectedSkills[]projectFiles[] - — public
appUrlURL (sandbox lifetime only)https://sb-XXX.vercel.run - — permanent
deployUrlURL (if deploy succeeded)https://xxx.vercel.app - — timestamped skill/file/port snapshots
pollHistory[] - —
verification{ ran, exitCode, stories: [{ index, status }], output } - — haiku structured completeness assessment
buildScore - — haiku structured deploy assessment
deployScore
The markdown report ( / ) includes:
report.md.reports/<timestamp>.md- Summary table — slug, build status, skills, files, verify results, deploy URL, duration
- Per-scenario details — build score, deploy score, verification per-story pass/fail
- Skill coverage — expected vs actual per scenario, missing/bonus breakdown
- Total unique skills across all scenarios
结果写入到目录:
~/dev/vercel-plugin-testing/sandbox-results/<run-id>/<run-id>/
results.json # 聚合结果(全部完成前complete为false,完成后为true)
report.md # 包含评分、覆盖率、URL的Markdown报告
<slug>/
result.json # 单场景结果(完成后立即写入)
source.tar.gz # 项目源码归档每个场景的结果包含:
- 、
slug、sandboxId、successdurationMs - 、
claimedSkills[]、expectedSkills[]projectFiles[] - — 公网
appUrlURL(仅沙箱运行期间有效)https://sb-XXX.vercel.run - — 永久
deployUrlURL(部署成功时存在)https://xxx.vercel.app - — 带时间戳的技能/文件/端口快照
pollHistory[] - —
verification{ ran, exitCode, stories: [{ index, status }], output } - — haiku结构化完整性评估
buildScore - — haiku结构化部署评估
deployScore
Markdown报告( / )包含:
report.md.reports/<timestamp>.md- 汇总表 — 场景标识、构建状态、技能、文件、验证结果、部署URL、耗时
- 单场景详情 — 构建评分、部署评分、验证阶段每个用户故事的通过/失败状态
- 技能覆盖率 — 每个场景的预期 vs 实际技能,缺失/额外技能 breakdown
- 全部场景的唯一技能总数
Proven Results (2026-03-10)
验证结果(2026-03-10)
Across 34 scenarios run in 5 batches:
| Metric | Best | Typical |
|---|---|---|
| Skills per scenario | 31 (ai-interior-designer) | 12-24 |
| Expected skill coverage | 100% (pet-adoption-board 4/4, apartment-hunting-copilot 7/7, splitwise-clone 6/6) | 50-86% |
| User stories verified | 3/3 PASS (ai-dream-journal, ai-gift-finder, ai-resume-roaster, ai-music-mood-radio, team-standup-bot, pet-adoption-board) | varies |
| Files built per scenario | 37 (student-study-groups) | 6-25 |
| Build time | 5-11 min | 5-7 min |
Key findings:
- User-story-focused prompts (no tech name-dropping) work — plugin detects patterns from actual code
- ,
ai-sdk,shadcn,nextjsare the most consistently detected skillsvercel-functions - ,
cron-jobsneed Claude to write specific file patterns to triggerrouting-middleware - Lexical prompt inject (UserPromptSubmit) working — skills injected before any files written
- deletes claim dirs — use poll history for final skill counts
session-end-cleanup - Enterprise tier (vercel-labs) — no sandbox time cap; builds ran 10+ minutes
在5个批次运行的34个场景中:
| 指标 | 最佳值 | 典型值 |
|---|---|---|
| 单场景技能数 | 31(ai-interior-designer) | 12-24 |
| 预期技能覆盖率 | 100%(pet-adoption-board 4/4、apartment-hunting-copilot 7/7、splitwise-clone 6/6) | 50-86% |
| 用户故事验证通过率 | 3/3 全通过(ai-dream-journal、ai-gift-finder、ai-resume-roaster、ai-music-mood-radio、team-standup-bot、pet-adoption-board) | 不固定 |
| 单场景构建文件数 | 37(student-study-groups) | 6-25 |
| 构建耗时 | 5-11分钟 | 5-7分钟 |
核心发现:
- 聚焦用户故事的提示词(不提及技术名称)效果良好——插件可以从实际代码中检测模式
- 、
ai-sdk、shadcn、nextjs是最稳定检测到的技能vercel-functions - 、
cron-jobs需要Claude编写特定的文件模式才能触发routing-middleware - 词法提示词注入(UserPromptSubmit)正常工作——技能在写入任何文件前就已注入
- 会删除认领目录——使用轮询历史获取最终技能计数
session-end-cleanup - 企业版层级(vercel-labs)无沙箱运行时长上限——构建可成功运行10分钟以上
Known Limitations
已知限制
- Snapshot stops the source sandbox: stops the original sandbox. Create a new sandbox from the snapshot to continue. Files and npm globals DO survive.
sandbox.snapshot() - v2 beta incompatible: 's named sandbox endpoint returns 404 for this team. Stick with v1.8.0.
@vercel/sandbox@2.0.0-beta.3 - Artifact window: Must extract before — filesystem is ephemeral. Session cleanup hook may delete claim dirs before extraction.
sandbox.stop() - Amazon Linux paths: User is (home at
vercel-sandbox). NOT/home/vercel-sandbox/or/home/user/./root/ - parity: Sandbox evals auto-approve all tool calls. WezTerm evals use normal permission flow. Coverage results may differ.
--dangerously-skip-permissions - timeout: Use
runCommand— the{ signal: AbortSignal.timeout(ms) }option is silently ignored.{ timeout } - BrotliDecompressionError: Transient Vercel API errors can kill sandbox creation. Retry logic recommended for production runs.
- Deploy reliability: Claude Code deploy sessions sometimes fail to output a parseable URL. The haiku scoring step provides a fallback URL extraction attempt.
*.vercel.app - Verify timeout: Complex apps may need the full 20 minutes for agent-browser to test all stories. Simpler apps finish in 2-5 minutes.
- 快照会停止源沙箱:会停止原沙箱。需要从快照创建新沙箱才能继续操作。文件和npm全局依赖确实会保留。
sandbox.snapshot() - v2 beta版本不兼容:的命名沙箱端点对该团队返回404。请坚持使用v1.8.0。
@vercel/sandbox@2.0.0-beta.3 - 产物提取窗口期:必须在前提取——文件系统是临时的。会话清理钩子可能在提取前删除认领目录。
sandbox.stop() - Amazon Linux路径问题:用户是(家目录在
vercel-sandbox),不是/home/vercel-sandbox/或/home/user/。/root/ - 一致性问题:沙箱评估会自动批准所有工具调用,WezTerm评估使用正常权限流,覆盖率结果可能存在差异。
--dangerously-skip-permissions - 超时问题:使用
runCommand——{ signal: AbortSignal.timeout(ms) }参数会被静默忽略。{ timeout } - BrotliDecompressionError:偶发的Vercel API错误可能导致沙箱创建失败。生产运行建议添加重试逻辑。
- 部署可靠性问题:Claude Code部署会话有时无法输出可解析的URL,haiku评分步骤提供备用URL提取尝试。
*.vercel.app - 验证超时问题:复杂应用可能需要完整20分钟让agent-browser测试所有故事,简单应用2-5分钟即可完成。