benchmark-sandbox

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Benchmark Sandbox — Remote Eval via Vercel Sandboxes

基准测试沙箱 — 基于Vercel沙箱的远程评估

Run benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a 3-phase eval pipeline:
  • Phase 1 (BUILD): Claude Code builds the app with
    --dangerously-skip-permissions --debug
  • Phase 2 (VERIFY): A follow-up Claude Code session uses
    agent-browser
    to walk through user stories, fixing issues until all pass (20 min timeout)
  • Phase 3 (DEPLOY): A third Claude Code session links to vercel-labs, runs
    vercel deploy
    , and fixes build errors (up to 3 retries). Deployed apps have deployment protection enabled by default.
Skills are tracked across all 3 phases — each phase may trigger additional skill injections as new files/patterns are created. After each phase, a haiku structured scoring step (
claude -p --json-schema --model haiku
) evaluates the results as structured JSON.
在Vercel沙箱(运行node24的临时Firecracker微虚拟机)中运行基准测试场景。每个沙箱都会预装全新的Claude Code + Vercel CLI + agent-browser,上传本地vercel-plugin,然后运行三阶段评估流水线
  • 阶段1(构建): Claude Code使用
    --dangerously-skip-permissions --debug
    参数构建应用
  • 阶段2(验证): 后续的Claude Code会话使用
    agent-browser
    遍历用户故事,修复问题直至全部通过(超时时间20分钟)
  • 阶段3(部署): 第三个Claude Code会话关联到vercel-labs,运行
    vercel deploy
    ,修复构建错误(最多重试3次)。部署的应用默认开启部署保护。
全部三个阶段都会跟踪技能使用情况——每个阶段在创建新文件/新模式时都可能触发额外的技能注入。每个阶段结束后,haiku结构化评分步骤
claude -p --json-schema --model haiku
)会将结果评估为结构化JSON。

Proven Working Script

已验证可用脚本

Use
run-eval.ts
— the proven eval runner:
bash
undefined
使用
run-eval.ts
——经过验证的评估运行器:
bash
undefined

Run default scenarios with full 3-phase pipeline

运行默认场景,执行完整三阶段流水线

bun run .claude/skills/benchmark-sandbox/run-eval.ts
bun run .claude/skills/benchmark-sandbox/run-eval.ts

With dynamic scenarios from a JSON file (recommended — see "Dynamic Scenarios" below)

从JSON文件加载动态场景(推荐——参见下文「动态场景」)

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json

Keep sandboxes alive overnight with public URLs

保持沙箱运行过夜,生成公网URL

bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive --keep-hours 8
bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive --keep-hours 8

Build-only (skip verification and deploy)

仅构建(跳过验证和部署阶段)

bun run .claude/skills/benchmark-sandbox/run-eval.ts --skip-verify --skip-deploy
bun run .claude/skills/benchmark-sandbox/run-eval.ts --skip-verify --skip-deploy

Run specific scenarios by slug

按标识运行指定场景

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
undefined
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
undefined

CLI Flags

CLI参数

FlagDefaultDescription
--concurrency N
5Max parallel sandboxes (max 10)
--timeout MS
1800000 (30 min)Per-phase timeout in ms
--keep-alive
offKeep sandboxes running after eval
--keep-hours N
8Hours to keep alive (with
--keep-alive
)
--skip-verify
offSkip the agent-browser verification phase
--skip-deploy
offSkip the Vercel deploy phase
--scenarios a,b,c
allOnly run specific scenarios by slug
--scenarios-file path
Load scenarios from a JSON file instead of built-in defaults
参数默认值描述
--concurrency N
5最大并行沙箱数(最高10)
--timeout MS
1800000 (30分钟)单阶段超时时间(单位:毫秒)
--keep-alive
关闭评估结束后保持沙箱运行
--keep-hours N
8沙箱保持运行的时长(配合
--keep-alive
使用)
--skip-verify
关闭跳过agent-browser验证阶段
--skip-deploy
关闭跳过Vercel部署阶段
--scenarios a,b,c
全部仅运行指定标识的场景
--scenarios-file path
从JSON文件加载场景,替代内置默认值

Dynamic Scenarios (Recommended Approach)

动态场景(推荐方案)

Instead of hardcoding tech-specific prompts, generate scenarios dynamically as a JSON file. Prompts should describe real-world apps people want to build using user stories — no tech name-dropping. Let the plugin figure out what Vercel tech to inject.
无需硬编码特定技术的提示词,而是动态生成JSON格式的场景文件。提示词应通过用户故事描述人们想要构建的真实应用——不要提及技术名称,让插件自行判断要注入的Vercel技术。

Scenario JSON Format

场景JSON格式

json
[
  {
    "slug": "pet-adoption-board",
    "prompt": "Build me a pet adoption listing board where shelters can post animals...",
    "expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
    "userStories": [
      "As a visitor, I can see a grid of pet listings with photos and names",
      "As a visitor, I can click a pet card to see a detail page",
      "As a visitor, I can filter pets by type"
    ]
  }
]
Each scenario needs:
slug
(string),
prompt
(string),
expectedSkills
(string[]),
userStories
(tuple of exactly 3 strings).
json
[
  {
    "slug": "pet-adoption-board",
    "prompt": "Build me a pet adoption listing board where shelters can post animals...",
    "expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
    "userStories": [
      "As a visitor, I can see a grid of pet listings with photos and names",
      "As a visitor, I can click a pet card to see a detail page",
      "As a visitor, I can filter pets by type"
    ]
  }
]
每个场景需要包含:
slug
(字符串)、
prompt
(字符串)、
expectedSkills
(字符串数组)、
userStories
(恰好包含3个字符串的元组)。

Prompt Design Guidelines

提示词设计指南

  • Focus on what the user wants, not what tech to use
  • Describe real-world apps that solve real problems with friendly, stylish UX
  • Include AI features naturally (recommendations, analysis, generation)
  • Always end with:
    "Link the project to my vercel-labs team. After building all files, start the dev server on port 3000 with \
    npx next dev --port 3000`."`
  • Include storage needs (photos, uploads) to trigger vercel-storage
  • Include scheduled tasks (reminders, cleanup) to trigger cron-jobs
  • Include auth/middleware to trigger routing-middleware
  • 聚焦用户需求,而非使用什么技术
  • 描述解决真实问题、体验友好时尚的真实应用
  • 自然融入AI功能(推荐、分析、生成)
  • 始终以这句话结尾:
    "Link the project to my vercel-labs team. After building all files, start the dev server on port 3000 with \
    npx next dev --port 3000`."`
  • 包含存储需求(照片、上传)以触发vercel-storage
  • 包含定时任务(提醒、清理)以触发cron-jobs
  • 包含认证/中间件需求以触发routing-middleware

Structured Scoring (Haiku)

结构化评分(Haiku)

Each phase gets a structured JSON score via
claude -p --json-schema --model haiku --setting-sources ""
running inside the sandbox. This is a separate quick pass — no tools, no hooks — just reads the phase output and returns structured data.
每个阶段都会通过沙箱内运行的
claude -p --json-schema --model haiku --setting-sources ""
生成结构化JSON评分。这是独立的快速步骤——无需工具、无需钩子,仅读取阶段输出并返回结构化数据。

Build Score Schema

构建评分Schema

json
{
  "completeness": "complete|partial|minimal|empty",
  "hasApiRoutes": true,
  "hasUIComponents": true,
  "hasAIFeature": true,
  "devServerRunning": true,
  "missingFeatures": ["feature1"],
  "summary": "Brief assessment"
}
json
{
  "completeness": "complete|partial|minimal|empty",
  "hasApiRoutes": true,
  "hasUIComponents": true,
  "hasAIFeature": true,
  "devServerRunning": true,
  "missingFeatures": ["feature1"],
  "summary": "Brief assessment"
}

Verify Score Schema (per user story)

验证评分Schema(按用户故事)

json
{
  "stories": [
    { "index": 1, "status": "pass|fail", "reason": "Evidence from output" }
  ]
}
json
{
  "stories": [
    { "index": 1, "status": "pass|fail", "reason": "Evidence from output" }
  ]
}

Deploy Score Schema

部署评分Schema

json
{
  "deployed": true,
  "url": "https://xxx.vercel.app",
  "buildSucceeded": true,
  "errors": [],
  "summary": "Brief assessment"
}
Important: The
claude -p --output-format json
response wraps results — the actual schema data is in
parsed.structured_output
, not the top-level object.
json
{
  "deployed": true,
  "url": "https://xxx.vercel.app",
  "buildSucceeded": true,
  "errors": [],
  "summary": "Brief assessment"
}
重要提示
claude -p --output-format json
响应会包裹结果,实际schema数据位于
parsed.structured_output
中,而非顶层对象。

Critical Sandbox Environment Facts

沙箱环境关键注意事项

PropertyValue
Home directory
/home/vercel-sandbox
(NOT
/home/user/
or
/root/
)
User
vercel-sandbox
(NOT
root
)
Claude binary
/home/vercel-sandbox/.global/npm/bin/claude
PATH (via sh -c)Includes
~/.global/npm/bin
— claude findable by name
Port exposure
sandbox.domain(3000)
https://subdomain.vercel.run
Snapshot persistenceFiles AND npm globals survive snapshot restore — use
sandbox.snapshot()
Sandbox.create({ source: { type: "snapshot", snapshotId } })
SDK version
@vercel/sandbox@1.8.0
(v2 beta's named sandbox endpoint returns 404 for this team)
Team tierEnterprise (vercel-labs) — no known sandbox time cap
属性
家目录
/home/vercel-sandbox
(不是
/home/user/
/root/
用户
vercel-sandbox
(不是
root
Claude二进制文件路径
/home/vercel-sandbox/.global/npm/bin/claude
PATH(通过sh -c)包含
~/.global/npm/bin
——可直接通过名称调用claude
端口暴露
sandbox.domain(3000)
https://subdomain.vercel.run
快照持久化文件和npm全局依赖都会在快照恢复后保留,使用
sandbox.snapshot()
Sandbox.create({ source: { type: "snapshot", snapshotId } })
SDK版本
@vercel/sandbox@1.8.0
(该团队访问v2 beta的命名沙箱端点会返回404)
团队层级企业版(vercel-labs)——无已知沙箱运行时长上限

Key Discoveries (Hard-Won)

关键踩坑经验

  1. Snapshots work:
    sandbox.snapshot()
    preserves files AND npm globals. Use it after build to create a restore point before verify/deploy. Note: snapshotting stops the source sandbox — create a new one from the snapshot to continue.
  2. Plugin install: Use
    npx add-plugin <path> -s project -y --target claude-code
    — works because claude is in PATH after
    npm install -g
    . The
    --target claude-code
    flag is required because add-plugin can't auto-detect Claude Code without an initialized
    ~/.claude/
    dir.
  3. File uploads: Use
    sandbox.writeFiles([{ path, content: Buffer }])
    — NOT runCommand heredocs. Heredocs with special characters cause 400 errors from the sandbox API.
  4. Claude flags: Always use
    --dangerously-skip-permissions --debug
    . The
    --debug
    flag writes to
    ~/.claude/debug/
    .
  5. Auth: API key from macOS Keychain (
    ANTHROPIC_AUTH_TOKEN
    — a
    vck_*
    Vercel Claude Key for AI Gateway), Vercel token from
    ~/.local/share/com.vercel.cli/auth.json
    (a
    vca_*
    token).
  6. OIDC for sandbox SDK: Run
    npx vercel link --scope vercel-labs -y
    +
    npx vercel env pull
    once before first use.
  7. Port exposure: Pass
    ports: [3000]
    in
    Sandbox.create()
    to get a public URL immediately via
    sandbox.domain(3000)
    . Works on v1.8.0 — URL is assigned at creation time, before anything listens.
  8. extendTimeout: Use
    sandbox.extendTimeout(ms)
    to keep sandboxes alive past their initial timeout. Verified working — extends by the requested duration. Use this for overnight keep-alive.
  9. Background commands:
    runCommand
    with backgrounded processes (
    &
    or
    nohup
    ) may throw ZodError on v1. Write a script file first, then execute it.
  10. Session cleanup race: The
    session-end-cleanup.mjs
    hook deletes
    /tmp/vercel-plugin-*-seen-skills.d/
    on session end. Extract artifacts BEFORE the session completes, or rely on poll history data.
  11. agent-browser works in sandboxes: Install via
    npm install -g agent-browser
    . Claude Code can use it for browser-based verification inside the sandbox.
  12. No hobby tier cap: Early 301s timeouts were from lower default timeout values in earlier script iterations, not a tier limitation. Enterprise (vercel-labs) has no known sandbox time cap — sandboxes ran 10+ minutes successfully.
  13. claude -p works inside sandboxes:
    claude -p --json-schema --output-format json --model haiku
    works for structured scoring passes. No nesting issue when running inside a sandbox (only fails when running Claude inside Claude on the same machine).
  14. Deploy project naming: ALWAYS use timestamped slugs with minute precision (e.g.,
    pet-adoption-board-202603101853
    ) to avoid collisions when linking to vercel-labs team projects. These are demo projects — we generate many per day. Format:
    <slug>-<YYYYMMDDHHMM>
    .
  1. 快照功能可用
    sandbox.snapshot()
    会保留文件和npm全局依赖。构建完成后使用它创建还原点,再进行验证/部署。注意:快照操作会停止源沙箱,需要从快照创建新沙箱才能继续操作。
  2. 插件安装:使用
    npx add-plugin <path> -s project -y --target claude-code
    ——因为
    npm install -g
    后claude已在PATH中,所以可以正常运行。
    --target claude-code
    参数是必填项,因为如果没有初始化的
    ~/.claude/
    目录,add-plugin无法自动检测Claude Code。
  3. 文件上传:使用
    sandbox.writeFiles([{ path, content: Buffer }])
    ——不要用runCommand的heredoc。包含特殊字符的heredoc会导致沙箱API返回400错误。
  4. Claude参数:始终使用
    --dangerously-skip-permissions --debug
    --debug
    参数会将日志写入
    ~/.claude/debug/
  5. 认证:API密钥来自macOS钥匙串(
    ANTHROPIC_AUTH_TOKEN
    ——AI网关使用的
    vck_*
    格式Vercel Claude密钥),Vercel令牌来自
    ~/.local/share/com.vercel.cli/auth.json
    vca_*
    格式令牌)。
  6. 沙箱SDK的OIDC认证:首次使用前运行一次
    npx vercel link --scope vercel-labs -y
    +
    npx vercel env pull
  7. 端口暴露:在
    Sandbox.create()
    中传入
    ports: [3000]
    即可立即通过
    sandbox.domain(3000)
    获取公网URL。v1.8.0版本可用——URL在创建时就已分配,无需等服务启动监听。
  8. extendTimeout:使用
    sandbox.extendTimeout(ms)
    延长沙箱的初始超时时间。已验证可用——会按请求的时长延长。用于过夜保持沙箱运行。
  9. 后台命令:v1版本中使用
    runCommand
    运行后台进程(
    &
    nohup
    )可能抛出ZodError。请先编写脚本文件,再执行脚本。
  10. 会话清理竞争问题
    session-end-cleanup.mjs
    钩子会在会话结束时删除
    /tmp/vercel-plugin-*-seen-skills.d/
    。请在会话完成前提取产物,或依赖轮询历史数据。
  11. agent-browser可在沙箱中运行:通过
    npm install -g agent-browser
    安装。Claude Code可以在沙箱内用它进行基于浏览器的验证。
  12. 无 hobby 层级时长限制:早期的301超时是因为旧版本脚本的默认超时值较低,而非层级限制。企业版(vercel-labs)无已知沙箱运行时长上限——沙箱可成功运行10分钟以上。
  13. claude -p
    可在沙箱内运行
    claude -p --json-schema --output-format json --model haiku
    可用于结构化评分。在沙箱内运行没有嵌套问题(仅在同一台机器上的Claude内运行Claude时会失败)。
  14. 部署项目命名:始终使用精确到分钟的时间戳后缀(例如
    pet-adoption-board-202603101853
    ),避免关联到vercel-labs团队项目时出现命名冲突。这些都是演示项目,每天会生成大量项目。格式:
    <slug>-<YYYYMMDDHHMM>

When to Use This vs benchmark-agents

本工具与benchmark-agents的适用场景对比

benchmark-agents (WezTerm)benchmark-sandbox
EnvironmentLocal macOS terminal panesRemote Vercel Sandboxes (Amazon Linux)
ParallelismLimited by local resourcesUp to 10 (Hobby) or 2,000 (Pro) concurrent
Session typeInteractive TTY via
/bin/zsh -ic
Direct
sh -c
invocation (PTY not required)
Artifact accessDirect filesystem (
~/.claude/debug/
)
sandbox.readFile()
/ poll via
runCommand
Port exposure
localhost:3000
Public
https://sb-XXX.vercel.run
URLs
VerificationManual browser checkAutomated agent-browser in Phase 2
DeployManualAutomated Phase 3 → permanent
*.vercel.app
URLs
ScoringManual reviewHaiku structured JSON scoring per phase
Best forManual eval + iteration loopAutomated parallel coverage + verification + deploy runs
benchmark-agents (WezTerm)benchmark-sandbox
运行环境本地macOS终端面板远程Vercel沙箱(Amazon Linux)
并行度受本地资源限制最多10(Hobby层级)或2000(Pro层级)并发
会话类型通过
/bin/zsh -ic
的交互式TTY
直接
sh -c
调用(无需PTY)
产物访问直接访问文件系统(
~/.claude/debug/
sandbox.readFile()
/ 通过
runCommand
轮询
端口暴露
localhost:3000
公网
https://sb-XXX.vercel.run
URL
验证方式手动浏览器检查阶段2自动运行agent-browser验证
部署方式手动阶段3自动部署 → 永久
*.vercel.app
URL
评分方式人工审核每个阶段通过Haiku生成结构化JSON评分
最佳适用场景手动评估 + 迭代循环自动化并行覆盖率 + 验证 + 部署运行

How It Works

工作原理

  1. Create fresh sandbox:
    Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } })
    — no snapshot
  2. Install tools:
    npm install -g @anthropic-ai/claude-code vercel agent-browser
    (~20s per sandbox)
  3. Auth Vercel CLI: Write token to
    ~/.local/share/com.vercel.cli/auth.json
  4. Upload plugin:
    sandbox.writeFiles()
    for 80 plugin files, then
    npx add-plugin
  5. Phase 1 — BUILD: Claude Code builds the app (30 min timeout)
  6. Score build: Haiku evaluates completeness, API routes, UI, AI features
  7. Start dev server: If not already running, start
    npx next dev --port 3000
  8. Extend timeout:
    sandbox.extendTimeout()
    for verify + deploy + keep-alive
  9. Phase 2 — VERIFY: Claude Code uses
    agent-browser
    to test user stories (20 min timeout). Prompt tells Claude to start dev server itself if not running.
  10. Score verify: Haiku evaluates each user story as pass/fail with reasons
  11. Re-extract skills: Skills re-collected after verify phase (agent-browser + code fixes trigger more)
  12. Phase 3 — DEPLOY: Claude Code runs
    vercel link
    +
    vercel deploy
    , fixes build errors (30 min timeout)
  13. Score deploy: Haiku evaluates deploy success, URL extraction, errors
  14. Re-extract skills: Skills re-collected after deploy phase
  15. Write incremental results: Each scenario writes its own
    result.json
    immediately on completion (survives crashes)
  16. Extract source archive:
    source.tar.gz
    of project files saved locally
  17. Generate report: Markdown report with build/verify/deploy scores, skill coverage, URLs
  1. 创建全新沙箱
    Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } })
    ——不使用快照
  2. 安装工具
    npm install -g @anthropic-ai/claude-code vercel agent-browser
    (每个沙箱耗时约20秒)
  3. Vercel CLI认证:将令牌写入
    ~/.local/share/com.vercel.cli/auth.json
  4. 上传插件:通过
    sandbox.writeFiles()
    上传80个插件文件,然后运行
    npx add-plugin
  5. 阶段1 — 构建:Claude Code构建应用(超时30分钟)
  6. 构建评分:Haiku评估完整性、API路由、UI、AI功能
  7. 启动开发服务器:如果未运行,启动
    npx next dev --port 3000
  8. 延长超时时间:调用
    sandbox.extendTimeout()
    覆盖验证+部署+保持运行的时长需求
  9. 阶段2 — 验证:Claude Code使用
    agent-browser
    测试用户故事(超时20分钟)。提示词会告知Claude如果服务未运行则自行启动开发服务器
  10. 验证评分:Haiku评估每个用户故事的通过/失败状态并给出原因
  11. 重新提取技能:验证阶段结束后重新收集技能(agent-browser + 代码修复会触发更多技能)
  12. 阶段3 — 部署:Claude Code运行
    vercel link
    +
    vercel deploy
    ,修复构建错误(超时30分钟)
  13. 部署评分:Haiku评估部署成功状态、URL提取结果、错误信息
  14. 重新提取技能:部署阶段结束后重新收集技能
  15. 写入增量结果:每个场景完成后立即写入独立的
    result.json
    (避免崩溃丢失数据)
  16. 提取源码归档:将项目文件的
    source.tar.gz
    保存到本地
  17. 生成报告:包含构建/验证/部署评分、技能覆盖率、URL的Markdown报告

Sandbox Session Flow (Per Scenario)

沙箱会话流程(每个场景)

Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
  ├─ npm install -g @anthropic-ai/claude-code vercel agent-browser   (~20s)
  ├─ Write Vercel CLI auth token to ~/.local/share/com.vercel.cli/auth.json
  ├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
  ├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/  (80 files, ~945KB)
  ├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
  ├─ Phase 1: BUILD
  │   ├─ sandbox.writeFiles() → /tmp/prompt.txt
  │   ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
  │   │   (with AbortSignal.timeout(TIMEOUT_MS))
  │   ├─ Poll every 20s:
  │   │   ├─ ls /tmp/vercel-plugin-*-seen-skills.d/     (claimed skills)
  │   │   ├─ cat /tmp/vercel-plugin-*-seen-skills.txt    (seen skills snapshot)
  │   │   ├─ find ~/.claude/debug -type f                (debug log count)
  │   │   ├─ find <project> -newer /tmp/prompt.txt       (new project files)
  │   │   └─ curl localhost:3000                         (port status)
  │   ├─ Extract build artifacts
  │   └─ Haiku build score (structured JSON)
  ├─ Start dev server (if not already running)
  ├─ sandbox.extendTimeout(...)
  ├─ Phase 2: VERIFY (if >1 project file exists)
  │   ├─ sandbox.writeFiles() → /tmp/verify.txt  (agent-browser verification prompt)
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
  │   │   (with AbortSignal.timeout(1_200_000) — 20 min)
  │   ├─ Re-extract skills (verify phase triggers more)
  │   └─ Haiku verify score (per-story pass/fail JSON)
  ├─ Phase 3: DEPLOY (if >3 project files)
  │   ├─ sandbox.writeFiles() → /tmp/deploy.txt
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
  │   │   (links to vercel-labs, deploys, fixes build errors up to 3x)
  │   ├─ Extract deploy URL from output (*.vercel.app)
  │   ├─ Re-extract skills (deploy phase triggers more)
  │   └─ Haiku deploy score (structured JSON)
  ├─ Write <slug>/result.json immediately (crash-safe)
  ├─ Update aggregate results.json (complete: false until all done)
  ├─ Extract source.tar.gz
  └─ sandbox.stop()  (skipped if --keep-alive)
Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
  ├─ npm install -g @anthropic-ai/claude-code vercel agent-browser   (~20s)
  ├─ Write Vercel CLI auth token to ~/.local/share/com.vercel.cli/auth.json
  ├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
  ├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/  (80 files, ~945KB)
  ├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
  ├─ Phase 1: BUILD
  │   ├─ sandbox.writeFiles() → /tmp/prompt.txt
  │   ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
  │   │   (with AbortSignal.timeout(TIMEOUT_MS))
  │   ├─ Poll every 20s:
  │   │   ├─ ls /tmp/vercel-plugin-*-seen-skills.d/     (claimed skills)
  │   │   ├─ cat /tmp/vercel-plugin-*-seen-skills.txt    (seen skills snapshot)
  │   │   ├─ find ~/.claude/debug -type f                (debug log count)
  │   │   ├─ find <project> -newer /tmp/prompt.txt       (new project files)
  │   │   └─ curl localhost:3000                         (port status)
  │   ├─ Extract build artifacts
  │   └─ Haiku build score (structured JSON)
  ├─ Start dev server (if not already running)
  ├─ sandbox.extendTimeout(...)
  ├─ Phase 2: VERIFY (if >1 project file exists)
  │   ├─ sandbox.writeFiles() → /tmp/verify.txt  (agent-browser verification prompt)
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
  │   │   (with AbortSignal.timeout(1_200_000) — 20 min)
  │   ├─ Re-extract skills (verify phase triggers more)
  │   └─ Haiku verify score (per-story pass/fail JSON)
  ├─ Phase 3: DEPLOY (if >3 project files)
  │   ├─ sandbox.writeFiles() → /tmp/deploy.txt
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
  │   │   (links to vercel-labs, deploys, fixes build errors up to 3x)
  │   ├─ Extract deploy URL from output (*.vercel.app)
  │   ├─ Re-extract skills (deploy phase triggers more)
  │   └─ Haiku deploy score (structured JSON)
  ├─ Write <slug>/result.json immediately (crash-safe)
  ├─ Update aggregate results.json (complete: false until all done)
  ├─ Extract source.tar.gz
  └─ sandbox.stop()  (skipped if --keep-alive)

Verification Phase Details

验证阶段详情

The verify phase is the "closer" — its job is to make the app work and prove it. Key behaviors:
  • Always runs if >1 project file exists (no longer gated on port 3000 being up)
  • Starts dev server itself if not already running — the prompt tells Claude to check
    localhost:3000
    and run
    npx next dev --port 3000
    if needed
  • 20 minute timeout — enough for agent-browser to open pages, screenshot, interact, fix broken code, restart server, and re-verify
  • Triggers skill injection — the verify session creates/edits files, triggering PreToolUse and PostToolUse hooks
  • Uses agent-browser workflow:
    open
    wait --load networkidle
    screenshot --annotate
    snapshot -i
    → interact → fix → re-verify
  • Results scored by haiku — no more parsing
    STORY_1: PASS
    from free text
验证阶段是「收尾环节」,职责是让应用正常运行并完成验证。核心特性:
  • 只要存在>1个项目文件就始终运行(不再限制必须启动3000端口)
  • 如果开发服务器未运行则自行启动——提示词会告知Claude检查
    localhost:3000
    ,如有需要则运行
    npx next dev --port 3000
  • 20分钟超时——足够agent-browser打开页面、截图、交互、修复损坏代码、重启服务器、重新验证
  • 触发技能注入——验证会话会创建/编辑文件,触发PreToolUse和PostToolUse钩子
  • 使用agent-browser工作流:
    open
    wait --load networkidle
    screenshot --annotate
    snapshot -i
    → 交互 → 修复 → 重新验证
  • 结果由haiku评分——无需从自由文本中解析
    STORY_1: PASS
    这类内容

Deploy Phase Details

部署阶段详情

The deploy phase uses a full Claude Code session (for skill tracking) to:
  1. Run
    vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDD
  2. Run
    vercel deploy --yes
  3. If build fails, fix code and retry (up to 3 attempts)
  4. Important: unsets
    VERCEL_TOKEN
    env var so CLI falls back to
    ~/.local/share/com.vercel.cli/auth.json
  5. Deployment protection is enabled by default on vercel-labs team
Deploy URL is extracted by regex from Claude's output, with haiku as fallback URL extractor.
部署阶段使用完整的Claude Code会话(用于技能跟踪)完成以下操作:
  1. 运行
    vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDD
  2. 运行
    vercel deploy --yes
  3. 如果构建失败,修复代码并重试(最多3次)
  4. 重要提示:取消
    VERCEL_TOKEN
    环境变量,让CLI回退使用
    ~/.local/share/com.vercel.cli/auth.json
  5. vercel-labs团队的部署默认开启部署保护
部署URL通过正则从Claude的输出中提取,haiku评分步骤作为备用URL提取方案。

DO NOT (Hard Rules)

禁止事项(硬性规则)

Same rules as
benchmark-agents
, plus sandbox-specific:
  • DO NOT use
    claude --print
    or
    -p
    flag for BUILD/VERIFY/DEPLOY phases — hooks don't fire without tool-calling sessions (use
    -p
    only for haiku scoring passes)
  • DO NOT let sandboxes run without extracting artifacts — ephemeral filesystem is lost on stop
  • DO NOT pass API keys via
    writeFiles()
    — use
    Sandbox.create({ env: { ... } })
  • DO NOT skip snapshotting after build — it's your safety net if verify/deploy kills the sandbox
  • DO NOT use v2 beta SDK — named sandbox endpoint returns 404 for this team; use v1.8.0
  • DO NOT use
    runCommand
    heredocs to write file content — use
    sandbox.writeFiles()
    instead
  • DO NOT assume
    /home/user/
    exists — the home dir is
    /home/vercel-sandbox/
  • DO NOT use simple project names without timestamps — always append
    -YYYYMMDDHHMM
    to avoid collisions across runs
benchmark-agents
规则一致,新增沙箱专属规则:
  • 禁止在构建/验证/部署阶段使用
    claude --print
    -p
    参数——没有工具调用会话的话钩子不会触发(仅在haiku评分步骤使用
    -p
  • 禁止未提取产物就停止沙箱运行——临时文件系统会在停止后丢失
  • 禁止通过
    writeFiles()
    传递API密钥——使用
    Sandbox.create({ env: { ... } })
    传递
  • 禁止构建完成后跳过快照——如果验证/部署导致沙箱损坏,快照是你的安全网
  • 禁止使用v2 beta SDK——该团队访问命名沙箱端点会返回404,请使用v1.8.0
  • 禁止使用
    runCommand
    heredoc写入文件内容——请改用
    sandbox.writeFiles()
  • 禁止假设
    /home/user/
    存在——家目录是
    /home/vercel-sandbox/
  • 禁止使用不带时间戳的简单项目名——始终追加
    -YYYYMMDDHHMM
    避免不同运行间的命名冲突

Prerequisites

前置要求

bash
undefined
bash
undefined

One-time setup: link project for OIDC sandbox auth

一次性设置:关联项目用于OIDC沙箱认证

npx vercel link --scope vercel-labs -y npx vercel env pull .env.local
npx vercel link --scope vercel-labs -y npx vercel env pull .env.local

Auth (auto-resolved from macOS Keychain + Vercel CLI auth):

认证(自动从macOS钥匙串 + Vercel CLI认证中读取):

- ANTHROPIC_API_KEY: from Keychain "ANTHROPIC_AUTH_TOKEN" (vck_* key) or env var

- ANTHROPIC_API_KEY: 来自钥匙串的「ANTHROPIC_AUTH_TOKEN」(vck_*格式密钥)或环境变量

- VERCEL_TOKEN: from ~/.local/share/com.vercel.cli/auth.json (vca_* token) or env var

- VERCEL_TOKEN: 来自
~/.local/share/com.vercel.cli/auth.json
的(vca_*格式令牌)或环境变量

- ANTHROPIC_BASE_URL: defaults to https://ai-gateway.vercel.sh

- ANTHROPIC_BASE_URL: 默认值为https://ai-gateway.vercel.sh

undefined
undefined

Commands

命令

Run eval with dynamic scenarios (recommended)

运行动态场景评估(推荐)

bash
undefined
bash
undefined

Generate scenarios as JSON, then run

生成JSON格式场景,然后运行

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json

With all phases + keep-alive for overnight

运行全阶段 + 保持沙箱过夜运行

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --keep-alive --keep-hours 8
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --keep-alive --keep-hours 8

Build-only, no verification or deploy

仅构建,不进行验证和部署

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --skip-verify --skip-deploy
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --skip-verify --skip-deploy

Filter to specific slugs from file or defaults

从文件或默认场景中过滤运行指定标识的场景

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
undefined
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
undefined

Monitoring While Running

运行时监控

The orchestrator prints live status. For manual checks on a running sandbox:
typescript
// List claimed skills
const claims = await sandbox.runCommand("sh", ["-c",
  "ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null"
]);

// Check hook firing count
const hooks = await sandbox.runCommand("sh", ["-c",
  "find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +"
]);

// Check port 3000
const port = await sandbox.runCommand("sh", ["-c",
  "curl -s -o /dev/null -w '%{http_code}' http://localhost:3000"
]);

// Get public URL (after ports: [3000] in Sandbox.create)
const url = sandbox.domain(3000);
编排器会打印实时状态。如果需要手动检查运行中的沙箱:
typescript
// 列出已认领的技能
const claims = await sandbox.runCommand("sh", ["-c",
  "ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null"
]);

// 检查钩子触发次数
const hooks = await sandbox.runCommand("sh", ["-c",
  "find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +"
]);

// 检查3000端口状态
const port = await sandbox.runCommand("sh", ["-c",
  "curl -s -o /dev/null -w '%{http_code}' http://localhost:3000"
]);

// 获取公网URL(在Sandbox.create中传入ports: [3000]后可用)
const url = sandbox.domain(3000);

Artifact Export Layout

产物导出结构

Results are written to
~/dev/vercel-plugin-testing/sandbox-results/<run-id>/
:
<run-id>/
  results.json             # Aggregate results (complete: false until all done, then true)
  report.md                # Markdown report with scores, coverage, URLs
  <slug>/
    result.json            # Per-scenario result (written immediately on completion)
    source.tar.gz          # Project source archive
Each scenario result includes:
  • slug
    ,
    sandboxId
    ,
    success
    ,
    durationMs
  • claimedSkills[]
    ,
    expectedSkills[]
    ,
    projectFiles[]
  • appUrl
    — public
    https://sb-XXX.vercel.run
    URL (sandbox lifetime only)
  • deployUrl
    — permanent
    https://xxx.vercel.app
    URL (if deploy succeeded)
  • pollHistory[]
    — timestamped skill/file/port snapshots
  • verification
    { ran, exitCode, stories: [{ index, status }], output }
  • buildScore
    — haiku structured completeness assessment
  • deployScore
    — haiku structured deploy assessment
The markdown report (
report.md
/
.reports/<timestamp>.md
) includes:
  1. Summary table — slug, build status, skills, files, verify results, deploy URL, duration
  2. Per-scenario details — build score, deploy score, verification per-story pass/fail
  3. Skill coverage — expected vs actual per scenario, missing/bonus breakdown
  4. Total unique skills across all scenarios
结果写入到
~/dev/vercel-plugin-testing/sandbox-results/<run-id>/
目录:
<run-id>/
  results.json             # 聚合结果(全部完成前complete为false,完成后为true)
  report.md                # 包含评分、覆盖率、URL的Markdown报告
  <slug>/
    result.json            # 单场景结果(完成后立即写入)
    source.tar.gz          # 项目源码归档
每个场景的结果包含:
  • slug
    sandboxId
    success
    durationMs
  • claimedSkills[]
    expectedSkills[]
    projectFiles[]
  • appUrl
    — 公网
    https://sb-XXX.vercel.run
    URL(仅沙箱运行期间有效)
  • deployUrl
    — 永久
    https://xxx.vercel.app
    URL(部署成功时存在)
  • pollHistory[]
    — 带时间戳的技能/文件/端口快照
  • verification
    { ran, exitCode, stories: [{ index, status }], output }
  • buildScore
    — haiku结构化完整性评估
  • deployScore
    — haiku结构化部署评估
Markdown报告(
report.md
/
.reports/<timestamp>.md
)包含:
  1. 汇总表 — 场景标识、构建状态、技能、文件、验证结果、部署URL、耗时
  2. 单场景详情 — 构建评分、部署评分、验证阶段每个用户故事的通过/失败状态
  3. 技能覆盖率 — 每个场景的预期 vs 实际技能,缺失/额外技能 breakdown
  4. 全部场景的唯一技能总数

Proven Results (2026-03-10)

验证结果(2026-03-10)

Across 34 scenarios run in 5 batches:
MetricBestTypical
Skills per scenario31 (ai-interior-designer)12-24
Expected skill coverage100% (pet-adoption-board 4/4, apartment-hunting-copilot 7/7, splitwise-clone 6/6)50-86%
User stories verified3/3 PASS (ai-dream-journal, ai-gift-finder, ai-resume-roaster, ai-music-mood-radio, team-standup-bot, pet-adoption-board)varies
Files built per scenario37 (student-study-groups)6-25
Build time5-11 min5-7 min
Key findings:
  • User-story-focused prompts (no tech name-dropping) work — plugin detects patterns from actual code
  • ai-sdk
    ,
    shadcn
    ,
    nextjs
    ,
    vercel-functions
    are the most consistently detected skills
  • cron-jobs
    ,
    routing-middleware
    need Claude to write specific file patterns to trigger
  • Lexical prompt inject (UserPromptSubmit) working — skills injected before any files written
  • session-end-cleanup
    deletes claim dirs — use poll history for final skill counts
  • Enterprise tier (vercel-labs) — no sandbox time cap; builds ran 10+ minutes
在5个批次运行的34个场景中:
指标最佳值典型值
单场景技能数31(ai-interior-designer)12-24
预期技能覆盖率100%(pet-adoption-board 4/4、apartment-hunting-copilot 7/7、splitwise-clone 6/6)50-86%
用户故事验证通过率3/3 全通过(ai-dream-journal、ai-gift-finder、ai-resume-roaster、ai-music-mood-radio、team-standup-bot、pet-adoption-board)不固定
单场景构建文件数37(student-study-groups)6-25
构建耗时5-11分钟5-7分钟
核心发现:
  • 聚焦用户故事的提示词(不提及技术名称)效果良好——插件可以从实际代码中检测模式
  • ai-sdk
    shadcn
    nextjs
    vercel-functions
    是最稳定检测到的技能
  • cron-jobs
    routing-middleware
    需要Claude编写特定的文件模式才能触发
  • 词法提示词注入(UserPromptSubmit)正常工作——技能在写入任何文件前就已注入
  • session-end-cleanup
    会删除认领目录——使用轮询历史获取最终技能计数
  • 企业版层级(vercel-labs)无沙箱运行时长上限——构建可成功运行10分钟以上

Known Limitations

已知限制

  1. Snapshot stops the source sandbox:
    sandbox.snapshot()
    stops the original sandbox. Create a new sandbox from the snapshot to continue. Files and npm globals DO survive.
  2. v2 beta incompatible:
    @vercel/sandbox@2.0.0-beta.3
    's named sandbox endpoint returns 404 for this team. Stick with v1.8.0.
  3. Artifact window: Must extract before
    sandbox.stop()
    — filesystem is ephemeral. Session cleanup hook may delete claim dirs before extraction.
  4. Amazon Linux paths: User is
    vercel-sandbox
    (home at
    /home/vercel-sandbox/
    ). NOT
    /home/user/
    or
    /root/
    .
  5. --dangerously-skip-permissions
    parity
    : Sandbox evals auto-approve all tool calls. WezTerm evals use normal permission flow. Coverage results may differ.
  6. runCommand
    timeout
    : Use
    { signal: AbortSignal.timeout(ms) }
    — the
    { timeout }
    option is silently ignored.
  7. BrotliDecompressionError: Transient Vercel API errors can kill sandbox creation. Retry logic recommended for production runs.
  8. Deploy reliability: Claude Code deploy sessions sometimes fail to output a parseable
    *.vercel.app
    URL. The haiku scoring step provides a fallback URL extraction attempt.
  9. Verify timeout: Complex apps may need the full 20 minutes for agent-browser to test all stories. Simpler apps finish in 2-5 minutes.
  1. 快照会停止源沙箱
    sandbox.snapshot()
    会停止原沙箱。需要从快照创建新沙箱才能继续操作。文件和npm全局依赖确实会保留。
  2. v2 beta版本不兼容
    @vercel/sandbox@2.0.0-beta.3
    的命名沙箱端点对该团队返回404。请坚持使用v1.8.0。
  3. 产物提取窗口期:必须在
    sandbox.stop()
    前提取——文件系统是临时的。会话清理钩子可能在提取前删除认领目录。
  4. Amazon Linux路径问题:用户是
    vercel-sandbox
    (家目录在
    /home/vercel-sandbox/
    ),不是
    /home/user/
    /root/
  5. --dangerously-skip-permissions
    一致性问题
    :沙箱评估会自动批准所有工具调用,WezTerm评估使用正常权限流,覆盖率结果可能存在差异。
  6. runCommand
    超时问题
    :使用
    { signal: AbortSignal.timeout(ms) }
    ——
    { timeout }
    参数会被静默忽略。
  7. BrotliDecompressionError:偶发的Vercel API错误可能导致沙箱创建失败。生产运行建议添加重试逻辑。
  8. 部署可靠性问题:Claude Code部署会话有时无法输出可解析的
    *.vercel.app
    URL,haiku评分步骤提供备用URL提取尝试。
  9. 验证超时问题:复杂应用可能需要完整20分钟让agent-browser测试所有故事,简单应用2-5分钟即可完成。