benchmark-sandbox

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Benchmark Sandbox — Remote Eval via Vercel Sandboxes

基准测试沙箱 — 通过Vercel沙箱进行远程评估

Run benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a 3-phase eval pipeline:

Phase 1 (BUILD): Claude Code builds the app with
```
--dangerously-skip-permissions --debug
```
Phase 2 (VERIFY): A follow-up Claude Code session uses
```
agent-browser
```
to walk through user stories, fixing issues until all pass (20 min timeout)
Phase 3 (DEPLOY): A third Claude Code session links to vercel-labs, runs
```
vercel deploy
```
, and fixes build errors (up to 3 retries). Deployed apps have deployment protection enabled by default.

Skills are tracked across all 3 phases — each phase may trigger additional skill injections as new files/patterns are created. After each phase, a haiku structured scoring step (

claude -p --json-schema --model haiku

) evaluates the results as structured JSON.

在Vercel沙箱中运行基准测试场景——这些是搭载node24的临时Firecracker微型虚拟机。每个沙箱都会全新安装Claude Code + Vercel CLI + agent-browser，上传本地vercel-plugin，并运行三阶段评估流水线：

第一阶段（构建）：Claude Code 使用
```
--dangerously-skip-permissions --debug
```
参数构建应用
第二阶段（验证）：后续的Claude Code会话使用
```
agent-browser
```
遍历用户故事，修复问题直至全部通过（20分钟超时）
第三阶段（部署）：第三个Claude Code会话关联至vercel-labs，运行
```
vercel deploy
```
并修复构建错误（最多重试3次）。已部署应用默认启用部署保护。

技能会在所有三个阶段中被追踪——每个阶段在创建新文件/模式时可能触发额外的技能注入。每个阶段结束后，会执行俳句结构化评分步骤（

claude -p --json-schema --model haiku

），以结构化JSON格式评估结果。

Proven Working Script

已验证可用的脚本

Use

run-eval.ts

— the proven eval runner:

bash

undefined

使用

run-eval.ts

——经过验证的评估运行器：

bash

undefined

Run default scenarios with full 3-phase pipeline

运行默认场景，执行完整三阶段流水线

bun run .claude/skills/benchmark-sandbox/run-eval.ts

With dynamic scenarios from a JSON file (recommended — see "Dynamic Scenarios" below)

从JSON文件加载动态场景（推荐——见下方“动态场景”）

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json

Keep sandboxes alive overnight with public URLs

保持沙箱夜间运行并提供公开URL

bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive --keep-hours 8

Build-only (skip verification and deploy)

仅构建（跳过验证和部署）

bun run .claude/skills/benchmark-sandbox/run-eval.ts --skip-verify --skip-deploy

Run specific scenarios by slug

按slug运行特定场景

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone

undefined

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone

undefined

CLI Flags

CLI参数

Flag	Default	Description
`--concurrency N`	5	Max parallel sandboxes (max 10)
`--timeout MS`	1800000 (30 min)	Per-phase timeout in ms
`--keep-alive`	off	Keep sandboxes running after eval
`--keep-hours N`	8	Hours to keep alive (with `--keep-alive` )
`--skip-verify`	off	Skip the agent-browser verification phase
`--skip-deploy`	off	Skip the Vercel deploy phase
`--scenarios a,b,c`	all	Only run specific scenarios by slug
`--scenarios-file path`	—	Load scenarios from a JSON file instead of built-in defaults

参数	默认值	描述
`--concurrency N`	5	最大并行沙箱数量（上限10）
`--timeout MS`	1800000（30分钟）	每个阶段的超时时间（毫秒）
`--keep-alive`	关闭	评估结束后保持沙箱运行
`--keep-hours N`	8	保持运行的时长（需搭配 `--keep-alive` ）
`--skip-verify`	关闭	跳过agent-browser验证阶段
`--skip-deploy`	关闭	跳过Vercel部署阶段
`--scenarios a,b,c`	全部	仅运行指定slug的场景
`--scenarios-file path`	—	从JSON文件加载场景，而非使用内置默认场景

Dynamic Scenarios (Recommended Approach)

动态场景（推荐方案）

Instead of hardcoding tech-specific prompts, generate scenarios dynamically as a JSON file. Prompts should describe real-world apps people want to build using user stories — no tech name-dropping. Let the plugin figure out what Vercel tech to inject.

不要硬编码特定技术的提示词，而是动态生成JSON格式的场景文件。提示词应描述人们实际想要构建的真实应用，使用用户故事的形式——不要提及具体技术名称。让插件自行决定要注入哪些Vercel技术。

Scenario JSON Format

场景JSON格式

json

[
  {
    "slug": "pet-adoption-board",
    "prompt": "Build me a pet adoption listing board where shelters can post animals...",
    "expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
    "userStories": [
      "As a visitor, I can see a grid of pet listings with photos and names",
      "As a visitor, I can click a pet card to see a detail page",
      "As a visitor, I can filter pets by type"
    ]
  }
]

Each scenario needs:

slug

(string),

prompt

(string),

expectedSkills

(string[]),

userStories

(tuple of exactly 3 strings).

json

[
  {
    "slug": "pet-adoption-board",
    "prompt": "为我构建一个宠物领养信息板，让收容所可以发布动物信息...",
    "expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
    "userStories": [
      "作为访客，我可以查看带照片和名字的宠物列表网格",
      "作为访客，我可以点击宠物卡片查看详情页",
      "作为访客，我可以按宠物类型筛选"
    ]
  }
]

每个场景需要包含：

slug

（字符串）、

prompt

（字符串）、

expectedSkills

（字符串数组）、

userStories

（恰好3个字符串的元组）。

Prompt Design Guidelines

提示词设计指南

Focus on what the user wants, not what tech to use
Describe real-world apps that solve real problems with friendly, stylish UX
Include AI features naturally (recommendations, analysis, generation)

Always end with:

"Link the project to my vercel-labs team. After building all files, start the dev server on port 3000 with \

npx next dev --port 3000`."`

Include storage needs (photos, uploads) to trigger vercel-storage
Include scheduled tasks (reminders, cleanup) to trigger cron-jobs
Include auth/middleware to trigger routing-middleware

聚焦于用户需求，而非使用的技术
描述解决真实问题的真实应用，具备友好、时尚的用户体验
自然融入AI功能（推荐、分析、生成）
结尾必须添加：
```
"将项目关联至我的vercel-labs团队。构建完成所有文件后，使用\
```
npx next dev --port 3000`在3000端口启动开发服务器。"`
包含存储需求（照片、上传）以触发vercel-storage
包含定时任务（提醒、清理）以触发cron-jobs
包含认证/中间件以触发routing-middleware

Structured Scoring (Haiku)

结构化评分（俳句模型）

Each phase gets a structured JSON score via

claude -p --json-schema --model haiku --setting-sources ""

running inside the sandbox. This is a separate quick pass — no tools, no hooks — just reads the phase output and returns structured data.

每个阶段都会通过沙箱内运行的

claude -p --json-schema --model haiku --setting-sources ""

生成结构化JSON评分。这是一个独立的快速评估步骤——不使用工具，不触发钩子——仅读取阶段输出并返回结构化数据。

Build Score Schema

构建评分 schema

json

{
  "completeness": "complete|partial|minimal|empty",
  "hasApiRoutes": true,
  "hasUIComponents": true,
  "hasAIFeature": true,
  "devServerRunning": true,
  "missingFeatures": ["feature1"],
  "summary": "Brief assessment"
}

json

{
  "completeness": "complete|partial|minimal|empty",
  "hasApiRoutes": true,
  "hasUIComponents": true,
  "hasAIFeature": true,
  "devServerRunning": true,
  "missingFeatures": ["feature1"],
  "summary": "简要评估"
}

Verify Score Schema (per user story)

验证评分 schema（按用户故事）

json

{
  "stories": [
    { "index": 1, "status": "pass|fail", "reason": "Evidence from output" }
  ]
}

json

{
  "stories": [
    { "index": 1, "status": "pass|fail", "reason": "输出中的证据" }
  ]
}

Deploy Score Schema

部署评分 schema

json

{
  "deployed": true,
  "url": "https://xxx.vercel.app",
  "buildSucceeded": true,
  "errors": [],
  "summary": "Brief assessment"
}

Important: The

claude -p --output-format json

response wraps results — the actual schema data is in

parsed.structured_output

, not the top-level object.

json

{
  "deployed": true,
  "url": "https://xxx.vercel.app",
  "buildSucceeded": true,
  "errors": [],
  "summary": "简要评估"
}

重要提示：

claude -p --output-format json

的响应会包裹结果——实际的schema数据位于

parsed.structured_output

中，而非顶层对象。

Critical Sandbox Environment Facts

关键沙箱环境信息

Property	Value
Home directory	`/home/vercel-sandbox` (NOT `/home/user/` or `/root/` )
User	`vercel-sandbox` (NOT `root` )
Claude binary	`/home/vercel-sandbox/.global/npm/bin/claude`
PATH (via sh -c)	Includes `~/.global/npm/bin` — claude findable by name
Port exposure	`sandbox.domain(3000)` → `https://subdomain.vercel.run`
Snapshot persistence	Files AND npm globals survive snapshot restore — use `sandbox.snapshot()` → `Sandbox.create({ source: { type: "snapshot", snapshotId } })`
SDK version	`@vercel/sandbox@1.8.0` (v2 beta's named sandbox endpoint returns 404 for this team)
Team tier	Enterprise (vercel-labs) — no known sandbox time cap

属性	值
主目录	`/home/vercel-sandbox` （不是 `/home/user/` 或 `/root/` ）
用户	`vercel-sandbox` （不是 `root` ）
Claude 二进制文件路径	`/home/vercel-sandbox/.global/npm/bin/claude`
PATH（通过sh -c）	包含 `~/.global/npm/bin` ——可直接通过名称调用claude
端口暴露	`sandbox.domain(3000)` → `https://subdomain.vercel.run`
快照持久化	文件和npm全局包在快照恢复后会保留——使用 `sandbox.snapshot()` → `Sandbox.create({ source: { type: "snapshot", snapshotId } })`
SDK版本	`@vercel/sandbox@1.8.0` （v2测试版的命名沙箱端点对该团队返回404）
团队等级	企业版（vercel-labs）——无已知沙箱时间限制

Key Discoveries (Hard-Won)

重要发现（实践总结）

Snapshots work:
```
sandbox.snapshot()
```
preserves files AND npm globals. Use it after build to create a restore point before verify/deploy. Note: snapshotting stops the source sandbox — create a new one from the snapshot to continue.
Plugin install: Use
```
npx add-plugin <path> -s project -y --target claude-code
```
— works because claude is in PATH after
```
npm install -g
```
. The
```
--target claude-code
```
flag is required because add-plugin can't auto-detect Claude Code without an initialized
```
~/.claude/
```
dir.
File uploads: Use
```
sandbox.writeFiles([{ path, content: Buffer }])
```
— NOT runCommand heredocs. Heredocs with special characters cause 400 errors from the sandbox API.

Claude flags: Always use

--dangerously-skip-permissions --debug

. The

--debug

flag writes to

~/.claude/debug/

Auth: API key from macOS Keychain (
```
ANTHROPIC_AUTH_TOKEN
```
— a
```
vck_*
```
Vercel Claude Key for AI Gateway), Vercel token from
```
~/.local/share/com.vercel.cli/auth.json
```
(a
```
vca_*
```
token).

OIDC for sandbox SDK: Run

npx vercel link --scope vercel-labs -y

npx vercel env pull

once before first use.

Port exposure: Pass
```
ports: [3000]
```
in
```
Sandbox.create()
```
to get a public URL immediately via
```
sandbox.domain(3000)
```
. Works on v1.8.0 — URL is assigned at creation time, before anything listens.
extendTimeout: Use
```
sandbox.extendTimeout(ms)
```
to keep sandboxes alive past their initial timeout. Verified working — extends by the requested duration. Use this for overnight keep-alive.
Background commands:
```
runCommand
```
with backgrounded processes (
```
&
```
or
```
nohup
```
) may throw ZodError on v1. Write a script file first, then execute it.
Session cleanup race: The
```
session-end-cleanup.mjs
```
hook deletes
```
/tmp/vercel-plugin-*-seen-skills.d/
```
on session end. Extract artifacts BEFORE the session completes, or rely on poll history data.
agent-browser works in sandboxes: Install via
```
npm install -g agent-browser
```
. Claude Code can use it for browser-based verification inside the sandbox.
No hobby tier cap: Early 301s timeouts were from lower default timeout values in earlier script iterations, not a tier limitation. Enterprise (vercel-labs) has no known sandbox time cap — sandboxes ran 10+ minutes successfully.
claude -p works inside sandboxes:
```
claude -p --json-schema --output-format json --model haiku
```
works for structured scoring passes. No nesting issue when running inside a sandbox (only fails when running Claude inside Claude on the same machine).
Deploy project naming: ALWAYS use timestamped slugs with minute precision (e.g.,
```
pet-adoption-board-202603101853
```
) to avoid collisions when linking to vercel-labs team projects. These are demo projects — we generate many per day. Format:
```
<slug>-<YYYYMMDDHHMM>
```
.

快照功能可用：
```
sandbox.snapshot()
```
会保留文件和npm全局包。构建完成后创建快照，作为验证/部署阶段的安全备份。注意：快照会停止源沙箱——需从快照创建新沙箱以继续操作。
插件安装：使用
```
npx add-plugin <path> -s project -y --target claude-code
```
——由于claude已通过
```
npm install -g
```
加入PATH，因此可正常工作。
```
--target claude-code
```
参数是必需的，因为如果没有初始化
```
~/.claude/
```
目录，add-plugin无法自动检测Claude Code。
文件上传：使用
```
sandbox.writeFiles([{ path, content: Buffer }])
```
——不要使用runCommand的here文档。包含特殊字符的here文档会导致沙箱API返回400错误。

Claude参数：始终使用

--dangerously-skip-permissions --debug

。

--debug

参数会将日志写入

~/.claude/debug/

。

认证：API密钥来自macOS钥匙串（
```
ANTHROPIC_AUTH_TOKEN
```
——用于AI网关的
```
vck_*
```
格式Vercel Claude密钥），Vercel令牌来自
```
~/.local/share/com.vercel.cli/auth.json
```
（
```
vca_*
```
格式令牌）。

沙箱SDK的OIDC认证：首次使用前运行一次

npx vercel link --scope vercel-labs -y

npx vercel env pull

。

端口暴露：在
```
Sandbox.create()
```
中传入
```
ports: [3000]
```
，即可通过
```
sandbox.domain(3000)
```
立即获取公开URL。在v1.8.0版本中可正常工作——URL在创建时分配，无需等待服务监听。
延长超时：使用
```
sandbox.extendTimeout(ms)
```
可让沙箱在初始超时后继续运行。已验证可用——会按请求时长延长超时。可用于夜间保持沙箱运行。
后台命令：在v1版本中，使用后台进程（
```
&
```
或
```
nohup
```
）的
```
runCommand
```
可能会抛出ZodError。建议先写入脚本文件，再执行该脚本。
会话清理竞争：
```
session-end-cleanup.mjs
```
钩子会在会话结束时删除
```
/tmp/vercel-plugin-*-seen-skills.d/
```
。需在会话完成前提取工件，或依赖轮询历史数据。
agent-browser在沙箱中可用：通过
```
npm install -g agent-browser
```
安装。Claude Code可在沙箱内使用它进行基于浏览器的验证。
无免费版限制：早期的301超时是由于早期脚本的默认超时值设置较低，而非等级限制。企业版（vercel-labs）无已知沙箱时间限制——沙箱已成功运行10分钟以上。
claude -p在沙箱中可用：
```
claude -p --json-schema --output-format json --model haiku
```
可用于结构化评分。在沙箱内运行时无嵌套问题（仅在同一机器上的Claude内部运行Claude时会失败）。
部署项目命名：必须使用精确到分钟的时间戳后缀（例如
```
pet-adoption-board-202603101853
```
），避免关联至vercel-labs团队项目时发生冲突。这些是演示项目——我们每天会生成多个。格式：
```
<slug>-<YYYYMMDDHHMM>
```
。

When to Use This vs benchmark-agents

与benchmark-agents的对比

	benchmark-agents (WezTerm)	benchmark-sandbox
Environment	Local macOS terminal panes	Remote Vercel Sandboxes (Amazon Linux)
Parallelism	Limited by local resources	Up to 10 (Hobby) or 2,000 (Pro) concurrent
Session type	Interactive TTY via `/bin/zsh -ic`	Direct `sh -c` invocation (PTY not required)
Artifact access	Direct filesystem ( `~/.claude/debug/` )	`sandbox.readFile()` / poll via `runCommand`
Port exposure	`localhost:3000`	Public `https://sb-XXX.vercel.run` URLs
Verification	Manual browser check	Automated agent-browser in Phase 2
Deploy	Manual	Automated Phase 3 → permanent `*.vercel.app` URLs
Scoring	Manual review	Haiku structured JSON scoring per phase
Best for	Manual eval + iteration loop	Automated parallel coverage + verification + deploy runs

	benchmark-agents（WezTerm）	benchmark-sandbox
运行环境	本地macOS终端面板	远程Vercel沙箱（Amazon Linux）
并行能力	受本地资源限制	最多10个（免费版）或2000个（专业版）并发
会话类型	通过 `/bin/zsh -ic` 的交互式TTY	直接 `sh -c` 调用（无需PTY）
工件访问	直接文件系统访问（ `~/.claude/debug/` ）	`sandbox.readFile()` / 通过 `runCommand` 轮询
端口暴露	`localhost:3000`	公开 `https://sb-XXX.vercel.run` URL
验证方式	手动浏览器检查	第二阶段自动agent-browser验证
部署方式	手动	第三阶段自动部署 → 永久 `*.vercel.app` URL
评分方式	手动审核	每个阶段的俳句结构化JSON评分
最佳用途	手动评估 + 迭代循环	自动化并行覆盖率 + 验证 + 部署运行

How It Works

工作原理

Create fresh sandbox:

Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } })

— no snapshot

Install tools:

npm install -g @anthropic-ai/claude-code vercel agent-browser

(~20s per sandbox)

Auth Vercel CLI: Write token to
```
~/.local/share/com.vercel.cli/auth.json
```
Upload plugin:
```
sandbox.writeFiles()
```
for 80 plugin files, then
```
npx add-plugin
```
Phase 1 — BUILD: Claude Code builds the app (30 min timeout)
Score build: Haiku evaluates completeness, API routes, UI, AI features
Start dev server: If not already running, start
```
npx next dev --port 3000
```
Extend timeout:
```
sandbox.extendTimeout()
```
for verify + deploy + keep-alive
Phase 2 — VERIFY: Claude Code uses
```
agent-browser
```
to test user stories (20 min timeout). Prompt tells Claude to start dev server itself if not running.
Score verify: Haiku evaluates each user story as pass/fail with reasons
Re-extract skills: Skills re-collected after verify phase (agent-browser + code fixes trigger more)
Phase 3 — DEPLOY: Claude Code runs
```
vercel link
```
+
```
vercel deploy
```
, fixes build errors (30 min timeout)
Score deploy: Haiku evaluates deploy success, URL extraction, errors
Re-extract skills: Skills re-collected after deploy phase
Write incremental results: Each scenario writes its own
```
result.json
```
immediately on completion (survives crashes)
Extract source archive:
```
source.tar.gz
```
of project files saved locally
Generate report: Markdown report with build/verify/deploy scores, skill coverage, URLs

创建全新沙箱：

Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } })

——不使用快照

安装工具：

npm install -g @anthropic-ai/claude-code vercel agent-browser

（每个沙箱约20秒）

Vercel CLI认证：将令牌写入
```
~/.local/share/com.vercel.cli/auth.json
```
上传插件：使用
```
sandbox.writeFiles()
```
上传80个插件文件，然后运行
```
npx add-plugin
```
第一阶段：构建：Claude Code构建应用（30分钟超时）
构建评分：俳句模型评估完整性、API路由、UI、AI功能
启动开发服务器：如果尚未运行，启动
```
npx next dev --port 3000
```
延长超时：
```
sandbox.extendTimeout()
```
用于验证、部署和保持运行
第二阶段：验证：Claude Code使用
```
agent-browser
```
测试用户故事（20分钟超时）。提示词会告知Claude如果服务器未运行则自行启动。
验证评分：俳句模型评估每个用户故事的通过/失败状态及原因
重新提取技能：验证阶段后重新收集技能（agent-browser和代码修复会触发更多技能）
第三阶段：部署：Claude Code运行
```
vercel link
```
+
```
vercel deploy
```
，修复构建错误（30分钟超时）
部署评分：俳句模型评估部署成功与否、URL提取、错误情况
重新提取技能：部署阶段后重新收集技能
写入增量结果：每个场景完成后立即写入自己的
```
result.json
```
（可在崩溃后保留）
提取源码归档：本地保存项目文件的
```
source.tar.gz
```
生成报告：包含构建/验证/部署评分、技能覆盖率、URL的Markdown报告

Sandbox Session Flow (Per Scenario)

沙箱会话流程（每个场景）

Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
  │
  ├─ npm install -g @anthropic-ai/claude-code vercel agent-browser   (~20s)
  ├─ Write Vercel CLI auth token to ~/.local/share/com.vercel.cli/auth.json
  ├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
  ├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/  (80 files, ~945KB)
  ├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
  │
  ├─ Phase 1: BUILD
  │   ├─ sandbox.writeFiles() → /tmp/prompt.txt
  │   ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
  │   │   (with AbortSignal.timeout(TIMEOUT_MS))
  │   ├─ Poll every 20s:
  │   │   ├─ ls /tmp/vercel-plugin-*-seen-skills.d/     (claimed skills)
  │   │   ├─ cat /tmp/vercel-plugin-*-seen-skills.txt    (seen skills snapshot)
  │   │   ├─ find ~/.claude/debug -type f                (debug log count)
  │   │   ├─ find <project> -newer /tmp/prompt.txt       (new project files)
  │   │   └─ curl localhost:3000                         (port status)
  │   ├─ Extract build artifacts
  │   └─ Haiku build score (structured JSON)
  │
  ├─ Start dev server (if not already running)
  ├─ sandbox.extendTimeout(...)
  │
  ├─ Phase 2: VERIFY (if >1 project file exists)
  │   ├─ sandbox.writeFiles() → /tmp/verify.txt  (agent-browser verification prompt)
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
  │   │   (with AbortSignal.timeout(1_200_000) — 20 min)
  │   ├─ Re-extract skills (verify phase triggers more)
  │   └─ Haiku verify score (per-story pass/fail JSON)
  │
  ├─ Phase 3: DEPLOY (if >3 project files)
  │   ├─ sandbox.writeFiles() → /tmp/deploy.txt
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
  │   │   (links to vercel-labs, deploys, fixes build errors up to 3x)
  │   ├─ Extract deploy URL from output (*.vercel.app)
  │   ├─ Re-extract skills (deploy phase triggers more)
  │   └─ Haiku deploy score (structured JSON)
  │
  ├─ Write <slug>/result.json immediately (crash-safe)
  ├─ Update aggregate results.json (complete: false until all done)
  ├─ Extract source.tar.gz
  └─ sandbox.stop()  (skipped if --keep-alive)

Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
  │
  ├─ npm install -g @anthropic-ai/claude-code vercel agent-browser   (~20s)
  ├─ 将Vercel CLI认证令牌写入~/.local/share/com.vercel.cli/auth.json
  ├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
  ├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/  (80个文件，约945KB)
  ├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
  │
  ├─ 第一阶段：构建
  │   ├─ sandbox.writeFiles() → /tmp/prompt.txt
  │   ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
  │   │   (使用AbortSignal.timeout(TIMEOUT_MS))
  │   ├─ 每20秒轮询一次：
  │   │   ├─ ls /tmp/vercel-plugin-*-seen-skills.d/     (已声明的技能)
  │   │   ├─ cat /tmp/vercel-plugin-*-seen-skills.txt    (已发现技能的快照)
  │   │   ├─ find ~/.claude/debug -type f                (调试日志数量)
  │   │   ├─ find <project> -newer /tmp/prompt.txt       (新增项目文件)
  │   │   └─ curl localhost:3000                         (端口状态)
  │   ├─ 提取构建工件
  │   └─ 俳句模型构建评分（结构化JSON）
  │
  ├─ 启动开发服务器（如果尚未运行）
  ├─ sandbox.extendTimeout(...)
  │
  ├─ 第二阶段：验证（如果项目文件数量>1）
  │   ├─ sandbox.writeFiles() → /tmp/verify.txt  (agent-browser验证提示词)
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
  │   │   (使用AbortSignal.timeout(1_200_000) — 20分钟)
  │   ├─ 重新提取技能（验证阶段会触发更多技能）
  │   └─ 俳句模型验证评分（按故事的通过/失败JSON）
  │
  ├─ 第三阶段：部署（如果项目文件数量>3）
  │   ├─ sandbox.writeFiles() → /tmp/deploy.txt
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
  │   │   (关联至vercel-labs，部署，最多3次修复构建错误)
  │   ├─ 从输出中提取部署URL（*.vercel.app）
  │   ├─ 重新提取技能（部署阶段会触发更多技能）
  │   └─ 俳句模型部署评分（结构化JSON）
  │
  ├─ 立即写入<slug>/result.json（崩溃安全）
  ├─ 更新汇总results.json（完成前complete为false，完成后为true）
  ├─ 提取source.tar.gz
  └─ sandbox.stop()  (如果使用--keep-alive则跳过)

Verification Phase Details

验证阶段详情

The verify phase is the "closer" — its job is to make the app work and prove it. Key behaviors:

Always runs if >1 project file exists (no longer gated on port 3000 being up)
Starts dev server itself if not already running — the prompt tells Claude to check
```
localhost:3000
```
and run
```
npx next dev --port 3000
```
if needed
20 minute timeout — enough for agent-browser to open pages, screenshot, interact, fix broken code, restart server, and re-verify
Triggers skill injection — the verify session creates/edits files, triggering PreToolUse and PostToolUse hooks
Uses agent-browser workflow:
```
open
```
→
```
wait --load networkidle
```
→
```
screenshot --annotate
```
→
```
snapshot -i
```
→ interact → fix → re-verify
Results scored by haiku — no more parsing
```
STORY_1: PASS
```
from free text

验证阶段是“收尾环节”——其职责是确保应用正常运行并验证功能。关键特性：

只要项目文件数量>1就会运行（不再以3000端口是否可用为前提）
自行启动开发服务器——如果服务器未运行，提示词会告知Claude检查
```
localhost:3000
```
并在需要时运行
```
npx next dev --port 3000
```
20分钟超时——足够agent-browser打开页面、截图、交互、修复代码、重启服务器并重新验证
触发技能注入——验证会话会创建/编辑文件，触发PreToolUse和PostToolUse钩子
使用agent-browser工作流：
```
open
```
→
```
wait --load networkidle
```
→
```
screenshot --annotate
```
→
```
snapshot -i
```
→ 交互 → 修复 → 重新验证
结果由俳句模型评分——无需再从自由文本中解析
```
STORY_1: PASS
```

Deploy Phase Details

部署阶段详情

The deploy phase uses a full Claude Code session (for skill tracking) to:

Run

vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDD

Run
```
vercel deploy --yes
```
If build fails, fix code and retry (up to 3 attempts)

Important: unsets

VERCEL_TOKEN

env var so CLI falls back to

~/.local/share/com.vercel.cli/auth.json

Deployment protection is enabled by default on vercel-labs team

Deploy URL is extracted by regex from Claude's output, with haiku as fallback URL extractor.

部署阶段使用完整的Claude Code会话（用于技能追踪）来：

运行

vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDD

运行
```
vercel deploy --yes
```
如果构建失败，修复代码并重试（最多3次）
重要提示：取消设置
```
VERCEL_TOKEN
```
环境变量，让CLI回退使用
```
~/.local/share/com.vercel.cli/auth.json
```
vercel-labs团队默认启用部署保护

部署URL通过正则表达式从Claude的输出中提取，俳句模型作为URL提取的备选方案。

DO NOT (Hard Rules)

禁止操作（硬性规则）

Same rules as

benchmark-agents

, plus sandbox-specific:

DO NOT use
```
claude --print
```
or
```
-p
```
flag for BUILD/VERIFY/DEPLOY phases — hooks don't fire without tool-calling sessions (use
```
-p
```
only for haiku scoring passes)
DO NOT let sandboxes run without extracting artifacts — ephemeral filesystem is lost on stop

DO NOT pass API keys via

writeFiles()

— use

Sandbox.create({ env: { ... } })

DO NOT skip snapshotting after build — it's your safety net if verify/deploy kills the sandbox
DO NOT use v2 beta SDK — named sandbox endpoint returns 404 for this team; use v1.8.0
DO NOT use
```
runCommand
```
heredocs to write file content — use
```
sandbox.writeFiles()
```
instead
DO NOT assume
```
/home/user/
```
exists — the home dir is
```
/home/vercel-sandbox/
```
DO NOT use simple project names without timestamps — always append
```
-YYYYMMDDHHMM
```
to avoid collisions across runs

与

benchmark-agents

的规则相同，加上沙箱特有的规则：

禁止在构建/验证/部署阶段使用
```
claude --print
```
或
```
-p
```
参数——不使用工具调用会话的话钩子不会触发（仅在俳句模型评分步骤中使用
```
-p
```
）
禁止在未提取工件的情况下让沙箱运行——临时文件系统在停止后会丢失

禁止通过

writeFiles()

传递API密钥——使用

Sandbox.create({ env: { ... } })

禁止在构建后跳过快照——这是验证/部署阶段沙箱崩溃时的安全备份
禁止使用v2测试版SDK——该团队的命名沙箱端点返回404；请使用v1.8.0
禁止使用runCommand的here文档写入文件内容——使用
```
sandbox.writeFiles()
```
替代
禁止假设
```
/home/user/
```
存在——主目录是
```
/home/vercel-sandbox/
```
禁止使用不带时间戳的简单项目名称——始终添加
```
-YYYYMMDDHHMM
```
后缀，避免跨运行的冲突

Prerequisites

前置条件

bash

undefined

bash

undefined

One-time setup: link project for OIDC sandbox auth

一次性设置：关联项目以进行OIDC沙箱认证

npx vercel link --scope vercel-labs -y npx vercel env pull .env.local

Auth (auto-resolved from macOS Keychain + Vercel CLI auth):

认证信息（自动从macOS钥匙串和Vercel CLI认证中获取）：

- ANTHROPIC_API_KEY: from Keychain "ANTHROPIC_AUTH_TOKEN" (vck_* key) or env var

- ANTHROPIC_API_KEY：来自钥匙串的"ANTHROPIC_AUTH_TOKEN"（vck_*格式密钥）或环境变量

- VERCEL_TOKEN: from ~/.local/share/com.vercel.cli/auth.json (vca_* token) or env var

- VERCEL_TOKEN：来自~/.local/share/com.vercel.cli/auth.json（vca_*格式令牌）或环境变量

- ANTHROPIC_BASE_URL: defaults to https://ai-gateway.vercel.sh

- ANTHROPIC_BASE_URL：默认值为https://ai-gateway.vercel.sh

undefined

undefined

Commands

命令

Run eval with dynamic scenarios (recommended)

使用动态场景运行评估（推荐）

bash

undefined

bash

undefined

Generate scenarios as JSON, then run

生成JSON格式的场景，然后运行

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json

With all phases + keep-alive for overnight

执行所有阶段并保持沙箱夜间运行

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --keep-alive --keep-hours 8

Build-only, no verification or deploy

仅构建，跳过验证和部署

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --skip-verify --skip-deploy

Filter to specific slugs from file or defaults

从文件或默认场景中筛选特定slug

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone

undefined

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone

undefined

Monitoring While Running

运行时监控

The orchestrator prints live status. For manual checks on a running sandbox:

typescript

// List claimed skills
const claims = await sandbox.runCommand("sh", ["-c",
  "ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null"
]);

// Check hook firing count
const hooks = await sandbox.runCommand("sh", ["-c",
  "find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +"
]);

// Check port 3000
const port = await sandbox.runCommand("sh", ["-c",
  "curl -s -o /dev/null -w '%{http_code}' http://localhost:3000"
]);

// Get public URL (after ports: [3000] in Sandbox.create)
const url = sandbox.domain(3000);

编排器会打印实时状态。如需手动检查运行中的沙箱：

typescript

undefined

Artifact Export Layout

列出已声明的技能

Results are written to

~/dev/vercel-plugin-testing/sandbox-results/<run-id>/

<run-id>/
  results.json             # Aggregate results (complete: false until all done, then true)
  report.md                # Markdown report with scores, coverage, URLs
  <slug>/
    result.json            # Per-scenario result (written immediately on completion)
    source.tar.gz          # Project source archive

Each scenario result includes:

```
slug
```
,
```
sandboxId
```
,
```
success
```
,
```
durationMs
```

claimedSkills[]

expectedSkills[]

projectFiles[]

```
appUrl
```
— public
```
https://sb-XXX.vercel.run
```
URL (sandbox lifetime only)
```
deployUrl
```
— permanent
```
https://xxx.vercel.app
```
URL (if deploy succeeded)
```
pollHistory[]
```
— timestamped skill/file/port snapshots

verification

—

{ ran, exitCode, stories: [{ index, status }], output }

```
buildScore
```
— haiku structured completeness assessment
```
deployScore
```
— haiku structured deploy assessment

The markdown report (

report.md

.reports/<timestamp>.md

) includes:

Summary table — slug, build status, skills, files, verify results, deploy URL, duration
Per-scenario details — build score, deploy score, verification per-story pass/fail
Skill coverage — expected vs actual per scenario, missing/bonus breakdown
Total unique skills across all scenarios

const claims = await sandbox.runCommand("sh", ["-c", "ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null" ]);

Proven Results (2026-03-10)

检查钩子触发次数

Across 34 scenarios run in 5 batches:

Metric	Best	Typical
Skills per scenario	31 (ai-interior-designer)	12-24
Expected skill coverage	100% (pet-adoption-board 4/4, apartment-hunting-copilot 7/7, splitwise-clone 6/6)	50-86%
User stories verified	3/3 PASS (ai-dream-journal, ai-gift-finder, ai-resume-roaster, ai-music-mood-radio, team-standup-bot, pet-adoption-board)	varies
Files built per scenario	37 (student-study-groups)	6-25
Build time	5-11 min	5-7 min

Key findings:

User-story-focused prompts (no tech name-dropping) work — plugin detects patterns from actual code
```
ai-sdk
```
,
```
shadcn
```
,
```
nextjs
```
,
```
vercel-functions
```
are the most consistently detected skills
```
cron-jobs
```
,
```
routing-middleware
```
need Claude to write specific file patterns to trigger
Lexical prompt inject (UserPromptSubmit) working — skills injected before any files written
```
session-end-cleanup
```
deletes claim dirs — use poll history for final skill counts
Enterprise tier (vercel-labs) — no sandbox time cap; builds ran 10+ minutes

const hooks = await sandbox.runCommand("sh", ["-c", "find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +" ]);

Known Limitations

检查3000端口

Snapshot stops the source sandbox:
```
sandbox.snapshot()
```
stops the original sandbox. Create a new sandbox from the snapshot to continue. Files and npm globals DO survive.
v2 beta incompatible:
```
@vercel/sandbox@2.0.0-beta.3
```
's named sandbox endpoint returns 404 for this team. Stick with v1.8.0.
Artifact window: Must extract before
```
sandbox.stop()
```
— filesystem is ephemeral. Session cleanup hook may delete claim dirs before extraction.

Amazon Linux paths: User is

vercel-sandbox

(home at

/home/vercel-sandbox/

). NOT

/home/user/

/root/

--dangerously-skip-permissions
parity: Sandbox evals auto-approve all tool calls. WezTerm evals use normal permission flow. Coverage results may differ.

runCommand
timeout: Use

{ signal: AbortSignal.timeout(ms) }

— the

{ timeout }

option is silently ignored.

BrotliDecompressionError: Transient Vercel API errors can kill sandbox creation. Retry logic recommended for production runs.
Deploy reliability: Claude Code deploy sessions sometimes fail to output a parseable
```
*.vercel.app
```
URL. The haiku scoring step provides a fallback URL extraction attempt.
Verify timeout: Complex apps may need the full 20 minutes for agent-browser to test all stories. Simpler apps finish in 2-5 minutes.

const port = await sandbox.runCommand("sh", ["-c", "curl -s -o /dev/null -w '%{http_code}' http://localhost:3000" ]);

—

获取公开URL（需在Sandbox.create中传入ports: [3000]）

—

const url = sandbox.domain(3000);

undefined

—

工件导出结构

—

结果会写入

~/dev/vercel-plugin-testing/sandbox-results/<run-id>/

：

<run-id>/
  results.json             # 汇总结果（完成前complete为false，完成后为true）
  report.md                # 包含评分、覆盖率、URL的Markdown报告
  <slug>/
    result.json            # 每个场景的结果（完成后立即写入）
    source.tar.gz          # 项目源码归档

每个场景结果包含：

```
slug
```
、
```
sandboxId
```
、
```
success
```
、
```
durationMs
```

claimedSkills[]

、

expectedSkills[]

、

projectFiles[]

```
appUrl
```
— 公开
```
https://sb-XXX.vercel.run
```
URL（仅沙箱生命周期内有效）
```
deployUrl
```
— 永久
```
https://xxx.vercel.app
```
URL（如果部署成功）
```
pollHistory[]
```
— 带时间戳的技能/文件/端口快照

verification

—

{ ran, exitCode, stories: [{ index, status }], output }

```
buildScore
```
— 俳句模型的结构化完整性评估
```
deployScore
```
— 俳句模型的结构化部署评估

Markdown报告（

report.md

.reports/<timestamp>.md

）包含：

汇总表格 — slug、构建状态、技能、文件、验证结果、部署URL、时长
每个场景的详情 — 构建评分、部署评分、按故事的验证通过/失败情况
技能覆盖率 — 每个场景的预期与实际技能对比，缺失/额外技能细分
所有场景的总唯一技能数

—

已验证结果（2026-03-10）

—

在5批共34个场景的运行中：

指标	最佳值	典型值
每个场景的技能数	31个（ai-interior-designer）	12-24个
预期技能覆盖率	100%（pet-adoption-board 4/4，apartment-hunting-copilot 7/7，splitwise-clone 6/6）	50-86%
已验证用户故事	3/3 通过（ai-dream-journal、ai-gift-finder、ai-resume-roaster、ai-music-mood-radio、team-standup-bot、pet-adoption-board）	各不相同
每个场景构建的文件数	37个（student-study-groups）	6-25个
构建时长	5-11分钟	5-7分钟

关键发现：

以用户故事为中心的提示词（不提及技术名称）有效——插件可从实际代码中检测到模式
```
ai-sdk
```
、
```
shadcn
```
、
```
nextjs
```
、
```
vercel-functions
```
是最常被检测到的技能
```
cron-jobs
```
、
```
routing-middleware
```
需要Claude编写特定文件模式才能触发
Lexical提示词注入（UserPromptSubmit）正常工作——在写入任何文件前就会注入技能
```
session-end-cleanup
```
会删除声明目录——需使用轮询历史数据获取最终技能计数
企业版（vercel-labs）——无沙箱时间限制；构建已成功运行10分钟以上

—

已知限制

—

快照会停止源沙箱：
```
sandbox.snapshot()
```
会停止原始沙箱。需从快照创建新沙箱以继续操作。文件和npm全局包会保留。
与v2测试版不兼容：
```
@vercel/sandbox@2.0.0-beta.3
```
的命名沙箱端点对该团队返回404。请坚持使用v1.8.0。
工件提取窗口：必须在
```
sandbox.stop()
```
前提取工件——文件系统是临时的。会话清理钩子可能在提取前删除声明目录。
Amazon Linux路径：用户是
```
vercel-sandbox
```
（主目录为
```
/home/vercel-sandbox/
```
）。不是
```
/home/user/
```
或
```
/root/
```
。
--dangerously-skip-permissions
一致性：沙箱评估会自动批准所有工具调用。WezTerm评估使用正常权限流程。覆盖率结果可能不同。

runCommand
超时：使用

{ signal: AbortSignal.timeout(ms) }

——

{ timeout }

选项会被静默忽略。

BrotliDecompressionError：Vercel API的临时错误可能导致沙箱创建失败。生产环境运行建议添加重试逻辑。
部署可靠性：Claude Code部署会话有时无法输出可解析的
```
*.vercel.app
```
URL。俳句模型评分步骤提供了备选的URL提取方案。
验证超时：复杂应用可能需要完整的20分钟让agent-browser测试所有故事。简单应用在2-5分钟内即可完成。