Benchmark Sandbox — Remote Eval via Vercel Sandboxes
基准测试沙箱 — 通过Vercel沙箱进行远程评估
Run benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a 3-phase eval pipeline:
- Phase 1 (BUILD): Claude Code builds the app with
--dangerously-skip-permissions --debug
- Phase 2 (VERIFY): A follow-up Claude Code session uses to walk through user stories, fixing issues until all pass (20 min timeout)
- Phase 3 (DEPLOY): A third Claude Code session links to vercel-labs, runs , and fixes build errors (up to 3 retries). Deployed apps have deployment protection enabled by default.
Skills are tracked across
all 3 phases — each phase may trigger additional skill injections as new files/patterns are created. After each phase, a
haiku structured scoring step (
claude -p --json-schema --model haiku
) evaluates the results as structured JSON.
在Vercel沙箱中运行基准测试场景——这些是搭载node24的临时Firecracker微型虚拟机。每个沙箱都会全新安装Claude Code + Vercel CLI + agent-browser,上传本地vercel-plugin,并运行三阶段评估流水线:
- 第一阶段(构建):Claude Code 使用
--dangerously-skip-permissions --debug
参数构建应用
- 第二阶段(验证):后续的Claude Code会话使用遍历用户故事,修复问题直至全部通过(20分钟超时)
- 第三阶段(部署):第三个Claude Code会话关联至vercel-labs,运行并修复构建错误(最多重试3次)。已部署应用默认启用部署保护。
技能会在
所有三个阶段中被追踪——每个阶段在创建新文件/模式时可能触发额外的技能注入。每个阶段结束后,会执行
俳句结构化评分步骤(
claude -p --json-schema --model haiku
),以结构化JSON格式评估结果。
Proven Working Script
已验证可用的脚本
Use
— the proven eval runner:
Run default scenarios with full 3-phase pipeline
运行默认场景,执行完整三阶段流水线
bun run .claude/skills/benchmark-sandbox/run-eval.ts
bun run .claude/skills/benchmark-sandbox/run-eval.ts
With dynamic scenarios from a JSON file (recommended — see "Dynamic Scenarios" below)
从JSON文件加载动态场景(推荐——见下方“动态场景”)
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
Keep sandboxes alive overnight with public URLs
保持沙箱夜间运行并提供公开URL
bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive --keep-hours 8
bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive --keep-hours 8
Build-only (skip verification and deploy)
仅构建(跳过验证和部署)
bun run .claude/skills/benchmark-sandbox/run-eval.ts --skip-verify --skip-deploy
bun run .claude/skills/benchmark-sandbox/run-eval.ts --skip-verify --skip-deploy
Run specific scenarios by slug
按slug运行特定场景
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
| Flag | Default | Description |
|---|
| 5 | Max parallel sandboxes (max 10) |
| 1800000 (30 min) | Per-phase timeout in ms |
| off | Keep sandboxes running after eval |
| 8 | Hours to keep alive (with ) |
| off | Skip the agent-browser verification phase |
| off | Skip the Vercel deploy phase |
| all | Only run specific scenarios by slug |
| — | Load scenarios from a JSON file instead of built-in defaults |
| 参数 | 默认值 | 描述 |
|---|
| 5 | 最大并行沙箱数量(上限10) |
| 1800000(30分钟) | 每个阶段的超时时间(毫秒) |
| 关闭 | 评估结束后保持沙箱运行 |
| 8 | 保持运行的时长(需搭配) |
| 关闭 | 跳过agent-browser验证阶段 |
| 关闭 | 跳过Vercel部署阶段 |
| 全部 | 仅运行指定slug的场景 |
| — | 从JSON文件加载场景,而非使用内置默认场景 |
Dynamic Scenarios (Recommended Approach)
动态场景(推荐方案)
Instead of hardcoding tech-specific prompts, generate scenarios dynamically as a JSON file. Prompts should describe real-world apps people want to build using user stories — no tech name-dropping. Let the plugin figure out what Vercel tech to inject.
不要硬编码特定技术的提示词,而是动态生成JSON格式的场景文件。提示词应描述人们实际想要构建的真实应用,使用用户故事的形式——不要提及具体技术名称。让插件自行决定要注入哪些Vercel技术。
Scenario JSON Format
场景JSON格式
json
[
{
"slug": "pet-adoption-board",
"prompt": "Build me a pet adoption listing board where shelters can post animals...",
"expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
"userStories": [
"As a visitor, I can see a grid of pet listings with photos and names",
"As a visitor, I can click a pet card to see a detail page",
"As a visitor, I can filter pets by type"
]
}
]
Each scenario needs:
(string),
(string),
(string[]),
(tuple of exactly 3 strings).
json
[
{
"slug": "pet-adoption-board",
"prompt": "为我构建一个宠物领养信息板,让收容所可以发布动物信息...",
"expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
"userStories": [
"作为访客,我可以查看带照片和名字的宠物列表网格",
"作为访客,我可以点击宠物卡片查看详情页",
"作为访客,我可以按宠物类型筛选"
]
}
]
每个场景需要包含:
(字符串)、
(字符串)、
(字符串数组)、
(恰好3个字符串的元组)。
Prompt Design Guidelines
提示词设计指南
- Focus on what the user wants, not what tech to use
- Describe real-world apps that solve real problems with friendly, stylish UX
- Include AI features naturally (recommendations, analysis, generation)
- Always end with:
"Link the project to my vercel-labs team. After building all files, start the dev server on port 3000 with \
npx next dev --port 3000`."`
- Include storage needs (photos, uploads) to trigger vercel-storage
- Include scheduled tasks (reminders, cleanup) to trigger cron-jobs
- Include auth/middleware to trigger routing-middleware
- 聚焦于用户需求,而非使用的技术
- 描述解决真实问题的真实应用,具备友好、时尚的用户体验
- 自然融入AI功能(推荐、分析、生成)
- 结尾必须添加:
"将项目关联至我的vercel-labs团队。构建完成所有文件后,使用\
npx next dev --port 3000`在3000端口启动开发服务器。"`
- 包含存储需求(照片、上传)以触发vercel-storage
- 包含定时任务(提醒、清理)以触发cron-jobs
- 包含认证/中间件以触发routing-middleware
Structured Scoring (Haiku)
结构化评分(俳句模型)
Each phase gets a structured JSON score via
claude -p --json-schema --model haiku --setting-sources ""
running inside the sandbox. This is a separate quick pass — no tools, no hooks — just reads the phase output and returns structured data.
每个阶段都会通过沙箱内运行的
claude -p --json-schema --model haiku --setting-sources ""
生成结构化JSON评分。这是一个独立的快速评估步骤——不使用工具,不触发钩子——仅读取阶段输出并返回结构化数据。
Build Score Schema
构建评分 schema
json
{
"completeness": "complete|partial|minimal|empty",
"hasApiRoutes": true,
"hasUIComponents": true,
"hasAIFeature": true,
"devServerRunning": true,
"missingFeatures": ["feature1"],
"summary": "Brief assessment"
}
json
{
"completeness": "complete|partial|minimal|empty",
"hasApiRoutes": true,
"hasUIComponents": true,
"hasAIFeature": true,
"devServerRunning": true,
"missingFeatures": ["feature1"],
"summary": "简要评估"
}
Verify Score Schema (per user story)
验证评分 schema(按用户故事)
json
{
"stories": [
{ "index": 1, "status": "pass|fail", "reason": "Evidence from output" }
]
}
json
{
"stories": [
{ "index": 1, "status": "pass|fail", "reason": "输出中的证据" }
]
}
Deploy Score Schema
部署评分 schema
json
{
"deployed": true,
"url": "https://xxx.vercel.app",
"buildSucceeded": true,
"errors": [],
"summary": "Brief assessment"
}
Important: The
claude -p --output-format json
response wraps results — the actual schema data is in
, not the top-level object.
json
{
"deployed": true,
"url": "https://xxx.vercel.app",
"buildSucceeded": true,
"errors": [],
"summary": "简要评估"
}
重要提示:
claude -p --output-format json
的响应会包裹结果——实际的schema数据位于
中,而非顶层对象。
Critical Sandbox Environment Facts
关键沙箱环境信息
| Property | Value |
|---|
| Home directory | (NOT or ) |
| User | (NOT ) |
| Claude binary | /home/vercel-sandbox/.global/npm/bin/claude
|
| PATH (via sh -c) | Includes — claude findable by name |
| Port exposure | → https://subdomain.vercel.run
|
| Snapshot persistence | Files AND npm globals survive snapshot restore — use → Sandbox.create({ source: { type: "snapshot", snapshotId } })
|
| SDK version | (v2 beta's named sandbox endpoint returns 404 for this team) |
| Team tier | Enterprise (vercel-labs) — no known sandbox time cap |
| 属性 | 值 |
|---|
| 主目录 | (不是或) |
| 用户 | (不是) |
| Claude 二进制文件路径 | /home/vercel-sandbox/.global/npm/bin/claude
|
| PATH(通过sh -c) | 包含——可直接通过名称调用claude |
| 端口暴露 | → https://subdomain.vercel.run
|
| 快照持久化 | 文件和npm全局包在快照恢复后会保留——使用 → Sandbox.create({ source: { type: "snapshot", snapshotId } })
|
| SDK版本 | (v2测试版的命名沙箱端点对该团队返回404) |
| 团队等级 | 企业版(vercel-labs)——无已知沙箱时间限制 |
Key Discoveries (Hard-Won)
重要发现(实践总结)
- Snapshots work: preserves files AND npm globals. Use it after build to create a restore point before verify/deploy. Note: snapshotting stops the source sandbox — create a new one from the snapshot to continue.
- Plugin install: Use
npx add-plugin <path> -s project -y --target claude-code
— works because claude is in PATH after . The flag is required because add-plugin can't auto-detect Claude Code without an initialized dir.
- File uploads: Use
sandbox.writeFiles([{ path, content: Buffer }])
— NOT runCommand heredocs. Heredocs with special characters cause 400 errors from the sandbox API.
- Claude flags: Always use
--dangerously-skip-permissions --debug
. The flag writes to .
- Auth: API key from macOS Keychain ( — a Vercel Claude Key for AI Gateway), Vercel token from
~/.local/share/com.vercel.cli/auth.json
(a token).
- OIDC for sandbox SDK: Run
npx vercel link --scope vercel-labs -y
+ once before first use.
- Port exposure: Pass in to get a public URL immediately via . Works on v1.8.0 — URL is assigned at creation time, before anything listens.
- extendTimeout: Use
sandbox.extendTimeout(ms)
to keep sandboxes alive past their initial timeout. Verified working — extends by the requested duration. Use this for overnight keep-alive.
- Background commands: with backgrounded processes ( or ) may throw ZodError on v1. Write a script file first, then execute it.
- Session cleanup race: The hook deletes
/tmp/vercel-plugin-*-seen-skills.d/
on session end. Extract artifacts BEFORE the session completes, or rely on poll history data.
- agent-browser works in sandboxes: Install via
npm install -g agent-browser
. Claude Code can use it for browser-based verification inside the sandbox.
- No hobby tier cap: Early 301s timeouts were from lower default timeout values in earlier script iterations, not a tier limitation. Enterprise (vercel-labs) has no known sandbox time cap — sandboxes ran 10+ minutes successfully.
- claude -p works inside sandboxes:
claude -p --json-schema --output-format json --model haiku
works for structured scoring passes. No nesting issue when running inside a sandbox (only fails when running Claude inside Claude on the same machine).
- Deploy project naming: ALWAYS use timestamped slugs with minute precision (e.g.,
pet-adoption-board-202603101853
) to avoid collisions when linking to vercel-labs team projects. These are demo projects — we generate many per day. Format: .
- 快照功能可用:会保留文件和npm全局包。构建完成后创建快照,作为验证/部署阶段的安全备份。注意:快照会停止源沙箱——需从快照创建新沙箱以继续操作。
- 插件安装:使用
npx add-plugin <path> -s project -y --target claude-code
——由于claude已通过加入PATH,因此可正常工作。参数是必需的,因为如果没有初始化目录,add-plugin无法自动检测Claude Code。
- 文件上传:使用
sandbox.writeFiles([{ path, content: Buffer }])
——不要使用runCommand的here文档。包含特殊字符的here文档会导致沙箱API返回400错误。
- Claude参数:始终使用
--dangerously-skip-permissions --debug
。参数会将日志写入。
- 认证:API密钥来自macOS钥匙串(——用于AI网关的格式Vercel Claude密钥),Vercel令牌来自
~/.local/share/com.vercel.cli/auth.json
(格式令牌)。
- 沙箱SDK的OIDC认证:首次使用前运行一次
npx vercel link --scope vercel-labs -y
+ 。
- 端口暴露:在中传入,即可通过立即获取公开URL。在v1.8.0版本中可正常工作——URL在创建时分配,无需等待服务监听。
- 延长超时:使用
sandbox.extendTimeout(ms)
可让沙箱在初始超时后继续运行。已验证可用——会按请求时长延长超时。可用于夜间保持沙箱运行。
- 后台命令:在v1版本中,使用后台进程(或)的可能会抛出ZodError。建议先写入脚本文件,再执行该脚本。
- 会话清理竞争:钩子会在会话结束时删除
/tmp/vercel-plugin-*-seen-skills.d/
。需在会话完成前提取工件,或依赖轮询历史数据。
- agent-browser在沙箱中可用:通过
npm install -g agent-browser
安装。Claude Code可在沙箱内使用它进行基于浏览器的验证。
- 无免费版限制:早期的301超时是由于早期脚本的默认超时值设置较低,而非等级限制。企业版(vercel-labs)无已知沙箱时间限制——沙箱已成功运行10分钟以上。
- claude -p在沙箱中可用:
claude -p --json-schema --output-format json --model haiku
可用于结构化评分。在沙箱内运行时无嵌套问题(仅在同一机器上的Claude内部运行Claude时会失败)。
- 部署项目命名:必须使用精确到分钟的时间戳后缀(例如
pet-adoption-board-202603101853
),避免关联至vercel-labs团队项目时发生冲突。这些是演示项目——我们每天会生成多个。格式:。
When to Use This vs benchmark-agents
与benchmark-agents的对比
| benchmark-agents (WezTerm) | benchmark-sandbox |
|---|
| Environment | Local macOS terminal panes | Remote Vercel Sandboxes (Amazon Linux) |
| Parallelism | Limited by local resources | Up to 10 (Hobby) or 2,000 (Pro) concurrent |
| Session type | Interactive TTY via | Direct invocation (PTY not required) |
| Artifact access | Direct filesystem () | / poll via |
| Port exposure | | Public https://sb-XXX.vercel.run
URLs |
| Verification | Manual browser check | Automated agent-browser in Phase 2 |
| Deploy | Manual | Automated Phase 3 → permanent URLs |
| Scoring | Manual review | Haiku structured JSON scoring per phase |
| Best for | Manual eval + iteration loop | Automated parallel coverage + verification + deploy runs |
| benchmark-agents(WezTerm) | benchmark-sandbox |
|---|
| 运行环境 | 本地macOS终端面板 | 远程Vercel沙箱(Amazon Linux) |
| 并行能力 | 受本地资源限制 | 最多10个(免费版)或2000个(专业版)并发 |
| 会话类型 | 通过的交互式TTY | 直接调用(无需PTY) |
| 工件访问 | 直接文件系统访问() | / 通过轮询 |
| 端口暴露 | | 公开https://sb-XXX.vercel.run
URL |
| 验证方式 | 手动浏览器检查 | 第二阶段自动agent-browser验证 |
| 部署方式 | 手动 | 第三阶段自动部署 → 永久 URL |
| 评分方式 | 手动审核 | 每个阶段的俳句结构化JSON评分 |
| 最佳用途 | 手动评估 + 迭代循环 | 自动化并行覆盖率 + 验证 + 部署运行 |
- Create fresh sandbox:
Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } })
— no snapshot
- Install tools:
npm install -g @anthropic-ai/claude-code vercel agent-browser
(~20s per sandbox)
- Auth Vercel CLI: Write token to
~/.local/share/com.vercel.cli/auth.json
- Upload plugin: for 80 plugin files, then
- Phase 1 — BUILD: Claude Code builds the app (30 min timeout)
- Score build: Haiku evaluates completeness, API routes, UI, AI features
- Start dev server: If not already running, start
- Extend timeout: for verify + deploy + keep-alive
- Phase 2 — VERIFY: Claude Code uses to test user stories (20 min timeout). Prompt tells Claude to start dev server itself if not running.
- Score verify: Haiku evaluates each user story as pass/fail with reasons
- Re-extract skills: Skills re-collected after verify phase (agent-browser + code fixes trigger more)
- Phase 3 — DEPLOY: Claude Code runs + , fixes build errors (30 min timeout)
- Score deploy: Haiku evaluates deploy success, URL extraction, errors
- Re-extract skills: Skills re-collected after deploy phase
- Write incremental results: Each scenario writes its own immediately on completion (survives crashes)
- Extract source archive: of project files saved locally
- Generate report: Markdown report with build/verify/deploy scores, skill coverage, URLs
- 创建全新沙箱:
Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } })
——不使用快照
- 安装工具:
npm install -g @anthropic-ai/claude-code vercel agent-browser
(每个沙箱约20秒)
- Vercel CLI认证:将令牌写入
~/.local/share/com.vercel.cli/auth.json
- 上传插件:使用上传80个插件文件,然后运行
- 第一阶段:构建:Claude Code构建应用(30分钟超时)
- 构建评分:俳句模型评估完整性、API路由、UI、AI功能
- 启动开发服务器:如果尚未运行,启动
- 延长超时:用于验证、部署和保持运行
- 第二阶段:验证:Claude Code使用测试用户故事(20分钟超时)。提示词会告知Claude如果服务器未运行则自行启动。
- 验证评分:俳句模型评估每个用户故事的通过/失败状态及原因
- 重新提取技能:验证阶段后重新收集技能(agent-browser和代码修复会触发更多技能)
- 第三阶段:部署:Claude Code运行 + ,修复构建错误(30分钟超时)
- 部署评分:俳句模型评估部署成功与否、URL提取、错误情况
- 重新提取技能:部署阶段后重新收集技能
- 写入增量结果:每个场景完成后立即写入自己的(可在崩溃后保留)
- 提取源码归档:本地保存项目文件的
- 生成报告:包含构建/验证/部署评分、技能覆盖率、URL的Markdown报告
Sandbox Session Flow (Per Scenario)
沙箱会话流程(每个场景)
Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
│
├─ npm install -g @anthropic-ai/claude-code vercel agent-browser (~20s)
├─ Write Vercel CLI auth token to ~/.local/share/com.vercel.cli/auth.json
├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/ (80 files, ~945KB)
├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
│
├─ Phase 1: BUILD
│ ├─ sandbox.writeFiles() → /tmp/prompt.txt
│ ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
│ │ (with AbortSignal.timeout(TIMEOUT_MS))
│ ├─ Poll every 20s:
│ │ ├─ ls /tmp/vercel-plugin-*-seen-skills.d/ (claimed skills)
│ │ ├─ cat /tmp/vercel-plugin-*-seen-skills.txt (seen skills snapshot)
│ │ ├─ find ~/.claude/debug -type f (debug log count)
│ │ ├─ find <project> -newer /tmp/prompt.txt (new project files)
│ │ └─ curl localhost:3000 (port status)
│ ├─ Extract build artifacts
│ └─ Haiku build score (structured JSON)
│
├─ Start dev server (if not already running)
├─ sandbox.extendTimeout(...)
│
├─ Phase 2: VERIFY (if >1 project file exists)
│ ├─ sandbox.writeFiles() → /tmp/verify.txt (agent-browser verification prompt)
│ ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
│ │ (with AbortSignal.timeout(1_200_000) — 20 min)
│ ├─ Re-extract skills (verify phase triggers more)
│ └─ Haiku verify score (per-story pass/fail JSON)
│
├─ Phase 3: DEPLOY (if >3 project files)
│ ├─ sandbox.writeFiles() → /tmp/deploy.txt
│ ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
│ │ (links to vercel-labs, deploys, fixes build errors up to 3x)
│ ├─ Extract deploy URL from output (*.vercel.app)
│ ├─ Re-extract skills (deploy phase triggers more)
│ └─ Haiku deploy score (structured JSON)
│
├─ Write <slug>/result.json immediately (crash-safe)
├─ Update aggregate results.json (complete: false until all done)
├─ Extract source.tar.gz
└─ sandbox.stop() (skipped if --keep-alive)
Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
│
├─ npm install -g @anthropic-ai/claude-code vercel agent-browser (~20s)
├─ 将Vercel CLI认证令牌写入~/.local/share/com.vercel.cli/auth.json
├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/ (80个文件,约945KB)
├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
│
├─ 第一阶段:构建
│ ├─ sandbox.writeFiles() → /tmp/prompt.txt
│ ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
│ │ (使用AbortSignal.timeout(TIMEOUT_MS))
│ ├─ 每20秒轮询一次:
│ │ ├─ ls /tmp/vercel-plugin-*-seen-skills.d/ (已声明的技能)
│ │ ├─ cat /tmp/vercel-plugin-*-seen-skills.txt (已发现技能的快照)
│ │ ├─ find ~/.claude/debug -type f (调试日志数量)
│ │ ├─ find <project> -newer /tmp/prompt.txt (新增项目文件)
│ │ └─ curl localhost:3000 (端口状态)
│ ├─ 提取构建工件
│ └─ 俳句模型构建评分(结构化JSON)
│
├─ 启动开发服务器(如果尚未运行)
├─ sandbox.extendTimeout(...)
│
├─ 第二阶段:验证(如果项目文件数量>1)
│ ├─ sandbox.writeFiles() → /tmp/verify.txt (agent-browser验证提示词)
│ ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
│ │ (使用AbortSignal.timeout(1_200_000) — 20分钟)
│ ├─ 重新提取技能(验证阶段会触发更多技能)
│ └─ 俳句模型验证评分(按故事的通过/失败JSON)
│
├─ 第三阶段:部署(如果项目文件数量>3)
│ ├─ sandbox.writeFiles() → /tmp/deploy.txt
│ ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
│ │ (关联至vercel-labs,部署,最多3次修复构建错误)
│ ├─ 从输出中提取部署URL(*.vercel.app)
│ ├─ 重新提取技能(部署阶段会触发更多技能)
│ └─ 俳句模型部署评分(结构化JSON)
│
├─ 立即写入<slug>/result.json(崩溃安全)
├─ 更新汇总results.json(完成前complete为false,完成后为true)
├─ 提取source.tar.gz
└─ sandbox.stop() (如果使用--keep-alive则跳过)
Verification Phase Details
验证阶段详情
The verify phase is the "closer" — its job is to make the app work and prove it. Key behaviors:
- Always runs if >1 project file exists (no longer gated on port 3000 being up)
- Starts dev server itself if not already running — the prompt tells Claude to check and run if needed
- 20 minute timeout — enough for agent-browser to open pages, screenshot, interact, fix broken code, restart server, and re-verify
- Triggers skill injection — the verify session creates/edits files, triggering PreToolUse and PostToolUse hooks
- Uses agent-browser workflow: → → → → interact → fix → re-verify
- Results scored by haiku — no more parsing from free text
验证阶段是“收尾环节”——其职责是确保应用正常运行并验证功能。关键特性:
- 只要项目文件数量>1就会运行(不再以3000端口是否可用为前提)
- 自行启动开发服务器——如果服务器未运行,提示词会告知Claude检查并在需要时运行
- 20分钟超时——足够agent-browser打开页面、截图、交互、修复代码、重启服务器并重新验证
- 触发技能注入——验证会话会创建/编辑文件,触发PreToolUse和PostToolUse钩子
- 使用agent-browser工作流: → → → → 交互 → 修复 → 重新验证
- 结果由俳句模型评分——无需再从自由文本中解析
Deploy Phase Details
部署阶段详情
The deploy phase uses a full Claude Code session (for skill tracking) to:
- Run
vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDD
- Run
- If build fails, fix code and retry (up to 3 attempts)
- Important: unsets env var so CLI falls back to
~/.local/share/com.vercel.cli/auth.json
- Deployment protection is enabled by default on vercel-labs team
Deploy URL is extracted by regex from Claude's output, with haiku as fallback URL extractor.
部署阶段使用完整的Claude Code会话(用于技能追踪)来:
- 运行
vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDD
- 运行
- 如果构建失败,修复代码并重试(最多3次)
- 重要提示:取消设置环境变量,让CLI回退使用
~/.local/share/com.vercel.cli/auth.json
- vercel-labs团队默认启用部署保护
部署URL通过正则表达式从Claude的输出中提取,俳句模型作为URL提取的备选方案。
DO NOT (Hard Rules)
禁止操作(硬性规则)
Same rules as
, plus sandbox-specific:
- DO NOT use or flag for BUILD/VERIFY/DEPLOY phases — hooks don't fire without tool-calling sessions (use only for haiku scoring passes)
- DO NOT let sandboxes run without extracting artifacts — ephemeral filesystem is lost on stop
- DO NOT pass API keys via — use
Sandbox.create({ env: { ... } })
- DO NOT skip snapshotting after build — it's your safety net if verify/deploy kills the sandbox
- DO NOT use v2 beta SDK — named sandbox endpoint returns 404 for this team; use v1.8.0
- DO NOT use heredocs to write file content — use instead
- DO NOT assume exists — the home dir is
- DO NOT use simple project names without timestamps — always append to avoid collisions across runs
- 禁止在构建/验证/部署阶段使用或参数——不使用工具调用会话的话钩子不会触发(仅在俳句模型评分步骤中使用)
- 禁止在未提取工件的情况下让沙箱运行——临时文件系统在停止后会丢失
- 禁止通过传递API密钥——使用
Sandbox.create({ env: { ... } })
- 禁止在构建后跳过快照——这是验证/部署阶段沙箱崩溃时的安全备份
- 禁止使用v2测试版SDK——该团队的命名沙箱端点返回404;请使用v1.8.0
- 禁止使用runCommand的here文档写入文件内容——使用替代
- 禁止假设存在——主目录是
- 禁止使用不带时间戳的简单项目名称——始终添加后缀,避免跨运行的冲突
One-time setup: link project for OIDC sandbox auth
一次性设置:关联项目以进行OIDC沙箱认证
npx vercel link --scope vercel-labs -y
npx vercel env pull .env.local
npx vercel link --scope vercel-labs -y
npx vercel env pull .env.local
Auth (auto-resolved from macOS Keychain + Vercel CLI auth):
认证信息(自动从macOS钥匙串和Vercel CLI认证中获取):
- ANTHROPIC_API_KEY: from Keychain "ANTHROPIC_AUTH_TOKEN" (vck_* key) or env var
- ANTHROPIC_API_KEY:来自钥匙串的"ANTHROPIC_AUTH_TOKEN"(vck_*格式密钥)或环境变量
- VERCEL_TOKEN: from ~/.local/share/com.vercel.cli/auth.json (vca_* token) or env var
- VERCEL_TOKEN:来自~/.local/share/com.vercel.cli/auth.json(vca_*格式令牌)或环境变量
Run eval with dynamic scenarios (recommended)
使用动态场景运行评估(推荐)
Generate scenarios as JSON, then run
生成JSON格式的场景,然后运行
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
With all phases + keep-alive for overnight
执行所有阶段并保持沙箱夜间运行
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --keep-alive --keep-hours 8
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --keep-alive --keep-hours 8
Build-only, no verification or deploy
仅构建,跳过验证和部署
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --skip-verify --skip-deploy
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --skip-verify --skip-deploy
Filter to specific slugs from file or defaults
从文件或默认场景中筛选特定slug
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
Monitoring While Running
运行时监控
The orchestrator prints live status. For manual checks on a running sandbox:
typescript
// List claimed skills
const claims = await sandbox.runCommand("sh", ["-c",
"ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null"
]);
// Check hook firing count
const hooks = await sandbox.runCommand("sh", ["-c",
"find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +"
]);
// Check port 3000
const port = await sandbox.runCommand("sh", ["-c",
"curl -s -o /dev/null -w '%{http_code}' http://localhost:3000"
]);
// Get public URL (after ports: [3000] in Sandbox.create)
const url = sandbox.domain(3000);
Artifact Export Layout
列出已声明的技能
Results are written to
~/dev/vercel-plugin-testing/sandbox-results/<run-id>/
:
<run-id>/
results.json # Aggregate results (complete: false until all done, then true)
report.md # Markdown report with scores, coverage, URLs
<slug>/
result.json # Per-scenario result (written immediately on completion)
source.tar.gz # Project source archive
Each scenario result includes:
- , , ,
- , ,
- — public
https://sb-XXX.vercel.run
URL (sandbox lifetime only)
- — permanent URL (if deploy succeeded)
- — timestamped skill/file/port snapshots
- —
{ ran, exitCode, stories: [{ index, status }], output }
- — haiku structured completeness assessment
- — haiku structured deploy assessment
The markdown report (
/
) includes:
- Summary table — slug, build status, skills, files, verify results, deploy URL, duration
- Per-scenario details — build score, deploy score, verification per-story pass/fail
- Skill coverage — expected vs actual per scenario, missing/bonus breakdown
- Total unique skills across all scenarios
const claims = await sandbox.runCommand("sh", ["-c",
"ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null"
]);
Proven Results (2026-03-10)
检查钩子触发次数
Across 34 scenarios run in 5 batches:
| Metric | Best | Typical |
|---|
| Skills per scenario | 31 (ai-interior-designer) | 12-24 |
| Expected skill coverage | 100% (pet-adoption-board 4/4, apartment-hunting-copilot 7/7, splitwise-clone 6/6) | 50-86% |
| User stories verified | 3/3 PASS (ai-dream-journal, ai-gift-finder, ai-resume-roaster, ai-music-mood-radio, team-standup-bot, pet-adoption-board) | varies |
| Files built per scenario | 37 (student-study-groups) | 6-25 |
| Build time | 5-11 min | 5-7 min |
Key findings:
- User-story-focused prompts (no tech name-dropping) work — plugin detects patterns from actual code
- , , , are the most consistently detected skills
- , need Claude to write specific file patterns to trigger
- Lexical prompt inject (UserPromptSubmit) working — skills injected before any files written
- deletes claim dirs — use poll history for final skill counts
- Enterprise tier (vercel-labs) — no sandbox time cap; builds ran 10+ minutes
const hooks = await sandbox.runCommand("sh", ["-c",
"find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +"
]);
Known Limitations
检查3000端口
- Snapshot stops the source sandbox: stops the original sandbox. Create a new sandbox from the snapshot to continue. Files and npm globals DO survive.
- v2 beta incompatible:
@vercel/sandbox@2.0.0-beta.3
's named sandbox endpoint returns 404 for this team. Stick with v1.8.0.
- Artifact window: Must extract before — filesystem is ephemeral. Session cleanup hook may delete claim dirs before extraction.
- Amazon Linux paths: User is (home at ). NOT or .
--dangerously-skip-permissions
parity: Sandbox evals auto-approve all tool calls. WezTerm evals use normal permission flow. Coverage results may differ.
- timeout: Use
{ signal: AbortSignal.timeout(ms) }
— the option is silently ignored.
- BrotliDecompressionError: Transient Vercel API errors can kill sandbox creation. Retry logic recommended for production runs.
- Deploy reliability: Claude Code deploy sessions sometimes fail to output a parseable URL. The haiku scoring step provides a fallback URL extraction attempt.
- Verify timeout: Complex apps may need the full 20 minutes for agent-browser to test all stories. Simpler apps finish in 2-5 minutes.
const port = await sandbox.runCommand("sh", ["-c",
"curl -s -o /dev/null -w '%{http_code}' http://localhost:3000"
]);
获取公开URL(需在Sandbox.create中传入ports: [3000])
const url = sandbox.domain(3000);
结果会写入
~/dev/vercel-plugin-testing/sandbox-results/<run-id>/
:
<run-id>/
results.json # 汇总结果(完成前complete为false,完成后为true)
report.md # 包含评分、覆盖率、URL的Markdown报告
<slug>/
result.json # 每个场景的结果(完成后立即写入)
source.tar.gz # 项目源码归档
每个场景结果包含:
- 、、、
- 、、
- — 公开
https://sb-XXX.vercel.run
URL(仅沙箱生命周期内有效)
- — 永久 URL(如果部署成功)
- — 带时间戳的技能/文件/端口快照
- —
{ ran, exitCode, stories: [{ index, status }], output }
- — 俳句模型的结构化完整性评估
- — 俳句模型的结构化部署评估
- 汇总表格 — slug、构建状态、技能、文件、验证结果、部署URL、时长
- 每个场景的详情 — 构建评分、部署评分、按故事的验证通过/失败情况
- 技能覆盖率 — 每个场景的预期与实际技能对比,缺失/额外技能细分
- 所有场景的总唯一技能数
在5批共34个场景的运行中:
| 指标 | 最佳值 | 典型值 |
|---|
| 每个场景的技能数 | 31个(ai-interior-designer) | 12-24个 |
| 预期技能覆盖率 | 100%(pet-adoption-board 4/4,apartment-hunting-copilot 7/7,splitwise-clone 6/6) | 50-86% |
| 已验证用户故事 | 3/3 通过(ai-dream-journal、ai-gift-finder、ai-resume-roaster、ai-music-mood-radio、team-standup-bot、pet-adoption-board) | 各不相同 |
| 每个场景构建的文件数 | 37个(student-study-groups) | 6-25个 |
| 构建时长 | 5-11分钟 | 5-7分钟 |
关键发现:
- 以用户故事为中心的提示词(不提及技术名称)有效——插件可从实际代码中检测到模式
- 、、、是最常被检测到的技能
- 、需要Claude编写特定文件模式才能触发
- Lexical提示词注入(UserPromptSubmit)正常工作——在写入任何文件前就会注入技能
- 会删除声明目录——需使用轮询历史数据获取最终技能计数
- 企业版(vercel-labs)——无沙箱时间限制;构建已成功运行10分钟以上
- 快照会停止源沙箱:会停止原始沙箱。需从快照创建新沙箱以继续操作。文件和npm全局包会保留。
- 与v2测试版不兼容:
@vercel/sandbox@2.0.0-beta.3
的命名沙箱端点对该团队返回404。请坚持使用v1.8.0。
- 工件提取窗口:必须在前提取工件——文件系统是临时的。会话清理钩子可能在提取前删除声明目录。
- Amazon Linux路径:用户是(主目录为)。不是或。
--dangerously-skip-permissions
一致性:沙箱评估会自动批准所有工具调用。WezTerm评估使用正常权限流程。覆盖率结果可能不同。
- 超时:使用
{ signal: AbortSignal.timeout(ms) }
——选项会被静默忽略。
- BrotliDecompressionError:Vercel API的临时错误可能导致沙箱创建失败。生产环境运行建议添加重试逻辑。
- 部署可靠性:Claude Code部署会话有时无法输出可解析的 URL。俳句模型评分步骤提供了备选的URL提取方案。
- 验证超时:复杂应用可能需要完整的20分钟让agent-browser测试所有故事。简单应用在2-5分钟内即可完成。