benchmark-agents

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Benchmark Agents — Advanced AI Systems

基准测试Agent — 高级AI系统

Launch real Claude Code sessions with the plugin installed, verify skill injection, monitor PostToolUse validation catches, and produce a coverage report. This skill covers the full eval loop: setup → launch → monitor → verify → fix → release → repeat.
启动安装了插件的真实Claude Code会话,验证技能注入,监控PostToolUse校验捕获结果,并生成覆盖率报告。该技能覆盖完整的评测循环:设置 → 启动 → 监控 → 验证 → 修复 → 发布 → 重复。

How Evals Work (The Only Correct Method)

评测工作原理(唯一正确的方法)

Evals are run by you, in this conversation, not by scripts. The process is:
  1. You create directories and install the plugin via Bash tool calls
  2. You spawn WezTerm panes with
    wezterm cli spawn
    — each pane runs an independent Claude Code interactive session
  3. You wait, then check debug logs and claim dirs to see what the plugin injected
  4. You inspect the generated source code for correctness
  5. You read conversation logs to find what the user had to correct
  6. You update skills/hooks, run
    /release
    , and spawn more evals
Never use
claude --print
, eval scripts, or
Bun.spawn(["claude", ...])
. These do not work because:
  • Plugin hooks (PreToolUse, PostToolUse, UserPromptSubmit) only fire during interactive tool-calling sessions
  • --print
    mode generates text without executing tools — no files are created, no deps installed, no dev servers started
  • No
    session_id
    means dedup, profiler, and claim files don't work
The WezTerm interactive approach is the only method that exercises the plugin correctly. Every eval in our history (60+ sessions) used this approach.
评测由你在本次对话中运行,而非通过脚本执行。流程如下:
  1. 你通过Bash工具调用创建目录并安装插件
  2. 你使用
    wezterm cli spawn
    命令创建WezTerm窗格——每个窗格运行独立的Claude Code交互会话
  3. 等待一段时间后,检查调试日志和认领目录,查看插件注入了哪些内容
  4. 检查生成的源代码正确性
  5. 阅读会话日志,找出用户需要修正的内容
  6. 更新技能/钩子,运行
    /release
    ,然后启动更多评测
绝对不要使用
claude --print
、评测脚本或
Bun.spawn(["claude", ...])
。这些方式无效,原因如下:
  • 插件钩子(PreToolUse、PostToolUse、UserPromptSubmit)仅在交互式工具调用会话期间触发
  • --print
    模式仅生成文本而不执行工具——不会创建文件、不会安装依赖、不会启动开发服务
  • 没有
    session_id
    会导致去重、分析器和认领文件功能无法正常工作
WezTerm交互式方法是唯一能正确测试插件的方式。我们历史上所有评测(60+次会话)都采用该方式。

DO NOT (Hard Rules)

禁止操作(硬性规则)

These are absolute prohibitions. Violating any of them wastes the entire eval run:
  • DO NOT use
    claude --print
    or
    -p
    flag — hooks don't fire, no files created
  • DO NOT use
    --dangerously-skip-permissions
    — changes agent behavior
  • DO NOT create projects in
    /tmp/
    — always use
    ~/dev/vercel-plugin-testing/
  • DO NOT manually create
    settings.local.json
    or wire hooks by hand — use
    npx add-plugin
  • DO NOT set
    CLAUDE_PLUGIN_ROOT
    manually — the plugin manages this
  • DO NOT use
    bash -c
    or
    bash -lc
    in WezTerm — always use
    /bin/zsh -ic
  • DO NOT use the full path to claude — use the
    x
    alias (it's configured in zsh)
  • DO NOT create custom
    debug.log
    files with stderr redirects — debug logs go to
    ~/.claude/debug/
  • DO NOT write eval runner scripts in TypeScript/JavaScript — do everything as Bash tool calls in the conversation
  • DO NOT try to
    git init
    or create
    package.json
    manually —
    npx add-plugin
    + the WezTerm session handle all scaffolding
  • DO NOT use uppercase letters in directory names — npm rejects them (e.g.
    T
    in timestamps breaks
    create-next-app
    )
Copy the exact commands below. Do not improvise.
以下是绝对禁令,违反任何一条都会导致整次评测运行失效:
  • 禁止使用
    claude --print
    -p
    参数——钩子不会触发,不会生成文件
  • 禁止使用
    --dangerously-skip-permissions
    ——会改变Agent行为
  • 禁止
    /tmp/
    目录下创建项目——始终使用
    ~/dev/vercel-plugin-testing/
  • 禁止手动创建
    settings.local.json
    或手动配置钩子——使用
    npx add-plugin
    完成
  • 禁止手动设置
    CLAUDE_PLUGIN_ROOT
    环境变量——由插件自动管理
  • 禁止在WezTerm中使用
    bash -c
    bash -lc
    ——始终使用
    /bin/zsh -ic
  • 禁止使用claude的完整路径——使用
    x
    别名(已在zsh中配置)
  • 禁止通过stderr重定向创建自定义
    debug.log
    文件——调试日志默认存储在
    ~/.claude/debug/
  • 禁止用TypeScript/JavaScript编写评测运行脚本——所有操作都通过对话中的Bash工具调用完成
  • 禁止尝试手动执行
    git init
    或创建
    package.json
    ——
    npx add-plugin
    和WezTerm会话会处理所有脚手架生成工作
  • 禁止在目录名中使用大写字母——npm会拒绝此类名称(比如时间戳中的
    T
    会导致
    create-next-app
    运行失败)
严格复制下方的 exact 命令,不要自行 improvisation。

Setup & Launch (Exact Commands)

设置与启动(精确命令)

Naming convention

命名规范

Always append a timestamp to directory names so reruns don't overwrite old projects:
<slug>-<yyyymmdd>-<hhmm>
Example:
tarot-card-deck-20260309-1227
,
interior-designer-20260309-1227
Generate the timestamp with:
date +%Y%m%d-%H%M
始终在目录名后附加时间戳,避免重新运行时覆盖旧项目:
<slug>-<yyyymmdd>-<hhmm>
示例:
tarot-card-deck-20260309-1227
interior-designer-20260309-1227
使用以下命令生成时间戳:
date +%Y%m%d-%H%M

1. Create test directory and install plugin

1. 创建测试目录并安装插件

bash
TS=$(date +%Y%m%d-%H%M)
SLUG="my-app-$TS"
mkdir -p ~/dev/vercel-plugin-testing/$SLUG
cd ~/dev/vercel-plugin-testing/$SLUG
npx add-plugin https://github.com/vercel/vercel-plugin -s project -y
bash
TS=$(date +%Y%m%d-%H%M)
SLUG="my-app-$TS"
mkdir -p ~/dev/vercel-plugin-testing/$SLUG
cd ~/dev/vercel-plugin-testing/$SLUG
npx add-plugin https://github.com/vercel/vercel-plugin -s project -y

2. Launch session via WezTerm

2. 通过WezTerm启动会话

bash
wezterm cli spawn --cwd /Users/johnlindquist/dev/vercel-plugin-testing/$SLUG -- /bin/zsh -ic \
  "unset CLAUDECODE; VERCEL_PLUGIN_LOG_LEVEL=debug x '<PROMPT>' --settings .claude/settings.json; exec zsh"
Key flags:
  • unset CLAUDECODE
    — prevents nested session detection error
  • VERCEL_PLUGIN_LOG_LEVEL=debug
    — enables hook debug output in
    ~/.claude/debug/
  • x
    — alias for
    claude
    CLI
  • --settings .claude/settings.json
    — loads project-level plugin settings
bash
wezterm cli spawn --cwd /Users/johnlindquist/dev/vercel-plugin-testing/$SLUG -- /bin/zsh -ic \
  "unset CLAUDECODE; VERCEL_PLUGIN_LOG_LEVEL=debug x '<PROMPT>' --settings .claude/settings.json; exec zsh"
关键参数说明:
  • unset CLAUDECODE
    ——避免嵌套会话检测错误
  • VERCEL_PLUGIN_LOG_LEVEL=debug
    ——在
    ~/.claude/debug/
    中启用钩子调试输出
  • x
    ——
    claude
    CLI的别名
  • --settings .claude/settings.json
    ——加载项目级插件配置

3. Find the debug log (wait ~25s for SessionStart hooks)

3. 查找调试日志(等待约25秒让SessionStart钩子执行完成)

bash
find ~/.claude/debug -name "*.txt" -mmin -2 -exec grep -l "$SLUG" {} +
bash
find ~/.claude/debug -name "*.txt" -mmin -2 -exec grep -l "$SLUG" {} +

4. Launch multiple sessions in parallel

4. 并行启动多个会话

Create dirs and install plugin in a loop, then spawn each WezTerm pane:
bash
TS=$(date +%Y%m%d-%H%M)
cd ~/dev/vercel-plugin-testing
for name in tarot-deck interior-designer superhero-origin; do
  d="${name}-${TS}"
  mkdir -p "$d" && (cd "$d" && npx add-plugin https://github.com/vercel/vercel-plugin -s project -y)
done
循环创建目录并安装插件,然后为每个目录创建WezTerm窗格:
bash
TS=$(date +%Y%m%d-%H%M)
cd ~/dev/vercel-plugin-testing
for name in tarot-deck interior-designer superhero-origin; do
  d="${name}-${TS}"
  mkdir -p "$d" && (cd "$d" && npx add-plugin https://github.com/vercel/vercel-plugin -s project -y)
done

Then spawn each (these run in separate terminal panes)

然后逐个启动会话(运行在独立的终端窗格中)

wezterm cli spawn --cwd .../tarot-deck-$TS -- /bin/zsh -ic "unset CLAUDECODE; VERCEL_PLUGIN_LOG_LEVEL=debug x '...' --settings .claude/settings.json; exec zsh" wezterm cli spawn --cwd .../interior-designer-$TS -- /bin/zsh -ic "unset CLAUDECODE; VERCEL_PLUGIN_LOG_LEVEL=debug x '...' --settings .claude/settings.json; exec zsh" wezterm cli spawn --cwd .../superhero-origin-$TS -- /bin/zsh -ic "unset CLAUDECODE; VERCEL_PLUGIN_LOG_LEVEL=debug x '...' --settings .claude/settings.json; exec zsh"
undefined
wezterm cli spawn --cwd .../tarot-deck-$TS -- /bin/zsh -ic "unset CLAUDECODE; VERCEL_PLUGIN_LOG_LEVEL=debug x '...' --settings .claude/settings.json; exec zsh" wezterm cli spawn --cwd .../interior-designer-$TS -- /bin/zsh -ic "unset CLAUDECODE; VERCEL_PLUGIN_LOG_LEVEL=debug x '...' --settings .claude/settings.json; exec zsh" wezterm cli spawn --cwd .../superhero-origin-$TS -- /bin/zsh -ic "unset CLAUDECODE; VERCEL_PLUGIN_LOG_LEVEL=debug x '...' --settings .claude/settings.json; exec zsh"
undefined

Monitoring

监控

Skill injection claims (the key metric)

技能注入认领(核心指标)

bash
TMPDIR=$(node -e "import {tmpdir} from 'os'; console.log(tmpdir())" --input-type=module)
CLAIMDIR="$TMPDIR/vercel-plugin-<session-id>-seen-skills.d"
bash
TMPDIR=$(node -e "import {tmpdir} from 'os'; console.log(tmpdir())" --input-type=module)
CLAIMDIR="$TMPDIR/vercel-plugin-<session-id>-seen-skills.d"

List all injected skills

列出所有注入的技能

ls "$CLAIMDIR"
ls "$CLAIMDIR"

Count

统计数量

ls "$CLAIMDIR" | wc -l
ls "$CLAIMDIR" | wc -l

Check specific skill

检查特定技能是否注入

ls "$CLAIMDIR/workflow" && echo "YES" || echo "NO"
undefined
ls "$CLAIMDIR/workflow" && echo "YES" || echo "NO"
undefined

Hook firing

钩子触发情况

bash
LOG=~/.claude/debug/<session-id>.txt
bash
LOG=~/.claude/debug/<session-id>.txt

SessionStart hooks

SessionStart钩子触发情况

grep -c 'SessionStart.*success' "$LOG"
grep -c 'SessionStart.*success' "$LOG"

PreToolUse calls and injections

PreToolUse调用和注入情况

grep -c 'executePreToolHooks' "$LOG" # total calls grep -c 'provided additionalContext' "$LOG" # actual injections
grep -c 'executePreToolHooks' "$LOG" # 总调用次数 grep -c 'provided additionalContext' "$LOG" # 实际注入次数

PostToolUse validation catches

PostToolUse校验捕获

grep 'VALIDATION' "$LOG" | head -10
grep 'VALIDATION' "$LOG" | head -10

UserPromptSubmit

UserPromptSubmit触发情况

grep -c 'UserPromptSubmit.*success' "$LOG"
undefined
grep -c 'UserPromptSubmit.*success' "$LOG"
undefined

Quick status check for multiple sessions

多个会话的快速状态检查

bash
TMPDIR=$(node -e "import {tmpdir} from 'os'; console.log(tmpdir())" --input-type=module 2>/dev/null)

for label_id in "slug1:SESSION_ID_1" "slug2:SESSION_ID_2" "slug3:SESSION_ID_3"; do
  label="${label_id%%:*}"
  id="${label_id##*:}"
  claimdir="$TMPDIR/vercel-plugin-$id-seen-skills.d"
  echo "=== $label ==="
  count=$(ls "$claimdir" 2>/dev/null | wc -l | tr -d ' ')
  claims=$(ls "$claimdir" 2>/dev/null | sort | tr '\n' ', ')
  echo "Skills ($count): $claims"
done
bash
TMPDIR=$(node -e "import {tmpdir} from 'os'; console.log(tmpdir())" --input-type=module 2>/dev/null)

for label_id in "slug1:SESSION_ID_1" "slug2:SESSION_ID_2" "slug3:SESSION_ID_3"; do
  label="${label_id%%:*}"
  id="${label_id##*:}"
  claimdir="$TMPDIR/vercel-plugin-$id-seen-skills.d"
  echo "=== $label ==="
  count=$(ls "$claimdir" 2>/dev/null | wc -l | tr -d ' ')
  claims=$(ls "$claimdir" 2>/dev/null | sort | tr '\n' ', ')
  echo "Skills ($count): $claims"
done

Verification — What to Check in Generated Code

验证——生成代码需要检查的内容

After sessions build, verify these patterns in the generated projects:
会话构建完成后,在生成的项目中验证以下模式:

Project structure

项目结构

bash
echo -n "src/: "; test -d "$base/src" && echo YES || echo NO          # Should be NO for WDK projects
echo -n "workflows/: "; test -d "$base/workflows" && echo YES || echo NO
echo -n "withWorkflow: "; grep -q "withWorkflow" "$base"/next.config.* && echo YES || echo NO
echo -n "components.json: "; test -f "$base/components.json" && echo YES || echo NO
bash
echo -n "src/: "; test -d "$base/src" && echo YES || echo NO          # WDK项目应该为NO
echo -n "workflows/: "; test -d "$base/workflows" && echo YES || echo NO
echo -n "withWorkflow: "; grep -q "withWorkflow" "$base"/next.config.* && echo YES || echo NO
echo -n "components.json: "; test -f "$base/components.json" && echo YES || echo NO

Image generation model

图像生成模型

bash
undefined
bash
undefined

Should use gemini-3.1-flash-image-preview, NOT dall-e-3 or older gemini models

应该使用gemini-3.1-flash-image-preview,而非dall-e-3或更旧的gemini模型

grep -rn "gemini.*image|dall-e|experimental_generateImage|result.files" "$base/workflows/" "$base/app/" 2>/dev/null | grep ".ts"
undefined
grep -rn "gemini.*image|dall-e|experimental_generateImage|result.files" "$base/workflows/" "$base/app/" 2>/dev/null | grep ".ts"
undefined

Gateway vs direct provider

Gateway与直接调用Provider对比

bash
undefined
bash
undefined

Should use gateway() or plain "provider/model" strings, NOT openai("gpt-4o") directly

应该使用gateway()或纯"provider/model"字符串,而非直接调用openai("gpt-4o")

grep -rn "from.@ai-sdk/openai|openai(" "$base" 2>/dev/null | grep ".ts" | grep -v node_modules grep -rn "gateway(|model:."openai/" "$base" 2>/dev/null | grep ".ts" | grep -v node_modules
undefined
grep -rn "from.@ai-sdk/openai|openai(" "$base" 2>/dev/null | grep ".ts" | grep -v node_modules grep -rn "gateway(|model:."openai/" "$base" 2>/dev/null | grep ".ts" | grep -v node_modules
undefined

AI Elements installed

AI Elements安装情况

bash
find "$base" -path "*/ai-elements/*.tsx" 2>/dev/null | grep -v node_modules | wc -l
bash
find "$base" -path "*/ai-elements/*.tsx" 2>/dev/null | grep -v node_modules | wc -l

Workflow API usage

Workflow API使用情况

bash
wf=$(find "$base" -name "*.ts" -path "*/workflow*" 2>/dev/null | grep -v node_modules | head -1)
head -5 "$wf"   # Should show: import { getWritable } from "workflow"
bash
wf=$(find "$base" -name "*.ts" -path "*/workflow*" 2>/dev/null | grep -v node_modules | head -1)
head -5 "$wf"   # 应该显示:import { getWritable } from "workflow"

Prompt Design Rules

Prompt设计规则

Describe products, not technologies. Let the plugin infer which skills to inject. This tests whether the plugin's pattern matching and prompt signals work from natural language.
描述产品而非技术,让插件自动推断需要注入的技能。这可以测试插件的模式匹配和prompt信号是否能从自然语言中正确识别需求。

DO:

正确示例:

  • "runs a multi-step creation pipeline that streams each phase"
  • "generates a portrait image"
  • "users can chat with an AI advisor"
  • "store all designs in a gallery"
  • "运行多步骤创建流水线,流式输出每个阶段的进度"
  • "生成一张肖像图片"
  • "用户可以与AI顾问聊天"
  • "将所有设计存储在画廊中"

DON'T:

错误示例:

  • "use Vercel Workflow DevKit with getWritable"
  • "use gateway('google/gemini-3.1-flash-image-preview')"
  • "install npx ai-elements"
  • "add withWorkflow to next.config.ts"
  • "使用Vercel Workflow DevKit和getWritable"
  • "使用gateway('google/gemini-3.1-flash-image-preview')"
  • "安装npx ai-elements"
  • "在next.config.ts中添加withWorkflow"

Always end prompts with:

始终在prompt末尾添加:

"Link the project to my vercel-labs team so we can deploy it later. Skip any planning and just build it. Get the dev server running."
"将项目链接到我的vercel-labs团队,方便后续部署。跳过任何规划环节直接构建,启动开发服务。"

Phrases that trigger key skills (via promptSignals):

触发核心技能的短语(通过promptSignals):

  • workflow: "multi-step pipeline", "streams progress", "streams each phase", "durable pipeline", "creation pipeline"
  • ai-sdk: Triggered by imports/install patterns (very broad)
  • shadcn: Triggered by
    create-next-app
    bash pattern
  • ai-elements: Triggered when ai-sdk is active + chat UI patterns
  • workflow:"multi-step pipeline"、"streams progress"、"streams each phase"、"durable pipeline"、"creation pipeline"
  • ai-sdk:通过导入/安装模式触发(覆盖范围非常广)
  • shadcn:通过
    create-next-app
    bash模式触发
  • ai-elements:当ai-sdk处于激活状态且存在聊天UI模式时触发

Common Issues Found in Evals (and Fixes Applied)

评测中发现的常见问题(及已应用的插件修复)

IssueCausePlugin Fix (version)
Workflow not triggered from natural languagepromptSignals too narrowBroadened phrases, lowered minScore 6→4 (v0.9.5)
Agent uses
openai("gpt-4o")
instead of gateway
Agent's training data defaults to openaiPostToolUse validate warns "your knowledge is outdated" (v0.9.9)
Agent uses
dall-e-3
for images
Agent doesn't know about gemini image genPostToolUse validate warns, capabilities table in ai-sdk (v0.9.7)
Agent uses
experimental_generateImage
Old APIPostToolUse validate warns, recommend
generateText
+
result.files
(v0.9.9)
Raw markdown rendering (
**bold**
visible)
Agent skips AI Elements
MessageResponse
documented as universal renderer (v0.9.2)
@/../../workflows/
broken import
Workflows outside
@
alias root
Canonical structure docs: no
src/
for WDK (v0.8.3)
withWorkflow
missing from next.config
Agent skipped setup stepMarked as "Required" in workflow skill (v0.8.1)
defineHook
but no resume route
Agent didn't wire the 3-piece patternDocumented as 3 required pieces (v0.9.3)
generateObject()
used (removed in v6)
Agent's training dataPostToolUse validate catches as error (v0.9.3)
getWritable()
in workflow scope
Sandbox violationStrengthened warning in skill (v0.8.1)
Missing
vercel link
+
vercel env pull
No OIDC credentialsAdded as "Required" setup step (v0.9.1)
getStepMetadata().retryCount
undefined on first attempt
WDK quirkDocumented: guard with
?? 0
(v0.9.1)
shadcn not installedNo trigger for scaffoldingAdded
create-next-app
bashPattern to shadcn (v0.8.0)
Skill cap too low (3)Only 3 skills injected per tool callRaised to 5 with 18KB budget (v0.8.0)
问题原因插件修复(版本)
自然语言无法触发WorkflowpromptSignals范围过窄扩大匹配短语范围,最低匹配分从6下调到4(v0.9.5)
Agent使用
openai("gpt-4o")
而非gateway
Agent训练数据默认使用openaiPostToolUse校验提示“你的知识已过时”(v0.9.9)
Agent使用
dall-e-3
生成图像
Agent不了解gemini图像生成能力PostToolUse校验提示,在ai-sdk中添加能力对照表(v0.9.7)
Agent使用
experimental_generateImage
旧APIPostToolUse校验提示,推荐使用
generateText
+
result.files
(v0.9.9)
原始Markdown渲染(
**粗体**
直接显示)
Agent跳过AI Elements
MessageResponse
标注为通用渲染器(v0.9.2)
@/../../workflows/
导入路径错误
Workflows不在
@
别名根目录下
规范结构文档:WDK项目不使用
src/
目录(v0.8.3)
next.config中缺失
withWorkflow
Agent跳过了设置步骤在workflow技能中标注为“必填”(v0.8.1)
定义了
defineHook
但没有resume路由
Agent没有配置三段式模式文档说明为3个必填组件(v0.9.3)
使用了已在v6中移除的
generateObject()
Agent训练数据过时PostToolUse校验捕获为错误(v0.9.3)
在workflow作用域中使用
getWritable()
Sandbox违规增强技能中的警告提示(v0.8.1)
缺失
vercel link
+
vercel env pull
没有OIDC凭证添加为“必填”设置步骤(v0.9.1)
首次运行时
getStepMetadata().retryCount
未定义
WDK特性文档说明:使用
?? 0
做兜底(v0.9.1)
shadcn未安装没有脚手架触发条件为shadcn添加
create-next-app
bashPattern匹配规则(v0.8.0)
技能上限过低(3个)每次工具调用仅注入3个技能上限提升到5个,预算为18KB(v0.8.0)

Agent-Browser Verification

Agent-浏览器验证

After dev server starts, verify with agent-browser. Note: agents currently DO NOT self-verify despite the skill being injected. You must launch verification manually:
bash
agent-browser open http://localhost:<port>
agent-browser wait --load networkidle
agent-browser screenshot
agent-browser snapshot -i
开发服务启动后,使用agent-browser进行验证。注意:尽管注入了相关技能,当前Agent不会自行执行验证,你需要手动启动验证流程:
bash
agent-browser open http://localhost:<port>
agent-browser wait --load networkidle
agent-browser screenshot
agent-browser snapshot -i

Coverage Report

覆盖率报告

Write results to
.notes/COVERAGE.md
with:
  1. Session index — slug, session ID, unique skills, dedup status
  2. Hook coverage matrix — which hooks fired in which sessions
  3. Skill injection table — which of the 43 skills triggered
  4. Code quality checks — gateway vs direct, image model, withWorkflow, AI Elements
  5. PostToolUse validation catches — outdated models, deprecated APIs
  6. Issues found — bugs, pattern gaps, new findings to feed back into skills
将结果写入
.notes/COVERAGE.md
,包含以下内容:
  1. 会话索引——slug、会话ID、唯一技能、去重状态
  2. 钩子覆盖率矩阵——哪些会话触发了哪些钩子
  3. 技能注入表——43个技能中哪些被触发
  4. 代码质量检查——是否使用gateway而非直接调用、图像模型是否正确、是否有withWorkflow、是否安装AI Elements
  5. PostToolUse校验捕获——过时模型、废弃API
  6. 发现的问题——Bug、模式缺口、需要反馈到技能中的新发现

Release → Eval Loop

发布→评测循环

The standard improvement cycle:
  1. Run evals — launch 3 sessions with natural language prompts
  2. Check results — skill claims, project structure, code quality
  3. Identify gaps — what skills didn't trigger, what patterns are wrong
  4. Read conversation logs — find user follow-up corrections
  5. Fix skills — update SKILL.md content, patterns, validate rules
  6. Run gates
    bun run typecheck && bun test && bun run validate
  7. Release — bump version,
    bun run build
    , commit, push
  8. Repeat — launch 3 more evals to verify fixes
标准优化周期:
  1. 运行评测——使用自然语言prompt启动3个会话
  2. 检查结果——技能认领情况、项目结构、代码质量
  3. 识别缺口——哪些技能没有触发、哪些模式存在错误
  4. 阅读会话日志——找出用户后续需要修正的内容
  5. 修复技能——更新SKILL.md内容、匹配模式、校验规则
  6. 运行门禁检查——
    bun run typecheck && bun test && bun run validate
  7. 发布——升级版本、
    bun run build
    、提交、推送
  8. 重复——启动3个新评测验证修复效果

Scenario Table

场景表

#SlugPrompt SummaryExpected Skills
01doc-qa-agentPDF Q&A with embeddings, citations, multi-step reasoningai-sdk, nextjs, vercel-storage, ai-elements
02customer-support-agentDurable support agent, escalation, confidence trackingai-sdk, workflow, nextjs, ai-elements
03deploy-monitorUptime monitoring, AI incident responder, durable investigationworkflow, cron-jobs, observability, ai-sdk
04multi-model-routerSide-by-side model comparison, parallel streaming, cost trackingai-gateway, ai-sdk, nextjs, ai-elements
05slack-pr-reviewerMulti-platform chat bot, PR review, threaded conversationschat-sdk, ai-sdk, nextjs
06content-pipelineDurable multi-step content production with image generationworkflow, ai-sdk, satori, nextjs
07feature-rolloutFeature flags, A/B testing, AI experiment analysisvercel-flags, ai-sdk, nextjs
08event-driven-crmEvent-driven CRM, churn prediction, re-engagement emailsvercel-queues, workflow, ai-sdk, email
09code-sandbox-tutorAI coding tutor with sandbox execution, auto-fixvercel-sandbox, ai-sdk, nextjs, ai-elements
10multi-agent-researchParallel sub-agents, durable orchestration, streaming synthesisworkflow, ai-sdk, ai-elements, nextjs
11discord-game-masterRPG bot, persistent game state, scene illustration generationchat-sdk, ai-sdk, vercel-storage, nextjs
12compliance-auditorScheduled AI audits, durable approval workflow, deploy blockingworkflow, cron-jobs, ai-sdk, vercel-firewall
#SlugPrompt摘要预期技能
01doc-qa-agent带嵌入、引用、多步骤推理的PDF问答ai-sdk, nextjs, vercel-storage, ai-elements
02customer-support-agent持久化客服Agent、升级流转、置信度跟踪ai-sdk, workflow, nextjs, ai-elements
03deploy-monitor可用性监控、AI事件响应、持久化排查workflow, cron-jobs, observability, ai-sdk
04multi-model-router多模型横向对比、并行流式输出、成本跟踪ai-gateway, ai-sdk, nextjs, ai-elements
05slack-pr-reviewer多平台聊天机器人、PR评审、线程对话chat-sdk, ai-sdk, nextjs
06content-pipeline带图像生成的持久化多步骤内容生产workflow, ai-sdk, satori, nextjs
07feature-rollout功能标志、A/B测试、AI实验分析vercel-flags, ai-sdk, nextjs
08event-driven-crm事件驱动CRM、流失预测、召回邮件vercel-queues, workflow, ai-sdk, email
09code-sandbox-tutor带Sandbox执行、自动修复的AI编程导师vercel-sandbox, ai-sdk, nextjs, ai-elements
10multi-agent-research并行子Agent、持久化编排、流式合成workflow, ai-sdk, ai-elements, nextjs
11discord-game-masterRPG机器人、持久化游戏状态、场景插图生成chat-sdk, ai-sdk, vercel-storage, nextjs
12compliance-auditor定时AI审计、持久化审批工作流、部署阻断workflow, cron-jobs, ai-sdk, vercel-firewall

Complexity Tiers

复杂度分层

Tier 1 — Core AI (30-45 min,
--quick
)

第一层——核心AI(30-45分钟,
--quick

Scenarios 01, 04, 09 — AI SDK, Gateway, Sandbox, AI Elements without durable workflows.
场景01、04、09——AI SDK、Gateway、Sandbox、AI Elements,不含持久化工作流。

Tier 2 — Durable Agents (45-60 min)

第二层——持久化Agent(45-60分钟)

Scenarios 02, 03, 06, 10 — Workflow DevKit, multi-step durability, agent orchestration.
场景02、03、06、10——Workflow DevKit、多步骤持久化、Agent编排。

Tier 3 — Platform Integration (45-60 min)

第三层——平台集成(45-60分钟)

Scenarios 05, 07, 08, 11, 12 — Chat SDK, Queues, Flags, Firewall, cross-platform messaging.
场景05、07、08、11、12——Chat SDK、Queues、Flags、Firewall、跨平台消息。

Full Suite

完整套件

All 12 scenarios, ~3-4 hours.
全部12个场景,约3-4小时。

Cleanup

清理

bash
rm -rf ~/dev/vercel-plugin-testing
bash
rm -rf ~/dev/vercel-plugin-testing