agent-experience

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Audit Agent Experience

审核Agent体验

Evaluate how well a product/SDK/docs surface works when an AI agent actually tries to onboard and do a realistic task — starting from a short one-sentence prompt, with nothing pasted in. The agent must find the docs, install what it needs, and attempt real work. That's the only honest test of agent DX.

The skill spawns multiple subagents in parallel, captures each one's tool-call trace, and scores the experience using the same dimensions as the Skill Test Arena dashboard: Setup Friction, Speed, Efficiency, Error Recovery, Doc Quality.

评估当AI Agent实际尝试入职并完成一项真实任务时，产品/SDK/文档平台的表现——仅从一句简短的提示开始，不提供任何额外粘贴内容。代理必须自行查找文档、安装所需组件并尝试完成实际工作。这是测试Agent DX（开发者体验）最真实的方式。

该技能会并行生成多个子代理，捕获每个代理的工具调用轨迹，并使用与Skill Test Arena仪表板相同的维度对体验进行评分：设置摩擦度、速度、效率、错误恢复能力、文档质量。

Core principle

核心原则

Do not spoonfeed. The subagent gets a tiny prompt like "Get started with {product} and {do its primary thing}". It must discover the docs, choose the path, and hit real failures. A good doc survives this; a bad doc does not.

不要喂饭式指导。子代理只会收到类似*"开始使用{product}并完成其核心功能"*的简短提示。它必须自行发现文档、选择路径并直面真实的失败。优秀的文档能通过这种测试，糟糕的文档则无法通过。

Workflow

工作流程

Execute these steps in order. Do not skip ahead.

按顺序执行以下步骤，请勿跳过。

Step 1 — Identify the target and define the abstract goal

步骤1 — 确定目标并定义抽象目标

Resolve what the user is asking to audit. The target may arrive in one of three forms:

URL — a docs site or product page (e.g.,
```
https://docs.example.com
```
). This is the seed the subagents start from.
Repo / file path — for SKILL.md audits or SDK repos.
Product name — if the user is vague ("test my product"), ask via
```
AskUserQuestion
```
for the URL or repo.

This skill is product-agnostic. Never assume what the user wants to audit. Do not infer a target from environment signals (operator's email domain, git remote, repo name, recent files, memory, CLAUDE.md). Even if context strongly suggests a particular company, the user-facing question must NOT pre-fill or default to any specific product, URL, or company name. Ask open-endedly with neutral options only: e.g., "Paste a URL", "Paste a local path", "Type a product name". If the user did not name a target in their invocation, ask them — start fresh, no priors.

Research lightly after the user has named a target. 1 WebFetch max, enough to confirm: what is this product, and does it have a getting-started guide? You're identifying that there is a flow to follow, not extracting the steps. The whole point is to let the docs dictate the path.

Define ONE abstract goal, not a step-by-step checklist. The goal should be at the level of "complete the onboarding" or "make the product do its primary thing once" — NOT a list of specific actions.

Why: prescriptive checklists steer agents. If you tell them "navigate to example.com" but the docs' quickstart navigates to a different URL, the agent is torn between your instruction and the docs. That pollutes the test.

Examples of good abstract goals (the target product is supplied by the user — the examples below are illustrative only, not defaults):

A search API → "Complete the getting-started guide. Success = your code successfully calls the API and prints whatever the docs treat as a meaningful result."
A payments API → "Complete the getting-started flow for making a test charge. Success = you have a charge ID or equivalent confirmation."
A browser-automation SDK → "Complete the getting-started guide end-to-end. Success = you have code that runs a cloud browser session using whatever approach the docs recommend."
A SKILL.md → "Follow the skill's instructions and produce a successful outcome for its advertised job."

Examples of BAD goals (too prescriptive — don't do this):

~~"Navigate to https://example.com"~~ (steers — the docs may pick a different URL)
~~"Use Playwright"~~ (the docs may recommend Stagehand or Selenium)
~~"Print the page title"~~ (the docs may print session ID, response body, anything)

The subagent will self-report against the abstract goal: did I complete the onboarding as the docs described? (yes / no / partial). The concrete sub-outcomes the agent actually achieved live in their trace under

primary_outcome_achieved

, not in a pre-defined checklist.

If the target has no clear getting-started flow (rare — even a README is a flow), ask the user what "done" means before continuing.

明确用户要求审核的对象。目标可能以三种形式呈现：

URL — 文档站点或产品页面（例如
```
https://docs.example.com
```
）。这是子代理的起始种子。
仓库/文件路径 — 用于SKILL.md审核或SDK仓库。
产品名称 — 如果用户表述模糊（如"测试我的产品"），通过
```
AskUserQuestion
```
询问其URL或仓库地址。

本技能与产品无关。绝不假设用户想要审核的对象。不要从环境信号（操作者的邮箱域名、Git远程仓库、仓库名称、最近文件、记忆、CLAUDE.md）中推断目标。即使上下文强烈指向某家公司，面向用户的问题也不得预先填充或默认任何特定产品、URL或公司名称。仅使用中立选项进行开放式提问：例如"粘贴URL"、"粘贴本地路径"、"输入产品名称"。如果用户在调用时未指定目标，请询问他们——从零开始，无任何预设。

在用户指定目标后进行轻度调研。最多调用1次WebFetch，仅需确认：这是什么产品？它是否有入门指南？你只需确认存在可遵循的流程，无需提取具体步骤。核心是让文档主导路径。

定义一个抽象目标，而非分步清单。目标应处于"完成入职流程"或"让产品完成一次核心功能"的层面——而非一系列具体操作。

原因：指令性清单会引导代理。如果你告诉它"导航到example.com"，但文档的快速入门指南指向另一个URL，代理会在你的指令和文档之间陷入两难，这会污染测试结果。

优秀抽象目标示例（目标产品由用户提供——以下示例仅作说明，非默认值）：

搜索API → "完成入门指南。成功标准 = 你的代码成功调用API并打印文档中定义的有意义结果。"
支付API → "完成发起测试支付的入门流程。成功标准 = 你获得支付ID或等效确认信息。"
浏览器自动化SDK → "完整完成入门指南。成功标准 = 你拥有一段使用文档推荐方法运行云端浏览器会话的代码。"
SKILL.md → "遵循技能说明并为其宣传的任务生成成功结果。"

糟糕目标示例（过于指令化——请勿这样做）：

~~"导航到https://example.com"~~（会引导代理——文档可能选择不同URL）
~~"使用Playwright"~~（文档可能推荐Stagehand或Selenium）
~~"打印页面标题"~~（文档可能打印会话ID、响应体或其他内容）

子代理会根据抽象目标自我报告：我是否按照文档描述完成了入职？（是/否/部分完成）。代理实际达成的具体子结果记录在其轨迹的

primary_outcome_achieved

字段中，而非预定义清单。

如果目标没有明确的入门流程（罕见——即使README也是一种流程），请在继续前询问用户"完成"的定义。

Step 2 — Gather audit config via AskUserQuestion

步骤2 — 通过AskUserQuestion收集审核配置

Use

AskUserQuestion

in a single call with 4 questions. Options: max 4 per question.

Test depth (single-select, header:
```
"Depth"
```
):
- ```
5 agents (Recommended)
```
  — balanced coverage
- ```
3 agents
```
  — quick sanity check
- ```
10 agents
```
  — thorough, higher cost
Programming languages (multiSelect, header:
```
"Languages"
```
): pick up to 4 —
```
Python
```
,
```
TypeScript
```
,
```
Go
```
,
```
Shell/Bash
```
(let user deselect).
Personas (multiSelect, header:
```
"Personas"
```
):
- ```
Standard (Recommended)
```
  — neutral baseline, no behavioral flavoring. Just "do the task." Best for unbiased measurement.
- ```
Pragmatic
```
  — just get it working, fastest path
- ```
Thorough
```
  — read the docs end-to-end before coding
- ```
Skeptical
```
  — verify claims the docs make
Execution mode (single-select, header:
```
"Exec mode"
```
):
- ```
Allow Bash (Recommended)
```
  — subagents can run
```
npm install
```
  ,
```
curl
```
  , etc. on your machine. Most realistic.
- ```
Draft-only
```
  — subagents may fetch docs and write code but won't execute anything. Safer.

After the user answers, gather one more question about model choice:

Model (single-select, header:
```
"Model"
```
):
- ```
Sonnet (Recommended)
```
  — balanced cost/quality, default for most audits
- ```
Opus
```
  — strongest reasoning, highest cost; good for dense/ambiguous docs
- ```
Haiku
```
  — cheapest, fastest; good for checking if docs are agent-friendly to smaller models
- ```
Mixed comparison
```
  — split agents across Opus + Sonnet + Haiku so you can see how doc quality varies by model size. Useful for "are my docs robust even to weaker models?"

Pass the chosen model to each

Agent

invocation via the

model

parameter. If

Mixed

, distribute N agents roughly equally across the 3 models (round-robin by slot index) and record which model each agent used in the trace + report.

After the user answers, you have:

depth

(N),

languages[]

personas[]

exec_mode

model

If
exec_mode = "Allow Bash"
, follow up with a second AskUserQuestion asking about credentials:

Credentials (single-select, header:
```
"Credentials"
```
):
- ```
Auto-discover (Recommended)
```
  — skill checks the user's env vars, common dotfiles, and credential managers; only prompts for paste if nothing found. Best for repeat use and for cases where another operator is running the audit.
- ```
None — let agents block (friction test)
```
  — agents hit the credential wall, counts as Setup Friction. Best for pure docs audits.
- ```
Paste manually
```
  — you paste keys directly; skill injects them. Use when you don't have keys stored locally yet.

If user picks

Auto-discover

, run Step 2.5 below before continuing. If

Paste manually

(or auto-discover falls back), AskUserQuestion asks for the credential values — not the names. The skill then writes them to each workspace

.env

using generic, product-agnostic names:

Primary credential →
```
API_KEY
```
Secondary (e.g. project/org ID) →
```
PROJECT_ID
```
Third (e.g. webhook secret) →
```
SECRET
```

Do NOT use product-specific names like

BROWSERBASE_API_KEY

EXA_API_KEY

STRIPE_SECRET_KEY

. Those names steer the agent — they see

BROWSERBASE_API_KEY

in env and skip ever reading the docs to find out what env var the SDK actually expects. The generic name forces them to:

Read the docs to discover the product's actual env var name (e.g.
```
BROWSERBASE_API_KEY
```
).
Map the generic
```
API_KEY
```
value into whatever form the SDK requires — either re-export (
```
export BROWSERBASE_API_KEY=$API_KEY
```
) or pass inline in code (
```
new Browserbase({ apiKey: process.env.API_KEY })
```
).

If an agent fails to figure out the mapping, that's a doc quality signal — the docs weren't clear about credential naming.

使用

AskUserQuestion

进行单次调用，包含4个问题。每个问题最多提供4个选项。

测试深度（单选，标题：
```
"深度"
```
）：
- ```
5个代理（推荐）
```
  — 平衡覆盖范围
- ```
3个代理
```
  — 快速 sanity 检查
- ```
10个代理
```
  — 全面测试，成本更高
编程语言（多选，标题：
```
"语言"
```
）：最多选4种 —
```
Python
```
、
```
TypeScript
```
、
```
Go
```
、
```
Shell/Bash
```
（允许用户取消选择）。
角色 persona（多选，标题：
```
"角色"
```
）：
- ```
标准（推荐）
```
  — 中性基准，无行为倾向。仅需"完成任务"。最适合无偏见测量。
- ```
务实型
```
  — 只求快速完成任务，选择最快路径
- ```
严谨型
```
  — 在编码前完整阅读文档
- ```
怀疑型
```
  — 验证文档中的声明
执行模式（单选，标题：
```
"执行模式"
```
）：
- ```
允许Bash（推荐）
```
  — 子代理可在你的机器上运行
```
npm install
```
  、
```
curl
```
  等命令。最贴近真实场景。
- ```
仅草稿
```
  — 子代理可获取文档并编写代码，但不会执行任何操作。更安全。

用户回答后，再询问一个关于模型选择的问题：

模型（单选，标题：
```
"模型"
```
）：
- ```
Sonnet（推荐）
```
  — 成本/质量平衡，适合大多数审核场景
- ```
Opus
```
  — 推理能力最强，成本最高；适用于复杂/模糊的文档
- ```
Haiku
```
  — 最便宜、最快；适用于检查文档对小型模型是否友好
- ```
混合对比
```
  — 将代理分配到Opus + Sonnet + Haiku三种模型，以便观察文档质量如何随模型大小变化。适用于"我的文档是否对较弱模型也足够健壮？"的场景

通过

model

参数将所选模型传递给每个

Agent

调用。如果选择

混合对比

，则大致平均地将N个代理分配到3种模型（按插槽索引循环），并在轨迹和报告中记录每个代理使用的模型。

用户回答后，你将获得：

depth

（N）、

languages[]

、

personas[]

、

exec_mode

、

model

。

如果
exec_mode = "允许Bash"
，通过第二次AskUserQuestion询问凭据相关问题：

凭据（单选，标题：
```
"凭据"
```
）：
- ```
自动发现（推荐）
```
  — 技能检查用户的环境变量、常见点文件和凭据管理器；仅在未找到时提示粘贴。最适合重复使用或由其他操作者运行审核的场景。
- ```
无 — 让代理受阻（摩擦测试）
```
  — 代理会遇到凭据壁垒，计入设置摩擦度。最适合纯文档审核。
- ```
手动粘贴
```
  — 你直接粘贴密钥；技能会注入这些密钥。适用于本地未存储密钥的场景。

如果用户选择

自动发现

，在继续前执行步骤2.5。如果选择

手动粘贴

（或自动发现失败回退），AskUserQuestion会要求提供凭据值——而非名称。技能随后会使用通用、与产品无关的名称将其写入每个工作区的

.env

文件：

主凭据 →
```
API_KEY
```
次要凭据（如项目/组织ID） →
```
PROJECT_ID
```
第三凭据（如Webhook密钥） →
```
SECRET
```

请勿使用产品特定名称，如

BROWSERBASE_API_KEY

、

EXA_API_KEY

、

STRIPE_SECRET_KEY

。这些名称会引导代理——它们在环境中看到

BROWSERBASE_API_KEY

后，会跳过阅读文档以了解SDK实际期望的环境变量名称。通用名称会迫使它们：

阅读文档以发现产品的实际环境变量名称（例如
```
BROWSERBASE_API_KEY
```
）。
将通用的
```
API_KEY
```
值映射为SDK所需的形式——要么重新导出（
```
export BROWSERBASE_API_KEY=$API_KEY
```
），要么在代码中内联传递（
```
new Browserbase({ apiKey: process.env.API_KEY })
```
）。

如果代理无法完成映射，这是文档质量的信号——文档未明确说明凭据命名规则。

Step 2.5 — Credential auto-discovery (only if user picked

Auto-discover

)

步骤2.5 — 凭据自动发现（仅当用户选择

自动发现

时）

Run a tiered lookup. Stop at the first tier that produces a usable candidate. Never print credential values to chat — only names and source paths. The user picks by name; the skill internally maps name → value → workspace

.env

Derive the product slug from the target URL/repo to bias toward relevant matches. e.g.

https://docs.browserbase.com

→ slug

browserbase

. Use lowercase substring match (case-insensitive) when ranking candidates.

Tier 1 — Already-exported env vars (free, zero side effects):

bash

printenv | grep -iE '^[A-Z][A-Z0-9_]*_(API_KEY|TOKEN|SECRET|KEY)=' | cut -d= -f1

This returns names only. If any names contain the product slug, those are top candidates.

Tier 2 — Narrow dotfile scan (a hardcoded short list, NOT a recursive grep):

bash

grep -hE '^[[:space:]]*export[[:space:]]+[A-Z][A-Z0-9_]*_(API_KEY|TOKEN|SECRET|KEY)=' \
  ~/.zshrc ~/.bashrc ~/.bash_profile ~/.zprofile ~/.env ./.env ./.envrc 2>/dev/null \
  | sed -E 's/^[[:space:]]*export[[:space:]]+([A-Z0-9_]+)=.*/\1/' \
  | sort -u

Files allowed:

~/.zshrc

~/.bashrc

~/.bash_profile

~/.zprofile

~/.env

./.env

./.envrc

. Do NOT expand this list. Do NOT recurse. Do NOT scan
~/Library
,
~/.config/
,
~/Documents
, etc. This is the entire allowlist; anything else is out of scope and risks leaking unrelated secrets.

For each match, record

(NAME, source_path)

. Read the value lazily — only when the user has confirmed the choice — by re-grepping the specific source file for that exact name.

Tier 3 — Credential manager (only if
op
or
security
is on PATH AND tiers 1–2 had no good match):

1Password CLI: skip unless
```
op account list
```
exits 0 (i.e. user is signed in). Don't trigger an interactive auth flow inside the skill.
macOS Keychain:
```
security find-generic-password -l "<expected-name>" -w
```
— try once with the most likely name (e.g.
```
BROWSERBASE_API_KEY
```
); silent failure means not stored.

If a credential manager produces hits, list them as candidates the same way as tiers 1–2.

Tier 4 — Fallback to paste: If all tiers above produced zero candidates, fall through to the manual paste flow described in Step 2.

Presenting candidates to the user. After tiers 1–3:

If exactly 1 candidate and its name contains the product slug → use it silently. Log a one-line confirmation in chat:
```
Using BROWSERBASE_API_KEY from ~/.zshrc.
```
(Name + source only — never the value.)
If multiple candidates, AskUserQuestion (single-select, header:
```
"Use which credential?"
```
) with up to 4 options:
- One option per top candidate, formatted
```
<NAME> (from <source>)
```
- Plus a
```
Paste manually instead
```
  escape hatch
- If >3 candidates, show the top 3 by slug-relevance and add a
```
Show all
```
  option that re-asks with the rest.
If no candidates → fall through to Tier 4 (paste).

Reading the value. Once the user has confirmed a choice (or it was auto-selected), read the value:

Tier 1:
```
printenv <NAME>
```
(capture stdout, do not echo).
Tier 2: re-grep the specific source file for the exact
```
export <NAME>=
```
line and parse the RHS, stripping surrounding quotes.

Tier 3:

op read "op://<vault>/<item>/<field>"

security find-generic-password -l <NAME> -w

Write the value into per-agent workspace

.env

files using the same generic names (

API_KEY

PROJECT_ID

SECRET

) as the paste flow — see Step 2. The discovery layer is upstream of injection; downstream behavior (generic names, agent must read docs to map them) is unchanged.

Orchestrator-retained credentials. After writing per-agent

.env

files, the orchestrator keeps the original product-specific names → values (e.g.

BROWSERBASE_API_KEY

) available to itself for downstream verification work in Steps 6 / 6.5 / 8 — for example, calling the product's API with

curl

to confirm that a session ID an agent reported actually resolves, or fetching session metadata to enrich the report. The orchestrator can read them with

printenv

(no need to store anywhere — the parent shell already has them since auto-discover sourced them from there).

This is asymmetric on purpose: the subagents see only generic

API_KEY

PROJECT_ID

SECRET

so the doc-quality test stays honest (they must read the docs to discover the real var name). The orchestrator is not being audited, so it can use the real names freely for verification.

Privacy guarantees the skill must uphold:

Never write a credential value to chat output, the trace, the report, or any file outside the per-agent workspace
```
.env
```
.
Never re-export the value into a subagent's workspace under a product-specific name. Subagents only see the generic names.
Treat values as opaque strings — do not log length, prefix, or fingerprint.
The HTML report records that auto-discovery happened (and which name was used) but never the value.

运行分层查找。在第一个产生可用候选的层级停止。绝不将凭据值打印到聊天中——仅打印名称和源路径。用户通过名称选择；技能在内部将名称→值映射到工作区的

.env

文件。

从目标URL/仓库派生产品slug，以便偏向相关匹配。例如

https://docs.browserbase.com

→ slug

browserbase

。排名候选时使用小写子串匹配（不区分大小写）。

层级1 — 已导出的环境变量（免费，无副作用）：

bash

printenv | grep -iE '^[A-Z][A-Z0-9_]*_(API_KEY|TOKEN|SECRET|KEY)=' | cut -d= -f1

此命令仅返回名称。如果任何名称包含产品slug，则为顶级候选。

层级2 — 窄范围点文件扫描（硬编码短列表，非递归 grep）：

bash

grep -hE '^[[:space:]]*export[[:space:]]+[A-Z][A-Z0-9_]*_(API_KEY|TOKEN|SECRET|KEY)=' \
  ~/.zshrc ~/.bashrc ~/.bash_profile ~/.zprofile ~/.env ./.env ./.envrc 2>/dev/null \
  | sed -E 's/^[[:space:]]*export[[:space:]]+([A-Z0-9_]+)=.*/\1/' \
  | sort -u

允许扫描的文件：

~/.zshrc

、

~/.bashrc

、

~/.bash_profile

、

~/.zprofile

、

~/.env

、

./.env

、

./.envrc

。请勿扩展此列表。请勿递归扫描。请勿扫描
~/Library
、
~/.config/
、
~/Documents
等目录。这是完整的允许列表；其他任何内容均超出范围，可能泄露无关机密。

对于每个匹配项，记录

(NAME, source_path)

。仅当用户确认选择后才延迟读取值——通过重新 grep 特定源文件中该精确名称的行。

层级3 — 凭据管理器（仅当
op
或
security
在PATH中且层级1–2未找到合适匹配时）：

1Password CLI：除非
```
op account list
```
返回0（即用户已登录），否则跳过。请勿在技能内触发交互式认证流程。
macOS Keychain：
```
security find-generic-password -l "<expected-name>" -w
```
— 使用最可能的名称（例如
```
BROWSERBASE_API_KEY
```
）尝试一次；静默失败表示未存储。

如果凭据管理器产生匹配项，按与层级1–2相同的方式将其列为候选。

层级4 — 回退到粘贴： 如果上述所有层级均未产生候选，则回退到步骤2中描述的手动粘贴流程。

向用户展示候选。完成层级1–3后：

如果恰好有1个候选且其名称包含产品slug → 静默使用。在聊天中记录一行确认信息：
```
Using BROWSERBASE_API_KEY from ~/.zshrc.
```
（仅名称+源——绝不包含值）。
如果有多个候选，使用AskUserQuestion（单选，标题：
```
"使用哪个凭据？"
```
）提供最多4个选项：
- 每个顶级候选对应一个选项，格式为
```
<NAME> (from <source>)
```
- 加上
```
改为手动粘贴
```
  的退出选项
- 如果候选>3个，显示前3个与slug最相关的候选，并添加
```
显示全部
```
  选项以重新询问剩余候选。
如果没有候选 → 回退到层级4（粘贴）。

读取值。一旦用户确认选择（或自动选择），读取值：

层级1：
```
printenv <NAME>
```
（捕获标准输出，不回显）。
层级2：重新 grep 特定源文件中精确的
```
export <NAME>=
```
行并解析右侧内容，去除周围引号。

层级3：

op read "op://<vault>/<item>/<field>"

或

security find-generic-password -l <NAME> -w

。

使用与粘贴流程相同的通用名称（

API_KEY

、

PROJECT_ID

、

SECRET

）将值写入每个代理工作区的

.env

文件——参见步骤2。发现层位于注入层上游；下游行为（通用名称，代理必须阅读文档以映射）保持不变。

编排器保留的凭据。写入每个代理的

.env

文件后，编排器会保留原始产品特定名称→值（例如

BROWSERBASE_API_KEY

），供自身在步骤6/6.5/8中进行下游验证工作——例如，使用

curl

调用产品API以确认代理报告的会话ID是否实际有效，或获取会话元数据以丰富报告。编排器可通过

printenv

读取这些值（无需存储在其他地方——父shell已通过自动发现从源加载了这些值）。

这种不对称设计是有意为之：子代理仅能看到通用的

API_KEY

PROJECT_ID

SECRET

，以保持文档质量测试的公正性（它们必须阅读文档以发现真实变量名称）。编排器并非审核对象，因此可自由使用真实名称进行验证。

技能必须遵守的隐私保证：

绝不将凭据值写入聊天输出、轨迹、报告或每个代理工作区
```
.env
```
之外的任何文件。
绝不以产品特定名称将值重新导出到子代理的工作区。子代理仅能看到通用名称。
将值视为不透明字符串——不记录长度、前缀或指纹。
HTML报告会记录自动发现已发生（以及使用的名称），但绝不包含值。

Step 3 — Safety check

步骤3 — 安全检查

exec_mode = "Allow Bash"

, print a brief warning to chat before spawning: "Agents may run real shell commands (npm install, curl, pip install, git clone) on this machine. Make sure you're in a directory you're okay with agents modifying. Continue in 5 seconds or Ctrl-C to abort." — then continue.

Do not run

sleep

— just proceed after printing. The user reads the warning before the agents start working.

如果

exec_mode = "允许Bash"

，在生成代理前向聊天打印简短警告："代理可能会在此机器上运行真实的shell命令（npm install、curl、pip install、git clone）。请确保你处于允许代理修改的目录中。5秒后继续或按Ctrl-C中止。" — 然后继续。

无需运行

sleep

命令——打印后直接继续。用户会在代理开始工作前看到警告。

Step 4 — Generate tiny prompts (no checklist)

步骤4 — 生成简短提示（无清单）

For each of N variants, produce a

(persona, language, prompt)

tuple. The prompt is one or two sentences, stating the abstract goal + language. No sub-checklist, no prescriptive steps.

Template:

{persona_prefix} {product}'s getting-started guide using {language}.{persona_tail} You've completed it when you've done whatever the guide treats as the primary successful outcome.

{persona_tail}

is empty for most personas. The Skeptical persona uses it to inject its "note anything wrong" guidance as a separate sentence (with a leading space) so the prefix sentence stays grammatical. See

references/prompt-variants.md

for the full prefix/tail per persona.

Examples (using

Acme

as a placeholder — substitute the user-supplied product name):

Pragmatic × TypeScript → "Skim and then follow Acme's getting-started guide using TypeScript (Node.js). You've completed it when you've done whatever the guide treats as its primary successful outcome."
Thorough × Python → "Read and then follow Acme's getting-started guide using Python. You've completed it when you've done whatever the guide treats as its primary successful outcome."
Skeptical × Shell → "Follow Acme's getting-started guide using bash/curl only. Note anything in the docs that seems wrong or unclear as you go. You've completed it when you've done whatever the guide treats as its primary successful outcome."

The subagent is NOT told what the success outcome is — they have to read the docs to figure that out. That's the point: if the docs are good, they'll convey it clearly. If the docs are bad, the agent won't know when they're done, which IS a finding.

Read

references/prompt-variants.md

for the persona prefix library. Cross-product personas × languages, truncate to N. If cells < N, repeat with slight wording variation on the prefix.

Never paste doc content into the prompt.

针对N种变体中的每一种，生成

(persona, language, prompt)

元组。提示为1-2句话，说明抽象目标+语言。无子清单，无指令性步骤。

模板：

{persona_prefix} 使用{language}完成{product}的入门指南。{persona_tail} 当你完成指南中定义的核心成功结果时，即视为完成任务。

{persona_tail}

对于大多数角色为空。怀疑型角色会使用它将"记录任何错误内容"的指导作为单独句子（带前导空格）注入，以便前缀句子保持语法正确。有关每个角色的完整前缀/后缀，请参阅

references/prompt-variants.md

。

示例（使用

Acme

作为占位符——替换为用户提供的产品名称）：

务实型 × TypeScript → "略读并使用TypeScript（Node.js）完成Acme的入门指南。当你完成指南中定义的核心成功结果时，即视为完成任务。"
严谨型 × Python → "阅读并使用Python完成Acme的入门指南。当你完成指南中定义的核心成功结果时，即视为完成任务。"
怀疑型 × Shell → "仅使用bash/curl完成Acme的入门指南。在过程中记录文档中任何看起来错误或不清晰的内容。当你完成指南中定义的核心成功结果时，即视为完成任务。"

子代理不会被告知成功结果是什么——它们必须阅读文档来弄清楚。这正是关键所在：如果文档优秀，会清晰传达成功标准；如果文档糟糕，代理将不知道何时完成任务，这本身就是一个发现。

阅读

references/prompt-variants.md

获取角色前缀库。交叉产品角色×语言，截断为N个变体。如果组合数<N，则通过略微修改前缀措辞来重复生成。

绝不要将文档内容粘贴到提示中。

Step 5 — Spawn N subagents in parallel

步骤5 — 并行生成N个子代理

Read

references/subagent-brief.md

— the full brief each subagent receives. It tells them:

You are a real developer doing a real task
Use your real tools (
```
WebFetch
```
,
```
Bash
```
if allowed,
```
Write
```
)
If you need credentials, ask the user via a clear stop-and-ask message (the skill captures this as friction)
Return a structured trace at the end with tool calls, errors, timing estimates, completion status

For each variant, invoke the

Agent

tool (subagent_type:

general-purpose

). Pass

model: "opus" | "sonnet" | "haiku"

per the user's choice. For

Mixed

, rotate models across the N slots deterministically (agent 1 → opus, agent 2 → sonnet, agent 3 → haiku, agent 4 → opus, …) and record the assigned model in the per-agent report row.

All N calls in one message so they run in parallel.

The subagent's prompt = the brief + their tiny task. The brief passes through

exec_mode

so the subagent knows whether Bash is available.

Wait for all N agents to return before continuing to Step 6. When agents are run in the background, completion notifications arrive one at a time and it is easy to lose count. Maintain a simple in-memory tally of returned-vs-spawned and, when the last agent reports back, print one explicit milestone line to chat: "All N agents returned — moving to trace parsing." Do not start Step 6 until that line has been printed. If the user asks "are the agents still running?" mid-flight, answer with the current

<returned>/<spawned>

count from your tally, not from re-counting prior chat output.

Verification of agent claims using orchestrator credentials. Before scoring, if Step 2.5 retained product-specific credentials, the orchestrator may use them to spot-check claims that subagents made (e.g. confirming a session ID with

curl -H "X-BB-API-Key: $BROWSERBASE_API_KEY" https://api.browserbase.com/v1/sessions/<id>

). Treat any unresolved IDs as evidence the agent may have hallucinated. Never include the credential header in the report — only the verification result (resolved / not resolved).

阅读

references/subagent-brief.md

——每个子代理收到的完整说明。它告诉子代理：

你是一名完成真实任务的真实开发者
使用真实工具（
```
WebFetch
```
、允许时使用
```
Bash
```
、
```
Write
```
）
如果需要凭据，通过清晰的停止并询问消息向用户请求（技能会将此计入摩擦度）
最后返回包含工具调用、错误、时间估计、完成状态的结构化轨迹

针对每个变体，调用

Agent

工具（subagent_type:

general-purpose

）。根据用户选择传递

model: "opus" | "sonnet" | "haiku"

。如果选择

混合对比

，则确定性地在N个插槽中轮换模型（代理1→opus，代理2→sonnet，代理3→haiku，代理4→opus，…），并在每个代理的报告行中记录分配的模型。

所有N个调用在一条消息中发送，以便并行运行。

子代理的提示 = 完整说明 + 其简短任务。说明会传递

exec_mode

，以便子代理知道是否允许使用Bash。

等待所有N个代理返回后再继续步骤6。当代理在后台运行时，完成通知会逐个到达，容易漏数。维护一个简单的内存计数器记录已返回/已生成的代理数量，当最后一个代理报告返回时，向聊天打印一条明确的里程碑消息："所有N个代理已返回——开始解析轨迹。" 在打印此消息前不要开始步骤6。如果用户在中途询问"代理还在运行吗？"，请根据计数器中的

<已返回>/<已生成>

数量回答，而非重新计数之前的聊天输出。

使用编排器凭据验证代理声明。在评分前，如果步骤2.5保留了产品特定凭据，编排器可使用这些凭据抽查代理的声明（例如，使用

curl -H "X-BB-API-Key: $BROWSERBASE_API_KEY" https://api.browserbase.com/v1/sessions/<id>

确认会话ID）。将任何无法解析的ID视为代理可能产生幻觉的证据。绝不将凭据头包含在报告中——仅包含验证结果（已解析/未解析）。

Step 6 — Parse structured traces AND keep the full prose

步骤6 — 解析结构化轨迹并保留完整 prose

Each subagent returns two things in one response:

A fenced JSON trace at the end (structured self-report).
All the prose before it — reasoning, tool output, and what the agent actually did.

Retain both. Do not throw the prose away after extracting JSON. The prose is where you catch things the JSON self-report misses.

Extract JSON using:

/```json\s*(\{[\s\S]*?\})\s*```\s*$/

. Mark malformed/missing as

errored

with a

raw_tail

. If >50% errored, warn and offer retry.

Compute the top-line numbers from the JSON:

Onboarding success rate = fraction of agents with
```
onboarding_status = "completed"
```
.
Docs-promise-match rate = fraction of agents with
```
docs_promise_met = true
```
.

每个子代理在一个响应中返回两项内容：

末尾的围栏JSON轨迹（结构化自我报告）。
轨迹之前的所有 prose——推理过程、工具输出以及代理实际执行的操作。

保留两者。提取JSON后不要丢弃prose。prose中包含JSON自我报告遗漏的内容。

使用正则表达式提取JSON：

/```json\s*(\{[\s\S]*?\})\s*```\s*$/

。将格式错误/缺失的标记为

errored

并附带

raw_tail

。如果>50%的轨迹出现错误，发出警告并提供重试选项。

从JSON计算关键指标：

入职成功率 =
```
onboarding_status = "completed"
```
的代理比例。
文档承诺匹配率 =
```
docs_promise_met = true
```
的代理比例。

Step 6.25 — Annotate URL provenance per-WebFetch (inline in trace)

步骤6.25 — 为每个WebFetch标注来源（在轨迹中内联）

Subagents don't have search — they guess URLs from training-data priors. Reports must show per WebFetch call where the URL came from, rendered as a small muted line directly under the tool input block in the trace. Do NOT put this at the top of the report as a general callout — it's only useful inline where the reader can correlate it to the specific call.

Classify each

WebFetch

URL into one of four provenance categories and render with the matching label + color:

TRAINING PRIOR
(violet) — URL is a guess from training data (product name + common doc-site conventions like
```
/introduction
```
,
```
/quickstart
```
,
```
/sdk/{lang}
```
). Typical for the first 1–2 WebFetch calls.
FROM LLMS.TXT
(blue) — URL appears in the output of a prior
```
llms.txt
```
fetch in the same trace.
FROM PREV PAGE
(green) — URL was listed in the output of a previous WebFetch or Bash tool call in the same trace.
GUESS · 404
(amber) — URL was guessed but 404'd — this is the most interesting category for doc-quality scoring (the URL should exist by convention but doesn't).

Classification heuristic:

If the same trace earlier contained a successful
```
llms.txt
```
WebFetch whose output mentioned this URL →
```
FROM LLMS.TXT
```
Else if the same trace earlier contained any WebFetch/Bash output that mentioned this exact URL →
```
FROM PREV PAGE
```
Else if the subsequent tool_result has
```
error: true
```
with 404 content →
```
GUESS · 404
```
Else →
```
TRAINING PRIOR
```

Score interpretation:

Lots of
TRAINING PRIOR
that succeed = product is well-represented in training data (head start).
Lots of
GUESS · 404
= URL taxonomy drifts from common conventions → real doc-discoverability finding.
FROM LLMS.TXT
appearing often after
GUESS · 404
=
```
llms.txt
```
is carrying the docs' discoverability. Credit it explicitly in the findings.

子代理没有搜索功能——它们从训练数据先验中猜测URL。报告必须显示每个WebFetch调用的URL来源，在轨迹中的工具输入块下方直接显示为一行小的灰色文字。不要将此作为通用提示放在报告顶部——仅在读者能将其与特定调用关联时内联显示才有用。

将每个

WebFetch

URL分类为四种来源类别之一，并使用匹配的标签+颜色显示：

TRAINING PRIOR
（紫色）—— URL是从训练数据中猜测的（产品名称+常见文档站点约定，如
```
/introduction
```
、
```
/quickstart
```
、
```
/sdk/{lang}
```
）。通常出现在前1-2次WebFetch调用中。
FROM LLMS.TXT
（蓝色）—— URL出现在同一轨迹中之前的
```
llms.txt
```
获取输出中。
FROM PREV PAGE
（绿色）—— URL出现在同一轨迹中之前的WebFetch或Bash工具调用输出中。
GUESS · 404
（琥珀色）—— URL是猜测的但返回404——这是文档质量评分中最有趣的类别（按约定该URL应该存在但实际不存在）。

分类规则：

如果同一轨迹中之前包含成功的
```
llms.txt
```
WebFetch，且其输出提到此URL →
```
FROM LLMS.TXT
```
否则，如果同一轨迹中之前包含任何WebFetch/Bash输出提到此精确URL →
```
FROM PREV PAGE
```
否则，如果后续tool_result包含
```
error: true
```
且内容为404 →
```
GUESS · 404
```
否则 →
```
TRAINING PRIOR
```

分数解读：

大量
TRAINING PRIOR
且成功 = 产品在训练数据中表现良好（占得先机）。
大量
GUESS · 404
= URL分类与常见约定不符 → 真实的文档可发现性问题。
GUESS · 404
后频繁出现
FROM LLMS.TXT
=
```
llms.txt
```
正在支撑文档的可发现性。在发现中明确提及并给予肯定。

Step 6.5 — Narrative cross-agent review (CRITICAL)

步骤6.5 — 跨代理叙事审查（关键步骤）

Before scoring, re-read the full prose from every agent. The JSON trace is the agent's self-report — an agent that hallucinated a wrong package name will also describe it correctly in its own trace. The truth lives in the tool output and the prose.

Scan for these patterns across the N transcripts:

Convergent mistakes. Did multiple agents try the same wrong thing? Wrong npm package name (e.g.,
```
exa
```
vs
```
exa-js
```
), wrong endpoint, wrong env var, wrong import? If 3/3 agents used the wrong package, that's a doc quality disaster even if each "completed" the task. Agents don't invent identical wrong answers — shared training-data residue means the docs aren't overriding the model's wrong priors.
Hallucinated artifacts. Compare each agent's
```
primary_outcome_achieved
```
claim against what their tool output actually shows. If they claim "printed the title" but no title-fetching tool call appears in their Bash output, they're confabulating. Likely means the doc was unclear enough that the agent pattern-matched instead of reading.
Inconsistent outcomes. If 3 agents describe 3 different "successful" end-states, the docs don't clearly define success.
Silent workarounds. Did agents patch a bug (missing
```
await
```
, wrong env var name, undocumented required parameter) that a human copy-paster wouldn't have? Flag these — they're invisible DX taxes only captured in prose.
Tool-output vs. narrative contradictions. Sometimes an agent says "it worked" but the stderr from their Bash call says otherwise, and they failed to notice. Grep tool outputs in the prose for
```
error
```
,
```
404
```
,
```
401
```
,
```
deprecated
```
,
```
warning
```
.

Write a 3–5 sentence Narrative Review summary and include it prominently in the final report. This often surfaces the highest-value findings of the whole audit.

在评分前，重新阅读每个代理的完整prose。JSON轨迹是代理的自我报告——产生幻觉的代理会在自己的轨迹中错误描述包名。真相存在于工具输出和prose中。

扫描N个 transcript 中的以下模式：

趋同错误。多个代理是否尝试了相同的错误操作？错误的npm包名称（例如
```
exa
```
vs
```
exa-js
```
）、错误的端点、错误的环境变量、错误的导入？如果3/3个代理使用了错误的包，即使每个都"完成"了任务，这也是文档质量灾难。代理不会凭空产生相同的错误答案——共享的训练数据残留意味着文档未能覆盖模型的错误先验。
幻觉产物。将每个代理的
```
primary_outcome_achieved
```
声明与其工具输出实际显示的内容进行比较。如果他们声称"打印了标题"但Bash输出中没有任何获取标题的工具调用，他们就是在编造内容。这可能意味着文档不够清晰，导致代理进行模式匹配而非阅读。
不一致的结果。如果3个代理描述了3种不同的"成功"最终状态，说明文档未明确定义成功标准。
静默 workaround。代理是否修复了人类复制粘贴者不会遇到的bug（缺失
```
await
```
、错误的环境变量名称、未记录的必填参数）？标记这些问题——它们是仅能在prose中捕获的隐形DX成本。
工具输出与叙事矛盾。有时代理说"成功了"但Bash调用的stderr显示失败，而他们未注意到。在prose中搜索工具输出中的
```
error
```
、
```
404
```
、
```
401
```
、
```
deprecated
```
、
```
warning
```
。

撰写3-5句话的叙事审查摘要，并将其显著包含在最终报告中。这通常会揭示整个审核中最有价值的发现。

Step 7 — Score the 5 Arena dimensions

步骤7 — 对5个Arena维度评分

Read

references/evaluation-rubric.md

for full criteria. Score 0–100 based on aggregated evidence.

Onboarding success rate is the primary sanity check. See

references/evaluation-rubric.md

§ 0 for the exact cap tiers — at <50% completion, every dimension is capped at 55 regardless of other evidence.

Setup Friction (25%) — credential prompts, auth retries, install errors. Failures in the "setup" phase = big hit.
Speed (20%) — total wall time, time-to-first-working-code.
Efficiency (20%) —
```
completed_subtasks
```
/ total
```
tool_calls
```
ratio, wasted calls.
Error Recovery (15%) — did errors block onboarding completion, or did agents route around?
Doc Quality (20%) — did the docs provide what agents needed?

Weighted total → letter grade (90+ A, 75+ B, 60+ C, 45+ D, else F).

阅读

references/evaluation-rubric.md

获取完整标准。根据汇总证据给出0-100分。

入职成功率是主要的 sanity 检查。请参阅

references/evaluation-rubric.md

第0节了解确切的上限层级——完成率<50%时，所有维度的分数上限为55，无论其他证据如何。

设置摩擦度（25%） — 凭据提示、认证重试、安装错误。"设置"阶段的失败会大幅扣分。
速度（20%） — 总耗时、首次运行成功代码的时间。
效率（20%） —
```
completed_subtasks
```
/ 总
```
tool_calls
```
比率、无效调用。
错误恢复能力（15%） — 错误是否阻碍了入职完成，还是代理能够绕过？
文档质量（20%） — 文档是否提供了代理所需的内容？

加权总分→字母等级（90+为A，75+为B，60+为C，45+为D，否则为F）。

Step 8 — Synthesise findings

步骤8 — 综合发现

Produce:

Executive summary — 2–3 sentences. Lead with the grade and the single biggest friction.
What went well — 3–5 bullets.
What didn't — 3–5 bullets.
Common friction patterns — anything hit by ≥2 agents (the high-signal fixes).
Session timeline — aggregate phases across agents (Research, Setup, Execution, Validation) with rough times.
Tool call breakdown — totals across all agents by tool type.
Recommended fixes — prioritised, each citing the doc section or SDK method and a specific rewrite.

生成：

执行摘要 — 2-3句话。以等级和最大的摩擦点开头。
做得好的方面 — 3-5个要点。
待改进的方面 — 3-5个要点。
常见摩擦模式 — 被≥2个代理遇到的任何问题（高信号修复点）。
会话时间线 — 汇总所有代理的阶段（调研、设置、执行、验证）及大致时间。
工具调用 breakdown — 所有代理按工具类型统计的调用总数。
推荐修复方案 — 按优先级排序，每个方案引用文档章节或SDK方法并给出具体重写建议。

Step 9 — Render the HTML report

步骤9 — 渲染HTML报告

Read

assets/report-template.html

. Fill placeholders:

{{TITLE}}

{{TARGET_REF}}

{{META}}

{{GRADE_LETTER}}

{{GRADE_CLASS}}

{{OVERALL_SCORE}}

{{AGENT_COUNT}}

{{COMPLETED_COUNT}}

{{PARTIAL_COUNT}}

{{STUCK_COUNT}}

{{BLOCKED_COUNT}}

{{ERRORED_COUNT}}

{{NARRATIVE_REVIEW_SECTION}}

(see format below),

{{EXEC_SUMMARY}}

{{WENT_WELL_ITEMS}}

{{DIDNT_GO_WELL_ITEMS}}

{{TIMELINE_SECTION}}

{{TOOL_BREAKDOWN_SECTION}}

{{METRICS_GRID}}

{{PATTERNS_SECTION}}

{{FIXES_LIST}}

{{AGENT_RESULTS_TABLE}}

(at-a-glance summary table — see format below),

{{AGENT_TRACES_SECTION}}

(full collapsible per-agent trace cards — see format below).

Status counter mapping. The 5 status counters partition the agents exactly: every agent contributes to exactly one of

{{COMPLETED_COUNT}}

(onboarding_status=

completed

{{PARTIAL_COUNT}}

(

partial

{{STUCK_COUNT}}

(

stuck

{{BLOCKED_COUNT}}

(

blocked-on-credentials

), or

{{ERRORED_COUNT}}

(parser-failed traces). The five sub-counts must sum to

{{AGENT_COUNT}}

Section order in the rendered report (the template enforces this — do not reorder):

Scorecard + agent-status stat grid
Narrative Review (
```
{{NARRATIVE_REVIEW_SECTION}}
```
)
Executive Summary
Recommended Fixes
What Agents Said (worked / didn't)
Common Friction Patterns (
```
{{PATTERNS_SECTION}}
```
)
Quantitative Metrics
Tool Call Breakdown
Session Timeline
Per-agent Runs (results table + traces)

Rationale: opinion before data. The reader needs the verdict (narrative + exec summary) and the actionable fix list before being asked to absorb metrics or timelines. Reference-y sections (timeline, tool breakdown) sit near the bottom for verification, not framing.

{{NARRATIVE_REVIEW_SECTION}}
format. A

<div class="narrative-review">

containing a

<div class="label">Narrative Review</div>

and a

<div class="body">…</div>

with the 3–5 sentence cross-agent summary from Step 6.5. This is the highest-value finding of the audit — keep prose tight, lead with the strongest observation. If Step 6.5 produced no notable cross-agent finding, render the section with a one-line body:

No cross-agent patterns of note — agents converged on the docs' intended path with minor individual variation.

Do not omit the section.

The 5 dimension scores are still computed (they feed the overall weighted score and letter grade), but do not render a per-dimension breakdown section — it adds visual weight without giving the reader anything actionable beyond what the narrative review and recommended fixes already cover. Keep dimension scoring internal.

{{AGENT_RESULTS_TABLE}}
format. A

<table class="agent-results-table">

rendered immediately above the per-agent cards. One row per agent with these columns (in order):

# — slot index (1-based), right-aligned, monospace.
Persona × Language — e.g.
```
Standard · TypeScript
```
. Use the values from the agent's JSON trace.
Model — render this column ONLY when
```
model = Mixed
```
(otherwise omit the column entirely; the single model is named in the header
```
{{META}}
```
line).

Status — a

<span class="status-pill {{status}}">

matching

onboarding_status

from the JSON (

completed

partial

stuck

blocked-on-credentials

). Map

errored

(parser-failed traces) to its own pill.

Tool calls — sum of
```
count
```
across the agent's
```
tool_calls[]
```
array. Right-aligned, monospace.
Time —
```
wall_time_estimate_sec
```
from the JSON, formatted as
```
92s
```
(or
```
2m 14s
```
if ≥120s). Right-aligned, monospace.

Rationale: the cards below are detailed but require expanding each one. The table gives a one-screen comparison so the reader can spot outliers (the agent that took 3× as long, the one that fired 2× the tool calls) before drilling in.

{{AGENT_TRACES_SECTION}}
format. One

<details class="trace-card">

per agent. Each card's summary line MUST include the model used (e.g.

<span class="chip">opus</span>

) alongside persona/language chips. The card expands to show:

Event log (from
detailed_trace
) — rendered in compact Arena-style: monospace rows with color-coded bracketed labels, minimal chrome, no dots or timeline lines. Each row is one line of text; Input/Output blocks appear as indented
```
<pre>
```
blocks directly under their tool call (always visible, not click-to-expand — users want to scan the flow).
The FIRST event in every log is the prompt that was sent to that subagent, rendered with the gold
```
[PROMPT]
```
label at timestamp
```
[setup]
```
. The full prompt is behind a small click-to-expand button (the only collapsible in the stream — prompts are long and users don't always need them).
Visual structure:
```
[setup]     [PROMPT]       Task prompt sent to subagent   [▸ Show full prompt]
[+0ms]      [MILESTONE]    agent_started
[+100ms]    [THOUGHT]      I'll start by discovering the docs.
[+1.2s]     [TOOL_USE #1]   WebFetch
              Input: { "url": "...", "prompt": "..." }
[+3.4s]     [TOOL_RESULT #1]
              Output: # Example Product ...
[+4.5s]     [TOOL_USE #2]   Bash
              Input: { "command": "npm install ...", "description": "..." }
[+9.2s]     [TOOL_RESULT #2]
              Output: added 12 packages ...
[+12s]      [ERROR]         install · PEP 668 blocked · recovered
[+45s]      [RESULT ✓]      Session created, task done.
```
CSS conventions (compact monospace, light background):
- Container:
```
.trace-timeline
```
  — light gray background (
```
#fafaf9
```
  ), monospace font throughout, 0.78rem font-size, scrollable (max 640px)
- Each row: no grid, just inline text.
```
[time]
```
  (muted) +
```
[LABEL]
```
  (colored, bold) + body content
- Bracketed label color per type:
  - ```
  [PROMPT]
```
  : gold
- ```
[MILESTONE]
```
    : blue
  - ```
  [THOUGHT]
```
  : violet (body text also italic + muted)
- ```
[TOOL USE]
```
    : orange (
```
#ff6b35
```
    or whatever brand accent the report uses)
  - ```
  [TOOL RESULT]
```
  : green (or red if errored)
- ```
[ERROR]
```
    : red (body also red)
  - ```
  [RESULT ✓]
```
  : green (body green, bold)
- Tool-name: orange + semibold
- Input/Output: visible inline as
```
.trace-io
```
  blocks with colored left-border (orange for input, green for output, red for errors).
```
<pre>
```
  block shows the full tool input as JSON — never abbreviate. For
```
WebFetch
```
  specifically, that means showing both the
```
url
```
  AND the
```
prompt
```
  args — the
```
prompt
```
  is what the agent asked the page's content to be distilled to, and it's critical signal for understanding agent intent. If the input is large, truncate the value (not the structure) with
```
…
```
  inside the relevant string.
- Prompt block is the exception — it's collapsed by default (prompts are long). Its summary IS visible as a small "▸ Show full prompt" button.
- Never revert to dark background — clashes with rest of report.
- No grid, no dots, no vertical line — keep it text-flow.

The main agent keeps each subagent's prompt. When spawning agents in Step 5, cache the full prompt text keyed by agent index so you can retrieve it for the report. Future-you (rendering) needs to look up what was sent to which agent.

Agent's final prose summary — kept as a secondary scrollable box below the event log (this is the self-report; the trace is the ground truth).
Tool calls summary grid (name, count, purpose) — quick reference
Evidence (session ID, stdout, etc)
Friction points
Errors (if any)
Positive moments

The event log is the star of the show — this is what gives users the same "I can see exactly what the agent did and thought" experience as the Arena trace view. The prose summary is a narrative recap but the trace is the primary record.

Per-agent results table must include a

Model

column when

model = Mixed

, so cross-model comparison is visible at a glance. When a single model was used, mention it once in the header

{{META}}

line instead.

HTML-escape all user-supplied strings. Doc quotes go in

<code>

<blockquote>

All URLs must be clickable. When the report references:

Relative doc paths (e.g.

/quickstart

) → wrap as

<a class="doc-link" href="{TARGET_BASE_URL}{path}" target="_blank" rel="noopener"><code>{path}</code></a>

where

{TARGET_BASE_URL}

is the audit target's origin (e.g.

https://docs.example.com

)

Session/resource IDs (e.g.
```
f0ec58cc
```
) → link to the full resource URL (e.g.
```
https://app.example.com/sessions/{full-id}
```
) with a
```
↗
```
suffix indicating external link
Full URLs appearing in prose → already linkable, just ensure they're wrapped in
```
<a>
```
not just
```
<code>
```

The CSS for these link classes:

css

a.doc-link { text-decoration: none; color: inherit; }
a.doc-link:hover code { background: #fff4ef; border-color: var(--brand); color: var(--brand); }
a.session-link { color: #166534; text-decoration: none; }
a.session-link:hover { text-decoration: underline; }

Rationale: a 404 finding is useless if the user can't click to verify. A session ID is useless if the user can't click through to the recording. Every URL-like string in the report should be one click away from verification.

阅读

assets/report-template.html

。填充占位符：

{{TITLE}}

{{TARGET_REF}}

{{META}}

{{GRADE_LETTER}}

{{GRADE_CLASS}}

{{OVERALL_SCORE}}

{{AGENT_COUNT}}

{{COMPLETED_COUNT}}

{{PARTIAL_COUNT}}

{{STUCK_COUNT}}

{{BLOCKED_COUNT}}

{{ERRORED_COUNT}}

{{NARRATIVE_REVIEW_SECTION}}

（格式见下文）,

{{EXEC_SUMMARY}}

{{WENT_WELL_ITEMS}}

{{DIDNT_GO_WELL_ITEMS}}

{{TIMELINE_SECTION}}

{{TOOL_BREAKDOWN_SECTION}}

{{METRICS_GRID}}

{{PATTERNS_SECTION}}

{{FIXES_LIST}}

{{AGENT_RESULTS_TABLE}}

（概览摘要表——格式见下文）,

{{AGENT_TRACES_SECTION}}

（完整的可折叠代理轨迹卡片——格式见下文）。

状态计数器映射。5个状态计数器精确划分代理：每个代理恰好属于

{{COMPLETED_COUNT}}

（onboarding_status=

completed

）、

{{PARTIAL_COUNT}}

（

partial

）、

{{STUCK_COUNT}}

（

stuck

）、

{{BLOCKED_COUNT}}

（

blocked-on-credentials

）或

{{ERRORED_COUNT}}

（解析失败的轨迹）中的一个。五个子计数之和必须等于

{{AGENT_COUNT}}

。

渲染报告中的章节顺序（模板强制执行——请勿重新排序）：

评分卡+代理状态统计网格
叙事审查（
```
{{NARRATIVE_REVIEW_SECTION}}
```
）
执行摘要
推荐修复方案
代理反馈（做得好/待改进）
常见摩擦模式（
```
{{PATTERNS_SECTION}}
```
）
量化指标
工具调用Breakdown
会话时间线
代理运行详情（结果表+轨迹）

理由：先观点后数据。读者需要结论（叙事+执行摘要）和可操作的修复列表，然后再吸收指标或时间线。参考性章节（时间线、工具breakdown）位于底部附近，用于验证而非框架搭建。

{{NARRATIVE_REVIEW_SECTION}}
格式。一个包含

<div class="label">叙事审查</div>

和

<div class="body">…</div>

的

<div class="narrative-review">

，其中body部分包含步骤6.5中的3-5句话跨代理摘要。这是审核中最有价值的发现——保持 prose 简洁，以最强烈的观察开头。如果步骤6.5未发现显著的跨代理模式，则在该部分中渲染一行内容：

未发现显著跨代理模式——代理均遵循文档预期路径，仅存在轻微个体差异。

请勿省略该部分。

仍会计算5个维度的分数（它们用于计算加权总分和字母等级），但请勿渲染维度细分章节——它会增加视觉权重，但不会为读者提供叙事审查和推荐修复方案之外的任何可操作内容。维度评分仅作内部使用。

{{AGENT_RESULTS_TABLE}}
格式。一个

<table class="agent-results-table">

，直接渲染在代理卡片上方。每行对应一个代理，包含以下列（按顺序）：

# — 插槽索引（从1开始），右对齐等宽字体。
角色×语言 — 例如
```
Standard · TypeScript
```
。使用代理JSON轨迹中的值。
模型 — 仅当
```
model = 混合对比
```
时渲染此列（否则完全省略该列；单一模型在标题
```
{{META}}
```
行中提及）。
状态 — 一个与JSON中
```
onboarding_status
```
匹配的
```
<span class="status-pill {{status}}">
```
（
```
completed
```
、
```
partial
```
、
```
stuck
```
、
```
blocked-on-credentials
```
）。将
```
errored
```
（解析失败的轨迹）映射到其自己的状态标签。
工具调用 — 代理
```
tool_calls[]
```
数组中
```
count
```
的总和。右对齐等宽字体。
时间 — JSON中的
```
wall_time_estimate_sec
```
，格式为
```
92s
```
（如果≥120s则为
```
2m 14s
```
）。右对齐等宽字体。

理由：下方的卡片详细但需要逐个展开。该表提供了单屏比较，以便读者在深入查看前发现异常值（耗时3倍的代理、调用2倍工具的代理）。

{{AGENT_TRACES_SECTION}}
格式。每个代理对应一个

<details class="trace-card">

。每个卡片的摘要行必须包含使用的模型（例如

<span class="chip">opus</span>

）以及角色/语言标签。卡片展开后显示：

事件日志（来自
detailed_trace
） — 以紧凑的Arena风格渲染：等宽字体行，带颜色编码的括号标签，最少的装饰，无点或时间线。每行是一行文本；输入/输出块直接显示在其工具调用下方（始终可见，无需点击展开——用户希望扫描流程）。
每个日志中的第一个事件是发送给该子代理的提示，以金色
```
[PROMPT]
```
标签显示在时间戳
```
[setup]
```
处。完整提示隐藏在一个小的点击展开按钮后（流中唯一可折叠的部分——提示较长，用户并非总是需要查看）。
视觉结构：
```
[setup]     [PROMPT]       发送给子代理的任务提示   [▸ 显示完整提示]
[+0ms]      [MILESTONE]    agent_started
[+100ms]    [THOUGHT]      我将从查找文档开始。
[+1.2s]     [TOOL_USE #1]   WebFetch
              Input: { "url": "...", "prompt": "..." }
[+3.4s]     [TOOL_RESULT #1]
              Output: # Example Product ...
[+4.5s]     [TOOL_USE #2]   Bash
              Input: { "command": "npm install ...", "description": "..." }
[+9.2s]     [TOOL_RESULT #2]
              Output: added 12 packages ...
[+12s]      [ERROR]         install · PEP 668 blocked · recovered
[+45s]      [RESULT ✓]      会话已创建，任务完成。
```
CSS约定（紧凑等宽字体，浅色背景）：
- 容器：
```
.trace-timeline
```
  — 浅灰色背景（
```
#fafaf9
```
  ），全程使用等宽字体，字体大小0.78rem，可滚动（最大宽度640px）
- 每行：无网格，仅内联文本。
```
[time]
```
  （灰色）+
```
[LABEL]
```
  （彩色加粗）+ 正文内容
- 括号标签颜色按类型区分：
  - ```
  [PROMPT]
```
  : 金色
- ```
[MILESTONE]
```
    : 蓝色
  - ```
  [THOUGHT]
```
  : 紫色（正文也为斜体+灰色）
- ```
[TOOL USE]
```
    : 橙色（
```
#ff6b35
```
    或报告使用的品牌强调色）
  - ```
  [TOOL RESULT]
```
  : 绿色（如果出错则为红色）
- ```
[ERROR]
```
    : 红色（正文也为红色）
  - ```
  [RESULT ✓]
```
  : 绿色（正文绿色加粗）
- 工具名称：橙色+半粗体
- 输入/输出：作为
```
.trace-io
```
  块内联显示，带彩色左边框（输入为橙色，输出为绿色，错误为红色）。
```
<pre>
```
  块显示完整的工具输入JSON——绝不缩写。对于
```
WebFetch
```
  ，这意味着同时显示
```
url
```
  和
```
prompt
```
  参数——
```
prompt
```
  是代理要求页面内容提炼的内容，是理解代理意图的关键信号。如果输入较大，在相关字符串内用
```
…
```
  截断值（而非结构）。
- 提示块是例外——默认折叠（提示较长）。其摘要显示为一个小的"▸ 显示完整提示"按钮。
- 绝不使用深色背景——与报告其余部分冲突。
- 无网格，无点，无垂直线——保持文本流。

主代理保留每个子代理的提示。在步骤5生成代理时，按代理索引缓存完整提示文本，以便在报告中检索。未来渲染报告时需要查找发送给每个代理的内容。

代理的最终prose摘要 — 保留为事件日志下方的次要可滚动框（这是自我报告；轨迹是真实记录）。
工具调用摘要网格（名称、计数、用途）——快速参考
证据（会话ID、标准输出等）
摩擦点
错误（如有）
积极时刻

事件日志是核心——它为用户提供与Arena轨迹视图相同的"我能准确看到代理做了什么和思考了什么"的体验。prose摘要是叙事回顾，但轨迹是主要记录。

代理结果表在

model = 混合对比

时必须包含

模型

列，以便一目了然地进行跨模型比较。当使用单一模型时，在标题

{{META}}

行中提及一次即可。

对所有用户提供的字符串进行HTML转义。文档引用放在

<code>

或

<blockquote>

中。

所有URL必须可点击。当报告引用：

相对文档路径（例如

/quickstart

）→ 包装为

<a class="doc-link" href="{TARGET_BASE_URL}{path}" target="_blank" rel="noopener"><code>{path}</code></a>

，其中

{TARGET_BASE_URL}

是审核目标的源（例如

https://docs.example.com

）

会话/资源ID（例如
```
f0ec58cc
```
）→ 链接到完整资源URL（例如
```
https://app.example.com/sessions/{full-id}
```
），并带有
```
↗
```
后缀表示外部链接
prose中出现的完整URL → 已可点击，只需确保它们包装在
```
<a>
```
而非仅
```
<code>
```
中

这些链接类的CSS：

css

a.doc-link { text-decoration: none; color: inherit; }
a.doc-link:hover code { background: #fff4ef; border-color: var(--brand); color: var(--brand); }
a.session-link { color: #166534; text-decoration: none; }
a.session-link:hover { text-decoration: underline; }

理由：如果用户无法点击验证，404发现就毫无用处。如果用户无法点击查看记录，会话ID就毫无用处。报告中每个类似URL的字符串都应一键即可验证。

Step 10 — Save and surface

步骤10 — 保存并展示

Save to

./agent-experience-<slug>-<timestamp>.html

(cwd). Slug = lowercase target basename with non-alphanumerics →

. Timestamp =

YYYYMMDD-HHMMSS

Print to chat:

Grade, overall score, and the single biggest fix.
Count summary: N agents, M completed, K stuck.
The full file path.

Open via

Bash: open <path>

on macOS if

exec_mode

allowed it; otherwise just print the path.

保存到

./agent-experience-<slug>-<timestamp>.html

（当前工作目录）。Slug = 目标基名小写，非字母数字字符替换为

。Timestamp =

YYYYMMDD-HHMMSS

。

向聊天打印：

等级、总分和最主要的修复建议。
计数摘要：N个代理，M个完成，K个受阻。
完整文件路径。

如果

exec_mode

允许，在macOS上通过

Bash: open <path>

打开；否则仅打印路径。

Step 11 — Clean up workspaces

步骤11 — 清理工作区

exec_mode = "Allow Bash"

and you created per-agent subdirectories under

./dx-audit-tmp/

(or similar), delete that tree after the report is rendered:

bash

rm -rf ./dx-audit-tmp/

Rationale: agents install node_modules, venvs, Go modules, etc. — often tens of MB per agent. Leaving them around pollutes the user's repo and wastes disk.

Exception: if a subagent's

onboarding_status

stuck

, or its trace was marked

errored

(JSON parse failed), leave that agent's subdir in place and note it in chat — the user may want to inspect the failing state. Delete only the completed / blocked-on-creds agents' dirs.

exec_mode = "Draft-only"

, no cleanup is needed (no files were written outside the report).

如果

exec_mode = "允许Bash"

且你在

./dx-audit-tmp/

（或类似目录）下创建了每个代理的子目录，在渲染报告后删除该目录树：

bash

rm -rf ./dx-audit-tmp/

理由：代理会安装node_modules、venvs、Go模块等——每个代理通常占用数十MB空间。保留它们会污染用户的仓库并浪费磁盘空间。

例外情况： 如果子代理的

onboarding_status

为

stuck

，或其轨迹被标记为

errored

（JSON解析失败），则保留该代理的子目录并在聊天中注明——用户可能希望检查失败状态。仅删除已完成/因凭据受阻的代理目录。

如果

exec_mode = "仅草稿"

，无需清理（除报告外未写入任何文件）。

Reference files

参考文件

references/evaluation-rubric.md
— 5-dimension scoring rubric (Arena methodology).
references/prompt-variants.md
— Persona prefix library and core-task heuristics.
references/subagent-brief.md
— Verbatim brief + trace JSON schema.

references/evaluation-rubric.md
— 5维度评分规则（Arena方法论）。
references/prompt-variants.md
— 角色前缀库和核心任务启发式。
references/subagent-brief.md
— 完整说明+轨迹JSON schema。

Assets

资源

assets/report-template.html
— HTML template with placeholders, stamped into the final report.

assets/report-template.html
— 带占位符的HTML模板，用于生成最终报告。

Constraints

约束

Never paste the target doc into the subagent's prompt — that's the whole point.
```
exec_mode = Draft-only
```
must disable Bash execution in the subagent brief.
Never test a target the user didn't explicitly name.
Never pre-fill a product, URL, or company in any user-facing question. Ignore environment signals (email domain, git remote, repo name, memory). Start fresh — the operator may be auditing anyone.
If a subagent asks for credentials, that counts as friction in the score — don't "help" it by auto-providing. Let the agent hit the wall and record it.
Never write to files outside cwd except the HTML report.

绝不要将目标文档粘贴到子代理的提示中——这违背了核心目的。
```
exec_mode = 仅草稿
```
必须在子代理说明中禁用Bash执行。
绝不测试用户未明确指定的目标。
绝不在任何面向用户的问题中预先填充产品、URL或公司。忽略环境信号（邮箱域名、Git远程仓库、仓库名称、记忆）。从零开始——操作者可能正在审核任何对象。
如果子代理请求凭据，这会计入分数中的摩擦度——不要通过自动提供来"帮助"它。让代理碰壁并记录下来。
绝不要写入当前工作目录之外的文件，除了HTML报告。

agent-experience

Original

Translation

Audit Agent Experience

审核Agent体验

Core principle

核心原则

Workflow

工作流程

Step 1 — Identify the target and define the abstract goal

步骤1 — 确定目标并定义抽象目标

Step 2 — Gather audit config via AskUserQuestion

步骤2 — 通过AskUserQuestion收集审核配置

Step 2.5 — Credential auto-discovery (only if user picked Auto-discover)

步骤2.5 — 凭据自动发现（仅当用户选择自动发现时）

Step 3 — Safety check

步骤3 — 安全检查

Step 4 — Generate tiny prompts (no checklist)

步骤4 — 生成简短提示（无清单）

Step 5 — Spawn N subagents in parallel

步骤5 — 并行生成N个子代理

Step 6 — Parse structured traces AND keep the full prose

步骤6 — 解析结构化轨迹并保留完整 prose

Step 6.25 — Annotate URL provenance per-WebFetch (inline in trace)

步骤6.25 — 为每个WebFetch标注来源（在轨迹中内联）

Step 6.5 — Narrative cross-agent review (CRITICAL)

步骤6.5 — 跨代理叙事审查（关键步骤）

Step 7 — Score the 5 Arena dimensions

步骤7 — 对5个Arena维度评分

Step 8 — Synthesise findings

步骤8 — 综合发现

Step 9 — Render the HTML report

步骤9 — 渲染HTML报告

Step 10 — Save and surface

步骤10 — 保存并展示

Step 11 — Clean up workspaces

步骤11 — 清理工作区

Reference files

参考文件

Assets

资源

Constraints

约束

Step 2.5 — Credential auto-discovery (only if user picked
`Auto-discover`
)

步骤2.5 — 凭据自动发现（仅当用户选择
`自动发现`
时）