terminal-bench-loop

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Terminal-Bench Loop

Terminal-Bench循环

A repeatable operating skill for driving one Terminal-Bench problem to a passing smoke through Paperclip, with explicit issue topology, bounded runs, board-gated product fixes, and worktree continuity.

This skill is operational + diagnostic, not engineering. It coordinates issues, artifacts, and approvals around a Terminal-Bench loop. It does not authorize code changes — every accepted product fix lands as a separate implementation child issue after a board confirmation.

Canonical execution model: read

doc/execution-semantics.md

before starting a loop or moving any loop issue. Every loop issue must rest in a state the doc allows: terminal (

done

cancelled

), explicitly live (active run / queued wake), explicitly waiting (

in_review

with participant/interaction/approval), or explicit recovery/blocker (

blocked

with

blockedByIssueIds

and a named owner).

这是一项可重复的操作技能，用于通过Paperclip推动单个Terminal-Bench问题通过冒烟测试，具备明确的问题拓扑结构、有限次运行、评审组管控的产品修复，以及工作树连续性。

本技能属于操作+诊断范畴，而非工程开发。它围绕Terminal-Bench循环协调问题、工件和审批流程。它不授权代码变更——所有被接受的产品修复都需经过评审组确认后，作为单独的子任务落地。

标准执行模型：在启动循环或处理任何循环问题前，请阅读

doc/execution-semantics.md

。每个循环问题必须处于文档允许的状态：终结状态（

done

cancelled

）、明确活跃状态（正在运行/排队等待唤醒）、明确等待状态（

in_review

且有参与者/交互/审批），或明确恢复/阻塞状态（

blocked

且带有

blockedByIssueIds

和指定负责人）。

When to use

使用场景

Trigger on an assignment whose title or body matches any of:

"run Terminal-Bench in a loop", "loop <task-name> through Paperclip"
"drive Terminal-Bench fix-git", "iterate on Terminal-Bench until it passes"
"Terminal-Bench smoke loop", "bench loop", "smoke loop on <task-name>"
An attached link to a Terminal-Bench loop parent issue, plus a request to do another iteration

Also use when the user hands you an existing top-level loop issue and asks for the next iteration, diagnosis, or rerun.

当任务标题或内容匹配以下任意情况时触发：

"run Terminal-Bench in a loop"、"loop <task-name> through Paperclip"
"drive Terminal-Bench fix-git"、"iterate on Terminal-Bench until it passes"
"Terminal-Bench smoke loop"、"bench loop"、"smoke loop on <task-name>"
附带指向Terminal-Bench循环父问题的链接，且要求进行下一次迭代

当用户提供现有的顶级循环问题，并要求进行下一次迭代、诊断或重新运行时，也可使用本技能。

When NOT to use

不适用场景

The assignment is to build or change
```
paperclip-bench
```
itself (Harbor adapter, wrapper, telemetry). Use normal engineering flow on that repo.
The assignment is to submit a benchmark result for ranking. This skill produces smoke/non-comparable runs by design — escalate full-suite or comparable runs to BenchmarkQualityManager.
The assignment is a normal Paperclip product bug not surfaced by a Terminal-Bench loop. Use normal investigation.
You have not been granted permission to install or assign company skills, and the asker actually wants library mutation. Hand that step to an authorized skill-library owner.

任务是构建或修改
```
paperclip-bench
```
本身（Harbor适配器、包装器、遥测功能）。此类任务请使用该仓库的常规工程流程。
任务是提交基准测试结果用于排名。本技能生成的是冒烟测试/非可比运行结果——如需全套件或可比运行，请转交给BenchmarkQualityManager。
任务是未通过Terminal-Bench循环发现的常规Paperclip产品bug。此类任务请使用常规调查流程。
你未被授予安装或分配公司技能的权限，而请求者实际需要修改技能库。请将此步骤转交给授权的技能库所有者。

Three invariants you must preserve

必须遵守的三项不变原则

Every loop iteration and every proposed product fix must hold these three invariants together. They come from

/diagnose-why-work-stopped

and the user has restated them across the liveness work:

Productive work continues. Each loop issue must always have a clear next action owner — agent, board, user, or named blocker. No silent
```
in_review
```
with nothing waiting on it.
Only real blockers stop work. Stops happen when something genuinely cannot proceed (board confirmation, QA, missing credentials, exhausted budget). Pseudo-stops must be detected and routed.
No infinite loops. Iteration count, wall-clock budget, and a board gate before product fixes are applied keep the loop bounded.

If a proposed iteration violates any of the three, drop it or rework it. State explicitly in the loop issue how each invariant is held this iteration.

每次循环迭代和每个提议的产品修复都必须同时遵守以下三项原则，它们来自

/diagnose-why-work-stopped

，且用户已在活跃度相关工作中重申：

高效工作持续进行。每个循环问题必须始终有明确的下一步行动负责人——Agent、评审组、用户或指定的阻塞点负责人。不允许出现无等待对象的静默
```
in_review
```
状态。
仅真正的阻塞点会终止工作。只有当确实无法继续时（如等待评审组确认、QA测试、缺少凭证、预算耗尽）才会停止。必须检测并处理伪停止情况。
无无限循环。迭代次数、时钟预算，以及产品修复前的评审组管控，确保循环是有限的。

如果某次提议的迭代违反了任何一项原则，请放弃该迭代或重新调整。在循环问题中明确说明本次迭代如何遵守每项原则。

Inputs

输入信息

Collect these on the top-level loop issue before iteration 1. Any input that cannot be supplied is a blocker — name the unblock owner and stop.

Source issue. The Paperclip issue that asked for the loop. The loop parent links back to it.
Terminal-Bench task name. Single-task identifier (e.g.
```
terminal-bench/fix-git
```
). Multi-task suites are out of scope for this skill.
Iteration budget. Maximum number of iterations before the loop must stop without further fixes (typical: 3–5). Also record a per-iteration wall-clock cap.
Paperclip App worktree issue. The implementation-side issue under the Paperclip App project whose execution workspace owns the isolated worktree. First iteration creates it; later iterations reuse it via
```
inheritExecutionWorkspaceFromIssueId
```
or equivalent.
Benchmark command. The exact
```
paperclip-bench
```
invocation, including the
```
PAPERCLIPAI_CMD
```
(or equivalent) binding pinned to the Paperclip App worktree under test. Record verbatim on the loop issue.
Dispatch runner config. The exact Harbor/Paperclip runner dispatch config required for the smoke to actually start a Paperclip heartbeat. For the current Harbor wrapper, record the
```
PAPERCLIP_HARBOR_RUNNER_CONFIG
```
JSON (or equivalent config file) verbatim enough to preserve:
```
assignee
```
,
```
heartbeat_strategy
```
,
```
agent_adapter
```
/
```
agent_adapters
```
,
```
reuse_host_home
```
when local credentials are intentionally needed, and the stop budget. A bare Harbor command that creates
```
BEN-1
```
as unassigned
```
todo
```
with zero heartbeat-enabled agents is a harness/setup failure, not a valid product diagnosis.
Latest artifact root. Filesystem or storage path under which
```
paperclip-bench
```
writes run artifacts (manifest,
```
results.jsonl
```
, Harbor raw job folders, redacted telemetry). Each iteration appends; nothing is overwritten.
Approval policy. Who must accept a proposed product fix before implementation (default: board via
```
request_confirmation
```
; CTO if delegated; never the loop driver alone).

Record each input on the top-level loop issue (description or a dedicated

inputs

document). If any input changes mid-loop, note the change and the iteration it took effect.

在第一次迭代前，收集以下信息并记录在顶级循环问题中。任何无法提供的输入都是阻塞点——请指定解除阻塞的负责人并停止流程。

源问题：要求执行循环的Paperclip问题。循环父问题需链接回该源问题。
Terminal-Bench任务名称：单任务标识符（例如
```
terminal-bench/fix-git
```
）。本技能不支持多任务套件。
迭代预算：循环终止前的最大迭代次数（典型值：3–5次）。同时记录每次迭代的时钟上限。
Paperclip App工作树问题：Paperclip App项目下的实现侧问题，其执行工作区拥有独立的工作树。第一次迭代创建该问题；后续迭代通过
```
inheritExecutionWorkspaceFromIssueId
```
或等效方式复用。
基准测试命令：确切的
```
paperclip-bench
```
调用命令，包括绑定到被测Paperclip App工作树的
```
PAPERCLIPAI_CMD
```
（或等效配置）。请在循环问题中如实记录。
调度运行器配置：确保冒烟测试能启动Paperclip心跳所需的Harbor/Paperclip运行器调度配置。对于当前的Harbor包装器，请如实记录
```
PAPERCLIP_HARBOR_RUNNER_CONFIG
```
JSON（或等效配置文件），需保留：
```
assignee
```
、
```
heartbeat_strategy
```
、
```
agent_adapter
```
/
```
agent_adapters
```
、需要本地凭证时的
```
reuse_host_home
```
，以及停止预算。如果仅使用Harbor命令创建未分配的
```
todo
```
状态
```
BEN-1
```
且未启用心跳Agent，这属于工具/设置失败，而非有效的产品诊断。
最新工件根目录：
```
paperclip-bench
```
写入运行工件（清单、
```
results.jsonl
```
、Harbor原始作业文件夹、脱敏遥测数据）的文件系统或存储路径。每次迭代追加内容；不覆盖任何现有内容。
审批策略：产品修复提议在实施前需获得谁的认可（默认：评审组通过
```
request_confirmation
```
；如委托则为CTO；绝不能仅由循环驱动者决定）。

将每个输入记录在顶级循环问题中（描述或专门的

inputs

文档）。如果任何输入在循环中途变更，请记录变更内容及其生效的迭代次数。

Issue topology

问题拓扑结构

The loop must be representable as a tree, not as prose in comments:

Top-level loop issue. Long-lived. Holds inputs, iteration counter, current state, links to every iteration child, and the product-rule history. Rests in
```
in_progress
```
while an iteration is running,
```
in_review
```
only when a typed waiter sits directly on the loop parent (execution-policy participant,
```
request_confirmation
```
/
```
ask_user_questions
```
/
```
suggest_tasks
```
interaction, approval, or named human owner),
```
blocked
```
with
```
blockedByIssueIds
```
while a child issue is the gating work (iteration child holding the fix-proposal
```
request_confirmation
```
, or implementation, QA, or CTO review children),
```
done
```
on pass, or
```
cancelled
```
on board-rejection / budget exhaustion.
Iteration child issues. One per iteration. Each carries: a bounded run issue (smoke), a diagnosis issue (applies
```
/diagnose-why-work-stopped
```
), a fix-proposal document with a
```
request_confirmation
```
interaction, and — only after acceptance — implementation, QA, CTO review, and rerun children. Iteration children are blocked by their predecessors so the executor wakes them in order.
Paperclip App implementation issue. The first iteration creates a fresh Paperclip App child whose project policy spawns an isolated worktree. Every later iteration's implementation/rerun child references that same execution workspace via
```
inheritExecutionWorkspaceFromIssueId
```
so the same worktree is amended and tested.

Wire dependencies with

blockedByIssueIds

, never with prose like "blocked by X". When a dependent child is

done

, the executor auto-wakes the next.

循环必须以树状结构呈现，而非仅用评论描述：

顶级循环问题：长期存在。存储输入信息、迭代计数器、当前状态、所有迭代子问题的链接，以及产品规则历史。迭代运行时处于
```
in_progress
```
状态；仅当有明确的等待对象直接关联到循环父问题时（执行策略参与者、
```
request_confirmation
```
/
```
ask_user_questions
```
/
```
suggest_tasks
```
交互、审批或指定的人工负责人）才处于
```
in_review
```
状态；当子问题是管控节点时（持有修复提议
```
request_confirmation
```
的迭代子问题，或实施、QA、CTO评审子问题）处于
```
blocked
```
状态并设置
```
blockedByIssueIds
```
；通过测试后处于
```
done
```
状态；因评审组拒绝/预算耗尽处于
```
cancelled
```
状态。
迭代子问题：每次迭代对应一个。每个子问题包含：有限次运行问题（冒烟测试）、诊断问题（应用
```
/diagnose-why-work-stopped
```
）、带有
```
request_confirmation
```
交互的修复提议文档，以及——仅在获得认可后——实施、QA、CTO评审和重新运行子问题。迭代子问题被其前置迭代子问题阻塞，确保执行器按顺序唤醒它们。
Paperclip App实现问题：第一次迭代创建一个全新的Paperclip App子问题，其项目策略会生成独立的工作树。后续迭代的实施/重新运行子问题通过
```
inheritExecutionWorkspaceFromIssueId
```
引用同一执行工作区，确保修复被应用到同一工作树并进行测试。

使用

blockedByIssueIds

设置依赖关系，切勿仅用“被X阻塞”等文字描述。当依赖的子问题变为

done

状态时，执行器会自动唤醒下一个子问题。

Procedure

流程步骤

0. Read the current execution contract

0. 阅读当前执行契约

Before opening or advancing a loop, read

doc/execution-semantics.md

. Use that document's terms intact when classifying loop-issue state: live path / waiting path / recovery path; post-run disposition; bounded continuation; productivity review; pause-hold; watchdog. Do not invent a new state.

在启动或推进循环前，请阅读

doc/execution-semantics.md

。分类循环问题状态时，请严格使用该文档中的术语：活跃路径/等待路径/恢复路径；运行后处置；有限延续性；效率评审；暂停保留；监控。请勿自定义新状态。

1. Open or reuse the top-level loop issue

1. 打开或复用顶级循环问题

If an existing loop issue is supplied, read it: inputs, iteration counter, last iteration's stop reason, current Paperclip App worktree pointer, latest benchmark command.
If no loop issue exists, create one under the Paperclip App project (or the project the source issue points at). Title:
```
Terminal-Bench loop: <task-name>
```
. Description captures the inputs above, the iteration budget, and a link to the source issue.
Verify the worktree pointer still resolves. If the recorded execution workspace was discarded (worktree pruned, project changed), the loop is blocked — name the unblock owner (CodexCoder or the Paperclip App owner) and stop.

如果提供了现有循环问题，请阅读：输入信息、迭代计数器、上一次迭代的停止原因、当前Paperclip App工作树指针、最新基准测试命令。
如果没有循环问题，请在Paperclip App项目（或源问题指向的项目）下创建一个。标题：
```
Terminal-Bench loop: <task-name>
```
。描述需包含上述输入信息、迭代预算，以及指向源问题的链接。
验证工作树指针是否仍有效。如果记录的执行工作区已被丢弃（工作树被清理、项目变更），则循环被阻塞——请指定解除阻塞的负责人（CodexCoder或Paperclip App所有者）并停止流程。

2. Open the iteration child

2. 打开迭代子问题

Increment the iteration counter on the loop issue.
Create an iteration child titled
```
Iteration N: <task-name>
```
. Its description repeats the inputs and references the loop parent. Block it on the prior iteration's terminal child (if any) so the executor cannot start two iterations in parallel.
If the iteration counter would exceed the budget, do not create the child. Move the loop issue to
```
cancelled
```
(budget exhausted) or
```
in_review
```
if the user must decide whether to extend the budget.

在循环问题中递增迭代计数器。
创建一个迭代子问题，标题为
```
Iteration N: <task-name>
```
。其描述需重复输入信息并引用循环父问题。将其设置为被上一次迭代的终结子问题阻塞（如有），确保执行器不会并行启动两次迭代。
如果迭代计数器将超过预算，则不创建该子问题。将循环问题移至
```
cancelled
```
状态（预算耗尽），或移至
```
in_review
```
状态（需用户决定是否扩展预算）。

3. Run the bounded smoke

3. 运行有限次冒烟测试

The benchmark command must use the Paperclip App worktree under test. Set
```
PAPERCLIPAI_CMD
```
(or the equivalent command binding) to the CLI entrypoint inside that worktree. Never let the smoke run against the operator's current Paperclip checkout.
The same command block must include the runner dispatch config that makes the benchmark issue actionable. For the current Harbor wrapper, export
```
PAPERCLIP_HARBOR_RUNNER_CONFIG
```
with the intended assignee, heartbeat strategy, agent adapter, credential/home mode, and stop budget. Do not treat a bare
```
uvx harbor run ...
```
as the canonical smoke if it omits the dispatch config; record that as a harness/setup miss and rerun with the recorded config.
Bound the run by wall-clock and by Paperclip's run-budget controls. If the smoke would exceed the per-iteration cap, kill it and record the truncation reason.
Capture, in the iteration child or a dedicated
```
run
```
document:
- Paperclip run id and heartbeat run ids
- benchmark run id, manifest,
```
results.jsonl
```
  row, Harbor raw job folder
- dispatch config used (
```
PAPERCLIP_HARBOR_RUNNER_CONFIG
```
  or equivalent), including assignee and adapter type
- the exact stop reason reported by the harness (pass, harness fail, verifier fail, timeout, agent gave up, infrastructure error)
- heartbeat-enabled and heartbeat-observed agent counts when Paperclip telemetry exports them
- failure taxonomy bucket (task/model, Paperclip product, harness/setup, verifier/infrastructure, security, unclear)
- artifact paths under the latest artifact root
Label the iteration as smoke / non-comparable. Comparable runs are out of scope for this skill.

基准测试命令必须使用被测的Paperclip App工作树。将
```
PAPERCLIPAI_CMD
```
（或等效命令绑定）设置为该工作树内的CLI入口点。切勿让冒烟测试运行在操作者当前的Paperclip检出版本上。
同一命令块必须包含使基准测试问题可执行的运行器调度配置。对于当前的Harbor包装器，请导出
```
PAPERCLIP_HARBOR_RUNNER_CONFIG
```
，包含指定的负责人、心跳策略、Agent适配器、凭证/主目录模式，以及停止预算。如果仅使用
```
uvx harbor run ...
```
而省略调度配置，请勿将其视为标准冒烟测试；请记录为工具/设置失误，并使用记录的配置重新运行。
通过时钟限制和Paperclip的运行预算控制来限制运行时长。如果冒烟测试将超过单次迭代的时间上限，请终止测试并记录截断原因。
在迭代子问题或专门的
```
run
```
文档中捕获以下内容：
- Paperclip运行ID和心跳运行ID
- 基准测试运行ID、清单、
```
results.jsonl
```
  行、Harbor原始作业文件夹
- 使用的调度配置（
```
PAPERCLIP_HARBOR_RUNNER_CONFIG
```
  或等效配置），包括负责人和适配器类型
- 工具报告的确切停止原因（通过、工具失败、验证器失败、超时、Agent放弃、基础设施错误）
- Paperclip遥测数据导出的已启用和已观测心跳Agent数量
- 失败分类（任务/模型、Paperclip产品、工具/设置、验证器/基础设施、安全、不明确）
- 最新工件根目录下的工件路径
将该迭代标记为冒烟测试 / 非可比。本技能不支持可比运行。

4. Diagnose the exact stop point

4. 诊断确切的停止点

Apply the

/diagnose-why-work-stopped

pattern to the iteration's run, scoped to this loop only — do not pull in unrelated forensic boilerplate. Specifically:

Walk the Paperclip issue tree the smoke produced under the Paperclip App worktree, node by node, and find the exact
```
(issue, status)
```
combination that stopped progress. Quote evidence: run ids, comment timestamps, status transitions.
Classify every non-progressing issue in that subtree as truly needs human/board intervention, agent-actionable but not currently routed, or already covered.
State whether the failure is task/model, Paperclip product, harness/setup, verifier/infrastructure, security, or unclear. Be explicit when evidence is inferred (e.g. cross-company API boundary blocks direct reads).
If the failure is a Paperclip product gap, frame the fix as a general product rule stated as a contract, and check it against the three invariants above. If the rule would have blocked a recent productive run, narrow it.

Record the diagnosis on the iteration child as a

diagnosis

document. Do not propose code yet.

将

/diagnose-why-work-stopped

模式应用于本次迭代的运行，仅聚焦于当前循环——请勿引入无关的取证模板。具体步骤：

遍历冒烟测试在Paperclip App工作树下生成的Paperclip问题树，逐个节点查找导致进度停止的确切
```
(issue, status)
```
组合。引用证据：运行ID、评论时间戳、状态转换记录。
将该子树中所有未推进的问题分类为确实需要人工/评审组干预、Agent可执行但未路由或已覆盖。
明确失败类型是任务/模型、Paperclip产品、工具/设置、验证器/基础设施、安全还是不明确。当证据为推断时需明确说明（例如跨公司API边界无法直接读取）。
如果失败是Paperclip产品缺口，请将修复方案表述为通用产品规则（以契约形式），并对照上述三项不变原则进行检查。如果该规则会阻塞近期的有效运行，请缩小其范围。

将诊断结果记录在迭代子问题的

diagnosis

文档中。此时请勿提出代码修改建议。

5. Decide the next move

5. 决定下一步行动

Based on the diagnosis, the iteration ends in exactly one of these terminal-for-iteration states:

Pass. Smoke verifier reports pass. Move the iteration child and the loop parent toward QA/CTO review (Step 8).
Product fix proposed. A Paperclip product gap was identified. Write the fix proposal as a
```
plan
```
document on the iteration child, then go to Step 6.
Non-product failure with retry. Failure is harness/setup/infrastructure or model flakiness, the iteration budget is not exhausted, and the loop driver believes a rerun without code changes has signal (e.g. transient infra). Record the rationale on the iteration child and go to Step 7 with no implementation step.
Real blocker. Named external blocker (credentials, quota, third-party outage, security review). Move the loop issue to
```
blocked
```
, set
```
blockedByIssueIds
```
to the blocker issue (creating one if needed), and name the unblock owner. Stop.
Budget or board stop. Iteration budget reached, or the board has rejected the next fix proposal. Move the loop issue to
```
cancelled
```
with a comment that summarizes the run history and the reason for stopping.

根据诊断结果，迭代将以以下确切的迭代终结状态之一结束：

通过：冒烟测试验证器报告通过。将迭代子问题和循环父问题推进至QA/CTO评审阶段（步骤8）。
提议产品修复：已识别出Paperclip产品缺口。在迭代子问题的
```
plan
```
文档中撰写修复提议，然后进入步骤6。
非产品失败可重试：失败原因是工具/设置/基础设施或模型不稳定，迭代预算未耗尽，且循环驱动者认为无需代码变更的重新运行有参考价值（例如临时基础设施问题）。在迭代子问题中记录理由，无需实施步骤直接进入步骤7。
真正阻塞点：已确定外部阻塞点（凭证、配额、第三方 outage、安全评审）。将循环问题移至
```
blocked
```
状态，设置
```
blockedByIssueIds
```
为阻塞点问题（如需则创建），并指定解除阻塞的负责人。停止流程。
预算或评审组终止：已达到迭代预算，或评审组已拒绝下一个修复提议。将循环问题移至
```
cancelled
```
状态，并添加评论总结运行历史和终止原因。

6. Request board confirmation before any product fix

6. 产品修复前请求评审组确认

When the iteration ends in product fix proposed:

Update the iteration child's
```
plan
```
document with the proposed contract, the three-invariant check, the affected Paperclip surfaces, and the phased subtasks (implementation, QA, CTO review, rerun) — but do not create those subtasks.
Open the
```
request_confirmation
```
interaction on the iteration child (the same issue that owns the
```
plan
```
document), targeting the latest plan revision. Idempotency key:
```
confirmation:{iterationIssueId}:plan:{revisionId}
```
. Set
```
continuationPolicy
```
to
```
wake_assignee
```
.
Move the iteration child to
```
in_review
```
. The typed waiter — the
```
request_confirmation
```
interaction — sits directly on it, so its
```
in_review
```
is healthy. Comment links the plan document and names the pending confirmation.
Move the loop parent to
```
blocked
```
with
```
blockedByIssueIds: [iterationChildId]
```
and a comment naming the board (or whichever approver the approval policy designates) as the unblock owner. Do not move the loop parent to
```
in_review
```
here: the typed waiter lives on the iteration child, not on the parent, so the parent's wait path is the child blocker. This matches the topology rule that the loop parent only sits in
```
in_review
```
when a typed waiter is attached directly to the parent.
Wait for acceptance. If the board posts a superseding comment that changes the plan, revise the document, then open a fresh confirmation tied to the new revision on the iteration child — the prior one is invalidated. The loop parent's
```
blockedByIssueIds
```
already points at the iteration child, so it does not need to change.
On rejection, end the loop per the Budget or board stop rule; do not silently retry the same proposal.
On acceptance, create the implementation, QA, CTO review, and rerun child issues with
```
blockedByIssueIds
```
wired in order, and update the loop parent's
```
blockedByIssueIds
```
to point at the new gating child (typically the implementation child) so the parent stays
```
blocked
```
against real downstream work. The implementation child must inherit the Paperclip App execution workspace (
```
inheritExecutionWorkspaceFromIssueId
```
to the worktree-owning issue) so the fix lands in the same isolated worktree the smoke ran against.

当迭代以提议产品修复结束时：

更新迭代子问题的
```
plan
```
文档，包含提议的契约、三项不变原则检查、受影响的Paperclip模块，以及分阶段子任务（实施、QA、CTO评审、重新运行）——但暂不创建这些子任务。
在迭代子问题（即拥有
```
plan
```
文档的同一问题）上开启
```
request_confirmation
```
交互，指向最新的计划修订版本。幂等键：
```
confirmation:{iterationIssueId}:plan:{revisionId}
```
。设置
```
continuationPolicy
```
为
```
wake_assignee
```
。
将迭代子问题移至
```
in_review
```
状态。由于明确的等待对象（
```
request_confirmation
```
交互）直接关联到该子问题，其
```
in_review
```
状态是有效的。添加评论链接到计划文档并说明待确认事项。
将循环父问题移至
```
blocked
```
状态，设置
```
blockedByIssueIds: [iterationChildId]
```
，并添加评论指定评审组（或审批策略指定的其他审批者）为解除阻塞的负责人。此时请勿将循环父问题移至
```
in_review
```
状态：明确的等待对象位于迭代子问题上，而非父问题，因此父问题的等待路径是子问题阻塞。这符合拓扑规则：仅当明确的等待对象直接关联到父问题时，循环父问题才处于
```
in_review
```
状态。
等待认可。如果评审组发布了变更计划的替代评论，请修订文档，然后在迭代子问题上开启针对新版本的确认请求——之前的确认请求失效。循环父问题的
```
blockedByIssueIds
```
已指向迭代子问题，无需修改。
如果被拒绝，请按照预算或评审组终止规则结束循环；请勿静默重试同一提议。
如果获得认可，请创建实施、QA、CTO评审和重新运行子问题，并通过
```
blockedByIssueIds
```
按顺序设置依赖关系，同时更新循环父问题的
```
blockedByIssueIds
```
指向新的管控子问题（通常是实施子问题），确保父问题在后续实际工作完成前保持
```
blocked
```
状态。实施子问题必须继承Paperclip App执行工作区（
```
inheritExecutionWorkspaceFromIssueId
```
指向拥有工作树的问题），确保修复被应用到冒烟测试使用的同一独立工作树。

7. Rerun against the same worktree

7. 针对同一工作树重新运行

After implementation and QA complete (or immediately, in the non-product failure with retry case), the rerun child runs the same

paperclip-bench

invocation with

PAPERCLIPAI_CMD

still pinned to the Paperclip App worktree under test.

The rerun must use the same worktree the fix landed in. If the workspace was reset between iterations, the loop is invalid — open a blocker on the loop issue and stop.
On completion, the rerun child becomes the next iteration's run record. If the smoke now passes, jump to Step 8. Otherwise return to Step 4 with a new iteration child (subject to the iteration budget).

在实施和QA完成后（或在非产品失败可重试情况下立即执行），重新运行子问题需使用相同的

paperclip-bench

调用命令，且

PAPERCLIPAI_CMD

仍绑定到被测的Paperclip App工作树。

重新运行必须使用修复落地的同一工作树。如果迭代间工作区被重置，则循环无效——请在循环问题上开启阻塞点并停止流程。
完成后，重新运行子问题将成为下一次迭代的运行记录。如果冒烟测试现在通过，则跳至步骤8。否则，在迭代预算允许的情况下，创建新的迭代子问题并返回步骤4。

8. Pass: QA, CTO review, close

8. 通过：QA、CTO评审、关闭

When the smoke passes:

Create QA and CTO review children if they are not already in the dependency chain (CTO review blocked by QA, so the chain wakes in order). Move the loop parent to
```
blocked
```
with
```
blockedByIssueIds
```
set to the QA / CTO review chain, and post a comment that names QA and CTO as the unblock owners and links the children. The loop parent stays
```
blocked
```
— not
```
in_review
```
— because the typed waiter lives on the children, not on the parent.
If you instead want the loop parent itself to sit in
```
in_review
```
during this phase (for example because a board user has explicitly volunteered to drive the review), put a typed waiter directly on the parent — execution-policy participant,
```
request_confirmation
```
/
```
ask_user_questions
```
/
```
suggest_tasks
```
interaction, approval, or named human owner — and do not rely on the child chain alone. Do not combine
```
in_review
```
on the parent with QA/CTO children acting as the blocker; that is the ambiguous review shape this skill exists to prevent.
QA validates artifacts (manifest,
```
results.jsonl
```
, Harbor raw job, redacted telemetry) and the rerun reproducibility against the same worktree.
CTO reviews the technical scope of any product fixes that landed during the loop.
On QA + CTO acceptance, close the loop issue with a board-level summary comment: task name, iteration count, stop reason (pass), worktree pointer, link to the final artifact root, and the list of accepted product fixes (each with its implementation issue id).

当冒烟测试通过时：

如果依赖链中尚未包含QA和CTO评审子问题，请创建它们（CTO评审被QA阻塞，确保按顺序唤醒）。将循环父问题移至
```
blocked
```
状态，设置
```
blockedByIssueIds
```
为QA/CTO评审链，并添加评论指定QA和CTO为解除阻塞的负责人，同时链接到子问题。循环父问题保持
```
blocked
```
状态——而非
```
in_review
```
——因为明确的等待对象位于子问题上，而非父问题。
如果希望循环父问题在此阶段处于
```
in_review
```
状态（例如评审组成员明确主动要求驱动评审），请在父问题上添加明确的等待对象——执行策略参与者、
```
request_confirmation
```
/
```
ask_user_questions
```
/
```
suggest_tasks
```
交互、审批或指定的人工负责人——且不要仅依赖子问题链。请勿将父问题的
```
in_review
```
状态与作为阻塞点的QA/CTO子问题结合使用；这正是本技能要避免的模糊评审形态。
QA验证工件（清单、
```
results.jsonl
```
、Harbor原始作业、脱敏遥测数据），以及针对同一工作树的重新运行可重复性。
CTO评审循环期间落地的所有产品修复的技术范围。
在QA和CTO认可后，关闭循环问题并添加评审组级别的总结评论：任务名称、迭代次数、停止原因（通过）、工作树指针、指向最终工件根目录的链接，以及所有被接受的产品修复列表（每个修复附带其实施问题ID）。

9. Stop rules

9. 终止规则

The loop must stop, with state explicitly recorded on the loop issue, when any of these is true:

Pass. Smoke verifier reports pass and QA + CTO accept (Step 8). Loop issue →
```
done
```
.
Board rejection. Board rejects a fix proposal and does not request a revision. Loop issue →
```
cancelled
```
. Comment names the rejected proposal and the reason.
Iteration budget reached. Iteration counter reaches the budget without a pass. Loop issue →
```
cancelled
```
(or
```
in_review
```
if the user must decide whether to extend the budget). Never silently start iteration N+1.
Real blocker named. External blocker (credentials, quota, infra, security, missing skill) cannot be resolved by the loop driver. Loop issue →
```
blocked
```
with
```
blockedByIssueIds
```
to the blocker issue and the unblock owner named.

A loop must never end on a prose comment alone. Every stop is a status transition with a named next-action owner.

当出现以下任意情况时，循环必须终止，并在循环问题上明确记录状态：

通过：冒烟测试验证器报告通过，且QA+CTO认可（步骤8）。循环问题→
```
done
```
。
评审组拒绝：评审组拒绝修复提议且未要求修订。循环问题→
```
cancelled
```
。评论需说明被拒绝的提议及原因。
迭代预算耗尽：迭代计数器达到预算但未通过测试。循环问题→
```
cancelled
```
（或
```
in_review
```
，需用户决定是否扩展预算）。切勿静默启动第N+1次迭代。
确定真正阻塞点：外部阻塞点（凭证、配额、基础设施、安全、缺失技能）无法由循环驱动者解决。循环问题→
```
blocked
```
，设置
```
blockedByIssueIds
```
为阻塞点问题，并指定解除阻塞的负责人。

循环绝不能仅通过文字评论终止。每次终止都必须是状态转换，并指定下一步行动的负责人。

Worktree rule

工作树规则

The loop must not test whatever Paperclip checkout happens to be current for the heartbeat. It must test the same isolated Paperclip App worktree where proposed fixes are applied.

The first iteration creates the Paperclip App implementation child; that project's git-worktree policy spawns a fresh worktree.
The loop issue records the worktree-owning issue id and the workspace path (or workspace id).
Every later implementation, QA, and rerun child sets
```
inheritExecutionWorkspaceFromIssueId
```
to that worktree-owning issue, so all subsequent loop work shares one workspace.
The benchmark command always sets
```
PAPERCLIPAI_CMD
```
(or the equivalent command binding) to the CLI entrypoint inside that worktree, and it carries the recorded dispatch runner config (
```
PAPERCLIP_HARBOR_RUNNER_CONFIG
```
or equivalent) needed to assign the benchmark issue and start the heartbeat. The benchmark command stored on the loop issue is the source of truth — if a heartbeat needs to run the smoke from a different shell, it copies the recorded command block verbatim, not only the Harbor invocation line.
If the workspace is pruned or the worktree path no longer resolves, the loop is invalid until rebuilt. Mark the loop
```
blocked
```
and name the unblock owner (typically CodexCoder or the Paperclip App owner).

循环不得测试心跳当前使用的任意Paperclip检出版本。必须测试应用提议修复的同一独立Paperclip App工作树。

第一次迭代创建Paperclip App实施子问题；该项目的git-worktree策略会生成全新的工作树。
循环问题记录拥有工作树的问题ID和工作区路径（或工作区ID）。
后续所有实施、QA和重新运行子问题都需设置
```
inheritExecutionWorkspaceFromIssueId
```
为拥有工作树的问题，确保所有后续循环工作共享同一工作区。
基准测试命令始终将
```
PAPERCLIPAI_CMD
```
（或等效命令绑定）设置为该工作树内的CLI入口点，并携带记录的调度运行器配置（
```
PAPERCLIP_HARBOR_RUNNER_CONFIG
```
或等效配置），该配置是分配基准测试问题并启动心跳所需的。循环问题中存储的基准测试命令是权威来源——如果心跳需要从不同shell运行冒烟测试，请完整复制记录的命令块，而非仅复制Harbor调用行。
如果工作区被清理或工作树路径不再有效，则循环在重建前无效。将循环标记为
```
blocked
```
并指定解除阻塞的负责人（通常是CodexCoder或Paperclip App所有者）。

Liveness rule

活跃度规则

Every loop issue, at the end of every heartbeat, must rest in one of:

Terminal:
```
done
```
or
```
cancelled
```
. No further action.
Explicitly live:
```
in_progress
```
with an active run, an upcoming queued wake, or a child issue actively executing under it.
Explicitly waiting:
```
in_review
```
with a typed waiter — execution-policy participant,
```
request_confirmation
```
/
```
ask_user_questions
```
/
```
suggest_tasks
```
interaction, approval, or a named human owner.
Explicit recovery / blocker:
```
blocked
```
with
```
blockedByIssueIds
```
set to a real blocking issue, plus a comment naming the unblock owner and the action needed.

If a loop issue does not fit one of these on exit, the heartbeat is not done. Fix the state before exiting.

在每次心跳结束时，每个循环问题必须处于以下状态之一：

终结状态：
```
done
```
或
```
cancelled
```
。无需进一步操作。
明确活跃状态：
```
in_progress
```
，且有正在运行的任务、即将到来的排队唤醒，或其子问题正在执行。
明确等待状态：
```
in_review
```
，且有明确的等待对象——执行策略参与者、
```
request_confirmation
```
/
```
ask_user_questions
```
/
```
suggest_tasks
```
交互、审批或指定的人工负责人。
明确恢复/阻塞状态：
```
blocked
```
，且
```
blockedByIssueIds
```
设置为真正的阻塞问题，同时添加评论指定解除阻塞的负责人和所需操作。

如果循环问题在结束时不符合上述任一状态，则心跳未完成。请在退出前修正状态。

Pitfalls

常见陷阱

Running the smoke against the operator's Paperclip checkout. The whole point of the worktree rule is that the bench tests the worktree the fix lands in. Always set
```
PAPERCLIPAI_CMD
```
and verify the path before launching the run.
Dropping the dispatch config. A Harbor run that omits
```
PAPERCLIP_HARBOR_RUNNER_CONFIG
```
(or equivalent) may boot Paperclip and create
```
BEN-1
```
, but leave it unassigned with zero heartbeat-enabled agents. That is not a Terminal-Bench product signal. Preserve and rerun the full command block, including assignee and adapter config.
Coding before approval. No implementation child exists until a board confirmation accepts the iteration's
```
plan
```
document. Do not push code in the diagnostic phase.
Skipping the recent-work survey. When proposing a Paperclip product rule, check what already shipped in the affected liveness/execution area in the last few days. A rule that contradicts last-week's accepted contract is rework.
Letting
in_review
mean done. A loop or iteration child sitting in
```
in_review
```
with no participant, no interaction, no approval, and no human owner is a stop, not progress. Treat it as a liveness violation and route it.
Silent iteration N+1. If the iteration budget is reached, never start another iteration without an explicit budget extension recorded on the loop issue.
Comparable-run drift. This skill produces smoke runs only. If the asker wants a comparable benchmark submission, hand off to BenchmarkQualityManager and BenchmarkForensics — do not relabel a smoke as comparable.
Recursive recovery. Stranded-work recovery that recovers its own recovery issues is the canonical infinite loop. If a diagnosis surfaces it inside the smoke's subtree, refuse to deepen and route to
```
/diagnose-why-work-stopped
```
for a product-rule fix.
Skill-library mutation. This skill never installs, edits, or assigns company skills as part of a loop iteration. Library changes go to an authorized skill-library owner via a separate issue.
Hiding the chain. Do not silently delete or hide failed iteration children, retracted proposals, or rejected confirmations. The audit trail is the loop's evidence.

在操作者的Paperclip检出版本上运行冒烟测试。工作树规则的核心是基准测试需测试修复落地的工作树。启动运行前务必设置
```
PAPERCLIPAI_CMD
```
并验证路径。
遗漏调度配置：省略
```
PAPERCLIP_HARBOR_RUNNER_CONFIG
```
（或等效配置）的Harbor运行可能会启动Paperclip并创建
```
BEN-1
```
，但会使其处于未分配状态且未启用心跳Agent。这并非Terminal-Bench的产品信号。请保留并重新运行完整的命令块，包括负责人和适配器配置。
未获认可就编码：在评审组确认接受迭代的
```
plan
```
文档前，不得创建实施子问题。诊断阶段请勿推送代码。
跳过近期工作调研：提议Paperclip产品规则时，请检查过去几天内相关活跃度/执行领域已发布的内容。与上周已接受的契约相矛盾的规则属于重复工作。
将
in_review
视为完成：循环或迭代子问题处于
```
in_review
```
状态但无参与者、无交互、无审批、无人工负责人，这属于停止状态，而非进展。请将其视为活跃度违规并处理。
静默启动第N+1次迭代：如果迭代预算已耗尽，未在循环问题上记录明确的预算扩展，切勿启动新的迭代。
可比运行偏离：本技能仅生成冒烟测试运行结果。如果请求者需要提交可比基准测试，请转交给BenchmarkQualityManager和BenchmarkForensics——请勿将冒烟测试标记为可比运行。
递归恢复：恢复搁浅工作时又生成自身的恢复问题，这是典型的无限循环。如果诊断在冒烟测试的子树中发现此情况，请拒绝继续深入并转交给
```
/diagnose-why-work-stopped
```
进行产品规则修复。
技能库修改：本技能在循环迭代中绝不安装、编辑或分配公司技能。技能库变更需通过单独的问题转交给授权的技能库所有者。
隐藏链路：请勿静默删除或隐藏失败的迭代子问题、撤回的提议或被拒绝的确认请求。审计跟踪是循环的证据。

Verification checklist (before exiting a heartbeat that touched the loop)

验证清单（在处理循环的心跳退出前）

All inputs are recorded on the top-level loop issue, including the exact benchmark command,
```
PAPERCLIPAI_CMD
```
binding, and dispatch runner config.
Iteration counter is up to date and within budget.
The Paperclip App worktree pointer still resolves, and the iteration's run/implementation/rerun children share that workspace.
The smoke run is captured with run ids, manifest,
```
results.jsonl
```
, Harbor raw job folder, and stop reason.
Paperclip telemetry shows the benchmark issue was assigned and a heartbeat was enabled/observed, or the iteration is explicitly classified as harness/setup no-dispatch.
Diagnosis applies the
```
/diagnose-why-work-stopped
```
pattern, classifies every non-progressing issue, and checks the three invariants.
No implementation child exists for an unapproved fix proposal; if one was proposed, a
```
request_confirmation
```
is open against the latest plan revision.
Every loop and iteration issue rests in a terminal, explicitly-live, explicitly-waiting, or named-blocker state.
The stop reason — if the loop stopped this heartbeat — is one of pass, board rejection, budget exhausted, or named real blocker.
No company-skill library mutation happened in this heartbeat.

所有输入信息都已记录在顶级循环问题中，包括确切的基准测试命令、
```
PAPERCLIPAI_CMD
```
绑定和调度运行器配置。
迭代计数器已更新且在预算范围内。
Paperclip App工作树指针仍有效，且迭代的运行/实施/重新运行子问题共享该工作区。
冒烟测试运行已捕获运行ID、清单、
```
results.jsonl
```
、Harbor原始作业文件夹和停止原因。
Paperclip遥测数据显示基准测试问题已分配且心跳已启用/观测，或该迭代被明确分类为工具/设置未调度。
诊断已应用
```
/diagnose-why-work-stopped
```
模式，对所有未推进的问题进行了分类，并检查了三项不变原则。
未获认可的修复提议没有对应的实施子问题；如果已提议修复，则针对最新计划修订版本的
```
request_confirmation
```
已开启。
所有循环和迭代问题都处于终结、明确活跃、明确等待或指定阻塞点状态。
如果本次心跳终止了循环，终止原因是通过、评审组拒绝、预算耗尽或确定真正阻塞点之一。
本次心跳未修改公司技能库。

Deterministic smoke

确定性冒烟测试

Run this smoke after installing or changing the skill, before treating it as operational for a live Terminal-Bench loop:

pnpm smoke:terminal-bench-loop-skill

The command uses the current Paperclip API token and company from

PAPERCLIP_API_URL

PAPERCLIP_API_KEY

, and

PAPERCLIP_COMPANY_ID

. When

PAPERCLIP_TASK_ID

is set, it attaches the smoke issues under that source issue and inherits its project/goal context. By default it cancels the short-lived smoke issues after verification; pass

-- --keep

to leave the verified

blocked

loop parent,

in_review

iteration child, and pending confirmation available for manual inspection.

The smoke is deterministic and intentionally non-comparable. It does not start Terminal-Bench, Harbor, an agent model, or a provider runtime. It verifies only the control-plane shape:

local
```
skills/terminal-bench-loop/SKILL.md
```
contains the loop contract terms;
a top-level loop issue can be created and updated into a blocker posture;
an iteration child issue can be created under the loop parent;
mocked benchmark artifact paths are recorded on a
```
run
```
document;
a
```
diagnosis
```
document names the exact stop point and next-action owner;
a
```
request_confirmation
```
interaction is created and the iteration child rests in
```
in_review
```
with a typed waiting path rather than silent review.

在安装或修改本技能后，运行以下冒烟测试，确认其可用于实际Terminal-Bench循环：

pnpm smoke:terminal-bench-loop-skill

该命令使用

PAPERCLIP_API_URL

、

PAPERCLIP_API_KEY

和

PAPERCLIP_COMPANY_ID

中的当前Paperclip API令牌和公司信息。当设置

PAPERCLIP_TASK_ID

时，它会将冒烟测试问题附加到该源问题下并继承其项目/目标上下文。默认情况下，验证后会取消短期冒烟测试问题；传递

-- --keep

参数可保留已验证的

blocked

循环父问题、

in_review

迭代子问题和待确认请求，以便手动检查。

本冒烟测试是确定性的，且特意设计为非可比。它不会启动Terminal-Bench、Harbor、Agent模型或提供程序运行时。仅验证控制平面形态：

本地
```
skills/terminal-bench-loop/SKILL.md
```
包含循环契约条款；
可创建顶级循环问题并更新为阻塞状态；
可在循环父问题下创建迭代子问题；
模拟的基准测试工件路径已记录在
```
run
```
文档中；
```
diagnosis
```
文档指定了确切的停止点和下一步行动负责人；
已创建
```
request_confirmation
```
交互，且迭代子问题处于
```
in_review
```
状态并带有明确的等待路径，而非静默评审。