ce-optimize

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Iterative Optimization Loop

迭代优化循环

Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.

运行基于指标的迭代优化。定义目标，搭建测量框架，随后运行并行实验逐步收敛至最优解。

Interaction Method

交互方式

Use the platform's blocking question tool when available (

AskUserQuestion

in Claude Code,

request_user_input

in Codex,

ask_user

in Gemini). Otherwise, present numbered options in chat and wait for the user's reply before proceeding.

当平台提供阻塞式提问工具时使用（Claude Code中的

AskUserQuestion

、Codex中的

request_user_input

、Gemini中的

ask_user

）。否则，在聊天中展示编号选项，等待用户回复后再继续。

Input

输入

<optimization_input> #$ARGUMENTS </optimization_input>

If the input above is empty, ask: "What would you like to optimize? Describe the goal, or provide a path to an optimization spec YAML file."

<optimization_input> #$ARGUMENTS </optimization_input>

如果上述输入为空，请询问："您想要优化什么？请描述目标，或提供优化规格YAML文件的路径。"

Optimization Spec Schema

优化规格 Schema

Reference the spec schema for validation:

references/optimize-spec-schema.yaml

参考规格Schema进行验证：

references/optimize-spec-schema.yaml

Experiment Log Schema

实验日志 Schema

Reference the experiment log schema for state management:

references/experiment-log-schema.yaml

参考实验日志Schema进行状态管理：

references/experiment-log-schema.yaml

Quick Start

快速开始

For a first run, optimize for signal and safety, not maximum throughput:

Start from
```
references/example-hard-spec.yaml
```
when the metric is objective and cheap to measure
Use
```
references/example-judge-spec.yaml
```
only when actual quality requires semantic judgment

Prefer

execution.mode: serial

and

execution.max_concurrent: 1

Cap the first run with

stopping.max_iterations: 4

and

stopping.max_hours: 1

Avoid new dependencies until the baseline and measurement harness are trusted

For judge mode, start with

sample_size: 10

batch_size: 5

, and

max_total_cost_usd: 5

For a friendly overview of what this skill is for, when to use hard metrics vs LLM-as-judge, and example kickoff prompts, see:

references/usage-guide.md

首次运行时，优先优化信号和安全性，而非追求最大吞吐量：

当指标客观且易于测量时，从
```
references/example-hard-spec.yaml
```
开始
仅当实际质量需要语义判断时，使用
```
references/example-judge-spec.yaml
```

优先选择

execution.mode: serial

和

execution.max_concurrent: 1

首次运行设置上限：

stopping.max_iterations: 4

和

stopping.max_hours: 1

在基线和测量工具可信之前，避免引入新依赖

对于judge模式，初始设置

sample_size: 10

、

batch_size: 5

和

max_total_cost_usd: 5

如需了解该技能的用途、何时使用硬指标vs LLM-as-judge，以及示例启动提示，请查看：

references/usage-guide.md

Persistence Discipline

持久化规范

CRITICAL: The experiment log on disk is the single source of truth. The conversation context is NOT durable storage. Results that exist only in the conversation WILL be lost.

The files under

.context/compound-engineering/ce-optimize/<spec-name>/

are local scratch state. They are ignored by git, so they survive local resumes on the same machine but are not preserved by commits, branches, or pushes unless the user exports them separately.

This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters MUST live on disk, not in the agent's memory.

If you produce a results table in the conversation without writing those results to disk first, you have a bug. The conversation is for the user's benefit. The experiment log file is for durability.

关键提示：磁盘上的实验日志是唯一可信数据源。对话上下文并非持久存储。仅存在于对话中的结果将会丢失。

.context/compound-engineering/ce-optimize/<spec-name>/

下的文件是本地临时状态。它们会被git忽略，因此在同一机器上本地恢复时会保留，但除非用户单独导出，否则不会通过提交、分支或推送保存。

该技能运行时长可达数小时。上下文窗口会压缩、会话可能崩溃、Agent可能重启。所有重要状态必须存储在磁盘上，而非Agent内存中。

**如果您在对话中生成了结果表格但未先将其写入磁盘，这属于错误。**对话是为了方便用户，实验日志文件才是用于持久化的载体。

Core Rules

核心规则

Write each experiment result to disk IMMEDIATELY after measurement — not after the batch, not after evaluation, IMMEDIATELY. Append the experiment entry to the experiment log file the moment its metrics are known, before evaluating the next experiment. This is the #1 crash-safety rule.
VERIFY every critical write — after writing the experiment log, read the file back and confirm the entry is present. This catches silent write failures. Do not proceed to the next experiment until verification passes.
Re-read from disk at every phase boundary and before every decision — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk.
The experiment log is append-only during Phase 3 — never rewrite the full file. Append new experiment entries. Update the
```
best
```
section in place only when a new best is found. This prevents data loss if a write is interrupted.
Per-experiment result markers for crash recovery — each experiment writes a
```
result.yaml
```
marker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged.
Strategy digest is written after every batch, before generating new hypotheses — the agent reads the digest (not its memory) when deciding what to try next.
Never present results to the user without writing them to disk first — the pattern is: measure -> write to disk -> verify -> THEN show the user. Not the reverse.

测量完成后立即将每个实验结果写入磁盘 — 不要在批量处理后，也不要在评估后，而是立即写入。一旦获知实验指标，就将实验条目追加到实验日志文件中，再进行下一个实验的评估。这是#1崩溃安全规则。
验证每一次关键写入 — 写入实验日志后，重新读取文件并确认条目已存在。这可以捕获静默写入失败。在验证通过前，不要进行下一个实验。
在每个阶段边界和每次决策前重新从磁盘读取 — 永远不要跨阶段转换、批量边界或任何可能耗时的操作后信任内存中的状态。重新读取实验日志和策略摘要。
Phase 3期间实验日志仅可追加 — 永远不要重写整个文件。追加新的实验条目。仅当找到新的最优解时，才就地更新
```
best
```
部分。这可以防止写入中断时的数据丢失。
用于崩溃恢复的每实验结果标记 — 每个实验在测量完成后，立即在其工作目录中写入
```
result.yaml
```
标记。恢复时，扫描这些标记以恢复已测量但尚未记录的实验。
每次批量处理后、生成新假设前写入策略摘要 — Agent在决定下一步尝试内容时，读取摘要（而非其内存）。
永远不要在写入磁盘前向用户展示结果 — 正确流程是：测量 -> 写入磁盘 -> 验证 -> 再展示给用户。顺序不可颠倒。

Mandatory Disk Checkpoints

强制磁盘检查点

These are non-negotiable write-then-verify steps. At each checkpoint, the agent MUST write the specified file and then read it back to confirm the write succeeded.

Checkpoint	File Written	Phase
CP-0: Spec saved	`spec.yaml`	Phase 0, after user approval
CP-1: Baseline recorded	`experiment-log.yaml` (initial with baseline)	Phase 1, after baseline measurement
CP-2: Hypothesis backlog saved	`experiment-log.yaml` (hypothesis_backlog section)	Phase 2, after hypothesis generation
CP-3: Each experiment result	`experiment-log.yaml` (append experiment entry)	Phase 3.3, immediately after each measurement
CP-4: Batch summary	`experiment-log.yaml` (outcomes + best) + `strategy-digest.md`	Phase 3.5, after batch evaluation
CP-5: Final summary	`experiment-log.yaml` (final state)	Phase 4, at wrap-up

Format of a verification step:

Write the file using the native file-write tool
Read the file back using the native file-read tool
Confirm the expected content is present
If verification fails, retry the write. If it fails twice, alert the user.

这些是必须执行的“写入后验证”步骤。在每个检查点，Agent必须写入指定文件，然后重新读取以确认写入成功。

检查点	写入文件	阶段
CP-0: 规格已保存	`spec.yaml`	Phase 0，用户批准后
CP-1: 基线已记录	`experiment-log.yaml` （包含基线的初始版本）	Phase 1，基线测量后
CP-2: 假设待办项已保存	`experiment-log.yaml` （hypothesis_backlog部分）	Phase 2，假设生成后
CP-3: 每个实验结果	`experiment-log.yaml` （追加实验条目）	Phase 3.3，每次测量后立即执行
CP-4: 批量摘要	`experiment-log.yaml` （outcomes + best） + `strategy-digest.md`	Phase 3.5，批量评估后
CP-5: 最终摘要	`experiment-log.yaml` （最终状态）	Phase 4，收尾阶段

验证步骤格式：

使用原生文件写入工具写入文件
使用原生文件读取工具重新读取文件
确认预期内容已存在
如果验证失败，重试写入。若两次失败，向用户发出警报。

File Locations (all under

.context/compound-engineering/ce-optimize/<spec-name>/

)

文件位置（均位于

.context/compound-engineering/ce-optimize/<spec-name>/

下）

File	Purpose	Written When
`spec.yaml`	Optimization spec (immutable during run)	Phase 0 (CP-0)
`experiment-log.yaml`	Full history of all experiments	Initialized at CP-1, appended at CP-3, updated at CP-4
`strategy-digest.md`	Compressed learnings for hypothesis generation	Written at CP-4 after each batch
`<worktree>/result.yaml`	Per-experiment crash-recovery marker	Immediately after measurement, before CP-3

文件	用途	写入时机
`spec.yaml`	优化规格（运行期间不可变）	Phase 0（CP-0）
`experiment-log.yaml`	所有实验的完整历史	在CP-1初始化，CP-3追加，CP-4更新
`strategy-digest.md`	用于假设生成的压缩式学习成果	每次批量处理后的CP-4写入
`<worktree>/result.yaml`	每实验崩溃恢复标记	测量完成后立即写入，早于CP-3

On Resume

恢复时操作

When Phase 0.4 detects an existing run:

Read the experiment log from disk — this is the ground truth
Scan worktree directories for
```
result.yaml
```
markers not yet in the log
Recover any measured-but-unlogged experiments
Continue from where the log left off

当Phase 0.4检测到已有运行实例时：

从磁盘读取实验日志 — 这是真实状态
扫描工作目录，查找尚未记录在日志中的
```
result.yaml
```
标记
恢复所有已测量但未记录的实验
从日志中断的位置继续

Phase 0: Setup

Phase 0: 准备阶段

0.1 Determine Input Type

0.1 确定输入类型

Check whether the input is:

A spec file path (ends in
```
.yaml
```
or
```
.yml
```
): read and validate it
A description of the optimization goal: help the user create a spec interactively

检查输入是否为：

规格文件路径（以
```
.yaml
```
或
```
.yml
```
结尾）：读取并验证
优化目标描述：帮助用户交互式创建规格

0.2 Load or Create Spec

0.2 加载或创建规格

If spec file provided:

Read the YAML spec file. The orchestrating agent parses YAML natively -- no shell script parsing.
Validate against
```
references/optimize-spec-schema.yaml
```
:
- All required fields present
- ```
name
```
  is lowercase kebab-case and safe to use in git refs / worktree paths
- ```
metric.primary.type
```
  is
```
hard
```
  or
```
judge
```
- If type is
```
judge
```
  ,
```
metric.judge
```
  section exists with
```
rubric
```
  and
```
scoring
```
- At least one degenerate gate defined
- ```
measurement.command
```
  is non-empty
- ```
scope.mutable
```
  and
```
scope.immutable
```
  each have at least one entry
- Gate check operators are valid (
```
>=
```
  ,
```
<=
```
  ,
```
>
```
  ,
```
<
```
  ,
```
==
```
  ,
```
!=
```
  )
- ```
execution.max_concurrent
```
  is at least 1
- ```
execution.max_concurrent
```
  does not exceed 6 when backend is
```
worktree
```
If validation fails, report errors and ask the user to fix them

If description provided:

Analyze the project to understand what can be measured
Detect whether the optimization target is qualitative or quantitative — this determines
```
type: hard
```
vs
```
type: judge
```
and is the single most important spec decision:
Use
type: hard
when:
- The metric is a scalar number with a clear "better" direction
- The metric is objectively measurable (build time, test pass rate, latency, memory usage)
- No human judgment is needed to evaluate "is this result actually good?"
- Examples: reduce build time, increase test coverage, reduce API latency, decrease bundle size
Use
type: judge
when:
- The quality of the output requires semantic understanding to evaluate
- A human reviewer would need to look at the results to say "this is better"
- Proxy metrics exist but can mislead (e.g., "more clusters" does not mean "better clusters")
- The optimization could produce degenerate solutions that look good on paper
- Examples: clustering quality, search relevance, summarization quality, code readability, UX copy, recommendation relevance
IMPORTANT: If the target is qualitative, strongly recommend
type: judge
. Explain that hard metrics alone will optimize proxy numbers without checking actual quality. Show the user the three-tier approach:
- Degenerate gates (hard, cheap, fast): catch obviously broken solutions — e.g., "all items in 1 cluster" or "0% coverage". Run first. If gates fail, skip the expensive judge step.
- LLM-as-judge (the actual optimization target): sample outputs, score them against a rubric, aggregate. This is what the loop optimizes.
- Diagnostics (logged, not gated): distribution stats, counts, timing — useful for understanding WHY a judge score changed.
If the user insists on
```
type: hard
```
for a qualitative target, proceed but warn that the results may optimize a misleading proxy.
Design the sampling strategy (for
```
type: judge
```
):
Guide the user through defining stratified sampling. The key question is: "What parts of the output space do you need to check quality on?"

Walk through these questions:
- What does one "item" look like? (a cluster, a search result page, a summary, etc.)
- What are the natural size/quality strata? (e.g., large clusters vs small clusters vs singletons)
- Where are quality failures most likely? (e.g., very large clusters may be degenerate merges; singletons may be missed groupings)
- What total sample size balances cost vs signal? (default: 30 items, adjust based on output volume)
Example stratified sampling for clustering:
yaml
```
stratification:
  - bucket: "top_by_size"     # largest clusters — check for degenerate mega-clusters
    count: 10
  - bucket: "mid_range"       # middle of non-solo cluster size range — representative quality
    count: 10
  - bucket: "small_clusters"  # clusters with 2-3 items — check if connections are real
    count: 10
singleton_sample: 15          # singletons — check for false negatives (items that should cluster)
```
The sampling strategy is domain-specific. For search relevance, strata might be "top-3 results", "results 4-10", "tail results". For summarization, strata might be "short documents", "long documents", "multi-topic documents".

Singleton evaluation is critical when the goal involves coverage — sampling singletons with the singleton rubric checks whether the system is missing obvious groupings.

Design the rubric (for

type: judge

Help the user define the scoring rubric. A good rubric:

Has a 1-5 scale (or similar) with concrete descriptions for each level
Includes supplementary fields that help diagnose issues (e.g.,
```
distinct_topics
```
,
```
outlier_count
```
)
Is specific enough that two judges would give similar scores
Does NOT assume bigger/more is better — "3 items per cluster average" is not inherently good or bad

Example for clustering:

yaml

rubric: |
  Rate this cluster 1-5:
  - 5: All items clearly about the same issue/feature
  - 4: Strong theme, minor outliers
  - 3: Related but covers 2-3 sub-topics that could reasonably be split
  - 2: Weak connection — items share superficial similarity only
  - 1: Unrelated items grouped together
  Also report: distinct_topics (integer), outlier_count (integer)

Guide the user through the remaining spec fields:
- What degenerate cases should be rejected? (gates — e.g., "solo_pct <= 0.95" catches all-singletons, "max_cluster_size <= 500" catches mega-clusters)
- What command runs the measurement?
- What files can be modified? What is immutable?
- Any constraints or dependencies?
- If this is the first run: recommend
```
execution.mode: serial
```
  ,
```
execution.max_concurrent: 1
```
  ,
```
stopping.max_iterations: 4
```
  , and
```
stopping.max_hours: 1
```
- If
```
type: judge
```
  : recommend
```
sample_size: 10
```
  ,
```
batch_size: 5
```
  , and
```
max_total_cost_usd: 5
```
  until the rubric and harness are trusted

Write the spec to

.context/compound-engineering/ce-optimize/<spec-name>/spec.yaml

Present the spec to the user for approval before proceeding

如果提供了规格文件：

读取YAML规格文件。编排Agent原生支持YAML解析 — 无需shell脚本解析。
根据
```
references/optimize-spec-schema.yaml
```
进行验证：
- 所有必填字段均存在
- ```
name
```
  为小写短横线格式，可安全用于git引用/工作目录路径
- ```
metric.primary.type
```
  为
```
hard
```
  或
```
judge
```
- 如果类型为
```
judge
```
  ，则存在
```
metric.judge
```
  部分，包含
```
rubric
```
  和
```
scoring
```
- 至少定义了一个退化门槛
- ```
measurement.command
```
  非空
- ```
scope.mutable
```
  和
```
scope.immutable
```
  各至少有一个条目
- 门槛检查运算符有效（
```
>=
```
  、
```
<=
```
  、
```
>
```
  、
```
<
```
  、
```
==
```
  、
```
!=
```
  ）
- ```
execution.max_concurrent
```
  至少为1
- 当后端为
```
worktree
```
  时，
```
execution.max_concurrent
```
  不超过6
如果验证失败，报告错误并要求用户修复

如果提供了目标描述：

分析项目以了解可测量内容
检测优化目标是定性还是定量 — 这决定了
```
type: hard
```
vs
```
type: judge
```
，是规格中最重要的决策：
当以下情况时使用
type: hard
：
- 指标是具有明确“更优”方向的标量数值
- 指标可客观测量（构建时间、测试通过率、延迟、内存占用）
- 无需人工判断即可评估“该结果是否真的好”
- 示例：缩短构建时间、提高测试覆盖率、降低API延迟、减小包体积
当以下情况时使用
type: judge
：
- 输出质量需要语义理解才能评估
- 需要人工审阅结果才能判断“这个更好”
- 存在代理指标但可能产生误导（例如，“更多聚类”并不意味着“更好的聚类”）
- 优化可能产生在纸面看起来不错但实际退化的解决方案
- 示例：聚类质量、搜索相关性、摘要质量、代码可读性、UX文案、推荐相关性
重要提示：如果目标是定性的，强烈推荐使用
type: judge
。解释仅使用硬指标会优化代理数值，但无法检查实际质量。向用户展示三层方法：
- 退化门槛（硬指标、低成本、快速）：捕获明显失效的解决方案 — 例如，“所有项都在1个聚类中”或“0%覆盖率”。先运行该步骤。如果门槛未通过，跳过昂贵的judge步骤。
- LLM-as-judge（实际优化目标）：对输出进行抽样，根据评分标准打分并汇总。这是循环优化的对象。
- 诊断信息（仅记录，不设门槛）：分布统计、计数、耗时 — 有助于理解judge分数变化的原因。
如果用户坚持对定性目标使用
```
type: hard
```
，可以继续执行，但需警告结果可能会优化具有误导性的代理指标。
设计抽样策略（针对
```
type: judge
```
）：
引导用户定义分层抽样。核心问题是：“您需要检查输出空间的哪些部分的质量？”

引导用户回答以下问题：
- 一个“项”是什么样的？（一个聚类、一个搜索结果页、一篇摘要等）
- 自然的大小/质量分层有哪些？（例如，大聚类vs小聚类vs单例）
- 质量故障最可能出现在哪里？（例如，非常大的聚类可能是退化的合并；单例可能是遗漏的分组）
- 总样本量如何平衡成本与信号？（默认：30个项，根据输出量调整）
聚类的分层抽样示例：
yaml
```
stratification:
  - bucket: "top_by_size"     # 最大的聚类 — 检查是否存在退化的超大聚类
    count: 10
  - bucket: "mid_range"       # 非单例聚类大小范围的中间值 — 代表性质量
    count: 10
  - bucket: "small_clusters"  # 包含2-3个项的聚类 — 检查关联是否真实
    count: 10
singleton_sample: 15          # 单例 — 检查是否存在假阴性（应被聚类的项）
```
抽样策略是领域特定的。对于搜索相关性，分层可能是“前3个结果”、“第4-10个结果”、“尾部结果”。对于摘要，分层可能是“短文档”、“长文档”、“多主题文档”。

当目标涉及覆盖率时，单例评估至关重要 — 使用单例评分标准抽样单例，检查系统是否遗漏了明显的分组。

设计评分标准（针对

type: judge

）：

帮助用户定义评分标准。优秀的评分标准：

采用1-5分制（或类似），每个分数段有具体描述
包含有助于诊断问题的补充字段（例如，
```
distinct_topics
```
、
```
outlier_count
```
）
足够具体，不同评估者给出的分数相近
不假设越大/越多越好 — “平均每个聚类3个项”本身并非好坏的标准

聚类的评分标准示例：

yaml

rubric: |
  为该聚类打分（1-5分）：
  - 5分：所有项明显围绕同一问题/功能
  - 4分：主题明确，存在少量异常值
  - 3分：相关但涵盖2-3个可合理拆分的子主题
  - 2分：关联薄弱 — 项仅存在表面相似性
  - 1分：无关项被分组在一起
  同时报告：distinct_topics（整数）、outlier_count（整数）

引导用户完成规格的其余字段：
- 应拒绝哪些退化情况？（门槛 — 例如，
```
solo_pct <= 0.95
```
  捕获全单例情况，
```
max_cluster_size <= 500
```
  捕获超大聚类）
- 运行测量的命令是什么？
- 哪些文件可以修改？哪些不可修改？
- 有任何约束或依赖吗？
- 如果是首次运行：推荐
```
execution.mode: serial
```
  、
```
execution.max_concurrent: 1
```
  、
```
stopping.max_iterations: 4
```
  和
```
stopping.max_hours: 1
```
- 如果是
```
type: judge
```
  ：推荐
```
sample_size: 10
```
  、
```
batch_size: 5
```
  和
```
max_total_cost_usd: 5
```
  ，直到评分标准和工具可信

将规格写入

.context/compound-engineering/ce-optimize/<spec-name>/spec.yaml

向用户展示规格，获得批准后再继续

0.3 Search Prior Learnings

0.3 搜索过往学习成果

Dispatch

compound-engineering:research:learnings-researcher

to search for prior optimization work on similar topics. If relevant learnings exist, incorporate them into the approach.

调用

compound-engineering:research:learnings-researcher

搜索类似主题的过往优化工作。如果存在相关学习成果，将其纳入方法中。

0.4 Run Identity Detection

0.4 运行身份检测

Check if

optimize/<spec-name>

branch already exists:

bash

git rev-parse --verify "optimize/<spec-name>" 2>/dev/null

If branch exists, check for an existing experiment log at

.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml

Present the user with a choice via the platform question tool:

Resume: read ALL state from the experiment log on disk (do not rely on any in-memory context from a prior session). Recover any measured-but-unlogged experiments by scanning worktree directories for
```
result.yaml
```
markers. Continue from the last iteration number in the log.
Fresh start: archive the old branch to
```
optimize-archive/<spec-name>/archived-<timestamp>
```
, clear the experiment log, start from scratch

检查

optimize/<spec-name>

分支是否已存在：

bash

git rev-parse --verify "optimize/<spec-name>" 2>/dev/null

如果分支存在，检查

.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml

是否存在现有实验日志。

通过平台提问工具向用户提供选择：

恢复：从磁盘上的实验日志读取所有状态（不依赖先前会话的任何内存上下文）。通过扫描工作目录查找
```
result.yaml
```
标记，恢复所有已测量但未记录的实验。从日志中的最后一次迭代继续。
重新开始：将旧分支归档到
```
optimize-archive/<spec-name>/archived-<timestamp>
```
，清除实验日志，从头开始

0.5 Create Optimization Branch and Scratch Space

0.5 创建优化分支和临时空间

bash

git checkout -b "optimize/<spec-name>"  # or switch to existing if resuming

Create scratch directory:

bash

mkdir -p .context/compound-engineering/ce-optimize/<spec-name>/

bash

git checkout -b "optimize/<spec-name>"  # 或如果恢复则切换到现有分支

创建临时目录：

bash

mkdir -p .context/compound-engineering/ce-optimize/<spec-name>/

Phase 1: Measurement Scaffolding

Phase 1: 测量框架搭建

This phase is a HARD GATE. The user must approve baseline and parallel readiness before Phase 2.

此阶段为硬性门槛。用户必须批准基线和并行准备情况后，才能进入Phase 2。

1.1 Clean-Tree Gate

1.1 干净工作树门槛

Verify no uncommitted changes to files within

scope.mutable

scope.immutable

bash

git status --porcelain

Filter the output against the scope paths. If any in-scope files have uncommitted changes:

Report which files are dirty
Ask the user to commit or stash before proceeding
Do NOT continue until the working tree is clean for in-scope files

验证

scope.mutable

或

scope.immutable

内的文件是否存在未提交的更改：

bash

git status --porcelain

根据范围路径过滤输出。如果任何范围内的文件存在未提交更改：

报告哪些文件已被修改
要求用户提交或暂存更改后再继续
在范围内文件的工作树干净前，不得继续

1.2 Build or Validate Measurement Harness

1.2 构建或验证测量工具

If user provides a measurement harness (the

measurement.command

already exists):

Run it once via the measurement script:

bash

bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>"

Validate the JSON output:
- Contains keys for all degenerate gate metric names
- Contains keys for all diagnostic metric names
- Values are numeric or boolean as expected
If validation fails, report what is missing and ask the user to fix the harness

If agent must build the harness:

Analyze the codebase to understand the current approach and what should be measured
Build an evaluation script (e.g.,
```
evaluate.py
```
,
```
evaluate.sh
```
, or equivalent)
Add the evaluation script path to
```
scope.immutable
```
-- the experiment agent must not modify it
Run it once and validate the output
Present the harness and its output to the user for review

如果用户提供了测量工具（

measurement.command

已存在）：

通过测量脚本运行一次：

bash

bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>"

验证JSON输出：
- 包含所有退化门槛指标名称的键
- 包含所有诊断指标名称的键
- 值为预期的数值或布尔类型
如果验证失败，报告缺失内容并要求用户修复工具

如果Agent必须构建工具：

分析代码库以了解当前方法和应测量的内容
构建评估脚本（例如，
```
evaluate.py
```
、
```
evaluate.sh
```
或等效脚本）
将评估脚本路径添加到
```
scope.immutable
```
— 实验Agent不得修改它
运行一次并验证输出
向用户展示工具及其输出以供审核

1.3 Establish Baseline

1.3 建立基线

Run the measurement harness on the current code.

If stability mode is
repeat
:

Run the harness
```
repeat_count
```
times
Aggregate results using the configured aggregation method (median, mean, min, max)
Calculate variance across runs
If variance exceeds
```
noise_threshold
```
, warn the user and suggest increasing
```
repeat_count
```

Record the baseline in the experiment log:

yaml

baseline:
  timestamp: "<current ISO 8601 timestamp>"
  gates:
    <gate_name>: <value>
    ...
  diagnostics:
    <diagnostic_name>: <value>
    ...

If primary type is

judge

, also run the judge evaluation on baseline output to establish the starting judge score.

在当前代码上运行测量工具。

如果稳定性模式为
repeat
：

运行工具
```
repeat_count
```
次
使用配置的聚合方法（中位数、均值、最小值、最大值）汇总结果
计算运行间的方差
如果方差超过
```
noise_threshold
```
，警告用户并建议增加
```
repeat_count
```

将基线记录到实验日志中：

yaml

baseline:
  timestamp: "<当前ISO 8601时间戳>"
  gates:
    <gate_name>: <value>
    ...
  diagnostics:
    <diagnostic_name>: <value>
    ...

如果主类型为

judge

，还需对基线输出运行judge评估以确定初始judge分数。

1.4 Parallelism Readiness Probe

1.4 并行准备探测

Run the parallelism probe script:

bash

bash scripts/parallel-probe.sh "<project_directory>" "<measurement.command>" "<measurement.working_directory>" <shared_files...>

Read the JSON output. Present any blockers to the user with suggested mitigations. Treat the probe as intentionally narrow: it should inspect the measurement command, the measurement working directory, and explicitly declared shared files, not the entire repository.

运行并行探测脚本：

bash

bash scripts/parallel-probe.sh "<project_directory>" "<measurement.command>" "<measurement.working_directory>" <shared_files...>

读取JSON输出。向用户展示任何障碍及建议的缓解措施。探测应针对性地检查测量命令、测量工作目录和明确声明的共享文件，而非整个仓库。

1.5 Worktree Budget Check

1.5 工作目录预算检查

Count existing worktrees:

bash

bash scripts/experiment-worktree.sh count

If count +

execution.max_concurrent

would exceed 12:

Warn the user
Suggest cleaning up existing worktrees or reducing
```
max_concurrent
```
Do NOT block -- the user may proceed at their own risk

统计现有工作目录数量：

bash

bash scripts/experiment-worktree.sh count

如果数量 +

execution.max_concurrent

超过12：

警告用户
建议清理现有工作目录或降低
```
max_concurrent
```
不强制阻止 — 用户可自行承担风险继续

1.6 Write Baseline to Disk (CP-1)

1.6 将基线写入磁盘（CP-1）

MANDATORY CHECKPOINT. Before presenting results to the user, write the initial experiment log with baseline metrics to disk:

Create the experiment log file at

.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml

Include all required top-level sections from

references/experiment-log-schema.yaml

spec

run_id

started_at

baseline

experiments

, and

best

Seed
```
experiments
```
as an empty array and seed
```
best
```
from the baseline snapshot (use
```
iteration: 0
```
, baseline metrics, and baseline judge scores if present) so later phases have a valid current-best state to compare against
Optionally seed
```
hypothesis_backlog: []
```
here as well so the log shape is stable before Phase 2 populates it
Verify: read the file back and confirm the required sections are present and the baseline values match
Only THEN present results to the user

强制检查点。在向用户展示结果前，必须将包含基线指标的初始实验日志写入磁盘：

在

.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml

创建实验日志文件

包含

references/experiment-log-schema.yaml

中所有必填的顶层部分：

spec

、

run_id

、

started_at

、

baseline

、

experiments

和

best

将
```
experiments
```
初始化为空数组，并从基线快照初始化
```
best
```
（使用
```
iteration: 0
```
、基线指标和基线judge分数（如果存在）），以便后续阶段有可用于比较的有效当前最优状态
可选地在此处初始化
```
hypothesis_backlog: []
```
，以便在Phase 2填充前日志结构保持稳定
验证：重新读取文件，确认必填部分存在且基线值匹配
完成后再向用户展示结果

1.7 User Approval Gate

1.7 用户批准门槛

Present to the user via the platform question tool:

Baseline metrics: all gate values, diagnostic values, and judge scores (if applicable)
Experiment log location: show the file path so the user knows where results are saved
Parallel readiness: probe results, any blockers, mitigations applied
Clean-tree status: confirmed clean
Worktree budget: current count and projected usage
Judge budget: estimated per-experiment judge cost and configured
```
max_total_cost_usd
```
cap (or an explicit note that spend is uncapped)

Options:

Proceed -- approve baseline and parallel config, move to Phase 2
Adjust spec -- modify spec settings before proceeding
Fix issues -- user needs to resolve blockers first

Do NOT proceed to Phase 2 until the user explicitly approves.

If primary type is

judge

and

max_total_cost_usd

is null, call that out as uncapped spend and require explicit approval before proceeding.

State re-read: After gate approval, re-read the spec and baseline from disk. Do not carry stale in-memory values forward.

通过平台提问工具向用户展示：

基线指标：所有门槛值、诊断值和judge分数（如适用）
实验日志位置：展示文件路径，让用户知道结果保存位置
并行准备情况：探测结果、任何障碍、已应用的缓解措施
干净工作树状态：已确认干净
工作目录预算：当前数量和预计使用量
Judge预算：每次实验的预估judge成本和配置的
```
max_total_cost_usd
```
上限（或明确说明支出无上限）

选项：

继续 — 批准基线和并行配置，进入Phase 2
调整规格 — 在继续前修改规格设置
修复问题 — 用户需要先解决障碍

在用户明确批准前，不得进入Phase 2。

如果主类型为

judge

且

max_total_cost_usd

为null，需明确指出支出无上限，并要求用户明确批准后再继续。

状态重新读取：门槛批准后，从磁盘重新读取规格和基线。不得携带过时的内存值继续。

Phase 2: Hypothesis Generation

Phase 2: 假设生成

2.1 Analyze Current Approach

2.1 分析当前方法

Read the code within

scope.mutable

to understand:

The current implementation approach
Obvious improvement opportunities
Constraints and dependencies between components

Optionally dispatch

compound-engineering:research:repo-research-analyst

for deeper codebase analysis if the scope is large or unfamiliar.

读取

scope.mutable

内的代码以了解：

当前实现方法
明显的改进机会
组件间的约束和依赖

如果范围较大或不熟悉，可选择调用

compound-engineering:research:repo-research-analyst

进行更深入的代码库分析。

2.2 Generate Hypothesis List

2.2 生成假设列表

Generate an initial set of hypotheses. Each hypothesis should have:

Description: what to try
Category: one of the standard categories (signal-extraction, graph-signals, embedding, algorithm, preprocessing, parameter-tuning, architecture, data-handling) or a domain-specific category
Priority: high, medium, or low based on expected impact and feasibility
Required dependencies: any new packages or tools needed

Include user-provided hypotheses if any were given as input.

Aim for 10-30 hypotheses in the initial backlog. More can be generated during the loop based on learnings.

生成初始假设集。每个假设应包含：

描述：尝试的内容
类别：标准类别之一（signal-extraction、graph-signals、embedding、algorithm、preprocessing、parameter-tuning、architecture、data-handling）或领域特定类别
优先级：基于预期影响和可行性分为高、中、低
所需依赖：任何需要的新包或工具

如果输入中包含用户提供的假设，需将其纳入。

初始待办项目标为10-30个假设。可根据学习成果在循环中生成更多假设。

2.3 Dependency Pre-Approval

2.3 依赖预批准

Collect all unique new dependencies across all hypotheses.

If any hypotheses require new dependencies:

Present the full dependency list to the user via the platform question tool
Ask for bulk approval
Mark each hypothesis's
```
dep_status
```
as
```
approved
```
or
```
needs_approval
```

Hypotheses with unapproved dependencies remain in the backlog but are skipped during batch selection. They are re-presented at wrap-up for potential approval.

收集所有假设中唯一的新依赖。

如果任何假设需要新依赖：

通过平台提问工具向用户展示完整的依赖列表
请求批量批准
将每个假设的
```
dep_status
```
标记为
```
approved
```
或
```
needs_approval
```

未获批准的假设仍保留在待办项中，但在批量选择时会被跳过。在收尾阶段会重新展示给用户以获得潜在批准。

2.4 Record Hypothesis Backlog (CP-2)

2.4 记录假设待办项（CP-2）

MANDATORY CHECKPOINT. Write the initial backlog to the experiment log file and verify:

yaml

hypothesis_backlog:
  - description: "Remove template boilerplate before embedding"
    category: "signal-extraction"
    priority: high
    dep_status: approved
    required_deps: []
  - description: "Try HDBSCAN clustering algorithm"
    category: "algorithm"
    priority: medium
    dep_status: needs_approval
    required_deps: ["scikit-learn"]

强制检查点。将初始待办项写入实验日志文件并验证：

yaml

hypothesis_backlog:
  - description: "嵌入前移除模板样板代码"
    category: "signal-extraction"
    priority: high
    dep_status: approved
    required_deps: []
  - description: "尝试HDBSCAN聚类算法"
    category: "algorithm"
    priority: medium
    dep_status: needs_approval
    required_deps: ["scikit-learn"]

Phase 3: Optimization Loop

Phase 3: 优化循环

This phase repeats in batches until a stopping criterion is met.

此阶段会批量重复执行，直到满足停止条件。

3.1 Batch Selection

3.1 批量选择

Select hypotheses for this batch:

Build a runnable backlog by excluding hypotheses with
```
dep_status: needs_approval
```
If
```
execution.mode
```
is
```
serial
```
, force
```
batch_size = 1
```

Otherwise,

batch_size = min(runnable_backlog_size, execution.max_concurrent)

Prefer diversity: select from different categories when possible
Within a category, select by priority (high first)

If the backlog is empty and no new hypotheses can be generated, proceed to Phase 4 (wrap-up). If the backlog is non-empty but no runnable hypotheses remain because everything needs approval or is otherwise blocked, proceed to Phase 4 so the user can approve dependencies instead of spinning forever.

选择此批次的假设：

通过排除
```
dep_status: needs_approval
```
的假设，构建可运行的待办项
如果
```
execution.mode
```
为
```
serial
```
，强制
```
batch_size = 1
```

否则，

batch_size = min(可运行待办项数量, execution.max_concurrent)

优先选择多样性：尽可能从不同类别中选择
在同一类别中，按优先级选择（高优先级优先）

如果待办项为空且无法生成新假设，进入Phase 4（收尾）。如果待办项非空但因所有假设都需要批准或存在其他障碍而无可用假设，进入Phase 4以便用户批准依赖，而非无限循环。

3.2 Dispatch Experiments

3.2 分发实验

For each hypothesis in the batch, dispatch according to

execution.mode

. In

serial

mode, run exactly one experiment to completion before selecting the next hypothesis. In

parallel

mode, dispatch the full batch concurrently.

Worktree backend:

Create experiment worktree:

bash

WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "<spec_name>" <exp_index> "optimize/<spec_name>" <shared_files...>)  # creates optimize-exp/<spec_name>/exp-<NNN>

Apply port parameterization if configured (set env vars for the measurement script)
Fill the experiment prompt template (
```
references/experiment-prompt-template.md
```
) with:
- Iteration number, spec name
- Hypothesis description and category
- Current best and baseline metrics
- Mutable and immutable scope
- Constraints and approved dependencies
- Rolling window of last 10 experiments (concise summaries)
Dispatch a subagent with the filled prompt, working in the experiment worktree

Codex backend:

Check environment guard -- do NOT delegate if already inside a Codex sandbox:

bash

# If these exist, we're already in Codex -- fall back to subagent
test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git

Fill the experiment prompt template
Write the filled prompt to a temp file

Dispatch via Codex:

bash

cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1

Security posture: use the user's selection (ask once per session if not set in spec)

对于批次中的每个假设，根据

execution.mode

分发。在

serial

模式下，完全运行一个实验后再选择下一个假设。在

parallel

模式下，并发分发整个批次。

Worktree后端：

创建实验工作目录：

bash

WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "<spec_name>" <exp_index> "optimize/<spec_name>" <shared_files...>)  # 创建optimize-exp/<spec_name>/exp-<NNN>

如果配置了端口参数化，应用设置（为测量脚本设置环境变量）
使用以下内容填充实验提示模板（
```
references/experiment-prompt-template.md
```
）：
- 迭代次数、规格名称
- 假设描述和类别
- 当前最优和基线指标
- 可变和不可变范围
- 约束和已批准依赖
- 最近10次实验的滚动窗口（简洁摘要）
使用填充后的提示调用子Agent，在实验工作目录中运行

Codex后端：

检查环境防护 — 如果已在Codex沙箱内，不要委托：

bash

# 如果这些存在，说明已在Codex中 — 回退到子Agent
test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git

填充实验提示模板
将填充后的提示写入临时文件

通过Codex分发：

bash

cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1

安全策略：使用用户选择（如果规格中未设置，每会话询问一次）

3.3 Collect and Persist Results

3.3 收集并持久化结果

Process experiments as they complete — do NOT wait for the entire batch to finish before writing results.

For each completed experiment, immediately:

Run measurement in the experiment's worktree:
bash
```
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>/<measurement.working_directory or .>" <env_vars...>
```
- If stability mode is
```
repeat
```
  , run the measurement harness
```
repeat_count
```
  times in that working directory and aggregate the results exactly as in Phase 1 before evaluating gates or ranking the experiment.
- Use the aggregated metrics as the experiment's score; if variance exceeds
```
noise_threshold
```
  , record that in learnings so the operator knows the result is noisy.
Write crash-recovery marker — immediately after measurement, write
```
result.yaml
```
in the experiment worktree containing the raw metrics. This ensures the measurement is recoverable even if the agent crashes before updating the main log.
Read raw JSON output from the measurement script
Evaluate degenerate gates:
- For each gate in
```
metric.degenerate_gates
```
  , parse the operator and threshold
- Compare the metric value against the threshold
- If ANY gate fails: mark outcome as
```
degenerate
```
  , skip judge evaluation, save money
If gates pass AND primary type is
judge
:
- Read the experiment's output (cluster assignments, search results, etc.)
- Apply stratified sampling per
```
metric.judge.stratification
```
  config (using
```
sample_seed
```
  )
- Group samples into batches of
```
metric.judge.batch_size
```
- Fill the judge prompt template (
```
references/judge-prompt-template.md
```
  ) for each batch
- Dispatch
```
ceil(sample_size / batch_size)
```
  parallel judge sub-agents
- Each sub-agent returns structured JSON scores
- Aggregate scores: compute the configured primary judge field from
```
metric.judge.scoring.primary
```
  (which should match
```
metric.primary.name
```
  ) plus any
```
scoring.secondary
```
  values
- If
```
singleton_sample > 0
```
  : also dispatch singleton evaluation sub-agents
If gates pass AND primary type is
hard
:
- Use the metric value directly from the measurement output
IMMEDIATELY append to experiment log on disk (CP-3) — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to
```
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
```
right now. Use the transitional outcome
```
measured
```
once the experiment has valid metrics but has not yet been compared to the current best. Update the outcome to
```
kept
```
,
```
reverted
```
, or another terminal state in the evaluation step, but the raw metrics are on disk and safe from context compaction.
VERIFY the write (CP-3 verification) — read the experiment log back from disk and confirm the entry just written is present. If verification fails, retry the write. Do NOT proceed to the next experiment until this entry is confirmed on disk.

Why immediately + verify? The agent's context window is NOT a durable store. Context compaction, session crashes, and restarts are expected during long runs. If results only exist in the agent's memory, they are lost. Karpathy's autoresearch writes to

results.tsv

after every single experiment — this skill must do the same with the experiment log. The verification step catches silent write failures that would otherwise lose data.

实验完成后立即处理 — 不要等待整个批次完成后再写入结果。

对于每个已完成的实验，立即执行以下操作：

在实验工作目录中运行测量：
bash
```
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>/<measurement.working_directory or .>" <env_vars...>
```
- 如果稳定性模式为
```
repeat
```
  ，在该工作目录中运行测量工具
```
repeat_count
```
  次，并完全按照Phase 1的方式汇总结果，然后再评估门槛或对实验排名。
- 使用汇总后的指标作为实验分数；如果方差超过
```
noise_threshold
```
  ，将其记录到学习成果中，以便操作人员知道结果存在噪声。
写入崩溃恢复标记 — 测量完成后立即在实验工作目录中写入包含原始指标的
```
result.yaml
```
。这确保即使Agent在更新主日志前崩溃，测量结果也可恢复。
读取测量脚本的原始JSON输出
评估退化门槛：
- 对于
```
metric.degenerate_gates
```
  中的每个门槛，解析运算符和阈值
- 将指标值与阈值比较
- 如果任何门槛未通过：标记结果为
```
degenerate
```
  ，跳过judge评估，节省成本
如果门槛通过且主类型为
judge
：
- 读取实验的输出（聚类分配、搜索结果等）
- 根据
```
metric.judge.stratification
```
  配置应用分层抽样（使用
```
sample_seed
```
  ）
- 将样本分组为
```
metric.judge.batch_size
```
  大小的批次
- 为每个批次填充judge提示模板（
```
references/judge-prompt-template.md
```
  ）
- 调用
```
ceil(sample_size / batch_size)
```
  个并行judge子Agent
- 每个子Agent返回结构化JSON分数
- 汇总分数：根据
```
metric.judge.scoring.primary
```
  （应与
```
metric.primary.name
```
  匹配）计算配置的主judge字段，以及任何
```
scoring.secondary
```
  值
- 如果
```
singleton_sample > 0
```
  ：同时调用单例评估子Agent
如果门槛通过且主类型为
hard
：
- 直接使用测量输出中的指标值
立即追加到磁盘上的实验日志（CP-3） — 不要推迟到批量评估时。立即将实验条目（迭代次数、假设、结果、指标、学习成果）写入
```
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
```
。当实验有有效指标但尚未与当前最优解比较时，使用过渡结果
```
measured
```
。在评估步骤中将结果更新为
```
kept
```
、
```
reverted
```
或其他最终状态，但原始指标已存储在磁盘上，不会因上下文压缩丢失。
验证写入（CP-3验证） — 从磁盘重新读取实验日志，确认刚写入的条目存在。如果验证失败，重试写入。在确认条目存在前，不要进行下一个实验。

为什么要立即写入并验证？ Agent的上下文窗口并非持久存储。在长时间运行中，上下文压缩、会话崩溃和重启是预期情况。如果结果仅存在于Agent内存中，将会丢失。Karpathy的autoresearch在每次实验后写入

results.tsv

— 本技能必须对实验日志执行相同操作。验证步骤可捕获否则会导致数据丢失的静默写入失败。

3.4 Evaluate Batch

3.4 评估批次

After all experiments in the batch have been measured:

Rank experiments by primary metric improvement:
- For hard metrics: compare to the current best using
```
metric.primary.direction
```
  (
```
maximize
```
  means higher is better,
```
minimize
```
  means lower is better), and require the absolute improvement to exceed
```
measurement.stability.noise_threshold
```
  before treating it as a real win
- For judge metrics: compare the configured primary judge score (
```
metric.judge.scoring.primary
```
  /
```
metric.primary.name
```
  ) to the current best, and require it to exceed
```
minimum_improvement
```
Identify the best experiment that passes all gates and improves the primary metric
If best improves on current best: KEEP
- Commit the experiment branch first so the winning diff exists as a real commit before any merge or cherry-pick
- Include only mutable-scope changes in that commit; if no eligible diff remains, treat the experiment as non-improving and revert it
- Merge the committed experiment branch into the optimization branch
- Use the message
```
optimize(<spec-name>): <hypothesis description>
```
  for the experiment commit
- After the merge succeeds, clean up the winner's experiment worktree and branch; the integrated commit on the optimization branch is the durable artifact
- This is now the new baseline for subsequent batches
Check file-disjoint runners-up (up to
```
max_runner_up_merges_per_batch
```
):
- For each runner-up that also improved, check file-level disjointness with the kept experiment
- File-level disjointness: two experiments are disjoint if they modified completely different files. Same file = overlapping, even if different lines.
- If disjoint: cherry-pick the runner-up onto the new baseline, re-run full measurement
- If combined measurement is strictly better: keep the cherry-pick (outcome:
```
runner_up_kept
```
  ), then clean up that runner-up's experiment worktree and branch
- Otherwise: revert the cherry-pick, log as "promising alone but neutral/harmful in combination" (outcome:
```
runner_up_reverted
```
  ), then clean up the runner-up's experiment worktree and branch
- Stop after first failed combination
Handle deferred deps: experiments that need unapproved dependencies get outcome
```
deferred_needs_approval
```
Revert all others: cleanup worktrees, log as
```
reverted
```

批次中所有实验测量完成后：

按主指标改进程度排名：
- 对于硬指标：根据
```
metric.primary.direction
```
  （
```
maximize
```
  表示越高越好，
```
minimize
```
  表示越低越好）与当前最优解比较，要求绝对改进超过
```
measurement.stability.noise_threshold
```
  才视为真正的提升
- 对于judge指标：将配置的主judge分数（
```
metric.judge.scoring.primary
```
  /
```
metric.primary.name
```
  ）与当前最优解比较，要求超过
```
minimum_improvement
```
确定通过所有门槛且主指标有改进的最优实验
如果最优实验优于当前最优解：保留
- 先提交实验分支，以便在合并或挑选前，获胜的差异作为真实提交存在
- 提交中仅包含可变范围的更改；如果没有合格的差异，将实验视为无改进并回退
- 将提交的实验分支合并到优化分支
- 实验提交的消息使用
```
optimize(<spec-name>): <假设描述>
```
- 合并成功后，清理获胜实验的工作目录和分支；优化分支上的集成提交是持久化工件
- 这成为后续批次的新基线
检查文件不相交的候选最优解（最多
```
max_runner_up_merges_per_batch
```
个）：
- 对于每个同样有改进的候选最优解，检查其与保留实验的文件级不相交性
- 文件级不相交性：如果两个实验修改的文件完全不同，则它们是不相交的。即使修改的是同一文件的不同行，也视为重叠。
- 如果不相交：将候选最优解挑选到新基线上，重新运行完整测量
- 如果合并后的测量结果严格更优：保留挑选的更改（结果：
```
runner_up_kept
```
  ），然后清理该候选最优解的工作目录和分支
- 否则：回退挑选的更改，记录为“单独使用有前景但组合使用无影响/有害”（结果：
```
runner_up_reverted
```
  ），然后清理该候选最优解的工作目录和分支
- 首次组合失败后停止
处理延迟依赖：需要未批准依赖的实验结果标记为
```
deferred_needs_approval
```
回退所有其他实验：清理工作目录，记录为
```
reverted
```

3.5 Update State (CP-4)

3.5 更新状态（CP-4）

MANDATORY CHECKPOINT. By this point, individual experiment results are already on disk (written in step 3.3). This step updates aggregate state and verifies.

Re-read the experiment log from disk — do not trust in-memory state. The log is the source of truth.
Finalize outcomes — update experiment entries from step 3.4 evaluation (mark
```
kept
```
,
```
reverted
```
,
```
runner_up_kept
```
, etc.). Write these outcome updates to disk immediately.
Update the
best
section in the experiment log if a new best was found. Write to disk.
Write strategy digest to
```
.context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md
```
:
- Categories tried so far (with success/failure counts)
- Key learnings from this batch and overall
- Exploration frontier: what categories and approaches remain untried
- Current best metrics and improvement from baseline
Generate new hypotheses based on learnings:
- Re-read the strategy digest from disk (not from memory)
- Read the rolling window (last 10 experiments from the log on disk)
- Do NOT read the full experiment log -- use the digest for broad context
- Add new hypotheses to the backlog and write the updated backlog to disk
Write updated hypothesis backlog to disk — the backlog section of the experiment log must reflect newly added hypotheses and removed (tested) ones.

CP-4 Verification: Read the experiment log back from disk. Confirm: (a) all experiment outcomes from this batch are finalized, (b) the

best

section reflects the current best, (c) the hypothesis backlog is updated. Read

strategy-digest.md

back and confirm it exists. Only THEN proceed to the next batch or stopping criteria check.

Checkpoint: at this point, all state for this batch is on disk. If the agent crashes and restarts, it can resume from the experiment log without loss.

强制检查点。此时，单个实验结果已写入磁盘（步骤3.3中完成）。此步骤更新聚合状态并验证。

从磁盘重新读取实验日志 — 不要信任内存中的状态。日志是可信数据源。
最终确定结果 — 根据步骤3.4的评估更新实验条目（标记
```
kept
```
、
```
reverted
```
、
```
runner_up_kept
```
等）。立即将这些结果更新写入磁盘。
如果找到新的最优解，更新实验日志中的
best
部分。写入磁盘。
将策略摘要写入
.context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md
：
- 迄今为止尝试的类别（包含成功/失败计数）
- 此批次及整体的关键学习成果
- 探索前沿：尚未尝试的类别和方法
- 当前最优指标及与基线的改进
根据学习成果生成新假设：
- 从磁盘重新读取策略摘要（而非内存）
- 读取滚动窗口（磁盘日志中的最近10次实验）
- 不要读取完整实验日志 — 使用摘要获取广泛上下文
- 将新假设添加到待办项，并将更新后的待办项写入磁盘
将更新后的假设待办项写入磁盘 — 实验日志的待办项部分必须反映新增的假设和已移除（已测试）的假设。

CP-4验证：从磁盘重新读取实验日志。确认：(a) 此批次的所有实验结果已最终确定，(b)

best

部分反映当前最优解，(c) 假设待办项已更新。重新读取

strategy-digest.md

并确认其存在。完成后再进入下一批次或检查停止条件。

检查点：此时，此批次的所有状态已存储在磁盘上。如果Agent崩溃并重启，可从实验日志恢复，无数据丢失。

3.6 Check Stopping Criteria

3.6 检查停止条件

Stop the loop if ANY of these are true:

Target reached:
```
stopping.target_reached
```
is true,
```
metric.primary.target
```
is set, and the primary metric reaches that target according to
```
metric.primary.direction
```
(
```
>=
```
for
```
maximize
```
,
```
<=
```
for
```
minimize
```
)
Max iterations: total experiments run >=
```
stopping.max_iterations
```
Max hours: wall-clock time since Phase 3 start >=
```
stopping.max_hours
```
Judge budget exhausted: cumulative judge spend >=
```
metric.judge.max_total_cost_usd
```
(if set)
Plateau: no improvement for
```
stopping.plateau_iterations
```
consecutive experiments
Manual stop: user interrupts (save state and proceed to Phase 4)
Empty backlog: no hypotheses remain and no new ones can be generated

If no stopping criterion is met, proceed to the next batch (step 3.1).

如果满足以下任一条件，停止循环：

达到目标：

stopping.target_reached

为true，

metric.primary.target

已设置，且主指标根据

metric.primary.direction

达到目标（

maximize

为

>=

，

minimize

为

<=

）

最大迭代次数：已运行的实验总数 >=
```
stopping.max_iterations
```
最长运行时间：自Phase 3开始的挂钟时间 >=
```
stopping.max_hours
```
Judge预算耗尽：累计judge支出 >=
```
metric.judge.max_total_cost_usd
```
（如果已设置）
进入平台期：连续
```
stopping.plateau_iterations
```
次实验无改进
手动停止：用户中断（保存状态并进入Phase 4）
待办项为空：无剩余假设且无法生成新假设

如果未满足任何停止条件，进入下一批次（步骤3.1）。

3.7 Cross-Cutting Concerns

3.7 跨领域关注点

Codex failure cascade: Track consecutive Codex delegation failures. After 3 consecutive failures, auto-disable Codex for remaining experiments and fall back to subagent dispatch. Log the switch.

Error handling: If an experiment's measurement command crashes, times out, or produces malformed output:

Log as outcome
```
error
```
or
```
timeout
```
with the error message
Revert the experiment (cleanup worktree)
The loop continues with remaining experiments in the batch

Progress reporting: After each batch, report:

Batch N of estimated M (based on backlog size)
Experiments run this batch and total
Current best metric and improvement from baseline
Cumulative judge cost (if applicable)

Crash recovery: See Persistence Discipline section. Per-experiment

result.yaml

markers are written in step 3.3. Individual experiment results are appended to the log immediately in step 3.3. Batch-level state (outcomes, best, digest) is written in step 3.5. On resume (Phase 0.4), the log on disk is the ground truth — scan for any

result.yaml

markers not yet reflected in the log.

Codex失败连锁反应：跟踪连续的Codex委托失败。连续失败3次后，自动为剩余实验禁用Codex，回退到子Agent分发。记录切换操作。

错误处理：如果实验的测量命令崩溃、超时或生成格式错误的输出：

将结果记录为
```
error
```
或
```
timeout
```
，并包含错误消息
回退实验（清理工作目录）
循环继续处理批次中的剩余实验

进度报告：每个批次完成后，报告：

第N批次，共预计M批次（基于待办项大小）
此批次及总共运行的实验数
当前最优指标及与基线的改进
累计judge成本（如适用）

崩溃恢复：请查看持久化规范部分。步骤3.3中写入了每实验

result.yaml

标记。单个实验结果在步骤3.3中立即追加到日志中。批次级状态（结果、最优解、摘要）在步骤3.5中写入。恢复时（Phase 0.4），磁盘上的日志是真实状态 — 扫描所有尚未反映在日志中的

result.yaml

标记。

Phase 4: Wrap-Up

Phase 4: 收尾阶段

4.1 Present Deferred Hypotheses

4.1 展示延迟的假设

If any hypotheses were deferred due to unapproved dependencies:

List them with their dependency requirements
Ask the user whether to approve, skip, or save for a future run
If approved: add to backlog and offer to re-enter Phase 3 for one more round

如果任何因未批准依赖而延迟的假设：

列出它们及其依赖要求
询问用户是否批准、跳过或保存供未来运行使用
如果批准：添加到待办项，并提供重新进入Phase 3进行额外一轮实验的选项

4.2 Summarize Results

4.2 总结结果

Present a comprehensive summary:

Optimization: <spec-name>
Duration: <wall-clock time>
Total experiments: <count>
  Kept: <count> (including <runner_up_kept_count> runner-up merges)
  Reverted: <count>
  Degenerate: <count>
  Errors: <count>
  Deferred: <count>

Baseline -> Final:
  <primary_metric>: <baseline_value> -> <final_value> (<delta>)
  <gate_metrics>: ...
  <diagnostics>: ...

Judge cost: $<total_judge_cost_usd> (if applicable)

Key improvements:
  1. <kept experiment 1 hypothesis> (+<delta>)
  2. <kept experiment 2 hypothesis> (+<delta>)
  ...

展示全面总结：

优化任务：<spec-name>
持续时间：<挂钟时间>
总实验数：<数量>
  保留：<数量>（包含<runner_up_kept_count>个候选最优解合并）
  回退：<数量>
  退化：<数量>
  错误：<数量>
  延迟：<数量>

基线 -> 最终：
  <primary_metric>：<baseline_value> -> <final_value>（<delta>）
  <gate_metrics>：...
  <diagnostics>：...

Judge成本：$<total_judge_cost_usd>（如适用）

关键改进：
  1. <保留的实验1假设>（+<delta>）
  2. <保留的实验2假设>（+<delta>）
  ...

4.3 Preserve and Offer Next Steps

4.3 保留并提供后续步骤选项

The optimization branch (

optimize/<spec-name>

) is preserved with all commits from kept experiments. The experiment log and strategy digest remain in local

.context/...

scratch space for resume and audit on this machine only; they do not travel with the branch because

.context/

is gitignored.

Present post-completion options via the platform question tool:

Run
/ce:review
on the cumulative diff (baseline to final). Load the
```
ce:review
```
skill with
```
mode:autofix
```
on the optimization branch.
Run
/ce:compound
to document the winning strategy as an institutional learning.
Create PR from the optimization branch to the default branch.
Continue with more experiments: re-enter Phase 3 with the current state. State re-read first.
Done -- leave the optimization branch for manual review.

优化分支（

optimize/<spec-name>

）会保留所有保留实验的提交。实验日志和策略摘要保留在本地

.context/...

临时空间中，仅可在此机器上恢复和审核；由于

.context/

被git忽略，它们不会随分支同步。

通过平台提问工具展示完成后的选项：

**运行
```
/ce:review
```
**检查累积差异（基线到最终版本）。在优化分支上以
```
mode:autofix
```
模式加载
```
ce:review
```
技能。
**运行
```
/ce:compound
```
**将获胜策略记录为机构学习成果。
创建PR从优化分支到默认分支。
继续进行更多实验：使用当前状态重新进入Phase 3。先重新读取状态。
完成 — 保留优化分支供人工审核。

4.4 Cleanup

4.4 清理

Clean up scratch space:

bash

undefined

清理临时空间：

bash

undefined

Keep the experiment log for local resume/audit on this machine

保留实验日志以便在此机器上本地恢复/审核

Remove temporary batch artifacts

删除临时批次工件

rm -f .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md


Do NOT delete the experiment log if the user may resume locally or wants a local audit trail. If they need a durable shared artifact, summarize or export the results into a tracked path before cleanup.
Do NOT delete experiment worktrees that are still being referenced.

rm -f .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md


如果用户可能在本地恢复或需要本地审核跟踪，不要删除实验日志。如果需要持久化的共享工件，在清理前将结果汇总或导出到受跟踪的路径。不要删除仍被引用的实验工作目录。

ce-optimize

Original

Translation

Iterative Optimization Loop

迭代优化循环

Interaction Method

交互方式

Input

输入

Optimization Spec Schema

优化规格 Schema

Experiment Log Schema

实验日志 Schema

Quick Start

快速开始

Persistence Discipline

持久化规范

Core Rules

核心规则

Mandatory Disk Checkpoints

强制磁盘检查点

File Locations (all under .context/compound-engineering/ce-optimize/<spec-name>/)

文件位置（均位于.context/compound-engineering/ce-optimize/<spec-name>/下）

On Resume

恢复时操作

Phase 0: Setup

Phase 0: 准备阶段

0.1 Determine Input Type

0.1 确定输入类型

0.2 Load or Create Spec

0.2 加载或创建规格

0.3 Search Prior Learnings

0.3 搜索过往学习成果

0.4 Run Identity Detection

0.4 运行身份检测

0.5 Create Optimization Branch and Scratch Space

0.5 创建优化分支和临时空间

Phase 1: Measurement Scaffolding

Phase 1: 测量框架搭建

1.1 Clean-Tree Gate

1.1 干净工作树门槛

1.2 Build or Validate Measurement Harness

1.2 构建或验证测量工具

1.3 Establish Baseline

1.3 建立基线

1.4 Parallelism Readiness Probe

1.4 并行准备探测

1.5 Worktree Budget Check

1.5 工作目录预算检查

1.6 Write Baseline to Disk (CP-1)

1.6 将基线写入磁盘（CP-1）

1.7 User Approval Gate

1.7 用户批准门槛

Phase 2: Hypothesis Generation

Phase 2: 假设生成

2.1 Analyze Current Approach

2.1 分析当前方法

2.2 Generate Hypothesis List

2.2 生成假设列表

2.3 Dependency Pre-Approval

2.3 依赖预批准

2.4 Record Hypothesis Backlog (CP-2)

2.4 记录假设待办项（CP-2）

Phase 3: Optimization Loop

Phase 3: 优化循环

3.1 Batch Selection

3.1 批量选择

3.2 Dispatch Experiments

3.2 分发实验

3.3 Collect and Persist Results

3.3 收集并持久化结果

3.4 Evaluate Batch

3.4 评估批次

3.5 Update State (CP-4)

3.5 更新状态（CP-4）

3.6 Check Stopping Criteria

3.6 检查停止条件

3.7 Cross-Cutting Concerns

3.7 跨领域关注点

Phase 4: Wrap-Up

File Locations (all under
`.context/compound-engineering/ce-optimize/<spec-name>/`
)

文件位置（均位于
`.context/compound-engineering/ce-optimize/<spec-name>/`
下）