ce-optimize

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Iterative Optimization Loop

迭代优化循环

Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.
运行基于指标的迭代优化。定义目标,搭建测量框架,随后运行并行实验逐步收敛至最优解。

Interaction Method

交互方式

Use the platform's blocking question tool when available (
AskUserQuestion
in Claude Code,
request_user_input
in Codex,
ask_user
in Gemini). Otherwise, present numbered options in chat and wait for the user's reply before proceeding.
当平台提供阻塞式提问工具时使用(Claude Code中的
AskUserQuestion
、Codex中的
request_user_input
、Gemini中的
ask_user
)。否则,在聊天中展示编号选项,等待用户回复后再继续。

Input

输入

<optimization_input> #$ARGUMENTS </optimization_input>
If the input above is empty, ask: "What would you like to optimize? Describe the goal, or provide a path to an optimization spec YAML file."
<optimization_input> #$ARGUMENTS </optimization_input>
如果上述输入为空,请询问:"您想要优化什么?请描述目标,或提供优化规格YAML文件的路径。"

Optimization Spec Schema

优化规格 Schema

Reference the spec schema for validation:
references/optimize-spec-schema.yaml
参考规格Schema进行验证:
references/optimize-spec-schema.yaml

Experiment Log Schema

实验日志 Schema

Reference the experiment log schema for state management:
references/experiment-log-schema.yaml
参考实验日志Schema进行状态管理:
references/experiment-log-schema.yaml

Quick Start

快速开始

For a first run, optimize for signal and safety, not maximum throughput:
  • Start from
    references/example-hard-spec.yaml
    when the metric is objective and cheap to measure
  • Use
    references/example-judge-spec.yaml
    only when actual quality requires semantic judgment
  • Prefer
    execution.mode: serial
    and
    execution.max_concurrent: 1
  • Cap the first run with
    stopping.max_iterations: 4
    and
    stopping.max_hours: 1
  • Avoid new dependencies until the baseline and measurement harness are trusted
  • For judge mode, start with
    sample_size: 10
    ,
    batch_size: 5
    , and
    max_total_cost_usd: 5
For a friendly overview of what this skill is for, when to use hard metrics vs LLM-as-judge, and example kickoff prompts, see:
references/usage-guide.md

首次运行时,优先优化信号和安全性,而非追求最大吞吐量:
  • 当指标客观且易于测量时,从
    references/example-hard-spec.yaml
    开始
  • 仅当实际质量需要语义判断时,使用
    references/example-judge-spec.yaml
  • 优先选择
    execution.mode: serial
    execution.max_concurrent: 1
  • 首次运行设置上限:
    stopping.max_iterations: 4
    stopping.max_hours: 1
  • 在基线和测量工具可信之前,避免引入新依赖
  • 对于judge模式,初始设置
    sample_size: 10
    batch_size: 5
    max_total_cost_usd: 5
如需了解该技能的用途、何时使用硬指标vs LLM-as-judge,以及示例启动提示,请查看:
references/usage-guide.md

Persistence Discipline

持久化规范

CRITICAL: The experiment log on disk is the single source of truth. The conversation context is NOT durable storage. Results that exist only in the conversation WILL be lost.
The files under
.context/compound-engineering/ce-optimize/<spec-name>/
are local scratch state. They are ignored by git, so they survive local resumes on the same machine but are not preserved by commits, branches, or pushes unless the user exports them separately.
This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters MUST live on disk, not in the agent's memory.
If you produce a results table in the conversation without writing those results to disk first, you have a bug. The conversation is for the user's benefit. The experiment log file is for durability.
关键提示:磁盘上的实验日志是唯一可信数据源。对话上下文并非持久存储。仅存在于对话中的结果将会丢失。
.context/compound-engineering/ce-optimize/<spec-name>/
下的文件是本地临时状态。它们会被git忽略,因此在同一机器上本地恢复时会保留,但除非用户单独导出,否则不会通过提交、分支或推送保存。
该技能运行时长可达数小时。上下文窗口会压缩、会话可能崩溃、Agent可能重启。所有重要状态必须存储在磁盘上,而非Agent内存中。
**如果您在对话中生成了结果表格但未先将其写入磁盘,这属于错误。**对话是为了方便用户,实验日志文件才是用于持久化的载体。

Core Rules

核心规则

  1. Write each experiment result to disk IMMEDIATELY after measurement — not after the batch, not after evaluation, IMMEDIATELY. Append the experiment entry to the experiment log file the moment its metrics are known, before evaluating the next experiment. This is the #1 crash-safety rule.
  2. VERIFY every critical write — after writing the experiment log, read the file back and confirm the entry is present. This catches silent write failures. Do not proceed to the next experiment until verification passes.
  3. Re-read from disk at every phase boundary and before every decision — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk.
  4. The experiment log is append-only during Phase 3 — never rewrite the full file. Append new experiment entries. Update the
    best
    section in place only when a new best is found. This prevents data loss if a write is interrupted.
  5. Per-experiment result markers for crash recovery — each experiment writes a
    result.yaml
    marker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged.
  6. Strategy digest is written after every batch, before generating new hypotheses — the agent reads the digest (not its memory) when deciding what to try next.
  7. Never present results to the user without writing them to disk first — the pattern is: measure -> write to disk -> verify -> THEN show the user. Not the reverse.
  1. 测量完成后立即将每个实验结果写入磁盘 — 不要在批量处理后,也不要在评估后,而是立即写入。一旦获知实验指标,就将实验条目追加到实验日志文件中,再进行下一个实验的评估。这是#1崩溃安全规则。
  2. 验证每一次关键写入 — 写入实验日志后,重新读取文件并确认条目已存在。这可以捕获静默写入失败。在验证通过前,不要进行下一个实验。
  3. 在每个阶段边界和每次决策前重新从磁盘读取 — 永远不要跨阶段转换、批量边界或任何可能耗时的操作后信任内存中的状态。重新读取实验日志和策略摘要。
  4. Phase 3期间实验日志仅可追加 — 永远不要重写整个文件。追加新的实验条目。仅当找到新的最优解时,才就地更新
    best
    部分。这可以防止写入中断时的数据丢失。
  5. 用于崩溃恢复的每实验结果标记 — 每个实验在测量完成后,立即在其工作目录中写入
    result.yaml
    标记。恢复时,扫描这些标记以恢复已测量但尚未记录的实验。
  6. 每次批量处理后、生成新假设前写入策略摘要 — Agent在决定下一步尝试内容时,读取摘要(而非其内存)。
  7. 永远不要在写入磁盘前向用户展示结果 — 正确流程是:测量 -> 写入磁盘 -> 验证 -> 再展示给用户。顺序不可颠倒。

Mandatory Disk Checkpoints

强制磁盘检查点

These are non-negotiable write-then-verify steps. At each checkpoint, the agent MUST write the specified file and then read it back to confirm the write succeeded.
CheckpointFile WrittenPhase
CP-0: Spec saved
spec.yaml
Phase 0, after user approval
CP-1: Baseline recorded
experiment-log.yaml
(initial with baseline)
Phase 1, after baseline measurement
CP-2: Hypothesis backlog saved
experiment-log.yaml
(hypothesis_backlog section)
Phase 2, after hypothesis generation
CP-3: Each experiment result
experiment-log.yaml
(append experiment entry)
Phase 3.3, immediately after each measurement
CP-4: Batch summary
experiment-log.yaml
(outcomes + best) +
strategy-digest.md
Phase 3.5, after batch evaluation
CP-5: Final summary
experiment-log.yaml
(final state)
Phase 4, at wrap-up
Format of a verification step:
  1. Write the file using the native file-write tool
  2. Read the file back using the native file-read tool
  3. Confirm the expected content is present
  4. If verification fails, retry the write. If it fails twice, alert the user.
这些是必须执行的“写入后验证”步骤。在每个检查点,Agent必须写入指定文件,然后重新读取以确认写入成功。
检查点写入文件阶段
CP-0: 规格已保存
spec.yaml
Phase 0,用户批准后
CP-1: 基线已记录
experiment-log.yaml
(包含基线的初始版本)
Phase 1,基线测量后
CP-2: 假设待办项已保存
experiment-log.yaml
(hypothesis_backlog部分)
Phase 2,假设生成后
CP-3: 每个实验结果
experiment-log.yaml
(追加实验条目)
Phase 3.3,每次测量后立即执行
CP-4: 批量摘要
experiment-log.yaml
(outcomes + best) +
strategy-digest.md
Phase 3.5,批量评估后
CP-5: 最终摘要
experiment-log.yaml
(最终状态)
Phase 4,收尾阶段
验证步骤格式:
  1. 使用原生文件写入工具写入文件
  2. 使用原生文件读取工具重新读取文件
  3. 确认预期内容已存在
  4. 如果验证失败,重试写入。若两次失败,向用户发出警报。

File Locations (all under
.context/compound-engineering/ce-optimize/<spec-name>/
)

文件位置(均位于
.context/compound-engineering/ce-optimize/<spec-name>/
下)

FilePurposeWritten When
spec.yaml
Optimization spec (immutable during run)Phase 0 (CP-0)
experiment-log.yaml
Full history of all experimentsInitialized at CP-1, appended at CP-3, updated at CP-4
strategy-digest.md
Compressed learnings for hypothesis generationWritten at CP-4 after each batch
<worktree>/result.yaml
Per-experiment crash-recovery markerImmediately after measurement, before CP-3
文件用途写入时机
spec.yaml
优化规格(运行期间不可变)Phase 0(CP-0)
experiment-log.yaml
所有实验的完整历史在CP-1初始化,CP-3追加,CP-4更新
strategy-digest.md
用于假设生成的压缩式学习成果每次批量处理后的CP-4写入
<worktree>/result.yaml
每实验崩溃恢复标记测量完成后立即写入,早于CP-3

On Resume

恢复时操作

When Phase 0.4 detects an existing run:
  1. Read the experiment log from disk — this is the ground truth
  2. Scan worktree directories for
    result.yaml
    markers not yet in the log
  3. Recover any measured-but-unlogged experiments
  4. Continue from where the log left off

当Phase 0.4检测到已有运行实例时:
  1. 从磁盘读取实验日志 — 这是真实状态
  2. 扫描工作目录,查找尚未记录在日志中的
    result.yaml
    标记
  3. 恢复所有已测量但未记录的实验
  4. 从日志中断的位置继续

Phase 0: Setup

Phase 0: 准备阶段

0.1 Determine Input Type

0.1 确定输入类型

Check whether the input is:
  • A spec file path (ends in
    .yaml
    or
    .yml
    ): read and validate it
  • A description of the optimization goal: help the user create a spec interactively
检查输入是否为:
  • 规格文件路径(以
    .yaml
    .yml
    结尾):读取并验证
  • 优化目标描述:帮助用户交互式创建规格

0.2 Load or Create Spec

0.2 加载或创建规格

If spec file provided:
  1. Read the YAML spec file. The orchestrating agent parses YAML natively -- no shell script parsing.
  2. Validate against
    references/optimize-spec-schema.yaml
    :
    • All required fields present
    • name
      is lowercase kebab-case and safe to use in git refs / worktree paths
    • metric.primary.type
      is
      hard
      or
      judge
    • If type is
      judge
      ,
      metric.judge
      section exists with
      rubric
      and
      scoring
    • At least one degenerate gate defined
    • measurement.command
      is non-empty
    • scope.mutable
      and
      scope.immutable
      each have at least one entry
    • Gate check operators are valid (
      >=
      ,
      <=
      ,
      >
      ,
      <
      ,
      ==
      ,
      !=
      )
    • execution.max_concurrent
      is at least 1
    • execution.max_concurrent
      does not exceed 6 when backend is
      worktree
  3. If validation fails, report errors and ask the user to fix them
If description provided:
  1. Analyze the project to understand what can be measured
  2. Detect whether the optimization target is qualitative or quantitative — this determines
    type: hard
    vs
    type: judge
    and is the single most important spec decision:
    Use
    type: hard
    when:
    • The metric is a scalar number with a clear "better" direction
    • The metric is objectively measurable (build time, test pass rate, latency, memory usage)
    • No human judgment is needed to evaluate "is this result actually good?"
    • Examples: reduce build time, increase test coverage, reduce API latency, decrease bundle size
    Use
    type: judge
    when:
    • The quality of the output requires semantic understanding to evaluate
    • A human reviewer would need to look at the results to say "this is better"
    • Proxy metrics exist but can mislead (e.g., "more clusters" does not mean "better clusters")
    • The optimization could produce degenerate solutions that look good on paper
    • Examples: clustering quality, search relevance, summarization quality, code readability, UX copy, recommendation relevance
    IMPORTANT: If the target is qualitative, strongly recommend
    type: judge
    . Explain that hard metrics alone will optimize proxy numbers without checking actual quality. Show the user the three-tier approach:
    • Degenerate gates (hard, cheap, fast): catch obviously broken solutions — e.g., "all items in 1 cluster" or "0% coverage". Run first. If gates fail, skip the expensive judge step.
    • LLM-as-judge (the actual optimization target): sample outputs, score them against a rubric, aggregate. This is what the loop optimizes.
    • Diagnostics (logged, not gated): distribution stats, counts, timing — useful for understanding WHY a judge score changed.
    If the user insists on
    type: hard
    for a qualitative target, proceed but warn that the results may optimize a misleading proxy.
  3. Design the sampling strategy (for
    type: judge
    ):
    Guide the user through defining stratified sampling. The key question is: "What parts of the output space do you need to check quality on?"
    Walk through these questions:
    • What does one "item" look like? (a cluster, a search result page, a summary, etc.)
    • What are the natural size/quality strata? (e.g., large clusters vs small clusters vs singletons)
    • Where are quality failures most likely? (e.g., very large clusters may be degenerate merges; singletons may be missed groupings)
    • What total sample size balances cost vs signal? (default: 30 items, adjust based on output volume)
    Example stratified sampling for clustering:
    yaml
    stratification:
      - bucket: "top_by_size"     # largest clusters — check for degenerate mega-clusters
        count: 10
      - bucket: "mid_range"       # middle of non-solo cluster size range — representative quality
        count: 10
      - bucket: "small_clusters"  # clusters with 2-3 items — check if connections are real
        count: 10
    singleton_sample: 15          # singletons — check for false negatives (items that should cluster)
    The sampling strategy is domain-specific. For search relevance, strata might be "top-3 results", "results 4-10", "tail results". For summarization, strata might be "short documents", "long documents", "multi-topic documents".
    Singleton evaluation is critical when the goal involves coverage — sampling singletons with the singleton rubric checks whether the system is missing obvious groupings.
  4. Design the rubric (for
    type: judge
    ):
    Help the user define the scoring rubric. A good rubric:
    • Has a 1-5 scale (or similar) with concrete descriptions for each level
    • Includes supplementary fields that help diagnose issues (e.g.,
      distinct_topics
      ,
      outlier_count
      )
    • Is specific enough that two judges would give similar scores
    • Does NOT assume bigger/more is better — "3 items per cluster average" is not inherently good or bad
    Example for clustering:
    yaml
    rubric: |
      Rate this cluster 1-5:
      - 5: All items clearly about the same issue/feature
      - 4: Strong theme, minor outliers
      - 3: Related but covers 2-3 sub-topics that could reasonably be split
      - 2: Weak connection — items share superficial similarity only
      - 1: Unrelated items grouped together
      Also report: distinct_topics (integer), outlier_count (integer)
  5. Guide the user through the remaining spec fields:
    • What degenerate cases should be rejected? (gates — e.g., "solo_pct <= 0.95" catches all-singletons, "max_cluster_size <= 500" catches mega-clusters)
    • What command runs the measurement?
    • What files can be modified? What is immutable?
    • Any constraints or dependencies?
    • If this is the first run: recommend
      execution.mode: serial
      ,
      execution.max_concurrent: 1
      ,
      stopping.max_iterations: 4
      , and
      stopping.max_hours: 1
    • If
      type: judge
      : recommend
      sample_size: 10
      ,
      batch_size: 5
      , and
      max_total_cost_usd: 5
      until the rubric and harness are trusted
  6. Write the spec to
    .context/compound-engineering/ce-optimize/<spec-name>/spec.yaml
  7. Present the spec to the user for approval before proceeding
如果提供了规格文件:
  1. 读取YAML规格文件。编排Agent原生支持YAML解析 — 无需shell脚本解析。
  2. 根据
    references/optimize-spec-schema.yaml
    进行验证:
    • 所有必填字段均存在
    • name
      为小写短横线格式,可安全用于git引用/工作目录路径
    • metric.primary.type
      hard
      judge
    • 如果类型为
      judge
      ,则存在
      metric.judge
      部分,包含
      rubric
      scoring
    • 至少定义了一个退化门槛
    • measurement.command
      非空
    • scope.mutable
      scope.immutable
      各至少有一个条目
    • 门槛检查运算符有效(
      >=
      <=
      >
      <
      ==
      !=
    • execution.max_concurrent
      至少为1
    • 当后端为
      worktree
      时,
      execution.max_concurrent
      不超过6
  3. 如果验证失败,报告错误并要求用户修复
如果提供了目标描述:
  1. 分析项目以了解可测量内容
  2. 检测优化目标是定性还是定量 — 这决定了
    type: hard
    vs
    type: judge
    ,是规格中最重要的决策:
    当以下情况时使用
    type: hard
    • 指标是具有明确“更优”方向的标量数值
    • 指标可客观测量(构建时间、测试通过率、延迟、内存占用)
    • 无需人工判断即可评估“该结果是否真的好”
    • 示例:缩短构建时间、提高测试覆盖率、降低API延迟、减小包体积
    当以下情况时使用
    type: judge
    • 输出质量需要语义理解才能评估
    • 需要人工审阅结果才能判断“这个更好”
    • 存在代理指标但可能产生误导(例如,“更多聚类”并不意味着“更好的聚类”)
    • 优化可能产生在纸面看起来不错但实际退化的解决方案
    • 示例:聚类质量、搜索相关性、摘要质量、代码可读性、UX文案、推荐相关性
    重要提示:如果目标是定性的,强烈推荐使用
    type: judge
    。解释仅使用硬指标会优化代理数值,但无法检查实际质量。向用户展示三层方法:
    • 退化门槛(硬指标、低成本、快速):捕获明显失效的解决方案 — 例如,“所有项都在1个聚类中”或“0%覆盖率”。先运行该步骤。如果门槛未通过,跳过昂贵的judge步骤。
    • LLM-as-judge(实际优化目标):对输出进行抽样,根据评分标准打分并汇总。这是循环优化的对象。
    • 诊断信息(仅记录,不设门槛):分布统计、计数、耗时 — 有助于理解judge分数变化的原因。
    如果用户坚持对定性目标使用
    type: hard
    ,可以继续执行,但需警告结果可能会优化具有误导性的代理指标。
  3. 设计抽样策略(针对
    type: judge
    ):
    引导用户定义分层抽样。核心问题是:“您需要检查输出空间的哪些部分的质量?”
    引导用户回答以下问题:
    • 一个“项”是什么样的?(一个聚类、一个搜索结果页、一篇摘要等)
    • 自然的大小/质量分层有哪些?(例如,大聚类vs小聚类vs单例)
    • 质量故障最可能出现在哪里?(例如,非常大的聚类可能是退化的合并;单例可能是遗漏的分组)
    • 总样本量如何平衡成本与信号?(默认:30个项,根据输出量调整)
    聚类的分层抽样示例:
    yaml
    stratification:
      - bucket: "top_by_size"     # 最大的聚类 — 检查是否存在退化的超大聚类
        count: 10
      - bucket: "mid_range"       # 非单例聚类大小范围的中间值 — 代表性质量
        count: 10
      - bucket: "small_clusters"  # 包含2-3个项的聚类 — 检查关联是否真实
        count: 10
    singleton_sample: 15          # 单例 — 检查是否存在假阴性(应被聚类的项)
    抽样策略是领域特定的。对于搜索相关性,分层可能是“前3个结果”、“第4-10个结果”、“尾部结果”。对于摘要,分层可能是“短文档”、“长文档”、“多主题文档”。
    当目标涉及覆盖率时,单例评估至关重要 — 使用单例评分标准抽样单例,检查系统是否遗漏了明显的分组。
  4. 设计评分标准(针对
    type: judge
    ):
    帮助用户定义评分标准。优秀的评分标准:
    • 采用1-5分制(或类似),每个分数段有具体描述
    • 包含有助于诊断问题的补充字段(例如,
      distinct_topics
      outlier_count
    • 足够具体,不同评估者给出的分数相近
    • 不假设越大/越多越好 — “平均每个聚类3个项”本身并非好坏的标准
    聚类的评分标准示例:
    yaml
    rubric: |
      为该聚类打分(1-5分):
      - 5分:所有项明显围绕同一问题/功能
      - 4分:主题明确,存在少量异常值
      - 3分:相关但涵盖2-3个可合理拆分的子主题
      - 2分:关联薄弱 — 项仅存在表面相似性
      - 1分:无关项被分组在一起
      同时报告:distinct_topics(整数)、outlier_count(整数)
  5. 引导用户完成规格的其余字段:
    • 应拒绝哪些退化情况?(门槛 — 例如,
      solo_pct <= 0.95
      捕获全单例情况,
      max_cluster_size <= 500
      捕获超大聚类)
    • 运行测量的命令是什么?
    • 哪些文件可以修改?哪些不可修改?
    • 有任何约束或依赖吗?
    • 如果是首次运行:推荐
      execution.mode: serial
      execution.max_concurrent: 1
      stopping.max_iterations: 4
      stopping.max_hours: 1
    • 如果是
      type: judge
      :推荐
      sample_size: 10
      batch_size: 5
      max_total_cost_usd: 5
      ,直到评分标准和工具可信
  6. 将规格写入
    .context/compound-engineering/ce-optimize/<spec-name>/spec.yaml
  7. 向用户展示规格,获得批准后再继续

0.3 Search Prior Learnings

0.3 搜索过往学习成果

Dispatch
compound-engineering:research:learnings-researcher
to search for prior optimization work on similar topics. If relevant learnings exist, incorporate them into the approach.
调用
compound-engineering:research:learnings-researcher
搜索类似主题的过往优化工作。如果存在相关学习成果,将其纳入方法中。

0.4 Run Identity Detection

0.4 运行身份检测

Check if
optimize/<spec-name>
branch already exists:
bash
git rev-parse --verify "optimize/<spec-name>" 2>/dev/null
If branch exists, check for an existing experiment log at
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
.
Present the user with a choice via the platform question tool:
  • Resume: read ALL state from the experiment log on disk (do not rely on any in-memory context from a prior session). Recover any measured-but-unlogged experiments by scanning worktree directories for
    result.yaml
    markers. Continue from the last iteration number in the log.
  • Fresh start: archive the old branch to
    optimize-archive/<spec-name>/archived-<timestamp>
    , clear the experiment log, start from scratch
检查
optimize/<spec-name>
分支是否已存在:
bash
git rev-parse --verify "optimize/<spec-name>" 2>/dev/null
如果分支存在,检查
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
是否存在现有实验日志。
通过平台提问工具向用户提供选择:
  • 恢复:从磁盘上的实验日志读取所有状态(不依赖先前会话的任何内存上下文)。通过扫描工作目录查找
    result.yaml
    标记,恢复所有已测量但未记录的实验。从日志中的最后一次迭代继续。
  • 重新开始:将旧分支归档到
    optimize-archive/<spec-name>/archived-<timestamp>
    ,清除实验日志,从头开始

0.5 Create Optimization Branch and Scratch Space

0.5 创建优化分支和临时空间

bash
git checkout -b "optimize/<spec-name>"  # or switch to existing if resuming
Create scratch directory:
bash
mkdir -p .context/compound-engineering/ce-optimize/<spec-name>/

bash
git checkout -b "optimize/<spec-name>"  # 或如果恢复则切换到现有分支
创建临时目录:
bash
mkdir -p .context/compound-engineering/ce-optimize/<spec-name>/

Phase 1: Measurement Scaffolding

Phase 1: 测量框架搭建

This phase is a HARD GATE. The user must approve baseline and parallel readiness before Phase 2.
此阶段为硬性门槛。用户必须批准基线和并行准备情况后,才能进入Phase 2。

1.1 Clean-Tree Gate

1.1 干净工作树门槛

Verify no uncommitted changes to files within
scope.mutable
or
scope.immutable
:
bash
git status --porcelain
Filter the output against the scope paths. If any in-scope files have uncommitted changes:
  • Report which files are dirty
  • Ask the user to commit or stash before proceeding
  • Do NOT continue until the working tree is clean for in-scope files
验证
scope.mutable
scope.immutable
内的文件是否存在未提交的更改:
bash
git status --porcelain
根据范围路径过滤输出。如果任何范围内的文件存在未提交更改:
  • 报告哪些文件已被修改
  • 要求用户提交或暂存更改后再继续
  • 在范围内文件的工作树干净前,不得继续

1.2 Build or Validate Measurement Harness

1.2 构建或验证测量工具

If user provides a measurement harness (the
measurement.command
already exists):
  1. Run it once via the measurement script:
    bash
    bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>"
  2. Validate the JSON output:
    • Contains keys for all degenerate gate metric names
    • Contains keys for all diagnostic metric names
    • Values are numeric or boolean as expected
  3. If validation fails, report what is missing and ask the user to fix the harness
If agent must build the harness:
  1. Analyze the codebase to understand the current approach and what should be measured
  2. Build an evaluation script (e.g.,
    evaluate.py
    ,
    evaluate.sh
    , or equivalent)
  3. Add the evaluation script path to
    scope.immutable
    -- the experiment agent must not modify it
  4. Run it once and validate the output
  5. Present the harness and its output to the user for review
如果用户提供了测量工具
measurement.command
已存在):
  1. 通过测量脚本运行一次:
    bash
    bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>"
  2. 验证JSON输出:
    • 包含所有退化门槛指标名称的键
    • 包含所有诊断指标名称的键
    • 值为预期的数值或布尔类型
  3. 如果验证失败,报告缺失内容并要求用户修复工具
如果Agent必须构建工具:
  1. 分析代码库以了解当前方法和应测量的内容
  2. 构建评估脚本(例如,
    evaluate.py
    evaluate.sh
    或等效脚本)
  3. 将评估脚本路径添加到
    scope.immutable
    — 实验Agent不得修改它
  4. 运行一次并验证输出
  5. 向用户展示工具及其输出以供审核

1.3 Establish Baseline

1.3 建立基线

Run the measurement harness on the current code.
If stability mode is
repeat
:
  1. Run the harness
    repeat_count
    times
  2. Aggregate results using the configured aggregation method (median, mean, min, max)
  3. Calculate variance across runs
  4. If variance exceeds
    noise_threshold
    , warn the user and suggest increasing
    repeat_count
Record the baseline in the experiment log:
yaml
baseline:
  timestamp: "<current ISO 8601 timestamp>"
  gates:
    <gate_name>: <value>
    ...
  diagnostics:
    <diagnostic_name>: <value>
    ...
If primary type is
judge
, also run the judge evaluation on baseline output to establish the starting judge score.
在当前代码上运行测量工具。
如果稳定性模式为
repeat
  1. 运行工具
    repeat_count
  2. 使用配置的聚合方法(中位数、均值、最小值、最大值)汇总结果
  3. 计算运行间的方差
  4. 如果方差超过
    noise_threshold
    ,警告用户并建议增加
    repeat_count
将基线记录到实验日志中:
yaml
baseline:
  timestamp: "<当前ISO 8601时间戳>"
  gates:
    <gate_name>: <value>
    ...
  diagnostics:
    <diagnostic_name>: <value>
    ...
如果主类型为
judge
,还需对基线输出运行judge评估以确定初始judge分数。

1.4 Parallelism Readiness Probe

1.4 并行准备探测

Run the parallelism probe script:
bash
bash scripts/parallel-probe.sh "<project_directory>" "<measurement.command>" "<measurement.working_directory>" <shared_files...>
Read the JSON output. Present any blockers to the user with suggested mitigations. Treat the probe as intentionally narrow: it should inspect the measurement command, the measurement working directory, and explicitly declared shared files, not the entire repository.
运行并行探测脚本:
bash
bash scripts/parallel-probe.sh "<project_directory>" "<measurement.command>" "<measurement.working_directory>" <shared_files...>
读取JSON输出。向用户展示任何障碍及建议的缓解措施。探测应针对性地检查测量命令、测量工作目录和明确声明的共享文件,而非整个仓库。

1.5 Worktree Budget Check

1.5 工作目录预算检查

Count existing worktrees:
bash
bash scripts/experiment-worktree.sh count
If count +
execution.max_concurrent
would exceed 12:
  • Warn the user
  • Suggest cleaning up existing worktrees or reducing
    max_concurrent
  • Do NOT block -- the user may proceed at their own risk
统计现有工作目录数量:
bash
bash scripts/experiment-worktree.sh count
如果数量 +
execution.max_concurrent
超过12:
  • 警告用户
  • 建议清理现有工作目录或降低
    max_concurrent
  • 不强制阻止 — 用户可自行承担风险继续

1.6 Write Baseline to Disk (CP-1)

1.6 将基线写入磁盘(CP-1)

MANDATORY CHECKPOINT. Before presenting results to the user, write the initial experiment log with baseline metrics to disk:
  1. Create the experiment log file at
    .context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
  2. Include all required top-level sections from
    references/experiment-log-schema.yaml
    :
    spec
    ,
    run_id
    ,
    started_at
    ,
    baseline
    ,
    experiments
    , and
    best
  3. Seed
    experiments
    as an empty array and seed
    best
    from the baseline snapshot (use
    iteration: 0
    , baseline metrics, and baseline judge scores if present) so later phases have a valid current-best state to compare against
  4. Optionally seed
    hypothesis_backlog: []
    here as well so the log shape is stable before Phase 2 populates it
  5. Verify: read the file back and confirm the required sections are present and the baseline values match
  6. Only THEN present results to the user
强制检查点。在向用户展示结果前,必须将包含基线指标的初始实验日志写入磁盘:
  1. .context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
    创建实验日志文件
  2. 包含
    references/experiment-log-schema.yaml
    中所有必填的顶层部分:
    spec
    run_id
    started_at
    baseline
    experiments
    best
  3. experiments
    初始化为空数组,并从基线快照初始化
    best
    (使用
    iteration: 0
    、基线指标和基线judge分数(如果存在)),以便后续阶段有可用于比较的有效当前最优状态
  4. 可选地在此处初始化
    hypothesis_backlog: []
    ,以便在Phase 2填充前日志结构保持稳定
  5. 验证:重新读取文件,确认必填部分存在且基线值匹配
  6. 完成后再向用户展示结果

1.7 User Approval Gate

1.7 用户批准门槛

Present to the user via the platform question tool:
  • Baseline metrics: all gate values, diagnostic values, and judge scores (if applicable)
  • Experiment log location: show the file path so the user knows where results are saved
  • Parallel readiness: probe results, any blockers, mitigations applied
  • Clean-tree status: confirmed clean
  • Worktree budget: current count and projected usage
  • Judge budget: estimated per-experiment judge cost and configured
    max_total_cost_usd
    cap (or an explicit note that spend is uncapped)
Options:
  1. Proceed -- approve baseline and parallel config, move to Phase 2
  2. Adjust spec -- modify spec settings before proceeding
  3. Fix issues -- user needs to resolve blockers first
Do NOT proceed to Phase 2 until the user explicitly approves.
If primary type is
judge
and
max_total_cost_usd
is null, call that out as uncapped spend and require explicit approval before proceeding.
State re-read: After gate approval, re-read the spec and baseline from disk. Do not carry stale in-memory values forward.

通过平台提问工具向用户展示:
  • 基线指标:所有门槛值、诊断值和judge分数(如适用)
  • 实验日志位置:展示文件路径,让用户知道结果保存位置
  • 并行准备情况:探测结果、任何障碍、已应用的缓解措施
  • 干净工作树状态:已确认干净
  • 工作目录预算:当前数量和预计使用量
  • Judge预算:每次实验的预估judge成本和配置的
    max_total_cost_usd
    上限(或明确说明支出无上限)
选项:
  1. 继续 — 批准基线和并行配置,进入Phase 2
  2. 调整规格 — 在继续前修改规格设置
  3. 修复问题 — 用户需要先解决障碍
在用户明确批准前,不得进入Phase 2。
如果主类型为
judge
max_total_cost_usd
为null,需明确指出支出无上限,并要求用户明确批准后再继续。
状态重新读取:门槛批准后,从磁盘重新读取规格和基线。不得携带过时的内存值继续。

Phase 2: Hypothesis Generation

Phase 2: 假设生成

2.1 Analyze Current Approach

2.1 分析当前方法

Read the code within
scope.mutable
to understand:
  • The current implementation approach
  • Obvious improvement opportunities
  • Constraints and dependencies between components
Optionally dispatch
compound-engineering:research:repo-research-analyst
for deeper codebase analysis if the scope is large or unfamiliar.
读取
scope.mutable
内的代码以了解:
  • 当前实现方法
  • 明显的改进机会
  • 组件间的约束和依赖
如果范围较大或不熟悉,可选择调用
compound-engineering:research:repo-research-analyst
进行更深入的代码库分析。

2.2 Generate Hypothesis List

2.2 生成假设列表

Generate an initial set of hypotheses. Each hypothesis should have:
  • Description: what to try
  • Category: one of the standard categories (signal-extraction, graph-signals, embedding, algorithm, preprocessing, parameter-tuning, architecture, data-handling) or a domain-specific category
  • Priority: high, medium, or low based on expected impact and feasibility
  • Required dependencies: any new packages or tools needed
Include user-provided hypotheses if any were given as input.
Aim for 10-30 hypotheses in the initial backlog. More can be generated during the loop based on learnings.
生成初始假设集。每个假设应包含:
  • 描述:尝试的内容
  • 类别:标准类别之一(signal-extraction、graph-signals、embedding、algorithm、preprocessing、parameter-tuning、architecture、data-handling)或领域特定类别
  • 优先级:基于预期影响和可行性分为高、中、低
  • 所需依赖:任何需要的新包或工具
如果输入中包含用户提供的假设,需将其纳入。
初始待办项目标为10-30个假设。可根据学习成果在循环中生成更多假设。

2.3 Dependency Pre-Approval

2.3 依赖预批准

Collect all unique new dependencies across all hypotheses.
If any hypotheses require new dependencies:
  1. Present the full dependency list to the user via the platform question tool
  2. Ask for bulk approval
  3. Mark each hypothesis's
    dep_status
    as
    approved
    or
    needs_approval
Hypotheses with unapproved dependencies remain in the backlog but are skipped during batch selection. They are re-presented at wrap-up for potential approval.
收集所有假设中唯一的新依赖。
如果任何假设需要新依赖:
  1. 通过平台提问工具向用户展示完整的依赖列表
  2. 请求批量批准
  3. 将每个假设的
    dep_status
    标记为
    approved
    needs_approval
未获批准的假设仍保留在待办项中,但在批量选择时会被跳过。在收尾阶段会重新展示给用户以获得潜在批准。

2.4 Record Hypothesis Backlog (CP-2)

2.4 记录假设待办项(CP-2)

MANDATORY CHECKPOINT. Write the initial backlog to the experiment log file and verify:
yaml
hypothesis_backlog:
  - description: "Remove template boilerplate before embedding"
    category: "signal-extraction"
    priority: high
    dep_status: approved
    required_deps: []
  - description: "Try HDBSCAN clustering algorithm"
    category: "algorithm"
    priority: medium
    dep_status: needs_approval
    required_deps: ["scikit-learn"]

强制检查点。将初始待办项写入实验日志文件并验证:
yaml
hypothesis_backlog:
  - description: "嵌入前移除模板样板代码"
    category: "signal-extraction"
    priority: high
    dep_status: approved
    required_deps: []
  - description: "尝试HDBSCAN聚类算法"
    category: "algorithm"
    priority: medium
    dep_status: needs_approval
    required_deps: ["scikit-learn"]

Phase 3: Optimization Loop

Phase 3: 优化循环

This phase repeats in batches until a stopping criterion is met.
此阶段会批量重复执行,直到满足停止条件。

3.1 Batch Selection

3.1 批量选择

Select hypotheses for this batch:
  • Build a runnable backlog by excluding hypotheses with
    dep_status: needs_approval
  • If
    execution.mode
    is
    serial
    , force
    batch_size = 1
  • Otherwise,
    batch_size = min(runnable_backlog_size, execution.max_concurrent)
  • Prefer diversity: select from different categories when possible
  • Within a category, select by priority (high first)
If the backlog is empty and no new hypotheses can be generated, proceed to Phase 4 (wrap-up). If the backlog is non-empty but no runnable hypotheses remain because everything needs approval or is otherwise blocked, proceed to Phase 4 so the user can approve dependencies instead of spinning forever.
选择此批次的假设:
  • 通过排除
    dep_status: needs_approval
    的假设,构建可运行的待办项
  • 如果
    execution.mode
    serial
    ,强制
    batch_size = 1
  • 否则,
    batch_size = min(可运行待办项数量, execution.max_concurrent)
  • 优先选择多样性:尽可能从不同类别中选择
  • 在同一类别中,按优先级选择(高优先级优先)
如果待办项为空且无法生成新假设,进入Phase 4(收尾)。 如果待办项非空但因所有假设都需要批准或存在其他障碍而无可用假设,进入Phase 4以便用户批准依赖,而非无限循环。

3.2 Dispatch Experiments

3.2 分发实验

For each hypothesis in the batch, dispatch according to
execution.mode
. In
serial
mode, run exactly one experiment to completion before selecting the next hypothesis. In
parallel
mode, dispatch the full batch concurrently.
Worktree backend:
  1. Create experiment worktree:
    bash
    WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "<spec_name>" <exp_index> "optimize/<spec_name>" <shared_files...>)  # creates optimize-exp/<spec_name>/exp-<NNN>
  2. Apply port parameterization if configured (set env vars for the measurement script)
  3. Fill the experiment prompt template (
    references/experiment-prompt-template.md
    ) with:
    • Iteration number, spec name
    • Hypothesis description and category
    • Current best and baseline metrics
    • Mutable and immutable scope
    • Constraints and approved dependencies
    • Rolling window of last 10 experiments (concise summaries)
  4. Dispatch a subagent with the filled prompt, working in the experiment worktree
Codex backend:
  1. Check environment guard -- do NOT delegate if already inside a Codex sandbox:
    bash
    # If these exist, we're already in Codex -- fall back to subagent
    test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git
  2. Fill the experiment prompt template
  3. Write the filled prompt to a temp file
  4. Dispatch via Codex:
    bash
    cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1
  5. Security posture: use the user's selection (ask once per session if not set in spec)
对于批次中的每个假设,根据
execution.mode
分发。在
serial
模式下,完全运行一个实验后再选择下一个假设。在
parallel
模式下,并发分发整个批次。
Worktree后端:
  1. 创建实验工作目录:
    bash
    WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "<spec_name>" <exp_index> "optimize/<spec_name>" <shared_files...>)  # 创建optimize-exp/<spec_name>/exp-<NNN>
  2. 如果配置了端口参数化,应用设置(为测量脚本设置环境变量)
  3. 使用以下内容填充实验提示模板(
    references/experiment-prompt-template.md
    ):
    • 迭代次数、规格名称
    • 假设描述和类别
    • 当前最优和基线指标
    • 可变和不可变范围
    • 约束和已批准依赖
    • 最近10次实验的滚动窗口(简洁摘要)
  4. 使用填充后的提示调用子Agent,在实验工作目录中运行
Codex后端:
  1. 检查环境防护 — 如果已在Codex沙箱内,不要委托:
    bash
    # 如果这些存在,说明已在Codex中 — 回退到子Agent
    test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git
  2. 填充实验提示模板
  3. 将填充后的提示写入临时文件
  4. 通过Codex分发:
    bash
    cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1
  5. 安全策略:使用用户选择(如果规格中未设置,每会话询问一次)

3.3 Collect and Persist Results

3.3 收集并持久化结果

Process experiments as they complete — do NOT wait for the entire batch to finish before writing results.
For each completed experiment, immediately:
  1. Run measurement in the experiment's worktree:
    bash
    bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>/<measurement.working_directory or .>" <env_vars...>
    • If stability mode is
      repeat
      , run the measurement harness
      repeat_count
      times in that working directory and aggregate the results exactly as in Phase 1 before evaluating gates or ranking the experiment.
    • Use the aggregated metrics as the experiment's score; if variance exceeds
      noise_threshold
      , record that in learnings so the operator knows the result is noisy.
  2. Write crash-recovery marker — immediately after measurement, write
    result.yaml
    in the experiment worktree containing the raw metrics. This ensures the measurement is recoverable even if the agent crashes before updating the main log.
  3. Read raw JSON output from the measurement script
  4. Evaluate degenerate gates:
    • For each gate in
      metric.degenerate_gates
      , parse the operator and threshold
    • Compare the metric value against the threshold
    • If ANY gate fails: mark outcome as
      degenerate
      , skip judge evaluation, save money
  5. If gates pass AND primary type is
    judge
    :
    • Read the experiment's output (cluster assignments, search results, etc.)
    • Apply stratified sampling per
      metric.judge.stratification
      config (using
      sample_seed
      )
    • Group samples into batches of
      metric.judge.batch_size
    • Fill the judge prompt template (
      references/judge-prompt-template.md
      ) for each batch
    • Dispatch
      ceil(sample_size / batch_size)
      parallel judge sub-agents
    • Each sub-agent returns structured JSON scores
    • Aggregate scores: compute the configured primary judge field from
      metric.judge.scoring.primary
      (which should match
      metric.primary.name
      ) plus any
      scoring.secondary
      values
    • If
      singleton_sample > 0
      : also dispatch singleton evaluation sub-agents
  6. If gates pass AND primary type is
    hard
    :
    • Use the metric value directly from the measurement output
  7. IMMEDIATELY append to experiment log on disk (CP-3) — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to
    .context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
    right now. Use the transitional outcome
    measured
    once the experiment has valid metrics but has not yet been compared to the current best. Update the outcome to
    kept
    ,
    reverted
    , or another terminal state in the evaluation step, but the raw metrics are on disk and safe from context compaction.
  8. VERIFY the write (CP-3 verification) — read the experiment log back from disk and confirm the entry just written is present. If verification fails, retry the write. Do NOT proceed to the next experiment until this entry is confirmed on disk.
Why immediately + verify? The agent's context window is NOT a durable store. Context compaction, session crashes, and restarts are expected during long runs. If results only exist in the agent's memory, they are lost. Karpathy's autoresearch writes to
results.tsv
after every single experiment — this skill must do the same with the experiment log. The verification step catches silent write failures that would otherwise lose data.
实验完成后立即处理 — 不要等待整个批次完成后再写入结果。
对于每个已完成的实验,立即执行以下操作
  1. 在实验工作目录中运行测量
    bash
    bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>/<measurement.working_directory or .>" <env_vars...>
    • 如果稳定性模式为
      repeat
      ,在该工作目录中运行测量工具
      repeat_count
      次,并完全按照Phase 1的方式汇总结果,然后再评估门槛或对实验排名。
    • 使用汇总后的指标作为实验分数;如果方差超过
      noise_threshold
      ,将其记录到学习成果中,以便操作人员知道结果存在噪声。
  2. 写入崩溃恢复标记 — 测量完成后立即在实验工作目录中写入包含原始指标的
    result.yaml
    。这确保即使Agent在更新主日志前崩溃,测量结果也可恢复。
  3. 读取测量脚本的原始JSON输出
  4. 评估退化门槛
    • 对于
      metric.degenerate_gates
      中的每个门槛,解析运算符和阈值
    • 将指标值与阈值比较
    • 如果任何门槛未通过:标记结果为
      degenerate
      ,跳过judge评估,节省成本
  5. 如果门槛通过且主类型为
    judge
    • 读取实验的输出(聚类分配、搜索结果等)
    • 根据
      metric.judge.stratification
      配置应用分层抽样(使用
      sample_seed
    • 将样本分组为
      metric.judge.batch_size
      大小的批次
    • 为每个批次填充judge提示模板(
      references/judge-prompt-template.md
    • 调用
      ceil(sample_size / batch_size)
      个并行judge子Agent
    • 每个子Agent返回结构化JSON分数
    • 汇总分数:根据
      metric.judge.scoring.primary
      (应与
      metric.primary.name
      匹配)计算配置的主judge字段,以及任何
      scoring.secondary
    • 如果
      singleton_sample > 0
      :同时调用单例评估子Agent
  6. 如果门槛通过且主类型为
    hard
    • 直接使用测量输出中的指标值
  7. 立即追加到磁盘上的实验日志(CP-3) — 不要推迟到批量评估时。立即将实验条目(迭代次数、假设、结果、指标、学习成果)写入
    .context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
    。当实验有有效指标但尚未与当前最优解比较时,使用过渡结果
    measured
    。在评估步骤中将结果更新为
    kept
    reverted
    或其他最终状态,但原始指标已存储在磁盘上,不会因上下文压缩丢失。
  8. 验证写入(CP-3验证) — 从磁盘重新读取实验日志,确认刚写入的条目存在。如果验证失败,重试写入。在确认条目存在前,不要进行下一个实验。
为什么要立即写入并验证? Agent的上下文窗口并非持久存储。在长时间运行中,上下文压缩、会话崩溃和重启是预期情况。如果结果仅存在于Agent内存中,将会丢失。Karpathy的autoresearch在每次实验后写入
results.tsv
— 本技能必须对实验日志执行相同操作。验证步骤可捕获否则会导致数据丢失的静默写入失败。

3.4 Evaluate Batch

3.4 评估批次

After all experiments in the batch have been measured:
  1. Rank experiments by primary metric improvement:
    • For hard metrics: compare to the current best using
      metric.primary.direction
      (
      maximize
      means higher is better,
      minimize
      means lower is better), and require the absolute improvement to exceed
      measurement.stability.noise_threshold
      before treating it as a real win
    • For judge metrics: compare the configured primary judge score (
      metric.judge.scoring.primary
      /
      metric.primary.name
      ) to the current best, and require it to exceed
      minimum_improvement
  2. Identify the best experiment that passes all gates and improves the primary metric
  3. If best improves on current best: KEEP
    • Commit the experiment branch first so the winning diff exists as a real commit before any merge or cherry-pick
    • Include only mutable-scope changes in that commit; if no eligible diff remains, treat the experiment as non-improving and revert it
    • Merge the committed experiment branch into the optimization branch
    • Use the message
      optimize(<spec-name>): <hypothesis description>
      for the experiment commit
    • After the merge succeeds, clean up the winner's experiment worktree and branch; the integrated commit on the optimization branch is the durable artifact
    • This is now the new baseline for subsequent batches
  4. Check file-disjoint runners-up (up to
    max_runner_up_merges_per_batch
    ):
    • For each runner-up that also improved, check file-level disjointness with the kept experiment
    • File-level disjointness: two experiments are disjoint if they modified completely different files. Same file = overlapping, even if different lines.
    • If disjoint: cherry-pick the runner-up onto the new baseline, re-run full measurement
    • If combined measurement is strictly better: keep the cherry-pick (outcome:
      runner_up_kept
      ), then clean up that runner-up's experiment worktree and branch
    • Otherwise: revert the cherry-pick, log as "promising alone but neutral/harmful in combination" (outcome:
      runner_up_reverted
      ), then clean up the runner-up's experiment worktree and branch
    • Stop after first failed combination
  5. Handle deferred deps: experiments that need unapproved dependencies get outcome
    deferred_needs_approval
  6. Revert all others: cleanup worktrees, log as
    reverted
批次中所有实验测量完成后:
  1. 按主指标改进程度排名
    • 对于硬指标:根据
      metric.primary.direction
      maximize
      表示越高越好,
      minimize
      表示越低越好)与当前最优解比较,要求绝对改进超过
      measurement.stability.noise_threshold
      才视为真正的提升
    • 对于judge指标:将配置的主judge分数(
      metric.judge.scoring.primary
      /
      metric.primary.name
      )与当前最优解比较,要求超过
      minimum_improvement
  2. 确定通过所有门槛且主指标有改进的最优实验
  3. 如果最优实验优于当前最优解:保留
    • 先提交实验分支,以便在合并或挑选前,获胜的差异作为真实提交存在
    • 提交中仅包含可变范围的更改;如果没有合格的差异,将实验视为无改进并回退
    • 将提交的实验分支合并到优化分支
    • 实验提交的消息使用
      optimize(<spec-name>): <假设描述>
    • 合并成功后,清理获胜实验的工作目录和分支;优化分支上的集成提交是持久化工件
    • 这成为后续批次的新基线
  4. 检查文件不相交的候选最优解(最多
    max_runner_up_merges_per_batch
    个):
    • 对于每个同样有改进的候选最优解,检查其与保留实验的文件级不相交性
    • 文件级不相交性:如果两个实验修改的文件完全不同,则它们是不相交的。即使修改的是同一文件的不同行,也视为重叠。
    • 如果不相交:将候选最优解挑选到新基线上,重新运行完整测量
    • 如果合并后的测量结果严格更优:保留挑选的更改(结果:
      runner_up_kept
      ),然后清理该候选最优解的工作目录和分支
    • 否则:回退挑选的更改,记录为“单独使用有前景但组合使用无影响/有害”(结果:
      runner_up_reverted
      ),然后清理该候选最优解的工作目录和分支
    • 首次组合失败后停止
  5. 处理延迟依赖:需要未批准依赖的实验结果标记为
    deferred_needs_approval
  6. 回退所有其他实验:清理工作目录,记录为
    reverted

3.5 Update State (CP-4)

3.5 更新状态(CP-4)

MANDATORY CHECKPOINT. By this point, individual experiment results are already on disk (written in step 3.3). This step updates aggregate state and verifies.
  1. Re-read the experiment log from disk — do not trust in-memory state. The log is the source of truth.
  2. Finalize outcomes — update experiment entries from step 3.4 evaluation (mark
    kept
    ,
    reverted
    ,
    runner_up_kept
    , etc.). Write these outcome updates to disk immediately.
  3. Update the
    best
    section
    in the experiment log if a new best was found. Write to disk.
  4. Write strategy digest to
    .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md
    :
    • Categories tried so far (with success/failure counts)
    • Key learnings from this batch and overall
    • Exploration frontier: what categories and approaches remain untried
    • Current best metrics and improvement from baseline
  5. Generate new hypotheses based on learnings:
    • Re-read the strategy digest from disk (not from memory)
    • Read the rolling window (last 10 experiments from the log on disk)
    • Do NOT read the full experiment log -- use the digest for broad context
    • Add new hypotheses to the backlog and write the updated backlog to disk
  6. Write updated hypothesis backlog to disk — the backlog section of the experiment log must reflect newly added hypotheses and removed (tested) ones.
CP-4 Verification: Read the experiment log back from disk. Confirm: (a) all experiment outcomes from this batch are finalized, (b) the
best
section reflects the current best, (c) the hypothesis backlog is updated. Read
strategy-digest.md
back and confirm it exists. Only THEN proceed to the next batch or stopping criteria check.
Checkpoint: at this point, all state for this batch is on disk. If the agent crashes and restarts, it can resume from the experiment log without loss.
强制检查点。此时,单个实验结果已写入磁盘(步骤3.3中完成)。此步骤更新聚合状态并验证。
  1. 从磁盘重新读取实验日志 — 不要信任内存中的状态。日志是可信数据源。
  2. 最终确定结果 — 根据步骤3.4的评估更新实验条目(标记
    kept
    reverted
    runner_up_kept
    等)。立即将这些结果更新写入磁盘。
  3. 如果找到新的最优解,更新实验日志中的
    best
    部分
    。写入磁盘。
  4. 将策略摘要写入
    .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md
    • 迄今为止尝试的类别(包含成功/失败计数)
    • 此批次及整体的关键学习成果
    • 探索前沿:尚未尝试的类别和方法
    • 当前最优指标及与基线的改进
  5. 根据学习成果生成新假设
    • 从磁盘重新读取策略摘要(而非内存)
    • 读取滚动窗口(磁盘日志中的最近10次实验)
    • 不要读取完整实验日志 — 使用摘要获取广泛上下文
    • 将新假设添加到待办项,并将更新后的待办项写入磁盘
  6. 将更新后的假设待办项写入磁盘 — 实验日志的待办项部分必须反映新增的假设和已移除(已测试)的假设。
CP-4验证:从磁盘重新读取实验日志。确认:(a) 此批次的所有实验结果已最终确定,(b)
best
部分反映当前最优解,(c) 假设待办项已更新。重新读取
strategy-digest.md
并确认其存在。完成后再进入下一批次或检查停止条件。
检查点:此时,此批次的所有状态已存储在磁盘上。如果Agent崩溃并重启,可从实验日志恢复,无数据丢失。

3.6 Check Stopping Criteria

3.6 检查停止条件

Stop the loop if ANY of these are true:
  • Target reached:
    stopping.target_reached
    is true,
    metric.primary.target
    is set, and the primary metric reaches that target according to
    metric.primary.direction
    (
    >=
    for
    maximize
    ,
    <=
    for
    minimize
    )
  • Max iterations: total experiments run >=
    stopping.max_iterations
  • Max hours: wall-clock time since Phase 3 start >=
    stopping.max_hours
  • Judge budget exhausted: cumulative judge spend >=
    metric.judge.max_total_cost_usd
    (if set)
  • Plateau: no improvement for
    stopping.plateau_iterations
    consecutive experiments
  • Manual stop: user interrupts (save state and proceed to Phase 4)
  • Empty backlog: no hypotheses remain and no new ones can be generated
If no stopping criterion is met, proceed to the next batch (step 3.1).
如果满足以下任一条件,停止循环:
  • 达到目标
    stopping.target_reached
    为true,
    metric.primary.target
    已设置,且主指标根据
    metric.primary.direction
    达到目标(
    maximize
    >=
    minimize
    <=
  • 最大迭代次数:已运行的实验总数 >=
    stopping.max_iterations
  • 最长运行时间:自Phase 3开始的挂钟时间 >=
    stopping.max_hours
  • Judge预算耗尽:累计judge支出 >=
    metric.judge.max_total_cost_usd
    (如果已设置)
  • 进入平台期:连续
    stopping.plateau_iterations
    次实验无改进
  • 手动停止:用户中断(保存状态并进入Phase 4)
  • 待办项为空:无剩余假设且无法生成新假设
如果未满足任何停止条件,进入下一批次(步骤3.1)。

3.7 Cross-Cutting Concerns

3.7 跨领域关注点

Codex failure cascade: Track consecutive Codex delegation failures. After 3 consecutive failures, auto-disable Codex for remaining experiments and fall back to subagent dispatch. Log the switch.
Error handling: If an experiment's measurement command crashes, times out, or produces malformed output:
  • Log as outcome
    error
    or
    timeout
    with the error message
  • Revert the experiment (cleanup worktree)
  • The loop continues with remaining experiments in the batch
Progress reporting: After each batch, report:
  • Batch N of estimated M (based on backlog size)
  • Experiments run this batch and total
  • Current best metric and improvement from baseline
  • Cumulative judge cost (if applicable)
Crash recovery: See Persistence Discipline section. Per-experiment
result.yaml
markers are written in step 3.3. Individual experiment results are appended to the log immediately in step 3.3. Batch-level state (outcomes, best, digest) is written in step 3.5. On resume (Phase 0.4), the log on disk is the ground truth — scan for any
result.yaml
markers not yet reflected in the log.

Codex失败连锁反应:跟踪连续的Codex委托失败。连续失败3次后,自动为剩余实验禁用Codex,回退到子Agent分发。记录切换操作。
错误处理:如果实验的测量命令崩溃、超时或生成格式错误的输出:
  • 将结果记录为
    error
    timeout
    ,并包含错误消息
  • 回退实验(清理工作目录)
  • 循环继续处理批次中的剩余实验
进度报告:每个批次完成后,报告:
  • 第N批次,共预计M批次(基于待办项大小)
  • 此批次及总共运行的实验数
  • 当前最优指标及与基线的改进
  • 累计judge成本(如适用)
崩溃恢复:请查看持久化规范部分。步骤3.3中写入了每实验
result.yaml
标记。单个实验结果在步骤3.3中立即追加到日志中。批次级状态(结果、最优解、摘要)在步骤3.5中写入。恢复时(Phase 0.4),磁盘上的日志是真实状态 — 扫描所有尚未反映在日志中的
result.yaml
标记。

Phase 4: Wrap-Up

Phase 4: 收尾阶段

4.1 Present Deferred Hypotheses

4.1 展示延迟的假设

If any hypotheses were deferred due to unapproved dependencies:
  1. List them with their dependency requirements
  2. Ask the user whether to approve, skip, or save for a future run
  3. If approved: add to backlog and offer to re-enter Phase 3 for one more round
如果任何因未批准依赖而延迟的假设:
  1. 列出它们及其依赖要求
  2. 询问用户是否批准、跳过或保存供未来运行使用
  3. 如果批准:添加到待办项,并提供重新进入Phase 3进行额外一轮实验的选项

4.2 Summarize Results

4.2 总结结果

Present a comprehensive summary:
Optimization: <spec-name>
Duration: <wall-clock time>
Total experiments: <count>
  Kept: <count> (including <runner_up_kept_count> runner-up merges)
  Reverted: <count>
  Degenerate: <count>
  Errors: <count>
  Deferred: <count>

Baseline -> Final:
  <primary_metric>: <baseline_value> -> <final_value> (<delta>)
  <gate_metrics>: ...
  <diagnostics>: ...

Judge cost: $<total_judge_cost_usd> (if applicable)

Key improvements:
  1. <kept experiment 1 hypothesis> (+<delta>)
  2. <kept experiment 2 hypothesis> (+<delta>)
  ...
展示全面总结:
优化任务:<spec-name>
持续时间:<挂钟时间>
总实验数:<数量>
  保留:<数量>(包含<runner_up_kept_count>个候选最优解合并)
  回退:<数量>
  退化:<数量>
  错误:<数量>
  延迟:<数量>

基线 -> 最终:
  <primary_metric>:<baseline_value> -> <final_value>(<delta>)
  <gate_metrics>:...
  <diagnostics>:...

Judge成本:$<total_judge_cost_usd>(如适用)

关键改进:
  1. <保留的实验1假设>(+<delta>)
  2. <保留的实验2假设>(+<delta>)
  ...

4.3 Preserve and Offer Next Steps

4.3 保留并提供后续步骤选项

The optimization branch (
optimize/<spec-name>
) is preserved with all commits from kept experiments. The experiment log and strategy digest remain in local
.context/...
scratch space for resume and audit on this machine only; they do not travel with the branch because
.context/
is gitignored.
Present post-completion options via the platform question tool:
  1. Run
    /ce:review
    on the cumulative diff (baseline to final). Load the
    ce:review
    skill with
    mode:autofix
    on the optimization branch.
  2. Run
    /ce:compound
    to document the winning strategy as an institutional learning.
  3. Create PR from the optimization branch to the default branch.
  4. Continue with more experiments: re-enter Phase 3 with the current state. State re-read first.
  5. Done -- leave the optimization branch for manual review.
优化分支(
optimize/<spec-name>
)会保留所有保留实验的提交。实验日志和策略摘要保留在本地
.context/...
临时空间中,仅可在此机器上恢复和审核;由于
.context/
被git忽略,它们不会随分支同步。
通过平台提问工具展示完成后的选项:
  1. **运行
    /ce:review
    **检查累积差异(基线到最终版本)。在优化分支上以
    mode:autofix
    模式加载
    ce:review
    技能。
  2. **运行
    /ce:compound
    **将获胜策略记录为机构学习成果。
  3. 创建PR从优化分支到默认分支。
  4. 继续进行更多实验:使用当前状态重新进入Phase 3。先重新读取状态。
  5. 完成 — 保留优化分支供人工审核。

4.4 Cleanup

4.4 清理

Clean up scratch space:
bash
undefined
清理临时空间:
bash
undefined

Keep the experiment log for local resume/audit on this machine

保留实验日志以便在此机器上本地恢复/审核

Remove temporary batch artifacts

删除临时批次工件

rm -f .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md

Do NOT delete the experiment log if the user may resume locally or wants a local audit trail. If they need a durable shared artifact, summarize or export the results into a tracked path before cleanup.
Do NOT delete experiment worktrees that are still being referenced.
rm -f .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md

如果用户可能在本地恢复或需要本地审核跟踪,不要删除实验日志。如果需要持久化的共享工件,在清理前将结果汇总或导出到受跟踪的路径。不要删除仍被引用的实验工作目录。