ce-optimize
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIterative Optimization Loop
迭代优化循环
Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.
运行基于指标的迭代优化。定义目标,搭建测量框架,随后运行并行实验逐步收敛至最优解。
Interaction Method
交互方式
Use the platform's blocking question tool when available ( in Claude Code, in Codex, in Gemini). Otherwise, present numbered options in chat and wait for the user's reply before proceeding.
AskUserQuestionrequest_user_inputask_user当平台提供阻塞式提问工具时使用(Claude Code中的、Codex中的、Gemini中的)。否则,在聊天中展示编号选项,等待用户回复后再继续。
AskUserQuestionrequest_user_inputask_userInput
输入
<optimization_input> #$ARGUMENTS </optimization_input>
If the input above is empty, ask: "What would you like to optimize? Describe the goal, or provide a path to an optimization spec YAML file."
<optimization_input> #$ARGUMENTS </optimization_input>
如果上述输入为空,请询问:"您想要优化什么?请描述目标,或提供优化规格YAML文件的路径。"
Optimization Spec Schema
优化规格 Schema
Reference the spec schema for validation:
references/optimize-spec-schema.yaml参考规格Schema进行验证:
references/optimize-spec-schema.yamlExperiment Log Schema
实验日志 Schema
Reference the experiment log schema for state management:
references/experiment-log-schema.yaml参考实验日志Schema进行状态管理:
references/experiment-log-schema.yamlQuick Start
快速开始
For a first run, optimize for signal and safety, not maximum throughput:
- Start from when the metric is objective and cheap to measure
references/example-hard-spec.yaml - Use only when actual quality requires semantic judgment
references/example-judge-spec.yaml - Prefer and
execution.mode: serialexecution.max_concurrent: 1 - Cap the first run with and
stopping.max_iterations: 4stopping.max_hours: 1 - Avoid new dependencies until the baseline and measurement harness are trusted
- For judge mode, start with ,
sample_size: 10, andbatch_size: 5max_total_cost_usd: 5
For a friendly overview of what this skill is for, when to use hard metrics vs LLM-as-judge, and example kickoff prompts, see:
references/usage-guide.md首次运行时,优先优化信号和安全性,而非追求最大吞吐量:
- 当指标客观且易于测量时,从开始
references/example-hard-spec.yaml - 仅当实际质量需要语义判断时,使用
references/example-judge-spec.yaml - 优先选择和
execution.mode: serialexecution.max_concurrent: 1 - 首次运行设置上限:和
stopping.max_iterations: 4stopping.max_hours: 1 - 在基线和测量工具可信之前,避免引入新依赖
- 对于judge模式,初始设置、
sample_size: 10和batch_size: 5max_total_cost_usd: 5
如需了解该技能的用途、何时使用硬指标vs LLM-as-judge,以及示例启动提示,请查看:
references/usage-guide.mdPersistence Discipline
持久化规范
CRITICAL: The experiment log on disk is the single source of truth. The conversation context is NOT durable storage. Results that exist only in the conversation WILL be lost.
The files under are local scratch state. They are ignored by git, so they survive local resumes on the same machine but are not preserved by commits, branches, or pushes unless the user exports them separately.
.context/compound-engineering/ce-optimize/<spec-name>/This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters MUST live on disk, not in the agent's memory.
If you produce a results table in the conversation without writing those results to disk first, you have a bug. The conversation is for the user's benefit. The experiment log file is for durability.
关键提示:磁盘上的实验日志是唯一可信数据源。对话上下文并非持久存储。仅存在于对话中的结果将会丢失。
.context/compound-engineering/ce-optimize/<spec-name>/该技能运行时长可达数小时。上下文窗口会压缩、会话可能崩溃、Agent可能重启。所有重要状态必须存储在磁盘上,而非Agent内存中。
**如果您在对话中生成了结果表格但未先将其写入磁盘,这属于错误。**对话是为了方便用户,实验日志文件才是用于持久化的载体。
Core Rules
核心规则
-
Write each experiment result to disk IMMEDIATELY after measurement — not after the batch, not after evaluation, IMMEDIATELY. Append the experiment entry to the experiment log file the moment its metrics are known, before evaluating the next experiment. This is the #1 crash-safety rule.
-
VERIFY every critical write — after writing the experiment log, read the file back and confirm the entry is present. This catches silent write failures. Do not proceed to the next experiment until verification passes.
-
Re-read from disk at every phase boundary and before every decision — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk.
-
The experiment log is append-only during Phase 3 — never rewrite the full file. Append new experiment entries. Update thesection in place only when a new best is found. This prevents data loss if a write is interrupted.
best -
Per-experiment result markers for crash recovery — each experiment writes amarker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged.
result.yaml -
Strategy digest is written after every batch, before generating new hypotheses — the agent reads the digest (not its memory) when deciding what to try next.
-
Never present results to the user without writing them to disk first — the pattern is: measure -> write to disk -> verify -> THEN show the user. Not the reverse.
-
测量完成后立即将每个实验结果写入磁盘 — 不要在批量处理后,也不要在评估后,而是立即写入。一旦获知实验指标,就将实验条目追加到实验日志文件中,再进行下一个实验的评估。这是#1崩溃安全规则。
-
验证每一次关键写入 — 写入实验日志后,重新读取文件并确认条目已存在。这可以捕获静默写入失败。在验证通过前,不要进行下一个实验。
-
在每个阶段边界和每次决策前重新从磁盘读取 — 永远不要跨阶段转换、批量边界或任何可能耗时的操作后信任内存中的状态。重新读取实验日志和策略摘要。
-
Phase 3期间实验日志仅可追加 — 永远不要重写整个文件。追加新的实验条目。仅当找到新的最优解时,才就地更新部分。这可以防止写入中断时的数据丢失。
best -
用于崩溃恢复的每实验结果标记 — 每个实验在测量完成后,立即在其工作目录中写入标记。恢复时,扫描这些标记以恢复已测量但尚未记录的实验。
result.yaml -
每次批量处理后、生成新假设前写入策略摘要 — Agent在决定下一步尝试内容时,读取摘要(而非其内存)。
-
永远不要在写入磁盘前向用户展示结果 — 正确流程是:测量 -> 写入磁盘 -> 验证 -> 再展示给用户。顺序不可颠倒。
Mandatory Disk Checkpoints
强制磁盘检查点
These are non-negotiable write-then-verify steps. At each checkpoint, the agent MUST write the specified file and then read it back to confirm the write succeeded.
| Checkpoint | File Written | Phase |
|---|---|---|
| CP-0: Spec saved | | Phase 0, after user approval |
| CP-1: Baseline recorded | | Phase 1, after baseline measurement |
| CP-2: Hypothesis backlog saved | | Phase 2, after hypothesis generation |
| CP-3: Each experiment result | | Phase 3.3, immediately after each measurement |
| CP-4: Batch summary | | Phase 3.5, after batch evaluation |
| CP-5: Final summary | | Phase 4, at wrap-up |
Format of a verification step:
- Write the file using the native file-write tool
- Read the file back using the native file-read tool
- Confirm the expected content is present
- If verification fails, retry the write. If it fails twice, alert the user.
这些是必须执行的“写入后验证”步骤。在每个检查点,Agent必须写入指定文件,然后重新读取以确认写入成功。
| 检查点 | 写入文件 | 阶段 |
|---|---|---|
| CP-0: 规格已保存 | | Phase 0,用户批准后 |
| CP-1: 基线已记录 | | Phase 1,基线测量后 |
| CP-2: 假设待办项已保存 | | Phase 2,假设生成后 |
| CP-3: 每个实验结果 | | Phase 3.3,每次测量后立即执行 |
| CP-4: 批量摘要 | | Phase 3.5,批量评估后 |
| CP-5: 最终摘要 | | Phase 4,收尾阶段 |
验证步骤格式:
- 使用原生文件写入工具写入文件
- 使用原生文件读取工具重新读取文件
- 确认预期内容已存在
- 如果验证失败,重试写入。若两次失败,向用户发出警报。
File Locations (all under .context/compound-engineering/ce-optimize/<spec-name>/
)
.context/compound-engineering/ce-optimize/<spec-name>/文件位置(均位于.context/compound-engineering/ce-optimize/<spec-name>/
下)
.context/compound-engineering/ce-optimize/<spec-name>/| File | Purpose | Written When |
|---|---|---|
| Optimization spec (immutable during run) | Phase 0 (CP-0) |
| Full history of all experiments | Initialized at CP-1, appended at CP-3, updated at CP-4 |
| Compressed learnings for hypothesis generation | Written at CP-4 after each batch |
| Per-experiment crash-recovery marker | Immediately after measurement, before CP-3 |
| 文件 | 用途 | 写入时机 |
|---|---|---|
| 优化规格(运行期间不可变) | Phase 0(CP-0) |
| 所有实验的完整历史 | 在CP-1初始化,CP-3追加,CP-4更新 |
| 用于假设生成的压缩式学习成果 | 每次批量处理后的CP-4写入 |
| 每实验崩溃恢复标记 | 测量完成后立即写入,早于CP-3 |
On Resume
恢复时操作
When Phase 0.4 detects an existing run:
- Read the experiment log from disk — this is the ground truth
- Scan worktree directories for markers not yet in the log
result.yaml - Recover any measured-but-unlogged experiments
- Continue from where the log left off
当Phase 0.4检测到已有运行实例时:
- 从磁盘读取实验日志 — 这是真实状态
- 扫描工作目录,查找尚未记录在日志中的标记
result.yaml - 恢复所有已测量但未记录的实验
- 从日志中断的位置继续
Phase 0: Setup
Phase 0: 准备阶段
0.1 Determine Input Type
0.1 确定输入类型
Check whether the input is:
- A spec file path (ends in or
.yaml): read and validate it.yml - A description of the optimization goal: help the user create a spec interactively
检查输入是否为:
- 规格文件路径(以或
.yaml结尾):读取并验证.yml - 优化目标描述:帮助用户交互式创建规格
0.2 Load or Create Spec
0.2 加载或创建规格
If spec file provided:
- Read the YAML spec file. The orchestrating agent parses YAML natively -- no shell script parsing.
- Validate against :
references/optimize-spec-schema.yaml- All required fields present
- is lowercase kebab-case and safe to use in git refs / worktree paths
name - is
metric.primary.typeorhardjudge - If type is ,
judgesection exists withmetric.judgeandrubricscoring - At least one degenerate gate defined
- is non-empty
measurement.command - and
scope.mutableeach have at least one entryscope.immutable - Gate check operators are valid (,
>=,<=,>,<,==)!= - is at least 1
execution.max_concurrent - does not exceed 6 when backend is
execution.max_concurrentworktree
- If validation fails, report errors and ask the user to fix them
If description provided:
-
Analyze the project to understand what can be measured
-
Detect whether the optimization target is qualitative or quantitative — this determinesvs
type: hardand is the single most important spec decision:type: judgeUsewhen:type: hard- The metric is a scalar number with a clear "better" direction
- The metric is objectively measurable (build time, test pass rate, latency, memory usage)
- No human judgment is needed to evaluate "is this result actually good?"
- Examples: reduce build time, increase test coverage, reduce API latency, decrease bundle size
Usewhen:type: judge- The quality of the output requires semantic understanding to evaluate
- A human reviewer would need to look at the results to say "this is better"
- Proxy metrics exist but can mislead (e.g., "more clusters" does not mean "better clusters")
- The optimization could produce degenerate solutions that look good on paper
- Examples: clustering quality, search relevance, summarization quality, code readability, UX copy, recommendation relevance
IMPORTANT: If the target is qualitative, strongly recommend. Explain that hard metrics alone will optimize proxy numbers without checking actual quality. Show the user the three-tier approach:type: judge- Degenerate gates (hard, cheap, fast): catch obviously broken solutions — e.g., "all items in 1 cluster" or "0% coverage". Run first. If gates fail, skip the expensive judge step.
- LLM-as-judge (the actual optimization target): sample outputs, score them against a rubric, aggregate. This is what the loop optimizes.
- Diagnostics (logged, not gated): distribution stats, counts, timing — useful for understanding WHY a judge score changed.
If the user insists onfor a qualitative target, proceed but warn that the results may optimize a misleading proxy.type: hard -
Design the sampling strategy (for):
type: judgeGuide the user through defining stratified sampling. The key question is: "What parts of the output space do you need to check quality on?"Walk through these questions:- What does one "item" look like? (a cluster, a search result page, a summary, etc.)
- What are the natural size/quality strata? (e.g., large clusters vs small clusters vs singletons)
- Where are quality failures most likely? (e.g., very large clusters may be degenerate merges; singletons may be missed groupings)
- What total sample size balances cost vs signal? (default: 30 items, adjust based on output volume)
Example stratified sampling for clustering:yamlstratification: - bucket: "top_by_size" # largest clusters — check for degenerate mega-clusters count: 10 - bucket: "mid_range" # middle of non-solo cluster size range — representative quality count: 10 - bucket: "small_clusters" # clusters with 2-3 items — check if connections are real count: 10 singleton_sample: 15 # singletons — check for false negatives (items that should cluster)The sampling strategy is domain-specific. For search relevance, strata might be "top-3 results", "results 4-10", "tail results". For summarization, strata might be "short documents", "long documents", "multi-topic documents".Singleton evaluation is critical when the goal involves coverage — sampling singletons with the singleton rubric checks whether the system is missing obvious groupings. -
Design the rubric (for):
type: judgeHelp the user define the scoring rubric. A good rubric:- Has a 1-5 scale (or similar) with concrete descriptions for each level
- Includes supplementary fields that help diagnose issues (e.g., ,
distinct_topics)outlier_count - Is specific enough that two judges would give similar scores
- Does NOT assume bigger/more is better — "3 items per cluster average" is not inherently good or bad
Example for clustering:yamlrubric: | Rate this cluster 1-5: - 5: All items clearly about the same issue/feature - 4: Strong theme, minor outliers - 3: Related but covers 2-3 sub-topics that could reasonably be split - 2: Weak connection — items share superficial similarity only - 1: Unrelated items grouped together Also report: distinct_topics (integer), outlier_count (integer) -
Guide the user through the remaining spec fields:
- What degenerate cases should be rejected? (gates — e.g., "solo_pct <= 0.95" catches all-singletons, "max_cluster_size <= 500" catches mega-clusters)
- What command runs the measurement?
- What files can be modified? What is immutable?
- Any constraints or dependencies?
- If this is the first run: recommend ,
execution.mode: serial,execution.max_concurrent: 1, andstopping.max_iterations: 4stopping.max_hours: 1 - If : recommend
type: judge,sample_size: 10, andbatch_size: 5until the rubric and harness are trustedmax_total_cost_usd: 5
-
Write the spec to
.context/compound-engineering/ce-optimize/<spec-name>/spec.yaml -
Present the spec to the user for approval before proceeding
如果提供了规格文件:
- 读取YAML规格文件。编排Agent原生支持YAML解析 — 无需shell脚本解析。
- 根据进行验证:
references/optimize-spec-schema.yaml- 所有必填字段均存在
- 为小写短横线格式,可安全用于git引用/工作目录路径
name - 为
metric.primary.type或hardjudge - 如果类型为,则存在
judge部分,包含metric.judge和rubricscoring - 至少定义了一个退化门槛
- 非空
measurement.command - 和
scope.mutable各至少有一个条目scope.immutable - 门槛检查运算符有效(、
>=、<=、>、<、==)!= - 至少为1
execution.max_concurrent - 当后端为时,
worktree不超过6execution.max_concurrent
- 如果验证失败,报告错误并要求用户修复
如果提供了目标描述:
-
分析项目以了解可测量内容
-
检测优化目标是定性还是定量 — 这决定了vs
type: hard,是规格中最重要的决策:type: judge当以下情况时使用:type: hard- 指标是具有明确“更优”方向的标量数值
- 指标可客观测量(构建时间、测试通过率、延迟、内存占用)
- 无需人工判断即可评估“该结果是否真的好”
- 示例:缩短构建时间、提高测试覆盖率、降低API延迟、减小包体积
当以下情况时使用:type: judge- 输出质量需要语义理解才能评估
- 需要人工审阅结果才能判断“这个更好”
- 存在代理指标但可能产生误导(例如,“更多聚类”并不意味着“更好的聚类”)
- 优化可能产生在纸面看起来不错但实际退化的解决方案
- 示例:聚类质量、搜索相关性、摘要质量、代码可读性、UX文案、推荐相关性
重要提示:如果目标是定性的,强烈推荐使用。解释仅使用硬指标会优化代理数值,但无法检查实际质量。向用户展示三层方法:type: judge- 退化门槛(硬指标、低成本、快速):捕获明显失效的解决方案 — 例如,“所有项都在1个聚类中”或“0%覆盖率”。先运行该步骤。如果门槛未通过,跳过昂贵的judge步骤。
- LLM-as-judge(实际优化目标):对输出进行抽样,根据评分标准打分并汇总。这是循环优化的对象。
- 诊断信息(仅记录,不设门槛):分布统计、计数、耗时 — 有助于理解judge分数变化的原因。
如果用户坚持对定性目标使用,可以继续执行,但需警告结果可能会优化具有误导性的代理指标。type: hard -
设计抽样策略(针对):
type: judge引导用户定义分层抽样。核心问题是:“您需要检查输出空间的哪些部分的质量?”引导用户回答以下问题:- 一个“项”是什么样的?(一个聚类、一个搜索结果页、一篇摘要等)
- 自然的大小/质量分层有哪些?(例如,大聚类vs小聚类vs单例)
- 质量故障最可能出现在哪里?(例如,非常大的聚类可能是退化的合并;单例可能是遗漏的分组)
- 总样本量如何平衡成本与信号?(默认:30个项,根据输出量调整)
聚类的分层抽样示例:yamlstratification: - bucket: "top_by_size" # 最大的聚类 — 检查是否存在退化的超大聚类 count: 10 - bucket: "mid_range" # 非单例聚类大小范围的中间值 — 代表性质量 count: 10 - bucket: "small_clusters" # 包含2-3个项的聚类 — 检查关联是否真实 count: 10 singleton_sample: 15 # 单例 — 检查是否存在假阴性(应被聚类的项)抽样策略是领域特定的。对于搜索相关性,分层可能是“前3个结果”、“第4-10个结果”、“尾部结果”。对于摘要,分层可能是“短文档”、“长文档”、“多主题文档”。当目标涉及覆盖率时,单例评估至关重要 — 使用单例评分标准抽样单例,检查系统是否遗漏了明显的分组。 -
设计评分标准(针对):
type: judge帮助用户定义评分标准。优秀的评分标准:- 采用1-5分制(或类似),每个分数段有具体描述
- 包含有助于诊断问题的补充字段(例如,、
distinct_topics)outlier_count - 足够具体,不同评估者给出的分数相近
- 不假设越大/越多越好 — “平均每个聚类3个项”本身并非好坏的标准
聚类的评分标准示例:yamlrubric: | 为该聚类打分(1-5分): - 5分:所有项明显围绕同一问题/功能 - 4分:主题明确,存在少量异常值 - 3分:相关但涵盖2-3个可合理拆分的子主题 - 2分:关联薄弱 — 项仅存在表面相似性 - 1分:无关项被分组在一起 同时报告:distinct_topics(整数)、outlier_count(整数) -
引导用户完成规格的其余字段:
- 应拒绝哪些退化情况?(门槛 — 例如,捕获全单例情况,
solo_pct <= 0.95捕获超大聚类)max_cluster_size <= 500 - 运行测量的命令是什么?
- 哪些文件可以修改?哪些不可修改?
- 有任何约束或依赖吗?
- 如果是首次运行:推荐、
execution.mode: serial、execution.max_concurrent: 1和stopping.max_iterations: 4stopping.max_hours: 1 - 如果是:推荐
type: judge、sample_size: 10和batch_size: 5,直到评分标准和工具可信max_total_cost_usd: 5
- 应拒绝哪些退化情况?(门槛 — 例如,
-
将规格写入
.context/compound-engineering/ce-optimize/<spec-name>/spec.yaml -
向用户展示规格,获得批准后再继续
0.3 Search Prior Learnings
0.3 搜索过往学习成果
Dispatch to search for prior optimization work on similar topics. If relevant learnings exist, incorporate them into the approach.
compound-engineering:research:learnings-researcher调用搜索类似主题的过往优化工作。如果存在相关学习成果,将其纳入方法中。
compound-engineering:research:learnings-researcher0.4 Run Identity Detection
0.4 运行身份检测
Check if branch already exists:
optimize/<spec-name>bash
git rev-parse --verify "optimize/<spec-name>" 2>/dev/nullIf branch exists, check for an existing experiment log at .
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yamlPresent the user with a choice via the platform question tool:
- Resume: read ALL state from the experiment log on disk (do not rely on any in-memory context from a prior session). Recover any measured-but-unlogged experiments by scanning worktree directories for markers. Continue from the last iteration number in the log.
result.yaml - Fresh start: archive the old branch to , clear the experiment log, start from scratch
optimize-archive/<spec-name>/archived-<timestamp>
检查分支是否已存在:
optimize/<spec-name>bash
git rev-parse --verify "optimize/<spec-name>" 2>/dev/null如果分支存在,检查是否存在现有实验日志。
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml通过平台提问工具向用户提供选择:
- 恢复:从磁盘上的实验日志读取所有状态(不依赖先前会话的任何内存上下文)。通过扫描工作目录查找标记,恢复所有已测量但未记录的实验。从日志中的最后一次迭代继续。
result.yaml - 重新开始:将旧分支归档到,清除实验日志,从头开始
optimize-archive/<spec-name>/archived-<timestamp>
0.5 Create Optimization Branch and Scratch Space
0.5 创建优化分支和临时空间
bash
git checkout -b "optimize/<spec-name>" # or switch to existing if resumingCreate scratch directory:
bash
mkdir -p .context/compound-engineering/ce-optimize/<spec-name>/bash
git checkout -b "optimize/<spec-name>" # 或如果恢复则切换到现有分支创建临时目录:
bash
mkdir -p .context/compound-engineering/ce-optimize/<spec-name>/Phase 1: Measurement Scaffolding
Phase 1: 测量框架搭建
This phase is a HARD GATE. The user must approve baseline and parallel readiness before Phase 2.
此阶段为硬性门槛。用户必须批准基线和并行准备情况后,才能进入Phase 2。
1.1 Clean-Tree Gate
1.1 干净工作树门槛
Verify no uncommitted changes to files within or :
scope.mutablescope.immutablebash
git status --porcelainFilter the output against the scope paths. If any in-scope files have uncommitted changes:
- Report which files are dirty
- Ask the user to commit or stash before proceeding
- Do NOT continue until the working tree is clean for in-scope files
验证或内的文件是否存在未提交的更改:
scope.mutablescope.immutablebash
git status --porcelain根据范围路径过滤输出。如果任何范围内的文件存在未提交更改:
- 报告哪些文件已被修改
- 要求用户提交或暂存更改后再继续
- 在范围内文件的工作树干净前,不得继续
1.2 Build or Validate Measurement Harness
1.2 构建或验证测量工具
If user provides a measurement harness (the already exists):
measurement.command- Run it once via the measurement script:
bash
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>" - Validate the JSON output:
- Contains keys for all degenerate gate metric names
- Contains keys for all diagnostic metric names
- Values are numeric or boolean as expected
- If validation fails, report what is missing and ask the user to fix the harness
If agent must build the harness:
- Analyze the codebase to understand the current approach and what should be measured
- Build an evaluation script (e.g., ,
evaluate.py, or equivalent)evaluate.sh - Add the evaluation script path to -- the experiment agent must not modify it
scope.immutable - Run it once and validate the output
- Present the harness and its output to the user for review
如果用户提供了测量工具(已存在):
measurement.command- 通过测量脚本运行一次:
bash
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>" - 验证JSON输出:
- 包含所有退化门槛指标名称的键
- 包含所有诊断指标名称的键
- 值为预期的数值或布尔类型
- 如果验证失败,报告缺失内容并要求用户修复工具
如果Agent必须构建工具:
- 分析代码库以了解当前方法和应测量的内容
- 构建评估脚本(例如,、
evaluate.py或等效脚本)evaluate.sh - 将评估脚本路径添加到— 实验Agent不得修改它
scope.immutable - 运行一次并验证输出
- 向用户展示工具及其输出以供审核
1.3 Establish Baseline
1.3 建立基线
Run the measurement harness on the current code.
If stability mode is :
repeat- Run the harness times
repeat_count - Aggregate results using the configured aggregation method (median, mean, min, max)
- Calculate variance across runs
- If variance exceeds , warn the user and suggest increasing
noise_thresholdrepeat_count
Record the baseline in the experiment log:
yaml
baseline:
timestamp: "<current ISO 8601 timestamp>"
gates:
<gate_name>: <value>
...
diagnostics:
<diagnostic_name>: <value>
...If primary type is , also run the judge evaluation on baseline output to establish the starting judge score.
judge在当前代码上运行测量工具。
如果稳定性模式为:
repeat- 运行工具次
repeat_count - 使用配置的聚合方法(中位数、均值、最小值、最大值)汇总结果
- 计算运行间的方差
- 如果方差超过,警告用户并建议增加
noise_thresholdrepeat_count
将基线记录到实验日志中:
yaml
baseline:
timestamp: "<当前ISO 8601时间戳>"
gates:
<gate_name>: <value>
...
diagnostics:
<diagnostic_name>: <value>
...如果主类型为,还需对基线输出运行judge评估以确定初始judge分数。
judge1.4 Parallelism Readiness Probe
1.4 并行准备探测
Run the parallelism probe script:
bash
bash scripts/parallel-probe.sh "<project_directory>" "<measurement.command>" "<measurement.working_directory>" <shared_files...>Read the JSON output. Present any blockers to the user with suggested mitigations. Treat the probe as intentionally narrow: it should inspect the measurement command, the measurement working directory, and explicitly declared shared files, not the entire repository.
运行并行探测脚本:
bash
bash scripts/parallel-probe.sh "<project_directory>" "<measurement.command>" "<measurement.working_directory>" <shared_files...>读取JSON输出。向用户展示任何障碍及建议的缓解措施。探测应针对性地检查测量命令、测量工作目录和明确声明的共享文件,而非整个仓库。
1.5 Worktree Budget Check
1.5 工作目录预算检查
Count existing worktrees:
bash
bash scripts/experiment-worktree.sh countIf count + would exceed 12:
execution.max_concurrent- Warn the user
- Suggest cleaning up existing worktrees or reducing
max_concurrent - Do NOT block -- the user may proceed at their own risk
统计现有工作目录数量:
bash
bash scripts/experiment-worktree.sh count如果数量 + 超过12:
execution.max_concurrent- 警告用户
- 建议清理现有工作目录或降低
max_concurrent - 不强制阻止 — 用户可自行承担风险继续
1.6 Write Baseline to Disk (CP-1)
1.6 将基线写入磁盘(CP-1)
MANDATORY CHECKPOINT. Before presenting results to the user, write the initial experiment log with baseline metrics to disk:
- Create the experiment log file at
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml - Include all required top-level sections from :
references/experiment-log-schema.yaml,spec,run_id,started_at,baseline, andexperimentsbest - Seed as an empty array and seed
experimentsfrom the baseline snapshot (usebest, baseline metrics, and baseline judge scores if present) so later phases have a valid current-best state to compare againstiteration: 0 - Optionally seed here as well so the log shape is stable before Phase 2 populates it
hypothesis_backlog: [] - Verify: read the file back and confirm the required sections are present and the baseline values match
- Only THEN present results to the user
强制检查点。在向用户展示结果前,必须将包含基线指标的初始实验日志写入磁盘:
- 在创建实验日志文件
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml - 包含中所有必填的顶层部分:
references/experiment-log-schema.yaml、spec、run_id、started_at、baseline和experimentsbest - 将初始化为空数组,并从基线快照初始化
experiments(使用best、基线指标和基线judge分数(如果存在)),以便后续阶段有可用于比较的有效当前最优状态iteration: 0 - 可选地在此处初始化,以便在Phase 2填充前日志结构保持稳定
hypothesis_backlog: [] - 验证:重新读取文件,确认必填部分存在且基线值匹配
- 完成后再向用户展示结果
1.7 User Approval Gate
1.7 用户批准门槛
Present to the user via the platform question tool:
- Baseline metrics: all gate values, diagnostic values, and judge scores (if applicable)
- Experiment log location: show the file path so the user knows where results are saved
- Parallel readiness: probe results, any blockers, mitigations applied
- Clean-tree status: confirmed clean
- Worktree budget: current count and projected usage
- Judge budget: estimated per-experiment judge cost and configured cap (or an explicit note that spend is uncapped)
max_total_cost_usd
Options:
- Proceed -- approve baseline and parallel config, move to Phase 2
- Adjust spec -- modify spec settings before proceeding
- Fix issues -- user needs to resolve blockers first
Do NOT proceed to Phase 2 until the user explicitly approves.
If primary type is and is null, call that out as uncapped spend and require explicit approval before proceeding.
judgemax_total_cost_usdState re-read: After gate approval, re-read the spec and baseline from disk. Do not carry stale in-memory values forward.
通过平台提问工具向用户展示:
- 基线指标:所有门槛值、诊断值和judge分数(如适用)
- 实验日志位置:展示文件路径,让用户知道结果保存位置
- 并行准备情况:探测结果、任何障碍、已应用的缓解措施
- 干净工作树状态:已确认干净
- 工作目录预算:当前数量和预计使用量
- Judge预算:每次实验的预估judge成本和配置的上限(或明确说明支出无上限)
max_total_cost_usd
选项:
- 继续 — 批准基线和并行配置,进入Phase 2
- 调整规格 — 在继续前修改规格设置
- 修复问题 — 用户需要先解决障碍
在用户明确批准前,不得进入Phase 2。
如果主类型为且为null,需明确指出支出无上限,并要求用户明确批准后再继续。
judgemax_total_cost_usd状态重新读取:门槛批准后,从磁盘重新读取规格和基线。不得携带过时的内存值继续。
Phase 2: Hypothesis Generation
Phase 2: 假设生成
2.1 Analyze Current Approach
2.1 分析当前方法
Read the code within to understand:
scope.mutable- The current implementation approach
- Obvious improvement opportunities
- Constraints and dependencies between components
Optionally dispatch for deeper codebase analysis if the scope is large or unfamiliar.
compound-engineering:research:repo-research-analyst读取内的代码以了解:
scope.mutable- 当前实现方法
- 明显的改进机会
- 组件间的约束和依赖
如果范围较大或不熟悉,可选择调用进行更深入的代码库分析。
compound-engineering:research:repo-research-analyst2.2 Generate Hypothesis List
2.2 生成假设列表
Generate an initial set of hypotheses. Each hypothesis should have:
- Description: what to try
- Category: one of the standard categories (signal-extraction, graph-signals, embedding, algorithm, preprocessing, parameter-tuning, architecture, data-handling) or a domain-specific category
- Priority: high, medium, or low based on expected impact and feasibility
- Required dependencies: any new packages or tools needed
Include user-provided hypotheses if any were given as input.
Aim for 10-30 hypotheses in the initial backlog. More can be generated during the loop based on learnings.
生成初始假设集。每个假设应包含:
- 描述:尝试的内容
- 类别:标准类别之一(signal-extraction、graph-signals、embedding、algorithm、preprocessing、parameter-tuning、architecture、data-handling)或领域特定类别
- 优先级:基于预期影响和可行性分为高、中、低
- 所需依赖:任何需要的新包或工具
如果输入中包含用户提供的假设,需将其纳入。
初始待办项目标为10-30个假设。可根据学习成果在循环中生成更多假设。
2.3 Dependency Pre-Approval
2.3 依赖预批准
Collect all unique new dependencies across all hypotheses.
If any hypotheses require new dependencies:
- Present the full dependency list to the user via the platform question tool
- Ask for bulk approval
- Mark each hypothesis's as
dep_statusorapprovedneeds_approval
Hypotheses with unapproved dependencies remain in the backlog but are skipped during batch selection. They are re-presented at wrap-up for potential approval.
收集所有假设中唯一的新依赖。
如果任何假设需要新依赖:
- 通过平台提问工具向用户展示完整的依赖列表
- 请求批量批准
- 将每个假设的标记为
dep_status或approvedneeds_approval
未获批准的假设仍保留在待办项中,但在批量选择时会被跳过。在收尾阶段会重新展示给用户以获得潜在批准。
2.4 Record Hypothesis Backlog (CP-2)
2.4 记录假设待办项(CP-2)
MANDATORY CHECKPOINT. Write the initial backlog to the experiment log file and verify:
yaml
hypothesis_backlog:
- description: "Remove template boilerplate before embedding"
category: "signal-extraction"
priority: high
dep_status: approved
required_deps: []
- description: "Try HDBSCAN clustering algorithm"
category: "algorithm"
priority: medium
dep_status: needs_approval
required_deps: ["scikit-learn"]强制检查点。将初始待办项写入实验日志文件并验证:
yaml
hypothesis_backlog:
- description: "嵌入前移除模板样板代码"
category: "signal-extraction"
priority: high
dep_status: approved
required_deps: []
- description: "尝试HDBSCAN聚类算法"
category: "algorithm"
priority: medium
dep_status: needs_approval
required_deps: ["scikit-learn"]Phase 3: Optimization Loop
Phase 3: 优化循环
This phase repeats in batches until a stopping criterion is met.
此阶段会批量重复执行,直到满足停止条件。
3.1 Batch Selection
3.1 批量选择
Select hypotheses for this batch:
- Build a runnable backlog by excluding hypotheses with
dep_status: needs_approval - If is
execution.mode, forceserialbatch_size = 1 - Otherwise,
batch_size = min(runnable_backlog_size, execution.max_concurrent) - Prefer diversity: select from different categories when possible
- Within a category, select by priority (high first)
If the backlog is empty and no new hypotheses can be generated, proceed to Phase 4 (wrap-up).
If the backlog is non-empty but no runnable hypotheses remain because everything needs approval or is otherwise blocked, proceed to Phase 4 so the user can approve dependencies instead of spinning forever.
选择此批次的假设:
- 通过排除的假设,构建可运行的待办项
dep_status: needs_approval - 如果为
execution.mode,强制serialbatch_size = 1 - 否则,
batch_size = min(可运行待办项数量, execution.max_concurrent) - 优先选择多样性:尽可能从不同类别中选择
- 在同一类别中,按优先级选择(高优先级优先)
如果待办项为空且无法生成新假设,进入Phase 4(收尾)。
如果待办项非空但因所有假设都需要批准或存在其他障碍而无可用假设,进入Phase 4以便用户批准依赖,而非无限循环。
3.2 Dispatch Experiments
3.2 分发实验
For each hypothesis in the batch, dispatch according to . In mode, run exactly one experiment to completion before selecting the next hypothesis. In mode, dispatch the full batch concurrently.
execution.modeserialparallelWorktree backend:
- Create experiment worktree:
bash
WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "<spec_name>" <exp_index> "optimize/<spec_name>" <shared_files...>) # creates optimize-exp/<spec_name>/exp-<NNN> - Apply port parameterization if configured (set env vars for the measurement script)
- Fill the experiment prompt template () with:
references/experiment-prompt-template.md- Iteration number, spec name
- Hypothesis description and category
- Current best and baseline metrics
- Mutable and immutable scope
- Constraints and approved dependencies
- Rolling window of last 10 experiments (concise summaries)
- Dispatch a subagent with the filled prompt, working in the experiment worktree
Codex backend:
- Check environment guard -- do NOT delegate if already inside a Codex sandbox:
bash
# If these exist, we're already in Codex -- fall back to subagent test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git - Fill the experiment prompt template
- Write the filled prompt to a temp file
- Dispatch via Codex:
bash
cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1 - Security posture: use the user's selection (ask once per session if not set in spec)
对于批次中的每个假设,根据分发。在模式下,完全运行一个实验后再选择下一个假设。在模式下,并发分发整个批次。
execution.modeserialparallelWorktree后端:
- 创建实验工作目录:
bash
WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "<spec_name>" <exp_index> "optimize/<spec_name>" <shared_files...>) # 创建optimize-exp/<spec_name>/exp-<NNN> - 如果配置了端口参数化,应用设置(为测量脚本设置环境变量)
- 使用以下内容填充实验提示模板():
references/experiment-prompt-template.md- 迭代次数、规格名称
- 假设描述和类别
- 当前最优和基线指标
- 可变和不可变范围
- 约束和已批准依赖
- 最近10次实验的滚动窗口(简洁摘要)
- 使用填充后的提示调用子Agent,在实验工作目录中运行
Codex后端:
- 检查环境防护 — 如果已在Codex沙箱内,不要委托:
bash
# 如果这些存在,说明已在Codex中 — 回退到子Agent test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git - 填充实验提示模板
- 将填充后的提示写入临时文件
- 通过Codex分发:
bash
cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1 - 安全策略:使用用户选择(如果规格中未设置,每会话询问一次)
3.3 Collect and Persist Results
3.3 收集并持久化结果
Process experiments as they complete — do NOT wait for the entire batch to finish before writing results.
For each completed experiment, immediately:
-
Run measurement in the experiment's worktree:bash
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>/<measurement.working_directory or .>" <env_vars...>- If stability mode is , run the measurement harness
repeattimes in that working directory and aggregate the results exactly as in Phase 1 before evaluating gates or ranking the experiment.repeat_count - Use the aggregated metrics as the experiment's score; if variance exceeds , record that in learnings so the operator knows the result is noisy.
noise_threshold
- If stability mode is
-
Write crash-recovery marker — immediately after measurement, writein the experiment worktree containing the raw metrics. This ensures the measurement is recoverable even if the agent crashes before updating the main log.
result.yaml -
Read raw JSON output from the measurement script
-
Evaluate degenerate gates:
- For each gate in , parse the operator and threshold
metric.degenerate_gates - Compare the metric value against the threshold
- If ANY gate fails: mark outcome as , skip judge evaluation, save money
degenerate
- For each gate in
-
If gates pass AND primary type is:
judge- Read the experiment's output (cluster assignments, search results, etc.)
- Apply stratified sampling per config (using
metric.judge.stratification)sample_seed - Group samples into batches of
metric.judge.batch_size - Fill the judge prompt template () for each batch
references/judge-prompt-template.md - Dispatch parallel judge sub-agents
ceil(sample_size / batch_size) - Each sub-agent returns structured JSON scores
- Aggregate scores: compute the configured primary judge field from (which should match
metric.judge.scoring.primary) plus anymetric.primary.namevaluesscoring.secondary - If : also dispatch singleton evaluation sub-agents
singleton_sample > 0
-
If gates pass AND primary type is:
hard- Use the metric value directly from the measurement output
-
IMMEDIATELY append to experiment log on disk (CP-3) — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) toright now. Use the transitional outcome
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yamlonce the experiment has valid metrics but has not yet been compared to the current best. Update the outcome tomeasured,kept, or another terminal state in the evaluation step, but the raw metrics are on disk and safe from context compaction.reverted -
VERIFY the write (CP-3 verification) — read the experiment log back from disk and confirm the entry just written is present. If verification fails, retry the write. Do NOT proceed to the next experiment until this entry is confirmed on disk.
Why immediately + verify? The agent's context window is NOT a durable store. Context compaction, session crashes, and restarts are expected during long runs. If results only exist in the agent's memory, they are lost. Karpathy's autoresearch writes to after every single experiment — this skill must do the same with the experiment log. The verification step catches silent write failures that would otherwise lose data.
results.tsv实验完成后立即处理 — 不要等待整个批次完成后再写入结果。
对于每个已完成的实验,立即执行以下操作:
-
在实验工作目录中运行测量:bash
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>/<measurement.working_directory or .>" <env_vars...>- 如果稳定性模式为,在该工作目录中运行测量工具
repeat次,并完全按照Phase 1的方式汇总结果,然后再评估门槛或对实验排名。repeat_count - 使用汇总后的指标作为实验分数;如果方差超过,将其记录到学习成果中,以便操作人员知道结果存在噪声。
noise_threshold
- 如果稳定性模式为
-
写入崩溃恢复标记 — 测量完成后立即在实验工作目录中写入包含原始指标的。这确保即使Agent在更新主日志前崩溃,测量结果也可恢复。
result.yaml -
读取测量脚本的原始JSON输出
-
评估退化门槛:
- 对于中的每个门槛,解析运算符和阈值
metric.degenerate_gates - 将指标值与阈值比较
- 如果任何门槛未通过:标记结果为,跳过judge评估,节省成本
degenerate
- 对于
-
如果门槛通过且主类型为:
judge- 读取实验的输出(聚类分配、搜索结果等)
- 根据配置应用分层抽样(使用
metric.judge.stratification)sample_seed - 将样本分组为大小的批次
metric.judge.batch_size - 为每个批次填充judge提示模板()
references/judge-prompt-template.md - 调用个并行judge子Agent
ceil(sample_size / batch_size) - 每个子Agent返回结构化JSON分数
- 汇总分数:根据(应与
metric.judge.scoring.primary匹配)计算配置的主judge字段,以及任何metric.primary.name值scoring.secondary - 如果:同时调用单例评估子Agent
singleton_sample > 0
-
如果门槛通过且主类型为:
hard- 直接使用测量输出中的指标值
-
立即追加到磁盘上的实验日志(CP-3) — 不要推迟到批量评估时。立即将实验条目(迭代次数、假设、结果、指标、学习成果)写入。当实验有有效指标但尚未与当前最优解比较时,使用过渡结果
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml。在评估步骤中将结果更新为measured、kept或其他最终状态,但原始指标已存储在磁盘上,不会因上下文压缩丢失。reverted -
验证写入(CP-3验证) — 从磁盘重新读取实验日志,确认刚写入的条目存在。如果验证失败,重试写入。在确认条目存在前,不要进行下一个实验。
为什么要立即写入并验证? Agent的上下文窗口并非持久存储。在长时间运行中,上下文压缩、会话崩溃和重启是预期情况。如果结果仅存在于Agent内存中,将会丢失。Karpathy的autoresearch在每次实验后写入 — 本技能必须对实验日志执行相同操作。验证步骤可捕获否则会导致数据丢失的静默写入失败。
results.tsv3.4 Evaluate Batch
3.4 评估批次
After all experiments in the batch have been measured:
-
Rank experiments by primary metric improvement:
- For hard metrics: compare to the current best using (
metric.primary.directionmeans higher is better,maximizemeans lower is better), and require the absolute improvement to exceedminimizebefore treating it as a real winmeasurement.stability.noise_threshold - For judge metrics: compare the configured primary judge score (/
metric.judge.scoring.primary) to the current best, and require it to exceedmetric.primary.nameminimum_improvement
- For hard metrics: compare to the current best using
-
Identify the best experiment that passes all gates and improves the primary metric
-
If best improves on current best: KEEP
- Commit the experiment branch first so the winning diff exists as a real commit before any merge or cherry-pick
- Include only mutable-scope changes in that commit; if no eligible diff remains, treat the experiment as non-improving and revert it
- Merge the committed experiment branch into the optimization branch
- Use the message for the experiment commit
optimize(<spec-name>): <hypothesis description> - After the merge succeeds, clean up the winner's experiment worktree and branch; the integrated commit on the optimization branch is the durable artifact
- This is now the new baseline for subsequent batches
-
Check file-disjoint runners-up (up to):
max_runner_up_merges_per_batch- For each runner-up that also improved, check file-level disjointness with the kept experiment
- File-level disjointness: two experiments are disjoint if they modified completely different files. Same file = overlapping, even if different lines.
- If disjoint: cherry-pick the runner-up onto the new baseline, re-run full measurement
- If combined measurement is strictly better: keep the cherry-pick (outcome: ), then clean up that runner-up's experiment worktree and branch
runner_up_kept - Otherwise: revert the cherry-pick, log as "promising alone but neutral/harmful in combination" (outcome: ), then clean up the runner-up's experiment worktree and branch
runner_up_reverted - Stop after first failed combination
-
Handle deferred deps: experiments that need unapproved dependencies get outcome
deferred_needs_approval -
Revert all others: cleanup worktrees, log as
reverted
批次中所有实验测量完成后:
-
按主指标改进程度排名:
- 对于硬指标:根据(
metric.primary.direction表示越高越好,maximize表示越低越好)与当前最优解比较,要求绝对改进超过minimize才视为真正的提升measurement.stability.noise_threshold - 对于judge指标:将配置的主judge分数(/
metric.judge.scoring.primary)与当前最优解比较,要求超过metric.primary.nameminimum_improvement
- 对于硬指标:根据
-
确定通过所有门槛且主指标有改进的最优实验
-
如果最优实验优于当前最优解:保留
- 先提交实验分支,以便在合并或挑选前,获胜的差异作为真实提交存在
- 提交中仅包含可变范围的更改;如果没有合格的差异,将实验视为无改进并回退
- 将提交的实验分支合并到优化分支
- 实验提交的消息使用
optimize(<spec-name>): <假设描述> - 合并成功后,清理获胜实验的工作目录和分支;优化分支上的集成提交是持久化工件
- 这成为后续批次的新基线
-
检查文件不相交的候选最优解(最多个):
max_runner_up_merges_per_batch- 对于每个同样有改进的候选最优解,检查其与保留实验的文件级不相交性
- 文件级不相交性:如果两个实验修改的文件完全不同,则它们是不相交的。即使修改的是同一文件的不同行,也视为重叠。
- 如果不相交:将候选最优解挑选到新基线上,重新运行完整测量
- 如果合并后的测量结果严格更优:保留挑选的更改(结果:),然后清理该候选最优解的工作目录和分支
runner_up_kept - 否则:回退挑选的更改,记录为“单独使用有前景但组合使用无影响/有害”(结果:),然后清理该候选最优解的工作目录和分支
runner_up_reverted - 首次组合失败后停止
-
处理延迟依赖:需要未批准依赖的实验结果标记为
deferred_needs_approval -
回退所有其他实验:清理工作目录,记录为
reverted
3.5 Update State (CP-4)
3.5 更新状态(CP-4)
MANDATORY CHECKPOINT. By this point, individual experiment results are already on disk (written in step 3.3). This step updates aggregate state and verifies.
-
Re-read the experiment log from disk — do not trust in-memory state. The log is the source of truth.
-
Finalize outcomes — update experiment entries from step 3.4 evaluation (mark,
kept,reverted, etc.). Write these outcome updates to disk immediately.runner_up_kept -
Update thesection in the experiment log if a new best was found. Write to disk.
best -
Write strategy digest to:
.context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md- Categories tried so far (with success/failure counts)
- Key learnings from this batch and overall
- Exploration frontier: what categories and approaches remain untried
- Current best metrics and improvement from baseline
-
Generate new hypotheses based on learnings:
- Re-read the strategy digest from disk (not from memory)
- Read the rolling window (last 10 experiments from the log on disk)
- Do NOT read the full experiment log -- use the digest for broad context
- Add new hypotheses to the backlog and write the updated backlog to disk
-
Write updated hypothesis backlog to disk — the backlog section of the experiment log must reflect newly added hypotheses and removed (tested) ones.
CP-4 Verification: Read the experiment log back from disk. Confirm: (a) all experiment outcomes from this batch are finalized, (b) the section reflects the current best, (c) the hypothesis backlog is updated. Read back and confirm it exists. Only THEN proceed to the next batch or stopping criteria check.
beststrategy-digest.mdCheckpoint: at this point, all state for this batch is on disk. If the agent crashes and restarts, it can resume from the experiment log without loss.
强制检查点。此时,单个实验结果已写入磁盘(步骤3.3中完成)。此步骤更新聚合状态并验证。
-
从磁盘重新读取实验日志 — 不要信任内存中的状态。日志是可信数据源。
-
最终确定结果 — 根据步骤3.4的评估更新实验条目(标记、
kept、reverted等)。立即将这些结果更新写入磁盘。runner_up_kept -
如果找到新的最优解,更新实验日志中的部分。写入磁盘。
best -
将策略摘要写入:
.context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md- 迄今为止尝试的类别(包含成功/失败计数)
- 此批次及整体的关键学习成果
- 探索前沿:尚未尝试的类别和方法
- 当前最优指标及与基线的改进
-
根据学习成果生成新假设:
- 从磁盘重新读取策略摘要(而非内存)
- 读取滚动窗口(磁盘日志中的最近10次实验)
- 不要读取完整实验日志 — 使用摘要获取广泛上下文
- 将新假设添加到待办项,并将更新后的待办项写入磁盘
-
将更新后的假设待办项写入磁盘 — 实验日志的待办项部分必须反映新增的假设和已移除(已测试)的假设。
CP-4验证:从磁盘重新读取实验日志。确认:(a) 此批次的所有实验结果已最终确定,(b) 部分反映当前最优解,(c) 假设待办项已更新。重新读取并确认其存在。完成后再进入下一批次或检查停止条件。
beststrategy-digest.md检查点:此时,此批次的所有状态已存储在磁盘上。如果Agent崩溃并重启,可从实验日志恢复,无数据丢失。
3.6 Check Stopping Criteria
3.6 检查停止条件
Stop the loop if ANY of these are true:
- Target reached: is true,
stopping.target_reachedis set, and the primary metric reaches that target according tometric.primary.target(metric.primary.directionfor>=,maximizefor<=)minimize - Max iterations: total experiments run >=
stopping.max_iterations - Max hours: wall-clock time since Phase 3 start >=
stopping.max_hours - Judge budget exhausted: cumulative judge spend >= (if set)
metric.judge.max_total_cost_usd - Plateau: no improvement for consecutive experiments
stopping.plateau_iterations - Manual stop: user interrupts (save state and proceed to Phase 4)
- Empty backlog: no hypotheses remain and no new ones can be generated
If no stopping criterion is met, proceed to the next batch (step 3.1).
如果满足以下任一条件,停止循环:
- 达到目标:为true,
stopping.target_reached已设置,且主指标根据metric.primary.target达到目标(metric.primary.direction为maximize,>=为minimize)<= - 最大迭代次数:已运行的实验总数 >=
stopping.max_iterations - 最长运行时间:自Phase 3开始的挂钟时间 >=
stopping.max_hours - Judge预算耗尽:累计judge支出 >= (如果已设置)
metric.judge.max_total_cost_usd - 进入平台期:连续次实验无改进
stopping.plateau_iterations - 手动停止:用户中断(保存状态并进入Phase 4)
- 待办项为空:无剩余假设且无法生成新假设
如果未满足任何停止条件,进入下一批次(步骤3.1)。
3.7 Cross-Cutting Concerns
3.7 跨领域关注点
Codex failure cascade: Track consecutive Codex delegation failures. After 3 consecutive failures, auto-disable Codex for remaining experiments and fall back to subagent dispatch. Log the switch.
Error handling: If an experiment's measurement command crashes, times out, or produces malformed output:
- Log as outcome or
errorwith the error messagetimeout - Revert the experiment (cleanup worktree)
- The loop continues with remaining experiments in the batch
Progress reporting: After each batch, report:
- Batch N of estimated M (based on backlog size)
- Experiments run this batch and total
- Current best metric and improvement from baseline
- Cumulative judge cost (if applicable)
Crash recovery: See Persistence Discipline section. Per-experiment markers are written in step 3.3. Individual experiment results are appended to the log immediately in step 3.3. Batch-level state (outcomes, best, digest) is written in step 3.5. On resume (Phase 0.4), the log on disk is the ground truth — scan for any markers not yet reflected in the log.
result.yamlresult.yamlCodex失败连锁反应:跟踪连续的Codex委托失败。连续失败3次后,自动为剩余实验禁用Codex,回退到子Agent分发。记录切换操作。
错误处理:如果实验的测量命令崩溃、超时或生成格式错误的输出:
- 将结果记录为或
error,并包含错误消息timeout - 回退实验(清理工作目录)
- 循环继续处理批次中的剩余实验
进度报告:每个批次完成后,报告:
- 第N批次,共预计M批次(基于待办项大小)
- 此批次及总共运行的实验数
- 当前最优指标及与基线的改进
- 累计judge成本(如适用)
崩溃恢复:请查看持久化规范部分。步骤3.3中写入了每实验标记。单个实验结果在步骤3.3中立即追加到日志中。批次级状态(结果、最优解、摘要)在步骤3.5中写入。恢复时(Phase 0.4),磁盘上的日志是真实状态 — 扫描所有尚未反映在日志中的标记。
result.yamlresult.yamlPhase 4: Wrap-Up
Phase 4: 收尾阶段
4.1 Present Deferred Hypotheses
4.1 展示延迟的假设
If any hypotheses were deferred due to unapproved dependencies:
- List them with their dependency requirements
- Ask the user whether to approve, skip, or save for a future run
- If approved: add to backlog and offer to re-enter Phase 3 for one more round
如果任何因未批准依赖而延迟的假设:
- 列出它们及其依赖要求
- 询问用户是否批准、跳过或保存供未来运行使用
- 如果批准:添加到待办项,并提供重新进入Phase 3进行额外一轮实验的选项
4.2 Summarize Results
4.2 总结结果
Present a comprehensive summary:
Optimization: <spec-name>
Duration: <wall-clock time>
Total experiments: <count>
Kept: <count> (including <runner_up_kept_count> runner-up merges)
Reverted: <count>
Degenerate: <count>
Errors: <count>
Deferred: <count>
Baseline -> Final:
<primary_metric>: <baseline_value> -> <final_value> (<delta>)
<gate_metrics>: ...
<diagnostics>: ...
Judge cost: $<total_judge_cost_usd> (if applicable)
Key improvements:
1. <kept experiment 1 hypothesis> (+<delta>)
2. <kept experiment 2 hypothesis> (+<delta>)
...展示全面总结:
优化任务:<spec-name>
持续时间:<挂钟时间>
总实验数:<数量>
保留:<数量>(包含<runner_up_kept_count>个候选最优解合并)
回退:<数量>
退化:<数量>
错误:<数量>
延迟:<数量>
基线 -> 最终:
<primary_metric>:<baseline_value> -> <final_value>(<delta>)
<gate_metrics>:...
<diagnostics>:...
Judge成本:$<total_judge_cost_usd>(如适用)
关键改进:
1. <保留的实验1假设>(+<delta>)
2. <保留的实验2假设>(+<delta>)
...4.3 Preserve and Offer Next Steps
4.3 保留并提供后续步骤选项
The optimization branch () is preserved with all commits from kept experiments.
The experiment log and strategy digest remain in local scratch space for resume and audit on this machine only; they do not travel with the branch because is gitignored.
optimize/<spec-name>.context/....context/Present post-completion options via the platform question tool:
- Run on the cumulative diff (baseline to final). Load the
/ce:reviewskill withce:reviewon the optimization branch.mode:autofix - Run to document the winning strategy as an institutional learning.
/ce:compound - Create PR from the optimization branch to the default branch.
- Continue with more experiments: re-enter Phase 3 with the current state. State re-read first.
- Done -- leave the optimization branch for manual review.
优化分支()会保留所有保留实验的提交。实验日志和策略摘要保留在本地临时空间中,仅可在此机器上恢复和审核;由于被git忽略,它们不会随分支同步。
optimize/<spec-name>.context/....context/通过平台提问工具展示完成后的选项:
- **运行**检查累积差异(基线到最终版本)。在优化分支上以
/ce:review模式加载mode:autofix技能。ce:review - **运行**将获胜策略记录为机构学习成果。
/ce:compound - 创建PR从优化分支到默认分支。
- 继续进行更多实验:使用当前状态重新进入Phase 3。先重新读取状态。
- 完成 — 保留优化分支供人工审核。
4.4 Cleanup
4.4 清理
Clean up scratch space:
bash
undefined清理临时空间:
bash
undefinedKeep the experiment log for local resume/audit on this machine
保留实验日志以便在此机器上本地恢复/审核
Remove temporary batch artifacts
删除临时批次工件
rm -f .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md
Do NOT delete the experiment log if the user may resume locally or wants a local audit trail. If they need a durable shared artifact, summarize or export the results into a tracked path before cleanup.
Do NOT delete experiment worktrees that are still being referenced.rm -f .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md
如果用户可能在本地恢复或需要本地审核跟踪,不要删除实验日志。如果需要持久化的共享工件,在清理前将结果汇总或导出到受跟踪的路径。不要删除仍被引用的实验工作目录。