Iterative Optimization Loop
Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.
Interaction Method
Use the platform's blocking question tool when available (
in Claude Code,
in Codex,
in Gemini). Otherwise, present numbered options in chat and wait for the user's reply before proceeding.
Input
<optimization_input> #$ARGUMENTS </optimization_input>
If the input above is empty, ask: "What would you like to optimize? Describe the goal, or provide a path to an optimization spec YAML file."
Optimization Spec Schema
Reference the spec schema for validation:
references/optimize-spec-schema.yaml
Experiment Log Schema
Reference the experiment log schema for state management:
references/experiment-log-schema.yaml
Quick Start
For a first run, optimize for signal and safety, not maximum throughput:
- Start from
references/example-hard-spec.yaml
when the metric is objective and cheap to measure
- Use
references/example-judge-spec.yaml
only when actual quality requires semantic judgment
- Prefer and
execution.max_concurrent: 1
- Cap the first run with
stopping.max_iterations: 4
and
- Avoid new dependencies until the baseline and measurement harness are trusted
- For judge mode, start with , , and
For a friendly overview of what this skill is for, when to use hard metrics vs LLM-as-judge, and example kickoff prompts, see:
references/usage-guide.md
Persistence Discipline
CRITICAL: The experiment log on disk is the single source of truth. The conversation context is NOT durable storage. Results that exist only in the conversation WILL be lost.
The files under
.context/compound-engineering/ce-optimize/<spec-name>/
are local scratch state. They are ignored by git, so they survive local resumes on the same machine but are not preserved by commits, branches, or pushes unless the user exports them separately.
This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters MUST live on disk, not in the agent's memory.
If you produce a results table in the conversation without writing those results to disk first, you have a bug. The conversation is for the user's benefit. The experiment log file is for durability.
Core Rules
-
Write each experiment result to disk IMMEDIATELY after measurement — not after the batch, not after evaluation, IMMEDIATELY. Append the experiment entry to the experiment log file the moment its metrics are known, before evaluating the next experiment. This is the #1 crash-safety rule.
-
VERIFY every critical write — after writing the experiment log, read the file back and confirm the entry is present. This catches silent write failures. Do not proceed to the next experiment until verification passes.
-
Re-read from disk at every phase boundary and before every decision — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk.
-
The experiment log is append-only during Phase 3 — never rewrite the full file. Append new experiment entries. Update the
section in place only when a new best is found. This prevents data loss if a write is interrupted.
-
Per-experiment result markers for crash recovery — each experiment writes a
marker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged.
-
Strategy digest is written after every batch, before generating new hypotheses — the agent reads the digest (not its memory) when deciding what to try next.
-
Never present results to the user without writing them to disk first — the pattern is: measure -> write to disk -> verify -> THEN show the user. Not the reverse.
Mandatory Disk Checkpoints
These are non-negotiable write-then-verify steps. At each checkpoint, the agent MUST write the specified file and then read it back to confirm the write succeeded.
| Checkpoint | File Written | Phase |
|---|
| CP-0: Spec saved | | Phase 0, after user approval |
| CP-1: Baseline recorded | (initial with baseline) | Phase 1, after baseline measurement |
| CP-2: Hypothesis backlog saved | (hypothesis_backlog section) | Phase 2, after hypothesis generation |
| CP-3: Each experiment result | (append experiment entry) | Phase 3.3, immediately after each measurement |
| CP-4: Batch summary | (outcomes + best) + | Phase 3.5, after batch evaluation |
| CP-5: Final summary | (final state) | Phase 4, at wrap-up |
Format of a verification step:
- Write the file using the native file-write tool
- Read the file back using the native file-read tool
- Confirm the expected content is present
- If verification fails, retry the write. If it fails twice, alert the user.
File Locations (all under .context/compound-engineering/ce-optimize/<spec-name>/
)
| File | Purpose | Written When |
|---|
| Optimization spec (immutable during run) | Phase 0 (CP-0) |
| Full history of all experiments | Initialized at CP-1, appended at CP-3, updated at CP-4 |
| Compressed learnings for hypothesis generation | Written at CP-4 after each batch |
| Per-experiment crash-recovery marker | Immediately after measurement, before CP-3 |
On Resume
When Phase 0.4 detects an existing run:
- Read the experiment log from disk — this is the ground truth
- Scan worktree directories for markers not yet in the log
- Recover any measured-but-unlogged experiments
- Continue from where the log left off
Phase 0: Setup
0.1 Determine Input Type
Check whether the input is:
- A spec file path (ends in or ): read and validate it
- A description of the optimization goal: help the user create a spec interactively
0.2 Load or Create Spec
If spec file provided:
- Read the YAML spec file. The orchestrating agent parses YAML natively -- no shell script parsing.
- Validate against
references/optimize-spec-schema.yaml
:
- All required fields present
- is lowercase kebab-case and safe to use in git refs / worktree paths
- is or
- If type is , section exists with and
- At least one degenerate gate defined
- is non-empty
- and each have at least one entry
- Gate check operators are valid (, , , , , )
- is at least 1
- does not exceed 6 when backend is
- If validation fails, report errors and ask the user to fix them
If description provided:
-
Analyze the project to understand what can be measured
-
Detect whether the optimization target is qualitative or quantitative — this determines
vs
and is the single most important spec decision:
- The metric is a scalar number with a clear "better" direction
- The metric is objectively measurable (build time, test pass rate, latency, memory usage)
- No human judgment is needed to evaluate "is this result actually good?"
- Examples: reduce build time, increase test coverage, reduce API latency, decrease bundle size
- The quality of the output requires semantic understanding to evaluate
- A human reviewer would need to look at the results to say "this is better"
- Proxy metrics exist but can mislead (e.g., "more clusters" does not mean "better clusters")
- The optimization could produce degenerate solutions that look good on paper
- Examples: clustering quality, search relevance, summarization quality, code readability, UX copy, recommendation relevance
IMPORTANT: If the target is qualitative,
strongly recommend . Explain that hard metrics alone will optimize proxy numbers without checking actual quality. Show the user the three-tier approach:
- Degenerate gates (hard, cheap, fast): catch obviously broken solutions — e.g., "all items in 1 cluster" or "0% coverage". Run first. If gates fail, skip the expensive judge step.
- LLM-as-judge (the actual optimization target): sample outputs, score them against a rubric, aggregate. This is what the loop optimizes.
- Diagnostics (logged, not gated): distribution stats, counts, timing — useful for understanding WHY a judge score changed.
If the user insists on
for a qualitative target, proceed but warn that the results may optimize a misleading proxy.
-
Design the sampling strategy (for
):
Guide the user through defining stratified sampling. The key question is: "What parts of the output space do you need to check quality on?"
Walk through these questions:
- What does one "item" look like? (a cluster, a search result page, a summary, etc.)
- What are the natural size/quality strata? (e.g., large clusters vs small clusters vs singletons)
- Where are quality failures most likely? (e.g., very large clusters may be degenerate merges; singletons may be missed groupings)
- What total sample size balances cost vs signal? (default: 30 items, adjust based on output volume)
Example stratified sampling for clustering:
yaml
stratification:
- bucket: "top_by_size" # largest clusters — check for degenerate mega-clusters
count: 10
- bucket: "mid_range" # middle of non-solo cluster size range — representative quality
count: 10
- bucket: "small_clusters" # clusters with 2-3 items — check if connections are real
count: 10
singleton_sample: 15 # singletons — check for false negatives (items that should cluster)
The sampling strategy is domain-specific. For search relevance, strata might be "top-3 results", "results 4-10", "tail results". For summarization, strata might be "short documents", "long documents", "multi-topic documents".
Singleton evaluation is critical when the goal involves coverage — sampling singletons with the singleton rubric checks whether the system is missing obvious groupings.
-
Design the rubric (for
):
Help the user define the scoring rubric. A good rubric:
- Has a 1-5 scale (or similar) with concrete descriptions for each level
- Includes supplementary fields that help diagnose issues (e.g., , )
- Is specific enough that two judges would give similar scores
- Does NOT assume bigger/more is better — "3 items per cluster average" is not inherently good or bad
Example for clustering:
yaml
rubric: |
Rate this cluster 1-5:
- 5: All items clearly about the same issue/feature
- 4: Strong theme, minor outliers
- 3: Related but covers 2-3 sub-topics that could reasonably be split
- 2: Weak connection — items share superficial similarity only
- 1: Unrelated items grouped together
Also report: distinct_topics (integer), outlier_count (integer)
-
Guide the user through the remaining spec fields:
- What degenerate cases should be rejected? (gates — e.g., "solo_pct <= 0.95" catches all-singletons, "max_cluster_size <= 500" catches mega-clusters)
- What command runs the measurement?
- What files can be modified? What is immutable?
- Any constraints or dependencies?
- If this is the first run: recommend ,
execution.max_concurrent: 1
, stopping.max_iterations: 4
, and
- If : recommend , , and until the rubric and harness are trusted
-
Write the spec to
.context/compound-engineering/ce-optimize/<spec-name>/spec.yaml
-
Present the spec to the user for approval before proceeding
0.3 Search Prior Learnings
Dispatch
compound-engineering:research:learnings-researcher
to search for prior optimization work on similar topics. If relevant learnings exist, incorporate them into the approach.
0.4 Run Identity Detection
Check if
branch already exists:
bash
git rev-parse --verify "optimize/<spec-name>" 2>/dev/null
If branch exists, check for an existing experiment log at
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
.
Present the user with a choice via the platform question tool:
- Resume: read ALL state from the experiment log on disk (do not rely on any in-memory context from a prior session). Recover any measured-but-unlogged experiments by scanning worktree directories for markers. Continue from the last iteration number in the log.
- Fresh start: archive the old branch to
optimize-archive/<spec-name>/archived-<timestamp>
, clear the experiment log, start from scratch
0.5 Create Optimization Branch and Scratch Space
bash
git checkout -b "optimize/<spec-name>" # or switch to existing if resuming
Create scratch directory:
bash
mkdir -p .context/compound-engineering/ce-optimize/<spec-name>/
Phase 1: Measurement Scaffolding
This phase is a HARD GATE. The user must approve baseline and parallel readiness before Phase 2.
1.1 Clean-Tree Gate
Verify no uncommitted changes to files within
or
:
Filter the output against the scope paths. If any in-scope files have uncommitted changes:
- Report which files are dirty
- Ask the user to commit or stash before proceeding
- Do NOT continue until the working tree is clean for in-scope files
1.2 Build or Validate Measurement Harness
If user provides a measurement harness (the
already exists):
- Run it once via the measurement script:
bash
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>"
- Validate the JSON output:
- Contains keys for all degenerate gate metric names
- Contains keys for all diagnostic metric names
- Values are numeric or boolean as expected
- If validation fails, report what is missing and ask the user to fix the harness
If agent must build the harness:
- Analyze the codebase to understand the current approach and what should be measured
- Build an evaluation script (e.g., , , or equivalent)
- Add the evaluation script path to -- the experiment agent must not modify it
- Run it once and validate the output
- Present the harness and its output to the user for review
1.3 Establish Baseline
Run the measurement harness on the current code.
- Run the harness times
- Aggregate results using the configured aggregation method (median, mean, min, max)
- Calculate variance across runs
- If variance exceeds , warn the user and suggest increasing
Record the baseline in the experiment log:
yaml
baseline:
timestamp: "<current ISO 8601 timestamp>"
gates:
<gate_name>: <value>
...
diagnostics:
<diagnostic_name>: <value>
...
If primary type is
, also run the judge evaluation on baseline output to establish the starting judge score.
1.4 Parallelism Readiness Probe
Run the parallelism probe script:
bash
bash scripts/parallel-probe.sh "<project_directory>" "<measurement.command>" "<measurement.working_directory>" <shared_files...>
Read the JSON output. Present any blockers to the user with suggested mitigations. Treat the probe as intentionally narrow: it should inspect the measurement command, the measurement working directory, and explicitly declared shared files, not the entire repository.
1.5 Worktree Budget Check
Count existing worktrees:
bash
bash scripts/experiment-worktree.sh count
If count +
would exceed 12:
- Warn the user
- Suggest cleaning up existing worktrees or reducing
- Do NOT block -- the user may proceed at their own risk
1.6 Write Baseline to Disk (CP-1)
MANDATORY CHECKPOINT. Before presenting results to the user, write the initial experiment log with baseline metrics to disk:
- Create the experiment log file at
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
- Include all required top-level sections from
references/experiment-log-schema.yaml
: , , , , , and
- Seed as an empty array and seed from the baseline snapshot (use , baseline metrics, and baseline judge scores if present) so later phases have a valid current-best state to compare against
- Optionally seed here as well so the log shape is stable before Phase 2 populates it
- Verify: read the file back and confirm the required sections are present and the baseline values match
- Only THEN present results to the user
1.7 User Approval Gate
Present to the user via the platform question tool:
- Baseline metrics: all gate values, diagnostic values, and judge scores (if applicable)
- Experiment log location: show the file path so the user knows where results are saved
- Parallel readiness: probe results, any blockers, mitigations applied
- Clean-tree status: confirmed clean
- Worktree budget: current count and projected usage
- Judge budget: estimated per-experiment judge cost and configured cap (or an explicit note that spend is uncapped)
Options:
- Proceed -- approve baseline and parallel config, move to Phase 2
- Adjust spec -- modify spec settings before proceeding
- Fix issues -- user needs to resolve blockers first
Do NOT proceed to Phase 2 until the user explicitly approves.
If primary type is
and
is null, call that out as uncapped spend and require explicit approval before proceeding.
State re-read: After gate approval, re-read the spec and baseline from disk. Do not carry stale in-memory values forward.
Phase 2: Hypothesis Generation
2.1 Analyze Current Approach
Read the code within
to understand:
- The current implementation approach
- Obvious improvement opportunities
- Constraints and dependencies between components
Optionally dispatch
compound-engineering:research:repo-research-analyst
for deeper codebase analysis if the scope is large or unfamiliar.
2.2 Generate Hypothesis List
Generate an initial set of hypotheses. Each hypothesis should have:
- Description: what to try
- Category: one of the standard categories (signal-extraction, graph-signals, embedding, algorithm, preprocessing, parameter-tuning, architecture, data-handling) or a domain-specific category
- Priority: high, medium, or low based on expected impact and feasibility
- Required dependencies: any new packages or tools needed
Include user-provided hypotheses if any were given as input.
Aim for 10-30 hypotheses in the initial backlog. More can be generated during the loop based on learnings.
2.3 Dependency Pre-Approval
Collect all unique new dependencies across all hypotheses.
If any hypotheses require new dependencies:
- Present the full dependency list to the user via the platform question tool
- Ask for bulk approval
- Mark each hypothesis's as or
Hypotheses with unapproved dependencies remain in the backlog but are skipped during batch selection. They are re-presented at wrap-up for potential approval.
2.4 Record Hypothesis Backlog (CP-2)
MANDATORY CHECKPOINT. Write the initial backlog to the experiment log file and verify:
yaml
hypothesis_backlog:
- description: "Remove template boilerplate before embedding"
category: "signal-extraction"
priority: high
dep_status: approved
required_deps: []
- description: "Try HDBSCAN clustering algorithm"
category: "algorithm"
priority: medium
dep_status: needs_approval
required_deps: ["scikit-learn"]
Phase 3: Optimization Loop
This phase repeats in batches until a stopping criterion is met.
3.1 Batch Selection
Select hypotheses for this batch:
- Build a runnable backlog by excluding hypotheses with
dep_status: needs_approval
- If is , force
- Otherwise,
batch_size = min(runnable_backlog_size, execution.max_concurrent)
- Prefer diversity: select from different categories when possible
- Within a category, select by priority (high first)
If the backlog is empty and no new hypotheses can be generated, proceed to Phase 4 (wrap-up).
If the backlog is non-empty but no runnable hypotheses remain because everything needs approval or is otherwise blocked, proceed to Phase 4 so the user can approve dependencies instead of spinning forever.
3.2 Dispatch Experiments
For each hypothesis in the batch, dispatch according to
. In
mode, run exactly one experiment to completion before selecting the next hypothesis. In
mode, dispatch the full batch concurrently.
Worktree backend:
- Create experiment worktree:
bash
WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "<spec_name>" <exp_index> "optimize/<spec_name>" <shared_files...>) # creates optimize-exp/<spec_name>/exp-<NNN>
- Apply port parameterization if configured (set env vars for the measurement script)
- Fill the experiment prompt template (
references/experiment-prompt-template.md
) with:
- Iteration number, spec name
- Hypothesis description and category
- Current best and baseline metrics
- Mutable and immutable scope
- Constraints and approved dependencies
- Rolling window of last 10 experiments (concise summaries)
- Dispatch a subagent with the filled prompt, working in the experiment worktree
Codex backend:
- Check environment guard -- do NOT delegate if already inside a Codex sandbox:
bash
# If these exist, we're already in Codex -- fall back to subagent
test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git
- Fill the experiment prompt template
- Write the filled prompt to a temp file
- Dispatch via Codex:
bash
cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1
- Security posture: use the user's selection (ask once per session if not set in spec)
3.3 Collect and Persist Results
Process experiments as they complete — do NOT wait for the entire batch to finish before writing results.
For each completed experiment, immediately:
-
Run measurement in the experiment's worktree:
bash
bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>/<measurement.working_directory or .>" <env_vars...>
- If stability mode is , run the measurement harness times in that working directory and aggregate the results exactly as in Phase 1 before evaluating gates or ranking the experiment.
- Use the aggregated metrics as the experiment's score; if variance exceeds , record that in learnings so the operator knows the result is noisy.
-
Write crash-recovery marker — immediately after measurement, write
in the experiment worktree containing the raw metrics. This ensures the measurement is recoverable even if the agent crashes before updating the main log.
-
Read raw JSON output from the measurement script
-
Evaluate degenerate gates:
- For each gate in , parse the operator and threshold
- Compare the metric value against the threshold
- If ANY gate fails: mark outcome as , skip judge evaluation, save money
-
If gates pass AND primary type is :
- Read the experiment's output (cluster assignments, search results, etc.)
- Apply stratified sampling per
metric.judge.stratification
config (using )
- Group samples into batches of
- Fill the judge prompt template (
references/judge-prompt-template.md
) for each batch
- Dispatch
ceil(sample_size / batch_size)
parallel judge sub-agents
- Each sub-agent returns structured JSON scores
- Aggregate scores: compute the configured primary judge field from
metric.judge.scoring.primary
(which should match ) plus any values
- If : also dispatch singleton evaluation sub-agents
-
If gates pass AND primary type is :
- Use the metric value directly from the measurement output
-
IMMEDIATELY append to experiment log on disk (CP-3) — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
right now. Use the transitional outcome
once the experiment has valid metrics but has not yet been compared to the current best. Update the outcome to
,
, or another terminal state in the evaluation step, but the raw metrics are on disk and safe from context compaction.
-
VERIFY the write (CP-3 verification) — read the experiment log back from disk and confirm the entry just written is present. If verification fails, retry the write. Do NOT proceed to the next experiment until this entry is confirmed on disk.
Why immediately + verify? The agent's context window is NOT a durable store. Context compaction, session crashes, and restarts are expected during long runs. If results only exist in the agent's memory, they are lost. Karpathy's autoresearch writes to
after every single experiment — this skill must do the same with the experiment log. The verification step catches silent write failures that would otherwise lose data.
3.4 Evaluate Batch
After all experiments in the batch have been measured:
-
Rank experiments by primary metric improvement:
- For hard metrics: compare to the current best using ( means higher is better, means lower is better), and require the absolute improvement to exceed
measurement.stability.noise_threshold
before treating it as a real win
- For judge metrics: compare the configured primary judge score (
metric.judge.scoring.primary
/ ) to the current best, and require it to exceed
-
Identify the best experiment that passes all gates and improves the primary metric
-
If best improves on current best: KEEP
- Commit the experiment branch first so the winning diff exists as a real commit before any merge or cherry-pick
- Include only mutable-scope changes in that commit; if no eligible diff remains, treat the experiment as non-improving and revert it
- Merge the committed experiment branch into the optimization branch
- Use the message
optimize(<spec-name>): <hypothesis description>
for the experiment commit
- After the merge succeeds, clean up the winner's experiment worktree and branch; the integrated commit on the optimization branch is the durable artifact
- This is now the new baseline for subsequent batches
-
Check file-disjoint runners-up (up to
max_runner_up_merges_per_batch
):
- For each runner-up that also improved, check file-level disjointness with the kept experiment
- File-level disjointness: two experiments are disjoint if they modified completely different files. Same file = overlapping, even if different lines.
- If disjoint: cherry-pick the runner-up onto the new baseline, re-run full measurement
- If combined measurement is strictly better: keep the cherry-pick (outcome: ), then clean up that runner-up's experiment worktree and branch
- Otherwise: revert the cherry-pick, log as "promising alone but neutral/harmful in combination" (outcome: ), then clean up the runner-up's experiment worktree and branch
- Stop after first failed combination
-
Handle deferred deps: experiments that need unapproved dependencies get outcome
-
Revert all others: cleanup worktrees, log as
3.5 Update State (CP-4)
MANDATORY CHECKPOINT. By this point, individual experiment results are already on disk (written in step 3.3). This step updates aggregate state and verifies.
-
Re-read the experiment log from disk — do not trust in-memory state. The log is the source of truth.
-
Finalize outcomes — update experiment entries from step 3.4 evaluation (mark
,
,
, etc.). Write these outcome updates to disk immediately.
-
Update the section in the experiment log if a new best was found. Write to disk.
-
Write strategy digest to
.context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md
:
- Categories tried so far (with success/failure counts)
- Key learnings from this batch and overall
- Exploration frontier: what categories and approaches remain untried
- Current best metrics and improvement from baseline
-
Generate new hypotheses based on learnings:
- Re-read the strategy digest from disk (not from memory)
- Read the rolling window (last 10 experiments from the log on disk)
- Do NOT read the full experiment log -- use the digest for broad context
- Add new hypotheses to the backlog and write the updated backlog to disk
-
Write updated hypothesis backlog to disk — the backlog section of the experiment log must reflect newly added hypotheses and removed (tested) ones.
CP-4 Verification: Read the experiment log back from disk. Confirm: (a) all experiment outcomes from this batch are finalized, (b) the
section reflects the current best, (c) the hypothesis backlog is updated. Read
back and confirm it exists. Only THEN proceed to the next batch or stopping criteria check.
Checkpoint: at this point, all state for this batch is on disk. If the agent crashes and restarts, it can resume from the experiment log without loss.
3.6 Check Stopping Criteria
Stop the loop if ANY of these are true:
- Target reached: is true, is set, and the primary metric reaches that target according to ( for , for )
- Max iterations: total experiments run >=
- Max hours: wall-clock time since Phase 3 start >=
- Judge budget exhausted: cumulative judge spend >=
metric.judge.max_total_cost_usd
(if set)
- Plateau: no improvement for
stopping.plateau_iterations
consecutive experiments
- Manual stop: user interrupts (save state and proceed to Phase 4)
- Empty backlog: no hypotheses remain and no new ones can be generated
If no stopping criterion is met, proceed to the next batch (step 3.1).
3.7 Cross-Cutting Concerns
Codex failure cascade: Track consecutive Codex delegation failures. After 3 consecutive failures, auto-disable Codex for remaining experiments and fall back to subagent dispatch. Log the switch.
Error handling: If an experiment's measurement command crashes, times out, or produces malformed output:
- Log as outcome or with the error message
- Revert the experiment (cleanup worktree)
- The loop continues with remaining experiments in the batch
Progress reporting: After each batch, report:
- Batch N of estimated M (based on backlog size)
- Experiments run this batch and total
- Current best metric and improvement from baseline
- Cumulative judge cost (if applicable)
Crash recovery: See Persistence Discipline section. Per-experiment
markers are written in step 3.3. Individual experiment results are appended to the log immediately in step 3.3. Batch-level state (outcomes, best, digest) is written in step 3.5. On resume (Phase 0.4), the log on disk is the ground truth — scan for any
markers not yet reflected in the log.
Phase 4: Wrap-Up
4.1 Present Deferred Hypotheses
If any hypotheses were deferred due to unapproved dependencies:
- List them with their dependency requirements
- Ask the user whether to approve, skip, or save for a future run
- If approved: add to backlog and offer to re-enter Phase 3 for one more round
4.2 Summarize Results
Present a comprehensive summary:
Optimization: <spec-name>
Duration: <wall-clock time>
Total experiments: <count>
Kept: <count> (including <runner_up_kept_count> runner-up merges)
Reverted: <count>
Degenerate: <count>
Errors: <count>
Deferred: <count>
Baseline -> Final:
<primary_metric>: <baseline_value> -> <final_value> (<delta>)
<gate_metrics>: ...
<diagnostics>: ...
Judge cost: $<total_judge_cost_usd> (if applicable)
Key improvements:
1. <kept experiment 1 hypothesis> (+<delta>)
2. <kept experiment 2 hypothesis> (+<delta>)
...
4.3 Preserve and Offer Next Steps
The optimization branch (
) is preserved with all commits from kept experiments.
The experiment log and strategy digest remain in local
scratch space for resume and audit on this machine only; they do not travel with the branch because
is gitignored.
Present post-completion options via the platform question tool:
- Run on the cumulative diff (baseline to final). Load the skill with on the optimization branch.
- Run to document the winning strategy as an institutional learning.
- Create PR from the optimization branch to the default branch.
- Continue with more experiments: re-enter Phase 3 with the current state. State re-read first.
- Done -- leave the optimization branch for manual review.
4.4 Cleanup
Clean up scratch space:
bash
# Keep the experiment log for local resume/audit on this machine
# Remove temporary batch artifacts
rm -f .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md
Do NOT delete the experiment log if the user may resume locally or wants a local audit trail. If they need a durable shared artifact, summarize or export the results into a tracked path before cleanup.
Do NOT delete experiment worktrees that are still being referenced.