experiment-pipeline

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Experiment Pipeline

实验管道

A structured 4-stage framework for executing research experiments from initial implementation through ablation study, with attempt budgets and gate conditions that prevent wasted effort. This follows the Experiment Tree Search design from the EvoScientist paper, where the engineer agent iteratively generates executable code, runs experiments, and records structured execution results at each stage.

这是一个结构化的四阶段框架，用于执行从初始实现到消融研究的科研实验，通过尝试预算和关卡条件避免无效工作。该框架遵循EvoScientist论文中的实验树搜索设计，工程师Agent会在每个阶段迭代生成可执行代码、运行实验并记录结构化的执行结果。

When to Use This Skill

何时使用该技能

User has a planned experiment and needs to organize the execution workflow
User wants to systematically validate a novel method against baselines
User asks about experiment stages, attempt budgets, or when to move on
User needs to reproduce baseline results before testing their method
User mentions "experiment pipeline", "baseline first", "ablation study", "stage budget", "experiment execution"

用户已有规划好的实验，需要组织执行工作流
用户想要系统性地对比基线验证新方法
用户询问实验阶段、尝试预算或推进时机
用户需要在测试自有方法前复现基线结果
用户提及“实验管道”、“先做基线”、“消融研究”、“阶段预算”、“实验执行”

The Pipeline Mindset

管道思维

Experiments fail for two reasons: wrong order and no stopping criteria. Most researchers jump straight to testing their novel method without verifying their baseline setup, then wonder why results don't make sense. Others spend weeks tuning hyperparameters without a budget, hoping the next run will work.

The 4-stage pipeline solves both problems. It enforces a strict order (each stage validates assumptions the next stage depends on) and assigns attempt budgets (forcing systematic thinking over brute-force iteration).

实验失败有两个原因：顺序错误和没有停止标准。 大多数研究人员会直接跳过基线验证，直奔新方法测试，之后又疑惑为什么结果不合理。还有些人会无预算限制地花费数周调优超参数，寄希望于下一次运行能成功。

四阶段管道解决了这两个问题。它强制要求严格的执行顺序（每个阶段都会验证下一阶段依赖的假设），并分配尝试预算（促使系统性思考而非暴力迭代）。

Before Starting: Load Prior Knowledge

开始前：加载已有知识

If coming from

idea-tournament

, your research proposal (Phase 4) provides the experiment plan — datasets, baselines, metrics, and ablation design — that maps directly to Stages 1-4 below.

Before entering the pipeline, load Experimentation Memory (M_E) from prior cycles:

Refer to the evo-memory skill → Read M_E at
```
/memory/experiment-memory.md
```
Select the top-1 entry (k_E=1) most relevant to the current experiment domain by comparing each entry's Context and Category against the current problem
The selected strategy informs hyperparameter ranges (Stage 2), debugging approaches (Stages 1-3), and training configurations across all stages
If M_E doesn't exist yet (first cycle), skip this step and proceed — your results will seed M_E via ESE after pipeline completion

如果是从

idea-tournament

跳转而来，你的研究提案（第4阶段）提供了实验计划——数据集、基线、指标和消融设计——可直接映射到下文的阶段1-4。

进入管道前，从先前周期加载实验内存（M_E）：

参考evo-memory技能 → 读取
```
/memory/experiment-memory.md
```
中的M_E
通过对比每个条目的上下文和类别与当前问题的匹配度，选择与当前实验领域最相关的前1条记录（k_E=1）
所选策略将为阶段2的超参数范围、阶段1-3的调试方法以及所有阶段的训练配置提供参考
如果M_E尚未存在（首次循环），跳过此步骤直接推进——你的结果将在管道完成后通过ESE为M_E提供初始数据

4-Stage Pipeline Overview

四阶段管道概述

Each stage follows a generate → execute → record → diagnose → revise loop:

Stage	Goal	Budget (N_E^s)	Gate Condition
1. Initial Implementation	Get baseline code running and reproduce known results	≤20 attempts	Metrics within 2% of reported values (or within reported variance)
2. Hyperparameter Tuning	Optimize config for your setup	≤12 attempts	Stable config, variance < 5% across 3 runs
3. Proposed Method	Implement & validate novel method	≤12 attempts	Outperforms tuned baseline on primary metric, consistent across 3 runs
4. Ablation Study	Prove each component's contribution	≤18 attempts	All claims evidenced with controlled experiments

Each stage saves artifacts to

/experiments/stageN_name/

每个阶段都遵循生成→执行→记录→诊断→修订的循环：

阶段	目标	尝试预算（N_E^s）	关卡条件
1. 初始实现	让基线代码运行并复现已知结果	≤20次尝试	指标与报告值的偏差在2%以内（或在报告的方差范围内）
2. 超参数调优	为你的特定环境优化配置	≤12次尝试	找到稳定配置，3次运行的方差<5%
3. 拟议方法	实现并验证新方法	≤12次尝试	在主要指标上优于调优后的基线，且3次运行结果一致
4. 消融研究	证明每个组件的贡献	≤18次尝试	所有主张都有受控实验作为证据

每个阶段的产物都会保存到

/experiments/stageN_name/

路径下。

The Stage Loop

阶段循环流程

Within every stage, repeat this cycle for each attempt:

Generate: Form a hypothesis or plan for this attempt. What specifically will you try? What do you expect to happen?
Execute: Run the experiment. Record exact configuration, code changes, and runtime.
Record: Log results immediately using the stage log template. Include both metrics and observations.
Diagnose: Compare results to expectations. If they match, assess the gate condition. If they don't, load
```
experiment-craft
```
for the 5-step diagnostic flow.
Revise: Based on diagnosis, either advance to the next stage (gate met) or plan the next attempt (gate not met).

在每个阶段内，为每次尝试重复以下循环：

生成：为本次尝试形成假设或计划。具体要尝试什么？预期会发生什么？
执行：运行实验。记录确切的配置、代码变更和运行时间。
记录：立即使用阶段日志模板记录结果，包括指标和观察结论。
诊断：将结果与预期对比。如果匹配，评估是否满足关卡条件；如果不匹配，加载
```
experiment-craft
```
执行五步诊断流程。
修订：根据诊断结果，要么推进到下一阶段（满足关卡条件），要么规划下一次尝试（未满足关卡条件）。

Stage 1: Initial Implementation

阶段1：初始实现

Goal: Find or generate executable baseline code and verify it reproduces published results. This stage corresponds to the paper's "initial implementation" — the engineer agent searches for working code, runs it, and records structured execution results.

Why this matters: If you can't get the baseline running and reproducing known results, every subsequent comparison is meaningless. Initial implementation validates your data pipeline, evaluation code, training infrastructure, and understanding of prior work.

Budget: ≤20 attempts (N_E^1=20). Baselines can be tricky — missing details in papers, version mismatches, unreported preprocessing steps. 20 attempts gives enough room to debug without allowing infinite tinkering.

Gate: Primary metrics within 2% of reported values (or within the reported variance if provided).

Process:

Find the original baseline code (official repo, re-implementations, or write from paper description)
Get the code running in your environment — resolve dependencies, fix compatibility issues
Match the exact training configuration from the paper (dataset splits, preprocessing, hyperparameters)
Run and compare metrics. If off by >2%, diagnose the gap
Common pitfalls: different random seeds, different data splits, unreported data augmentation, framework version differences

When to load
experiment-craft
: If attempts 1-5 all fail significantly (>10% gap), switch to the 5-step diagnostic flow to isolate the cause before burning more attempts.

Output:

/experiments/stage1_baseline/

containing results, config, and verified baseline code.

See references/stage-protocols.md for detailed initial implementation checklists.

目标：找到或生成可执行的基线代码，并验证其能复现已发表的结果。该阶段对应论文中的“初始实现”——工程师Agent会搜索可用代码、运行代码并记录结构化的执行结果。

重要性：如果你无法让基线正常运行并复现已知结果，后续的所有对比都将毫无意义。初始实现会验证你的数据管道、评估代码、训练基础设施以及对先前工作的理解。

预算：≤20次尝试（N_E^1=20）。基线调试可能很棘手——论文中缺失细节、版本不匹配、未报告的预处理步骤。20次尝试为调试提供了足够空间，同时避免无限期的 tinkering。

关卡：主要指标与报告值的偏差在2%以内（如果提供了方差，则在报告的方差范围内）。

流程：

找到原始基线代码（官方仓库、重实现版本，或根据论文描述编写）
让代码在你的环境中运行——解决依赖问题、修复兼容性问题
匹配论文中确切的训练配置（数据集划分、预处理、超参数）
运行并对比指标。如果偏差超过2%，诊断差距原因
常见陷阱：随机种子不同、数据划分不同、未报告的数据增强、框架版本差异

何时加载
experiment-craft
：如果前5次尝试均大幅失败（偏差>10%），切换到五步诊断流程以定位原因，避免浪费更多尝试次数。

输出：

/experiments/stage1_baseline/

包含结果、配置和已验证的基线代码。

详细的初始实现检查清单请参考references/stage-protocols.md。

Stage 2: Hyperparameter Tuning

阶段2：超参数调优

Goal: Find the optimal hyperparameter configuration for YOUR specific setup.

Why this matters: Published hyperparameters are tuned for the authors' setup. Your hardware, data version, framework version, or subtle implementation differences mean their config may not be optimal for you. Tuning now prevents confounding your novel method's results with suboptimal baselines.

Budget: ≤12 attempts. Hyperparameter tuning has diminishing returns. If 12 structured attempts don't find a stable config, the problem is likely deeper than hyperparameters.

Gate: Stable configuration found — variance < 5% across 3 independent runs with different random seeds.

Process:

Identify the most sensitive hyperparameters (usually: learning rate, batch size, loss weights)
Start with coarse search on the most sensitive parameter
Narrow the range based on results, then move to the next parameter
Validate final config with 3 independent runs

Priority order for tuning: Learning rate → batch size → loss weights → regularization → architecture-specific params. This order reflects typical sensitivity.

When to load
experiment-craft
: If results are highly unstable (variance > 20%) across runs, there's likely a training instability issue. Use diagnostic flow.

Output:

/experiments/stage2_tuning/

containing tuning logs, final config, and stability verification.

See references/attempt-budget-guide.md for budget rationale and adjustment rules.

目标：为你的特定环境找到最优超参数配置。

重要性：已发表的超参数是针对作者的环境调优的。你的硬件、数据版本、框架版本或细微的实现差异意味着他们的配置可能并不适用于你。现在进行调优可以避免将新方法的结果与次优基线混淆。

预算：≤12次尝试。超参数调优的收益会逐渐递减。如果12次结构化尝试仍未找到稳定配置，问题可能比超参数本身更深入。

关卡：找到稳定配置——使用不同随机种子的3次独立运行的方差<5%。

流程：

识别最敏感的超参数（通常是：学习率、批次大小、损失权重）
从对最敏感参数的粗粒度搜索开始
根据结果缩小范围，然后转向下一个参数
用3次独立运行验证最终配置

调优优先级：学习率 → 批次大小 → 损失权重 → 正则化 → 架构特定参数。该顺序反映了典型的敏感度。

何时加载
experiment-craft
：如果多次运行的结果高度不稳定（方差>20%），可能存在训练不稳定问题。使用诊断流程。

输出：

/experiments/stage2_tuning/

包含调优日志、最终配置和稳定性验证结果。

预算依据和调整规则请参考references/attempt-budget-guide.md。

Stage 3: Proposed Method

阶段3：拟议方法

Goal: Implement and validate your novel method, demonstrating improvement over the tuned baseline.

Why this matters: This is the core contribution. But because you've verified the baseline (Stage 1) and optimized the config (Stage 2), any improvement you see is genuinely attributable to your method — not to a better-tuned setup or a broken baseline.

Budget: ≤12 attempts. Your method should work within a reasonable number of iterations if the underlying idea is sound. Excessive attempts suggest a fundamental problem, not a tuning issue.

Gate: Outperforms the tuned baseline on the primary metric. The improvement should be consistent across at least 3 runs.

Process:

Implement the core method incrementally — don't add everything at once
Test each component's integration with the baseline pipeline
Run full training and compare against Stage 2 results
If underperforming, isolate which component causes the gap

Integration strategy: Add your method's components one at a time to the working baseline. Each added component should stay within 20% of the baseline's performance — if a single component causes a >20% regression, isolate and debug it before proceeding. Never integrate the full method in one shot.

When to load
experiment-craft
: When your method underperforms the baseline despite correct implementation. The 5-step diagnostic flow will help distinguish between implementation bugs and fundamental issues.

Critical decision — failure classification: If the method underperforms the baseline after exhausting the attempt budget, hand off to

evo-memory

for IVE (Idea Validation Evolution) — this is evo-memory's job, not this skill's. IVE triggers under two conditions:

No executable code: Cannot find working code within the budget at any stage.
Worse than baseline: Experiments complete but the method underperforms.

The

evo-memory

skill will classify the failure as:

Implementation failure: Bugs or missing tricks → retryable in a future cycle.
Fundamental direction failure: Core idea doesn't work → update ideation memory to prevent retrying.

Output:

/experiments/stage3_method/

containing method code, results, comparison with baseline.

目标：实现并验证你的新方法，证明其优于调优后的基线。

重要性：这是核心贡献。但由于你已经验证了基线（阶段1）并优化了配置（阶段2），你看到的任何提升都真正归因于你的方法——而不是更优的配置或有问题的基线。

预算：≤12次尝试。如果核心想法合理，你的方法应该在合理的迭代次数内生效。过多的尝试表明存在根本性问题，而非调优问题。

关卡：在主要指标上优于调优后的基线，且提升效果在至少3次运行中保持一致。

流程：

逐步实现核心方法——不要一次性添加所有内容
测试每个组件与基线管道的集成
运行完整训练并与阶段2的结果对比
如果性能不佳，定位导致差距的组件

集成策略：将你的方法组件逐个添加到可运行的基线中。每个添加的组件的性能应与基线的偏差保持在20%以内——如果单个组件导致>20%的性能下降，在继续之前先隔离并调试它。切勿一次性集成完整方法。

何时加载
experiment-craft
：当你的方法在实现正确的情况下仍不如基线时。五步诊断流程将帮助区分实现错误和根本性问题。

关键决策——失败分类：如果在耗尽尝试预算后，方法的性能仍不如基线，将任务移交至

evo-memory

进行IVE（想法验证演进）——这是evo-memory的职责，而非本技能的职责。在以下两种情况下会触发IVE：

无可用代码：在任何阶段的预算内都无法找到可运行的代码。
性能逊于基线：实验完成但方法性能不如基线。

evo-memory

技能会将失败分类为：

实现失败：存在漏洞或缺失关键技巧 → 可在未来循环中重试。
根本性方向失败：核心想法无效 → 更新创意记忆以避免重复尝试。

输出：

/experiments/stage3_method/

包含方法代码、结果以及与基线的对比。

Stage 4: Ablation Study

阶段4：消融研究

Goal: Prove that each component of your method contributes meaningfully to the final result.

Why this matters: Reviewers will ask "is component X really necessary?" for every part of your method. Without ablation, you can't answer. More importantly, ablation helps YOU understand why your method works — sometimes components you thought were important aren't, and vice versa.

Budget: ≤18 attempts. Ablation requires multiple controlled experiments — one per component being ablated, plus interaction effects. 18 attempts covers a method with 4-5 components.

Gate: Every claimed contribution is supported by a controlled experiment showing its effect.

Process:

List all components of your method that you claim contribute to performance
Design ablation experiments: remove ONE component at a time, measure the impact
For components that interact, test interaction effects
Verify that no single component's removal improves results (would invalidate the claim)

Three ablation designs:

Leave-one-out: Remove each component individually. Shows each component's marginal contribution.
Additive: Start from baseline, add components one at a time. Shows incremental gains.
Substitution: Replace your component with an alternative approach. Shows your component is better than alternatives, not just better than nothing.

When to load
experiment-craft
: If ablation results contradict your hypothesis (removing a component improves results), use diagnostic flow to understand why.

Output:

/experiments/stage4_ablation/

containing ablation results table, per-component analysis.

See references/stage-protocols.md for detailed ablation design patterns.

目标：证明你的方法的每个组件都对最终结果有实质性贡献。

重要性：审稿人会针对你方法的每个部分提出“组件X真的必要吗？”的问题。没有消融研究，你无法回答这个问题。更重要的是，消融研究能帮助你理解方法有效的原因——有时你认为重要的组件其实不重要，反之亦然。

预算：≤18次尝试。消融研究需要多次受控实验——每个待消融组件一次，加上交互效应的实验。18次尝试足以覆盖包含4-5个组件的方法。

关卡：每个主张的贡献都有受控实验证明其效果。

流程：

列出你声称对性能有贡献的所有方法组件
设计消融实验：每次移除一个组件，测量其影响
对于存在交互的组件，测试交互效应
验证移除任何单个组件都不会提升结果（否则会 invalidate 主张）

三种消融设计：

留一法：逐个移除每个组件。显示每个组件的边际贡献。
增量法：从基线开始，逐个添加组件。显示增量收益。
替换法：用替代方法替换你的组件。显示你的组件优于替代方案，而不仅仅是比没有组件更好。

何时加载
experiment-craft
：如果消融结果与你的假设矛盾（移除组件后性能提升），使用诊断流程理解原因。

输出：

/experiments/stage4_ablation/

包含消融结果表、各组件分析。

详细的消融设计模式请参考references/stage-protocols.md。

Integrating experiment-craft for Diagnosis

集成experiment-craft进行诊断

When a stage attempt fails, refer to the experiment-craft skill for structured diagnosis:

Follow the experiment-craft diagnostic protocol
Run the 5-step diagnostic flow (observe, hypothesize, test, conclude, prescribe)
The diagnosis does NOT consume your stage budget — it's a free analysis step
The diagnosis output (a prescription) becomes the plan for your next attempt
Return to the pipeline and record the diagnosis in your trajectory log

Trigger points: After any failed attempt in any stage. Especially important:

Stage 1: After 5+ failed attempts (>10% gap from reported metrics)
Stage 2: When variance > 20% across runs
Stage 3: When method consistently underperforms baseline
Stage 4: When ablation results contradict your hypothesis

当阶段尝试失败时，参考experiment-craft技能进行结构化诊断：

遵循experiment-craft诊断协议
运行五步诊断流程（观察、假设、测试、结论、建议）
诊断不会消耗你的阶段预算——这是免费的分析步骤
诊断输出（建议）将作为你下一次尝试的计划
返回管道并在轨迹日志中记录诊断结果

触发点：任何阶段的任何失败尝试后。尤其重要的情况：

阶段1：5次以上尝试均失败（与报告指标的偏差>10%）
阶段2：多次运行的方差>20%
阶段3：方法持续逊于基线
阶段4：消融结果与假设矛盾

Code Trajectory Logging

代码轨迹日志

Every attempt across all stages should be logged in a structured format that captures not just WHAT you did but WHY and WHAT YOU LEARNED. These logs feed into

evo-memory

's Experiment Strategy Evolution (ESE) mechanism.

For each attempt, record:

Attempt number and stage
Hypothesis: What you expected and why
Code changes: Summary of what was modified (not a full diff, but the key changes)
Result: Metrics and observations
Analysis: Whether the hypothesis was confirmed or refuted, and what you learned

See references/code-trajectory-logging.md for the full logging format and how logs feed into

evo-memory

所有阶段的每次尝试都应以结构化格式记录，不仅要记录你做了什么，还要记录为什么这么做以及你学到了什么。这些日志会输入到

evo-memory

的实验策略演进（ESE）机制中。

对于每次尝试，记录：

尝试次数和阶段
假设：你预期会发生什么以及原因
代码变更：修改内容的摘要（不是完整差异，而是关键变更）
结果：指标和观察结论
分析：假设是否被证实或推翻，以及你学到了什么

完整的日志格式以及日志如何输入到

evo-memory

请参考references/code-trajectory-logging.md。

Counterintuitive Pipeline Rules

违反直觉的管道规则

Prioritize these rules during experiment execution:

Initial implementation is not wasted time: It validates your entire infrastructure — data pipeline, evaluation code, training setup. Skipping it means every subsequent result is built on unverified ground. Most "method doesn't work" bugs are actually baseline setup bugs.
Budget limits prevent rabbit holes: Fixed attempt budgets force you to think systematically. When you know you have 12 attempts, you design each one to maximize information. Without limits, attempt #47 is rarely more informative than attempt #12 — it's just more desperate.
Stage order is non-negotiable: Each stage validates assumptions the next depends on. Skipping Stage 1 means Stage 3 results could be wrong due to a broken baseline. Skipping Stage 2 means Stage 3 improvements might just be better hyperparameters, not a better method. There are no shortcuts.
Ablation is not optional cleanup: It's the primary evidence that your method works for the right reasons. A method that outperforms the baseline but has no ablation is a method you don't understand. Reviewers know this.
Failed attempts are data, not waste: Each failed attempt narrows the search space and reveals something about the problem. Log failures carefully — they feed into
```
evo-memory
```
and prevent future researchers from repeating the same mistakes.
Early termination is a feature: Stopping before budget exhaustion is smart, not lazy. If the gate is clearly unachievable after systematic attempts, escalate to
```
evo-memory
```
IVE rather than burning remaining budget on increasingly random variations.

在实验执行期间优先遵循以下规则：

初始实现不是浪费时间：它验证了你的整个基础设施——数据管道、评估代码、训练设置。跳过它意味着后续所有结果都建立在未经验证的基础上。大多数“方法无效”的漏洞实际上是基线设置的漏洞。
预算限制避免陷入死胡同：固定的尝试预算促使你系统性思考。当你知道只有12次尝试时，你会设计每次尝试以最大化信息获取。没有限制的话，第47次尝试的信息量很少会超过第12次——只是更绝望的尝试。
阶段顺序不可协商：每个阶段都会验证下一阶段依赖的假设。跳过阶段1意味着阶段3的结果可能因基线故障而错误。跳过阶段2意味着阶段3的提升可能只是因为超参数更优，而非方法更好。没有捷径可走。
消融研究不是可选的收尾工作：它是证明你的方法因正确原因有效的主要证据。一个优于基线但没有消融研究的方法是你不理解的方法。审稿人都知道这一点。
失败尝试是数据，而非浪费：每次失败尝试都会缩小搜索空间并揭示问题的某些方面。仔细记录失败——它们会输入到
```
evo-memory
```
中，防止未来的研究人员重复同样的错误。
提前终止是一项特性：在预算耗尽前停止是明智的，而非懒惰。如果经过系统性尝试后明显无法达到关卡条件，将问题升级到
```
evo-memory
```
的IVE，而非将剩余预算浪费在越来越随机的变体上。

Handoff to Paper Writing

移交至论文写作

When all four stages are complete, pass these artifacts to

paper-writing

Artifact	Source Stage	Used By
Initial implementation results	Stage 1	Comparison tables, setup verification
Optimal hyperparameter config	Stage 2	Reproducibility section
Method vs baseline comparison	Stage 3	Main results table
Ablation study results	Stage 4	Ablation table, contribution claims
Code trajectory logs (all stages)	All stages	Method section details, supplementary
Implementation details and tricks	Stages 1-3	Method section, reproducibility (captured in trajectory log Analysis fields and `[Reusable]` tags)

Also pass results to

evo-memory

for evolution updates:

If any stage exhausts budget without executable code, OR Stage 3 method underperforms the tuned baseline → trigger IVE (Idea Validation Evolution)
If all stages succeeded → trigger ESE (Experiment Strategy Evolution)

当四个阶段全部完成后，将以下产物移交至

paper-writing

：

产物	来源阶段	使用方
初始实现结果	阶段1	对比表格、设置验证
最优超参数配置	阶段2	可复现性章节
方法与基线对比	阶段3	主要结果表格
消融研究结果	阶段4	消融表格、贡献主张
代码轨迹日志（所有阶段）	所有阶段	方法章节细节、补充材料
实现细节和技巧	阶段1-3	方法章节、可复现性（记录在轨迹日志的分析字段和 `[Reusable]` 标签中）

同时将结果移交至

evo-memory

进行演进更新：

如果任何阶段耗尽预算仍无可用代码，或阶段3的方法性能逊于调优后的基线 → 触发IVE（想法验证演进）
如果所有阶段都成功 → 触发ESE（实验策略演进）

Skill Integration

技能集成

Before Starting (load memory)

开始前（加载内存）

Refer to the evo-memory skill to read Experimentation Memory: → Read M_E at

/memory/experiment-memory.md

参考evo-memory技能读取实验内存： → 读取

/memory/experiment-memory.md

中的M_E

On Failure (within any stage)

失败时（任何阶段内）

Refer to the experiment-craft skill for 5-step diagnostic: → Run diagnosis → Return to pipeline

参考experiment-craft技能进行五步诊断： → 运行诊断 → 返回管道

On IVE Trigger (budget exhausted or method underperforms)

触发IVE时（预算耗尽或方法性能逊于基线）

Refer to the evo-memory skill for failure classification: → Run IVE protocol

参考evo-memory技能进行失败分类： → 运行IVE协议

On Pipeline Success (all 4 stages complete)

管道成功时（所有4个阶段完成）

Refer to the evo-memory skill for strategy extraction: → Run ESE protocol with trajectory logs

参考evo-memory技能进行策略提取： → 使用轨迹日志运行ESE协议

Handoff to Paper Writing

移交至论文写作

Refer to the paper-writing skill: → Pass all stage artifacts

参考paper-writing技能： → 移交所有阶段产物

Reference Navigation

参考导航

Topic	Reference File	When to Use
Per-stage checklists and patterns	stage-protocols.md	Detailed guidance for each stage
Budget rationale and adjustment	attempt-budget-guide.md	When budgets feel too tight or too loose
Code trajectory logging format	code-trajectory-logging.md	Recording attempts for evo-memory
Stage log template	stage-log-template.md	Logging a single stage's progress
Pipeline tracker template	pipeline-tracker-template.md	Tracking the full 4-stage pipeline

主题	参考文件	使用场景
各阶段检查清单和模式	stage-protocols.md	各阶段的详细指导
预算依据和调整	attempt-budget-guide.md	预算过紧或过松时
代码轨迹日志格式	code-trajectory-logging.md	为evo-memory记录尝试内容
阶段日志模板	stage-log-template.md	记录单个阶段的进展
管道追踪模板	pipeline-tracker-template.md	追踪完整的四阶段管道