experiment-iterative-coder

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Iterative Coder

迭代式编码流程

Iterative code refinement through structured plan → code → evaluate → refine cycles. Each cycle runs objective checks (lint, tests) and self-evaluation, then diagnoses failures and plans targeted improvements. Reaches production quality in 3-8 iterations.

通过结构化的规划→编码→评估→优化循环进行迭代式代码优化。每个循环会运行客观检查（lint、测试）和自我评估，随后诊断失败问题并制定针对性改进计划。通过3-8次迭代即可达到生产级质量。

When to Use This Skill

何时使用该技能

Main agent delegates a code task prefixed with "MODE: MORE_EFFORT"
User selected "More Effort" mode for code generation
Task requires high code quality with verified correctness
Task involves complex implementation (5+ files, multiple modules)
You want to iterate on code quality rather than submit first-pass code
You mention "iterative refinement", "code quality loop", "plan-code-evaluate"

主Agent委托的代码任务前缀带有"MODE: MORE_EFFORT"
用户为代码生成选择了「More Effort」模式
任务需确保高代码质量与验证正确性
任务涉及复杂实现（5个及以上文件、多个模块）
你希望迭代优化代码质量，而非提交首次编写的代码
你提到了「迭代优化」、「代码质量循环」、「规划-编码-评估」

The Iteration Mindset

迭代思维

Code quality comes from fast feedback loops, not careful first attempts. A fast plan → code → evaluate → fix cycle beats spending 30 minutes on a "perfect" first implementation. The evaluate step reveals problems you cannot predict by thinking alone — lint errors, import failures, test regressions, and missing edge cases all surface immediately when you actually run the code.

代码质量来自快速反馈循环，而非精心打磨的首次尝试。 快速的规划→编码→评估→修复循环，比花30分钟编写一份「完美」的初始实现更有效。评估步骤会暴露仅凭思考无法预见的问题——lint错误、导入失败、测试回归以及遗漏的边缘案例，在实际运行代码时都会立即显现。

Before Starting: Load Context

开始前：加载上下文

Read
```
/memory/experiment-memory.md
```
for proven strategies from past cycles (skip if it doesn't exist)
Identify existing tests, linting config (pyproject.toml, ruff.toml), or CI setup in the workspace
Check available tools:
bash
```
ruff --version 2>&1; echo "---"; python -m pytest --version 2>&1
```
If either is missing, you will skip that check during evaluation (do not fail the iteration).

读取
```
/memory/experiment-memory.md
```
获取过往循环中的可行策略（若文件不存在可跳过）
识别工作区中已有的测试、lint配置（pyproject.toml、ruff.toml）或CI设置
检查可用工具：
bash
```
ruff --version 2>&1; echo "---"; python -m pytest --version 2>&1
```
若其中任意工具缺失，评估时将跳过对应检查（不会终止迭代）。

Phase Decomposition

阶段分解

Before iterating, analyze the task and break it into sequential phases:

Task Complexity	Recommended Phases
Single file, well-defined function	1 phase
2-4 files, clear interfaces	2 phases
5+ files, multiple interacting modules	3-5 phases

For each phase, define:

Name: concise label (e.g., "Data loading pipeline")
Goal: what "done" looks like for this phase
Verification signal: how to confirm the phase is complete (specific test, lint clean, output matches)

Order phases by dependency — later phases may build on earlier ones.

在开始迭代前，分析任务并将其拆分为连续阶段：

任务复杂度	推荐阶段数
单个文件、功能明确的函数	1个阶段
2-4个文件、接口清晰	2个阶段
5个及以上文件、多个交互模块	3-5个阶段

为每个阶段定义：

名称：简洁标签（例如：「数据加载流水线」）
目标：该阶段完成的标准是什么
验证信号：如何确认阶段已完成（特定测试、lint检查通过、输出符合预期）

按依赖关系排序阶段——后续阶段可能基于前期阶段构建。

The Iteration Loop

迭代循环

For each phase, iterate up to 3 times. Global maximum: 10 iterations across all phases.

每个阶段最多迭代3次。全局最大迭代次数：所有阶段总计10次。

Step 1: Plan

步骤1：规划

Read current code and previous evaluation feedback (if any). Write a concise improvement plan.

First iteration of a phase: Write an initial implementation plan based on the phase goal.

Subsequent iterations: Analyze the last evaluation's feedback and diagnose the root cause of failures before planning changes. Do not repeat the same approach that already failed.

Adapt your plan based on the failure mode from the last evaluation:

Last Failure	Planned Response
Timeout	Add `--quick` / `--smoke` mode, reduce data size, add early stopping
Syntax Error	Simplify logic, run `python -c "import ast; ast.parse(open('file.py').read())"` to validate before running
Import Error	Check `pip list` , use only installed packages, add missing deps to requirements
Test Failure	Focus on the specific failing test, make minimal targeted changes
Lint Failure	Run `ruff check --fix . && ruff format .` before any logic changes
Low self-assessment	Re-read the original task requirements, check for missing functionality

阅读当前代码及之前的评估反馈（若有）。编写简洁的改进计划。

阶段首次迭代：基于阶段目标编写初始实现计划。

后续迭代：分析上一次评估的反馈，诊断失败的根本原因后再规划变更。不要重复已失败的方法。

根据上一次评估的失败模式调整计划：

上一次失败类型	计划应对方案
超时	添加 `--quick` / `--smoke` 模式、缩小数据规模、添加提前停止机制
语法错误	简化逻辑，运行 `python -c "import ast; ast.parse(open('file.py').read())"` 在运行前验证语法
导入错误	检查 `pip list` ，仅使用已安装的包，将缺失依赖添加至requirements
测试失败	聚焦于特定失败的测试，进行最小化针对性修改
Lint失败	在修改逻辑前先运行 `ruff check --fix . && ruff format .`
自我评估分数低	重新阅读原始任务要求，检查是否遗漏功能

Step 2: Code

步骤2：编码

Implement the plan. Keep changes focused on what the plan specifies.

Do not rewrite working files unless the plan explicitly requires it
After writing code, do a quick sanity read of the changed files

执行计划。仅聚焦于计划中指定的修改。

除非计划明确要求，否则不要重写可正常运行的文件
编写代码后，快速通读修改的文件进行合理性检查

Step 3: Evaluate

步骤3：评估

CRITICAL: You MUST run these commands every iteration. Do not skip evaluation.

bash

undefined

重要提示：每次迭代必须运行以下命令，请勿跳过评估。

bash

undefined

1. Lint check

1. Lint检查

ruff check . 2>&1 | tail -20 echo "LINT_EXIT: $?"

2. Format check

2. 格式检查

ruff format --check . 2>&1 | tail -10 echo "FORMAT_EXIT: $?"

3. Run tests (only if test files exist in workspace)

3. 运行测试（仅当工作区存在测试文件时）

python -m pytest -x -q --tb=short 2>&1 | tail -30 echo "TEST_EXIT: $?"


If `ruff` is not installed, skip checks 1-2. If `pytest` is not installed or no test files exist, skip check 3. Record which checks were skipped.

python -m pytest -x -q --tb=short 2>&1 | tail -30 echo "TEST_EXIT: $?"


若未安装`ruff`，则跳过检查1-2。若未安装`pytest`或工作区无测试文件，则跳过检查3。记录跳过的检查项。

Step 4: Score

步骤4：评分

Compute a composite score from objective signals and self-assessment.

Objective signals (from Step 3 exit codes):

```
LINT_EXIT=0
```
→ lint_score = 1.0, else lint_score = 0.0
```
FORMAT_EXIT=0
```
→ format_score = 1.0, else format_score = 0.0
```
TEST_EXIT=0
```
→ test_score = 1.0, else parse pass ratio from pytest output (e.g., "3 passed, 1 failed" → 0.75)

Self-assessment (rate 0.0 – 1.0): Evaluate on: correctness (does the code do what was asked?), completeness (all requirements addressed?), error handling (reasonable edge cases covered?), readability (clear names, structure).

Composite score — dynamic weighting based on available signals:

Lint + tests available:

0.2 × lint + 0.1 × format + 0.3 × test + 0.4 × self

Lint only (no tests):

0.3 × lint + 0.1 × format + 0.6 × self

Tests only (no ruff):
```
0.4 × test + 0.6 × self
```
Neither available:
```
1.0 × self
```

Self-assessment hard caps — prevent score inflation from self-assessment:

If lint check FAILED → composite capped at 0.4, regardless of self-assessment
If any test FAILED → composite capped at 0.6
Only claim composite ≥ 0.85 if BOTH lint and tests pass AND implementation is complete
Deductions: missing error handling for obvious cases (−0.1), hardcoded absolute paths (−0.05)

See references/evaluation-protocol.md for detailed scoring edge cases.

结合客观信号与自我评估计算综合得分。

客观信号（来自步骤3的退出码）：

```
LINT_EXIT=0
```
→ lint得分=1.0，否则lint得分=0.0
```
FORMAT_EXIT=0
```
→ 格式得分=1.0，否则格式得分=0.0
```
TEST_EXIT=0
```
→ 测试得分=1.0，否则从pytest输出中解析通过率（例如："3 passed, 1 failed" → 0.75）

自我评估（评分范围0.0 – 1.0）：评估维度：正确性（代码是否符合要求？）、完整性（是否覆盖所有需求？）、错误处理（是否覆盖合理的边缘案例？）、可读性（命名清晰、结构合理？）。

综合得分——根据可用信号动态加权：

同时有Lint和测试：

0.2 × lint得分 + 0.1 × 格式得分 + 0.3 × 测试得分 + 0.4 × 自我评估得分

仅Lint（无测试）：

0.3 × lint得分 + 0.1 × 格式得分 + 0.6 × 自我评估得分

仅测试（无ruff）：

0.4 × 测试得分 + 0.6 × 自我评估得分

两者都无：
```
1.0 × 自我评估得分
```

自我评估上限——防止自我评估导致得分虚高：

若Lint检查失败 → 综合得分上限为0.4，无论自我评估得分如何
若任意测试失败 → 综合得分上限为0.6
仅当Lint和测试均通过且实现完整时，才可宣称综合得分≥0.85
扣分项：明显边缘案例缺失错误处理（−0.1）、硬编码绝对路径（−0.05）

详细评分边缘情况请参考references/evaluation-protocol.md。

Step 5: Decide

步骤5：决策

Composite score ≥ 0.85 → advance to next phase (or finish if last phase)
Composite score < 0.85 → return to Step 1 with evaluation feedback
Phase iteration limit reached (3 per phase) → advance to next phase anyway, note remaining issues
Global iteration limit reached (10 total) → stop, output current best result

综合得分**≥0.85** → 进入下一阶段（若为最后阶段则完成）
综合得分**<0.85** → 携带评估反馈返回步骤1
阶段迭代次数达上限（每个阶段3次） → 无论如何进入下一阶段，记录剩余问题
全局迭代次数达上限（总计10次） → 停止迭代，输出当前最优结果

Step 6: Log

步骤6：记录

CRITICAL: Append to
/artifacts/iteration_log.md
after every iteration.

Use the template at assets/iteration-log-template.md:

markdown

undefined

重要提示：每次迭代后必须追加至
/artifacts/iteration_log.md
。

使用assets/iteration-log-template.md中的模板：

markdown

undefined

Iteration {N} (Phase {M}/{T})

第{N}次迭代（第{M}/{T}阶段）

Score: {composite} (lint={X} format={X} test={X} self={X})
Lint: passed/failed ({N} issues)
Tests: passed/failed ({passed}/{total})
Changes: [{files changed}]
Feedback: [{key evaluation findings}]
Next: continue / next_phase / done

undefined

得分：{综合得分} (lint={X} format={X} test={X} self={X})
Lint：通过/失败（{N}个问题）
测试：通过/失败（{通过数}/{总数}）
变更：[{修改的文件}]
反馈：[{关键评估发现}]
下一步：继续/进入下一阶段/完成

undefined

Completion

完成

After all phases complete or global iteration limit is reached:

Report to the caller:
- Total iterations used
- Final composite score
- Key improvements per phase (1-2 sentences each)
List all output file paths (code, configs, tests)
Note remaining issues: lint warnings, missing tests, known limitations, TODOs

所有阶段完成或全局迭代次数达上限后：

向调用方汇报：
- 使用的总迭代次数
- 最终综合得分
- 各阶段的关键改进（每个阶段1-2句话）
列出所有输出文件路径（代码、配置、测试）
记录剩余问题：lint警告、缺失测试、已知限制、待办事项

Counterintuitive Iteration Rules

违反直觉的迭代规则

Fix lint before logic: Lint errors compound — one import error masks all test failures downstream. Always run
```
ruff check --fix .
```
before investigating logic bugs.
3 iterations is enough per phase: If you cannot fix it in 3 targeted iterations, the problem is architectural (wrong decomposition), not incremental. Advance to the next phase or re-plan rather than iterating further.
Tests reveal more than reading: Running tests for 10 seconds teaches you more about correctness than reading code for 5 minutes. Always run tests, even when you are confident the code is correct.
Score drops are information: If your composite score drops after a change, that is a signal about what matters. Analyze why it dropped before undoing the change.
Don't gold-plate: 0.85 is the target, not 1.0. Diminishing returns kick in hard above 0.9. Ship and iterate in the next conversation if needed.

先修复lint错误再处理逻辑：Lint错误会连锁反应——一个导入错误会掩盖后续所有测试失败。在排查逻辑bug前，务必先运行
```
ruff check --fix .
```
。
每个阶段3次迭代足够：若3次针对性迭代仍无法解决问题，说明问题是架构性的（分解错误），而非增量问题。应进入下一阶段或重新规划，而非继续迭代。
测试比读代码更能发现问题：运行10秒测试学到的正确性知识，比读5分钟代码更多。即使你对代码正确性有信心，也要始终运行测试。
得分下降是重要信息：若修改后综合得分下降，这是一个关于优先级的信号。在撤销修改前先分析得分下降的原因。
不要过度优化：目标是0.85而非1.0。得分超过0.9后收益会急剧递减。若有需要，可在后续对话中继续迭代优化。

Skill Integration

技能集成

Before Starting (load memory)

开始前（加载记忆）

Refer to evo-memory → Read

/memory/experiment-memory.md

for prior strategies

参考evo-memory → 读取

/memory/experiment-memory.md

获取过往策略

On Failure (stuck after max iterations)

失败时（达最大迭代次数仍卡壳）

Refer to experiment-craft → 5-step diagnostic flow to understand the root cause before retrying

参考experiment-craft → 在重试前使用5步诊断流程找出根本原因

On Success (all phases complete, score ≥ 0.85)

成功时（所有阶段完成，得分≥0.85）

Report to the main agent → main agent continues pipeline (data-analysis, writing, etc.)

向主Agent汇报 → 主Agent继续执行后续流程（数据分析、文档撰写等）

Handoff Artifacts

移交产物

Artifact	Location	Used By
Iteration log	`/artifacts/iteration_log.md`	Main agent summary, evo-memory ESE
Final code	Workspace root	Next pipeline step
Test results	Iteration log entries	data-analysis-agent

产物	位置	使用方
迭代日志	`/artifacts/iteration_log.md`	主Agent汇总、evo-memory ESE
最终代码	工作区根目录	下一流程步骤
测试结果	迭代日志条目	data-analysis-agent

Reference Navigation

参考导航

Topic	Reference File	When to Use
Scoring rules and edge cases	evaluation-protocol.md	When scoring edge cases arise (partial tests, missing tools)
Iteration log template	iteration-log-template.md	Every iteration (Step 6)

主题	参考文件	使用场景
评分规则与边缘情况	evaluation-protocol.md	出现评分边缘情况时（部分测试通过、工具缺失）
迭代日志模板	iteration-log-template.md	每次迭代（步骤6）