experiment-iterative-coder
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIterative Coder
迭代式编码流程
Iterative code refinement through structured plan → code → evaluate → refine cycles. Each cycle runs objective checks (lint, tests) and self-evaluation, then diagnoses failures and plans targeted improvements. Reaches production quality in 3-8 iterations.
通过结构化的规划→编码→评估→优化循环进行迭代式代码优化。每个循环会运行客观检查(lint、测试)和自我评估,随后诊断失败问题并制定针对性改进计划。通过3-8次迭代即可达到生产级质量。
When to Use This Skill
何时使用该技能
- Main agent delegates a code task prefixed with "MODE: MORE_EFFORT"
- User selected "More Effort" mode for code generation
- Task requires high code quality with verified correctness
- Task involves complex implementation (5+ files, multiple modules)
- You want to iterate on code quality rather than submit first-pass code
- You mention "iterative refinement", "code quality loop", "plan-code-evaluate"
- 主Agent委托的代码任务前缀带有"MODE: MORE_EFFORT"
- 用户为代码生成选择了「More Effort」模式
- 任务需确保高代码质量与验证正确性
- 任务涉及复杂实现(5个及以上文件、多个模块)
- 你希望迭代优化代码质量,而非提交首次编写的代码
- 你提到了「迭代优化」、「代码质量循环」、「规划-编码-评估」
The Iteration Mindset
迭代思维
Code quality comes from fast feedback loops, not careful first attempts. A fast plan → code → evaluate → fix cycle beats spending 30 minutes on a "perfect" first implementation. The evaluate step reveals problems you cannot predict by thinking alone — lint errors, import failures, test regressions, and missing edge cases all surface immediately when you actually run the code.
代码质量来自快速反馈循环,而非精心打磨的首次尝试。 快速的规划→编码→评估→修复循环,比花30分钟编写一份「完美」的初始实现更有效。评估步骤会暴露仅凭思考无法预见的问题——lint错误、导入失败、测试回归以及遗漏的边缘案例,在实际运行代码时都会立即显现。
Before Starting: Load Context
开始前:加载上下文
- Read for proven strategies from past cycles (skip if it doesn't exist)
/memory/experiment-memory.md - Identify existing tests, linting config (pyproject.toml, ruff.toml), or CI setup in the workspace
- Check available tools:
If either is missing, you will skip that check during evaluation (do not fail the iteration).bash
ruff --version 2>&1; echo "---"; python -m pytest --version 2>&1
- 读取获取过往循环中的可行策略(若文件不存在可跳过)
/memory/experiment-memory.md - 识别工作区中已有的测试、lint配置(pyproject.toml、ruff.toml)或CI设置
- 检查可用工具:
若其中任意工具缺失,评估时将跳过对应检查(不会终止迭代)。bash
ruff --version 2>&1; echo "---"; python -m pytest --version 2>&1
Phase Decomposition
阶段分解
Before iterating, analyze the task and break it into sequential phases:
| Task Complexity | Recommended Phases |
|---|---|
| Single file, well-defined function | 1 phase |
| 2-4 files, clear interfaces | 2 phases |
| 5+ files, multiple interacting modules | 3-5 phases |
For each phase, define:
- Name: concise label (e.g., "Data loading pipeline")
- Goal: what "done" looks like for this phase
- Verification signal: how to confirm the phase is complete (specific test, lint clean, output matches)
Order phases by dependency — later phases may build on earlier ones.
在开始迭代前,分析任务并将其拆分为连续阶段:
| 任务复杂度 | 推荐阶段数 |
|---|---|
| 单个文件、功能明确的函数 | 1个阶段 |
| 2-4个文件、接口清晰 | 2个阶段 |
| 5个及以上文件、多个交互模块 | 3-5个阶段 |
为每个阶段定义:
- 名称:简洁标签(例如:「数据加载流水线」)
- 目标:该阶段完成的标准是什么
- 验证信号:如何确认阶段已完成(特定测试、lint检查通过、输出符合预期)
按依赖关系排序阶段——后续阶段可能基于前期阶段构建。
The Iteration Loop
迭代循环
For each phase, iterate up to 3 times. Global maximum: 10 iterations across all phases.
每个阶段最多迭代3次。全局最大迭代次数:所有阶段总计10次。
Step 1: Plan
步骤1:规划
Read current code and previous evaluation feedback (if any). Write a concise improvement plan.
First iteration of a phase: Write an initial implementation plan based on the phase goal.
Subsequent iterations: Analyze the last evaluation's feedback and diagnose the root cause of failures before planning changes. Do not repeat the same approach that already failed.
Adapt your plan based on the failure mode from the last evaluation:
| Last Failure | Planned Response |
|---|---|
| Timeout | Add |
| Syntax Error | Simplify logic, run |
| Import Error | Check |
| Test Failure | Focus on the specific failing test, make minimal targeted changes |
| Lint Failure | Run |
| Low self-assessment | Re-read the original task requirements, check for missing functionality |
阅读当前代码及之前的评估反馈(若有)。编写简洁的改进计划。
阶段首次迭代:基于阶段目标编写初始实现计划。
后续迭代:分析上一次评估的反馈,诊断失败的根本原因后再规划变更。不要重复已失败的方法。
根据上一次评估的失败模式调整计划:
| 上一次失败类型 | 计划应对方案 |
|---|---|
| 超时 | 添加 |
| 语法错误 | 简化逻辑,运行 |
| 导入错误 | 检查 |
| 测试失败 | 聚焦于特定失败的测试,进行最小化针对性修改 |
| Lint失败 | 在修改逻辑前先运行 |
| 自我评估分数低 | 重新阅读原始任务要求,检查是否遗漏功能 |
Step 2: Code
步骤2:编码
Implement the plan. Keep changes focused on what the plan specifies.
- Do not rewrite working files unless the plan explicitly requires it
- After writing code, do a quick sanity read of the changed files
执行计划。仅聚焦于计划中指定的修改。
- 除非计划明确要求,否则不要重写可正常运行的文件
- 编写代码后,快速通读修改的文件进行合理性检查
Step 3: Evaluate
步骤3:评估
CRITICAL: You MUST run these commands every iteration. Do not skip evaluation.
bash
undefined重要提示:每次迭代必须运行以下命令,请勿跳过评估。
bash
undefined1. Lint check
1. Lint检查
ruff check . 2>&1 | tail -20
echo "LINT_EXIT: $?"
ruff check . 2>&1 | tail -20
echo "LINT_EXIT: $?"
2. Format check
2. 格式检查
ruff format --check . 2>&1 | tail -10
echo "FORMAT_EXIT: $?"
ruff format --check . 2>&1 | tail -10
echo "FORMAT_EXIT: $?"
3. Run tests (only if test files exist in workspace)
3. 运行测试(仅当工作区存在测试文件时)
python -m pytest -x -q --tb=short 2>&1 | tail -30
echo "TEST_EXIT: $?"
If `ruff` is not installed, skip checks 1-2. If `pytest` is not installed or no test files exist, skip check 3. Record which checks were skipped.python -m pytest -x -q --tb=short 2>&1 | tail -30
echo "TEST_EXIT: $?"
若未安装`ruff`,则跳过检查1-2。若未安装`pytest`或工作区无测试文件,则跳过检查3。记录跳过的检查项。Step 4: Score
步骤4:评分
Compute a composite score from objective signals and self-assessment.
Objective signals (from Step 3 exit codes):
- → lint_score = 1.0, else lint_score = 0.0
LINT_EXIT=0 - → format_score = 1.0, else format_score = 0.0
FORMAT_EXIT=0 - → test_score = 1.0, else parse pass ratio from pytest output (e.g., "3 passed, 1 failed" → 0.75)
TEST_EXIT=0
Self-assessment (rate 0.0 – 1.0):
Evaluate on: correctness (does the code do what was asked?), completeness (all requirements addressed?), error handling (reasonable edge cases covered?), readability (clear names, structure).
Composite score — dynamic weighting based on available signals:
- Lint + tests available:
0.2 × lint + 0.1 × format + 0.3 × test + 0.4 × self - Lint only (no tests):
0.3 × lint + 0.1 × format + 0.6 × self - Tests only (no ruff):
0.4 × test + 0.6 × self - Neither available:
1.0 × self
Self-assessment hard caps — prevent score inflation from self-assessment:
- If lint check FAILED → composite capped at 0.4, regardless of self-assessment
- If any test FAILED → composite capped at 0.6
- Only claim composite ≥ 0.85 if BOTH lint and tests pass AND implementation is complete
- Deductions: missing error handling for obvious cases (−0.1), hardcoded absolute paths (−0.05)
See references/evaluation-protocol.md for detailed scoring edge cases.
结合客观信号与自我评估计算综合得分。
客观信号(来自步骤3的退出码):
- → lint得分=1.0,否则lint得分=0.0
LINT_EXIT=0 - → 格式得分=1.0,否则格式得分=0.0
FORMAT_EXIT=0 - → 测试得分=1.0,否则从pytest输出中解析通过率(例如:"3 passed, 1 failed" → 0.75)
TEST_EXIT=0
自我评估(评分范围0.0 – 1.0):
评估维度:正确性(代码是否符合要求?)、完整性(是否覆盖所有需求?)、错误处理(是否覆盖合理的边缘案例?)、可读性(命名清晰、结构合理?)。
综合得分——根据可用信号动态加权:
- 同时有Lint和测试:
0.2 × lint得分 + 0.1 × 格式得分 + 0.3 × 测试得分 + 0.4 × 自我评估得分 - 仅Lint(无测试):
0.3 × lint得分 + 0.1 × 格式得分 + 0.6 × 自我评估得分 - 仅测试(无ruff):
0.4 × 测试得分 + 0.6 × 自我评估得分 - 两者都无:
1.0 × 自我评估得分
自我评估上限——防止自我评估导致得分虚高:
- 若Lint检查失败 → 综合得分上限为0.4,无论自我评估得分如何
- 若任意测试失败 → 综合得分上限为0.6
- 仅当Lint和测试均通过且实现完整时,才可宣称综合得分≥0.85
- 扣分项:明显边缘案例缺失错误处理(−0.1)、硬编码绝对路径(−0.05)
详细评分边缘情况请参考references/evaluation-protocol.md。
Step 5: Decide
步骤5:决策
- Composite score ≥ 0.85 → advance to next phase (or finish if last phase)
- Composite score < 0.85 → return to Step 1 with evaluation feedback
- Phase iteration limit reached (3 per phase) → advance to next phase anyway, note remaining issues
- Global iteration limit reached (10 total) → stop, output current best result
- 综合得分**≥0.85** → 进入下一阶段(若为最后阶段则完成)
- 综合得分**<0.85** → 携带评估反馈返回步骤1
- 阶段迭代次数达上限(每个阶段3次) → 无论如何进入下一阶段,记录剩余问题
- 全局迭代次数达上限(总计10次) → 停止迭代,输出当前最优结果
Step 6: Log
步骤6:记录
CRITICAL: Append to after every iteration.
/artifacts/iteration_log.mdUse the template at assets/iteration-log-template.md:
markdown
undefined重要提示:每次迭代后必须追加至。
/artifacts/iteration_log.md使用assets/iteration-log-template.md中的模板:
markdown
undefinedIteration {N} (Phase {M}/{T})
第{N}次迭代(第{M}/{T}阶段)
- Score: {composite} (lint={X} format={X} test={X} self={X})
- Lint: passed/failed ({N} issues)
- Tests: passed/failed ({passed}/{total})
- Changes: [{files changed}]
- Feedback: [{key evaluation findings}]
- Next: continue / next_phase / done
undefined- 得分:{综合得分} (lint={X} format={X} test={X} self={X})
- Lint:通过/失败({N}个问题)
- 测试:通过/失败({通过数}/{总数})
- 变更:[{修改的文件}]
- 反馈:[{关键评估发现}]
- 下一步:继续/进入下一阶段/完成
undefinedCompletion
完成
After all phases complete or global iteration limit is reached:
- Report to the caller:
- Total iterations used
- Final composite score
- Key improvements per phase (1-2 sentences each)
- List all output file paths (code, configs, tests)
- Note remaining issues: lint warnings, missing tests, known limitations, TODOs
所有阶段完成或全局迭代次数达上限后:
- 向调用方汇报:
- 使用的总迭代次数
- 最终综合得分
- 各阶段的关键改进(每个阶段1-2句话)
- 列出所有输出文件路径(代码、配置、测试)
- 记录剩余问题:lint警告、缺失测试、已知限制、待办事项
Counterintuitive Iteration Rules
违反直觉的迭代规则
-
Fix lint before logic: Lint errors compound — one import error masks all test failures downstream. Always runbefore investigating logic bugs.
ruff check --fix . -
3 iterations is enough per phase: If you cannot fix it in 3 targeted iterations, the problem is architectural (wrong decomposition), not incremental. Advance to the next phase or re-plan rather than iterating further.
-
Tests reveal more than reading: Running tests for 10 seconds teaches you more about correctness than reading code for 5 minutes. Always run tests, even when you are confident the code is correct.
-
Score drops are information: If your composite score drops after a change, that is a signal about what matters. Analyze why it dropped before undoing the change.
-
Don't gold-plate: 0.85 is the target, not 1.0. Diminishing returns kick in hard above 0.9. Ship and iterate in the next conversation if needed.
-
先修复lint错误再处理逻辑:Lint错误会连锁反应——一个导入错误会掩盖后续所有测试失败。在排查逻辑bug前,务必先运行。
ruff check --fix . -
每个阶段3次迭代足够:若3次针对性迭代仍无法解决问题,说明问题是架构性的(分解错误),而非增量问题。应进入下一阶段或重新规划,而非继续迭代。
-
测试比读代码更能发现问题:运行10秒测试学到的正确性知识,比读5分钟代码更多。即使你对代码正确性有信心,也要始终运行测试。
-
得分下降是重要信息:若修改后综合得分下降,这是一个关于优先级的信号。在撤销修改前先分析得分下降的原因。
-
不要过度优化:目标是0.85而非1.0。得分超过0.9后收益会急剧递减。若有需要,可在后续对话中继续迭代优化。
Skill Integration
技能集成
Before Starting (load memory)
开始前(加载记忆)
Refer to evo-memory → Read for prior strategies
/memory/experiment-memory.md参考evo-memory → 读取获取过往策略
/memory/experiment-memory.mdOn Failure (stuck after max iterations)
失败时(达最大迭代次数仍卡壳)
Refer to experiment-craft → 5-step diagnostic flow to understand the root cause before retrying
参考experiment-craft → 在重试前使用5步诊断流程找出根本原因
On Success (all phases complete, score ≥ 0.85)
成功时(所有阶段完成,得分≥0.85)
Report to the main agent → main agent continues pipeline (data-analysis, writing, etc.)
向主Agent汇报 → 主Agent继续执行后续流程(数据分析、文档撰写等)
Handoff Artifacts
移交产物
| Artifact | Location | Used By |
|---|---|---|
| Iteration log | | Main agent summary, evo-memory ESE |
| Final code | Workspace root | Next pipeline step |
| Test results | Iteration log entries | data-analysis-agent |
| 产物 | 位置 | 使用方 |
|---|---|---|
| 迭代日志 | | 主Agent汇总、evo-memory ESE |
| 最终代码 | 工作区根目录 | 下一流程步骤 |
| 测试结果 | 迭代日志条目 | data-analysis-agent |
Reference Navigation
参考导航
| Topic | Reference File | When to Use |
|---|---|---|
| Scoring rules and edge cases | evaluation-protocol.md | When scoring edge cases arise (partial tests, missing tools) |
| Iteration log template | iteration-log-template.md | Every iteration (Step 6) |
| 主题 | 参考文件 | 使用场景 |
|---|---|---|
| 评分规则与边缘情况 | evaluation-protocol.md | 出现评分边缘情况时(部分测试通过、工具缺失) |
| 迭代日志模板 | iteration-log-template.md | 每次迭代(步骤6) |