auto-review-loop

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Auto Review Loop: Autonomous Research Improvement

自动评审循环:自主研究改进

Autonomously iterate: review → implement fixes → re-review, until the external reviewer gives a positive assessment or MAX_ROUNDS is reached.
自主迭代:评审 → 实施修复 → 重新评审,直到外部评审者给出正面评估或达到MAX_ROUNDS上限。

Context: $ARGUMENTS

上下文:$ARGUMENTS

Constants

常量

  • MAX_ROUNDS = 4
  • POSITIVE_THRESHOLD: score >= 6/10, or verdict contains "accept", "sufficient", "ready for submission"
  • REVIEW_DOC:
    AUTO_REVIEW.md
    in project root (cumulative log)
  • REVIEWER_MODEL =
    gpt-5.4
    — Model used via Codex MCP. Must be an OpenAI model (e.g.,
    gpt-5.4
    ,
    o3
    ,
    gpt-4o
    )
  • HUMAN_CHECKPOINT = false — When
    true
    , pause after each round's review (Phase B) and present the score + weaknesses to the user. Wait for user input before proceeding to Phase C. The user can: approve the suggested fixes, provide custom modification instructions, skip specific fixes, or stop the loop early. When
    false
    (default), the loop runs fully autonomously.
💡 Override:
/auto-review-loop "topic" — human checkpoint: true
  • MAX_ROUNDS = 4
  • POSITIVE_THRESHOLD:评分≥6/10,或评审结论包含"accept""sufficient""ready for submission"
  • REVIEW_DOC:项目根目录下的
    AUTO_REVIEW.md
    (累积日志)
  • REVIEWER_MODEL =
    gpt-5.4
    — 通过Codex MCP调用的模型。必须为OpenAI模型(例如
    gpt-5.4
    o3
    gpt-4o
  • HUMAN_CHECKPOINT = false — 设为
    true
    时,每轮评审(Phase B)后暂停,向用户展示评分和问题点。等待用户输入后再进入Phase C。用户可:批准建议的修复方案、提供自定义修改指令、跳过特定修复、或提前终止循环。设为
    false
    (默认)时,循环完全自主运行。
💡 覆盖配置:
/auto-review-loop "topic" — human checkpoint: true

State Persistence (Compact Recovery)

状态持久化(紧凑恢复)

Long-running loops may hit the context window limit, triggering automatic compaction. To survive this, persist state to
REVIEW_STATE.json
after each round:
json
{
  "round": 2,
  "threadId": "019cd392-...",
  "status": "in_progress",
  "last_score": 5.0,
  "last_verdict": "not ready",
  "pending_experiments": ["screen_name_1"],
  "timestamp": "2026-03-13T21:00:00"
}
Write this file at the end of every Phase E (after documenting the round). Overwrite each time — only the latest state matters.
On completion (positive assessment or max rounds), set
"status": "completed"
so future invocations don't accidentally resume a finished loop.
长时间运行的循环可能触发上下文窗口限制,进而自动触发紧凑处理。为应对此情况,每轮结束后将状态持久化到
REVIEW_STATE.json
json
{
  "round": 2,
  "threadId": "019cd392-...",
  "status": "in_progress",
  "last_score": 5.0,
  "last_verdict": "not ready",
  "pending_experiments": ["screen_name_1"],
  "timestamp": "2026-03-13T21:00:00"
}
在每轮Phase E结束时(完成本轮记录后)写入该文件。每次覆盖文件 — 仅保留最新状态。
循环完成时(获得正面评估或达到最大轮次),将
"status": "completed"
写入文件,避免后续调用意外恢复已完成的循环。

Workflow

工作流程

Initialization

初始化

  1. Check for
    REVIEW_STATE.json
    in project root:
    • If it does not exist: fresh start (normal case, identical to behavior before this feature existed)
    • If it exists AND
      status
      is
      "completed"
      : fresh start (previous loop finished normally)
    • If it exists AND
      status
      is
      "in_progress"
      AND
      timestamp
      is older than 24 hours: fresh start (stale state from a killed/abandoned run — delete the file and start over)
    • If it exists AND
      status
      is
      "in_progress"
      AND
      timestamp
      is within 24 hours: resume
      • Read the state file to recover
        round
        ,
        threadId
        ,
        last_score
        ,
        pending_experiments
      • Read
        AUTO_REVIEW.md
        to restore full context of prior rounds
      • If
        pending_experiments
        is non-empty, check if they have completed (e.g., check screen sessions)
      • Resume from the next round (round = saved round + 1)
      • Log: "Recovered from context compaction. Resuming at Round N."
  2. Read project narrative documents, memory files, and any prior review documents
  3. Read recent experiment results (check output directories, logs)
  4. Identify current weaknesses and open TODOs from prior reviews
  5. Initialize round counter = 1 (unless recovered from state file)
  6. Create/update
    AUTO_REVIEW.md
    with header and timestamp
  1. 检查项目根目录下的
    REVIEW_STATE.json
    • 若不存在:全新启动(常规情况,与该功能推出前的行为一致)
    • 若存在且
      status
      "completed"
      全新启动(上一轮循环已正常结束)
    • 若存在且
      status
      "in_progress"
      timestamp
      早于24小时:全新启动(已终止/废弃运行的过期状态 — 删除文件并重新开始)
    • 若存在且
      status
      "in_progress"
      timestamp
      在24小时内:恢复运行
      • 读取状态文件恢复
        round
        threadId
        last_score
        pending_experiments
      • 读取
        AUTO_REVIEW.md
        恢复之前所有轮次的完整上下文
      • pending_experiments
        非空,检查相关实验是否已完成(例如检查screen会话)
      • 从下一轮开始恢复(round = 已保存轮次 + 1)
      • 日志记录:"从上下文紧凑处理中恢复。将从第N轮继续。"
  2. 读取项目说明文档、内存文件及所有过往评审文档
  3. 读取近期实验结果(检查输出目录、日志)
  4. 从过往评审中识别当前存在的问题点和未完成的TODO项
  5. 初始化轮次计数器为1(除非从状态文件恢复)
  6. 创建/更新
    AUTO_REVIEW.md
    ,添加标题和时间戳

Loop (repeat up to MAX_ROUNDS)

循环(最多重复MAX_ROUNDS次)

Phase A: Review

Phase A: 评审

Send comprehensive context to the external reviewer:
mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    [Round N/MAX_ROUNDS of autonomous review loop]

    [Full research context: claims, methods, results, known weaknesses]
    [Changes since last round, if any]

    Please act as a senior ML reviewer (NeurIPS/ICML level).

    1. Score this work 1-10 for a top venue
    2. List remaining critical weaknesses (ranked by severity)
    3. For each weakness, specify the MINIMUM fix (experiment, analysis, or reframing)
    4. State clearly: is this READY for submission? Yes/No/Almost

    Be brutally honest. If the work is ready, say so clearly.
If this is round 2+, use
mcp__codex__codex-reply
with the saved threadId to maintain conversation context.
向外部评审者发送完整上下文:
mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    [Round N/MAX_ROUNDS of autonomous review loop]

    [Full research context: claims, methods, results, known weaknesses]
    [Changes since last round, if any]

    Please act as a senior ML reviewer (NeurIPS/ICML level).

    1. Score this work 1-10 for a top venue
    2. List remaining critical weaknesses (ranked by severity)
    3. For each weakness, specify the MINIMUM fix (experiment, analysis, or reframing)
    4. State clearly: is this READY for submission? Yes/No/Almost

    Be brutally honest. If the work is ready, say so clearly.
若为第2轮及以后,使用
mcp__codex__codex-reply
并传入已保存的threadId以维持对话上下文。

Phase B: Parse Assessment

Phase B: 解析评估结果

CRITICAL: Save the FULL raw response from the external reviewer verbatim (store in a variable for Phase E). Do NOT discard or summarize — the raw text is the primary record.
Then extract structured fields:
  • Score (numeric 1-10)
  • Verdict ("ready" / "almost" / "not ready")
  • Action items (ranked list of fixes)
STOP CONDITION: If score >= 6 AND verdict contains "ready" or "almost" → stop loop, document final state.
关键操作:完整保存外部评审者的原始响应(存储到变量中用于Phase E)。不得丢弃或总结 — 原始文本是主要记录。
随后提取结构化字段:
  • Score(1-10的数值)
  • Verdict("ready" / "almost" / "not ready")
  • Action items(按优先级排序的修复列表)
终止条件:若评分≥6且结论包含"ready"或"almost" → 终止循环,记录最终状态。

Human Checkpoint (if enabled)

人工检查点(若启用)

Skip this step entirely if
HUMAN_CHECKPOINT = false
.
When
HUMAN_CHECKPOINT = true
, present the review results and wait for user input:
📋 Round N/MAX_ROUNDS review complete.

Score: X/10 — [verdict]
Top weaknesses:
1. [weakness 1]
2. [weakness 2]
3. [weakness 3]

Suggested fixes:
1. [fix 1]
2. [fix 2]
3. [fix 3]

Options:
- Reply "go" or "continue" → implement all suggested fixes
- Reply with custom instructions → implement your modifications instead
- Reply "skip 2" → skip fix #2, implement the rest
- Reply "stop" → end the loop, document current state
Wait for the user's response. Parse their input:
  • Approval ("go", "continue", "ok", "proceed"): proceed to Phase C with all suggested fixes
  • Custom instructions (any other text): treat as additional/replacement guidance for Phase C. Merge with reviewer suggestions where appropriate
  • Skip specific fixes ("skip 1,3"): remove those fixes from the action list
  • Stop ("stop", "enough", "done"): terminate the loop, jump to Termination
HUMAN_CHECKPOINT = false
,则完全跳过此步骤。
HUMAN_CHECKPOINT = true
时,展示评审结果并等待用户输入:
📋 第N/MAX_ROUNDS轮评审完成。

评分:X/10 — [评审结论]
主要问题点:
1. [问题1]
2. [问题2]
3. [问题3]

建议修复方案:
1. [修复1]
2. [修复2]
3. [修复3]

可选操作:
- 回复"go"或"continue" → 实施所有建议的修复方案
- 回复自定义指令 → 执行您指定的修改
- 回复"skip 2" → 跳过第2项修复,实施其余修复
- 回复"stop" → 终止循环,记录当前状态
等待用户响应并解析输入:
  • 批准("go"、"continue"、"ok"、"proceed"):携带所有建议的修复方案进入Phase C
  • 自定义指令(任何其他文本):作为Phase C的额外/替代指导。酌情与评审者建议合并
  • 跳过特定修复("skip 1,3"):从操作列表中移除指定的修复项
  • 终止("stop"、"enough"、"done"):终止循环,跳转到终止流程

Feishu Notification (if configured)

飞书通知(若已配置)

After parsing the score, check if
~/.claude/feishu.json
exists and mode is not
"off"
:
  • Send a
    review_scored
    notification: "Round N: X/10 — [verdict]" with top 3 weaknesses
  • If interactive mode and verdict is "almost": send as checkpoint, wait for user reply on whether to continue or stop
  • If config absent or mode off: skip entirely (no-op)
解析评分后,检查
~/.claude/feishu.json
是否存在且模式不为
"off"
  • 发送
    review_scored
    通知:"第N轮:X/10 — [评审结论]",并附上Top 3问题点
  • 若为交互模式且结论为"almost":发送检查点通知,等待用户回复是否继续或终止
  • 若配置不存在或模式为关闭:完全跳过(无操作)

Phase C: Implement Fixes (if not stopping)

Phase C: 实施修复(若未终止)

For each action item (highest priority first):
  1. Code changes: Write/modify experiment scripts, model code, analysis scripts
  2. Run experiments: Deploy to GPU server via SSH + screen/tmux
  3. Analysis: Run evaluation, collect results, update figures/tables
  4. Documentation: Update project notes and review document
Prioritization rules:
  • Skip fixes requiring excessive compute (flag for manual follow-up)
  • Skip fixes requiring external data/models not available
  • Prefer reframing/analysis over new experiments when both address the concern
  • Always implement metric additions (cheap, high impact)
针对每个操作项(按优先级从高到低):
  1. 代码修改:编写/修改实验脚本、模型代码、分析脚本
  2. 运行实验:通过SSH + screen/tmux部署到GPU服务器
  3. 分析:运行评估、收集结果、更新图表/表格
  4. 文档更新:更新项目笔记和评审文档
优先级规则:
  • 跳过需要大量计算资源的修复(标记为手动跟进)
  • 跳过需要外部数据/模型且无法获取的修复
  • 当重构/分析与新实验均可解决问题时,优先选择重构/分析
  • 始终优先实施指标补充(成本低、影响大)

Phase D: Wait for Results

Phase D: 等待结果

If experiments were launched:
  • Monitor remote sessions for completion
  • Collect results from output files and logs
若已启动实验:
  • 监控远程会话的完成状态
  • 从输出文件和日志中收集结果

Phase E: Document Round

Phase E: 记录本轮内容

Append to
AUTO_REVIEW.md
:
markdown
undefined
AUTO_REVIEW.md
追加内容:
markdown
undefined

Round N (timestamp)

第N轮(时间戳)

Assessment (Summary)

评估(摘要)

  • Score: X/10
  • Verdict: [ready/almost/not ready]
  • Key criticisms: [bullet list]
  • 评分:X/10
  • 结论:[ready/almost/not ready]
  • 主要批评:[项目符号列表]

Reviewer Raw Response

评审者原始响应

<details> <summary>Click to expand full reviewer response</summary>
[Paste the COMPLETE raw response from the external reviewer here — verbatim, unedited. This is the authoritative record. Do NOT truncate or paraphrase.]
</details>
<details> <summary>点击展开完整评审者响应</summary>
[在此粘贴外部评审者的完整原始响应 — 一字不差,未经编辑。 这是权威记录。不得截断或改写。]
</details>

Actions Taken

已执行操作

  • [what was implemented/changed]
  • [已实施/修改的内容]

Results

结果

  • [experiment outcomes, if any]
  • [实验结果(如有)]

Status

状态

  • [continuing to round N+1 / stopping]

**Write `REVIEW_STATE.json`** with current round, threadId, score, verdict, and any pending experiments.

Increment round counter → back to Phase A.
  • [继续到第N+1轮 / 终止]

**写入`REVIEW_STATE.json`**,包含当前轮次、threadId、评分、结论及所有未完成的实验。

轮次计数器加1 → 返回Phase A。

Termination

终止流程

When loop ends (positive assessment or max rounds):
  1. Update
    REVIEW_STATE.json
    with
    "status": "completed"
  2. Write final summary to
    AUTO_REVIEW.md
  3. Update project notes with conclusions
  4. Write method/pipeline description to
    AUTO_REVIEW.md
    under a
    ## Method Description
    section — a concise 1-2 paragraph description of the final method, its architecture, and data flow. This serves as input for
    /paper-illustration
    in Workflow 3 (so it can generate architecture diagrams automatically).
  5. If stopped at max rounds without positive assessment:
    • List remaining blockers
    • Estimate effort needed for each
    • Suggest whether to continue manually or pivot
  6. Feishu notification (if configured): Send
    pipeline_done
    with final score progression table
当循环结束(获得正面评估或达到最大轮次):
  1. 更新
    REVIEW_STATE.json
    ,设置
    "status": "completed"
  2. AUTO_REVIEW.md
    写入最终摘要
  3. 更新项目笔记,记录结论
  4. 撰写方法/流程说明:在
    AUTO_REVIEW.md
    ## Method Description
    部分添加1-2段简洁描述,说明最终方法、架构和数据流。此内容将作为Workflow 3中
    /paper-illustration
    的输入(以便自动生成架构图)。
  5. 若达到最大轮次仍未获得正面评估:
    • 列出剩余的阻碍因素
    • 估算每个阻碍因素所需的工作量
    • 建议是否继续手动改进或调整方向
  6. 飞书通知(若已配置):发送
    pipeline_done
    通知,并附上最终评分进展表

Key Rules

核心规则

  • Large file handling: If the Write tool fails due to file size, immediately retry using Bash (
    cat << 'EOF' > file
    ) to write in chunks. Do NOT ask the user for permission — just do it silently.
  • ALWAYS use
    config: {"model_reasoning_effort": "xhigh"}
    for maximum reasoning depth
  • Save threadId from first call, use
    mcp__codex__codex-reply
    for subsequent rounds
  • Be honest — include negative results and failed experiments
  • Do NOT hide weaknesses to game a positive score
  • Implement fixes BEFORE re-reviewing (don't just promise to fix)
  • If an experiment takes > 30 minutes, launch it and continue with other fixes while waiting
  • Document EVERYTHING — the review log should be self-contained
  • Update project notes after each round, not just at the end
  • 大文件处理:若写入工具因文件大小失败,立即使用Bash(
    cat << 'EOF' > file
    )分块重试。无需询问用户权限 — 直接静默执行。
  • 始终使用
    config: {"model_reasoning_effort": "xhigh"}
    以确保最大推理深度
  • 保存首次调用的threadId,后续轮次使用
    mcp__codex__codex-reply
  • 保持诚实 — 包含负面结果和失败的实验
  • 不得隐藏问题点以获取正面评分
  • 在重新评审前必须实施修复(不得仅承诺修复)
  • 若实验耗时超过30分钟,启动实验后继续执行其他修复操作
  • 记录所有内容 — 评审日志应具备自包含性
  • 每轮结束后更新项目笔记,而非仅在最后更新

Prompt Template for Round 2+

第2轮及以后的提示模板

mcp__codex__codex-reply:
  threadId: [saved from round 1]
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    [Round N update]

    Since your last review, we have:
    1. [Action 1]: [result]
    2. [Action 2]: [result]
    3. [Action 3]: [result]

    Updated results table:
    [paste metrics]

    Please re-score and re-assess. Are the remaining concerns addressed?
    Same format: Score, Verdict, Remaining Weaknesses, Minimum Fixes.
mcp__codex__codex-reply:
  threadId: [从第1轮保存的ID]
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    [Round N update]

    Since your last review, we have:
    1. [Action 1]: [result]
    2. [Action 2]: [result]
    3. [Action 3]: [result]

    Updated results table:
    [paste metrics]

    Please re-score and re-assess. Are the remaining concerns addressed?
    Same format: Score, Verdict, Remaining Weaknesses, Minimum Fixes.