auto-review-loop
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAuto Review Loop: Autonomous Research Improvement
自动评审循环:自主研究改进
Autonomously iterate: review → implement fixes → re-review, until the external reviewer gives a positive assessment or MAX_ROUNDS is reached.
自主迭代:评审 → 实施修复 → 重新评审,直到外部评审者给出正面评估或达到MAX_ROUNDS上限。
Context: $ARGUMENTS
上下文:$ARGUMENTS
Constants
常量
- MAX_ROUNDS = 4
- POSITIVE_THRESHOLD: score >= 6/10, or verdict contains "accept", "sufficient", "ready for submission"
- REVIEW_DOC: in project root (cumulative log)
AUTO_REVIEW.md - REVIEWER_MODEL = — Model used via Codex MCP. Must be an OpenAI model (e.g.,
gpt-5.4,gpt-5.4,o3)gpt-4o - HUMAN_CHECKPOINT = false — When , pause after each round's review (Phase B) and present the score + weaknesses to the user. Wait for user input before proceeding to Phase C. The user can: approve the suggested fixes, provide custom modification instructions, skip specific fixes, or stop the loop early. When
true(default), the loop runs fully autonomously.false
💡 Override:/auto-review-loop "topic" — human checkpoint: true
- MAX_ROUNDS = 4
- POSITIVE_THRESHOLD:评分≥6/10,或评审结论包含"accept""sufficient""ready for submission"
- REVIEW_DOC:项目根目录下的(累积日志)
AUTO_REVIEW.md - REVIEWER_MODEL = — 通过Codex MCP调用的模型。必须为OpenAI模型(例如
gpt-5.4、gpt-5.4、o3)gpt-4o - HUMAN_CHECKPOINT = false — 设为时,每轮评审(Phase B)后暂停,向用户展示评分和问题点。等待用户输入后再进入Phase C。用户可:批准建议的修复方案、提供自定义修改指令、跳过特定修复、或提前终止循环。设为
true(默认)时,循环完全自主运行。false
💡 覆盖配置:/auto-review-loop "topic" — human checkpoint: true
State Persistence (Compact Recovery)
状态持久化(紧凑恢复)
Long-running loops may hit the context window limit, triggering automatic compaction. To survive this, persist state to after each round:
REVIEW_STATE.jsonjson
{
"round": 2,
"threadId": "019cd392-...",
"status": "in_progress",
"last_score": 5.0,
"last_verdict": "not ready",
"pending_experiments": ["screen_name_1"],
"timestamp": "2026-03-13T21:00:00"
}Write this file at the end of every Phase E (after documenting the round). Overwrite each time — only the latest state matters.
On completion (positive assessment or max rounds), set so future invocations don't accidentally resume a finished loop.
"status": "completed"长时间运行的循环可能触发上下文窗口限制,进而自动触发紧凑处理。为应对此情况,每轮结束后将状态持久化到:
REVIEW_STATE.jsonjson
{
"round": 2,
"threadId": "019cd392-...",
"status": "in_progress",
"last_score": 5.0,
"last_verdict": "not ready",
"pending_experiments": ["screen_name_1"],
"timestamp": "2026-03-13T21:00:00"
}在每轮Phase E结束时(完成本轮记录后)写入该文件。每次覆盖文件 — 仅保留最新状态。
循环完成时(获得正面评估或达到最大轮次),将写入文件,避免后续调用意外恢复已完成的循环。
"status": "completed"Workflow
工作流程
Initialization
初始化
- Check for in project root:
REVIEW_STATE.json- If it does not exist: fresh start (normal case, identical to behavior before this feature existed)
- If it exists AND is
status: fresh start (previous loop finished normally)"completed" - If it exists AND is
statusAND"in_progress"is older than 24 hours: fresh start (stale state from a killed/abandoned run — delete the file and start over)timestamp - If it exists AND is
statusAND"in_progress"is within 24 hours: resumetimestamp- Read the state file to recover ,
round,threadId,last_scorepending_experiments - Read to restore full context of prior rounds
AUTO_REVIEW.md - If is non-empty, check if they have completed (e.g., check screen sessions)
pending_experiments - Resume from the next round (round = saved round + 1)
- Log: "Recovered from context compaction. Resuming at Round N."
- Read the state file to recover
- Read project narrative documents, memory files, and any prior review documents
- Read recent experiment results (check output directories, logs)
- Identify current weaknesses and open TODOs from prior reviews
- Initialize round counter = 1 (unless recovered from state file)
- Create/update with header and timestamp
AUTO_REVIEW.md
- 检查项目根目录下的:
REVIEW_STATE.json- 若不存在:全新启动(常规情况,与该功能推出前的行为一致)
- 若存在且为
status:全新启动(上一轮循环已正常结束)"completed" - 若存在且为
status且"in_progress"早于24小时:全新启动(已终止/废弃运行的过期状态 — 删除文件并重新开始)timestamp - 若存在且为
status且"in_progress"在24小时内:恢复运行timestamp- 读取状态文件恢复、
round、threadId、last_scorepending_experiments - 读取恢复之前所有轮次的完整上下文
AUTO_REVIEW.md - 若非空,检查相关实验是否已完成(例如检查screen会话)
pending_experiments - 从下一轮开始恢复(round = 已保存轮次 + 1)
- 日志记录:"从上下文紧凑处理中恢复。将从第N轮继续。"
- 读取状态文件恢复
- 读取项目说明文档、内存文件及所有过往评审文档
- 读取近期实验结果(检查输出目录、日志)
- 从过往评审中识别当前存在的问题点和未完成的TODO项
- 初始化轮次计数器为1(除非从状态文件恢复)
- 创建/更新,添加标题和时间戳
AUTO_REVIEW.md
Loop (repeat up to MAX_ROUNDS)
循环(最多重复MAX_ROUNDS次)
Phase A: Review
Phase A: 评审
Send comprehensive context to the external reviewer:
mcp__codex__codex:
config: {"model_reasoning_effort": "xhigh"}
prompt: |
[Round N/MAX_ROUNDS of autonomous review loop]
[Full research context: claims, methods, results, known weaknesses]
[Changes since last round, if any]
Please act as a senior ML reviewer (NeurIPS/ICML level).
1. Score this work 1-10 for a top venue
2. List remaining critical weaknesses (ranked by severity)
3. For each weakness, specify the MINIMUM fix (experiment, analysis, or reframing)
4. State clearly: is this READY for submission? Yes/No/Almost
Be brutally honest. If the work is ready, say so clearly.If this is round 2+, use with the saved threadId to maintain conversation context.
mcp__codex__codex-reply向外部评审者发送完整上下文:
mcp__codex__codex:
config: {"model_reasoning_effort": "xhigh"}
prompt: |
[Round N/MAX_ROUNDS of autonomous review loop]
[Full research context: claims, methods, results, known weaknesses]
[Changes since last round, if any]
Please act as a senior ML reviewer (NeurIPS/ICML level).
1. Score this work 1-10 for a top venue
2. List remaining critical weaknesses (ranked by severity)
3. For each weakness, specify the MINIMUM fix (experiment, analysis, or reframing)
4. State clearly: is this READY for submission? Yes/No/Almost
Be brutally honest. If the work is ready, say so clearly.若为第2轮及以后,使用并传入已保存的threadId以维持对话上下文。
mcp__codex__codex-replyPhase B: Parse Assessment
Phase B: 解析评估结果
CRITICAL: Save the FULL raw response from the external reviewer verbatim (store in a variable for Phase E). Do NOT discard or summarize — the raw text is the primary record.
Then extract structured fields:
- Score (numeric 1-10)
- Verdict ("ready" / "almost" / "not ready")
- Action items (ranked list of fixes)
STOP CONDITION: If score >= 6 AND verdict contains "ready" or "almost" → stop loop, document final state.
关键操作:完整保存外部评审者的原始响应(存储到变量中用于Phase E)。不得丢弃或总结 — 原始文本是主要记录。
随后提取结构化字段:
- Score(1-10的数值)
- Verdict("ready" / "almost" / "not ready")
- Action items(按优先级排序的修复列表)
终止条件:若评分≥6且结论包含"ready"或"almost" → 终止循环,记录最终状态。
Human Checkpoint (if enabled)
人工检查点(若启用)
Skip this step entirely if .
HUMAN_CHECKPOINT = falseWhen , present the review results and wait for user input:
HUMAN_CHECKPOINT = true📋 Round N/MAX_ROUNDS review complete.
Score: X/10 — [verdict]
Top weaknesses:
1. [weakness 1]
2. [weakness 2]
3. [weakness 3]
Suggested fixes:
1. [fix 1]
2. [fix 2]
3. [fix 3]
Options:
- Reply "go" or "continue" → implement all suggested fixes
- Reply with custom instructions → implement your modifications instead
- Reply "skip 2" → skip fix #2, implement the rest
- Reply "stop" → end the loop, document current stateWait for the user's response. Parse their input:
- Approval ("go", "continue", "ok", "proceed"): proceed to Phase C with all suggested fixes
- Custom instructions (any other text): treat as additional/replacement guidance for Phase C. Merge with reviewer suggestions where appropriate
- Skip specific fixes ("skip 1,3"): remove those fixes from the action list
- Stop ("stop", "enough", "done"): terminate the loop, jump to Termination
若,则完全跳过此步骤。
HUMAN_CHECKPOINT = false当时,展示评审结果并等待用户输入:
HUMAN_CHECKPOINT = true📋 第N/MAX_ROUNDS轮评审完成。
评分:X/10 — [评审结论]
主要问题点:
1. [问题1]
2. [问题2]
3. [问题3]
建议修复方案:
1. [修复1]
2. [修复2]
3. [修复3]
可选操作:
- 回复"go"或"continue" → 实施所有建议的修复方案
- 回复自定义指令 → 执行您指定的修改
- 回复"skip 2" → 跳过第2项修复,实施其余修复
- 回复"stop" → 终止循环,记录当前状态等待用户响应并解析输入:
- 批准("go"、"continue"、"ok"、"proceed"):携带所有建议的修复方案进入Phase C
- 自定义指令(任何其他文本):作为Phase C的额外/替代指导。酌情与评审者建议合并
- 跳过特定修复("skip 1,3"):从操作列表中移除指定的修复项
- 终止("stop"、"enough"、"done"):终止循环,跳转到终止流程
Feishu Notification (if configured)
飞书通知(若已配置)
After parsing the score, check if exists and mode is not :
~/.claude/feishu.json"off"- Send a notification: "Round N: X/10 — [verdict]" with top 3 weaknesses
review_scored - If interactive mode and verdict is "almost": send as checkpoint, wait for user reply on whether to continue or stop
- If config absent or mode off: skip entirely (no-op)
解析评分后,检查是否存在且模式不为:
~/.claude/feishu.json"off"- 发送通知:"第N轮:X/10 — [评审结论]",并附上Top 3问题点
review_scored - 若为交互模式且结论为"almost":发送检查点通知,等待用户回复是否继续或终止
- 若配置不存在或模式为关闭:完全跳过(无操作)
Phase C: Implement Fixes (if not stopping)
Phase C: 实施修复(若未终止)
For each action item (highest priority first):
- Code changes: Write/modify experiment scripts, model code, analysis scripts
- Run experiments: Deploy to GPU server via SSH + screen/tmux
- Analysis: Run evaluation, collect results, update figures/tables
- Documentation: Update project notes and review document
Prioritization rules:
- Skip fixes requiring excessive compute (flag for manual follow-up)
- Skip fixes requiring external data/models not available
- Prefer reframing/analysis over new experiments when both address the concern
- Always implement metric additions (cheap, high impact)
针对每个操作项(按优先级从高到低):
- 代码修改:编写/修改实验脚本、模型代码、分析脚本
- 运行实验:通过SSH + screen/tmux部署到GPU服务器
- 分析:运行评估、收集结果、更新图表/表格
- 文档更新:更新项目笔记和评审文档
优先级规则:
- 跳过需要大量计算资源的修复(标记为手动跟进)
- 跳过需要外部数据/模型且无法获取的修复
- 当重构/分析与新实验均可解决问题时,优先选择重构/分析
- 始终优先实施指标补充(成本低、影响大)
Phase D: Wait for Results
Phase D: 等待结果
If experiments were launched:
- Monitor remote sessions for completion
- Collect results from output files and logs
若已启动实验:
- 监控远程会话的完成状态
- 从输出文件和日志中收集结果
Phase E: Document Round
Phase E: 记录本轮内容
Append to :
AUTO_REVIEW.mdmarkdown
undefined向追加内容:
AUTO_REVIEW.mdmarkdown
undefinedRound N (timestamp)
第N轮(时间戳)
Assessment (Summary)
评估(摘要)
- Score: X/10
- Verdict: [ready/almost/not ready]
- Key criticisms: [bullet list]
- 评分:X/10
- 结论:[ready/almost/not ready]
- 主要批评:[项目符号列表]
Reviewer Raw Response
评审者原始响应
<details>
<summary>Click to expand full reviewer response</summary>
[Paste the COMPLETE raw response from the external reviewer here — verbatim, unedited.
This is the authoritative record. Do NOT truncate or paraphrase.]
</details><details>
<summary>点击展开完整评审者响应</summary>
[在此粘贴外部评审者的完整原始响应 — 一字不差,未经编辑。
这是权威记录。不得截断或改写。]
</details>Actions Taken
已执行操作
- [what was implemented/changed]
- [已实施/修改的内容]
Results
结果
- [experiment outcomes, if any]
- [实验结果(如有)]
Status
状态
- [continuing to round N+1 / stopping]
**Write `REVIEW_STATE.json`** with current round, threadId, score, verdict, and any pending experiments.
Increment round counter → back to Phase A.- [继续到第N+1轮 / 终止]
**写入`REVIEW_STATE.json`**,包含当前轮次、threadId、评分、结论及所有未完成的实验。
轮次计数器加1 → 返回Phase A。Termination
终止流程
When loop ends (positive assessment or max rounds):
- Update with
REVIEW_STATE.json"status": "completed" - Write final summary to
AUTO_REVIEW.md - Update project notes with conclusions
- Write method/pipeline description to under a
AUTO_REVIEW.mdsection — a concise 1-2 paragraph description of the final method, its architecture, and data flow. This serves as input for## Method Descriptionin Workflow 3 (so it can generate architecture diagrams automatically)./paper-illustration - If stopped at max rounds without positive assessment:
- List remaining blockers
- Estimate effort needed for each
- Suggest whether to continue manually or pivot
- Feishu notification (if configured): Send with final score progression table
pipeline_done
当循环结束(获得正面评估或达到最大轮次):
- 更新,设置
REVIEW_STATE.json"status": "completed" - 向写入最终摘要
AUTO_REVIEW.md - 更新项目笔记,记录结论
- 撰写方法/流程说明:在的
AUTO_REVIEW.md部分添加1-2段简洁描述,说明最终方法、架构和数据流。此内容将作为Workflow 3中## Method Description的输入(以便自动生成架构图)。/paper-illustration - 若达到最大轮次仍未获得正面评估:
- 列出剩余的阻碍因素
- 估算每个阻碍因素所需的工作量
- 建议是否继续手动改进或调整方向
- 飞书通知(若已配置):发送通知,并附上最终评分进展表
pipeline_done
Key Rules
核心规则
-
Large file handling: If the Write tool fails due to file size, immediately retry using Bash () to write in chunks. Do NOT ask the user for permission — just do it silently.
cat << 'EOF' > file -
ALWAYS usefor maximum reasoning depth
config: {"model_reasoning_effort": "xhigh"} -
Save threadId from first call, usefor subsequent rounds
mcp__codex__codex-reply -
Be honest — include negative results and failed experiments
-
Do NOT hide weaknesses to game a positive score
-
Implement fixes BEFORE re-reviewing (don't just promise to fix)
-
If an experiment takes > 30 minutes, launch it and continue with other fixes while waiting
-
Document EVERYTHING — the review log should be self-contained
-
Update project notes after each round, not just at the end
-
大文件处理:若写入工具因文件大小失败,立即使用Bash()分块重试。无需询问用户权限 — 直接静默执行。
cat << 'EOF' > file -
始终使用以确保最大推理深度
config: {"model_reasoning_effort": "xhigh"} -
保存首次调用的threadId,后续轮次使用
mcp__codex__codex-reply -
保持诚实 — 包含负面结果和失败的实验
-
不得隐藏问题点以获取正面评分
-
在重新评审前必须实施修复(不得仅承诺修复)
-
若实验耗时超过30分钟,启动实验后继续执行其他修复操作
-
记录所有内容 — 评审日志应具备自包含性
-
每轮结束后更新项目笔记,而非仅在最后更新
Prompt Template for Round 2+
第2轮及以后的提示模板
mcp__codex__codex-reply:
threadId: [saved from round 1]
config: {"model_reasoning_effort": "xhigh"}
prompt: |
[Round N update]
Since your last review, we have:
1. [Action 1]: [result]
2. [Action 2]: [result]
3. [Action 3]: [result]
Updated results table:
[paste metrics]
Please re-score and re-assess. Are the remaining concerns addressed?
Same format: Score, Verdict, Remaining Weaknesses, Minimum Fixes.mcp__codex__codex-reply:
threadId: [从第1轮保存的ID]
config: {"model_reasoning_effort": "xhigh"}
prompt: |
[Round N update]
Since your last review, we have:
1. [Action 1]: [result]
2. [Action 2]: [result]
3. [Action 3]: [result]
Updated results table:
[paste metrics]
Please re-score and re-assess. Are the remaining concerns addressed?
Same format: Score, Verdict, Remaining Weaknesses, Minimum Fixes.