research-refine

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Research Refine: Problem-Anchored, Elegant, Frontier-Aware Plan Refinement

研究方案细化：以问题为锚点、简洁优雅、紧跟前沿的计划优化

Refine and concretize: $ARGUMENTS

细化并具体化：$ARGUMENTS

Overview

概述

Use this skill when the research problem is already visible but the technical route is still fuzzy. The goal is not to produce a bloated proposal or a benchmark shopping list. The goal is to turn a vague direction into a problem -> focused method -> minimal validation document that is concrete enough to implement, elegant enough to feel paper-worthy, and current enough to resonate in the foundation-model era.

Four principles dominate this skill:

Do not lose the original problem. Freeze an immutable Problem Anchor and reuse it in every round.
The smallest adequate mechanism wins. Prefer the minimal intervention that directly fixes the bottleneck.
One paper, one dominant contribution. Prefer one sharp thesis plus at most one supporting contribution.
Modern leverage is a prior, not a decoration. When LLM / VLM / Diffusion / RL / distillation / inference-time scaling naturally fit the bottleneck, use them concretely. Do not bolt them on as buzzwords.

User input (PROBLEM + vague APPROACH)
  -> Phase 0 (Claude): Freeze Problem Anchor
  -> Phase 1 (Claude): Scan grounding papers -> identify technical gap -> choose the sharpest route -> write focused proposal
  -> Phase 2 (Codex/GPT-5.4): Review for fidelity, specificity, contribution quality, and frontier leverage
  -> Phase 3 (Claude): Anchor check + simplicity check -> revise method -> rewrite full proposal
  -> Phase 4 (Codex, same thread): Re-evaluate revised proposal
  -> Repeat Phase 3-4 until OVERALL SCORE >= 9 or MAX_ROUNDS reached
  -> Phase 5: Save full history to refine-logs/
  -> Optional handoff: /experiment-plan for a detailed execution-ready experiment roadmap

当研究问题已明确但技术路线仍模糊时，可使用本技能。我们的目标并非产出臃肿的提案或基准测试清单，而是将模糊方向转化为一份问题→聚焦方法→最小化验证的文档，这份文档需足够具体可落地、简洁优雅符合论文水准，且贴合大模型时代的技术趋势。

本技能遵循四大核心原则：

不偏离原始问题：锁定不可更改的问题锚点，并在每一轮迭代中复用。
最简有效机制优先：优先选择能直接解决瓶颈的最小干预方案。
一篇论文，一个核心贡献：优先聚焦一个清晰的核心论点，最多搭配一个辅助贡献。
现代技术是基础，而非装饰：当LLM / VLM / Diffusion / RL / 蒸馏 / 推理时扩展等技术能自然适配瓶颈问题时，需具体落地使用，而非将其作为 buzzword 生硬拼接。

用户输入（问题 + 模糊方案）
  -> 阶段0（Claude）：锁定问题锚点
  -> 阶段1（Claude）：扫描基础文献→识别技术缺口→选择最清晰的路线→撰写聚焦型提案
  -> 阶段2（Codex/GPT-5.4）：针对准确性、具体性、贡献质量和前沿技术适配性进行评审
  -> 阶段3（Claude）：锚点检查 + 简洁性检查→修订方法→重写完整提案
  -> 阶段4（Codex，同线程）：重新评估修订后的提案
  -> 重复阶段3-4，直至整体分数≥9或达到最大迭代轮次
  -> 阶段5：将完整历史记录保存至refine-logs/
  -> 可选交接：调用/experiment-plan获取可直接执行的详细实验路线图

Constants

常量配置

REVIEWER_MODEL =
gpt-5.4
— Reviewer model used via Codex MCP.
MAX_ROUNDS = 5 — Maximum review-revise rounds.
SCORE_THRESHOLD = 9 — Minimum overall score to stop.
OUTPUT_DIR =
refine-logs/
— Directory for round files and final report.
MAX_LOCAL_PAPERS = 15 — Maximum local papers/notes to scan for grounding.
MAX_CORE_EXPERIMENTS = 3 — Default cap for core validation blocks inside this skill.
MAX_PRIMARY_CLAIMS = 2 — Soft cap for paper-level claims. Prefer one dominant claim plus one supporting claim.
MAX_NEW_TRAINABLE_COMPONENTS = 2 — Soft cap for genuinely new trainable pieces. Exceed only if the paper breaks otherwise.

Override via argument if needed, e.g.
/research-refine "problem | approach" -- max rounds: 3, threshold: 9
.

REVIEWER_MODEL =
gpt-5.4
—— 通过Codex MCP使用的评审模型。
MAX_ROUNDS = 5 —— 最大评审-修订轮次。
SCORE_THRESHOLD = 9 —— 停止迭代的最低整体分数。
OUTPUT_DIR =
refine-logs/
—— 轮次文件和最终报告的存储目录。
MAX_LOCAL_PAPERS = 15 —— 用于基础调研的本地文献/笔记最大数量。
MAX_CORE_EXPERIMENTS = 3 —— 本技能内核心验证模块的默认上限。
MAX_PRIMARY_CLAIMS = 2 —— 论文级论点的软上限，优先选择一个核心论点加一个辅助论点。
MAX_NEW_TRAINABLE_COMPONENTS = 2 —— 全新可训练模块的软上限，仅当论文无法通过其他方式突破时才可超出。

可通过参数覆盖默认配置，例如：/research-refine "problem | approach" -- max rounds: 3, threshold: 9。

Output Structure

输出结构

refine-logs/
├── round-0-initial-proposal.md
├── round-1-review.md
├── round-1-refinement.md
├── round-2-review.md
├── round-2-refinement.md
├── ...
├── REVIEW_SUMMARY.md
├── FINAL_PROPOSAL.md
├── REFINEMENT_REPORT.md
└── score-history.md

Every

round-N-refinement.md

must contain a full anchored proposal, not just incremental fixes.

refine-logs/
├── round-0-initial-proposal.md
├── round-1-review.md
├── round-1-refinement.md
├── round-2-review.md
├── round-2-refinement.md
├── ...
├── REVIEW_SUMMARY.md
├── FINAL_PROPOSAL.md
├── REFINEMENT_REPORT.md
└── score-history.md

每个

round-N-refinement.md

必须包含完整的锚定提案，而非仅增量修改内容。

Workflow

工作流程

Phase 0: Freeze the Problem Anchor

阶段0：锁定问题锚点

Before proposing anything, extract the user's immutable bottom-line problem. This anchor must be copied verbatim into every proposal and every refinement round.

Write:

Bottom-line problem: What technical problem must be solved?
Must-solve bottleneck: What specific weakness in current methods is unacceptable?
Non-goals: What is explicitly not the goal of this project?
Constraints: Compute, data, time, tooling, venue, deployment limits.
Success condition: What evidence would make the user say "yes, this method addresses the actual problem"?

If later reviewer feedback would change the problem being solved, mark that as drift and push back or adapt carefully.

在提出任何方案前，先提取用户不可更改的核心问题。该锚点必须原封不动地复制到每一份提案和每一轮修订中。

需撰写：

核心问题：必须解决的技术问题是什么？
必解决瓶颈：当前方法中哪些具体缺陷是不可接受的？
非目标：本项目明确不包含的目标是什么？
约束条件：算力、数据、时间、工具、会议、部署限制。
成功标准：什么证据能让用户认可“该方法解决了实际问题”？

若后续评审反馈会改变待解决的问题，需标记为偏离，并谨慎应对或调整。

Phase 1: Build the Initial Proposal

阶段1：构建初始提案

Step 1.1: Scan Grounding Material

步骤1.1：扫描基础资料

Check

papers/

and

literature/

first. Read only the relevant parts needed to answer:

What mechanism do current methods use?
Where exactly do they fail for this problem?
Which recent LLM / VLM / Diffusion / RL era techniques are actually relevant here?
What training objectives, representations, or interfaces are reusable?
What details distinguish a real method from a renamed high-level idea?

If local material is insufficient, search recent top-venue/arXiv work online. Focus on method sections, training setup, and failure modes, not just abstracts.

优先检查

papers/

和

literature/

目录。仅阅读回答以下问题所需的相关部分：

当前方法使用何种机制？
针对本问题，它们具体在何处失效？
LLM / VLM / Diffusion / RL时代的哪些近期技术与此问题实际相关？
哪些训练目标、表征或接口可复用？
真正的方法与换个名字的高阶想法有哪些具体区别？

若本地资料不足，可在线搜索顶会/arXiv的近期研究。重点关注方法部分、训练设置和失效模式，而非仅摘要。

Step 1.2: Identify the Technical Gap

步骤1.2：识别技术缺口

Do not stop at generic research questions. Make the gap operational:

Current pipeline failure point: where does the baseline break?
Why naive fixes are insufficient: larger context, more data, prompting, memory bank, or stacking more modules.
Smallest adequate intervention: what is the least additional mechanism that could plausibly fix the bottleneck?
Frontier-native alternative: is there a more current route using foundation-model-era primitives that better matches the bottleneck?
Core technical claim: what exact mechanism claim could survive top-venue scrutiny?
Required evidence: what minimum proof is needed to defend that claim?

不能停留在泛泛的研究问题层面，需将缺口具体化：

当前 pipeline 失效点：基线方法在何处崩溃？
为何简单修复无效：更大的上下文、更多数据、提示工程、记忆库或堆叠更多模块为何无法解决？
最小有效干预方案：哪种最小的额外机制可能解决该瓶颈？
前沿原生替代方案：是否存在使用大模型时代原语的更贴合瓶颈的技术路线？
核心技术论点：哪种具体的机制论点能通过顶会评审？
所需证据：验证该论点所需的最低限度证据是什么？

Step 1.3: Choose the Sharpest Route

步骤1.3：选择最清晰的路线

Before locking the method, compare two candidate routes if both are plausible:

Route A: Elegant minimal route — the smallest mechanism that directly targets the bottleneck.
Route B: Frontier-native route — a more modern route that uses LLM / VLM / Diffusion / RL / distillation / inference-time scaling only if it gives a cleaner or stronger story.

Then decide:

Which route is more likely to become a strong paper under the stated constraints?
Which route has the cleaner novelty story relative to the closest work?
Which route avoids contribution sprawl?

If both routes are weak, rethink the framing instead of combining them into a larger system by default.

若存在两种可行的候选路线，需在锁定方法前进行比较：

路线A：简洁优雅的最小化路线：直接针对瓶颈的最小机制。
路线B：前沿原生路线：使用LLM / VLM / Diffusion / RL / 蒸馏 / 推理时扩展等技术的现代路线，仅当能带来更清晰或更强的叙事时选择。

然后决定：

在给定约束下，哪种路线更有可能产出高质量论文？
与最相关的研究相比，哪种路线的创新性叙事更清晰？
哪种路线能避免贡献分散？

若两种路线均薄弱，应重新思考框架，而非默认将它们组合成更大的系统。

Step 1.4: Concretize the Method First

步骤1.4：先具体化方法细节

The proposal must answer "how would we actually build this?" Prefer method detail over broad experimentation and prefer reuse over invention.

Cover:

One-sentence method thesis: the single strongest mechanism claim.
Contribution focus: one dominant contribution and at most one supporting contribution.
Complexity budget: what is frozen or reused, what is new, and what tempting additions are intentionally excluded.
System graph: modules, data flow, inputs, outputs.
Representation design: what latent, embedding, plan token, reward signal, memory state, or alignment space is used?
Training recipe: data source, supervision, pseudo-labeling, negatives, curriculum, losses, weighting, stagewise vs joint training.
Inference path: how the trained components are used at test time and what signals flow where.
Why the mechanism stays small: why a larger stack is unnecessary.
Exact role of any frontier primitive: if you use an LLM / VLM / Diffusion / RL component, specify whether it acts as planner, teacher, critic, reward model, generator prior, search controller, or distillation source.
Failure handling: what could go wrong and what fallback or diagnostic exists?
Novelty and elegance argument: why this is more than naming a module and why the paper still looks focused.

If the method is still only described as "add a module" or "use a planner," it is not concrete enough.

提案必须回答“我们实际如何构建这个系统？”。优先提供方法细节而非宽泛的实验计划，优先复用而非创新。

需涵盖：

一句话方法论点：最核心的机制主张。
贡献聚焦：一个核心贡献，最多搭配一个辅助贡献。
复杂度预算：哪些部分冻结或复用，哪些是新增部分，哪些诱人的功能被有意排除。
系统图：模块、数据流、输入、输出。
表征设计：使用何种隐空间、嵌入、计划令牌、奖励信号、记忆状态或对齐空间？
训练方案：数据源、监督方式、伪标签、负样本、课程学习、损失函数、权重分配、分阶段训练 vs 联合训练。
推理路径：训练后的组件在测试时如何使用，信号如何流转。
为何机制保持精简：为何不需要更大的堆叠结构。
任何前沿原语的确切作用：若使用LLM / VLM / Diffusion / RL组件，需明确其是作为规划器、教师、评论者、奖励模型、生成先验、搜索控制器还是蒸馏源。
故障处理：可能出现的问题，以及对应的 fallback 或诊断方案。
创新性与简洁性论证：为何这不仅仅是给模块起个新名字，为何论文仍保持聚焦。

若方法仍仅被描述为“添加一个模块”或“使用一个规划器”，则说明不够具体。

Step 1.5: Design Minimal Claim-Driven Validation

步骤1.5：设计最小化论点驱动的验证方案

Experiments exist to validate the method, not to dominate the document.

For each core claim, define the smallest strong experiment that can validate it:

the claim being tested
the necessary baseline or ablation
the decisive metric
the expected directional outcome

Additional rules:

Ensure one experiment block directly supports the Problem Anchor.
If complexity risk exists, include one simplification or deletion check.
If a frontier primitive is central, include one necessity check showing why that choice matters.
Default to 1-3 core experiment blocks and leave the full execution roadmap to
```
/experiment-plan
```
.

实验的存在是为了验证方法，而非主导文档。

针对每个核心论点，定义最小且有力的实验来验证：

待测试的论点
必要的基线或消融实验
决定性指标
预期的方向结果

附加规则：

确保有一个实验模块直接支持问题锚点。
若存在复杂度风险，需包含一个简化或删除检查。
若前沿原语是核心，需包含一个必要性检查，证明该选择的重要性。
默认设置1-3个核心实验模块，完整的执行路线图交给
```
/experiment-plan
```
处理。

Step 1.6: Write the Initial Proposal

步骤1.6：撰写初始提案

Save to

refine-logs/round-0-initial-proposal.md

Use this structure:

markdown

undefined

保存至

refine-logs/round-0-initial-proposal.md

。

使用以下结构：

markdown

undefined

Research Proposal: [Title]

研究提案：[标题]

Problem Anchor

问题锚点

Bottom-line problem:
Must-solve bottleneck:
Non-goals:
Constraints:
Success condition:

核心问题:
必解决瓶颈:
非目标:
约束条件:
成功标准:

Technical Gap

技术缺口

[Why current methods fail, why naive bigger systems are not enough, and what mechanism is missing]

[当前方法为何失效，为何简单的大型系统不足以解决问题，缺失何种机制]

Method Thesis

方法论点

One-sentence thesis:
Why this is the smallest adequate intervention:
Why this route is timely in the foundation-model era:

一句话论点:
为何这是最小有效干预方案:
为何该路线贴合大模型时代趋势:

Contribution Focus

贡献聚焦

Dominant contribution:
Optional supporting contribution:
Explicit non-contributions:

核心贡献:
可选辅助贡献:
明确排除的非贡献:

Proposed Method

拟议方法

Complexity Budget

复杂度预算

Frozen / reused backbone:
New trainable components:
Tempting additions intentionally not used:

冻结/复用的骨干网络:
新增可训练组件:
有意排除的诱人功能:

System Overview

系统概述

[Step-by-step pipeline or ASCII graph]

[分步 pipeline 或 ASCII 图]

Core Mechanism

核心机制

Input / output:
Architecture or policy:
Training signal / loss:
Why this is the main novelty:

输入/输出:
架构或策略:
训练信号/损失函数:
为何这是主要创新点:

Optional Supporting Component

可选辅助组件

Only include if truly necessary:
Input / output:
Training signal / loss:
Why it does not create contribution sprawl:

仅在真正必要时包含:
输入/输出:
训练信号/损失函数:
为何不会导致贡献分散:

Modern Primitive Usage

现代原语的使用

Which LLM / VLM / Diffusion / RL-era primitive is used:
Exact role in the pipeline:
Why it is more natural than an old-school alternative:

使用的LLM / VLM / Diffusion / RL时代原语:
在pipeline中的确切作用:
为何比传统方案更自然:

Integration into Base Generator / Downstream Pipeline

与基础生成器/下游Pipeline的集成

[Where the new method attaches, what is frozen, what is trainable, inference order]

[新方法的接入点，哪些部分冻结，哪些可训练，推理顺序]

Training Plan

训练计划

[Stagewise or joint training, losses, data construction, pseudo-labels, schedules]

[分阶段或联合训练，损失函数，数据构建，伪标签，时间表]

Failure Modes and Diagnostics

故障模式与诊断

[Failure mode]:
[How to detect]:
[Fallback or mitigation]:

[故障模式]:
[检测方式]:
[Fallback或缓解方案]:

Novelty and Elegance Argument

创新性与简洁性论证

[Closest work, exact difference, why this is a focused mechanism-level contribution rather than a module pile-up]

[最相关的研究，确切差异，为何这是聚焦于机制层面的贡献而非模块堆叠]

Claim-Driven Validation Sketch

论点驱动的验证方案草图

Claim 1: [Main claim]

论点1: [主论点]

Minimal experiment:
Baselines / ablations:
Metric:
Expected evidence:

最小实验:
基线/消融实验:
指标:
预期证据:

Claim 2: [Optional]

论点2: [可选]

Minimal experiment:
Baselines / ablations:
Metric:
Expected evidence:

最小实验:
基线/消融实验:
指标:
预期证据:

Experiment Handoff Inputs

实验交接输入

Must-prove claims:
Must-run ablations:
Critical datasets / metrics:
Highest-risk assumptions:

必须验证的论点:
必须运行的消融实验:
关键数据集/指标:
最高风险假设:

Compute & Timeline Estimate

算力与时间预估

Estimated GPU-hours:
Data / annotation cost:
Timeline:

undefined

预估GPU小时数:
数据/标注成本:
时间线:

undefined

Phase 2: External Method Review (Round 1)

阶段2：外部方法评审（第一轮）

Send the full proposal to GPT-5.4 for an elegance-first, frontier-aware, method-first review. The reviewer should spend most of the critique budget on the method itself, not on expanding the experiment menu.

mcp__codex__codex:
  model: REVIEWER_MODEL
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    You are a senior ML reviewer for a top venue (NeurIPS/ICML/ICLR).
    This is an early-stage, method-first research proposal.

    Your job is NOT to reward extra modules, contribution sprawl, or a giant benchmark checklist.
    Your job IS to stress-test whether the proposed method:
    (1) still solves the original anchored problem,
    (2) is concrete enough to implement,
    (3) presents a focused, elegant contribution,
    (4) uses foundation-model-era techniques appropriately when they are the natural fit.

    Review principles:
    - Prefer the smallest adequate mechanism over a larger system.
    - Penalize parallel contributions that make the paper feel unfocused.
    - If a modern LLM / VLM / Diffusion / RL route would clearly produce a better paper, say so concretely.
    - If the proposal is already modern enough, do NOT force trendy components.
    - Do not ask for extra experiments unless they are needed to prove the core claims.

    Read the Problem Anchor first. If your suggested fix would change the problem being solved,
    call that out explicitly as drift instead of treating it as a normal revision request.

    === PROPOSAL ===
    [Paste the FULL proposal from Phase 1]
    === END PROPOSAL ===

    Score these 7 dimensions from 1-10:

    1. **Problem Fidelity**: Does the method still attack the original bottleneck, or has it drifted into solving something easier or different?

    2. **Method Specificity**: Are the interfaces, representations, losses, training stages, and inference path concrete enough that an engineer could start implementing?

    3. **Contribution Quality**: Is there one dominant mechanism-level contribution with real novelty, good parsimony, and no obvious contribution sprawl?

    4. **Frontier Leverage**: Does the proposal use current foundation-model-era primitives appropriately when they are the right tool, instead of defaulting to old-school module stacking?

    5. **Feasibility**: Can this method be trained and integrated with the stated resources and data assumptions?

    6. **Validation Focus**: Are the proposed experiments minimal but sufficient to validate the core claims? Is there unnecessary experimental bloat?

    7. **Venue Readiness**: If executed well, would the contribution feel sharp and timely enough for a top venue?

    **OVERALL SCORE** (1-10): Weighted toward Problem Fidelity, Method Specificity, Contribution Quality, and Frontier Leverage.
    Use this weighting: Problem Fidelity 15%, Method Specificity 25%, Contribution Quality 25%, Frontier Leverage 15%, Feasibility 10%, Validation Focus 5%, Venue Readiness 5%.

    For each dimension scoring < 7, provide:
    - The specific weakness
    - A concrete fix at the method level (interface / loss / training recipe / integration point / deletion of unnecessary parts)
    - Priority: CRITICAL / IMPORTANT / MINOR

    Then add:
    - **Simplification Opportunities**: 1-3 concrete ways to delete, merge, or reuse components while preserving the main claim. Write "NONE" if already tight.
    - **Modernization Opportunities**: 1-3 concrete ways to replace old-school pieces with more natural foundation-model-era primitives if genuinely better. Write "NONE" if already modern enough.
    - **Drift Warning**: "NONE" if the proposal still solves the anchored problem; otherwise explain the drift clearly.
    - **Verdict**: READY / REVISE / RETHINK

    Verdict rule:
    - READY: overall score >= 9, no meaningful drift, one focused dominant contribution, and no obvious complexity bloat remains
    - REVISE: the direction is promising but not yet at READY bar
    - RETHINK: the core mechanism or framing is still fundamentally off

CRITICAL: Save the
threadId
from this call for all later rounds.

CRITICAL: Save the FULL raw response verbatim.

Save review to

refine-logs/round-1-review.md

with the raw response in a

<details>

block.

将完整提案提交给GPT-5.4进行以简洁性、前沿性、方法为核心的评审。评审者应将大部分批评重点放在方法本身，而非扩展实验清单。

mcp__codex__codex:
  model: REVIEWER_MODEL
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    你是顶会（NeurIPS/ICML/ICLR）的资深ML评审专家。
    这是一份早期、以方法为核心的研究提案。

    你的任务不是奖励额外模块、分散的贡献或庞大的基准测试清单。
    你的任务是严格测试拟议方法是否符合以下要求：
    (1) 仍能解决原始锚定问题，
    (2) 足够具体可落地，
    (3) 呈现聚焦、简洁的贡献，
    (4) 在自然适配时合理使用大模型时代的技术。

    评审原则:
    - 优先选择最小有效机制而非更大的系统。
    - 惩罚导致论文聚焦度下降的平行贡献。
    - 若现代LLM / VLM / Diffusion / RL路线明显能产出更好的论文，需具体说明。
    - 若提案已足够贴合现代技术，请勿强行添加流行组件。
    - 除非需要验证核心论点，否则不要要求额外实验。

    先阅读问题锚点。若你的建议修改会改变待解决的问题，请明确标记为偏离，而非视为常规修订请求。

    === 提案 ===
    [粘贴阶段1的完整提案]
    === 提案结束 ===

    从以下7个维度进行1-10分评分:

    1. **问题贴合度**：方法是否仍针对原始瓶颈，还是已偏离至更简单或不同的问题？

    2. **方法具体性**：接口、表征、损失函数、训练阶段和推理路径是否足够具体，让工程师可直接开始实现？

    3. **贡献质量**：是否存在一个聚焦于机制层面的核心贡献，具备真正的创新性、简洁性，且无明显的贡献分散？

    4. **前沿技术适配性**：提案是否在合适时合理使用大模型时代的原语，而非默认采用传统模块堆叠？

    5. **可行性**：在给定的资源和数据假设下，该方法能否被训练和集成？

    6. **验证聚焦度**：拟议实验是否足够精简且能验证核心论点？是否存在不必要的实验冗余？

    7. **顶会适配性**：若执行到位，该贡献是否足够清晰及时，符合顶会标准？

    **整体得分**(1-10): 权重分配为：问题贴合度15%，方法具体性25%，贡献质量25%，前沿技术适配性15%，可行性10%，验证聚焦度5%，顶会适配性5%。

    对于每个得分<7的维度，需提供:
    - 具体缺陷
    - 方法层面的具体修复方案（接口/损失函数/训练方案/集成点/删除不必要部分）
    - 优先级：CRITICAL / IMPORTANT / MINOR

    然后添加:
    - **简化机会**: 1-3种具体的删除、合并或复用组件的方式，同时保留核心论点。若已足够精简，写“NONE”。
    - **现代化机会**: 1-3种具体的用大模型时代原语替代传统组件的方式（仅当确实更优时）。若已足够现代，写“NONE”。
    - **偏离警告**: 若提案仍解决锚定问题则写“NONE”；否则清晰说明偏离情况。
    - **结论**: READY / REVISE / RETHINK

    结论规则:
    - READY: 整体得分≥9，无明显偏离，有一个聚焦的核心贡献，且无明显复杂度冗余
    - REVISE: 方向可行但未达到READY标准
    - RETHINK: 核心机制或框架仍存在根本性问题

重要提示：保存本次调用的
threadId
，用于后续所有轮次。

重要提示：完整保存原始回复内容。

将评审结果保存至

refine-logs/round-1-review.md

，并将原始回复放在

<details>

块中。

Phase 3: Parse Feedback and Revise the Method

阶段3：解析反馈并修订方法

Step 3.1: Parse the Review

步骤3.1：解析评审结果

Extract:

Problem Fidelity
Method Specificity
Contribution Quality
Frontier Leverage
Feasibility
Validation Focus
Venue Readiness
Overall score
Verdict
Drift Warning
Simplification Opportunities
Modernization Opportunities
Action items ranked by priority

Update

refine-logs/score-history.md

markdown

undefined

提取:

问题贴合度
方法具体性
贡献质量
前沿技术适配性
可行性
验证聚焦度
顶会适配性
整体得分
结论
偏离警告
简化机会
现代化机会
按优先级排序的行动项

更新

refine-logs/score-history.md

markdown

undefined

Score Evolution

得分演变

Round	Problem Fidelity	Method Specificity	Contribution Quality	Frontier Leverage	Feasibility	Validation Focus	Venue Readiness	Overall	Verdict
1	X	X	X	X	X	X	X	X	REVISE


**STOP CONDITION**: If overall score >= SCORE_THRESHOLD, verdict is READY, and there is no unresolved drift warning, skip to Phase 5.

轮次	问题贴合度	方法具体性	贡献质量	前沿技术适配性	可行性	验证聚焦度	顶会适配性	整体得分	结论
1	X	X	X	X	X	X	X	X	REVISE


**停止条件**: 若整体得分≥SCORE_THRESHOLD，结论为READY，且无未解决的偏离警告，跳至阶段5。

Step 3.2: Revise With an Anchor Check and a Simplicity Check

步骤3.2：锚点检查与简洁性检查后再修订

Before changing anything:

Copy the Problem Anchor verbatim.
Write an Anchor Check:
- What is the original bottleneck?
- Does the current method still solve it?
- Which reviewer suggestions would cause drift if followed blindly?
Write a Simplicity Check:
- What is the dominant contribution now?
- What components can be removed, merged, or kept frozen?
- Which reviewer suggestions add unnecessary complexity?
- If a frontier primitive is central, is its role still crisp and justified?

Then process reviewer feedback:

If valid: sharpen the mechanism, simplify if possible, or modernize if the paper really improves.
If debatable: revise, but explain your reasoning with evidence.
If wrong, drifting, or over-complicating: push back with evidence from local papers and the Problem Anchor.

Bias the revisions toward:

a sharper central contribution
fewer moving parts
cleaner reuse of strong existing backbones
more natural foundation-model-era leverage when it improves the paper
leaner, claim-driven experiments

Do not add multiple parallel contributions just to chase score. If the reviewer requests another module, first ask whether the same gain can come from a better interface, distillation signal, reward model, or inference policy on top of an existing backbone.

Save to

refine-logs/round-N-refinement.md

markdown

undefined

在进行任何修改前:

原封不动地复制问题锚点。
撰写锚点检查:
- 原始瓶颈是什么？
- 当前方法是否仍能解决该瓶颈？
- 哪些评审建议若盲目遵循会导致偏离？
撰写简洁性检查:
- 当前的核心贡献是什么？
- 哪些组件可删除、合并或保持冻结？
- 哪些评审建议会增加不必要的复杂度？
- 若前沿原语是核心，其作用是否仍清晰合理？

然后处理评审反馈:

若有效: 优化机制，尽可能简化，或在确实能提升论文质量时进行现代化改进。
若有争议: 进行修订，但需提供证据说明理由。
若错误、偏离或过度复杂: 结合本地文献和问题锚点进行反驳。

修订时优先偏向:

更清晰的核心贡献
更少的可动部件
更简洁地复用强大的现有骨干网络
当能提升论文质量时，更自然地使用大模型时代技术
更精简的论点驱动实验

不要为了追求分数而添加多个平行贡献。若评审要求添加另一个模块，先思考是否可通过优化现有骨干网络的接口、蒸馏信号、奖励模型或推理策略来实现相同效果。

保存至

refine-logs/round-N-refinement.md

markdown

undefined

Round N Refinement

第N轮修订

Problem Anchor

问题锚点

[Copy verbatim from round 0]

[原封不动复制自第0轮]

Anchor Check

锚点检查

Original bottleneck:
Why the revised method still addresses it:
Reviewer suggestions rejected as drift:

原始瓶颈:
修订后的方法为何仍能解决该瓶颈:
被拒绝的偏离性评审建议:

Simplicity Check

简洁性检查

Dominant contribution after revision:
Components removed or merged:
Reviewer suggestions rejected as unnecessary complexity:
Why the remaining mechanism is still the smallest adequate route:

修订后的核心贡献:
删除或合并的组件:
被拒绝的增加复杂度的评审建议:
为何剩余机制仍是最小有效路线:

Changes Made

已做修改

1. [Method section changed]

1. [修改的方法部分]

Reviewer said:
Action:
Reasoning:
Impact on core method:

评审建议:
行动:
理由:
对核心方法的影响:

2. [Novelty / modernity / feasibility / validation change]

2. [创新性/现代化/可行性/验证修改]

Reviewer said:
Action:
Reasoning:
Impact on core method:

评审建议:
行动:
理由:
对核心方法的影响:

Revised Proposal

修订后的提案

[Full updated proposal from Problem Anchor through Claim-Driven Validation Sketch]

undefined

[从问题锚点到论点驱动验证草图的完整更新提案]

undefined

Phase 4: Re-evaluation (Round 2+)

阶段4：重新评估（第2轮及以后）

Send the revised proposal back to GPT-5.4 in the same thread:

mcp__codex__codex-reply:
  threadId: [saved from Phase 2]
  model: REVIEWER_MODEL
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    [Round N re-evaluation]

    I revised the proposal based on your feedback.
    First, check whether the original Problem Anchor is still preserved.
    Second, judge whether the method is now more concrete, more focused, and more current.

    Key changes:
    1. [Method change 1]
    2. [Method change 2]
    3. [Simplification / modernization / pushback if any]

    === REVISED PROPOSAL ===
    [Paste the FULL revised proposal]
    === END REVISED PROPOSAL ===

    Please:
    - Re-score the same 7 dimensions and overall
    - State whether the Problem Anchor is preserved or drifted
    - State whether the dominant contribution is now sharper or still too broad
    - State whether the method is simpler or still overbuilt
    - State whether the frontier leverage is now appropriate or still old-school / forced
    - Focus new critiques on missing mechanism, weak training signal, weak integration point, pseudo-novelty, or unnecessary complexity
    - Use the same verdict rule: READY only if overall score >= 9 and no blocking issue remains

    Same output format: 7 scores, overall score, verdict, drift warning, simplification opportunities, modernization opportunities, remaining action items.

Save review to

refine-logs/round-N-review.md

Then return to Phase 3 until:

Overall score >= SCORE_THRESHOLD and verdict is READY and no unresolved drift
or MAX_ROUNDS reached

在同一线程中将修订后的提案送回GPT-5.4:

mcp__codex__codex-reply:
  threadId: [阶段2保存的ID]
  model: REVIEWER_MODEL
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    [第N轮重新评估]

    我已根据你的反馈修订了提案。
    首先，检查原始问题锚点是否仍被保留。
    其次，判断方法是否更具体、更聚焦、更贴合现代趋势。

    主要修改:
    1. [方法修改1]
    2. [方法修改2]
    3. [简化/现代化/反驳内容（若有）]

    === 修订后的提案 ===
    [粘贴完整的修订提案]
    === 修订后的提案结束 ===

    请:
    - 重新对7个维度和整体得分进行评分
    - 说明问题锚点是否被保留或偏离
    - 说明核心贡献是否更清晰或仍过于宽泛
    - 说明方法是否更简洁或仍过度复杂
    - 说明前沿技术适配性是否合理或仍偏传统/生硬
    - 新的批评重点放在缺失的机制、薄弱的训练信号、薄弱的集成点、伪创新或不必要的复杂度上
    - 使用相同的结论规则：仅当整体得分≥9且无阻塞问题时，结论为READY

    使用相同的输出格式：7项得分、整体得分、结论、偏离警告、简化机会、现代化机会、剩余行动项。

将评审结果保存至

refine-logs/round-N-review.md

。

然后返回阶段3，直到:

整体得分≥SCORE_THRESHOLD，结论为READY，且无未解决的偏离
或达到MAX_ROUNDS

Phase 5: Final Report and Logs

阶段5：最终报告与日志

Step 5.1: Write

refine-logs/REVIEW_SUMMARY.md

步骤5.1：撰写

refine-logs/REVIEW_SUMMARY.md

This file is the high-level round-by-round review record. It should answer: each round was trying to solve what, what changed, what got resolved, and what remained.

markdown

undefined

本文件是高层面的轮次评审记录，需回答：每一轮试图解决什么问题，做了哪些修改，解决了什么问题，剩余什么问题。

markdown

undefined

Review Summary

评审摘要

Problem: [user's problem] Initial Approach: [user's vague approach] Date: [today] Rounds: N / MAX_ROUNDS Final Score: X / 10 Final Verdict: [READY / REVISE / RETHINK]

问题: [用户的问题] 初始方案: [用户的模糊方案] 日期: [今日] 轮次: N / MAX_ROUNDS 最终得分: X / 10 最终结论: [READY / REVISE / RETHINK]

Problem Anchor

问题锚点

[Verbatim anchor used across all rounds]

[所有轮次使用的原封不动的锚点]

Round-by-Round Resolution Log

轮次解决日志

Round	Main Reviewer Concerns	What This Round Simplified / Modernized	Solved?	Remaining Risk
1	[top issues from review]	[main method changes]	[yes / partial / no]	[if any]
2	...	...	...	...

轮次	主要评审关注点	本轮简化/现代化内容	是否解决	剩余风险
1	[评审中的核心问题]	[主要方法修改]	[是/部分/否]	[若有]
2	...	...	...	...

Overall Evolution

整体演变

[How the method became more concrete]
[How the dominant contribution became more focused]
[How unnecessary complexity was removed]
[How modern technical leverage improved or stayed intentionally minimal]
[How drift was avoided or corrected]

[方法如何变得更具体]
[核心贡献如何变得更聚焦]
[如何移除不必要的复杂度]
[现代技术适配性如何提升或保持精简]
[如何避免或纠正偏离]

Final Status

最终状态

Anchor status: [preserved / corrected / unresolved]
Focus status: [tight / slightly broad / still diffuse]
Modernity status: [appropriately frontier-aware / intentionally conservative / still old-school]
Strongest parts of final method:
Remaining weaknesses:

undefined

锚点状态: [保留/纠正偏离/未解决问题]
聚焦状态: [清晰/略宽泛/仍分散]
现代化状态: [合理适配前沿/有意保持精简/仍偏传统]
最终方法的优势:
剩余缺陷:

undefined

Step 5.2: Write

refine-logs/FINAL_PROPOSAL.md

步骤5.2：撰写

refine-logs/FINAL_PROPOSAL.md

This file is the clean final version document. It should contain only the final proposal itself, without review chatter, round history, or raw reviewer output.

markdown

undefined

本文件是干净的最终版本文档，仅包含最终提案，无评审讨论、轮次历史或原始评审输出。

markdown

undefined

Research Proposal: [Title]

研究提案：[标题]

[Paste the final refined proposal only]


If the final verdict is not READY, still write the best current final version here.

[仅粘贴最终细化后的提案]


若最终结论不是READY，仍需在此处写入当前最优的最终版本。

Step 5.3: Write

refine-logs/REFINEMENT_REPORT.md

步骤5.3：撰写

refine-logs/REFINEMENT_REPORT.md

markdown

undefined

markdown

undefined

Refinement Report

细化报告

Problem: [user's problem] Initial Approach: [user's vague approach] Date: [today] Rounds: N / MAX_ROUNDS Final Score: X / 10 Final Verdict: [READY / REVISE / RETHINK]

问题: [用户的问题] 初始方案: [用户的模糊方案] 日期: [今日] 轮次: N / MAX_ROUNDS 最终得分: X / 10 最终结论: [READY / REVISE / RETHINK]

Problem Anchor

问题锚点

[Verbatim anchor used across all rounds]

[所有轮次使用的原封不动的锚点]

Output Files

输出文件

Review summary:
```
refine-logs/REVIEW_SUMMARY.md
```
Final proposal:
```
refine-logs/FINAL_PROPOSAL.md
```

评审摘要:
```
refine-logs/REVIEW_SUMMARY.md
```
最终提案:
```
refine-logs/FINAL_PROPOSAL.md
```

Score Evolution

得分演变

Round	Problem Fidelity	Method Specificity	Contribution Quality	Frontier Leverage	Feasibility	Validation Focus	Venue Readiness	Overall	Verdict
1	...	...	...	...	...	...	...	...	...

轮次	问题贴合度	方法具体性	贡献质量	前沿技术适配性	可行性	验证聚焦度	顶会适配性	整体得分	结论
1	...	...	...	...	...	...	...	...	...

Round-by-Round Review Record

轮次评审记录

Round	Main Reviewer Concerns	What Was Changed	Result
1	[top issues]	[main fixes]	[resolved / partial / unresolved]
2	...	...	...

轮次	主要评审关注点	修改内容	结果
1	[核心问题]	[主要修复]	[解决/部分解决/未解决]
2	...	...	...

Final Proposal Snapshot

最终提案快照

Canonical clean version lives in
```
refine-logs/FINAL_PROPOSAL.md
```
Summarize the final thesis in 3-5 bullets here

标准干净版本位于
```
refine-logs/FINAL_PROPOSAL.md
```
在此处用3-5个要点总结最终论点

Method Evolution Highlights

方法演变亮点

[Most important simplification or focusing move]
[Most important mechanism upgrade]
[Most important modernization or justification for staying simple]

[最重要的简化或聚焦调整]
[最重要的机制升级]
[最重要的现代化或精简合理性论证]

Pushback / Drift Log

反驳/偏离日志

Round	Reviewer Said	Author Response	Outcome
1	[criticism]	[pushback + anchor / evidence]	[accepted / rejected]

轮次	评审建议	作者回应	结果
1	[批评]	[反驳+锚点/证据]	[接受/拒绝]

Remaining Weaknesses

剩余缺陷

[Honest unresolved issues]

[诚实列出未解决的问题]

Raw Reviewer Responses

原始评审回复

<details> <summary>Round 1 Review</summary>

[Full verbatim response from GPT-5.4]

</details>

...

[GPT-5.4的完整原始回复]

</details>

...

Next Steps

下一步建议

If READY: proceed to
```
/experiment-plan
```
for a full experiment roadmap, then
```
/run-experiment
```
If REVISE: manually address the remaining mechanism weaknesses, then re-run
```
/research-refine
```
If RETHINK: revisit the core mechanism, possibly with
```
/idea-creator
```

undefined

若READY: 进入
```
/experiment-plan
```
获取完整实验路线图，然后执行
```
/run-experiment
```
若REVISE: 手动解决剩余的机制缺陷，然后重新运行
```
/research-refine
```
若RETHINK: 重新思考核心机制，可尝试
```
/idea-creator
```

undefined

Step 5.4: Finalize

score-history.md

步骤5.4：完善

score-history.md

Ensure it contains the complete score evolution table using the new dimensions.

确保包含使用新维度的完整得分演变表。

Step 5.5: Present a Brief Summary to the User

步骤5.5：向用户呈现简要摘要

Refinement complete after N rounds.

Final score: X/10 (Verdict: READY / REVISE / RETHINK)

Anchor status:
- [preserved / drift corrected / unresolved concern]

Focus status:
- [tight / slightly broad / still diffuse]

Modernity status:
- [appropriately frontier-aware / intentionally conservative / still old-school]

Key method upgrades:
- [method change 1]
- [method change 2]

Remaining concerns:
- [if any]

Review summary: refine-logs/REVIEW_SUMMARY.md
Full report: refine-logs/REFINEMENT_REPORT.md
Final proposal: refine-logs/FINAL_PROPOSAL.md
Suggested next step: /experiment-plan

N轮后细化完成。

最终得分: X/10（结论: READY / REVISE / RETHINK）

锚点状态:
- [保留/已纠正偏离/未解决问题]

聚焦状态:
- [清晰/略宽泛/仍分散]

现代化状态:
- [合理适配前沿/有意保持精简/仍偏传统]

主要方法升级:
- [方法修改1]
- [方法修改2]

剩余问题:
- [若有]

评审摘要: refine-logs/REVIEW_SUMMARY.md
完整报告: refine-logs/REFINEMENT_REPORT.md
最终提案: refine-logs/FINAL_PROPOSAL.md
建议下一步: /experiment-plan

Key Rules

核心规则

Large file handling: If the Write tool fails due to file size, immediately retry using Bash (
```
cat << 'EOF' > file
```
) to write in chunks. Do NOT ask the user for permission — just do it silently.
Anchor first, every round. Always carry forward the same Problem Anchor.
One paper, one dominant contribution. Avoid multiple parallel contributions unless the paper truly needs them.
The smallest adequate mechanism wins. Bigger is not automatically better.
Prefer reuse over invention. Start from strong existing backbones and add only what the bottleneck requires.
Modern techniques are a prior, not a decoration. Use LLM / VLM / Diffusion / RL-era components when they sharpen the method, not when they only make the proposal sound trendy.
Minimal experiments. Inside this skill, experiments only need to prove the core claims.
Review the mechanism, not the parts count. A long module list is not novelty.
Pushback is encouraged. If reviewer feedback causes drift or unnecessary complexity, argue back with evidence.

ALWAYS use
config: {"model_reasoning_effort": "xhigh"}
for all Codex review calls.

Save
threadId
from Phase 2 and use
```
mcp__codex__codex-reply
```
for later rounds.
Do not fabricate results. Only describe expected evidence and planned experiments.
Be specific about compute and data assumptions. Vague "we'll train a model" is not enough.
Document everything. Save every raw review, every anchor check, every simplicity check, and every major method change.

大文件处理: 若Write工具因文件大小失败，立即使用Bash(
```
cat << 'EOF' > file
```
)分块重试。无需询问用户权限，直接静默执行。
每轮先锚定: 始终沿用相同的问题锚点。
一篇论文，一个核心贡献: 避免多个平行贡献，除非论文确实需要。
最小有效机制优先: 更大并不一定更好。
优先复用而非创新: 从强大的现有骨干网络开始，仅添加瓶颈所需的部分。
现代技术是基础，而非装饰: 当LLM / VLM / Diffusion / RL时代的组件能优化方法时使用，而非仅为了让提案听起来时髦。
最小化实验: 在本技能中，实验仅需验证核心论点。
评审机制而非组件数量: 长模块列表不等于创新。
鼓励反驳: 若评审反馈导致偏离或不必要的复杂度，结合证据进行反驳。

所有Codex评审调用必须使用
config: {"model_reasoning_effort": "xhigh"}

保存阶段2的
threadId
，后续轮次使用
```
mcp__codex__codex-reply
```
不要编造结果: 仅描述预期证据和计划实验。
明确算力和数据假设: 模糊的“我们将训练一个模型”是不够的。
记录所有内容: 保存每一份原始评审、每一次锚点检查、每一次简洁性检查和每一次主要方法修改。

Composing with Other Skills

与其他技能的配合

This skill sits between idea discovery and execution:

/research-refine-pipeline              -> one-shot refine + experiment planning
/idea-creator "direction"       -> candidate ideas
/research-refine "PROBLEM: ... | APPROACH: ..."  <- you are here
/experiment-plan                -> detailed experiment roadmap
/run-experiment                 -> execute the chosen method
/auto-review-loop               -> iterate on results and paper

Typical flow:

```
/idea-creator
```
or local reading gives you a problem and a vague method direction
```
/research-refine
```
turns that into an anchored, elegant, frontier-aware method plan
```
/experiment-plan
```
turns the final proposal into a detailed claim-driven experiment roadmap
```
/research-refine-pipeline
```
is the one-shot wrapper when the user wants both stages in a single request
```
/run-experiment
```
executes the chosen runs
Later loops operate on results, not just ideas

This skill also works standalone if you already know the problem and just need the method to become concrete.

本技能介于创意发现和执行之间:

/research-refine-pipeline              -> 一站式细化+实验规划
/idea-creator "方向"       -> 候选创意
/research-refine "PROBLEM: ... | APPROACH: ..."  <- 当前技能
/experiment-plan                -> 详细实验路线图
/run-experiment                 -> 执行选定方法
/auto-review-loop               -> 根据结果和论文进行迭代

典型流程:

```
/idea-creator
```
或本地阅读提供问题和模糊的方法方向
```
/research-refine
```
将其转化为锚定、简洁、贴合前沿的方法计划
```
/experiment-plan
```
将最终提案转化为详细的论点驱动实验路线图
```
/research-refine-pipeline
```
是一站式包装，当用户希望单次请求完成两个阶段时使用
```
/run-experiment
```
执行选定的实验
后续迭代基于结果而非仅想法

若已明确问题且仅需将方法具体化，本技能也可独立使用。