Loading...
Loading...
Turn a vague research direction into a problem-anchored, elegant, frontier-aware, implementation-oriented method plan via iterative GPT-5.4 review. Use when the user says "refine my approach", "帮我细化方案", "decompose this problem", "打磨idea", "refine research plan", "细化研究方案", or wants a concrete research method that stays simple, focused, and top-venue ready instead of a vague or overbuilt idea.
npx skill4agent add wanshuiyin/auto-claude-code-research-in-sleep research-refineUser input (PROBLEM + vague APPROACH)
-> Phase 0 (Claude): Freeze Problem Anchor
-> Phase 1 (Claude): Scan grounding papers -> identify technical gap -> choose the sharpest route -> write focused proposal
-> Phase 2 (Codex/GPT-5.4): Review for fidelity, specificity, contribution quality, and frontier leverage
-> Phase 3 (Claude): Anchor check + simplicity check -> revise method -> rewrite full proposal
-> Phase 4 (Codex, same thread): Re-evaluate revised proposal
-> Repeat Phase 3-4 until OVERALL SCORE >= 9 or MAX_ROUNDS reached
-> Phase 5: Save full history to refine-logs/
-> Optional handoff: /experiment-plan for a detailed execution-ready experiment roadmapgpt-5.4refine-logs/Override via argument if needed, e.g../research-refine "problem | approach" -- max rounds: 3, threshold: 9
refine-logs/
├── round-0-initial-proposal.md
├── round-1-review.md
├── round-1-refinement.md
├── round-2-review.md
├── round-2-refinement.md
├── ...
├── REVIEW_SUMMARY.md
├── FINAL_PROPOSAL.md
├── REFINEMENT_REPORT.md
└── score-history.mdround-N-refinement.mdpapers/literature//experiment-planrefine-logs/round-0-initial-proposal.md# Research Proposal: [Title]
## Problem Anchor
- Bottom-line problem:
- Must-solve bottleneck:
- Non-goals:
- Constraints:
- Success condition:
## Technical Gap
[Why current methods fail, why naive bigger systems are not enough, and what mechanism is missing]
## Method Thesis
- One-sentence thesis:
- Why this is the smallest adequate intervention:
- Why this route is timely in the foundation-model era:
## Contribution Focus
- Dominant contribution:
- Optional supporting contribution:
- Explicit non-contributions:
## Proposed Method
### Complexity Budget
- Frozen / reused backbone:
- New trainable components:
- Tempting additions intentionally not used:
### System Overview
[Step-by-step pipeline or ASCII graph]
### Core Mechanism
- Input / output:
- Architecture or policy:
- Training signal / loss:
- Why this is the main novelty:
### Optional Supporting Component
- Only include if truly necessary:
- Input / output:
- Training signal / loss:
- Why it does not create contribution sprawl:
### Modern Primitive Usage
- Which LLM / VLM / Diffusion / RL-era primitive is used:
- Exact role in the pipeline:
- Why it is more natural than an old-school alternative:
### Integration into Base Generator / Downstream Pipeline
[Where the new method attaches, what is frozen, what is trainable, inference order]
### Training Plan
[Stagewise or joint training, losses, data construction, pseudo-labels, schedules]
### Failure Modes and Diagnostics
- [Failure mode]:
- [How to detect]:
- [Fallback or mitigation]:
### Novelty and Elegance Argument
[Closest work, exact difference, why this is a focused mechanism-level contribution rather than a module pile-up]
## Claim-Driven Validation Sketch
### Claim 1: [Main claim]
- Minimal experiment:
- Baselines / ablations:
- Metric:
- Expected evidence:
### Claim 2: [Optional]
- Minimal experiment:
- Baselines / ablations:
- Metric:
- Expected evidence:
## Experiment Handoff Inputs
- Must-prove claims:
- Must-run ablations:
- Critical datasets / metrics:
- Highest-risk assumptions:
## Compute & Timeline Estimate
- Estimated GPU-hours:
- Data / annotation cost:
- Timeline:mcp__codex__codex:
model: REVIEWER_MODEL
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are a senior ML reviewer for a top venue (NeurIPS/ICML/ICLR).
This is an early-stage, method-first research proposal.
Your job is NOT to reward extra modules, contribution sprawl, or a giant benchmark checklist.
Your job IS to stress-test whether the proposed method:
(1) still solves the original anchored problem,
(2) is concrete enough to implement,
(3) presents a focused, elegant contribution,
(4) uses foundation-model-era techniques appropriately when they are the natural fit.
Review principles:
- Prefer the smallest adequate mechanism over a larger system.
- Penalize parallel contributions that make the paper feel unfocused.
- If a modern LLM / VLM / Diffusion / RL route would clearly produce a better paper, say so concretely.
- If the proposal is already modern enough, do NOT force trendy components.
- Do not ask for extra experiments unless they are needed to prove the core claims.
Read the Problem Anchor first. If your suggested fix would change the problem being solved,
call that out explicitly as drift instead of treating it as a normal revision request.
=== PROPOSAL ===
[Paste the FULL proposal from Phase 1]
=== END PROPOSAL ===
Score these 7 dimensions from 1-10:
1. **Problem Fidelity**: Does the method still attack the original bottleneck, or has it drifted into solving something easier or different?
2. **Method Specificity**: Are the interfaces, representations, losses, training stages, and inference path concrete enough that an engineer could start implementing?
3. **Contribution Quality**: Is there one dominant mechanism-level contribution with real novelty, good parsimony, and no obvious contribution sprawl?
4. **Frontier Leverage**: Does the proposal use current foundation-model-era primitives appropriately when they are the right tool, instead of defaulting to old-school module stacking?
5. **Feasibility**: Can this method be trained and integrated with the stated resources and data assumptions?
6. **Validation Focus**: Are the proposed experiments minimal but sufficient to validate the core claims? Is there unnecessary experimental bloat?
7. **Venue Readiness**: If executed well, would the contribution feel sharp and timely enough for a top venue?
**OVERALL SCORE** (1-10): Weighted toward Problem Fidelity, Method Specificity, Contribution Quality, and Frontier Leverage.
Use this weighting: Problem Fidelity 15%, Method Specificity 25%, Contribution Quality 25%, Frontier Leverage 15%, Feasibility 10%, Validation Focus 5%, Venue Readiness 5%.
For each dimension scoring < 7, provide:
- The specific weakness
- A concrete fix at the method level (interface / loss / training recipe / integration point / deletion of unnecessary parts)
- Priority: CRITICAL / IMPORTANT / MINOR
Then add:
- **Simplification Opportunities**: 1-3 concrete ways to delete, merge, or reuse components while preserving the main claim. Write "NONE" if already tight.
- **Modernization Opportunities**: 1-3 concrete ways to replace old-school pieces with more natural foundation-model-era primitives if genuinely better. Write "NONE" if already modern enough.
- **Drift Warning**: "NONE" if the proposal still solves the anchored problem; otherwise explain the drift clearly.
- **Verdict**: READY / REVISE / RETHINK
Verdict rule:
- READY: overall score >= 9, no meaningful drift, one focused dominant contribution, and no obvious complexity bloat remains
- REVISE: the direction is promising but not yet at READY bar
- RETHINK: the core mechanism or framing is still fundamentally offthreadIdrefine-logs/round-1-review.md<details>refine-logs/score-history.md# Score Evolution
| Round | Problem Fidelity | Method Specificity | Contribution Quality | Frontier Leverage | Feasibility | Validation Focus | Venue Readiness | Overall | Verdict |
|-------|------------------|--------------------|----------------------|-------------------|-------------|------------------|-----------------|---------|---------|
| 1 | X | X | X | X | X | X | X | X | REVISE |refine-logs/round-N-refinement.md# Round N Refinement
## Problem Anchor
[Copy verbatim from round 0]
## Anchor Check
- Original bottleneck:
- Why the revised method still addresses it:
- Reviewer suggestions rejected as drift:
## Simplicity Check
- Dominant contribution after revision:
- Components removed or merged:
- Reviewer suggestions rejected as unnecessary complexity:
- Why the remaining mechanism is still the smallest adequate route:
## Changes Made
### 1. [Method section changed]
- Reviewer said:
- Action:
- Reasoning:
- Impact on core method:
### 2. [Novelty / modernity / feasibility / validation change]
- Reviewer said:
- Action:
- Reasoning:
- Impact on core method:
## Revised Proposal
[Full updated proposal from Problem Anchor through Claim-Driven Validation Sketch]mcp__codex__codex-reply:
threadId: [saved from Phase 2]
model: REVIEWER_MODEL
config: {"model_reasoning_effort": "xhigh"}
prompt: |
[Round N re-evaluation]
I revised the proposal based on your feedback.
First, check whether the original Problem Anchor is still preserved.
Second, judge whether the method is now more concrete, more focused, and more current.
Key changes:
1. [Method change 1]
2. [Method change 2]
3. [Simplification / modernization / pushback if any]
=== REVISED PROPOSAL ===
[Paste the FULL revised proposal]
=== END REVISED PROPOSAL ===
Please:
- Re-score the same 7 dimensions and overall
- State whether the Problem Anchor is preserved or drifted
- State whether the dominant contribution is now sharper or still too broad
- State whether the method is simpler or still overbuilt
- State whether the frontier leverage is now appropriate or still old-school / forced
- Focus new critiques on missing mechanism, weak training signal, weak integration point, pseudo-novelty, or unnecessary complexity
- Use the same verdict rule: READY only if overall score >= 9 and no blocking issue remains
Same output format: 7 scores, overall score, verdict, drift warning, simplification opportunities, modernization opportunities, remaining action items.refine-logs/round-N-review.mdrefine-logs/REVIEW_SUMMARY.md# Review Summary
**Problem**: [user's problem]
**Initial Approach**: [user's vague approach]
**Date**: [today]
**Rounds**: N / MAX_ROUNDS
**Final Score**: X / 10
**Final Verdict**: [READY / REVISE / RETHINK]
## Problem Anchor
[Verbatim anchor used across all rounds]
## Round-by-Round Resolution Log
| Round | Main Reviewer Concerns | What This Round Simplified / Modernized | Solved? | Remaining Risk |
|-------|-------------------------|------------------------------------------|---------|----------------|
| 1 | [top issues from review] | [main method changes] | [yes / partial / no] | [if any] |
| 2 | ... | ... | ... | ... |
## Overall Evolution
- [How the method became more concrete]
- [How the dominant contribution became more focused]
- [How unnecessary complexity was removed]
- [How modern technical leverage improved or stayed intentionally minimal]
- [How drift was avoided or corrected]
## Final Status
- Anchor status: [preserved / corrected / unresolved]
- Focus status: [tight / slightly broad / still diffuse]
- Modernity status: [appropriately frontier-aware / intentionally conservative / still old-school]
- Strongest parts of final method:
- Remaining weaknesses:refine-logs/FINAL_PROPOSAL.md# Research Proposal: [Title]
[Paste the final refined proposal only]refine-logs/REFINEMENT_REPORT.md# Refinement Report
**Problem**: [user's problem]
**Initial Approach**: [user's vague approach]
**Date**: [today]
**Rounds**: N / MAX_ROUNDS
**Final Score**: X / 10
**Final Verdict**: [READY / REVISE / RETHINK]
## Problem Anchor
[Verbatim anchor used across all rounds]
## Output Files
- Review summary: `refine-logs/REVIEW_SUMMARY.md`
- Final proposal: `refine-logs/FINAL_PROPOSAL.md`
## Score Evolution
| Round | Problem Fidelity | Method Specificity | Contribution Quality | Frontier Leverage | Feasibility | Validation Focus | Venue Readiness | Overall | Verdict |
|-------|------------------|--------------------|----------------------|-------------------|-------------|------------------|-----------------|---------|---------|
| 1 | ... | ... | ... | ... | ... | ... | ... | ... | ... |
## Round-by-Round Review Record
| Round | Main Reviewer Concerns | What Was Changed | Result |
|-------|-------------------------|------------------|--------|
| 1 | [top issues] | [main fixes] | [resolved / partial / unresolved] |
| 2 | ... | ... | ... |
## Final Proposal Snapshot
- Canonical clean version lives in `refine-logs/FINAL_PROPOSAL.md`
- Summarize the final thesis in 3-5 bullets here
## Method Evolution Highlights
1. [Most important simplification or focusing move]
2. [Most important mechanism upgrade]
3. [Most important modernization or justification for staying simple]
## Pushback / Drift Log
| Round | Reviewer Said | Author Response | Outcome |
|-------|---------------|-----------------|---------|
| 1 | [criticism] | [pushback + anchor / evidence] | [accepted / rejected] |
## Remaining Weaknesses
[Honest unresolved issues]
## Raw Reviewer Responses
<details>
<summary>Round 1 Review</summary>
[Full verbatim response from GPT-5.4]
</details>
...
## Next Steps
- If READY: proceed to `/experiment-plan` for a full experiment roadmap, then `/run-experiment`
- If REVISE: manually address the remaining mechanism weaknesses, then re-run `/research-refine`
- If RETHINK: revisit the core mechanism, possibly with `/idea-creator`score-history.mdRefinement complete after N rounds.
Final score: X/10 (Verdict: READY / REVISE / RETHINK)
Anchor status:
- [preserved / drift corrected / unresolved concern]
Focus status:
- [tight / slightly broad / still diffuse]
Modernity status:
- [appropriately frontier-aware / intentionally conservative / still old-school]
Key method upgrades:
- [method change 1]
- [method change 2]
Remaining concerns:
- [if any]
Review summary: refine-logs/REVIEW_SUMMARY.md
Full report: refine-logs/REFINEMENT_REPORT.md
Final proposal: refine-logs/FINAL_PROPOSAL.md
Suggested next step: /experiment-plancat << 'EOF' > fileconfig: {"model_reasoning_effort": "xhigh"}threadIdmcp__codex__codex-reply/research-refine-pipeline -> one-shot refine + experiment planning
/idea-creator "direction" -> candidate ideas
/research-refine "PROBLEM: ... | APPROACH: ..." <- you are here
/experiment-plan -> detailed experiment roadmap
/run-experiment -> execute the chosen method
/auto-review-loop -> iterate on results and paper/idea-creator/research-refine/experiment-plan/research-refine-pipeline/run-experiment