result-to-claim

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Result-to-Claim Gate

结果-主张校验门

Experiments produce numbers; this gate decides what those numbers mean. Collect results from available sources, get a Codex judgment, then auto-route based on the verdict.
实验会产出大量数据,该校验门负责判断这些数据的实际含义。从可用来源收集结果,获取Codex的判定结论,然后根据判定结果自动规划后续流程。

Context: $ARGUMENTS

上下文:$ARGUMENTS

When to Use

适用场景

  • After a set of experiments completes (main results, not just sanity checks)
  • Before committing to claims in a paper or review response
  • When results are ambiguous and you need an objective second opinion
  • 一组实验完成后(指主实验结果,而非仅合理性检查结果)
  • 在论文或审稿回复中确定主张内容之前
  • 当结果存在歧义,你需要客观的第三方意见时

Workflow

工作流程

Step 1: Collect Results

步骤1:收集结果

Gather experiment data from whatever sources are available in the project:
  1. W&B (preferred):
    wandb.Api().run("<entity>/<project>/<run_id>").history()
    — metrics, training curves, comparisons
  2. EXPERIMENT_LOG.md: full results table with baselines and verdicts
  3. EXPERIMENT_TRACKER.md: check which experiments are DONE vs still running
  4. Log files:
    ssh server "tail -100 /path/to/training.log"
    if no other source
  5. docs/research_contract.md: intended claims and experiment design
Assemble the key information:
  • What experiments were run (method, dataset, config)
  • Main metrics and baseline comparisons (deltas)
  • The intended claim these experiments were designed to test
  • Any known confounds or caveats
从项目中所有可用来源收集实验数据:
  1. W&B(优先选择):
    wandb.Api().run("<entity>/<project>/<run_id>").history()
    — 包含指标、训练曲线、对比数据
  2. EXPERIMENT_LOG.md:包含基线和判定结果的完整结果表
  3. EXPERIMENT_TRACKER.md:核对哪些实验已完成、哪些仍在运行
  4. 日志文件:如果没有其他来源,可执行
    ssh server "tail -100 /path/to/training.log"
    获取
  5. docs/research_contract.md:预期主张和实验设计说明
整理核心信息:
  • 运行了哪些实验(方法、数据集、配置)
  • 核心指标和基线对比结果(差值)
  • 这些实验设计用于验证的预期主张
  • 任何已知的混淆因素或注意事项

Step 2: Codex Judgment

步骤2:Codex判定

Send the collected results to Codex for objective evaluation:
mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    RESULT-TO-CLAIM EVALUATION

    I need you to judge whether experimental results support the intended claim.

    Intended claim: [the claim these experiments test]

    Experiments run:
    [list experiments with method, dataset, metrics]

    Results:
    [paste key numbers, comparison deltas, significance]

    Baselines:
    [baseline numbers and sources — reproduced or from paper]

    Known caveats:
    [any confounding factors, limited datasets, missing comparisons]

    Please evaluate:
    1. claim_supported: yes | partial | no
    2. what_results_support: what the data actually shows
    3. what_results_dont_support: where the data falls short of the claim
    4. missing_evidence: specific evidence gaps
    5. suggested_claim_revision: if the claim should be strengthened, weakened, or reframed
    6. next_experiments_needed: specific experiments to fill gaps (if any)
    7. confidence: high | medium | low

    Be honest. Do not inflate claims beyond what the data supports.
    A single positive result on one dataset does not support a general claim.
将收集到的结果发送给Codex进行客观评估:
mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    RESULT-TO-CLAIM EVALUATION

    I need you to judge whether experimental results support the intended claim.

    Intended claim: [the claim these experiments test]

    Experiments run:
    [list experiments with method, dataset, metrics]

    Results:
    [paste key numbers, comparison deltas, significance]

    Baselines:
    [baseline numbers and sources — reproduced or from paper]

    Known caveats:
    [any confounding factors, limited datasets, missing comparisons]

    Please evaluate:
    1. claim_supported: yes | partial | no
    2. what_results_support: what the data actually shows
    3. what_results_dont_support: where the data falls short of the claim
    4. missing_evidence: specific evidence gaps
    5. suggested_claim_revision: if the claim should be strengthened, weakened, or reframed
    6. next_experiments_needed: specific experiments to fill gaps (if any)
    7. confidence: high | medium | low

    Be honest. Do not inflate claims beyond what the data supports.
    A single positive result on one dataset does not support a general claim.

Step 3: Parse and Normalize

步骤3:解析与标准化

Extract structured fields from Codex response:
markdown
- claim_supported: yes | partial | no
- what_results_support: "..."
- what_results_dont_support: "..."
- missing_evidence: "..."
- suggested_claim_revision: "..."
- next_experiments_needed: "..."
- confidence: high | medium | low
从Codex的返回结果中提取结构化字段:
markdown
- claim_supported: yes | partial | no
- what_results_support: "..."
- what_results_dont_support: "..."
- missing_evidence: "..."
- suggested_claim_revision: "..."
- next_experiments_needed: "..."
- confidence: high | medium | low

Step 4: Route Based on Verdict

步骤4:根据判定结果规划流程

no
— Claim not supported

no
— 主张不被支持

  1. Record postmortem in findings.md (Research Findings section):
    • What was tested, what failed, hypotheses for why
    • Constraints for future attempts (what NOT to try again)
  2. Update CLAUDE.md Pipeline Status
  3. Decide whether to pivot to next idea from IDEA_CANDIDATES.md or try an alternative approach
  1. 在findings.md的研究发现部分记录复盘内容:
    • 测试了什么、哪里出了问题、原因假设
    • 未来尝试的限制(哪些内容不需要再试)
  2. 更新CLAUDE.md的流水线状态
  3. 决定是转向IDEA_CANDIDATES.md中的下一个想法,还是尝试替代方案

partial
— Claim partially supported

partial
— 主张被部分支持

  1. Update the working claim to reflect what IS supported
  2. Record the gap in findings.md
  3. Design and run supplementary experiments to fill evidence gaps
  4. Re-run result-to-claim after supplementary experiments complete
  5. Multiple rounds of
    partial
    on the same claim
    → record analysis in findings.md, consider whether to narrow the claim scope or switch ideas
  1. 更新待确认主张,使其匹配实际被支持的内容
  2. 在findings.md中记录证据缺口
  3. 设计并运行补充实验填补证据缺口
  4. 补充实验完成后重新运行结果-主张校验流程
  5. 同一主张多次得到
    partial
    判定
    → 在findings.md中记录分析,考虑是否缩小主张范围或切换思路

yes
— Claim supported

yes
— 主张被支持

  1. Record confirmed claim in project notes
  2. If ablation studies are incomplete → trigger
    /ablation-planner
  3. If all evidence is in → ready for paper writing
  1. 在项目笔记中记录已确认的主张
  2. 如果消融实验未完成 → 触发
    /ablation-planner
  3. 如果所有证据都已收集完成 → 可进入论文撰写阶段

Rules

规则

  • Codex is the judge, not CC. CC collects evidence and routes; Codex evaluates. This prevents post-hoc rationalization.
  • Do not inflate claims beyond what the data supports. If Codex says "partial", do not round up to "yes".
  • A single positive result on one dataset does not support a general claim. Be honest about scope.
  • If
    confidence
    is low, treat the judgment as inconclusive and add experiments rather than committing to a claim.
  • If Codex MCP is unavailable (call fails), CC makes its own judgment and marks it
    [pending Codex review]
    — do not block the pipeline.
  • Always record the verdict and reasoning in findings.md, regardless of outcome.
  • Codex是裁判,而非CC。 CC负责收集证据和流程跳转,Codex负责评估。这可以避免事后合理化解释。
  • 不要夸大主张,不要超出数据支持的范围。如果Codex判定为「partial」,不要自行升级为「yes」。
  • 单个数据集上的单个正向结果不足以支撑通用性主张,要如实说明适用范围。
  • 如果
    confidence
    为低,将判定视为非结论性结果,补充实验而非直接确认主张。
  • 如果Codex MCP不可用(调用失败),CC可自行做出判定并标注「[待Codex审核]」——不要阻塞流水线。
  • 无论结果如何,始终在findings.md中记录判定结论和推理过程。