result-to-claim

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Result-to-Claim Gate

结果-主张校验门

Experiments produce numbers; this gate decides what those numbers mean. Collect results from available sources, get a Codex judgment, then auto-route based on the verdict.

实验会产出大量数据，该校验门负责判断这些数据的实际含义。从可用来源收集结果，获取Codex的判定结论，然后根据判定结果自动规划后续流程。

Context: $ARGUMENTS

上下文：$ARGUMENTS

When to Use

适用场景

After a set of experiments completes (main results, not just sanity checks)
Before committing to claims in a paper or review response
When results are ambiguous and you need an objective second opinion

一组实验完成后（指主实验结果，而非仅合理性检查结果）
在论文或审稿回复中确定主张内容之前
当结果存在歧义，你需要客观的第三方意见时

Workflow

工作流程

Step 1: Collect Results

步骤1：收集结果

Gather experiment data from whatever sources are available in the project:

W&B (preferred):

wandb.Api().run("<entity>/<project>/<run_id>").history()

— metrics, training curves, comparisons

EXPERIMENT_LOG.md: full results table with baselines and verdicts
EXPERIMENT_TRACKER.md: check which experiments are DONE vs still running

Log files:

ssh server "tail -100 /path/to/training.log"

if no other source

docs/research_contract.md: intended claims and experiment design

Assemble the key information:

What experiments were run (method, dataset, config)
Main metrics and baseline comparisons (deltas)
The intended claim these experiments were designed to test
Any known confounds or caveats

从项目中所有可用来源收集实验数据：

W&B（优先选择）：
```
wandb.Api().run("<entity>/<project>/<run_id>").history()
```
— 包含指标、训练曲线、对比数据
EXPERIMENT_LOG.md：包含基线和判定结果的完整结果表
EXPERIMENT_TRACKER.md：核对哪些实验已完成、哪些仍在运行
日志文件：如果没有其他来源，可执行
```
ssh server "tail -100 /path/to/training.log"
```
获取
docs/research_contract.md：预期主张和实验设计说明

整理核心信息：

运行了哪些实验（方法、数据集、配置）
核心指标和基线对比结果（差值）
这些实验设计用于验证的预期主张
任何已知的混淆因素或注意事项

Step 2: Codex Judgment

步骤2：Codex判定

Send the collected results to Codex for objective evaluation:

mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    RESULT-TO-CLAIM EVALUATION

    I need you to judge whether experimental results support the intended claim.

    Intended claim: [the claim these experiments test]

    Experiments run:
    [list experiments with method, dataset, metrics]

    Results:
    [paste key numbers, comparison deltas, significance]

    Baselines:
    [baseline numbers and sources — reproduced or from paper]

    Known caveats:
    [any confounding factors, limited datasets, missing comparisons]

    Please evaluate:
    1. claim_supported: yes | partial | no
    2. what_results_support: what the data actually shows
    3. what_results_dont_support: where the data falls short of the claim
    4. missing_evidence: specific evidence gaps
    5. suggested_claim_revision: if the claim should be strengthened, weakened, or reframed
    6. next_experiments_needed: specific experiments to fill gaps (if any)
    7. confidence: high | medium | low

    Be honest. Do not inflate claims beyond what the data supports.
    A single positive result on one dataset does not support a general claim.

将收集到的结果发送给Codex进行客观评估：

mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    RESULT-TO-CLAIM EVALUATION

    I need you to judge whether experimental results support the intended claim.

    Intended claim: [the claim these experiments test]

    Experiments run:
    [list experiments with method, dataset, metrics]

    Results:
    [paste key numbers, comparison deltas, significance]

    Baselines:
    [baseline numbers and sources — reproduced or from paper]

    Known caveats:
    [any confounding factors, limited datasets, missing comparisons]

    Please evaluate:
    1. claim_supported: yes | partial | no
    2. what_results_support: what the data actually shows
    3. what_results_dont_support: where the data falls short of the claim
    4. missing_evidence: specific evidence gaps
    5. suggested_claim_revision: if the claim should be strengthened, weakened, or reframed
    6. next_experiments_needed: specific experiments to fill gaps (if any)
    7. confidence: high | medium | low

    Be honest. Do not inflate claims beyond what the data supports.
    A single positive result on one dataset does not support a general claim.

Step 3: Parse and Normalize

步骤3：解析与标准化

Extract structured fields from Codex response:

markdown

- claim_supported: yes | partial | no
- what_results_support: "..."
- what_results_dont_support: "..."
- missing_evidence: "..."
- suggested_claim_revision: "..."
- next_experiments_needed: "..."
- confidence: high | medium | low

从Codex的返回结果中提取结构化字段：

markdown

- claim_supported: yes | partial | no
- what_results_support: "..."
- what_results_dont_support: "..."
- missing_evidence: "..."
- suggested_claim_revision: "..."
- next_experiments_needed: "..."
- confidence: high | medium | low

Step 4: Route Based on Verdict

步骤4：根据判定结果规划流程

no

— Claim not supported

no

— 主张不被支持

Record postmortem in findings.md (Research Findings section):
- What was tested, what failed, hypotheses for why
- Constraints for future attempts (what NOT to try again)
Update CLAUDE.md Pipeline Status
Decide whether to pivot to next idea from IDEA_CANDIDATES.md or try an alternative approach

在findings.md的研究发现部分记录复盘内容：
- 测试了什么、哪里出了问题、原因假设
- 未来尝试的限制（哪些内容不需要再试）
更新CLAUDE.md的流水线状态
决定是转向IDEA_CANDIDATES.md中的下一个想法，还是尝试替代方案

partial

— Claim partially supported

partial

— 主张被部分支持

Update the working claim to reflect what IS supported
Record the gap in findings.md
Design and run supplementary experiments to fill evidence gaps
Re-run result-to-claim after supplementary experiments complete
Multiple rounds of
partial
on the same claim → record analysis in findings.md, consider whether to narrow the claim scope or switch ideas

更新待确认主张，使其匹配实际被支持的内容
在findings.md中记录证据缺口
设计并运行补充实验填补证据缺口
补充实验完成后重新运行结果-主张校验流程
同一主张多次得到
partial
判定 → 在findings.md中记录分析，考虑是否缩小主张范围或切换思路

yes

— Claim supported

yes

— 主张被支持

Record confirmed claim in project notes
If ablation studies are incomplete → trigger
```
/ablation-planner
```
If all evidence is in → ready for paper writing

在项目笔记中记录已确认的主张
如果消融实验未完成 → 触发
```
/ablation-planner
```
如果所有证据都已收集完成 → 可进入论文撰写阶段

Rules

规则

Codex is the judge, not CC. CC collects evidence and routes; Codex evaluates. This prevents post-hoc rationalization.
Do not inflate claims beyond what the data supports. If Codex says "partial", do not round up to "yes".
A single positive result on one dataset does not support a general claim. Be honest about scope.
If
```
confidence
```
is low, treat the judgment as inconclusive and add experiments rather than committing to a claim.
If Codex MCP is unavailable (call fails), CC makes its own judgment and marks it
```
[pending Codex review]
```
— do not block the pipeline.
Always record the verdict and reasoning in findings.md, regardless of outcome.

Codex是裁判，而非CC。 CC负责收集证据和流程跳转，Codex负责评估。这可以避免事后合理化解释。
不要夸大主张，不要超出数据支持的范围。如果Codex判定为「partial」，不要自行升级为「yes」。
单个数据集上的单个正向结果不足以支撑通用性主张，要如实说明适用范围。
如果
```
confidence
```
为低，将判定视为非结论性结果，补充实验而非直接确认主张。
如果Codex MCP不可用（调用失败），CC可自行做出判定并标注「[待Codex审核]」——不要阻塞流水线。
无论结果如何，始终在findings.md中记录判定结论和推理过程。