result-to-claim
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseResult-to-Claim Gate
结果-主张校验门
Experiments produce numbers; this gate decides what those numbers mean. Collect results from available sources, get a Codex judgment, then auto-route based on the verdict.
实验会产出大量数据,该校验门负责判断这些数据的实际含义。从可用来源收集结果,获取Codex的判定结论,然后根据判定结果自动规划后续流程。
Context: $ARGUMENTS
上下文:$ARGUMENTS
When to Use
适用场景
- After a set of experiments completes (main results, not just sanity checks)
- Before committing to claims in a paper or review response
- When results are ambiguous and you need an objective second opinion
- 一组实验完成后(指主实验结果,而非仅合理性检查结果)
- 在论文或审稿回复中确定主张内容之前
- 当结果存在歧义,你需要客观的第三方意见时
Workflow
工作流程
Step 1: Collect Results
步骤1:收集结果
Gather experiment data from whatever sources are available in the project:
- W&B (preferred): — metrics, training curves, comparisons
wandb.Api().run("<entity>/<project>/<run_id>").history() - EXPERIMENT_LOG.md: full results table with baselines and verdicts
- EXPERIMENT_TRACKER.md: check which experiments are DONE vs still running
- Log files: if no other source
ssh server "tail -100 /path/to/training.log" - docs/research_contract.md: intended claims and experiment design
Assemble the key information:
- What experiments were run (method, dataset, config)
- Main metrics and baseline comparisons (deltas)
- The intended claim these experiments were designed to test
- Any known confounds or caveats
从项目中所有可用来源收集实验数据:
- W&B(优先选择):— 包含指标、训练曲线、对比数据
wandb.Api().run("<entity>/<project>/<run_id>").history() - EXPERIMENT_LOG.md:包含基线和判定结果的完整结果表
- EXPERIMENT_TRACKER.md:核对哪些实验已完成、哪些仍在运行
- 日志文件:如果没有其他来源,可执行获取
ssh server "tail -100 /path/to/training.log" - docs/research_contract.md:预期主张和实验设计说明
整理核心信息:
- 运行了哪些实验(方法、数据集、配置)
- 核心指标和基线对比结果(差值)
- 这些实验设计用于验证的预期主张
- 任何已知的混淆因素或注意事项
Step 2: Codex Judgment
步骤2:Codex判定
Send the collected results to Codex for objective evaluation:
mcp__codex__codex:
config: {"model_reasoning_effort": "xhigh"}
prompt: |
RESULT-TO-CLAIM EVALUATION
I need you to judge whether experimental results support the intended claim.
Intended claim: [the claim these experiments test]
Experiments run:
[list experiments with method, dataset, metrics]
Results:
[paste key numbers, comparison deltas, significance]
Baselines:
[baseline numbers and sources — reproduced or from paper]
Known caveats:
[any confounding factors, limited datasets, missing comparisons]
Please evaluate:
1. claim_supported: yes | partial | no
2. what_results_support: what the data actually shows
3. what_results_dont_support: where the data falls short of the claim
4. missing_evidence: specific evidence gaps
5. suggested_claim_revision: if the claim should be strengthened, weakened, or reframed
6. next_experiments_needed: specific experiments to fill gaps (if any)
7. confidence: high | medium | low
Be honest. Do not inflate claims beyond what the data supports.
A single positive result on one dataset does not support a general claim.将收集到的结果发送给Codex进行客观评估:
mcp__codex__codex:
config: {"model_reasoning_effort": "xhigh"}
prompt: |
RESULT-TO-CLAIM EVALUATION
I need you to judge whether experimental results support the intended claim.
Intended claim: [the claim these experiments test]
Experiments run:
[list experiments with method, dataset, metrics]
Results:
[paste key numbers, comparison deltas, significance]
Baselines:
[baseline numbers and sources — reproduced or from paper]
Known caveats:
[any confounding factors, limited datasets, missing comparisons]
Please evaluate:
1. claim_supported: yes | partial | no
2. what_results_support: what the data actually shows
3. what_results_dont_support: where the data falls short of the claim
4. missing_evidence: specific evidence gaps
5. suggested_claim_revision: if the claim should be strengthened, weakened, or reframed
6. next_experiments_needed: specific experiments to fill gaps (if any)
7. confidence: high | medium | low
Be honest. Do not inflate claims beyond what the data supports.
A single positive result on one dataset does not support a general claim.Step 3: Parse and Normalize
步骤3:解析与标准化
Extract structured fields from Codex response:
markdown
- claim_supported: yes | partial | no
- what_results_support: "..."
- what_results_dont_support: "..."
- missing_evidence: "..."
- suggested_claim_revision: "..."
- next_experiments_needed: "..."
- confidence: high | medium | low从Codex的返回结果中提取结构化字段:
markdown
- claim_supported: yes | partial | no
- what_results_support: "..."
- what_results_dont_support: "..."
- missing_evidence: "..."
- suggested_claim_revision: "..."
- next_experiments_needed: "..."
- confidence: high | medium | lowStep 4: Route Based on Verdict
步骤4:根据判定结果规划流程
no
— Claim not supported
nono
— 主张不被支持
no- Record postmortem in findings.md (Research Findings section):
- What was tested, what failed, hypotheses for why
- Constraints for future attempts (what NOT to try again)
- Update CLAUDE.md Pipeline Status
- Decide whether to pivot to next idea from IDEA_CANDIDATES.md or try an alternative approach
- 在findings.md的研究发现部分记录复盘内容:
- 测试了什么、哪里出了问题、原因假设
- 未来尝试的限制(哪些内容不需要再试)
- 更新CLAUDE.md的流水线状态
- 决定是转向IDEA_CANDIDATES.md中的下一个想法,还是尝试替代方案
partial
— Claim partially supported
partialpartial
— 主张被部分支持
partial- Update the working claim to reflect what IS supported
- Record the gap in findings.md
- Design and run supplementary experiments to fill evidence gaps
- Re-run result-to-claim after supplementary experiments complete
- Multiple rounds of on the same claim → record analysis in findings.md, consider whether to narrow the claim scope or switch ideas
partial
- 更新待确认主张,使其匹配实际被支持的内容
- 在findings.md中记录证据缺口
- 设计并运行补充实验填补证据缺口
- 补充实验完成后重新运行结果-主张校验流程
- 同一主张多次得到判定 → 在findings.md中记录分析,考虑是否缩小主张范围或切换思路
partial
yes
— Claim supported
yesyes
— 主张被支持
yes- Record confirmed claim in project notes
- If ablation studies are incomplete → trigger
/ablation-planner - If all evidence is in → ready for paper writing
- 在项目笔记中记录已确认的主张
- 如果消融实验未完成 → 触发
/ablation-planner - 如果所有证据都已收集完成 → 可进入论文撰写阶段
Rules
规则
- Codex is the judge, not CC. CC collects evidence and routes; Codex evaluates. This prevents post-hoc rationalization.
- Do not inflate claims beyond what the data supports. If Codex says "partial", do not round up to "yes".
- A single positive result on one dataset does not support a general claim. Be honest about scope.
- If is low, treat the judgment as inconclusive and add experiments rather than committing to a claim.
confidence - If Codex MCP is unavailable (call fails), CC makes its own judgment and marks it — do not block the pipeline.
[pending Codex review] - Always record the verdict and reasoning in findings.md, regardless of outcome.
- Codex是裁判,而非CC。 CC负责收集证据和流程跳转,Codex负责评估。这可以避免事后合理化解释。
- 不要夸大主张,不要超出数据支持的范围。如果Codex判定为「partial」,不要自行升级为「yes」。
- 单个数据集上的单个正向结果不足以支撑通用性主张,要如实说明适用范围。
- 如果为低,将判定视为非结论性结果,补充实验而非直接确认主张。
confidence - 如果Codex MCP不可用(调用失败),CC可自行做出判定并标注「[待Codex审核]」——不要阻塞流水线。
- 无论结果如何,始终在findings.md中记录判定结论和推理过程。