paper-claim-audit
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePaper Claim Audit: Zero-Context Evidence Verification
论文声明审核:零上下文证据验证
Verify that every claim in the paper matches raw evidence for: $ARGUMENTS
验证论文中的每一项声明是否与原始证据匹配:$ARGUMENTS
Why This Exists
设计初衷
The executor writes experiments AND writes the paper. It "knows" what the results should be. This creates confirmation bias:
- Rounding 84.7% up to 85.3%
- Reporting best seed instead of average
- Citing metrics from a different experiment config
- Claiming "improves by 15%" when the delta is actually 12.8%
A fresh reviewer with zero prior context catches these because it has no expectations — it just compares paper text vs raw files.
实验执行者既要编写实验代码,又要撰写论文。他们“知道”预期的实验结果,这会导致确认偏差:
- 将84.7%向上近似为85.3%
- 报告最优种子结果而非平均值
- 引用来自不同实验配置的指标
- 声称“提升15%”但实际差值仅为12.8%
无前置上下文的全新评审者能发现这些问题,因为他们没有预期,只会直接对比论文文本与原始文件。
How This Differs From Other Audit Skills
与其他审核技能的区别
| Skill | Question it answers |
|---|---|
| Is the experiment code honest? (fake GT, normalization fraud) |
| Does the data scientifically support this claim? |
| Does the paper report the data truthfully and precisely? |
| 技能 | 解决的问题 |
|---|---|
| 实验代码是否真实?(如伪造GT、归一化欺诈) |
| 数据是否能科学支撑该声明? |
| 论文是否如实、精准地报告数据? |
Core Principle
核心原则
Zero-context, fresh reviewer. The auditor receives ONLY:
- Paper .tex files (the claims)
- Raw result files (the evidence)
It does NOT receive:
- ❌ EXPERIMENT_LOG.md
- ❌ EXPERIMENT_TRACKER.md
- ❌ AUTO_REVIEW.md
- ❌ NARRATIVE_REPORT.md
- ❌ Any executor summary or interpretation
- ❌ Any prior audit results
- ❌ Any conversation history
This is stricter than reviewer-independence — it's zero-context evidence audit.
零上下文、全新评审者。审核者仅会收到:
- 论文.tex文件(声明内容)
- 原始结果文件(证据)
不会收到:
- ❌ EXPERIMENT_LOG.md
- ❌ EXPERIMENT_TRACKER.md
- ❌ AUTO_REVIEW.md
- ❌ NARRATIVE_REPORT.md
- ❌ 任何执行者的总结或解读
- ❌ 任何过往审核结果
- ❌ 任何对话历史
这比“评审者独立性”要求更严格——这是零上下文证据审核。
Workflow
工作流程
Step 1: Collect Files (Executor — Claude)
步骤1:收集文件(执行者 — Claude)
Locate paper and result files WITHOUT reading or interpreting them.
Paper files (claims) — paths shown relative to the shell's working
directory so you can find them with ; when writing them into
, use paths relative to the paper dir (no
prefix) per the "Submission Artifact Emission" section below:
lsaudited_input_hashespaper/paper/main.tex # → hash key: main.tex
paper/sections/*.tex # → hash key: sections/*.tex
paper/tables/*.tex (if separate) # → hash key: tables/*.texResult files (evidence):
results/*.json, results/*.jsonl, results/*.csv, results/*.tsv
outputs/*.json, outputs/*.csv
wandb-summary.json (if exists)
**/metrics.json, **/eval_results.json
**/config.yaml, **/args.json (experiment configs)Exclude (no summaries, no interpretations):
EXPERIMENT_LOG.md, EXPERIMENT_TRACKER.md, AUTO_REVIEW*.md
NARRATIVE_REPORT.md, PAPER_PLAN.md, findings.md
Any .md file that is an executor-written summary定位论文和结果文件,无需读取或解读内容。
论文文件(声明内容)——路径为shell工作目录的相对路径,可通过命令查找;写入时,请遵循下文“提交工件输出”部分的要求,使用相对于论文目录的路径(无需前缀):
lsaudited_input_hashespaper/paper/main.tex # → 哈希键:main.tex
paper/sections/*.tex # → 哈希键:sections/*.tex
paper/tables/*.tex (若单独存在) # → 哈希键:tables/*.tex结果文件(证据):
results/*.json, results/*.jsonl, results/*.csv, results/*.tsv
outputs/*.json, outputs/*.csv
wandb-summary.json (若存在)
**/metrics.json, **/eval_results.json
**/config.yaml, **/args.json (实验配置文件)排除文件(不含总结、解读类文件):
EXPERIMENT_LOG.md, EXPERIMENT_TRACKER.md, AUTO_REVIEW*.md
NARRATIVE_REPORT.md, PAPER_PLAN.md, findings.md
任何由执行者撰写的.md格式总结文件Step 2: Fresh Reviewer Audit (GPT-5.4 — NEW thread, no reply)
步骤2:全新评审者审核(GPT-5.4 — 新线程,无回复)
CRITICAL: Use (new thread), NEVER . Every run must be a fresh context.
mcp__codex__codexmcp__codex__codex-replymcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are a paper-to-evidence auditor. You have ZERO prior context about
this research. You will receive only paper source files and raw result
files. Your job is to verify that every number in the paper exactly
matches the raw evidence.
Paper files to read:
[list .tex file paths]
Result files to read:
[list .json/.csv/.yaml file paths]
## Audit Protocol
### A. Extract Every Quantitative Claim
For each number, percentage, comparison, or scope statement in the paper:
- Location (section, table, caption, or inline text)
- Exact claim text
- The number or comparison being made
### B. Trace Each Claim to Evidence
For each extracted claim, find the supporting raw data:
- Which result file contains this number?
- What is the EXACT value in that file?
- Match status: exact_match / rounding_ok / mismatch
### C. Check These Specific Failure Modes
1. **Number inflation**: Paper says 85.3%, raw file says 84.7%
Rule: only standard rounding to displayed precision is allowed
2. **Best-seed cherry-pick**: Paper says "achieves 90.2%" but
that's the best of 5 seeds; mean is 87.1%
Rule: check if paper specifies "average" / "best" / "median"
3. **Config mismatch**: Paper compares Method A vs Baseline B,
but they used different hyperparameters / datasets / splits
Rule: verify config files show same settings for compared methods
4. **Aggregation mismatch**: Paper says "average over 5 seeds"
but result files show only 3 runs
Rule: count actual runs vs claimed count
5. **Delta error**: Paper says "improves by 15%" but
actual delta is (85.3 - 73.1) / 73.1 = 16.7%
Rule: verify arithmetic of all relative improvements
6. **Caption-table mismatch**: Figure caption describes
something different from what the figure/table actually shows
Rule: cross-check every caption against its content
7. **Scope overclaim**: Paper says "consistently outperforms"
but only tested on 2 datasets
Rule: check if language matches actual evaluation scope
## Output Format (per claim)
For each claim, report:
- claim_id: sequential number
- location: section/table/figure
- paper_text: exact quote from paper
- paper_value: the number claimed
- evidence_file: which raw file
- evidence_value: the actual number
- status: exact_match | rounding_ok | ambiguous_mapping |
missing_evidence | config_mismatch | aggregation_mismatch |
number_mismatch | scope_overclaim | unsupported_claim
- details: explanation if not exact_match
Overall verdict: PASS | WARN | FAIL关键要求:使用(新线程),绝对禁止使用。 每次运行都必须使用全新上下文。
mcp__codex__codexmcp__codex__codex-replymcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are a paper-to-evidence auditor. You have ZERO prior context about
this research. You will receive only paper source files and raw result
files. Your job is to verify that every number in the paper exactly
matches the raw evidence.
Paper files to read:
[list .tex file paths]
Result files to read:
[list .json/.csv/.yaml file paths]
## Audit Protocol
### A. Extract Every Quantitative Claim
For each number, percentage, comparison, or scope statement in the paper:
- Location (section, table, caption, or inline text)
- Exact claim text
- The number or comparison being made
### B. Trace Each Claim to Evidence
For each extracted claim, find the supporting raw data:
- Which result file contains this number?
- What is the EXACT value in that file?
- Match status: exact_match / rounding_ok / mismatch
### C. Check These Specific Failure Modes
1. **Number inflation**: Paper says 85.3%, raw file says 84.7%
Rule: only standard rounding to displayed precision is allowed
2. **Best-seed cherry-pick**: Paper says "achieves 90.2%" but
that's the best of 5 seeds; mean is 87.1%
Rule: check if paper specifies "average" / "best" / "median"
3. **Config mismatch**: Paper compares Method A vs Baseline B,
but they used different hyperparameters / datasets / splits
Rule: verify config files show same settings for compared methods
4. **Aggregation mismatch**: Paper says "average over 5 seeds"
but result files show only 3 runs
Rule: count actual runs vs claimed count
5. **Delta error**: Paper says "improves by 15%" but
actual delta is (85.3 - 73.1) / 73.1 = 16.7%
Rule: verify arithmetic of all relative improvements
6. **Caption-table mismatch**: Figure caption describes
something different from what the figure/table actually shows
Rule: cross-check every caption against its content
7. **Scope overclaim**: Paper says "consistently outperforms"
but only tested on 2 datasets
Rule: check if language matches actual evaluation scope
## Output Format (per claim)
For each claim, report:
- claim_id: sequential number
- location: section/table/figure
- paper_text: exact quote from paper
- paper_value: the number claimed
- evidence_file: which raw file
- evidence_value: the actual number
- status: exact_match | rounding_ok | ambiguous_mapping |
missing_evidence | config_mismatch | aggregation_mismatch |
number_mismatch | scope_overclaim | unsupported_claim
- details: explanation if not exact_match
Overall verdict: PASS | WARN | FAILStep 3: Write Report (Executor — Claude)
步骤3:撰写报告(执行者 — Claude)
Parse the reviewer's response and write :
PAPER_CLAIM_AUDIT.mdmarkdown
undefined解析评审者的回复并撰写:
PAPER_CLAIM_AUDIT.mdmarkdown
undefinedPaper Claim Audit Report
论文声明审核报告
Date: [today]
Auditor: GPT-5.4 xhigh (fresh zero-context thread)
Paper: [paper title from tex]
日期: [今日]
审核者: GPT-5.4 xhigh(全新零上下文线程)
论文: [从tex文件提取的论文标题]
Overall Verdict: [PASS | WARN | FAIL]
整体结论: [PASS | WARN | FAIL]
Claims Verified: [N total]
已验证声明数量: [N 总计]
- exact_match: [count]
- rounding_ok: [count]
- ambiguous_mapping: [count]
- missing_evidence: [count]
- mismatch: [count]
- exact_match: [数量]
- rounding_ok: [数量]
- ambiguous_mapping: [数量]
- missing_evidence: [数量]
- mismatch: [数量]
Issues Found
发现的问题
[FAIL/WARN] Claim #N: [description]
[FAIL/WARN] 声明 #N: [描述]
- Location: Section X / Table Y / Figure Z
- Paper says: "..."
- Evidence shows: ...
- Status: [status]
- Fix: [specific correction needed]
- 位置: 第X节 / 第Y表 / 第Z图
- 论文表述: "..."
- 证据显示: ...
- 状态: [状态]
- 修正建议: [具体需要修正的内容]
All Claims (detailed)
所有声明详情
| # | Location | Paper Value | Evidence Value | Status |
|---|---|---|---|---|
| 1 | Table 2 | 85.3% | 85.28% | rounding_ok |
| 2 | Abstract | "15% improvement" | 12.8% | number_mismatch |
| ... |
Also write `PAPER_CLAIM_AUDIT.json` for machine consumption.| # | 位置 | 论文数值 | 证据数值 | 状态 |
|---|---|---|---|---|
| 1 | 表2 | 85.3% | 85.28% | rounding_ok |
| 2 | 摘要 | "提升15%" | 12.8% | number_mismatch |
| ... |
同时撰写供机器读取的`PAPER_CLAIM_AUDIT.json`文件。Step 4: Print Summary
步骤4:打印摘要
📋 Paper Claim Audit Complete
Claims verified: 24
exact_match: 18
rounding_ok: 3
ambiguous: 1
⚠️ mismatch: 2
Overall: ⚠️ WARN
See PAPER_CLAIM_AUDIT.md for details.📋 论文声明审核完成
已验证声明数量: 24
exact_match: 18
rounding_ok: 3
ambiguous: 1
⚠️ mismatch: 2
整体结论: ⚠️ WARN
详情请查看 PAPER_CLAIM_AUDIT.md。When to Run
运行时机
- After — first check before improvement loop
/paper-write - After — recheck if improvement loop changed numbers
/auto-paper-improvement-loop - Before submission — final verification
- 之后 —— 改进循环前的首次检查
/paper-write - 之后 —— 若改进循环修改了数值,需重新检查
/auto-paper-improvement-loop - 提交前 —— 最终验证
Integration with Other Skills
与其他技能的集成
Read by /auto-paper-improvement-loop
(if exists)
/auto-paper-improvement-loop被/auto-paper-improvement-loop
读取(若存在)
/auto-paper-improvement-loopif PAPER_CLAIM_AUDIT.json exists:
read mismatched claims
fix them as priority items in the improvement roundif PAPER_CLAIM_AUDIT.json exists:
read mismatched claims
fix them as priority items in the improvement roundAdvisory, Never Blocking
仅提供建议,绝不阻塞流程
Same pattern as :
/experiment-audit- → continue normally
PASS - → print warning, continue, flag draft as "check numbers before submission"
WARN - → print alert, continue, but do NOT mark as submission-ready
FAIL
与遵循相同模式:
/experiment-audit- → 正常继续
PASS - → 打印警告,继续流程,标记草稿为“提交前需检查数值”
WARN - → 打印警报,继续流程,但不可标记为“可提交”状态
FAIL
Key Rules
核心规则
- Fresh thread EVERY run. Never use . Never carry context.
codex-reply - Zero executor interpretation. Only file paths. No summaries.
- Only raw results. No EXPERIMENT_LOG, no AUTO_REVIEW, no human summaries.
- Rounding rule. Only standard rounding to displayed precision. 84.7% → 84.7% or 85% is OK. 84.7% → 85.3% is NOT OK.
- Cross-model. Reviewer must be a different model family from executor.
- 每次运行都使用全新线程。绝不可使用,绝不携带上下文。
codex-reply - 执行者不得解读内容。仅提供文件路径,不得提供总结。
- 仅使用原始结果。不得使用EXPERIMENT_LOG、AUTO_REVIEW或人工总结文件。
- 四舍五入规则。仅允许按显示精度进行标准四舍五入。84.7% → 84.7%或85%是允许的,84.7% → 85.3%是不允许的。
- 跨模型要求。评审者必须与执行者属于不同的模型系列。
Review Tracing
评审追踪
After each or reviewer call, save the trace following . Use or write files directly to . Respect the parameter (default: ).
mcp__codex__codexmcp__codex__codex-replyshared-references/review-tracing.mdtools/save_trace.sh.aris/traces/<skill>/<date>_run<NN>/--- trace:full每次调用或评审者后,需遵循保存追踪记录。使用或直接将文件写入。需遵守参数(默认值:)。
mcp__codex__codexmcp__codex__codex-replyshared-references/review-tracing.mdtools/save_trace.sh.aris/traces/<skill>/<date>_run<NN>/--- trace:fullSubmission Artifact Emission
提交工件输出
This skill always writes , regardless of
caller or detector outcome. A detector-negative run (paper has no numeric
claims) emits verdict ; a paper-with-numeric-claims-but-no-
raw-results run emits . Silent skip is forbidden —
Phase 6 and both rely on this artifact
existing at a predictable path.
paper/PAPER_CLAIM_AUDIT.jsonNOT_APPLICABLEBLOCKEDpaper-writingtools/verify_paper_audits.shThe artifact conforms to the schema in :
shared-references/assurance-contract.mdjson
{
"audit_skill": "paper-claim-audit",
"verdict": "PASS | WARN | FAIL | NOT_APPLICABLE | BLOCKED | ERROR",
"reason_code": "all_numbers_match | rounding_drift | missing_raw_results | ...",
"summary": "One-line human-readable verdict summary.",
"audited_input_hashes": {
"main.tex": "sha256:...",
"sections/5.evidence.tex": "sha256:...",
"/abs/path/to/results/run_2026_04_19.json": "sha256:..."
},
"trace_path": ".aris/traces/paper-claim-audit/<date>_run<NN>/",
"thread_id": "<codex mcp thread id>",
"reviewer_model": "gpt-5.4",
"reviewer_reasoning": "xhigh",
"generated_at": "<UTC ISO-8601>",
"details": {
"total_claims": <int>,
"mismatches": [ ... per-claim issue records ... ],
"result_files": [ ... raw files consulted ... ]
}
}无论调用方或检测结果如何,本技能始终会生成文件。若检测到论文无数值声明,输出结论为;若检测到论文有数值声明但无原始结果文件,输出结论为。禁止静默跳过——第6阶段和均依赖该工件存在于可预测路径。
paper/PAPER_CLAIM_AUDIT.jsonNOT_APPLICABLEBLOCKEDpaper-writingtools/verify_paper_audits.sh该工件需符合中的 schema:
shared-references/assurance-contract.mdjson
{
"audit_skill": "paper-claim-audit",
"verdict": "PASS | WARN | FAIL | NOT_APPLICABLE | BLOCKED | ERROR",
"reason_code": "all_numbers_match | rounding_drift | missing_raw_results | ...",
"summary": "一行人类可读的结论摘要。",
"audited_input_hashes": {
"main.tex": "sha256:...",
"sections/5.evidence.tex": "sha256:...",
"/abs/path/to/results/run_2026_04_19.json": "sha256:..."
},
"trace_path": ".aris/traces/paper-claim-audit/<date>_run<NN>/",
"thread_id": "<codex mcp thread id>",
"reviewer_model": "gpt-5.4",
"reviewer_reasoning": "xhigh",
"generated_at": "<UTC ISO-8601>",
"details": {
"total_claims": <int>,
"mismatches": [ ... 每条声明的问题记录 ... ],
"result_files": [ ... 参考的原始文件 ... ]
}
}audited_input_hashes
scope
audited_input_hashesaudited_input_hashes
范围
audited_input_hashesHash the declared input set passed into this audit invocation — i.e. the
exact files and raw result / config files this run read — not a
repo-wide union and not the reviewer's self-reported subset. If a caller
passed only + a single result file, hash those two files and no
others. The external verifier rehashes these entries; any mismatch flags
.
.texmain.texSTALEPath convention (must match what
expects): keys are paths relative to the paper directory (the arg
passed to the verifier) for in-paper files — so , not
— and absolute paths for out-of-paper files such as
external dirs. The verifier resolves relative entries via
; prefixing with produces
and false-fails as STALE.
tools/verify_paper_audits.shmain.texpaper/main.texresults/os.path.join(paper_dir, key)paper/paper/paper/main.tex对本次审核调用中传入的声明输入集进行哈希计算——即本次运行读取的精确.tex文件和原始结果/配置文件,而非仓库范围内的所有文件或评审者自行报告的子集。若调用方仅传入 + 单个结果文件,则仅对这两个文件进行哈希计算。外部验证者会重新计算这些条目的哈希值;任何不匹配都会标记为。
main.texSTALE路径约定(必须符合的预期):论文内文件的键为相对于论文目录的路径(即验证器传入的参数)——例如,而非;论文外文件(如外部目录)的键为绝对路径。验证器会通过解析相对路径;若前缀为会生成,导致错误标记为STALE。
tools/verify_paper_audits.shmain.texpaper/main.texresults/os.path.join(paper_dir, key)paper/paper/paper/main.texVerdict decision table
结论判定表
| Input state | Verdict | |
|---|---|---|
| No numeric claims detected in paper | | |
| Numeric claims detected, no raw result files found | | |
| All claims reconcile to raw data | | |
| Minor rounding drift only, no material mismatch | | |
| Any material mismatch (wrong number, config mismatch) | | |
| Reviewer invocation failed (network / malformed) | | |
| 输入状态 | 结论 | |
|---|---|---|
| 论文中未检测到数值声明 | | |
| 检测到数值声明,但未找到原始结果文件 | | |
| 所有声明均与原始数据一致 | | |
| 仅存在轻微四舍五入偏差,无实质性不匹配 | | |
| 存在任何实质性不匹配(数值错误、配置不匹配) | | |
| 评审者调用失败(网络问题/格式错误) | | |
Thread independence
线程独立性
Every invocation uses a fresh thread. Never
. Do not accept prior audit outputs (PROOF_AUDIT, CITATION_AUDIT,
EXPERIMENT_LOG, AUTO_REVIEW summaries) as input to this audit — the fresh
thread preserves reviewer independence per
.
mcp__codex__codexcodex-replyshared-references/reviewer-independence.md每次调用都使用全新的线程。绝不使用。不得将过往审核输出(PROOF_AUDIT、CITATION_AUDIT、EXPERIMENT_LOG、AUTO_REVIEW总结)作为本次审核的输入——全新线程可确保评审者独立性,符合的要求。
mcp__codex__codexcodex-replyshared-references/reviewer-independence.mdHuman-readable sibling
人类可读的配套文件
paper/PAPER_CLAIM_AUDIT.mdtools/verify_paper_audits.shpaper-writingpaper/PAPER_CLAIM_AUDIT.mdtools/verify_paper_audits.shpaper-writing