paper-claim-audit

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Paper Claim Audit: Zero-Context Evidence Verification

论文声明审核:零上下文证据验证

Verify that every claim in the paper matches raw evidence for: $ARGUMENTS
验证论文中的每一项声明是否与原始证据匹配:$ARGUMENTS

Why This Exists

设计初衷

The executor writes experiments AND writes the paper. It "knows" what the results should be. This creates confirmation bias:
  • Rounding 84.7% up to 85.3%
  • Reporting best seed instead of average
  • Citing metrics from a different experiment config
  • Claiming "improves by 15%" when the delta is actually 12.8%
A fresh reviewer with zero prior context catches these because it has no expectations — it just compares paper text vs raw files.
实验执行者既要编写实验代码,又要撰写论文。他们“知道”预期的实验结果,这会导致确认偏差:
  • 将84.7%向上近似为85.3%
  • 报告最优种子结果而非平均值
  • 引用来自不同实验配置的指标
  • 声称“提升15%”但实际差值仅为12.8%
无前置上下文的全新评审者能发现这些问题,因为他们没有预期,只会直接对比论文文本与原始文件。

How This Differs From Other Audit Skills

与其他审核技能的区别

SkillQuestion it answers
/experiment-audit
Is the experiment code honest? (fake GT, normalization fraud)
/result-to-claim
Does the data scientifically support this claim?
/paper-claim-audit
Does the paper report the data truthfully and precisely?
技能解决的问题
/experiment-audit
实验代码是否真实?(如伪造GT、归一化欺诈)
/result-to-claim
数据是否能科学支撑该声明?
/paper-claim-audit
论文是否如实、精准地报告数据?

Core Principle

核心原则

Zero-context, fresh reviewer. The auditor receives ONLY:
  • Paper .tex files (the claims)
  • Raw result files (the evidence)
It does NOT receive:
  • ❌ EXPERIMENT_LOG.md
  • ❌ EXPERIMENT_TRACKER.md
  • ❌ AUTO_REVIEW.md
  • ❌ NARRATIVE_REPORT.md
  • ❌ Any executor summary or interpretation
  • ❌ Any prior audit results
  • ❌ Any conversation history
This is stricter than reviewer-independence — it's zero-context evidence audit.
零上下文、全新评审者。审核者仅会收到:
  • 论文.tex文件(声明内容)
  • 原始结果文件(证据)
不会收到:
  • ❌ EXPERIMENT_LOG.md
  • ❌ EXPERIMENT_TRACKER.md
  • ❌ AUTO_REVIEW.md
  • ❌ NARRATIVE_REPORT.md
  • ❌ 任何执行者的总结或解读
  • ❌ 任何过往审核结果
  • ❌ 任何对话历史
这比“评审者独立性”要求更严格——这是零上下文证据审核。

Workflow

工作流程

Step 1: Collect Files (Executor — Claude)

步骤1:收集文件(执行者 — Claude)

Locate paper and result files WITHOUT reading or interpreting them.
Paper files (claims) — paths shown relative to the shell's working directory so you can find them with
ls
; when writing them into
audited_input_hashes
, use paths relative to the paper dir (no
paper/
prefix) per the "Submission Artifact Emission" section below:
paper/main.tex                # → hash key: main.tex
paper/sections/*.tex          # → hash key: sections/*.tex
paper/tables/*.tex (if separate)   # → hash key: tables/*.tex
Result files (evidence):
results/*.json, results/*.jsonl, results/*.csv, results/*.tsv
outputs/*.json, outputs/*.csv
wandb-summary.json (if exists)
**/metrics.json, **/eval_results.json
**/config.yaml, **/args.json (experiment configs)
Exclude (no summaries, no interpretations):
EXPERIMENT_LOG.md, EXPERIMENT_TRACKER.md, AUTO_REVIEW*.md
NARRATIVE_REPORT.md, PAPER_PLAN.md, findings.md
Any .md file that is an executor-written summary
定位论文和结果文件,无需读取或解读内容。
论文文件(声明内容)——路径为shell工作目录的相对路径,可通过
ls
命令查找;写入
audited_input_hashes
时,请遵循下文“提交工件输出”部分的要求,使用相对于论文目录的路径(无需
paper/
前缀):
paper/main.tex                # → 哈希键:main.tex
paper/sections/*.tex          # → 哈希键:sections/*.tex
paper/tables/*.tex (若单独存在)   # → 哈希键:tables/*.tex
结果文件(证据):
results/*.json, results/*.jsonl, results/*.csv, results/*.tsv
outputs/*.json, outputs/*.csv
wandb-summary.json (若存在)
**/metrics.json, **/eval_results.json
**/config.yaml, **/args.json (实验配置文件)
排除文件(不含总结、解读类文件):
EXPERIMENT_LOG.md, EXPERIMENT_TRACKER.md, AUTO_REVIEW*.md
NARRATIVE_REPORT.md, PAPER_PLAN.md, findings.md
任何由执行者撰写的.md格式总结文件

Step 2: Fresh Reviewer Audit (GPT-5.4 — NEW thread, no reply)

步骤2:全新评审者审核(GPT-5.4 — 新线程,无回复)

CRITICAL: Use
mcp__codex__codex
(new thread), NEVER
mcp__codex__codex-reply
.
Every run must be a fresh context.
mcp__codex__codex:
  model: gpt-5.4
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    You are a paper-to-evidence auditor. You have ZERO prior context about
    this research. You will receive only paper source files and raw result
    files. Your job is to verify that every number in the paper exactly
    matches the raw evidence.

    Paper files to read:
    [list .tex file paths]

    Result files to read:
    [list .json/.csv/.yaml file paths]

    ## Audit Protocol

    ### A. Extract Every Quantitative Claim
    For each number, percentage, comparison, or scope statement in the paper:
    - Location (section, table, caption, or inline text)
    - Exact claim text
    - The number or comparison being made

    ### B. Trace Each Claim to Evidence
    For each extracted claim, find the supporting raw data:
    - Which result file contains this number?
    - What is the EXACT value in that file?
    - Match status: exact_match / rounding_ok / mismatch

    ### C. Check These Specific Failure Modes

    1. **Number inflation**: Paper says 85.3%, raw file says 84.7%
       Rule: only standard rounding to displayed precision is allowed

    2. **Best-seed cherry-pick**: Paper says "achieves 90.2%" but
       that's the best of 5 seeds; mean is 87.1%
       Rule: check if paper specifies "average" / "best" / "median"

    3. **Config mismatch**: Paper compares Method A vs Baseline B,
       but they used different hyperparameters / datasets / splits
       Rule: verify config files show same settings for compared methods

    4. **Aggregation mismatch**: Paper says "average over 5 seeds"
       but result files show only 3 runs
       Rule: count actual runs vs claimed count

    5. **Delta error**: Paper says "improves by 15%" but
       actual delta is (85.3 - 73.1) / 73.1 = 16.7%
       Rule: verify arithmetic of all relative improvements

    6. **Caption-table mismatch**: Figure caption describes
       something different from what the figure/table actually shows
       Rule: cross-check every caption against its content

    7. **Scope overclaim**: Paper says "consistently outperforms"
       but only tested on 2 datasets
       Rule: check if language matches actual evaluation scope

    ## Output Format (per claim)
    For each claim, report:
    - claim_id: sequential number
    - location: section/table/figure
    - paper_text: exact quote from paper
    - paper_value: the number claimed
    - evidence_file: which raw file
    - evidence_value: the actual number
    - status: exact_match | rounding_ok | ambiguous_mapping |
              missing_evidence | config_mismatch | aggregation_mismatch |
              number_mismatch | scope_overclaim | unsupported_claim
    - details: explanation if not exact_match

    Overall verdict: PASS | WARN | FAIL
关键要求:使用
mcp__codex__codex
(新线程),绝对禁止使用
mcp__codex__codex-reply
每次运行都必须使用全新上下文。
mcp__codex__codex:
  model: gpt-5.4
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    You are a paper-to-evidence auditor. You have ZERO prior context about
    this research. You will receive only paper source files and raw result
    files. Your job is to verify that every number in the paper exactly
    matches the raw evidence.

    Paper files to read:
    [list .tex file paths]

    Result files to read:
    [list .json/.csv/.yaml file paths]

    ## Audit Protocol

    ### A. Extract Every Quantitative Claim
    For each number, percentage, comparison, or scope statement in the paper:
    - Location (section, table, caption, or inline text)
    - Exact claim text
    - The number or comparison being made

    ### B. Trace Each Claim to Evidence
    For each extracted claim, find the supporting raw data:
    - Which result file contains this number?
    - What is the EXACT value in that file?
    - Match status: exact_match / rounding_ok / mismatch

    ### C. Check These Specific Failure Modes

    1. **Number inflation**: Paper says 85.3%, raw file says 84.7%
       Rule: only standard rounding to displayed precision is allowed

    2. **Best-seed cherry-pick**: Paper says "achieves 90.2%" but
       that's the best of 5 seeds; mean is 87.1%
       Rule: check if paper specifies "average" / "best" / "median"

    3. **Config mismatch**: Paper compares Method A vs Baseline B,
       but they used different hyperparameters / datasets / splits
       Rule: verify config files show same settings for compared methods

    4. **Aggregation mismatch**: Paper says "average over 5 seeds"
       but result files show only 3 runs
       Rule: count actual runs vs claimed count

    5. **Delta error**: Paper says "improves by 15%" but
       actual delta is (85.3 - 73.1) / 73.1 = 16.7%
       Rule: verify arithmetic of all relative improvements

    6. **Caption-table mismatch**: Figure caption describes
       something different from what the figure/table actually shows
       Rule: cross-check every caption against its content

    7. **Scope overclaim**: Paper says "consistently outperforms"
       but only tested on 2 datasets
       Rule: check if language matches actual evaluation scope

    ## Output Format (per claim)
    For each claim, report:
    - claim_id: sequential number
    - location: section/table/figure
    - paper_text: exact quote from paper
    - paper_value: the number claimed
    - evidence_file: which raw file
    - evidence_value: the actual number
    - status: exact_match | rounding_ok | ambiguous_mapping |
              missing_evidence | config_mismatch | aggregation_mismatch |
              number_mismatch | scope_overclaim | unsupported_claim
    - details: explanation if not exact_match

    Overall verdict: PASS | WARN | FAIL

Step 3: Write Report (Executor — Claude)

步骤3:撰写报告(执行者 — Claude)

Parse the reviewer's response and write
PAPER_CLAIM_AUDIT.md
:
markdown
undefined
解析评审者的回复并撰写
PAPER_CLAIM_AUDIT.md
markdown
undefined

Paper Claim Audit Report

论文声明审核报告

Date: [today] Auditor: GPT-5.4 xhigh (fresh zero-context thread) Paper: [paper title from tex]
日期: [今日] 审核者: GPT-5.4 xhigh(全新零上下文线程) 论文: [从tex文件提取的论文标题]

Overall Verdict: [PASS | WARN | FAIL]

整体结论: [PASS | WARN | FAIL]

Claims Verified: [N total]

已验证声明数量: [N 总计]

  • exact_match: [count]
  • rounding_ok: [count]
  • ambiguous_mapping: [count]
  • missing_evidence: [count]
  • mismatch: [count]
  • exact_match: [数量]
  • rounding_ok: [数量]
  • ambiguous_mapping: [数量]
  • missing_evidence: [数量]
  • mismatch: [数量]

Issues Found

发现的问题

[FAIL/WARN] Claim #N: [description]

[FAIL/WARN] 声明 #N: [描述]

  • Location: Section X / Table Y / Figure Z
  • Paper says: "..."
  • Evidence shows: ...
  • Status: [status]
  • Fix: [specific correction needed]
  • 位置: 第X节 / 第Y表 / 第Z图
  • 论文表述: "..."
  • 证据显示: ...
  • 状态: [状态]
  • 修正建议: [具体需要修正的内容]

All Claims (detailed)

所有声明详情

#LocationPaper ValueEvidence ValueStatus
1Table 285.3%85.28%rounding_ok
2Abstract"15% improvement"12.8%number_mismatch
...

Also write `PAPER_CLAIM_AUDIT.json` for machine consumption.
#位置论文数值证据数值状态
1表285.3%85.28%rounding_ok
2摘要"提升15%"12.8%number_mismatch
...

同时撰写供机器读取的`PAPER_CLAIM_AUDIT.json`文件。

Step 4: Print Summary

步骤4:打印摘要

📋 Paper Claim Audit Complete

  Claims verified: 24
  exact_match:     18
  rounding_ok:      3
  ambiguous:         1
  ⚠️ mismatch:      2

  Overall: ⚠️ WARN

  See PAPER_CLAIM_AUDIT.md for details.
📋 论文声明审核完成

  已验证声明数量: 24
  exact_match:     18
  rounding_ok:      3
  ambiguous:         1
  ⚠️ mismatch:      2

  整体结论: ⚠️ WARN

  详情请查看 PAPER_CLAIM_AUDIT.md。

When to Run

运行时机

  1. After
    /paper-write
    — first check before improvement loop
  2. After
    /auto-paper-improvement-loop
    — recheck if improvement loop changed numbers
  3. Before submission — final verification
  1. /paper-write
    之后
    —— 改进循环前的首次检查
  2. /auto-paper-improvement-loop
    之后
    —— 若改进循环修改了数值,需重新检查
  3. 提交前 —— 最终验证

Integration with Other Skills

与其他技能的集成

Read by
/auto-paper-improvement-loop
(if exists)

/auto-paper-improvement-loop
读取(若存在)

if PAPER_CLAIM_AUDIT.json exists:
    read mismatched claims
    fix them as priority items in the improvement round
if PAPER_CLAIM_AUDIT.json exists:
    read mismatched claims
    fix them as priority items in the improvement round

Advisory, Never Blocking

仅提供建议,绝不阻塞流程

Same pattern as
/experiment-audit
:
  • PASS
    → continue normally
  • WARN
    → print warning, continue, flag draft as "check numbers before submission"
  • FAIL
    → print alert, continue, but do NOT mark as submission-ready
/experiment-audit
遵循相同模式:
  • PASS
    → 正常继续
  • WARN
    → 打印警告,继续流程,标记草稿为“提交前需检查数值”
  • FAIL
    → 打印警报,继续流程,但不可标记为“可提交”状态

Key Rules

核心规则

  • Fresh thread EVERY run. Never use
    codex-reply
    . Never carry context.
  • Zero executor interpretation. Only file paths. No summaries.
  • Only raw results. No EXPERIMENT_LOG, no AUTO_REVIEW, no human summaries.
  • Rounding rule. Only standard rounding to displayed precision. 84.7% → 84.7% or 85% is OK. 84.7% → 85.3% is NOT OK.
  • Cross-model. Reviewer must be a different model family from executor.
  • 每次运行都使用全新线程。绝不可使用
    codex-reply
    ,绝不携带上下文。
  • 执行者不得解读内容。仅提供文件路径,不得提供总结。
  • 仅使用原始结果。不得使用EXPERIMENT_LOG、AUTO_REVIEW或人工总结文件。
  • 四舍五入规则。仅允许按显示精度进行标准四舍五入。84.7% → 84.7%或85%是允许的,84.7% → 85.3%是不允许的。
  • 跨模型要求。评审者必须与执行者属于不同的模型系列。

Review Tracing

评审追踪

After each
mcp__codex__codex
or
mcp__codex__codex-reply
reviewer call, save the trace following
shared-references/review-tracing.md
. Use
tools/save_trace.sh
or write files directly to
.aris/traces/<skill>/<date>_run<NN>/
. Respect the
--- trace:
parameter (default:
full
).
每次调用
mcp__codex__codex
mcp__codex__codex-reply
评审者后,需遵循
shared-references/review-tracing.md
保存追踪记录。使用
tools/save_trace.sh
或直接将文件写入
.aris/traces/<skill>/<date>_run<NN>/
。需遵守
--- trace:
参数(默认值:
full
)。

Submission Artifact Emission

提交工件输出

This skill always writes
paper/PAPER_CLAIM_AUDIT.json
, regardless of caller or detector outcome. A detector-negative run (paper has no numeric claims) emits verdict
NOT_APPLICABLE
; a paper-with-numeric-claims-but-no- raw-results run emits
BLOCKED
. Silent skip is forbidden —
paper-writing
Phase 6 and
tools/verify_paper_audits.sh
both rely on this artifact existing at a predictable path.
The artifact conforms to the schema in
shared-references/assurance-contract.md
:
json
{
  "audit_skill":      "paper-claim-audit",
  "verdict":          "PASS | WARN | FAIL | NOT_APPLICABLE | BLOCKED | ERROR",
  "reason_code":      "all_numbers_match | rounding_drift | missing_raw_results | ...",
  "summary":          "One-line human-readable verdict summary.",
  "audited_input_hashes": {
    "main.tex":                              "sha256:...",
    "sections/5.evidence.tex":               "sha256:...",
    "/abs/path/to/results/run_2026_04_19.json": "sha256:..."
  },
  "trace_path":       ".aris/traces/paper-claim-audit/<date>_run<NN>/",
  "thread_id":        "<codex mcp thread id>",
  "reviewer_model":   "gpt-5.4",
  "reviewer_reasoning": "xhigh",
  "generated_at":     "<UTC ISO-8601>",
  "details": {
    "total_claims":   <int>,
    "mismatches":     [ ... per-claim issue records ... ],
    "result_files":   [ ... raw files consulted ... ]
  }
}
无论调用方或检测结果如何,本技能始终会生成
paper/PAPER_CLAIM_AUDIT.json
文件。若检测到论文无数值声明,输出结论为
NOT_APPLICABLE
;若检测到论文有数值声明但无原始结果文件,输出结论为
BLOCKED
。禁止静默跳过——
paper-writing
第6阶段和
tools/verify_paper_audits.sh
均依赖该工件存在于可预测路径。
该工件需符合
shared-references/assurance-contract.md
中的 schema:
json
{
  "audit_skill":      "paper-claim-audit",
  "verdict":          "PASS | WARN | FAIL | NOT_APPLICABLE | BLOCKED | ERROR",
  "reason_code":      "all_numbers_match | rounding_drift | missing_raw_results | ...",
  "summary":          "一行人类可读的结论摘要。",
  "audited_input_hashes": {
    "main.tex":                              "sha256:...",
    "sections/5.evidence.tex":               "sha256:...",
    "/abs/path/to/results/run_2026_04_19.json": "sha256:..."
  },
  "trace_path":       ".aris/traces/paper-claim-audit/<date>_run<NN>/",
  "thread_id":        "<codex mcp thread id>",
  "reviewer_model":   "gpt-5.4",
  "reviewer_reasoning": "xhigh",
  "generated_at":     "<UTC ISO-8601>",
  "details": {
    "total_claims":   <int>,
    "mismatches":     [ ... 每条声明的问题记录 ... ],
    "result_files":   [ ... 参考的原始文件 ... ]
  }
}

audited_input_hashes
scope

audited_input_hashes
范围

Hash the declared input set passed into this audit invocation — i.e. the exact
.tex
files and raw result / config files this run read — not a repo-wide union and not the reviewer's self-reported subset. If a caller passed only
main.tex
+ a single result file, hash those two files and no others. The external verifier rehashes these entries; any mismatch flags
STALE
.
Path convention (must match what
tools/verify_paper_audits.sh
expects): keys are paths relative to the paper directory (the arg passed to the verifier) for in-paper files — so
main.tex
, not
paper/main.tex
— and absolute paths for out-of-paper files such as external
results/
dirs. The verifier resolves relative entries via
os.path.join(paper_dir, key)
; prefixing with
paper/
produces
paper/paper/main.tex
and false-fails as STALE.
对本次审核调用中传入的声明输入集进行哈希计算——即本次运行读取的精确.tex文件和原始结果/配置文件,而非仓库范围内的所有文件或评审者自行报告的子集。若调用方仅传入
main.tex
+ 单个结果文件,则仅对这两个文件进行哈希计算。外部验证者会重新计算这些条目的哈希值;任何不匹配都会标记为
STALE
路径约定(必须符合
tools/verify_paper_audits.sh
的预期):论文内文件的键为相对于论文目录的路径(即验证器传入的参数)——例如
main.tex
,而非
paper/main.tex
;论文外文件(如外部
results/
目录)的键为绝对路径。验证器会通过
os.path.join(paper_dir, key)
解析相对路径;若前缀为
paper/
会生成
paper/paper/main.tex
,导致错误标记为STALE。

Verdict decision table

结论判定表

Input stateVerdict
reason_code
example
No numeric claims detected in paper
NOT_APPLICABLE
no_numeric_claims
Numeric claims detected, no raw result files found
BLOCKED
no_raw_evidence
All claims reconcile to raw data
PASS
all_numbers_match
Minor rounding drift only, no material mismatch
WARN
rounding_drift
Any material mismatch (wrong number, config mismatch)
FAIL
claim_mismatch
Reviewer invocation failed (network / malformed)
ERROR
reviewer_error
输入状态结论
reason_code
示例
论文中未检测到数值声明
NOT_APPLICABLE
no_numeric_claims
检测到数值声明,但未找到原始结果文件
BLOCKED
no_raw_evidence
所有声明均与原始数据一致
PASS
all_numbers_match
仅存在轻微四舍五入偏差,无实质性不匹配
WARN
rounding_drift
存在任何实质性不匹配(数值错误、配置不匹配)
FAIL
claim_mismatch
评审者调用失败(网络问题/格式错误)
ERROR
reviewer_error

Thread independence

线程独立性

Every invocation uses a fresh
mcp__codex__codex
thread. Never
codex-reply
. Do not accept prior audit outputs (PROOF_AUDIT, CITATION_AUDIT, EXPERIMENT_LOG, AUTO_REVIEW summaries) as input to this audit — the fresh thread preserves reviewer independence per
shared-references/reviewer-independence.md
.
每次调用都使用全新的
mcp__codex__codex
线程。绝不使用
codex-reply
。不得将过往审核输出(PROOF_AUDIT、CITATION_AUDIT、EXPERIMENT_LOG、AUTO_REVIEW总结)作为本次审核的输入——全新线程可确保评审者独立性,符合
shared-references/reviewer-independence.md
的要求。

Human-readable sibling

人类可读的配套文件

paper/PAPER_CLAIM_AUDIT.md
is written alongside the JSON for readers. The JSON is authoritative for
tools/verify_paper_audits.sh
; the Markdown is for humans. The parent skill (
paper-writing
Phase 6) plus the verifier decide whether the verdict blocks finalization — this skill itself never blocks; it only emits.
paper/PAPER_CLAIM_AUDIT.md
会与JSON文件一同生成,供人类阅读。JSON文件是
tools/verify_paper_audits.sh
的权威依据;Markdown文件供人类查看。父技能(
paper-writing
第6阶段)和验证器会决定该结论是否会阻止最终定稿——本技能本身绝不阻塞流程,仅输出结果。