Loading...
Loading...
Zero-context verification that every number, comparison, and scope claim in the paper matches raw result files. Uses a fresh cross-model reviewer with NO prior context to prevent confirmation bias. Use when user says "审查论文数据", "check paper claims", "verify numbers", "论文数字核对", or before submission to ensure paper-to-evidence fidelity.
npx skill4agent add wanshuiyin/auto-claude-code-research-in-sleep paper-claim-audit| Skill | Question it answers |
|---|---|
| Is the experiment code honest? (fake GT, normalization fraud) |
| Does the data scientifically support this claim? |
| Does the paper report the data truthfully and precisely? |
lsaudited_input_hashespaper/paper/main.tex # → hash key: main.tex
paper/sections/*.tex # → hash key: sections/*.tex
paper/tables/*.tex (if separate) # → hash key: tables/*.texresults/*.json, results/*.jsonl, results/*.csv, results/*.tsv
outputs/*.json, outputs/*.csv
wandb-summary.json (if exists)
**/metrics.json, **/eval_results.json
**/config.yaml, **/args.json (experiment configs)EXPERIMENT_LOG.md, EXPERIMENT_TRACKER.md, AUTO_REVIEW*.md
NARRATIVE_REPORT.md, PAPER_PLAN.md, findings.md
Any .md file that is an executor-written summarymcp__codex__codexmcp__codex__codex-replymcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are a paper-to-evidence auditor. You have ZERO prior context about
this research. You will receive only paper source files and raw result
files. Your job is to verify that every number in the paper exactly
matches the raw evidence.
Paper files to read:
[list .tex file paths]
Result files to read:
[list .json/.csv/.yaml file paths]
## Audit Protocol
### A. Extract Every Quantitative Claim
For each number, percentage, comparison, or scope statement in the paper:
- Location (section, table, caption, or inline text)
- Exact claim text
- The number or comparison being made
### B. Trace Each Claim to Evidence
For each extracted claim, find the supporting raw data:
- Which result file contains this number?
- What is the EXACT value in that file?
- Match status: exact_match / rounding_ok / mismatch
### C. Check These Specific Failure Modes
1. **Number inflation**: Paper says 85.3%, raw file says 84.7%
Rule: only standard rounding to displayed precision is allowed
2. **Best-seed cherry-pick**: Paper says "achieves 90.2%" but
that's the best of 5 seeds; mean is 87.1%
Rule: check if paper specifies "average" / "best" / "median"
3. **Config mismatch**: Paper compares Method A vs Baseline B,
but they used different hyperparameters / datasets / splits
Rule: verify config files show same settings for compared methods
4. **Aggregation mismatch**: Paper says "average over 5 seeds"
but result files show only 3 runs
Rule: count actual runs vs claimed count
5. **Delta error**: Paper says "improves by 15%" but
actual delta is (85.3 - 73.1) / 73.1 = 16.7%
Rule: verify arithmetic of all relative improvements
6. **Caption-table mismatch**: Figure caption describes
something different from what the figure/table actually shows
Rule: cross-check every caption against its content
7. **Scope overclaim**: Paper says "consistently outperforms"
but only tested on 2 datasets
Rule: check if language matches actual evaluation scope
## Output Format (per claim)
For each claim, report:
- claim_id: sequential number
- location: section/table/figure
- paper_text: exact quote from paper
- paper_value: the number claimed
- evidence_file: which raw file
- evidence_value: the actual number
- status: exact_match | rounding_ok | ambiguous_mapping |
missing_evidence | config_mismatch | aggregation_mismatch |
number_mismatch | scope_overclaim | unsupported_claim
- details: explanation if not exact_match
Overall verdict: PASS | WARN | FAILPAPER_CLAIM_AUDIT.md# Paper Claim Audit Report
**Date**: [today]
**Auditor**: GPT-5.4 xhigh (fresh zero-context thread)
**Paper**: [paper title from tex]
## Overall Verdict: [PASS | WARN | FAIL]
## Claims Verified: [N total]
- exact_match: [count]
- rounding_ok: [count]
- ambiguous_mapping: [count]
- missing_evidence: [count]
- mismatch: [count]
## Issues Found
### [FAIL/WARN] Claim #N: [description]
- **Location**: Section X / Table Y / Figure Z
- **Paper says**: "..."
- **Evidence shows**: ...
- **Status**: [status]
- **Fix**: [specific correction needed]
## All Claims (detailed)
| # | Location | Paper Value | Evidence Value | Status |
|---|----------|-------------|---------------|--------|
| 1 | Table 2 | 85.3% | 85.28% | rounding_ok |
| 2 | Abstract | "15% improvement" | 12.8% | number_mismatch |
| ... |PAPER_CLAIM_AUDIT.json📋 Paper Claim Audit Complete
Claims verified: 24
exact_match: 18
rounding_ok: 3
ambiguous: 1
⚠️ mismatch: 2
Overall: ⚠️ WARN
See PAPER_CLAIM_AUDIT.md for details./paper-write/auto-paper-improvement-loop/auto-paper-improvement-loopif PAPER_CLAIM_AUDIT.json exists:
read mismatched claims
fix them as priority items in the improvement round/experiment-auditPASSWARNFAILcodex-replymcp__codex__codexmcp__codex__codex-replyshared-references/review-tracing.mdtools/save_trace.sh.aris/traces/<skill>/<date>_run<NN>/--- trace:fullpaper/PAPER_CLAIM_AUDIT.jsonNOT_APPLICABLEBLOCKEDpaper-writingtools/verify_paper_audits.shshared-references/assurance-contract.md{
"audit_skill": "paper-claim-audit",
"verdict": "PASS | WARN | FAIL | NOT_APPLICABLE | BLOCKED | ERROR",
"reason_code": "all_numbers_match | rounding_drift | missing_raw_results | ...",
"summary": "One-line human-readable verdict summary.",
"audited_input_hashes": {
"main.tex": "sha256:...",
"sections/5.evidence.tex": "sha256:...",
"/abs/path/to/results/run_2026_04_19.json": "sha256:..."
},
"trace_path": ".aris/traces/paper-claim-audit/<date>_run<NN>/",
"thread_id": "<codex mcp thread id>",
"reviewer_model": "gpt-5.4",
"reviewer_reasoning": "xhigh",
"generated_at": "<UTC ISO-8601>",
"details": {
"total_claims": <int>,
"mismatches": [ ... per-claim issue records ... ],
"result_files": [ ... raw files consulted ... ]
}
}audited_input_hashes.texmain.texSTALEtools/verify_paper_audits.shmain.texpaper/main.texresults/os.path.join(paper_dir, key)paper/paper/paper/main.tex| Input state | Verdict | |
|---|---|---|
| No numeric claims detected in paper | | |
| Numeric claims detected, no raw result files found | | |
| All claims reconcile to raw data | | |
| Minor rounding drift only, no material mismatch | | |
| Any material mismatch (wrong number, config mismatch) | | |
| Reviewer invocation failed (network / malformed) | | |
mcp__codex__codexcodex-replyshared-references/reviewer-independence.mdpaper/PAPER_CLAIM_AUDIT.mdtools/verify_paper_audits.shpaper-writing