experiment-audit
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseExperiment Audit: Cross-Model Integrity Verification
实验审计:跨模型完整性验证
Audit experiment integrity for: $ARGUMENTS
针对**$ARGUMENTS**进行实验完整性审计
Why This Exists
设计初衷
LLM agents can produce fraudulent experimental results through:
- Fake ground truth — creating synthetic "reference" from model outputs, then reporting high agreement as performance
- Score normalization — dividing metrics by the model's own max to get 0.99+
- Phantom results — claiming numbers from files that don't exist or functions never called
- Insufficient scope — reporting 2-scene pilots as "comprehensive evaluation"
These are NOT intentional deception — they are failure modes of optimizing agents that lack integrity constraints. This skill adds that constraint.
LLM agents可能通过以下方式生成欺诈性实验结果:
- 虚假真值——从模型输出中生成合成“参考数据”,随后将高一致性报告为性能表现
- 分数归一化欺诈——将指标除以模型自身的最大值,得到0.99+的分数
- 虚假结果——宣称来自不存在的文件或从未调用过的函数的数据
- 范围不足——将仅包含2个场景的试点实验报告为“全面评估”
这些并非故意欺诈,而是缺乏完整性约束的优化型agents的失效模式。本技能正是为了添加这类约束。
Core Principle
核心原则
The executor (Claude) collects file paths. The reviewer (GPT-5.4) reads code and judges integrity. The executor does NOT participate in integrity judgment.
This follows and .
shared-references/reviewer-independence.mdshared-references/experiment-integrity.md执行者(Claude)负责收集文件路径。审核者(GPT-5.4)负责读取代码并判断完整性。执行者绝不参与完整性判断。
此流程遵循和文档规范。
shared-references/reviewer-independence.mdshared-references/experiment-integrity.mdConstants
常量定义
- REVIEWER_BACKEND = — Default: Codex MCP (xhigh). Override with
codexfor GPT-5.4 Pro via Oracle MCP. See— reviewer: oracle-pro.shared-references/reviewer-routing.md
- REVIEWER_BACKEND = — 默认值:Codex MCP(xhigh)。可通过
codex参数覆盖,切换为通过Oracle MCP调用GPT-5.4 Pro。详情见— reviewer: oracle-pro。shared-references/reviewer-routing.md
Workflow
工作流程
Step 1: Collect Artifacts (Executor — Claude)
步骤1:收集工件(执行者——Claude)
Locate and list these files WITHOUT reading or summarizing their content:
Scan project directory for:
1. Evaluation scripts: *eval*.py, *metric*.py, *test*.py, *benchmark*.py
2. Result files: *.json, *.csv in results/, outputs/, logs/
3. Ground truth paths: look in eval scripts for data loading (dataset paths, GT references)
4. Experiment tracker: EXPERIMENT_TRACKER.md, EXPERIMENT_LOG.md
5. Paper claims: NARRATIVE_REPORT.md, paper/sections/*.tex, PAPER_PLAN.md
6. Config files: *.yaml, *.toml, *.json configs with metric definitionsDO NOT summarize, interpret, or explain any file content. Only collect paths.
定位并列出以下文件,无需读取或总结其内容:
扫描项目目录,查找:
1. 评估脚本: *eval*.py, *metric*.py, *test*.py, *benchmark*.py
2. 结果文件: results/、outputs/、logs/目录下的*.json、*.csv文件
3. 真值路径: 在评估脚本中查找数据加载逻辑(数据集路径、GT参考数据)
4. 实验跟踪文档: EXPERIMENT_TRACKER.md, EXPERIMENT_LOG.md
5. 论文声明: NARRATIVE_REPORT.md, paper/sections/*.tex, PAPER_PLAN.md
6. 配置文件: 包含指标定义的*.yaml、*.toml、*.json配置文件请勿总结、解读或解释任何文件内容,仅需收集路径。
Step 2: Send to Reviewer (GPT-5.4 via Codex MCP)
步骤2:提交至审核者(通过Codex MCP调用GPT-5.4)
Pass ONLY file paths and the audit checklist to the reviewer. The reviewer reads everything directly.
mcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
sandbox: read-only
cwd: [project directory]
prompt: |
You are an experiment integrity auditor. Read ALL files listed below
and check for the following fraud patterns.
Files to read:
- Evaluation scripts: [list paths]
- Result files: [list paths]
- Experiment tracker: [list paths]
- Paper claims: [list paths]
- Config files: [list paths]
## Audit Checklist
### A. Ground Truth Provenance
For each evaluation script:
1. Where does "ground truth" / "reference" / "target" come from?
2. Is it loaded from the DATASET, or generated/derived from MODEL OUTPUTS?
3. If derived: is it explicitly labeled as proxy evaluation?
4. Are official eval scripts used when available for this benchmark?
FAIL if: GT is derived from model outputs without explicit proxy labeling.
### B. Score Normalization
For each metric computation:
1. Is any metric divided by max/min/mean of the model's OWN output?
2. Are raw scores reported alongside any normalized scores?
3. Are any scores suspiciously close to 1.0 or 100%?
FAIL if: Normalization denominator comes from prediction statistics.
### C. Result File Existence
For each claim in the paper/narrative:
1. Does the referenced result file actually exist?
2. Does the claimed metric key exist in that file?
3. Does the claimed NUMBER match what's in the file?
4. Is the experiment tracker status DONE (not TODO/IN_PROGRESS)?
FAIL if: Claimed results reference nonexistent files or mismatched numbers.
### D. Dead Code Detection
For each metric function defined in eval scripts:
1. Is it actually CALLED in any evaluation pipeline?
2. Does its output appear in any result file?
WARN if: Metric functions exist but are never called.
### E. Scope Assessment
1. How many scenes/datasets/configurations were actually tested?
2. How many seeds/runs per configuration?
3. Does the paper use words like "comprehensive", "extensive", "robust"?
4. Is the actual scope sufficient for those claims?
WARN if: Scope language exceeds actual evidence.
### F. Evaluation Type Classification
Classify each evaluation as:
- real_gt: uses dataset-provided ground truth
- synthetic_proxy: uses model-generated reference
- self_supervised_proxy: no GT by design
- simulation_only: simulated environment
- human_eval: human judges
## Output Format
For each check (A-F), report:
- Status: PASS | WARN | FAIL
- Evidence: exact file:line references
- Details: what specifically was found
Overall verdict: PASS | WARN | FAIL
Be thorough. Read every eval script line by line.仅将文件路径和审计检查清单提交给审核者,由审核者直接读取所有内容。
mcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
sandbox: read-only
cwd: [project directory]
prompt: |
You are an experiment integrity auditor. Read ALL files listed below
and check for the following fraud patterns.
Files to read:
- Evaluation scripts: [list paths]
- Result files: [list paths]
- Experiment tracker: [list paths]
- Paper claims: [list paths]
- Config files: [list paths]
## Audit Checklist
### A. Ground Truth Provenance
For each evaluation script:
1. Where does "ground truth" / "reference" / "target" come from?
2. Is it loaded from the DATASET, or generated/derived from MODEL OUTPUTS?
3. If derived: is it explicitly labeled as proxy evaluation?
4. Are official eval scripts used when available for this benchmark?
FAIL if: GT is derived from model outputs without explicit proxy labeling.
### B. Score Normalization
For each metric computation:
1. Is any metric divided by max/min/mean of the model's OWN output?
2. Are raw scores reported alongside any normalized scores?
3. Are any scores suspiciously close to 1.0 or 100%?
FAIL if: Normalization denominator comes from prediction statistics.
### C. Result File Existence
For each claim in the paper/narrative:
1. Does the referenced result file actually exist?
2. Does the claimed metric key exist in that file?
3. Does the claimed NUMBER match what's in the file?
4. Is the experiment tracker status DONE (not TODO/IN_PROGRESS)?
FAIL if: Claimed results reference nonexistent files or mismatched numbers.
### D. Dead Code Detection
For each metric function defined in eval scripts:
1. Is it actually CALLED in any evaluation pipeline?
2. Does its output appear in any result file?
WARN if: Metric functions exist but are never called.
### E. Scope Assessment
1. How many scenes/datasets/configurations were actually tested?
2. How many seeds/runs per configuration?
3. Does the paper use words like "comprehensive", "extensive", "robust"?
4. Is the actual scope sufficient for those claims?
WARN if: Scope language exceeds actual evidence.
### F. Evaluation Type Classification
Classify each evaluation as:
- real_gt: uses dataset-provided ground truth
- synthetic_proxy: uses model-generated reference
- self_supervised_proxy: no GT by design
- simulation_only: simulated environment
- human_eval: human judges
## Output Format
For each check (A-F), report:
- Status: PASS | WARN | FAIL
- Evidence: exact file:line references
- Details: what specifically was found
Overall verdict: PASS | WARN | FAIL
Be thorough. Read every eval script line by line.Step 3: Parse and Write Report (Executor — Claude)
步骤3:解析并撰写报告(执行者——Claude)
Parse the reviewer's response and write :
EXPERIMENT_AUDIT.mdmarkdown
undefined解析审核者的响应并撰写:
EXPERIMENT_AUDIT.mdmarkdown
undefinedExperiment Audit Report
实验审计报告
Date: [today]
Auditor: GPT-5.4 xhigh (cross-model, read-only)
Project: [project name]
日期: [今日]
审核者: GPT-5.4 xhigh(跨模型,只读权限)
项目: [项目名称]
Overall Verdict: [PASS | WARN | FAIL]
整体结论: [PASS | WARN | FAIL]
Integrity Status: [pass | warn | fail]
完整性状态: [pass | warn | fail]
Checks
检查项
A. Ground Truth Provenance: [PASS|WARN|FAIL]
A. 真值来源: [PASS|WARN|FAIL]
[details + file:line evidence]
[详情 + 文件:行号证据]
B. Score Normalization: [PASS|WARN|FAIL]
B. 分数归一化: [PASS|WARN|FAIL]
[details]
[详情]
C. Result File Existence: [PASS|WARN|FAIL]
C. 结果文件存在性: [PASS|WARN|FAIL]
[details]
[详情]
D. Dead Code Detection: [PASS|WARN|FAIL]
D. 死代码检测: [PASS|WARN|FAIL]
[details]
[详情]
E. Scope Assessment: [PASS|WARN|FAIL]
E. 范围评估: [PASS|WARN|FAIL]
[details]
[详情]
F. Evaluation Type: [real_gt | synthetic_proxy | ...]
F. 评估类型: [real_gt | synthetic_proxy | ...]
[classification + evidence]
[分类结果 + 证据]
Action Items
行动项
- [specific fixes if WARN or FAIL]
- [若为WARN或FAIL,列出具体修复措施]
Claim Impact
声明影响
- Claim 1: [supported | needs qualifier | unsupported]
- Claim 2: ...
Also write `EXPERIMENT_AUDIT.json` for machine consumption:
```json
{
"date": "2026-04-10",
"auditor": "gpt-5.4-xhigh",
"overall_verdict": "warn",
"integrity_status": "warn",
"checks": {
"gt_provenance": {"status": "pass", "details": "..."},
"score_normalization": {"status": "warn", "details": "..."},
"result_existence": {"status": "pass", "details": "..."},
"dead_code": {"status": "pass", "details": "..."},
"scope": {"status": "warn", "details": "..."},
"eval_type": "real_gt"
},
"claims": [
{"id": "C1", "impact": "supported"},
{"id": "C2", "impact": "needs_qualifier"}
]
}- 声明1: [支持 | 需要限定 | 不支持]
- 声明2: ...
同时撰写供机器读取的`EXPERIMENT_AUDIT.json`:
```json
{
"date": "2026-04-10",
"auditor": "gpt-5.4-xhigh",
"overall_verdict": "warn",
"integrity_status": "warn",
"checks": {
"gt_provenance": {"status": "pass", "details": "..."},
"score_normalization": {"status": "warn", "details": "..."},
"result_existence": {"status": "pass", "details": "..."},
"dead_code": {"status": "pass", "details": "..."},
"scope": {"status": "warn", "details": "..."},
"eval_type": "real_gt"
},
"claims": [
{"id": "C1", "impact": "supported"},
{"id": "C2", "impact": "needs_qualifier"}
]
}Step 4: Print Summary
步骤4:打印摘要
🔬 Experiment Audit Complete
GT Provenance: ✅ PASS — real dataset GT used
Score Normalization: ⚠️ WARN — boundary metric uses self-reference
Result Existence: ✅ PASS — all files exist, numbers match
Dead Code: ✅ PASS — all metric functions called
Scope: ⚠️ WARN — 2 scenes, paper says "comprehensive"
Overall: ⚠️ WARN
See EXPERIMENT_AUDIT.md for details.🔬 实验审计完成
真值来源: ✅ 通过 — 使用真实数据集真值
分数归一化: ⚠️ 警告 — 边界指标使用自引用计算
结果文件存在性: ✅ 通过 — 所有文件存在,数据匹配
死代码检测: ✅ 通过 — 所有指标函数均被调用
范围评估: ⚠️ 警告 — 仅测试2个场景,论文称“全面评估”
整体结论: ⚠️ 警告
详情请查看EXPERIMENT_AUDIT.md。Integration with Other Skills
与其他技能的集成
Automatic in /research-pipeline (advisory, never blocks)
在/research-pipeline中自动运行(仅提供建议,绝不阻塞流程)
When integrated into the pipeline, this skill runs automatically after and before :
/experiment-bridge/auto-review-loop/experiment-bridge → results ready
↓
/experiment-audit (automatic, advisory)
├── PASS → continue normally
├── WARN → print ⚠️ warning, continue, tag claims as [INTEGRITY: WARN]
└── FAIL → print 🔴 alert, continue, tag claims as [INTEGRITY CONCERN]
↓
/auto-review-loop → proceeds with integrity tags visible to reviewerNever blocks the pipeline. Even on FAIL, the pipeline continues — but claims carry visible integrity tags.
集成到流水线后,本技能会在完成后、开始前自动运行:
/experiment-bridge/auto-review-loop/experiment-bridge → 结果准备就绪
↓
/experiment-audit(自动运行,仅提供建议)
├── 通过 → 正常继续流程
├── 警告 → 打印⚠️警告,继续流程,为声明添加[INTEGRITY: WARN]标签
└── 失败 → 打印🔴警报,继续流程,为声明添加[INTEGRITY CONCERN]标签
↓
/auto-review-loop → 流程继续,审核者可见完整性标签绝不阻塞流水线。即使审计失败,流程仍会继续,但声明会带有可见的完整性标签。
Read by /result-to-claim (if exists)
被/result-to-claim读取(若该技能存在)
if EXPERIMENT_AUDIT.json exists:
read integrity_status
attach to verdict: {claim_supported: "yes", integrity_status: "warn"}
if integrity_status == "fail":
downgrade verdict display: "yes [INTEGRITY CONCERN]"
else:
verdict as normal, integrity_status = "unavailable"
mark as "provisional — no integrity audit"如果EXPERIMENT_AUDIT.json存在:
读取integrity_status字段
将其附加到结论中: {claim_supported: "yes", integrity_status: "warn"}
如果integrity_status == "fail":
降级结论显示: "yes [INTEGRITY CONCERN]"
否则:
结论正常显示,integrity_status = "unavailable"
标记为“临时结论 — 未进行完整性审计”Read by /paper-write (if exists)
被/paper-write读取(若该技能存在)
if EXPERIMENT_AUDIT.json exists AND integrity_status == "fail":
add footnote to affected claims: "Note: integrity audit flagged concerns with this evaluation"如果EXPERIMENT_AUDIT.json存在 且 integrity_status == "fail":
为受影响的声明添加脚注: "注:完整性审计发现该评估存在问题"Key Rules
关键规则
- Reviewer independence: executor collects paths, reviewer judges. Period.
- Never block: warn loudly, never halt the pipeline.
- File-as-switch: no EXPERIMENT_AUDIT.md = skill was never run = zero impact on existing behavior.
- Cross-model: the reviewer MUST be a different model family from the executor.
- Honest about limits: the audit catches common patterns, not all possible fraud. It is a safety net, not a guarantee.
- 审核者独立性:执行者仅收集路径,审核者负责判断。严格执行此规则。
- 绝不阻塞:发出明确警告,但绝不中断流水线。
- 文件开关机制:若无EXPERIMENT_AUDIT.md文件,说明本技能从未运行,不会对现有行为产生任何影响。
- 跨模型要求:审核者必须与执行者属于不同的模型家族。
- 诚实说明局限性:本审计仅能发现常见欺诈模式,无法覆盖所有可能的欺诈情况。它是安全保障措施,而非绝对保证。
Acknowledgements
致谢
Motivated by community-reported integrity issues (#57, #131) where executor agents created fake ground truth and self-normalized scores.
本技能的开发源于社区报告的完整性问题(#57、#131),这些问题中执行者agents生成了虚假真值并进行了自归一化分数计算。
Review Tracing
审核追踪
After each or reviewer call, save the trace following . Use or write files directly to . Respect the parameter (default: ).
mcp__codex__codexmcp__codex__codex-replyshared-references/review-tracing.mdtools/save_trace.sh.aris/traces/<skill>/<date>_run<NN>/--- trace:full每次调用或审核者后,需按照保存追踪记录。可使用脚本,或直接将文件写入目录。遵守参数(默认值:)。
mcp__codex__codexmcp__codex__codex-replyshared-references/review-tracing.mdtools/save_trace.sh.aris/traces/<skill>/<date>_run<NN>/--- trace:full