experiment-audit

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Experiment Audit: Cross-Model Integrity Verification

实验审计:跨模型完整性验证

Audit experiment integrity for: $ARGUMENTS
针对**$ARGUMENTS**进行实验完整性审计

Why This Exists

设计初衷

LLM agents can produce fraudulent experimental results through:
  1. Fake ground truth — creating synthetic "reference" from model outputs, then reporting high agreement as performance
  2. Score normalization — dividing metrics by the model's own max to get 0.99+
  3. Phantom results — claiming numbers from files that don't exist or functions never called
  4. Insufficient scope — reporting 2-scene pilots as "comprehensive evaluation"
These are NOT intentional deception — they are failure modes of optimizing agents that lack integrity constraints. This skill adds that constraint.
LLM agents可能通过以下方式生成欺诈性实验结果:
  1. 虚假真值——从模型输出中生成合成“参考数据”,随后将高一致性报告为性能表现
  2. 分数归一化欺诈——将指标除以模型自身的最大值,得到0.99+的分数
  3. 虚假结果——宣称来自不存在的文件或从未调用过的函数的数据
  4. 范围不足——将仅包含2个场景的试点实验报告为“全面评估”
这些并非故意欺诈,而是缺乏完整性约束的优化型agents的失效模式。本技能正是为了添加这类约束。

Core Principle

核心原则

The executor (Claude) collects file paths. The reviewer (GPT-5.4) reads code and judges integrity. The executor does NOT participate in integrity judgment.
This follows
shared-references/reviewer-independence.md
and
shared-references/experiment-integrity.md
.
执行者(Claude)负责收集文件路径。审核者(GPT-5.4)负责读取代码并判断完整性。执行者绝不参与完整性判断。
此流程遵循
shared-references/reviewer-independence.md
shared-references/experiment-integrity.md
文档规范。

Constants

常量定义

  • REVIEWER_BACKEND =
    codex
    — Default: Codex MCP (xhigh). Override with
    — reviewer: oracle-pro
    for GPT-5.4 Pro via Oracle MCP. See
    shared-references/reviewer-routing.md
    .
  • REVIEWER_BACKEND =
    codex
    — 默认值:Codex MCP(xhigh)。可通过
    — reviewer: oracle-pro
    参数覆盖,切换为通过Oracle MCP调用GPT-5.4 Pro。详情见
    shared-references/reviewer-routing.md

Workflow

工作流程

Step 1: Collect Artifacts (Executor — Claude)

步骤1:收集工件(执行者——Claude)

Locate and list these files WITHOUT reading or summarizing their content:
Scan project directory for:
1. Evaluation scripts:    *eval*.py, *metric*.py, *test*.py, *benchmark*.py
2. Result files:          *.json, *.csv in results/, outputs/, logs/
3. Ground truth paths:    look in eval scripts for data loading (dataset paths, GT references)
4. Experiment tracker:    EXPERIMENT_TRACKER.md, EXPERIMENT_LOG.md
5. Paper claims:          NARRATIVE_REPORT.md, paper/sections/*.tex, PAPER_PLAN.md
6. Config files:          *.yaml, *.toml, *.json configs with metric definitions
DO NOT summarize, interpret, or explain any file content. Only collect paths.
定位并列出以下文件,无需读取或总结其内容
扫描项目目录,查找:
1. 评估脚本:    *eval*.py, *metric*.py, *test*.py, *benchmark*.py
2. 结果文件:          results/、outputs/、logs/目录下的*.json、*.csv文件
3. 真值路径:    在评估脚本中查找数据加载逻辑(数据集路径、GT参考数据)
4. 实验跟踪文档:    EXPERIMENT_TRACKER.md, EXPERIMENT_LOG.md
5. 论文声明:          NARRATIVE_REPORT.md, paper/sections/*.tex, PAPER_PLAN.md
6. 配置文件:          包含指标定义的*.yaml、*.toml、*.json配置文件
请勿总结、解读或解释任何文件内容,仅需收集路径。

Step 2: Send to Reviewer (GPT-5.4 via Codex MCP)

步骤2:提交至审核者(通过Codex MCP调用GPT-5.4)

Pass ONLY file paths and the audit checklist to the reviewer. The reviewer reads everything directly.
mcp__codex__codex:
  model: gpt-5.4
  config: {"model_reasoning_effort": "xhigh"}
  sandbox: read-only
  cwd: [project directory]
  prompt: |
    You are an experiment integrity auditor. Read ALL files listed below
    and check for the following fraud patterns.

    Files to read:
    - Evaluation scripts: [list paths]
    - Result files: [list paths]
    - Experiment tracker: [list paths]
    - Paper claims: [list paths]
    - Config files: [list paths]

    ## Audit Checklist

    ### A. Ground Truth Provenance
    For each evaluation script:
    1. Where does "ground truth" / "reference" / "target" come from?
    2. Is it loaded from the DATASET, or generated/derived from MODEL OUTPUTS?
    3. If derived: is it explicitly labeled as proxy evaluation?
    4. Are official eval scripts used when available for this benchmark?
    FAIL if: GT is derived from model outputs without explicit proxy labeling.

    ### B. Score Normalization
    For each metric computation:
    1. Is any metric divided by max/min/mean of the model's OWN output?
    2. Are raw scores reported alongside any normalized scores?
    3. Are any scores suspiciously close to 1.0 or 100%?
    FAIL if: Normalization denominator comes from prediction statistics.

    ### C. Result File Existence
    For each claim in the paper/narrative:
    1. Does the referenced result file actually exist?
    2. Does the claimed metric key exist in that file?
    3. Does the claimed NUMBER match what's in the file?
    4. Is the experiment tracker status DONE (not TODO/IN_PROGRESS)?
    FAIL if: Claimed results reference nonexistent files or mismatched numbers.

    ### D. Dead Code Detection
    For each metric function defined in eval scripts:
    1. Is it actually CALLED in any evaluation pipeline?
    2. Does its output appear in any result file?
    WARN if: Metric functions exist but are never called.

    ### E. Scope Assessment
    1. How many scenes/datasets/configurations were actually tested?
    2. How many seeds/runs per configuration?
    3. Does the paper use words like "comprehensive", "extensive", "robust"?
    4. Is the actual scope sufficient for those claims?
    WARN if: Scope language exceeds actual evidence.

    ### F. Evaluation Type Classification
    Classify each evaluation as:
    - real_gt: uses dataset-provided ground truth
    - synthetic_proxy: uses model-generated reference
    - self_supervised_proxy: no GT by design
    - simulation_only: simulated environment
    - human_eval: human judges

    ## Output Format

    For each check (A-F), report:
    - Status: PASS | WARN | FAIL
    - Evidence: exact file:line references
    - Details: what specifically was found

    Overall verdict: PASS | WARN | FAIL
    
    Be thorough. Read every eval script line by line.
仅将文件路径和审计检查清单提交给审核者,由审核者直接读取所有内容。
mcp__codex__codex:
  model: gpt-5.4
  config: {"model_reasoning_effort": "xhigh"}
  sandbox: read-only
  cwd: [project directory]
  prompt: |
    You are an experiment integrity auditor. Read ALL files listed below
    and check for the following fraud patterns.

    Files to read:
    - Evaluation scripts: [list paths]
    - Result files: [list paths]
    - Experiment tracker: [list paths]
    - Paper claims: [list paths]
    - Config files: [list paths]

    ## Audit Checklist

    ### A. Ground Truth Provenance
    For each evaluation script:
    1. Where does "ground truth" / "reference" / "target" come from?
    2. Is it loaded from the DATASET, or generated/derived from MODEL OUTPUTS?
    3. If derived: is it explicitly labeled as proxy evaluation?
    4. Are official eval scripts used when available for this benchmark?
    FAIL if: GT is derived from model outputs without explicit proxy labeling.

    ### B. Score Normalization
    For each metric computation:
    1. Is any metric divided by max/min/mean of the model's OWN output?
    2. Are raw scores reported alongside any normalized scores?
    3. Are any scores suspiciously close to 1.0 or 100%?
    FAIL if: Normalization denominator comes from prediction statistics.

    ### C. Result File Existence
    For each claim in the paper/narrative:
    1. Does the referenced result file actually exist?
    2. Does the claimed metric key exist in that file?
    3. Does the claimed NUMBER match what's in the file?
    4. Is the experiment tracker status DONE (not TODO/IN_PROGRESS)?
    FAIL if: Claimed results reference nonexistent files or mismatched numbers.

    ### D. Dead Code Detection
    For each metric function defined in eval scripts:
    1. Is it actually CALLED in any evaluation pipeline?
    2. Does its output appear in any result file?
    WARN if: Metric functions exist but are never called.

    ### E. Scope Assessment
    1. How many scenes/datasets/configurations were actually tested?
    2. How many seeds/runs per configuration?
    3. Does the paper use words like "comprehensive", "extensive", "robust"?
    4. Is the actual scope sufficient for those claims?
    WARN if: Scope language exceeds actual evidence.

    ### F. Evaluation Type Classification
    Classify each evaluation as:
    - real_gt: uses dataset-provided ground truth
    - synthetic_proxy: uses model-generated reference
    - self_supervised_proxy: no GT by design
    - simulation_only: simulated environment
    - human_eval: human judges

    ## Output Format

    For each check (A-F), report:
    - Status: PASS | WARN | FAIL
    - Evidence: exact file:line references
    - Details: what specifically was found

    Overall verdict: PASS | WARN | FAIL
    
    Be thorough. Read every eval script line by line.

Step 3: Parse and Write Report (Executor — Claude)

步骤3:解析并撰写报告(执行者——Claude)

Parse the reviewer's response and write
EXPERIMENT_AUDIT.md
:
markdown
undefined
解析审核者的响应并撰写
EXPERIMENT_AUDIT.md
markdown
undefined

Experiment Audit Report

实验审计报告

Date: [today] Auditor: GPT-5.4 xhigh (cross-model, read-only) Project: [project name]
日期: [今日] 审核者: GPT-5.4 xhigh(跨模型,只读权限) 项目: [项目名称]

Overall Verdict: [PASS | WARN | FAIL]

整体结论: [PASS | WARN | FAIL]

Integrity Status: [pass | warn | fail]

完整性状态: [pass | warn | fail]

Checks

检查项

A. Ground Truth Provenance: [PASS|WARN|FAIL]

A. 真值来源: [PASS|WARN|FAIL]

[details + file:line evidence]
[详情 + 文件:行号证据]

B. Score Normalization: [PASS|WARN|FAIL]

B. 分数归一化: [PASS|WARN|FAIL]

[details]
[详情]

C. Result File Existence: [PASS|WARN|FAIL]

C. 结果文件存在性: [PASS|WARN|FAIL]

[details]
[详情]

D. Dead Code Detection: [PASS|WARN|FAIL]

D. 死代码检测: [PASS|WARN|FAIL]

[details]
[详情]

E. Scope Assessment: [PASS|WARN|FAIL]

E. 范围评估: [PASS|WARN|FAIL]

[details]
[详情]

F. Evaluation Type: [real_gt | synthetic_proxy | ...]

F. 评估类型: [real_gt | synthetic_proxy | ...]

[classification + evidence]
[分类结果 + 证据]

Action Items

行动项

  • [specific fixes if WARN or FAIL]
  • [若为WARN或FAIL,列出具体修复措施]

Claim Impact

声明影响

  • Claim 1: [supported | needs qualifier | unsupported]
  • Claim 2: ...

Also write `EXPERIMENT_AUDIT.json` for machine consumption:

```json
{
  "date": "2026-04-10",
  "auditor": "gpt-5.4-xhigh",
  "overall_verdict": "warn",
  "integrity_status": "warn",
  "checks": {
    "gt_provenance": {"status": "pass", "details": "..."},
    "score_normalization": {"status": "warn", "details": "..."},
    "result_existence": {"status": "pass", "details": "..."},
    "dead_code": {"status": "pass", "details": "..."},
    "scope": {"status": "warn", "details": "..."},
    "eval_type": "real_gt"
  },
  "claims": [
    {"id": "C1", "impact": "supported"},
    {"id": "C2", "impact": "needs_qualifier"}
  ]
}
  • 声明1: [支持 | 需要限定 | 不支持]
  • 声明2: ...

同时撰写供机器读取的`EXPERIMENT_AUDIT.json`:

```json
{
  "date": "2026-04-10",
  "auditor": "gpt-5.4-xhigh",
  "overall_verdict": "warn",
  "integrity_status": "warn",
  "checks": {
    "gt_provenance": {"status": "pass", "details": "..."},
    "score_normalization": {"status": "warn", "details": "..."},
    "result_existence": {"status": "pass", "details": "..."},
    "dead_code": {"status": "pass", "details": "..."},
    "scope": {"status": "warn", "details": "..."},
    "eval_type": "real_gt"
  },
  "claims": [
    {"id": "C1", "impact": "supported"},
    {"id": "C2", "impact": "needs_qualifier"}
  ]
}

Step 4: Print Summary

步骤4:打印摘要

🔬 Experiment Audit Complete

  GT Provenance:      ✅ PASS — real dataset GT used
  Score Normalization: ⚠️ WARN — boundary metric uses self-reference
  Result Existence:    ✅ PASS — all files exist, numbers match
  Dead Code:           ✅ PASS — all metric functions called
  Scope:               ⚠️ WARN — 2 scenes, paper says "comprehensive"

  Overall: ⚠️ WARN
  
  See EXPERIMENT_AUDIT.md for details.
🔬 实验审计完成

  真值来源:      ✅ 通过 — 使用真实数据集真值
  分数归一化: ⚠️ 警告 — 边界指标使用自引用计算
  结果文件存在性:    ✅ 通过 — 所有文件存在,数据匹配
  死代码检测:           ✅ 通过 — 所有指标函数均被调用
  范围评估:               ⚠️ 警告 — 仅测试2个场景,论文称“全面评估”

  整体结论: ⚠️ 警告
  
  详情请查看EXPERIMENT_AUDIT.md。

Integration with Other Skills

与其他技能的集成

Automatic in /research-pipeline (advisory, never blocks)

在/research-pipeline中自动运行(仅提供建议,绝不阻塞流程)

When integrated into the pipeline, this skill runs automatically after
/experiment-bridge
and before
/auto-review-loop
:
/experiment-bridge → results ready
/experiment-audit (automatic, advisory)
    ├── PASS  → continue normally
    ├── WARN  → print ⚠️ warning, continue, tag claims as [INTEGRITY: WARN]
    └── FAIL  → print 🔴 alert, continue, tag claims as [INTEGRITY CONCERN]
/auto-review-loop → proceeds with integrity tags visible to reviewer
Never blocks the pipeline. Even on FAIL, the pipeline continues — but claims carry visible integrity tags.
集成到流水线后,本技能会在
/experiment-bridge
完成后、
/auto-review-loop
开始前自动运行:
/experiment-bridge → 结果准备就绪
/experiment-audit(自动运行,仅提供建议)
    ├── 通过  → 正常继续流程
    ├── 警告  → 打印⚠️警告,继续流程,为声明添加[INTEGRITY: WARN]标签
    └── 失败  → 打印🔴警报,继续流程,为声明添加[INTEGRITY CONCERN]标签
/auto-review-loop → 流程继续,审核者可见完整性标签
绝不阻塞流水线。即使审计失败,流程仍会继续,但声明会带有可见的完整性标签。

Read by /result-to-claim (if exists)

被/result-to-claim读取(若该技能存在)

if EXPERIMENT_AUDIT.json exists:
    read integrity_status
    attach to verdict: {claim_supported: "yes", integrity_status: "warn"}
    if integrity_status == "fail":
        downgrade verdict display: "yes [INTEGRITY CONCERN]"
else:
    verdict as normal, integrity_status = "unavailable"
    mark as "provisional — no integrity audit"
如果EXPERIMENT_AUDIT.json存在:
    读取integrity_status字段
    将其附加到结论中: {claim_supported: "yes", integrity_status: "warn"}
    如果integrity_status == "fail":
        降级结论显示: "yes [INTEGRITY CONCERN]"
否则:
    结论正常显示,integrity_status = "unavailable"
    标记为“临时结论 — 未进行完整性审计”

Read by /paper-write (if exists)

被/paper-write读取(若该技能存在)

if EXPERIMENT_AUDIT.json exists AND integrity_status == "fail":
    add footnote to affected claims: "Note: integrity audit flagged concerns with this evaluation"
如果EXPERIMENT_AUDIT.json存在 且 integrity_status == "fail":
    为受影响的声明添加脚注: "注:完整性审计发现该评估存在问题"

Key Rules

关键规则

  • Reviewer independence: executor collects paths, reviewer judges. Period.
  • Never block: warn loudly, never halt the pipeline.
  • File-as-switch: no EXPERIMENT_AUDIT.md = skill was never run = zero impact on existing behavior.
  • Cross-model: the reviewer MUST be a different model family from the executor.
  • Honest about limits: the audit catches common patterns, not all possible fraud. It is a safety net, not a guarantee.
  • 审核者独立性:执行者仅收集路径,审核者负责判断。严格执行此规则。
  • 绝不阻塞:发出明确警告,但绝不中断流水线。
  • 文件开关机制:若无EXPERIMENT_AUDIT.md文件,说明本技能从未运行,不会对现有行为产生任何影响。
  • 跨模型要求:审核者必须与执行者属于不同的模型家族。
  • 诚实说明局限性:本审计仅能发现常见欺诈模式,无法覆盖所有可能的欺诈情况。它是安全保障措施,而非绝对保证。

Acknowledgements

致谢

Motivated by community-reported integrity issues (#57, #131) where executor agents created fake ground truth and self-normalized scores.
本技能的开发源于社区报告的完整性问题(#57、#131),这些问题中执行者agents生成了虚假真值并进行了自归一化分数计算。

Review Tracing

审核追踪

After each
mcp__codex__codex
or
mcp__codex__codex-reply
reviewer call, save the trace following
shared-references/review-tracing.md
. Use
tools/save_trace.sh
or write files directly to
.aris/traces/<skill>/<date>_run<NN>/
. Respect the
--- trace:
parameter (default:
full
).
每次调用
mcp__codex__codex
mcp__codex__codex-reply
审核者后,需按照
shared-references/review-tracing.md
保存追踪记录。可使用
tools/save_trace.sh
脚本,或直接将文件写入
.aris/traces/<skill>/<date>_run<NN>/
目录。遵守
--- trace:
参数(默认值:
full
)。