baseline-selection-audit

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Baseline Selection Audit

基线选择审核

Turn a claim, method, draft experiment plan, or literature map into a reviewer-proof baseline set and fairness ledger.
Use this skill when:
  • experiments are being planned and the right baselines are unclear
  • a literature review found competitors but they have not been converted into comparisons
  • a paper may be missing SOTA, direct competitors, classics, ablation baselines, or control baselines
  • a reviewer might complain about unfair tuning, scale, data, compute, protocol, or metric differences
  • a rebuttal or revision needs to decide which additional baseline experiment is worth running
  • the user needs to justify why a baseline is excluded as not comparable
Do not use this skill for citation metadata checks. Use
citation-audit
for BibTeX and LaTeX correctness. Use
citation-coverage-audit
when the primary question is missing references rather than missing comparisons.
Pair this skill with:
  • literature-review-sprint
    before this skill when the competing paper map is incomplete
  • algorithm-design-planner
    when the closest baseline changes the method design
  • experiment-design-planner
    after this skill to turn selected baselines into a concrete experiment matrix
  • run-experiment
    only after baseline scope, fairness rules, and stop conditions are clear
  • result-diagnosis
    when baseline results are surprising, unstable, or stronger than the proposed method
  • paper-evidence-board
    when baseline risks must be linked to paper claims, figures, and sections
  • research-project-memory
    when baseline decisions, risks, and actions should persist across sessions
将研究主张、方法、实验计划草稿或文献图谱转化为经得起评审的基线集合和公平性台账。
适用场景:
  • 正在规划实验但不确定合适基线
  • 文献综述已找到竞品但尚未转化为对比实验
  • 论文可能缺失SOTA、直接竞品、经典方法、消融基线或控制基线
  • 评审可能质疑调参、规模、数据、算力、实验流程或指标存在不公平性
  • 回复评审意见或修改论文时需要决定值得补充哪些基线实验
  • 用户需要证明为何排除某一基线(因不可比较)
不适用场景:引用元数据检查。BibTeX和LaTeX格式正确性请使用
citation-audit
;若核心问题是缺失参考文献而非缺失对比实验,请使用
citation-coverage-audit
搭配使用技能:
  • 若竞品文献图谱不完善,先使用
    literature-review-sprint
  • 若最接近的基线会改变方法设计,使用
    algorithm-design-planner
  • 选择基线后,使用
    experiment-design-planner
    将其转化为具体的实验矩阵
  • 仅在基线范围、公平性规则和终止条件明确后,使用
    run-experiment
  • 当基线结果出乎意料、不稳定或优于所提方法时,使用
    result-diagnosis
  • 当基线风险需与论文主张、图表和章节关联时,使用
    paper-evidence-board
  • 当基线决策、风险和行动需跨会话留存时,使用
    research-project-memory

Skill Directory Layout

技能目录结构

text
<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── baseline-taxonomy.md
    ├── fairness-ledger.md
    ├── memory-writeback.md
    ├── report-template.md
    └── reviewer-risk.md
text
<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── baseline-taxonomy.md
    ├── fairness-ledger.md
    ├── memory-writeback.md
    ├── report-template.md
    └── reviewer-risk.md

Progressive Loading

渐进式加载规则

  • Always read
    references/baseline-taxonomy.md
    ,
    references/fairness-ledger.md
    , and
    references/reviewer-risk.md
    .
  • Read
    references/report-template.md
    before writing the final audit.
  • Read
    references/memory-writeback.md
    when the project has
    memory/
    , component
    .agent/
    folders, or the user asks for persistent project memory.
  • If the baseline set depends on current SOTA, recent concurrent work, or venue expectations, verify with current sources through web search, OpenReview, proceedings, arXiv, PMLR, ACL Anthology, CVF, DBLP, Semantic Scholar, or user-provided papers.
  • If current verification is unavailable, mark baseline status as provisional and identify the missing search needed before final experiment planning.
  • 必须阅读
    references/baseline-taxonomy.md
    references/fairness-ledger.md
    references/reviewer-risk.md
  • 撰写最终审核报告前,阅读
    references/report-template.md
  • 当项目包含
    memory/
    、组件
    .agent/
    文件夹,或用户要求持久化项目记忆时,阅读
    references/memory-writeback.md
  • 若基线集合依赖当前SOTA、近期同期研究或会议/期刊预期,需通过网络搜索、OpenReview、会议论文集、arXiv、PMLR、ACL Anthology、CVF、DBLP、Semantic Scholar或用户提供的论文验证信息。
  • 若无法进行当前验证,需将基线状态标记为临时,并明确最终实验规划前需补充的搜索内容。

Core Principles

核心原则

  • Baselines exist to defend a claim, not to decorate a table.
  • Separate closest conceptual competitor, strongest empirical baseline, standard benchmark baseline, ablation baseline, and control baseline.
  • A baseline can be missing for citation purposes, comparison purposes, or both. Name which one.
  • Fairness must cover data, model size, compute, tuning, metric, protocol, code availability, and reporting.
  • Do not ask the user to run every possible baseline. Rank by reviewer impact and decision value.
  • Excluding a baseline requires a defensible reason and often a citation or limitation statement.
  • A strong baseline beating the method is project information, not merely an experiment failure.
  • The output must hand off directly to
    experiment-design-planner
    .
  • 基线的存在是为了支撑研究主张,而非单纯填充表格。
  • 区分最接近的概念竞品、最强的实证基线、标准基准基线、消融基线和控制基线。
  • 基线可能因引用需求、对比需求或两者皆有而缺失,需明确说明类型。
  • 公平性需覆盖数据、模型规模、算力、调参、指标、实验流程、代码可用性和报告方式。
  • 无需要求用户运行所有可能的基线,需按评审影响和决策价值排序。
  • 排除基线需有合理依据,通常需引用文献或说明局限性。
  • 强基线优于所提方法是项目重要信息,而非单纯的实验失败。
  • 输出结果需可直接对接
    experiment-design-planner

Step 1 - Recover Claim and Comparison Surface

步骤1 - 梳理研究主张与对比范围

Collect:
  • paper claim or experiment claim
  • proposed method and closest baseline, if known
  • target task, dataset, benchmark, metric, and protocol
  • target venue or community expectations
  • existing results, draft tables, or planned experiments
  • literature-review outputs, if available
  • code availability and compute budget
  • project memory IDs such as
    CLM-###
    ,
    EVD-###
    ,
    RSK-###
    , or
    ACT-###
Rewrite the claim into:
text
We need to show that [method] improves [property] over [comparison set] under [task/protocol], without the result being explained by [confound].
If this cannot be written, route to
research-idea-validator
,
algorithm-design-planner
, or
paper-evidence-board
.
收集信息:
  • 论文主张或实验主张
  • 所提方法及已知的最接近基线
  • 目标任务、数据集、基准、指标和实验流程
  • 目标会议/期刊或社区预期
  • 现有结果、表格草稿或规划中的实验
  • 若有,文献综述输出内容
  • 代码可用性和算力预算
  • 项目记忆ID,如
    CLM-###
    EVD-###
    RSK-###
    ACT-###
将主张改写为如下格式:
text
我们需要证明,在[任务/实验流程]下,[方法]在[属性]上优于[对比集合],且该结果无法被[混淆因素]解释。
若无法完成改写,引导至
research-idea-validator
algorithm-design-planner
paper-evidence-board

Step 2 - Build Candidate Baseline Pool

步骤2 - 构建候选基线池

Use:
  • literature review outputs
  • cited related work
  • benchmark leaderboards or official baselines
  • recent accepted papers at the target venue
  • code repositories or model releases
  • reviewer comments, if this is rebuttal mode
Classify each candidate using
references/baseline-taxonomy.md
.
The pool should include:
  • direct competitor
  • strongest current method
  • standard benchmark baseline
  • classic baseline
  • previous version or nearest ablation of the user's method
  • no-method or trivial control baseline
  • oracle, upper bound, or diagnostic baseline when appropriate
  • resource-matched baseline
  • domain-specific baseline expected by the venue
使用以下来源:
  • 文献综述输出内容
  • 引用的相关研究
  • 基准排行榜或官方基线
  • 目标会议/期刊近期录用的论文
  • 代码仓库或模型发布版本
  • 若处于回复评审阶段,参考评审意见
使用
references/baseline-taxonomy.md
对每个候选基线分类。
基线池应包含:
  • 直接竞品
  • 当前最强方法
  • 标准基准基线
  • 经典基线
  • 用户方法的先前版本或最接近的消融版本
  • 无方法或简单控制基线
  • 必要时加入Oracle、上界或诊断基线
  • 资源匹配的基线
  • 目标会议/期刊预期的领域特定基线

Step 3 - Assign Baseline Requirement Level

步骤3 - 分配基线需求等级

For each candidate, assign exactly one:
  • must-have
    : paper is hard to defend without it
  • should-have
    : materially improves reviewer confidence, but omission may be defensible
  • optional
    : useful context, low acceptance impact
  • not-comparable
    : related but unfair or invalid as a direct comparison
  • citation-only
    : should be discussed/cited but does not need an experiment
Every
must-have
baseline needs an owner, experiment form, fairness constraints, and fallback if impossible.
Every
not-comparable
baseline needs a reason:
  • different task or data
  • incompatible metric
  • unavailable code and reproduction too expensive
  • different resource regime
  • uses extra supervision or data
  • no public details sufficient for faithful reproduction
  • evaluates a different claim
为每个候选基线分配唯一等级:
  • must-have
    :无此基线则论文难以辩护
  • should-have
    :显著提升评审信心,但缺失仍可辩护
  • optional
    :提供有用背景,对录用影响小
  • not-comparable
    :相关但作为直接对比不公平或无效
  • citation-only
    :需讨论/引用但无需实验验证
每个
must-have
基线需明确负责人、实验形式、公平性约束及无法实现时的备选方案。
每个
not-comparable
基线需说明原因:
  • 任务或数据不同
  • 指标不兼容
  • 代码不可用且复现成本过高
  • 资源配置不同
  • 使用额外监督或数据
  • 无足够公开细节用于忠实复现
  • 评估的主张不同

Step 4 - Audit Fairness

步骤4 - 公平性审核

Read
references/fairness-ledger.md
.
For each
must-have
and
should-have
baseline, check:
  • same data split and preprocessing
  • same training data and extra-data policy
  • comparable model size or explicit scale control
  • comparable compute or explicit compute-normalized metric
  • comparable tuning budget
  • comparable evaluation metric and decoding/sampling protocol
  • correct official code or faithful reimplementation
  • enough seeds, confidence intervals, or variance reporting
  • same reporting unit: tokens, examples, images, FLOPs, wall-clock, NFE, parameters, or memory
If fairness cannot be achieved, decide whether to:
  • change claim
  • add a matched subset comparison
  • run a smaller diagnostic comparison
  • mark baseline as citation-only with clear limitation
  • defer to rebuttal risk
阅读
references/fairness-ledger.md
对每个
must-have
should-have
基线,检查以下内容:
  • 相同的数据划分与预处理
  • 相同的训练数据及额外数据政策
  • 可比较的模型规模或明确的规模控制
  • 可比较的算力或明确的算力归一化指标
  • 可比较的调参预算
  • 可比较的评估指标及解码/采样流程
  • 正确的官方代码或忠实的复现实现
  • 足够的随机种子、置信区间或方差报告
  • 相同的报告单位:token、样本、图像、FLOPs、 wall-clock、NFE、参数或内存
若无法实现公平性,需决定:
  • 修改研究主张
  • 添加匹配子集对比
  • 开展更小范围的诊断性对比
  • 将基线标记为
    citation-only
    并明确局限性
  • 参考评审风险决定后续方案

Step 5 - Forecast Reviewer Attacks

步骤5 - 预测评审质疑

Read
references/reviewer-risk.md
.
For each missing, weak, or unfair baseline, write the likely reviewer objection:
text
Reviewer could say: [attack].
Severity: fatal / major / medium / minor
Mitigation: run / cite / justify / narrow claim / move to appendix / accept risk
Prioritize by acceptance impact:
  • fatal novelty or comparison threat
  • required benchmark/SOTA omission
  • unfair tuning or compute
  • weak ablation baseline
  • unclear protocol
  • missing control
阅读
references/reviewer-risk.md
针对每个缺失、薄弱或不公平的基线,撰写可能的评审意见:
text
评审可能提出:[质疑内容]。
严重程度:致命 / 重大 / 中等 / 轻微
缓解方案:运行实验 / 引用文献 / 合理解释 / 缩小主张范围 / 移至附录 / 接受风险
按录用影响优先级排序:
  • 致命的创新性或对比威胁
  • 缺失必需的基准/SOTA
  • 调参或算力不公平
  • 薄弱的消融基线
  • 实验流程不清晰
  • 缺失控制基线

Step 6 - Produce Experiment Handoff

步骤6 - 生成实验对接内容

For
experiment-design-planner
, output:
  • selected baselines and requirement levels
  • exact comparison table rows
  • fairness ledger fields to log
  • metrics and protocol constraints
  • ablation/control baselines
  • stop conditions
  • expected reviewer question each baseline answers
  • fallback plan if a baseline is impossible
If compute is limited, propose a staged plan:
  1. minimal reviewer-proof set
  2. high-impact optional additions
  3. appendix or deferred baselines
experiment-design-planner
输出以下内容:
  • 选定的基线及需求等级
  • 精确的对比表格行内容
  • 需记录的公平性台账字段
  • 指标和实验流程约束
  • 消融/控制基线
  • 终止条件
  • 每个基线需回答的预期评审问题
  • 基线无法实现时的备选方案
若算力有限,提出分阶段计划:
  1. 最小化经得起评审的基线集合
  2. 高影响的可选补充基线
  3. 附录或延后开展的基线

Step 7 - Write the Baseline Audit Report

步骤7 - 撰写基线审核报告

Read
references/report-template.md
.
If saving to a project and no path is given, use:
text
docs/experiments/baseline_selection_audit_YYYY-MM-DD_<short-name>.md
If working inside a code repo or code worktree created by
init-python-project
/
new-workspace
, prefer:
text
docs/reports/baseline_selection_audit_YYYY-MM-DD_<short-name>.md
The report must include:
  • claim under audit
  • candidate baseline pool
  • requirement-level table
  • fairness ledger
  • reviewer attack forecast
  • selected experiment matrix handoff
  • baselines excluded and why
  • memory update section
阅读
references/report-template.md
若需保存到项目且未指定路径,使用:
text
docs/experiments/baseline_selection_audit_YYYY-MM-DD_<short-name>.md
若在
init-python-project
/
new-workspace
创建的代码仓库或工作树中工作,优先使用:
text
docs/reports/baseline_selection_audit_YYYY-MM-DD_<short-name>.md
报告必须包含:
  • 待审核的研究主张
  • 候选基线池
  • 需求等级表格
  • 公平性台账
  • 评审质疑预测
  • 选定的实验矩阵对接内容
  • 被排除的基线及原因
  • 记忆更新部分

Step 8 - Write Back to Project Memory

步骤8 - 写入项目记忆

Read
references/memory-writeback.md
when memory exists.
Update the smallest useful set of entries:
  • memory/risk-board.md
    : missing, unfair, unavailable, or not-comparable baseline risks
  • memory/evidence-board.md
    : planned baseline comparisons and ablations
  • memory/action-board.md
    : implementation, run, citation, or justification actions
  • memory/claim-board.md
    : claims narrowed by baseline feasibility
  • memory/decision-log.md
    : durable decisions to include, exclude, or stage baselines
  • worktree
    .agent/worktree-status.md
    : baseline implementation purpose and exit condition
  • paper/.agent/
    : table/section implications when a draft exists
Use certainty labels:
  • verified
    for baselines checked against primary sources or official code
  • user-stated
    for constraints supplied by the user
  • inferred
    for reviewer risks and fairness judgments
  • unverified
    for candidates not yet checked
当项目存在记忆时,阅读
references/memory-writeback.md
更新最小必要的条目集合:
  • memory/risk-board.md
    :缺失、不公平、不可用或不可比较的基线风险
  • memory/evidence-board.md
    :规划中的基线对比和消融实验
  • memory/action-board.md
    :实现、运行、引用或合理解释的行动项
  • memory/claim-board.md
    :因基线可行性而缩小的研究主张
  • memory/decision-log.md
    :关于包含、排除或分阶段开展基线的持久化决策
  • 工作树
    .agent/worktree-status.md
    :基线实现目的和退出条件
  • paper/.agent/
    :若存在草稿,记录对表格/章节的影响
使用确定性标签:
  • verified
    :已通过原始来源或官方代码验证的基线
  • user-stated
    :用户提供的约束条件
  • inferred
    :评审风险和公平性判断
  • unverified
    :尚未验证的候选基线

Final Sanity Check

最终 sanity check

Before finalizing:
  • every paper claim has at least one direct comparison or control
  • closest conceptual competitor and strongest empirical baseline are not conflated
  • must-have
    baselines are explicit
  • excluded baselines have defensible reasons
  • fairness constraints are concrete enough to run
  • reviewer attacks are written in reviewer language
  • the output can feed directly into
    experiment-design-planner
  • project memory is updated when present
定稿前检查:
  • 每个论文主张至少有一个直接对比或控制基线
  • 未混淆最接近的概念竞品和最强的实证基线
  • must-have
    基线明确
  • 被排除的基线有合理依据
  • 公平性约束足够具体可执行
  • 评审质疑使用评审视角的语言
  • 输出可直接对接
    experiment-design-planner
  • 若存在项目记忆已完成更新