section-mapper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Section Mapper

小节映射器(Section Mapper)

Create a paper→subsection map that supports evidence building and later synthesis.
Good mapping is diverse (avoids reusing the same paper everywhere) and explainable (short semantic “why”, not just keyword overlap).
创建论文→小节的映射关系,为证据整理及后续内容合成提供支持。
优质的映射需具备多样性(避免在所有地方重复使用同一篇论文)和可解释性(简短的语义层面“映射理由”,而非仅依赖关键词匹配)。

When to use

适用场景

  • You have
    outline/outline.yml
    and a
    papers/core_set.csv
    and need coverage per subsection.
  • You want to identify weak-signal subsections early (so you can adjust scope or add papers).
  • 已拥有
    outline/outline.yml
    papers/core_set.csv
    ,且需要追踪每个小节的论文覆盖率。
  • 希望尽早识别出支撑不足的小节(以便调整范围或补充论文)。

Inputs

输入文件

  • papers/core_set.csv
  • outline/outline.yml
  • papers/core_set.csv
  • outline/outline.yml

Outputs

输出文件

  • outline/mapping.tsv
  • outline/mapping_report.md
    (diagnostics: reuse hotspots, weak-signal subsections)
  • outline/mapping.tsv
  • outline/mapping_report.md
    (诊断内容:论文重复使用热点、支撑不足的小节)

Freeze marker (explicit)

冻结标记(显式)

To prevent accidental overwrites after you refine mapping rationales:
  • Create
    outline/mapping.refined.ok
    .
If you rerun the script without this marker, it will back up the previous mapping to a timestamped file:
  • outline/mapping.tsv.bak.<timestamp>
为避免在优化映射理由后被意外覆盖:
  • 创建
    outline/mapping.refined.ok
    文件。
如果未创建该标记就重新运行脚本,系统会将之前的映射文件备份到带时间戳的文件中:
  • outline/mapping.tsv.bak.<timestamp>

Workflow (heuristic)

工作流程(启发式)

  1. Start from the outline subsections (each subsection should be “mappable”).
  2. For each subsection, pick enough papers to support evidence-first writing (A150++ default: 28; smaller runs: ~12–20; lightweight: ~3–6) that are:
    • representative (canonical / frequently-cited)
    • complementary (different design choices, different eval setups)
    • not overly reused elsewhere unless truly foundational
  3. Fill
    why
    with a short semantic rationale (one line is enough), e.g.:
    • mechanism: “decouples planner/executor; tool calling API”
    • evaluation: “interactive web tasks; strong tool error analysis”
    • safety: “agentic jailbreak surface; mitigation study”
  4. After initial mapping, scan for:
    • subsections with <3 papers → either broaden, merge, or expand retrieval
    • a few papers mapped everywhere → diversify; reserve “foundational” papers for only the truly relevant parts
  1. 从大纲的各个小节开始(每个小节需具备“可映射性”)。
  2. 为每个小节挑选足够的论文以支撑“证据优先”的写作(A150++规模默认28篇;小型项目约12–20篇;轻量项目约3–6篇),所选论文需满足:
    • 代表性(经典/高引用)
    • 互补性(不同设计方案、不同评估设置)
    • 除非是真正的基础论文,否则避免在多个小节过度重复使用
  3. why
    列中填写简短的语义层面理由(一行即可),例如:
    • 机制类:“解耦规划器/执行器;工具调用API”
    • 评估类:“交互式Web任务;深入的工具错误分析”
    • 安全类:“Agent越狱风险面;缓解方案研究”
  4. 完成初始映射后,检查以下内容:
    • 论文数量<3篇的小节→要么扩大范围、合并小节,要么扩展检索
    • 被大量重复映射的少数论文→增加多样性;仅在真正相关的小节使用“基础论文”

Quality checklist

质量检查清单

  • outline/mapping.tsv
    exists and is non-empty.
  • Most subsections have ≥3 mapped papers (or a clear exception noted in
    why
    ).
  • why
    is semantic (not just
    matched_terms=...
    ).
  • No single paper dominates unrelated subsections.
  • outline/mapping.tsv
    已生成且非空。
  • 大多数小节的映射论文数量≥3篇(或在
    why
    列中注明明确的例外情况)。
  • why
    列内容为语义层面的理由(而非仅
    matched_terms=...
    这类内容)。
  • 没有单篇论文被大量用于无关小节。

Helper script (optional)

辅助脚本(可选)

Quick Start

快速开始

  • python .codex/skills/section-mapper/scripts/run.py --help
  • python .codex/skills/section-mapper/scripts/run.py --workspace <workspace_dir> --per-subsection 28
  • python .codex/skills/section-mapper/scripts/run.py --help
  • python .codex/skills/section-mapper/scripts/run.py --workspace <workspace_dir> --per-subsection 28

All Options

所有选项

  • --per-subsection <n>
    : target mapped papers per subsection
  • --diversity-penalty <float>
    : penalize repeated reuse of the same paper across many subsections
  • --soft-limit <n>
    /
    --hard-limit <n>
    : caps for per-paper reuse (0 = auto)
  • --per-subsection <n>
    : 每个小节的目标映射论文数量
  • --diversity-penalty <float>
    : 对同一论文在多个小节重复使用的惩罚系数
  • --soft-limit <n>
    /
    --hard-limit <n>
    : 单篇论文的重复使用上限(0表示自动设置)

Examples

示例

  • Higher diversity (reduce over-reuse):
    • python .codex/skills/section-mapper/scripts/run.py --workspace <ws> --per-subsection 4 --diversity-penalty 0.25
  • Tighter reuse caps:
    • python .codex/skills/section-mapper/scripts/run.py --workspace <ws> --per-subsection 3 --soft-limit 6 --hard-limit 10
  • 提升多样性(减少重复使用):
    • python .codex/skills/section-mapper/scripts/run.py --workspace <ws> --per-subsection 4 --diversity-penalty 0.25
  • 更严格的重复使用上限:
    • python .codex/skills/section-mapper/scripts/run.py --workspace <ws> --per-subsection 3 --soft-limit 6 --hard-limit 10

Notes

注意事项

  • Writes
    outline/mapping_report.md
    diagnostics.
  • In
    pipeline.py --strict
    , mapping may be blocked until generic
    why
    rationales are replaced with semantic ones.
  • 会生成
    outline/mapping_report.md
    诊断报告。
  • pipeline.py --strict
    模式下,若
    why
    列的理由为通用表述而非语义层面的理由,映射流程可能会被阻塞。

Troubleshooting

故障排查

Common Issues

常见问题

Issue:
outline/mapping.tsv
is empty or low-coverage

问题:
outline/mapping.tsv
为空或覆盖率低

Symptom:
  • Mapping has few rows, or many subsections have <3 papers.
Causes:
  • Core set is too small or outline is too fine-grained.
Solutions:
  • Increase core set size (rerun
    dedupe-rank
    with larger
    --core-size
    ).
  • Merge weak-signal subsections or broaden the scope/queries.
症状
  • 映射文件行数极少,或多个小节的论文数量<3篇。
原因
  • 核心论文集规模过小,或大纲划分过于精细。
解决方案
  • 扩大核心论文集规模(使用更大的
    --core-size
    参数重新运行
    dedupe-rank
    )。
  • 合并支撑不足的小节,或扩大范围/检索查询词。

Issue: Mapping over-reuses the same papers

问题:映射过度重复使用同一批论文

Symptom:
  • Quality gate reports repeated papers across many unrelated subsections.
Causes:
  • Diversity penalty too low; limited core set.
Solutions:
  • Raise
    --diversity-penalty
    and/or set tighter
    --soft-limit/--hard-limit
    .
  • Manually diversify mappings for unrelated sections.
症状
  • 质量检查报告显示同一批论文被大量用于无关小节。
原因
  • 多样性惩罚系数过低;核心论文集规模有限。
解决方案
  • 提高
    --diversity-penalty
    系数,和/或设置更严格的
    --soft-limit/--hard-limit
    上限。
  • 手动为无关小节增加映射的多样性。

Recovery Checklist

恢复检查清单

  • Each subsection has ≥3 mapped papers (target).
  • why
    column contains semantic rationale (not just token overlap).
  • 每个小节的映射论文数量≥3篇(目标值)。
  • why
    列包含语义层面的理由(而非仅基于词元匹配)。