pdf-text-extractor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PDF Text Extractor

PDF文本提取工具

Optionally collect full-text snippets to deepen evidence beyond abstracts.
This skill is intentionally conservative: in many survey runs, abstract/snippet mode is enough and avoids heavy downloads.
可选收集全文片段,以获得比摘要更深入的证据。
本工具设计得较为保守:在许多调研场景中,摘要/片段模式已足够,且可避免大量下载。

Inputs

输入

  • papers/core_set.csv
    (expects
    paper_id
    ,
    title
    , and ideally
    pdf_url
    /
    arxiv_id
    /
    url
    )
  • Optional:
    outline/mapping.tsv
    (to prioritize mapped papers)
  • papers/core_set.csv
    (需包含
    paper_id
    title
    字段,理想情况下还需
    pdf_url
    /
    arxiv_id
    /
    url
    字段)
  • 可选:
    outline/mapping.tsv
    (用于优先处理已映射的论文)

Outputs

输出

  • papers/fulltext_index.jsonl
    (one record per attempted paper)
  • Side artifacts:
    • papers/pdfs/<paper_id>.pdf
      (cached downloads)
    • papers/fulltext/<paper_id>.txt
      (extracted text)
  • papers/fulltext_index.jsonl
    (每条记录对应一篇尝试处理的论文)
  • 附属产物:
    • papers/pdfs/<paper_id>.pdf
      (缓存的下载文件)
    • papers/fulltext/<paper_id>.txt
      (提取的文本)

Decision: evidence mode

决策:证据模式

  • queries.md
    can set
    evidence_mode: "abstract" | "fulltext"
    .
    • abstract
      (default template): do not download; write an index that clearly records skipping.
    • fulltext
      : download PDFs (when possible) and extract text to
      papers/fulltext/
      .
  • queries.md
    中可设置
    evidence_mode: "abstract" | "fulltext"
    • abstract
      (默认模板):不进行下载;生成的索引文件会明确记录跳过状态。
    • fulltext
      :下载可用的PDF并将文本提取到
      papers/fulltext/
      目录。

Local PDFs Mode

本地PDF模式

When you cannot/should not download PDFs (restricted network, rate limits, no permission), provide PDFs manually and run in “local PDFs only” mode.
  • PDF naming convention:
    papers/pdfs/<paper_id>.pdf
    where
    <paper_id>
    matches
    papers/core_set.csv
    .
  • Set
    - evidence_mode: "fulltext"
    in
    queries.md
    .
  • Run:
    python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
If PDFs are missing, the script writes a to-do list:
  • output/MISSING_PDFS.md
    (human-readable summary)
  • papers/missing_pdfs.csv
    (machine-readable list)
当你无法/不应下载PDF时(如网络受限、速率限制、无权限),可手动提供PDF并以“仅本地PDF”模式运行。
  • PDF命名规则:
    papers/pdfs/<paper_id>.pdf
    ,其中
    <paper_id>
    需与
    papers/core_set.csv
    中的ID匹配。
  • queries.md
    中设置
    - evidence_mode: "fulltext"
  • 运行命令:
    python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
若缺少PDF,脚本会生成待办列表:
  • output/MISSING_PDFS.md
    (人类可读的摘要)
  • papers/missing_pdfs.csv
    (机器可读的列表)

Workflow (heuristic)

工作流程(启发式)

  1. Read
    papers/core_set.csv
    .
  2. If
    outline/mapping.tsv
    exists, prioritize mapped papers first.
  3. For each selected paper (fulltext mode):
    • resolve
      pdf_url
      (use
      pdf_url
      , else derive from
      arxiv_id
      /
      url
      when possible)
    • download to
      papers/pdfs/<paper_id>.pdf
      if missing
    • extract a reasonable prefix of text to
      papers/fulltext/<paper_id>.txt
    • append/update a JSONL record in
      papers/fulltext_index.jsonl
      with status + stats
  4. Never overwrite existing extracted text unless explicitly requested (delete the
    .txt
    to re-extract).
  1. 读取
    papers/core_set.csv
    文件。
  2. outline/mapping.tsv
    存在,优先处理已映射的论文。
  3. 对每篇选中的论文(全文模式下):
    • 解析
      pdf_url
      (优先使用
      pdf_url
      ,若不存在则尽可能从
      arxiv_id
      /
      url
      推导)
    • 若本地无缓存,则下载到
      papers/pdfs/<paper_id>.pdf
    • 将PDF的合理前缀文本提取到
      papers/fulltext/<paper_id>.txt
    • papers/fulltext_index.jsonl
      中追加/更新一条JSONL记录,包含状态和统计信息
  4. 除非明确要求(删除
    .txt
    文件以重新提取),否则不会覆盖已有的提取文本。

Quality checklist

质量检查清单

  • papers/fulltext_index.jsonl
    exists and is non-empty.
  • If
    evidence_mode: "fulltext"
    : at least a small but non-trivial subset has extracted text (strict mode blocks if extraction coverage is near-zero).
  • If
    evidence_mode: "abstract"
    : the index records clearly reflect skip status (no downloads attempted).
  • papers/fulltext_index.jsonl
    已存在且非空。
  • 若设置
    evidence_mode: "fulltext"
    :至少有一小部分论文成功提取文本(严格模式下若提取覆盖率接近零会阻止运行)。
  • 若设置
    evidence_mode: "abstract"
    :索引文件中明确记录了跳过状态(未尝试任何下载)。

Script

脚本

Quick Start

快速开始

  • python .codex/skills/pdf-text-extractor/scripts/run.py --help
  • python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>
  • python .codex/skills/pdf-text-extractor/scripts/run.py --help
  • python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>

All Options

所有选项

  • --max-papers <n>
    : cap number of papers processed (can be overridden by
    queries.md
    )
  • --max-pages <n>
    : extract at most N pages per PDF
  • --min-chars <n>
    : minimum extracted chars to count as OK
  • --sleep <sec>
    : delay between downloads
  • --local-pdfs-only
    : do not download; only use
    papers/pdfs/<paper_id>.pdf
    if present
  • queries.md
    supports:
    evidence_mode
    ,
    fulltext_max_papers
    ,
    fulltext_max_pages
    ,
    fulltext_min_chars
  • --max-papers <n>
    :限制处理的论文数量(可被
    queries.md
    中的设置覆盖)
  • --max-pages <n>
    :每篇PDF最多提取N页内容
  • --min-chars <n>
    :提取的文本最少需包含N个字符才算有效
  • --sleep <sec>
    :下载间隔时间(秒)
  • --local-pdfs-only
    :不进行下载;仅使用
    papers/pdfs/<paper_id>.pdf
    中已有的文件
  • queries.md
    支持的设置:
    evidence_mode
    ,
    fulltext_max_papers
    ,
    fulltext_max_pages
    ,
    fulltext_min_chars

Examples

示例

  • Abstract mode (no downloads):
    • Set
      - evidence_mode: "abstract"
      in
      queries.md
      , then run the script (it will emit
      papers/fulltext_index.jsonl
      with skip statuses)
  • Fulltext mode with local PDFs only:
    • Set
      - evidence_mode: "fulltext"
      in
      queries.md
      , put PDFs under
      papers/pdfs/
      , then run:
      python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
  • Fulltext mode with smaller budget:
    • python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200
  • 摘要模式(不下载):
    • queries.md
      中设置
      - evidence_mode: "abstract"
      ,然后运行脚本(会生成
      papers/fulltext_index.jsonl
      并记录跳过状态)
  • 仅本地PDF的全文模式:
    • queries.md
      中设置
      - evidence_mode: "fulltext"
      ,将PDF放入
      papers/pdfs/
      目录,然后运行:
      python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
  • 限制资源的全文模式:
    • python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200

Notes

注意事项

  • Downloads are cached under
    papers/pdfs/
    ; extracted text is cached under
    papers/fulltext/
    .
  • The script does not overwrite existing extracted text unless you delete the
    .txt
    file.
  • 下载的文件会缓存到
    papers/pdfs/
    目录;提取的文本会缓存到
    papers/fulltext/
    目录。
  • 除非你删除
    .txt
    文件,否则脚本不会覆盖已有的提取文本。

Troubleshooting

故障排除

Issue: no PDFs are available to download

问题:没有可下载的PDF

Fix:
  • Use
    evidence_mode: abstract
    (default) or provide local PDFs under
    papers/pdfs/
    and rerun with
    --local-pdfs-only
    .
解决方法
  • 使用
    evidence_mode: abstract
    (默认设置),或手动将PDF放入
    papers/pdfs/
    目录并添加
    --local-pdfs-only
    参数重新运行。

Issue: extracted text is empty/garbled

问题:提取的文本为空/乱码

Fix:
  • Try a different extraction backend if supported; otherwise mark the paper as
    abstract
    evidence level and avoid strong fulltext claims.
解决方法
  • 若支持,尝试使用不同的提取后端;否则将该论文标记为
    abstract
    证据级别,避免使用全文论点。