pdf-text-extractor

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PDF Text Extractor

PDF文本提取工具

Optionally collect full-text snippets to deepen evidence beyond abstracts.

This skill is intentionally conservative: in many survey runs, abstract/snippet mode is enough and avoids heavy downloads.

可选收集全文片段，以获得比摘要更深入的证据。

本工具设计得较为保守：在许多调研场景中，摘要/片段模式已足够，且可避免大量下载。

Inputs

输入

papers/core_set.csv

(expects

paper_id

title

, and ideally

pdf_url

arxiv_id

url

)

Optional:
```
outline/mapping.tsv
```
(to prioritize mapped papers)

```
papers/core_set.csv
```
（需包含
```
paper_id
```
、
```
title
```
字段，理想情况下还需
```
pdf_url
```
/
```
arxiv_id
```
/
```
url
```
字段）
可选：
```
outline/mapping.tsv
```
（用于优先处理已映射的论文）

Outputs

输出

```
papers/fulltext_index.jsonl
```
(one record per attempted paper)

Side artifacts:

```
papers/pdfs/<paper_id>.pdf
```
(cached downloads)
```
papers/fulltext/<paper_id>.txt
```
(extracted text)

```
papers/fulltext_index.jsonl
```
（每条记录对应一篇尝试处理的论文）
附属产物：
- ```
papers/pdfs/<paper_id>.pdf
```
  （缓存的下载文件）
- ```
papers/fulltext/<paper_id>.txt
```
  （提取的文本）

Decision: evidence mode

决策：证据模式

```
queries.md
```
can set
```
evidence_mode: "abstract" | "fulltext"
```
.
- ```
abstract
```
  (default template): do not download; write an index that clearly records skipping.
- ```
fulltext
```
  : download PDFs (when possible) and extract text to
```
papers/fulltext/
```
  .

```
queries.md
```
中可设置
```
evidence_mode: "abstract" | "fulltext"
```
。
- ```
abstract
```
  （默认模板）：不进行下载；生成的索引文件会明确记录跳过状态。
- ```
fulltext
```
  ：下载可用的PDF并将文本提取到
```
papers/fulltext/
```
  目录。

Local PDFs Mode

本地PDF模式

When you cannot/should not download PDFs (restricted network, rate limits, no permission), provide PDFs manually and run in “local PDFs only” mode.

PDF naming convention:

papers/pdfs/<paper_id>.pdf

where

<paper_id>

matches

papers/core_set.csv

Set
```
- evidence_mode: "fulltext"
```
in
```
queries.md
```
.

Run:

python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only

If PDFs are missing, the script writes a to-do list:

```
output/MISSING_PDFS.md
```
(human-readable summary)
```
papers/missing_pdfs.csv
```
(machine-readable list)

当你无法/不应下载PDF时（如网络受限、速率限制、无权限），可手动提供PDF并以“仅本地PDF”模式运行。

PDF命名规则：

papers/pdfs/<paper_id>.pdf

，其中

<paper_id>

需与

papers/core_set.csv

中的ID匹配。

在
```
queries.md
```
中设置
```
- evidence_mode: "fulltext"
```
。

运行命令：

python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only

若缺少PDF，脚本会生成待办列表：

```
output/MISSING_PDFS.md
```
（人类可读的摘要）
```
papers/missing_pdfs.csv
```
（机器可读的列表）

Workflow (heuristic)

工作流程（启发式）

Read
```
papers/core_set.csv
```
.
If
```
outline/mapping.tsv
```
exists, prioritize mapped papers first.
For each selected paper (fulltext mode):
- resolve
```
pdf_url
```
  (use
```
pdf_url
```
  , else derive from
```
arxiv_id
```
  /
```
url
```
  when possible)
- download to
```
papers/pdfs/<paper_id>.pdf
```
  if missing
- extract a reasonable prefix of text to
```
papers/fulltext/<paper_id>.txt
```
- append/update a JSONL record in
```
papers/fulltext_index.jsonl
```
  with status + stats
Never overwrite existing extracted text unless explicitly requested (delete the
```
.txt
```
to re-extract).

读取
```
papers/core_set.csv
```
文件。
若
```
outline/mapping.tsv
```
存在，优先处理已映射的论文。
对每篇选中的论文（全文模式下）：
- 解析
```
pdf_url
```
  （优先使用
```
pdf_url
```
  ，若不存在则尽可能从
```
arxiv_id
```
  /
```
url
```
  推导）
- 若本地无缓存，则下载到
```
papers/pdfs/<paper_id>.pdf
```
- 将PDF的合理前缀文本提取到
```
papers/fulltext/<paper_id>.txt
```
- 在
```
papers/fulltext_index.jsonl
```
  中追加/更新一条JSONL记录，包含状态和统计信息
除非明确要求（删除
```
.txt
```
文件以重新提取），否则不会覆盖已有的提取文本。

Quality checklist

质量检查清单

```
papers/fulltext_index.jsonl
```
exists and is non-empty.
If
```
evidence_mode: "fulltext"
```
: at least a small but non-trivial subset has extracted text (strict mode blocks if extraction coverage is near-zero).
If
```
evidence_mode: "abstract"
```
: the index records clearly reflect skip status (no downloads attempted).

```
papers/fulltext_index.jsonl
```
已存在且非空。
若设置
```
evidence_mode: "fulltext"
```
：至少有一小部分论文成功提取文本（严格模式下若提取覆盖率接近零会阻止运行）。
若设置
```
evidence_mode: "abstract"
```
：索引文件中明确记录了跳过状态（未尝试任何下载）。

Script

脚本

Quick Start

快速开始

python .codex/skills/pdf-text-extractor/scripts/run.py --help

python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>

python .codex/skills/pdf-text-extractor/scripts/run.py --help

python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>

All Options

所有选项

```
--max-papers <n>
```
: cap number of papers processed (can be overridden by
```
queries.md
```
)
```
--max-pages <n>
```
: extract at most N pages per PDF
```
--min-chars <n>
```
: minimum extracted chars to count as OK
```
--sleep <sec>
```
: delay between downloads

--local-pdfs-only

: do not download; only use

papers/pdfs/<paper_id>.pdf

if present

queries.md

supports:

evidence_mode

fulltext_max_papers

fulltext_max_pages

fulltext_min_chars

```
--max-papers <n>
```
：限制处理的论文数量（可被
```
queries.md
```
中的设置覆盖）
```
--max-pages <n>
```
：每篇PDF最多提取N页内容
```
--min-chars <n>
```
：提取的文本最少需包含N个字符才算有效
```
--sleep <sec>
```
：下载间隔时间（秒）
```
--local-pdfs-only
```
：不进行下载；仅使用
```
papers/pdfs/<paper_id>.pdf
```
中已有的文件

queries.md

支持的设置：

evidence_mode

fulltext_max_papers

fulltext_max_pages

fulltext_min_chars

Examples

示例

Abstract mode (no downloads):
- Set
```
- evidence_mode: "abstract"
```
  in
```
queries.md
```
  , then run the script (it will emit
```
papers/fulltext_index.jsonl
```
  with skip statuses)

Fulltext mode with local PDFs only:

Set

- evidence_mode: "fulltext"

queries.md

, put PDFs under

papers/pdfs/

, then run:

python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only

Fulltext mode with smaller budget:

python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200

摘要模式（不下载）：
- 在
```
queries.md
```
  中设置
```
- evidence_mode: "abstract"
```
  ，然后运行脚本（会生成
```
papers/fulltext_index.jsonl
```
  并记录跳过状态）

仅本地PDF的全文模式：

在

queries.md

中设置

- evidence_mode: "fulltext"

，将PDF放入

papers/pdfs/

目录，然后运行：

python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only

限制资源的全文模式：

python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200

Notes

注意事项

Downloads are cached under
```
papers/pdfs/
```
; extracted text is cached under
```
papers/fulltext/
```
.
The script does not overwrite existing extracted text unless you delete the
```
.txt
```
file.

下载的文件会缓存到
```
papers/pdfs/
```
目录；提取的文本会缓存到
```
papers/fulltext/
```
目录。
除非你删除
```
.txt
```
文件，否则脚本不会覆盖已有的提取文本。

Troubleshooting

故障排除

Issue: no PDFs are available to download

问题：没有可下载的PDF

Fix:

Use
```
evidence_mode: abstract
```
(default) or provide local PDFs under
```
papers/pdfs/
```
and rerun with
```
--local-pdfs-only
```
.

解决方法：

使用
```
evidence_mode: abstract
```
（默认设置），或手动将PDF放入
```
papers/pdfs/
```
目录并添加
```
--local-pdfs-only
```
参数重新运行。

Issue: extracted text is empty/garbled

问题：提取的文本为空/乱码

Fix:

Try a different extraction backend if supported; otherwise mark the paper as
```
abstract
```
evidence level and avoid strong fulltext claims.

解决方法：

若支持，尝试使用不同的提取后端；否则将该论文标记为
```
abstract
```
证据级别，避免使用全文论点。