pdf-text-extractor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePDF Text Extractor
PDF文本提取工具
Optionally collect full-text snippets to deepen evidence beyond abstracts.
This skill is intentionally conservative: in many survey runs, abstract/snippet mode is enough and avoids heavy downloads.
可选收集全文片段,以获得比摘要更深入的证据。
本工具设计得较为保守:在许多调研场景中,摘要/片段模式已足够,且可避免大量下载。
Inputs
输入
- (expects
papers/core_set.csv,paper_id, and ideallytitle/pdf_url/arxiv_id)url - Optional: (to prioritize mapped papers)
outline/mapping.tsv
- (需包含
papers/core_set.csv、paper_id字段,理想情况下还需title/pdf_url/arxiv_id字段)url - 可选:(用于优先处理已映射的论文)
outline/mapping.tsv
Outputs
输出
- (one record per attempted paper)
papers/fulltext_index.jsonl - Side artifacts:
- (cached downloads)
papers/pdfs/<paper_id>.pdf - (extracted text)
papers/fulltext/<paper_id>.txt
- (每条记录对应一篇尝试处理的论文)
papers/fulltext_index.jsonl - 附属产物:
- (缓存的下载文件)
papers/pdfs/<paper_id>.pdf - (提取的文本)
papers/fulltext/<paper_id>.txt
Decision: evidence mode
决策:证据模式
- can set
queries.md.evidence_mode: "abstract" | "fulltext"- (default template): do not download; write an index that clearly records skipping.
abstract - : download PDFs (when possible) and extract text to
fulltext.papers/fulltext/
- 中可设置
queries.md。evidence_mode: "abstract" | "fulltext"- (默认模板):不进行下载;生成的索引文件会明确记录跳过状态。
abstract - :下载可用的PDF并将文本提取到
fulltext目录。papers/fulltext/
Local PDFs Mode
本地PDF模式
When you cannot/should not download PDFs (restricted network, rate limits, no permission), provide PDFs manually and run in “local PDFs only” mode.
- PDF naming convention: where
papers/pdfs/<paper_id>.pdfmatches<paper_id>.papers/core_set.csv - Set in
- evidence_mode: "fulltext".queries.md - Run:
python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
If PDFs are missing, the script writes a to-do list:
- (human-readable summary)
output/MISSING_PDFS.md - (machine-readable list)
papers/missing_pdfs.csv
当你无法/不应下载PDF时(如网络受限、速率限制、无权限),可手动提供PDF并以“仅本地PDF”模式运行。
- PDF命名规则:,其中
papers/pdfs/<paper_id>.pdf需与<paper_id>中的ID匹配。papers/core_set.csv - 在中设置
queries.md。- evidence_mode: "fulltext" - 运行命令:
python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
若缺少PDF,脚本会生成待办列表:
- (人类可读的摘要)
output/MISSING_PDFS.md - (机器可读的列表)
papers/missing_pdfs.csv
Workflow (heuristic)
工作流程(启发式)
- Read .
papers/core_set.csv - If exists, prioritize mapped papers first.
outline/mapping.tsv - For each selected paper (fulltext mode):
- resolve (use
pdf_url, else derive frompdf_url/arxiv_idwhen possible)url - download to if missing
papers/pdfs/<paper_id>.pdf - extract a reasonable prefix of text to
papers/fulltext/<paper_id>.txt - append/update a JSONL record in with status + stats
papers/fulltext_index.jsonl
- resolve
- Never overwrite existing extracted text unless explicitly requested (delete the to re-extract).
.txt
- 读取文件。
papers/core_set.csv - 若存在,优先处理已映射的论文。
outline/mapping.tsv - 对每篇选中的论文(全文模式下):
- 解析(优先使用
pdf_url,若不存在则尽可能从pdf_url/arxiv_id推导)url - 若本地无缓存,则下载到
papers/pdfs/<paper_id>.pdf - 将PDF的合理前缀文本提取到
papers/fulltext/<paper_id>.txt - 在中追加/更新一条JSONL记录,包含状态和统计信息
papers/fulltext_index.jsonl
- 解析
- 除非明确要求(删除文件以重新提取),否则不会覆盖已有的提取文本。
.txt
Quality checklist
质量检查清单
- exists and is non-empty.
papers/fulltext_index.jsonl - If : at least a small but non-trivial subset has extracted text (strict mode blocks if extraction coverage is near-zero).
evidence_mode: "fulltext" - If : the index records clearly reflect skip status (no downloads attempted).
evidence_mode: "abstract"
- 已存在且非空。
papers/fulltext_index.jsonl - 若设置:至少有一小部分论文成功提取文本(严格模式下若提取覆盖率接近零会阻止运行)。
evidence_mode: "fulltext" - 若设置:索引文件中明确记录了跳过状态(未尝试任何下载)。
evidence_mode: "abstract"
Script
脚本
Quick Start
快速开始
python .codex/skills/pdf-text-extractor/scripts/run.py --helppython .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>
python .codex/skills/pdf-text-extractor/scripts/run.py --helppython .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>
All Options
所有选项
- : cap number of papers processed (can be overridden by
--max-papers <n>)queries.md - : extract at most N pages per PDF
--max-pages <n> - : minimum extracted chars to count as OK
--min-chars <n> - : delay between downloads
--sleep <sec> - : do not download; only use
--local-pdfs-onlyif presentpapers/pdfs/<paper_id>.pdf - supports:
queries.md,evidence_mode,fulltext_max_papers,fulltext_max_pagesfulltext_min_chars
- :限制处理的论文数量(可被
--max-papers <n>中的设置覆盖)queries.md - :每篇PDF最多提取N页内容
--max-pages <n> - :提取的文本最少需包含N个字符才算有效
--min-chars <n> - :下载间隔时间(秒)
--sleep <sec> - :不进行下载;仅使用
--local-pdfs-only中已有的文件papers/pdfs/<paper_id>.pdf - 支持的设置:
queries.md,evidence_mode,fulltext_max_papers,fulltext_max_pagesfulltext_min_chars
Examples
示例
- Abstract mode (no downloads):
- Set in
- evidence_mode: "abstract", then run the script (it will emitqueries.mdwith skip statuses)papers/fulltext_index.jsonl
- Set
- Fulltext mode with local PDFs only:
- Set in
- evidence_mode: "fulltext", put PDFs underqueries.md, then run:papers/pdfs/python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
- Set
- Fulltext mode with smaller budget:
python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200
- 摘要模式(不下载):
- 在中设置
queries.md,然后运行脚本(会生成- evidence_mode: "abstract"并记录跳过状态)papers/fulltext_index.jsonl
- 在
- 仅本地PDF的全文模式:
- 在中设置
queries.md,将PDF放入- evidence_mode: "fulltext"目录,然后运行:papers/pdfs/python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only
- 在
- 限制资源的全文模式:
python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200
Notes
注意事项
- Downloads are cached under ; extracted text is cached under
papers/pdfs/.papers/fulltext/ - The script does not overwrite existing extracted text unless you delete the file.
.txt
- 下载的文件会缓存到目录;提取的文本会缓存到
papers/pdfs/目录。papers/fulltext/ - 除非你删除文件,否则脚本不会覆盖已有的提取文本。
.txt
Troubleshooting
故障排除
Issue: no PDFs are available to download
问题:没有可下载的PDF
Fix:
- Use (default) or provide local PDFs under
evidence_mode: abstractand rerun withpapers/pdfs/.--local-pdfs-only
解决方法:
- 使用(默认设置),或手动将PDF放入
evidence_mode: abstract目录并添加papers/pdfs/参数重新运行。--local-pdfs-only
Issue: extracted text is empty/garbled
问题:提取的文本为空/乱码
Fix:
- Try a different extraction backend if supported; otherwise mark the paper as evidence level and avoid strong fulltext claims.
abstract
解决方法:
- 若支持,尝试使用不同的提取后端;否则将该论文标记为证据级别,避免使用全文论点。
abstract