split-pdf
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSplit-PDF: Download, Split, and Deep-Read Academic Papers
Split-PDF:下载、拆分并深度阅读学术论文
CRITICAL RULE: Never read a full PDF. Only read the 4-page split files, and only 3 splits at a time (~12 pages). Reading a full PDF will either crash the session with a "context limit exceeded" error or produce shallow, hallucinated output.
重要规则:切勿阅读完整PDF。仅阅读拆分后的4页分块文件,且一次最多读取3个分块(约12页)。阅读完整PDF要么会触发"超出上下文限制"错误导致会话崩溃,要么会生成浅显、存在幻觉的输出。
When This Skill Is Invoked
适用场景
You want to read, review, or analyze an academic paper either for:
- Teaching workflow: Reading papers to prepare lectures, understand context, extract key findings
- Research workflow: Reading papers for a research project, building background, identifying methodology
The input is either:
- A file path to a local PDF (e.g., )
/path/to/articles/smith_2024.pdf - A search query or paper title (e.g., )
"Gentzkow Shapiro Sinkinson 2014 competition newspapers"
Important: You cannot search for a paper you don't know exists. The user MUST provide either a file path or a specific search query — an author name, a title, keywords, a year, or some combination that identifies the paper. If invoked without specifying what paper to read, ask the user.
当你需要阅读、评审或分析学术论文,用于以下场景时可调用本工具:
- 教学工作流:阅读论文准备课程、理解相关背景、提取核心发现
- 科研工作流:为研究项目阅读文献、搭建知识背景、梳理研究方法
支持的输入形式为以下两种:
- 本地PDF的文件路径(例如:)
/path/to/articles/smith_2024.pdf - 搜索关键词或论文标题(例如:)
"Gentzkow Shapiro Sinkinson 2014 competition newspapers"
注意: 你无法搜索不存在的论文。用户必须提供文件路径或者特定的搜索条件——作者名、标题、关键词、年份,或其他可以定位到对应论文的组合信息。如果调用时未指定要阅读的论文,请向用户询问相关信息。
Step 1: Acquire the PDF
步骤1:获取PDF
If a local file path is provided:
- Verify the file exists
- If the file is NOT already inside , copy it there (preserve the original location)
articles/ - Proceed to Step 2
If a search query or paper title is provided:
- Use WebSearch to find the paper
- Download the PDF (request user permission if required)
- Save it to in the project directory (create the directory if needed)
articles/ - Proceed to Step 2
CRITICAL: Always preserve the original PDF. The PDF in must NEVER be deleted, moved, or overwritten. Split files are derivatives — the original is permanent.
articles/如果用户提供了本地文件路径:
- 校验文件是否存在
- 如果文件不在目录下,将其复制到该目录(保留原位置的文件)
articles/ - 继续执行步骤2
如果用户提供了搜索关键词或论文标题:
- 使用WebSearch查找对应论文
- 下载PDF(如果需要请先征得用户同意)
- 将PDF保存到项目目录的文件夹下(如果目录不存在则新建)
articles/ - 继续执行步骤2
重要提示:请始终保留原始PDF文件。 目录下的原始PDF绝对不能删除、移动或覆盖。拆分后的文件是衍生产物,原始文件需要永久保留。
articles/Step 2: Split the PDF into 4-Page Chunks
步骤2:将PDF拆分为4页分块
Create a subdirectory for the splits and run the splitting script:
python
from PyPDF2 import PdfReader, PdfWriter
import os
def split_pdf(input_path, output_dir, pages_per_chunk=4):
"""Split PDF into 4-page chunks. Preserves original."""
os.makedirs(output_dir, exist_ok=True)
reader = PdfReader(input_path)
total = len(reader.pages)
prefix = os.path.splitext(os.path.basename(input_path))[0]
for start in range(0, total, pages_per_chunk):
end = min(start + pages_per_chunk, total)
writer = PdfWriter()
for i in range(start, end):
writer.add_page(reader.pages[i])
out_name = f"{prefix}_pp{start+1}-{end}.pdf"
out_path = os.path.join(output_dir, out_name)
with open(out_path, "wb") as f:
writer.write(f)
print(f"Split {total} pages into {-(-total // pages_per_chunk)} chunks in {output_dir}")Directory convention:
articles/
├── smith_2024.pdf # original — NEVER DELETE
└── split_smith_2024/ # split subdirectory
├── smith_2024_pp1-4.pdf
├── smith_2024_pp5-8.pdf
├── smith_2024_pp9-12.pdf
└── notes.md # structured notesIf PyPDF2 is not installed:
pip install PyPDF2为拆分后的文件新建子目录,运行拆分脚本:
python
from PyPDF2 import PdfReader, PdfWriter
import os
def split_pdf(input_path, output_dir, pages_per_chunk=4):
"""Split PDF into 4-page chunks. Preserves original."""
os.makedirs(output_dir, exist_ok=True)
reader = PdfReader(input_path)
total = len(reader.pages)
prefix = os.path.splitext(os.path.basename(input_path))[0]
for start in range(0, total, pages_per_chunk):
end = min(start + pages_per_chunk, total)
writer = PdfWriter()
for i in range(start, end):
writer.add_page(reader.pages[i])
out_name = f"{prefix}_pp{start+1}-{end}.pdf"
out_path = os.path.join(output_dir, out_name)
with open(out_path, "wb") as f:
writer.write(f)
print(f"Split {total} pages into {-(-total // pages_per_chunk)} chunks in {output_dir}")目录规范:
articles/
├── smith_2024.pdf # original — NEVER DELETE
└── split_smith_2024/ # split subdirectory
├── smith_2024_pp1-4.pdf
├── smith_2024_pp5-8.pdf
├── smith_2024_pp9-12.pdf
└── notes.md # structured notes如果未安装PyPDF2,请执行:
pip install PyPDF2Step 3: Read in Batches of 3 Splits
步骤3:按3个分块为一批次阅读
Read exactly 3 split files at a time (~12 pages). After each batch:
- Read the 3 split PDFs using Cowork's Read tool
- Update the running notes file (in the split subdirectory)
notes.md - Report to the user:
"I have finished reading splits [X-Y] and updated the notes. I have [N] more splits remaining. Would you like me to continue?"
- Wait for user confirmation before reading the next batch
Do NOT read ahead. Do NOT read all splits at once.
一次恰好读取3个拆分文件(约12页)。每读完一个批次后:
- 使用Cowork的Read工具读取这3个拆分的PDF文件
- 更新动态笔记文件(拆分文件子目录下的)
notes.md - 向用户反馈:
"我已读完分块[X-Y]并更新了笔记,还剩[N]个分块未读。是否需要继续读取?"
- 读取下一批次前等待用户确认
不要提前读取后续内容,不要一次性读取所有分块。
Step 4: Structured Extraction
步骤4:结构化信息提取
As you read, collect information along 8 dimensions and write them into :
notes.md- Research Question — What is the paper asking? Why does it matter?
- Audience — Which research community cares about this work?
- Method — How do they answer the question? Identification strategy?
- Data Sources — What data? Where from? Unit of observation? Sample size? Time period?
- Statistical Methods — What econometric or statistical techniques? Key specifications?
- Findings — Main results? Key coefficient estimates and standard errors?
- Contributions — What is new? What did we learn?
- Replication Feasibility — Is data public? Replication archive? Data appendix? URLs?
These 8 dimensions extract what a researcher needs to build on or replicate the work.
阅读过程中,按照以下8个维度收集信息并写入:
notes.md- 研究问题 — 论文要解决什么问题?该问题的研究意义是什么?
- 目标受众 — 哪些研究领域的群体关注这项工作?
- 研究方法 — 作者通过什么方式解答研究问题?识别策略是什么?
- 数据来源 — 使用了什么数据?数据来源是哪里?观测单位是什么?样本量有多少?覆盖的时间范围是?
- 统计方法 — 使用了哪些计量经济学或统计技术?核心模型设定是什么?
- 研究发现 — 主要结果是什么?核心系数估计值和标准误是多少?
- 研究贡献 — 这项工作的创新点是什么?我们能从中获得什么新认知?
- 复现可行性 — 数据是否公开?是否有复现归档文件?是否有数据附录?相关URL是什么?
这8个维度提取的信息是研究者参考或复现这项工作所需的核心内容。
Step 5: The Notes File
步骤5:笔记文件
The output is in the split subdirectory:
notes.mdarticles/split_smith_2024/notes.mdThis file is updated incrementally after each batch. Structure it with headers for each of the 8 dimensions. After each batch, update whichever dimensions have new information — do not rewrite from scratch.
By the final batch, the notes should contain specific data sources, variable names, equation references, sample sizes, coefficient estimates, and standard errors. A structured extraction, not a summary.
输出文件为拆分子目录下的:
notes.mdarticles/split_smith_2024/notes.md该文件会在每读完一个批次后增量更新,按照上述8个维度设置标题结构。每个批次读完后,仅更新有新信息的维度,不要从头重写整个文件。
读完最后一个批次后,笔记应该包含具体的数据来源、变量名、引用的公式、样本量、系数估计值和标准误,是结构化的信息提取结果,而非摘要。
When NOT to Split
不需要拆分的场景
- Papers shorter than ~15 pages: Read directly using Cowork's Read tool
- Policy briefs or non-technical documents: A rough summary is acceptable
- Triage only: Read just the first split (pages 1-4, abstract + introduction)
- 页数少于15页的论文:直接使用Cowork的Read工具读取即可
- 政策简报或非技术文档:可以只生成粗略摘要
- 仅需快速分类筛选:仅读取第一个分块(1-4页,摘要+引言)即可
Quick Reference
快速参考
| Step | Action |
|---|---|
| Acquire | Download to |
| Split | 4-page chunks into |
| Read | 3 splits at a time; pause after each batch |
| Write | Update |
| Confirm | Ask user before continuing to next batch |
| 步骤 | 操作 |
|---|---|
| 获取 | 下载到 |
| 拆分 | 拆分为4页分块,保存到 |
| 读取 | 一次读3个分块,每读完一批次暂停 |
| 写入 | 按照8个维度更新 |
| 确认 | 读取下一批次前询问用户意见 |
Key Differences from Original
与原版本的核心差异
- Cowork compatible: No , no slash commands. Works with Cowork's file system and tools.
.claude/commands - Dual workflow: Explicitly supports both teaching (lecture prep) and research (project work).
- PyPDF2-based splitting: Uses industry-standard PDF library.
- Preserved originals: Split files saved to , originals never deleted.
articles/split_<name>/ - Structured 8-dimension extraction: Methodical note-taking across research dimensions.
- 兼容Cowork:没有,没有斜杠命令,适配Cowork的文件系统和工具
.claude/commands - 双工作流支持:明确支持教学(备课)和科研(项目研究)两类场景
- 基于PyPDF2拆分:使用行业标准PDF处理库
- 原始文件保留:拆分后的文件保存到目录,原始文件永远不会被删除
articles/split_<name>/ - 8维度结构化提取:围绕科研所需维度形成系统化的笔记记录逻辑