split-pdf

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Split-PDF: Download, Split, and Deep-Read Academic Papers

Split-PDF:下载、拆分并深度阅读学术论文

CRITICAL RULE: Never read a full PDF. Only read the 4-page split files, and only 3 splits at a time (~12 pages). Reading a full PDF will either crash the session with a "context limit exceeded" error or produce shallow, hallucinated output.
重要规则:切勿阅读完整PDF。仅阅读拆分后的4页分块文件,且一次最多读取3个分块(约12页)。阅读完整PDF要么会触发"超出上下文限制"错误导致会话崩溃,要么会生成浅显、存在幻觉的输出。

When This Skill Is Invoked

适用场景

You want to read, review, or analyze an academic paper either for:
  • Teaching workflow: Reading papers to prepare lectures, understand context, extract key findings
  • Research workflow: Reading papers for a research project, building background, identifying methodology
The input is either:
  • A file path to a local PDF (e.g.,
    /path/to/articles/smith_2024.pdf
    )
  • A search query or paper title (e.g.,
    "Gentzkow Shapiro Sinkinson 2014 competition newspapers"
    )
Important: You cannot search for a paper you don't know exists. The user MUST provide either a file path or a specific search query — an author name, a title, keywords, a year, or some combination that identifies the paper. If invoked without specifying what paper to read, ask the user.

当你需要阅读、评审或分析学术论文,用于以下场景时可调用本工具:
  • 教学工作流:阅读论文准备课程、理解相关背景、提取核心发现
  • 科研工作流:为研究项目阅读文献、搭建知识背景、梳理研究方法
支持的输入形式为以下两种:
  • 本地PDF的文件路径(例如:
    /path/to/articles/smith_2024.pdf
  • 搜索关键词或论文标题(例如:
    "Gentzkow Shapiro Sinkinson 2014 competition newspapers"
注意: 你无法搜索不存在的论文。用户必须提供文件路径或者特定的搜索条件——作者名、标题、关键词、年份,或其他可以定位到对应论文的组合信息。如果调用时未指定要阅读的论文,请向用户询问相关信息。

Step 1: Acquire the PDF

步骤1:获取PDF

If a local file path is provided:
  • Verify the file exists
  • If the file is NOT already inside
    articles/
    , copy it there (preserve the original location)
  • Proceed to Step 2
If a search query or paper title is provided:
  1. Use WebSearch to find the paper
  2. Download the PDF (request user permission if required)
  3. Save it to
    articles/
    in the project directory (create the directory if needed)
  4. Proceed to Step 2
CRITICAL: Always preserve the original PDF. The PDF in
articles/
must NEVER be deleted, moved, or overwritten. Split files are derivatives — the original is permanent.

如果用户提供了本地文件路径:
  • 校验文件是否存在
  • 如果文件不在
    articles/
    目录下,将其复制到该目录(保留原位置的文件)
  • 继续执行步骤2
如果用户提供了搜索关键词或论文标题:
  1. 使用WebSearch查找对应论文
  2. 下载PDF(如果需要请先征得用户同意)
  3. 将PDF保存到项目目录的
    articles/
    文件夹下(如果目录不存在则新建)
  4. 继续执行步骤2
重要提示:请始终保留原始PDF文件。
articles/
目录下的原始PDF绝对不能删除、移动或覆盖。拆分后的文件是衍生产物,原始文件需要永久保留。

Step 2: Split the PDF into 4-Page Chunks

步骤2:将PDF拆分为4页分块

Create a subdirectory for the splits and run the splitting script:
python
from PyPDF2 import PdfReader, PdfWriter
import os

def split_pdf(input_path, output_dir, pages_per_chunk=4):
    """Split PDF into 4-page chunks. Preserves original."""
    os.makedirs(output_dir, exist_ok=True)
    reader = PdfReader(input_path)
    total = len(reader.pages)
    prefix = os.path.splitext(os.path.basename(input_path))[0]

    for start in range(0, total, pages_per_chunk):
        end = min(start + pages_per_chunk, total)
        writer = PdfWriter()
        for i in range(start, end):
            writer.add_page(reader.pages[i])

        out_name = f"{prefix}_pp{start+1}-{end}.pdf"
        out_path = os.path.join(output_dir, out_name)
        with open(out_path, "wb") as f:
            writer.write(f)

    print(f"Split {total} pages into {-(-total // pages_per_chunk)} chunks in {output_dir}")
Directory convention:
articles/
├── smith_2024.pdf                    # original — NEVER DELETE
└── split_smith_2024/                 # split subdirectory
    ├── smith_2024_pp1-4.pdf
    ├── smith_2024_pp5-8.pdf
    ├── smith_2024_pp9-12.pdf
    └── notes.md                      # structured notes
If PyPDF2 is not installed:
pip install PyPDF2

为拆分后的文件新建子目录,运行拆分脚本:
python
from PyPDF2 import PdfReader, PdfWriter
import os

def split_pdf(input_path, output_dir, pages_per_chunk=4):
    """Split PDF into 4-page chunks. Preserves original."""
    os.makedirs(output_dir, exist_ok=True)
    reader = PdfReader(input_path)
    total = len(reader.pages)
    prefix = os.path.splitext(os.path.basename(input_path))[0]

    for start in range(0, total, pages_per_chunk):
        end = min(start + pages_per_chunk, total)
        writer = PdfWriter()
        for i in range(start, end):
            writer.add_page(reader.pages[i])

        out_name = f"{prefix}_pp{start+1}-{end}.pdf"
        out_path = os.path.join(output_dir, out_name)
        with open(out_path, "wb") as f:
            writer.write(f)

    print(f"Split {total} pages into {-(-total // pages_per_chunk)} chunks in {output_dir}")
目录规范:
articles/
├── smith_2024.pdf                    # original — NEVER DELETE
└── split_smith_2024/                 # split subdirectory
    ├── smith_2024_pp1-4.pdf
    ├── smith_2024_pp5-8.pdf
    ├── smith_2024_pp9-12.pdf
    └── notes.md                      # structured notes
如果未安装PyPDF2,请执行:
pip install PyPDF2

Step 3: Read in Batches of 3 Splits

步骤3:按3个分块为一批次阅读

Read exactly 3 split files at a time (~12 pages). After each batch:
  1. Read the 3 split PDFs using Cowork's Read tool
  2. Update the running notes file (
    notes.md
    in the split subdirectory)
  3. Report to the user:
"I have finished reading splits [X-Y] and updated the notes. I have [N] more splits remaining. Would you like me to continue?"
  1. Wait for user confirmation before reading the next batch
Do NOT read ahead. Do NOT read all splits at once.

一次恰好读取3个拆分文件(约12页)。每读完一个批次后:
  1. 使用Cowork的Read工具读取这3个拆分的PDF文件
  2. 更新动态笔记文件(拆分文件子目录下的
    notes.md
  3. 向用户反馈
"我已读完分块[X-Y]并更新了笔记,还剩[N]个分块未读。是否需要继续读取?"
  1. 读取下一批次前等待用户确认
不要提前读取后续内容,不要一次性读取所有分块。

Step 4: Structured Extraction

步骤4:结构化信息提取

As you read, collect information along 8 dimensions and write them into
notes.md
:
  1. Research Question — What is the paper asking? Why does it matter?
  2. Audience — Which research community cares about this work?
  3. Method — How do they answer the question? Identification strategy?
  4. Data Sources — What data? Where from? Unit of observation? Sample size? Time period?
  5. Statistical Methods — What econometric or statistical techniques? Key specifications?
  6. Findings — Main results? Key coefficient estimates and standard errors?
  7. Contributions — What is new? What did we learn?
  8. Replication Feasibility — Is data public? Replication archive? Data appendix? URLs?
These 8 dimensions extract what a researcher needs to build on or replicate the work.

阅读过程中,按照以下8个维度收集信息并写入
notes.md
  1. 研究问题 — 论文要解决什么问题?该问题的研究意义是什么?
  2. 目标受众 — 哪些研究领域的群体关注这项工作?
  3. 研究方法 — 作者通过什么方式解答研究问题?识别策略是什么?
  4. 数据来源 — 使用了什么数据?数据来源是哪里?观测单位是什么?样本量有多少?覆盖的时间范围是?
  5. 统计方法 — 使用了哪些计量经济学或统计技术?核心模型设定是什么?
  6. 研究发现 — 主要结果是什么?核心系数估计值和标准误是多少?
  7. 研究贡献 — 这项工作的创新点是什么?我们能从中获得什么新认知?
  8. 复现可行性 — 数据是否公开?是否有复现归档文件?是否有数据附录?相关URL是什么?
这8个维度提取的信息是研究者参考或复现这项工作所需的核心内容。

Step 5: The Notes File

步骤5:笔记文件

The output is
notes.md
in the split subdirectory:
articles/split_smith_2024/notes.md
This file is updated incrementally after each batch. Structure it with headers for each of the 8 dimensions. After each batch, update whichever dimensions have new information — do not rewrite from scratch.
By the final batch, the notes should contain specific data sources, variable names, equation references, sample sizes, coefficient estimates, and standard errors. A structured extraction, not a summary.

输出文件为拆分子目录下的
notes.md
articles/split_smith_2024/notes.md
该文件会在每读完一个批次后增量更新,按照上述8个维度设置标题结构。每个批次读完后,仅更新有新信息的维度,不要从头重写整个文件。
读完最后一个批次后,笔记应该包含具体的数据来源、变量名、引用的公式、样本量、系数估计值和标准误,是结构化的信息提取结果,而非摘要。

When NOT to Split

不需要拆分的场景

  • Papers shorter than ~15 pages: Read directly using Cowork's Read tool
  • Policy briefs or non-technical documents: A rough summary is acceptable
  • Triage only: Read just the first split (pages 1-4, abstract + introduction)

  • 页数少于15页的论文:直接使用Cowork的Read工具读取即可
  • 政策简报或非技术文档:可以只生成粗略摘要
  • 仅需快速分类筛选:仅读取第一个分块(1-4页,摘要+引言)即可

Quick Reference

快速参考

StepAction
AcquireDownload to
articles/
or use existing file
Split4-page chunks into
articles/split_<name>/
Read3 splits at a time; pause after each batch
WriteUpdate
notes.md
with 8 dimensions
ConfirmAsk user before continuing to next batch

步骤操作
获取下载到
articles/
目录或使用已有文件
拆分拆分为4页分块,保存到
articles/split_<name>/
目录
读取一次读3个分块,每读完一批次暂停
写入按照8个维度更新
notes.md
确认读取下一批次前询问用户意见

Key Differences from Original

与原版本的核心差异

  • Cowork compatible: No
    .claude/commands
    , no slash commands. Works with Cowork's file system and tools.
  • Dual workflow: Explicitly supports both teaching (lecture prep) and research (project work).
  • PyPDF2-based splitting: Uses industry-standard PDF library.
  • Preserved originals: Split files saved to
    articles/split_<name>/
    , originals never deleted.
  • Structured 8-dimension extraction: Methodical note-taking across research dimensions.
  • 兼容Cowork:没有
    .claude/commands
    ,没有斜杠命令,适配Cowork的文件系统和工具
  • 双工作流支持:明确支持教学(备课)和科研(项目研究)两类场景
  • 基于PyPDF2拆分:使用行业标准PDF处理库
  • 原始文件保留:拆分后的文件保存到
    articles/split_<name>/
    目录,原始文件永远不会被删除
  • 8维度结构化提取:围绕科研所需维度形成系统化的笔记记录逻辑