Split-PDF: Download, Split, and Deep-Read Academic Papers
CRITICAL RULE: Never read a full PDF. Only read the 4-page split files, and only 3 splits at a time (~12 pages). Reading a full PDF will either crash the session with a "context limit exceeded" error or produce shallow, hallucinated output.
When This Skill Is Invoked
You want to read, review, or analyze an academic paper either for:
- Teaching workflow: Reading papers to prepare lectures, understand context, extract key findings
- Research workflow: Reading papers for a research project, building background, identifying methodology
The input is either:
- A file path to a local PDF (e.g.,
/path/to/articles/smith_2024.pdf
)
- A search query or paper title (e.g.,
"Gentzkow Shapiro Sinkinson 2014 competition newspapers"
)
Important: You cannot search for a paper you don't know exists. The user MUST provide either a file path or a specific search query — an author name, a title, keywords, a year, or some combination that identifies the paper. If invoked without specifying what paper to read, ask the user.
Step 1: Acquire the PDF
If a local file path is provided:
- Verify the file exists
- If the file is NOT already inside , copy it there (preserve the original location)
- Proceed to Step 2
If a search query or paper title is provided:
- Use WebSearch to find the paper
- Download the PDF (request user permission if required)
- Save it to in the project directory (create the directory if needed)
- Proceed to Step 2
CRITICAL: Always preserve the original PDF. The PDF in
must NEVER be deleted, moved, or overwritten. Split files are derivatives — the original is permanent.
Step 2: Split the PDF into 4-Page Chunks
Create a subdirectory for the splits and run the splitting script:
python
from PyPDF2 import PdfReader, PdfWriter
import os
def split_pdf(input_path, output_dir, pages_per_chunk=4):
"""Split PDF into 4-page chunks. Preserves original."""
os.makedirs(output_dir, exist_ok=True)
reader = PdfReader(input_path)
total = len(reader.pages)
prefix = os.path.splitext(os.path.basename(input_path))[0]
for start in range(0, total, pages_per_chunk):
end = min(start + pages_per_chunk, total)
writer = PdfWriter()
for i in range(start, end):
writer.add_page(reader.pages[i])
out_name = f"{prefix}_pp{start+1}-{end}.pdf"
out_path = os.path.join(output_dir, out_name)
with open(out_path, "wb") as f:
writer.write(f)
print(f"Split {total} pages into {-(-total // pages_per_chunk)} chunks in {output_dir}")
Directory convention:
articles/
├── smith_2024.pdf # original — NEVER DELETE
└── split_smith_2024/ # split subdirectory
├── smith_2024_pp1-4.pdf
├── smith_2024_pp5-8.pdf
├── smith_2024_pp9-12.pdf
└── notes.md # structured notes
If PyPDF2 is not installed:
Step 3: Read in Batches of 3 Splits
Read exactly 3 split files at a time (~12 pages). After each batch:
- Read the 3 split PDFs using Cowork's Read tool
- Update the running notes file ( in the split subdirectory)
- Report to the user:
"I have finished reading splits [X-Y] and updated the notes. I have [N] more splits remaining. Would you like me to continue?"
- Wait for user confirmation before reading the next batch
Do NOT read ahead. Do NOT read all splits at once.
Step 4: Structured Extraction
As you read, collect information along
8 dimensions and write them into
:
- Research Question — What is the paper asking? Why does it matter?
- Audience — Which research community cares about this work?
- Method — How do they answer the question? Identification strategy?
- Data Sources — What data? Where from? Unit of observation? Sample size? Time period?
- Statistical Methods — What econometric or statistical techniques? Key specifications?
- Findings — Main results? Key coefficient estimates and standard errors?
- Contributions — What is new? What did we learn?
- Replication Feasibility — Is data public? Replication archive? Data appendix? URLs?
These 8 dimensions extract what a researcher needs to build on or replicate the work.
Step 5: The Notes File
The output is
in the split subdirectory:
articles/split_smith_2024/notes.md
This file is updated incrementally after each batch. Structure it with headers for each of the 8 dimensions. After each batch, update whichever dimensions have new information — do not rewrite from scratch.
By the final batch, the notes should contain specific data sources, variable names, equation references, sample sizes, coefficient estimates, and standard errors. A structured extraction, not a summary.
When NOT to Split
- Papers shorter than ~15 pages: Read directly using Cowork's Read tool
- Policy briefs or non-technical documents: A rough summary is acceptable
- Triage only: Read just the first split (pages 1-4, abstract + introduction)
Quick Reference
| Step | Action |
|---|
| Acquire | Download to or use existing file |
| Split | 4-page chunks into |
| Read | 3 splits at a time; pause after each batch |
| Write | Update with 8 dimensions |
| Confirm | Ask user before continuing to next batch |
Key Differences from Original
- Cowork compatible: No , no slash commands. Works with Cowork's file system and tools.
- Dual workflow: Explicitly supports both teaching (lecture prep) and research (project work).
- PyPDF2-based splitting: Uses industry-standard PDF library.
- Preserved originals: Split files saved to , originals never deleted.
- Structured 8-dimension extraction: Methodical note-taking across research dimensions.