pdf-extractor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PDF Extractor

PDF提取工具

Extract text, tables, and images from PDF files using pdfplumber - turn static PDFs into usable data.
使用pdfplumber从PDF文件中提取文本、表格和图片——将静态PDF转换为可用数据。

When to Use This Skill

适用场景

  • Report processing - Extract data from PDF reports
  • Table extraction - Convert PDF tables to CSV
  • Image collection - Pull images from presentations
  • Text mining - Bulk convert PDFs to searchable text
  • Research - Process academic papers and whitepapers
  • 报告处理 - 从PDF报告中提取数据
  • 表格提取 - 将PDF表格转换为CSV
  • 图片收集 - 从演示文稿中提取图片
  • 文本挖掘 - 批量将PDF转换为可搜索文本
  • 研究工作 - 处理学术论文和白皮书

What Claude Does vs What You Decide

Claude负责的工作 vs 由你决定的内容

Claude DoesYou Decide
Structures analysis frameworksMetric definitions
Identifies patterns in dataBusiness interpretation
Creates visualization templatesDashboard design
Suggests optimization areasAction priorities
Calculates statistical measuresDecision thresholds
Claude负责的工作由你决定的内容
构建分析框架指标定义
识别数据中的模式业务解读
创建可视化模板仪表盘设计
提出优化方向建议行动优先级
计算统计指标决策阈值

Dependencies

依赖项

bash
pip install pdfplumber pypdf click pandas
bash
pip install pdfplumber pypdf click pandas

For image extraction:

For image extraction:

pip install Pillow
undefined
pip install Pillow
undefined

Commands

命令

Extract Text

提取文本

bash
python scripts/main.py text document.pdf
python scripts/main.py text document.pdf --pages 1-5
bash
python scripts/main.py text document.pdf
python scripts/main.py text document.pdf --pages 1-5

Extract Tables

提取表格

bash
python scripts/main.py tables report.pdf --output tables.csv
python scripts/main.py tables financial.pdf --page 3
bash
python scripts/main.py tables report.pdf --output tables.csv
python scripts/main.py tables financial.pdf --page 3

Extract Images

提取图片

bash
python scripts/main.py images presentation.pdf --output ./images/
bash
python scripts/main.py images presentation.pdf --output ./images/

Merge PDFs

合并PDF

bash
python scripts/main.py merge doc1.pdf doc2.pdf --output combined.pdf
bash
python scripts/main.py merge doc1.pdf doc2.pdf --output combined.pdf

PDF Info

PDF信息查询

bash
python scripts/main.py info document.pdf
bash
python scripts/main.py info document.pdf

Examples

示例

Example 1: Extract Financial Tables

示例1:提取财务表格

bash
python scripts/main.py tables annual-report.pdf --output financials.csv
bash
python scripts/main.py tables annual-report.pdf --output financials.csv

Output: financials.csv with all tables found

Output: financials.csv with all tables found

Also creates individual CSVs: table_page3_1.csv, table_page5_1.csv

Also creates individual CSVs: table_page3_1.csv, table_page5_1.csv

undefined
undefined

Example 2: Batch Convert to Text

示例2:批量转换为文本

bash
python scripts/main.py batch ./pdfs/ --output ./text/
bash
python scripts/main.py batch ./pdfs/ --output ./text/

Converts all PDFs in folder to .txt files

Converts all PDFs in folder to .txt files

undefined
undefined

Example 3: Extract Specific Pages

示例3:提取指定页面

bash
python scripts/main.py text whitepaper.pdf --pages 1,5-10,15
bash
python scripts/main.py text whitepaper.pdf --pages 1,5-10,15

Extracts only pages 1, 5-10, and 15

Extracts only pages 1, 5-10, and 15

undefined
undefined

Skill Boundaries

技能边界

What This Skill Does Well

本技能擅长的工作

  • Structuring data analysis
  • Identifying patterns and trends
  • Creating visualization frameworks
  • Calculating statistical measures
  • 结构化数据分析
  • 识别模式和趋势
  • 创建可视化框架
  • 计算统计指标

What This Skill Cannot Do

本技能无法完成的工作

  • Access your actual data
  • Replace statistical expertise
  • Make business decisions
  • Guarantee prediction accuracy
  • 访问你的实际数据
  • 替代专业统计知识
  • 做出商业决策
  • 保证预测准确性

Related Skills

相关技能

  • web-scraper - Scrape web content
  • content-repurposer - Repurpose extracted content
  • web-scraper - 抓取网页内容
  • content-repurposer - 重新利用提取的内容

Skill Metadata

技能元数据

  • Mode: centaur
yaml
category: automation
subcategory: document-processing
dependencies: [pdfplumber, pypdf, pandas]
difficulty: beginner
time_saved: 4+ hours/week
  • Mode: centaur
yaml
category: automation
subcategory: document-processing
dependencies: [pdfplumber, pypdf, pandas]
difficulty: beginner
time_saved: 4+ hours/week