pdf-extractor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePDF Extractor
PDF提取工具
Extract text, tables, and images from PDF files using pdfplumber - turn static PDFs into usable data.
使用pdfplumber从PDF文件中提取文本、表格和图片——将静态PDF转换为可用数据。
When to Use This Skill
适用场景
- Report processing - Extract data from PDF reports
- Table extraction - Convert PDF tables to CSV
- Image collection - Pull images from presentations
- Text mining - Bulk convert PDFs to searchable text
- Research - Process academic papers and whitepapers
- 报告处理 - 从PDF报告中提取数据
- 表格提取 - 将PDF表格转换为CSV
- 图片收集 - 从演示文稿中提取图片
- 文本挖掘 - 批量将PDF转换为可搜索文本
- 研究工作 - 处理学术论文和白皮书
What Claude Does vs What You Decide
Claude负责的工作 vs 由你决定的内容
| Claude Does | You Decide |
|---|---|
| Structures analysis frameworks | Metric definitions |
| Identifies patterns in data | Business interpretation |
| Creates visualization templates | Dashboard design |
| Suggests optimization areas | Action priorities |
| Calculates statistical measures | Decision thresholds |
| Claude负责的工作 | 由你决定的内容 |
|---|---|
| 构建分析框架 | 指标定义 |
| 识别数据中的模式 | 业务解读 |
| 创建可视化模板 | 仪表盘设计 |
| 提出优化方向建议 | 行动优先级 |
| 计算统计指标 | 决策阈值 |
Dependencies
依赖项
bash
pip install pdfplumber pypdf click pandasbash
pip install pdfplumber pypdf click pandasFor image extraction:
For image extraction:
pip install Pillow
undefinedpip install Pillow
undefinedCommands
命令
Extract Text
提取文本
bash
python scripts/main.py text document.pdf
python scripts/main.py text document.pdf --pages 1-5bash
python scripts/main.py text document.pdf
python scripts/main.py text document.pdf --pages 1-5Extract Tables
提取表格
bash
python scripts/main.py tables report.pdf --output tables.csv
python scripts/main.py tables financial.pdf --page 3bash
python scripts/main.py tables report.pdf --output tables.csv
python scripts/main.py tables financial.pdf --page 3Extract Images
提取图片
bash
python scripts/main.py images presentation.pdf --output ./images/bash
python scripts/main.py images presentation.pdf --output ./images/Merge PDFs
合并PDF
bash
python scripts/main.py merge doc1.pdf doc2.pdf --output combined.pdfbash
python scripts/main.py merge doc1.pdf doc2.pdf --output combined.pdfPDF Info
PDF信息查询
bash
python scripts/main.py info document.pdfbash
python scripts/main.py info document.pdfExamples
示例
Example 1: Extract Financial Tables
示例1:提取财务表格
bash
python scripts/main.py tables annual-report.pdf --output financials.csvbash
python scripts/main.py tables annual-report.pdf --output financials.csvOutput: financials.csv with all tables found
Output: financials.csv with all tables found
Also creates individual CSVs: table_page3_1.csv, table_page5_1.csv
Also creates individual CSVs: table_page3_1.csv, table_page5_1.csv
undefinedundefinedExample 2: Batch Convert to Text
示例2:批量转换为文本
bash
python scripts/main.py batch ./pdfs/ --output ./text/bash
python scripts/main.py batch ./pdfs/ --output ./text/Converts all PDFs in folder to .txt files
Converts all PDFs in folder to .txt files
undefinedundefinedExample 3: Extract Specific Pages
示例3:提取指定页面
bash
python scripts/main.py text whitepaper.pdf --pages 1,5-10,15bash
python scripts/main.py text whitepaper.pdf --pages 1,5-10,15Extracts only pages 1, 5-10, and 15
Extracts only pages 1, 5-10, and 15
undefinedundefinedSkill Boundaries
技能边界
What This Skill Does Well
本技能擅长的工作
- Structuring data analysis
- Identifying patterns and trends
- Creating visualization frameworks
- Calculating statistical measures
- 结构化数据分析
- 识别模式和趋势
- 创建可视化框架
- 计算统计指标
What This Skill Cannot Do
本技能无法完成的工作
- Access your actual data
- Replace statistical expertise
- Make business decisions
- Guarantee prediction accuracy
- 访问你的实际数据
- 替代专业统计知识
- 做出商业决策
- 保证预测准确性
Related Skills
相关技能
- web-scraper - Scrape web content
- content-repurposer - Repurpose extracted content
- web-scraper - 抓取网页内容
- content-repurposer - 重新利用提取的内容
Skill Metadata
技能元数据
- Mode: centaur
yaml
category: automation
subcategory: document-processing
dependencies: [pdfplumber, pypdf, pandas]
difficulty: beginner
time_saved: 4+ hours/week- Mode: centaur
yaml
category: automation
subcategory: document-processing
dependencies: [pdfplumber, pypdf, pandas]
difficulty: beginner
time_saved: 4+ hours/week