read-bin-docs
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDoc Formats
文档格式
Quick Start: Extract Text from PDF
快速开始:从PDF提取文本
Need to extract text from a PDF? Use this Python snippet:
python
from pypdf import PdfReader
reader = PdfReader("document.pdf")
text = "".join(page.extract_text() for page in reader.pages)
print(text)Or from the command line:
bash
uvx --with pypdf python /path/to/extract_pdf_text.py document.pdf需要从PDF中提取文本?使用以下Python代码片段:
python
from pypdf import PdfReader
reader = PdfReader("document.pdf")
text = "".join(page.extract_text() for page in reader.pages)
print(text)或者通过命令行使用:
bash
uvx --with pypdf python /path/to/extract_pdf_text.py document.pdfPDF Text Extraction
PDF文本提取
Basic Usage
基本用法
python
from pypdf import PdfReaderpython
from pypdf import PdfReaderRead all pages
读取所有页面
reader = PdfReader("file.pdf")
for page in reader.pages:
text = page.extract_text()
print(text)
undefinedreader = PdfReader("file.pdf")
for page in reader.pages:
text = page.extract_text()
print(text)
undefinedExtract Specific Pages
提取指定页面
python
from pypdf import PdfReader
reader = PdfReader("file.pdf")python
from pypdf import PdfReader
reader = PdfReader("file.pdf")Get pages 1-5 (0-indexed)
获取第1-5页(从0开始索引)
for page in reader.pages[0:5]:
print(page.extract_text())
undefinedfor page in reader.pages[0:5]:
print(page.extract_text())
undefinedUsing the Script
使用脚本
This skill includes for command-line extraction:
scripts/extract_pdf_text.pybash
undefined此工具包含用于命令行提取:
scripts/extract_pdf_text.pybash
undefinedExtract all pages to stdout
将所有页面内容提取到标准输出
python extract_pdf_text.py document.pdf
python extract_pdf_text.py document.pdf
Extract to file
提取内容到文件
python extract_pdf_text.py document.pdf --output text.txt
python extract_pdf_text.py document.pdf --output text.txt
Extract specific pages
提取指定页面
python extract_pdf_text.py document.pdf --pages 1-5
python extract_pdf_text.py document.pdf --pages 1,3,5
undefinedpython extract_pdf_text.py document.pdf --pages 1-5
python extract_pdf_text.py document.pdf --pages 1,3,5
undefinedRequirements
依赖要求
- pypdf:
uvx --with pypdf python <script> - Works with most text-based PDFs
- Scanned PDFs without OCR won't extract text
- pypdf:
uvx --with pypdf python <script> - 适用于大多数基于文本的PDF
- 未经过OCR处理的扫描版PDF无法提取文本
Common Issues
常见问题
"No text extracted": The PDF may be scanned (image-based) without OCR. OCR support requires additional tools.
"Encoding errors": pypdf handles most encodings, but some PDFs may have encoding issues. Use for layout-aware extraction if available.
page.extract_text(layout=True)Future: Support for DOCX, XLSX, and other formats coming soon.
“未提取到文本”:该PDF可能是扫描生成的(基于图像)且未经过OCR处理。OCR支持需要额外工具。
“编码错误”:pypdf支持大多数编码格式,但部分PDF可能存在编码问题。如果支持,可使用进行基于布局的文本提取。
page.extract_text(layout=True)未来规划:即将支持DOCX、XLSX及其他格式。