read-bin-docs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Doc Formats

文档格式

Quick Start: Extract Text from PDF

快速开始:从PDF提取文本

Need to extract text from a PDF? Use this Python snippet:
python
from pypdf import PdfReader

reader = PdfReader("document.pdf")
text = "".join(page.extract_text() for page in reader.pages)
print(text)
Or from the command line:
bash
uvx --with pypdf python /path/to/extract_pdf_text.py document.pdf
需要从PDF中提取文本?使用以下Python代码片段:
python
from pypdf import PdfReader

reader = PdfReader("document.pdf")
text = "".join(page.extract_text() for page in reader.pages)
print(text)
或者通过命令行使用:
bash
uvx --with pypdf python /path/to/extract_pdf_text.py document.pdf

PDF Text Extraction

PDF文本提取

Basic Usage

基本用法

python
from pypdf import PdfReader
python
from pypdf import PdfReader

Read all pages

读取所有页面

reader = PdfReader("file.pdf") for page in reader.pages: text = page.extract_text() print(text)
undefined
reader = PdfReader("file.pdf") for page in reader.pages: text = page.extract_text() print(text)
undefined

Extract Specific Pages

提取指定页面

python
from pypdf import PdfReader

reader = PdfReader("file.pdf")
python
from pypdf import PdfReader

reader = PdfReader("file.pdf")

Get pages 1-5 (0-indexed)

获取第1-5页(从0开始索引)

for page in reader.pages[0:5]: print(page.extract_text())
undefined
for page in reader.pages[0:5]: print(page.extract_text())
undefined

Using the Script

使用脚本

This skill includes
scripts/extract_pdf_text.py
for command-line extraction:
bash
undefined
此工具包含
scripts/extract_pdf_text.py
用于命令行提取:
bash
undefined

Extract all pages to stdout

将所有页面内容提取到标准输出

python extract_pdf_text.py document.pdf
python extract_pdf_text.py document.pdf

Extract to file

提取内容到文件

python extract_pdf_text.py document.pdf --output text.txt
python extract_pdf_text.py document.pdf --output text.txt

Extract specific pages

提取指定页面

python extract_pdf_text.py document.pdf --pages 1-5 python extract_pdf_text.py document.pdf --pages 1,3,5
undefined
python extract_pdf_text.py document.pdf --pages 1-5 python extract_pdf_text.py document.pdf --pages 1,3,5
undefined

Requirements

依赖要求

  • pypdf:
    uvx --with pypdf python <script>
  • Works with most text-based PDFs
  • Scanned PDFs without OCR won't extract text
  • pypdf
    uvx --with pypdf python <script>
  • 适用于大多数基于文本的PDF
  • 未经过OCR处理的扫描版PDF无法提取文本

Common Issues

常见问题

"No text extracted": The PDF may be scanned (image-based) without OCR. OCR support requires additional tools.
"Encoding errors": pypdf handles most encodings, but some PDFs may have encoding issues. Use
page.extract_text(layout=True)
for layout-aware extraction if available.

Future: Support for DOCX, XLSX, and other formats coming soon.
“未提取到文本”:该PDF可能是扫描生成的(基于图像)且未经过OCR处理。OCR支持需要额外工具。
“编码错误”:pypdf支持大多数编码格式,但部分PDF可能存在编码问题。如果支持,可使用
page.extract_text(layout=True)
进行基于布局的文本提取。

未来规划:即将支持DOCX、XLSX及其他格式。