financial-document-processor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Financial Document Processor

财务文档处理器

Overview

概述

This skill provides guidance for extracting structured data from financial documents (invoices, receipts, statements, etc.) using OCR and PDF text extraction. It emphasizes data safety practices that prevent catastrophic failures from destructive operations.
本技能提供了使用OCR和PDF文本提取技术从财务文档(发票、收据、对账单等)中提取结构化数据的指南。它重点介绍了防止破坏性操作导致灾难性故障的数据安全实践。

Critical Data Safety Principles

关键数据安全原则

NEVER perform destructive operations on source data without verification or backup.
Before any file processing:
  1. Create a backup of all source files before processing
  2. Work on copies, not originals
  3. Verify outputs match expectations before any cleanup
  4. Use atomic operations (copy → verify → delete) instead of direct moves
未经验证或备份,切勿对源数据执行破坏性操作。
在进行任何文件处理之前:
  1. 处理前先备份所有源文件
  2. 操作副本而非原始文件
  3. 清理前先验证输出结果符合预期
  4. 使用原子操作(复制→验证→删除)替代直接移动

Safe File Operation Pattern

安全文件操作模式

bash
undefined
bash
undefined

CORRECT: Copy first, verify, then clean up

正确做法:先复制,验证后再清理

cp -r /source/documents/ /backup/documents/
cp -r /source/documents/ /backup/documents/

... process files ...

... 处理文件 ...

... verify outputs match expectations ...

... 验证输出结果符合预期 ...

Only after verification: rm /backup/documents/

仅在验证通过后执行:rm /backup/documents/

WRONG: Delete before moving (data loss risk)

错误做法:移动前删除(存在数据丢失风险)

rm -f /source/.pdf && mv /source/ /dest/ # Files deleted before move!
undefined
rm -f /source/.pdf && mv /source/ /dest/ # 文件在移动前已被删除!
undefined

Workflow

工作流程

Step 1: Assess the Environment and Requirements

步骤1:评估环境与需求

Before writing any processing code:
  1. List all source files and note their exact paths
  2. Identify file types (PDF text-based, PDF scanned, JPG, PNG, etc.)
  3. Check available tools:
    which tesseract
    ,
    which pdftotext
    ,
    python3 -c "import pypdf"
  4. Understand the expected output format (CSV columns, required fields, etc.)
在编写任何处理代码之前:
  1. 列出所有源文件并记录其准确路径
  2. 识别文件类型(文本型PDF、扫描型PDF、JPG、PNG等)
  3. 检查可用工具:
    which tesseract
    which pdftotext
    python3 -c "import pypdf"
  4. 明确预期的输出格式(CSV列、必填字段等)

Step 2: Create Backup

步骤2:创建备份

Always backup source files before any processing:
bash
undefined
任何处理操作前务必备份源文件:
bash
undefined

Create timestamped backup directory

创建带时间戳的备份目录

BACKUP_DIR="/tmp/backup_$(date +%Y%m%d_%H%M%S)" mkdir -p "$BACKUP_DIR" cp -r /path/to/source/documents/* "$BACKUP_DIR/" echo "Backup created at: $BACKUP_DIR"
undefined
BACKUP_DIR="/tmp/backup_$(date +%Y%m%d_%H%M%S)" mkdir -p "$BACKUP_DIR" cp -r /path/to/source/documents/* "$BACKUP_DIR/" echo "备份已创建于:$BACKUP_DIR"
undefined

Step 3: Test Extraction on Sample Files

步骤3:在样本文件上测试提取功能

Before processing all documents:
  1. Select 2-3 representative files (different formats, edge cases)
  2. Test OCR/extraction on these samples
  3. Verify extracted values match visual inspection
  4. Adjust extraction logic based on sample results
python
undefined
在处理所有文档之前:
  1. 选择2-3个有代表性的文件(不同格式、边缘案例)
  2. 在这些样本上测试OCR/提取功能
  3. 验证提取的值与人工检查结果一致
  4. 根据样本结果调整提取逻辑
python
undefined

Test extraction on a single file first

先在单个文件上测试提取功能

sample_file = "/path/to/sample_invoice.pdf" extracted_data = extract_document(sample_file) print(f"Extracted: {extracted_data}")
sample_file = "/path/to/sample_invoice.pdf" extracted_data = extract_document(sample_file) print(f"提取结果:{extracted_data}")

Manually verify these values match the document

手动验证这些值是否与文档内容匹配

undefined
undefined

Step 4: Handle Format Variations

步骤4:处理格式差异

Financial documents often have format variations:
  • Number formats: European (1.234,56) vs US (1,234.56)
  • Date formats: DD/MM/YYYY vs MM/DD/YYYY vs YYYY-MM-DD
  • Currency symbols: $, €, £, or spelled out
  • Empty/missing fields: VAT may be blank, not zero
python
def parse_amount(text):
    """Handle multiple number format conventions."""
    # Remove currency symbols and whitespace
    cleaned = re.sub(r'[$€£\s]', '', text)

    # Detect European format (comma as decimal separator)
    if re.match(r'^\d{1,3}(\.\d{3})*,\d{2}$', cleaned):
        cleaned = cleaned.replace('.', '').replace(',', '.')
    # US format (comma as thousands separator)
    elif ',' in cleaned:
        cleaned = cleaned.replace(',', '')

    return float(cleaned) if cleaned else None
财务文档通常存在格式差异:
  • 数字格式:欧洲格式(1.234,56) vs 美国格式(1,234.56)
  • 日期格式:DD/MM/YYYY vs MM/DD/YYYY vs YYYY-MM-DD
  • 货币符号:$、€、£或拼写形式
  • 空/缺失字段:增值税字段可能为空而非零值
python
def parse_amount(text):
    """处理多种数字格式规范。"""
    # 移除货币符号和空白字符
    cleaned = re.sub(r'[$€£\s]', '', text)

    # 检测欧洲格式(逗号作为小数分隔符)
    if re.match(r'^\d{1,3}(\.\d{3})*,\d{2}$', cleaned):
        cleaned = cleaned.replace('.', '').replace(',', '.')
    # 美国格式(逗号作为千位分隔符)
    elif ',' in cleaned:
        cleaned = cleaned.replace(',', '')

    return float(cleaned) if cleaned else None

Step 5: Process All Documents

步骤5:处理所有文档

After successful sample testing:
  1. Process documents one at a time with error handling
  2. Log extraction results for each document
  3. Collect all results before writing output file
python
results = []
errors = []

for doc_path in document_paths:
    try:
        data = extract_document(doc_path)
        results.append(data)
        print(f"✓ Processed: {doc_path}")
    except Exception as e:
        errors.append((doc_path, str(e)))
        print(f"✗ Failed: {doc_path} - {e}")

if errors:
    print(f"\nWarning: {len(errors)} documents failed to process")
    for path, error in errors:
        print(f"  - {path}: {error}")
在样本测试成功后:
  1. 逐个处理文档并添加错误处理
  2. 记录每个文档的提取结果
  3. 在写入输出文件前收集所有结果
python
results = []
errors = []

for doc_path in document_paths:
    try:
        data = extract_document(doc_path)
        results.append(data)
        print(f"✓ 已处理:{doc_path}")
    except Exception as e:
        errors.append((doc_path, str(e)))
        print(f"✗ 处理失败:{doc_path} - {e}")

if errors:
    print(f"\n警告:{len(errors)}个文档处理失败")
    for path, error in errors:
        print(f"  - {path}{error}")

Step 6: Verify Before File Operations

步骤6:文件操作前验证

Before moving files or writing final outputs:
  1. Compare extracted record count to source file count
  2. Spot-check extracted values against source documents
  3. Verify output format matches requirements
python
undefined
在移动文件或写入最终输出前:
  1. 比较提取记录数与源文件数
  2. 抽查提取值与源文档是否匹配
  3. 验证输出格式符合要求
python
undefined

Verification checklist

验证清单

assert len(results) == len(document_paths), "Record count mismatch"
assert len(results) == len(document_paths), "记录数不匹配"

Spot-check a few values

抽查部分值

for sample in random.sample(results, min(3, len(results))): print(f"Please verify: {sample['filename']} -> Total: {sample['total']}")
undefined
for sample in random.sample(results, min(3, len(results))): print(f"请验证:{sample['filename']} -> 总计:{sample['total']}")
undefined

Step 7: Move Files (Only After Verification)

步骤7:移动文件(仅在验证通过后)

Only after verification passes:
bash
undefined
仅在验证通过后执行:
bash
undefined

Move files to destination (not delete!)

将文件移动到目标目录(不要直接删除!)

for file in /source/documents/*.pdf; do mv "$file" /processed/ done
for file in /source/documents/*.pdf; do mv "$file" /processed/ done

Only remove backup after confirming processed files exist

确认已处理文件存在后再删除备份

ls /processed/*.pdf && rm -rf "$BACKUP_DIR"
undefined
ls /processed/*.pdf && rm -rf "$BACKUP_DIR"
undefined

Common Pitfalls

常见陷阱

1. Destructive Commands Without Backup

1. 无备份情况下执行破坏性命令

Problem: Using
rm
or overwriting files before verifying success. Prevention: Always create backups first; use copy-verify-delete pattern.
问题: 在验证成功前使用
rm
或覆盖文件。 预防措施: 始终先创建备份;采用复制-验证-删除模式。

2. Command Order in Shell Pipelines

2. Shell管道中的命令顺序

Problem:
rm -f *.pdf && mv *.pdf /dest/
- files are deleted before move. Prevention: Test commands on sample data; understand execution order.
问题:
rm -f *.pdf && mv *.pdf /dest/
- 文件在移动前已被删除。 预防措施: 在样本数据上测试命令;理解执行顺序。

3. Incomplete Script Verification

3. 不完整的脚本验证

Problem: Running truncated or incomplete scripts on production data. Prevention: Verify script content before execution; test on samples first.
问题: 在生产数据上运行截断或不完整的脚本。 预防措施: 执行前验证脚本内容;先在样本上测试。

4. Fabricating Missing Data

4. 编造缺失数据

Problem: Writing guessed values when extraction fails. Prevention: Report failures explicitly; use null/empty for missing values.
问题: 提取失败时写入猜测值。 预防措施: 明确报告失败;对缺失值使用null/空值。

5. Premature Optimization

5. 过早优化

Problem: Immediately reprocessing when values look wrong without investigation. Prevention: First analyze OCR output and extraction logic issues without moving files.
问题: 当值看起来异常时,未调查就立即重新处理。 预防措施: 先分析OCR输出和提取逻辑问题,不要移动文件。

6. PDF vs Image Handling

6. PDF与图片处理混淆

Problem: Using OCR on text-based PDFs or text extraction on scanned PDFs. Prevention: Check if PDF has extractable text before choosing extraction method.
python
def is_text_based_pdf(pdf_path):
    """Check if PDF contains extractable text."""
    from pypdf import PdfReader
    reader = PdfReader(pdf_path)
    for page in reader.pages:
        if page.extract_text().strip():
            return True
    return False
问题: 对文本型PDF使用OCR,或对扫描型PDF使用文本提取。 预防措施: 选择提取方法前,先检查PDF是否包含可提取文本。
python
def is_text_based_pdf(pdf_path):
    """检查PDF是否包含可提取文本。"""
    from pypdf import PdfReader
    reader = PdfReader(pdf_path)
    for page in reader.pages:
        if page.extract_text().strip():
            return True
    return False

Verification Strategies

验证策略

Pre-Processing Verification

处理前验证

  • Source files exist and are readable
  • Backup created successfully
  • Required tools installed (tesseract, pdftotext, pypdf)
  • Sample extraction produces reasonable values
  • 源文件存在且可读
  • 备份创建成功
  • 所需工具已安装(tesseract、pdftotext、pypdf)
  • 样本提取结果合理

Post-Processing Verification

处理后验证

  • Output record count matches input file count
  • No extraction errors occurred (or errors are documented)
  • Spot-checked values match source documents
  • Output format matches requirements (correct columns, types)
  • Files moved to correct destinations
  • Original backup preserved until final verification
  • 输出记录数与输入文件数匹配
  • 未发生提取错误(或错误已记录)
  • 抽查值与源文档匹配
  • 输出格式符合要求(正确的列、类型)
  • 文件已移动到正确目标目录
  • 原始备份在最终验证前保留

Recovery Plan

恢复计划

If something goes wrong:
  1. Stop immediately - do not continue processing
  2. Restore from backup:
    cp -r "$BACKUP_DIR"/* /source/
  3. Investigate the failure before retrying
  4. Fix extraction logic on samples before reprocessing all files
如果出现问题:
  1. 立即停止 - 不要继续处理
  2. 从备份恢复:
    cp -r "$BACKUP_DIR"/* /source/
  3. 调查失败原因后再重试
  4. 先在样本上修复提取逻辑,再重新处理所有文件

Tool Selection Guide

工具选择指南

File TypePrimary ToolFallback
Text-based PDFpypdf, pdftotext-
Scanned PDFtesseract (after pdf2image)pypdf
JPG/PNG imagestesseract-
Mixed PDF (text + scans)pypdf first, tesseract for image pages-
Install dependencies:
bash
undefined
文件类型首选工具备选工具
文本型PDFpypdf、pdftotext-
扫描型PDFtesseract(需先通过pdf2image转换)pypdf
JPG/PNG图片tesseract-
混合PDF(文本+扫描)先使用pypdf,对图片页使用tesseract-
安装依赖:
bash
undefined

System packages

系统包

apt-get install tesseract-ocr poppler-utils
apt-get install tesseract-ocr poppler-utils

Python packages

Python包

pip install pypdf pytesseract pdf2image pillow
undefined
pip install pypdf pytesseract pdf2image pillow
undefined