financial-document-processor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFinancial Document Processor
财务文档处理器
Overview
概述
This skill provides guidance for extracting structured data from financial documents (invoices, receipts, statements, etc.) using OCR and PDF text extraction. It emphasizes data safety practices that prevent catastrophic failures from destructive operations.
本技能提供了使用OCR和PDF文本提取技术从财务文档(发票、收据、对账单等)中提取结构化数据的指南。它重点介绍了防止破坏性操作导致灾难性故障的数据安全实践。
Critical Data Safety Principles
关键数据安全原则
NEVER perform destructive operations on source data without verification or backup.
Before any file processing:
- Create a backup of all source files before processing
- Work on copies, not originals
- Verify outputs match expectations before any cleanup
- Use atomic operations (copy → verify → delete) instead of direct moves
未经验证或备份,切勿对源数据执行破坏性操作。
在进行任何文件处理之前:
- 处理前先备份所有源文件
- 操作副本而非原始文件
- 清理前先验证输出结果符合预期
- 使用原子操作(复制→验证→删除)替代直接移动
Safe File Operation Pattern
安全文件操作模式
bash
undefinedbash
undefinedCORRECT: Copy first, verify, then clean up
正确做法:先复制,验证后再清理
cp -r /source/documents/ /backup/documents/
cp -r /source/documents/ /backup/documents/
... process files ...
... 处理文件 ...
... verify outputs match expectations ...
... 验证输出结果符合预期 ...
Only after verification: rm /backup/documents/
仅在验证通过后执行:rm /backup/documents/
WRONG: Delete before moving (data loss risk)
错误做法:移动前删除(存在数据丢失风险)
rm -f /source/.pdf && mv /source/ /dest/ # Files deleted before move!
undefinedrm -f /source/.pdf && mv /source/ /dest/ # 文件在移动前已被删除!
undefinedWorkflow
工作流程
Step 1: Assess the Environment and Requirements
步骤1:评估环境与需求
Before writing any processing code:
- List all source files and note their exact paths
- Identify file types (PDF text-based, PDF scanned, JPG, PNG, etc.)
- Check available tools: ,
which tesseract,which pdftotextpython3 -c "import pypdf" - Understand the expected output format (CSV columns, required fields, etc.)
在编写任何处理代码之前:
- 列出所有源文件并记录其准确路径
- 识别文件类型(文本型PDF、扫描型PDF、JPG、PNG等)
- 检查可用工具:、
which tesseract、which pdftotextpython3 -c "import pypdf" - 明确预期的输出格式(CSV列、必填字段等)
Step 2: Create Backup
步骤2:创建备份
Always backup source files before any processing:
bash
undefined任何处理操作前务必备份源文件:
bash
undefinedCreate timestamped backup directory
创建带时间戳的备份目录
BACKUP_DIR="/tmp/backup_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
cp -r /path/to/source/documents/* "$BACKUP_DIR/"
echo "Backup created at: $BACKUP_DIR"
undefinedBACKUP_DIR="/tmp/backup_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
cp -r /path/to/source/documents/* "$BACKUP_DIR/"
echo "备份已创建于:$BACKUP_DIR"
undefinedStep 3: Test Extraction on Sample Files
步骤3:在样本文件上测试提取功能
Before processing all documents:
- Select 2-3 representative files (different formats, edge cases)
- Test OCR/extraction on these samples
- Verify extracted values match visual inspection
- Adjust extraction logic based on sample results
python
undefined在处理所有文档之前:
- 选择2-3个有代表性的文件(不同格式、边缘案例)
- 在这些样本上测试OCR/提取功能
- 验证提取的值与人工检查结果一致
- 根据样本结果调整提取逻辑
python
undefinedTest extraction on a single file first
先在单个文件上测试提取功能
sample_file = "/path/to/sample_invoice.pdf"
extracted_data = extract_document(sample_file)
print(f"Extracted: {extracted_data}")
sample_file = "/path/to/sample_invoice.pdf"
extracted_data = extract_document(sample_file)
print(f"提取结果:{extracted_data}")
Manually verify these values match the document
手动验证这些值是否与文档内容匹配
undefinedundefinedStep 4: Handle Format Variations
步骤4:处理格式差异
Financial documents often have format variations:
- Number formats: European (1.234,56) vs US (1,234.56)
- Date formats: DD/MM/YYYY vs MM/DD/YYYY vs YYYY-MM-DD
- Currency symbols: $, €, £, or spelled out
- Empty/missing fields: VAT may be blank, not zero
python
def parse_amount(text):
"""Handle multiple number format conventions."""
# Remove currency symbols and whitespace
cleaned = re.sub(r'[$€£\s]', '', text)
# Detect European format (comma as decimal separator)
if re.match(r'^\d{1,3}(\.\d{3})*,\d{2}$', cleaned):
cleaned = cleaned.replace('.', '').replace(',', '.')
# US format (comma as thousands separator)
elif ',' in cleaned:
cleaned = cleaned.replace(',', '')
return float(cleaned) if cleaned else None财务文档通常存在格式差异:
- 数字格式:欧洲格式(1.234,56) vs 美国格式(1,234.56)
- 日期格式:DD/MM/YYYY vs MM/DD/YYYY vs YYYY-MM-DD
- 货币符号:$、€、£或拼写形式
- 空/缺失字段:增值税字段可能为空而非零值
python
def parse_amount(text):
"""处理多种数字格式规范。"""
# 移除货币符号和空白字符
cleaned = re.sub(r'[$€£\s]', '', text)
# 检测欧洲格式(逗号作为小数分隔符)
if re.match(r'^\d{1,3}(\.\d{3})*,\d{2}$', cleaned):
cleaned = cleaned.replace('.', '').replace(',', '.')
# 美国格式(逗号作为千位分隔符)
elif ',' in cleaned:
cleaned = cleaned.replace(',', '')
return float(cleaned) if cleaned else NoneStep 5: Process All Documents
步骤5:处理所有文档
After successful sample testing:
- Process documents one at a time with error handling
- Log extraction results for each document
- Collect all results before writing output file
python
results = []
errors = []
for doc_path in document_paths:
try:
data = extract_document(doc_path)
results.append(data)
print(f"✓ Processed: {doc_path}")
except Exception as e:
errors.append((doc_path, str(e)))
print(f"✗ Failed: {doc_path} - {e}")
if errors:
print(f"\nWarning: {len(errors)} documents failed to process")
for path, error in errors:
print(f" - {path}: {error}")在样本测试成功后:
- 逐个处理文档并添加错误处理
- 记录每个文档的提取结果
- 在写入输出文件前收集所有结果
python
results = []
errors = []
for doc_path in document_paths:
try:
data = extract_document(doc_path)
results.append(data)
print(f"✓ 已处理:{doc_path}")
except Exception as e:
errors.append((doc_path, str(e)))
print(f"✗ 处理失败:{doc_path} - {e}")
if errors:
print(f"\n警告:{len(errors)}个文档处理失败")
for path, error in errors:
print(f" - {path}:{error}")Step 6: Verify Before File Operations
步骤6:文件操作前验证
Before moving files or writing final outputs:
- Compare extracted record count to source file count
- Spot-check extracted values against source documents
- Verify output format matches requirements
python
undefined在移动文件或写入最终输出前:
- 比较提取记录数与源文件数
- 抽查提取值与源文档是否匹配
- 验证输出格式符合要求
python
undefinedVerification checklist
验证清单
assert len(results) == len(document_paths), "Record count mismatch"
assert len(results) == len(document_paths), "记录数不匹配"
Spot-check a few values
抽查部分值
for sample in random.sample(results, min(3, len(results))):
print(f"Please verify: {sample['filename']} -> Total: {sample['total']}")
undefinedfor sample in random.sample(results, min(3, len(results))):
print(f"请验证:{sample['filename']} -> 总计:{sample['total']}")
undefinedStep 7: Move Files (Only After Verification)
步骤7:移动文件(仅在验证通过后)
Only after verification passes:
bash
undefined仅在验证通过后执行:
bash
undefinedMove files to destination (not delete!)
将文件移动到目标目录(不要直接删除!)
for file in /source/documents/*.pdf; do
mv "$file" /processed/
done
for file in /source/documents/*.pdf; do
mv "$file" /processed/
done
Only remove backup after confirming processed files exist
确认已处理文件存在后再删除备份
ls /processed/*.pdf && rm -rf "$BACKUP_DIR"
undefinedls /processed/*.pdf && rm -rf "$BACKUP_DIR"
undefinedCommon Pitfalls
常见陷阱
1. Destructive Commands Without Backup
1. 无备份情况下执行破坏性命令
Problem: Using or overwriting files before verifying success.
Prevention: Always create backups first; use copy-verify-delete pattern.
rm问题: 在验证成功前使用或覆盖文件。
预防措施: 始终先创建备份;采用复制-验证-删除模式。
rm2. Command Order in Shell Pipelines
2. Shell管道中的命令顺序
Problem: - files are deleted before move.
Prevention: Test commands on sample data; understand execution order.
rm -f *.pdf && mv *.pdf /dest/问题: - 文件在移动前已被删除。
预防措施: 在样本数据上测试命令;理解执行顺序。
rm -f *.pdf && mv *.pdf /dest/3. Incomplete Script Verification
3. 不完整的脚本验证
Problem: Running truncated or incomplete scripts on production data.
Prevention: Verify script content before execution; test on samples first.
问题: 在生产数据上运行截断或不完整的脚本。
预防措施: 执行前验证脚本内容;先在样本上测试。
4. Fabricating Missing Data
4. 编造缺失数据
Problem: Writing guessed values when extraction fails.
Prevention: Report failures explicitly; use null/empty for missing values.
问题: 提取失败时写入猜测值。
预防措施: 明确报告失败;对缺失值使用null/空值。
5. Premature Optimization
5. 过早优化
Problem: Immediately reprocessing when values look wrong without investigation.
Prevention: First analyze OCR output and extraction logic issues without moving files.
问题: 当值看起来异常时,未调查就立即重新处理。
预防措施: 先分析OCR输出和提取逻辑问题,不要移动文件。
6. PDF vs Image Handling
6. PDF与图片处理混淆
Problem: Using OCR on text-based PDFs or text extraction on scanned PDFs.
Prevention: Check if PDF has extractable text before choosing extraction method.
python
def is_text_based_pdf(pdf_path):
"""Check if PDF contains extractable text."""
from pypdf import PdfReader
reader = PdfReader(pdf_path)
for page in reader.pages:
if page.extract_text().strip():
return True
return False问题: 对文本型PDF使用OCR,或对扫描型PDF使用文本提取。
预防措施: 选择提取方法前,先检查PDF是否包含可提取文本。
python
def is_text_based_pdf(pdf_path):
"""检查PDF是否包含可提取文本。"""
from pypdf import PdfReader
reader = PdfReader(pdf_path)
for page in reader.pages:
if page.extract_text().strip():
return True
return FalseVerification Strategies
验证策略
Pre-Processing Verification
处理前验证
- Source files exist and are readable
- Backup created successfully
- Required tools installed (tesseract, pdftotext, pypdf)
- Sample extraction produces reasonable values
- 源文件存在且可读
- 备份创建成功
- 所需工具已安装(tesseract、pdftotext、pypdf)
- 样本提取结果合理
Post-Processing Verification
处理后验证
- Output record count matches input file count
- No extraction errors occurred (or errors are documented)
- Spot-checked values match source documents
- Output format matches requirements (correct columns, types)
- Files moved to correct destinations
- Original backup preserved until final verification
- 输出记录数与输入文件数匹配
- 未发生提取错误(或错误已记录)
- 抽查值与源文档匹配
- 输出格式符合要求(正确的列、类型)
- 文件已移动到正确目标目录
- 原始备份在最终验证前保留
Recovery Plan
恢复计划
If something goes wrong:
- Stop immediately - do not continue processing
- Restore from backup:
cp -r "$BACKUP_DIR"/* /source/ - Investigate the failure before retrying
- Fix extraction logic on samples before reprocessing all files
如果出现问题:
- 立即停止 - 不要继续处理
- 从备份恢复:
cp -r "$BACKUP_DIR"/* /source/ - 调查失败原因后再重试
- 先在样本上修复提取逻辑,再重新处理所有文件
Tool Selection Guide
工具选择指南
| File Type | Primary Tool | Fallback |
|---|---|---|
| Text-based PDF | pypdf, pdftotext | - |
| Scanned PDF | tesseract (after pdf2image) | pypdf |
| JPG/PNG images | tesseract | - |
| Mixed PDF (text + scans) | pypdf first, tesseract for image pages | - |
Install dependencies:
bash
undefined| 文件类型 | 首选工具 | 备选工具 |
|---|---|---|
| 文本型PDF | pypdf、pdftotext | - |
| 扫描型PDF | tesseract(需先通过pdf2image转换) | pypdf |
| JPG/PNG图片 | tesseract | - |
| 混合PDF(文本+扫描) | 先使用pypdf,对图片页使用tesseract | - |
安装依赖:
bash
undefinedSystem packages
系统包
apt-get install tesseract-ocr poppler-utils
apt-get install tesseract-ocr poppler-utils
Python packages
Python包
pip install pypdf pytesseract pdf2image pillow
undefinedpip install pypdf pytesseract pdf2image pillow
undefined