financial-document-processor

Original：🇺🇸 English

Translated

Guidance for processing financial documents (invoices, receipts, statements) with OCR and text extraction. This skill should be used when tasks involve extracting data from financial PDFs or images, generating summaries (CSV/JSON), or moving/organizing processed documents. Emphasizes data safety practices to prevent catastrophic data loss.

13installs

Sourceletta-ai/skills

Added on2026-02-15

NPX Install

npx skill4agent add letta-ai/skills financial-document-processor

SKILL.md Content

View Translation Comparison →

Financial Document Processor

Overview

This skill provides guidance for extracting structured data from financial documents (invoices, receipts, statements, etc.) using OCR and PDF text extraction. It emphasizes data safety practices that prevent catastrophic failures from destructive operations.

Critical Data Safety Principles

NEVER perform destructive operations on source data without verification or backup.

Before any file processing:

Create a backup of all source files before processing
Work on copies, not originals
Verify outputs match expectations before any cleanup
Use atomic operations (copy → verify → delete) instead of direct moves

Safe File Operation Pattern

bash

# CORRECT: Copy first, verify, then clean up
cp -r /source/documents/ /backup/documents/
# ... process files ...
# ... verify outputs match expectations ...
# Only after verification: rm /backup/documents/

# WRONG: Delete before moving (data loss risk)
rm -f /source/*.pdf && mv /source/* /dest/  # Files deleted before move!

Workflow

Step 1: Assess the Environment and Requirements

Before writing any processing code:

List all source files and note their exact paths
Identify file types (PDF text-based, PDF scanned, JPG, PNG, etc.)

Check available tools:

which tesseract

which pdftotext

python3 -c "import pypdf"

Understand the expected output format (CSV columns, required fields, etc.)

Step 2: Create Backup

Always backup source files before any processing:

bash

# Create timestamped backup directory
BACKUP_DIR="/tmp/backup_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
cp -r /path/to/source/documents/* "$BACKUP_DIR/"
echo "Backup created at: $BACKUP_DIR"

Step 3: Test Extraction on Sample Files

Before processing all documents:

Select 2-3 representative files (different formats, edge cases)
Test OCR/extraction on these samples
Verify extracted values match visual inspection
Adjust extraction logic based on sample results

python

# Test extraction on a single file first
sample_file = "/path/to/sample_invoice.pdf"
extracted_data = extract_document(sample_file)
print(f"Extracted: {extracted_data}")
# Manually verify these values match the document

Step 4: Handle Format Variations

Financial documents often have format variations:

Number formats: European (1.234,56) vs US (1,234.56)
Date formats: DD/MM/YYYY vs MM/DD/YYYY vs YYYY-MM-DD
Currency symbols: $, €, £, or spelled out
Empty/missing fields: VAT may be blank, not zero

python

def parse_amount(text):
    """Handle multiple number format conventions."""
    # Remove currency symbols and whitespace
    cleaned = re.sub(r'[$€£\s]', '', text)

    # Detect European format (comma as decimal separator)
    if re.match(r'^\d{1,3}(\.\d{3})*,\d{2}$', cleaned):
        cleaned = cleaned.replace('.', '').replace(',', '.')
    # US format (comma as thousands separator)
    elif ',' in cleaned:
        cleaned = cleaned.replace(',', '')

    return float(cleaned) if cleaned else None

Step 5: Process All Documents

After successful sample testing:

Process documents one at a time with error handling
Log extraction results for each document
Collect all results before writing output file

python

results = []
errors = []

for doc_path in document_paths:
    try:
        data = extract_document(doc_path)
        results.append(data)
        print(f"✓ Processed: {doc_path}")
    except Exception as e:
        errors.append((doc_path, str(e)))
        print(f"✗ Failed: {doc_path} - {e}")

if errors:
    print(f"\nWarning: {len(errors)} documents failed to process")
    for path, error in errors:
        print(f"  - {path}: {error}")

Step 6: Verify Before File Operations

Before moving files or writing final outputs:

Compare extracted record count to source file count
Spot-check extracted values against source documents
Verify output format matches requirements

python

# Verification checklist
assert len(results) == len(document_paths), "Record count mismatch"

# Spot-check a few values
for sample in random.sample(results, min(3, len(results))):
    print(f"Please verify: {sample['filename']} -> Total: {sample['total']}")

Step 7: Move Files (Only After Verification)

Only after verification passes:

bash

# Move files to destination (not delete!)
for file in /source/documents/*.pdf; do
    mv "$file" /processed/
done

# Only remove backup after confirming processed files exist
ls /processed/*.pdf && rm -rf "$BACKUP_DIR"

Common Pitfalls

1. Destructive Commands Without Backup

Problem: Using

rm

or overwriting files before verifying success. Prevention: Always create backups first; use copy-verify-delete pattern.

2. Command Order in Shell Pipelines

Problem:

rm -f *.pdf && mv *.pdf /dest/

- files are deleted before move. Prevention: Test commands on sample data; understand execution order.

3. Incomplete Script Verification

Problem: Running truncated or incomplete scripts on production data. Prevention: Verify script content before execution; test on samples first.

4. Fabricating Missing Data

Problem: Writing guessed values when extraction fails. Prevention: Report failures explicitly; use null/empty for missing values.

5. Premature Optimization

Problem: Immediately reprocessing when values look wrong without investigation. Prevention: First analyze OCR output and extraction logic issues without moving files.

6. PDF vs Image Handling

Problem: Using OCR on text-based PDFs or text extraction on scanned PDFs. Prevention: Check if PDF has extractable text before choosing extraction method.

python

def is_text_based_pdf(pdf_path):
    """Check if PDF contains extractable text."""
    from pypdf import PdfReader
    reader = PdfReader(pdf_path)
    for page in reader.pages:
        if page.extract_text().strip():
            return True
    return False

Verification Strategies

Pre-Processing Verification

Source files exist and are readable
Backup created successfully
Required tools installed (tesseract, pdftotext, pypdf)
Sample extraction produces reasonable values

Post-Processing Verification

Output record count matches input file count
No extraction errors occurred (or errors are documented)
Spot-checked values match source documents
Output format matches requirements (correct columns, types)
Files moved to correct destinations
Original backup preserved until final verification

Recovery Plan

If something goes wrong:

Stop immediately - do not continue processing
Restore from backup:
```
cp -r "$BACKUP_DIR"/* /source/
```
Investigate the failure before retrying
Fix extraction logic on samples before reprocessing all files

Tool Selection Guide

File Type	Primary Tool	Fallback
Text-based PDF	pypdf, pdftotext	-
Scanned PDF	tesseract (after pdf2image)	pypdf
JPG/PNG images	tesseract	-
Mixed PDF (text + scans)	pypdf first, tesseract for image pages	-

Install dependencies:

bash

# System packages
apt-get install tesseract-ocr poppler-utils

# Python packages
pip install pypdf pytesseract pdf2image pillow

financial-document-processor

NPX Install

Tags

SKILL.md Content

Financial Document Processor

Overview

Critical Data Safety Principles

Safe File Operation Pattern

Workflow

Step 1: Assess the Environment and Requirements

Step 2: Create Backup

Step 3: Test Extraction on Sample Files

Step 4: Handle Format Variations

Step 5: Process All Documents

Step 6: Verify Before File Operations

Step 7: Move Files (Only After Verification)

Common Pitfalls

1. Destructive Commands Without Backup

2. Command Order in Shell Pipelines

3. Incomplete Script Verification

4. Fabricating Missing Data

5. Premature Optimization

6. PDF vs Image Handling

Verification Strategies

Pre-Processing Verification

Post-Processing Verification

Recovery Plan

Tool Selection Guide