Loading...
Loading...
Use when working with "PDF", "Excel", "Word", "PowerPoint", "XLSX", "DOCX", "PPTX", "spreadsheets", "presentations", "extract text", "merge documents", "convert documents", or asking about "office document manipulation"
npx skill4agent add eyadsibai/ltk document-processing| Format | Extension | Structure | Best For |
|---|---|---|---|
| Binary/text | Reports, forms, archives | ||
| Excel | .xlsx | XML in ZIP | Data, calculations, models |
| Word | .docx | XML in ZIP | Text documents, contracts |
| PowerPoint | .pptx | XML in ZIP | Presentations, slides |
| Task | Best Tool |
|---|---|
| Basic read/write | pypdf |
| Text extraction | pdfplumber |
| Table extraction | pdfplumber |
| Create PDFs | reportlab |
| OCR scanned PDFs | pytesseract + pdf2image |
| Command line | qpdf, pdftotext |
| Operation | Approach |
|---|---|
| Merge | Loop through files, add pages to writer |
| Split | Create new writer per page |
| Extract tables | Use pdfplumber, convert to DataFrame |
| Rotate | Call |
| Encrypt | Use writer's |
| OCR | Convert to images, run pytesseract |
| Task | Best Tool |
|---|---|
| Data analysis | pandas |
| Formulas & formatting | openpyxl |
| Simple CSV | pandas |
| Financial models | openpyxl |
| Approach | Result |
|---|---|
| Wrong: Calculate in Python, write value | Static number, breaks when data changes |
| Right: Write Excel formula | Dynamic, recalculates automatically |
| Convention | Meaning |
|---|---|
| Blue text | Hardcoded inputs |
| Black text | Formulas |
| Green text | Links to other sheets |
| Yellow fill | Needs attention |
| Error | Cause |
|---|---|
| #REF! | Invalid cell reference |
| #DIV/0! | Division by zero |
| #VALUE! | Wrong data type |
| #NAME? | Unknown function name |
| Task | Best Tool |
|---|---|
| Text extraction | pandoc |
| Create new | python-docx or docx-js |
| Simple edits | python-docx |
| Tracked changes | Direct XML editing |
| File | Contains |
|---|---|
| Main content |
| Comments |
| Images |
| Element | XML Tag |
|---|---|
| Deletion | |
| Insertion | |
| Task | Best Tool |
|---|---|
| Text extraction | markitdown |
| Create new | pptxgenjs (JS) or python-pptx |
| Edit existing | Direct XML or python-pptx |
| Path | Contains |
|---|---|
| Slide content |
| Speaker notes |
| Master templates |
| Images |
| Principle | Guideline |
|---|---|
| Fonts | Use web-safe: Arial, Helvetica, Georgia |
| Layout | Two-column preferred, avoid vertical stacking |
| Hierarchy | Size, weight, color for emphasis |
| Consistency | Repeat patterns across slides |
| Conversion | Tool |
|---|---|
| Any → PDF | LibreOffice headless |
| PDF → Images | pdftoppm |
| DOCX → Markdown | pandoc |
| Any → Text | Appropriate extractor |
| Practice | Why |
|---|---|
| Use formulas in Excel | Dynamic calculations |
| Preserve formatting on edit | Don't lose styles |
| Test output opens correctly | Catch corruption early |
| Use tracked changes for contracts | Audit trail |
| Extract to markdown for analysis | Easier to process |
| Language | Packages |
|---|---|
| Python | pypdf, pdfplumber, openpyxl, python-docx, python-pptx |
| JavaScript | docx, pptxgenjs |
| CLI | pandoc, qpdf, pdftotext, libreoffice |