read-source
Original:🇺🇸 English
Translated
Extract text from source documents (PDF, DOCX, PPTX, HTML, Markdown) for spreadsheet workflows. Use to understand source material before populating workbooks.
1installs
Sourcewitanlabs/witan-cli
Added on
NPX Install
npx skill4agent add witanlabs/witan-cli read-sourceTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →When to Use
Use to convert source documents into LLM-ready text. This is for source material — PDFs, Word docs, presentations, HTML pages, and Markdown files that contain data you need to extract.
witan read- PDF → plain text
- Word (.doc, .docx) → markdown
- PowerPoint (.ppt, .pptx) → markdown
- HTML → markdown
- Markdown (.md) → outline support via
--outline
This is not for reading spreadsheet data (.xlsx, .xls) — use spreadsheet-specific tools for that.
Setup
Files are cached server-side by content hash so repeated operations skip re-upload. If is set (or is passed), files are processed but not stored.
WITAN_STATELESS=1--statelessThe CLI automatically applies per-attempt request timeouts and retries transient API failures (, , , , , , plus timeout/network errors). Non-retryable responses fail immediately.
4084295005025035044xxQuick Reference
bash
# Get document structure first
witan read report.pdf --outline
witan read slides.pptx --outline
# Read specific sections
witan read report.pdf --pages 1-5
witan read slides.pptx --slides 1-3
witan read notes.docx --offset 50 --limit 100
# Read from URLs
witan read https://example.com/report.pdf --outline
witan read https://example.com/data.csv
# JSON output for automation
witan read report.pdf --json
witan read report.pdf --outline --jsonExit Codes
| Code | Meaning |
|---|---|
| Success |
| Error (bad arguments, network failure, unsupported format) |
Navigation Strategy
Go directly with , , or / when you know where to look. Use when you don't — it gives document structure to target the right section.
--pages--slides--offset--limit--outlinePDF workflow:
- → see chapter/section structure with page ranges
witan read report.pdf --outline - → read the section you need
witan read report.pdf --pages 12-15
PPTX workflow:
- → see slide titles
witan read deck.pptx --outline - → read specific slides
witan read deck.pptx --slides 5-8
Text/DOCX workflow:
- → see heading structure with line offsets
witan read notes.docx --outline - → read a section
witan read notes.docx --offset 120 --limit 50
Command Reference
witan read <file-or-url> [flags]| Flag | Default | Description |
|---|---|---|
| — | PDF page range (e.g. |
| — | Presentation slide range (e.g. |
| | Start line (1-indexed) |
| | Maximum lines to return |
| | Show document structure instead of content |
| | Output full JSON response |
Pagination Limits
| Constraint | Value |
|---|---|
| Max PDF pages per read | 10 |
| Max PPTX slides per read | 10 |
| Default line limit | 2000 |
| Max file size | 25 MB |
Pipeline: Source → Spreadsheet
The typical flow for reading source material and populating a spreadsheet:
- Explore — to understand structure
witan read source.pdf --outline - Read — to get the data
witan read source.pdf --pages 3-8 - Parse — extract values from the text (LLM or regex)
- Write — to populate the spreadsheet
witan xlsx exec model.xlsx --input-json '...'
Output Format
Content mode (default): line-numbered text to stdout, metadata to stderr.
1 Revenue Summary
2
3 Q1: $1,250,000
4 Q2: $1,380,000
text/plain [15 pages, 10 read, 847 lines total, showing 1–847]Outline mode (): indented structure to stdout.
--outlineIntroduction [pages 1-2]
Background [pages 1-1]
Methodology [pages 2-2]
Results [pages 3-8]
Financial Summary [pages 3-5]
Projections [pages 6-8]
Appendix [pages 9-15]
[15 pages]Error Guide
| Error | Fix |
|---|---|
| Check file path exists and is readable |
| Check the URL is accessible |
| File exceeds 25 MB limit |
| Set Content-Type header (API only) |
| Empty outline | Document has no bookmarks/headings; use offset/limit to navigate |
| Truncated text | Use |