read-source

Original：🇺🇸 English

Translated

Extract text from source documents (PDF, DOCX, PPTX, HTML, Markdown) for spreadsheet workflows. Use to understand source material before populating workbooks.

2installs

Sourcewitanlabs/witan-cli

Added on2026-03-04

NPX Install

npx skill4agent add witanlabs/witan-cli read-source

SKILL.md Content

View Translation Comparison →

When to Use

Use

witan read

to convert source documents into LLM-ready text. This is for source material — PDFs, Word docs, presentations, HTML pages, and Markdown files that contain data you need to extract.

PDF → plain text
Word (.doc, .docx) → markdown
PowerPoint (.ppt, .pptx) → markdown
HTML → markdown
Markdown (.md) → outline support via
```
--outline
```

This is not for reading spreadsheet data (.xlsx, .xls) — use spreadsheet-specific tools for that.

Setup

Files are cached server-side by content hash so repeated operations skip re-upload. If

WITAN_STATELESS=1

is set (or

--stateless

is passed), files are processed but not stored.

The CLI automatically applies per-attempt request timeouts and retries transient API failures (

,

, plus timeout/network errors). Non-retryable

4xx

responses fail immediately.

Quick Reference

bash

# Get document structure first
witan read report.pdf --outline
witan read slides.pptx --outline

# Read specific sections
witan read report.pdf --pages 1-5
witan read slides.pptx --slides 1-3
witan read notes.docx --offset 50 --limit 100

# Read from URLs
witan read https://example.com/report.pdf --outline
witan read https://example.com/data.csv

# JSON output for automation
witan read report.pdf --json
witan read report.pdf --outline --json

Exit Codes

Code	Meaning
`0`	Success
`1`	Error (bad arguments, network failure, unsupported format)

Navigation Strategy

Go directly with

--pages

,

--slides

, or

--offset

/

--limit

when you know where to look. Use

--outline

when you don't — it gives document structure to target the right section.

PDF workflow:

```
witan read report.pdf --outline
```
→ see chapter/section structure with page ranges
```
witan read report.pdf --pages 12-15
```
→ read the section you need

PPTX workflow:

```
witan read deck.pptx --outline
```
→ see slide titles
```
witan read deck.pptx --slides 5-8
```
→ read specific slides

Text/DOCX workflow:

```
witan read notes.docx --outline
```
→ see heading structure with line offsets

witan read notes.docx --offset 120 --limit 50

→ read a section

Command Reference

witan read <file-or-url> [flags]

Flag	Default	Description
`--pages`	—	PDF page range (e.g. `1-5` , `1,3,5` , `1-5,10-15` )
`--slides`	—	Presentation slide range (e.g. `1-3` )
`--offset`	`1`	Start line (1-indexed)
`--limit`	`2000`	Maximum lines to return
`--outline`	`false`	Show document structure instead of content
`--json`	`false`	Output full JSON response

Pagination Limits

Constraint	Value
Max PDF pages per read	10
Max PPTX slides per read	10
Default line limit	2000
Max file size	25 MB

Pipeline: Source → Spreadsheet

The typical flow for reading source material and populating a spreadsheet:

Explore —
```
witan read source.pdf --outline
```
to understand structure
Read —
```
witan read source.pdf --pages 3-8
```
to get the data
Parse — extract values from the text (LLM or regex)

Write —

witan xlsx exec model.xlsx --input-json '...'

to populate the spreadsheet

Output Format

Content mode (default): line-numbered text to stdout, metadata to stderr.

     1	Revenue Summary
     2
     3	Q1: $1,250,000
     4	Q2: $1,380,000
text/plain  [15 pages, 10 read, 847 lines total, showing 1–847]

Outline mode (

--outline

): indented structure to stdout.

Introduction  [pages 1-2]
  Background  [pages 1-1]
  Methodology  [pages 2-2]
Results  [pages 3-8]
  Financial Summary  [pages 3-5]
  Projections  [pages 6-8]
Appendix  [pages 9-15]
[15 pages]

Error Guide

Error	Fix
`cannot access file`	Check file path exists and is readable
`downloading URL: HTTP 4xx/5xx`	Check the URL is accessible
`payload_too_large`	File exceeds 25 MB limit
`missing_content_type`	Set Content-Type header (API only)
Empty outline	Document has no bookmarks/headings; use offset/limit to navigate
Truncated text	Use `--pages` , `--slides` , or increase `--limit`