read-source

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

When to Use

何时使用

Use

witan read

to convert source documents into LLM-ready text. This is for source material — PDFs, Word docs, presentations, HTML pages, and Markdown files that contain data you need to extract.

PDF → plain text
Word (.doc, .docx) → markdown
PowerPoint (.ppt, .pptx) → markdown
HTML → markdown
Markdown (.md) → outline support via
```
--outline
```

This is not for reading spreadsheet data (.xlsx, .xls) — use spreadsheet-specific tools for that.

使用

witan read

命令可将源文档转换为适合LLM处理的文本。该命令适用于源材料——包含你需要提取的数据的PDF、Word文档、演示文稿、HTML页面和Markdown文件。

PDF → 纯文本
Word (.doc, .docx) → markdown
PowerPoint (.ppt, .pptx) → markdown
HTML → markdown
Markdown (.md) → 通过
```
--outline
```
参数支持大纲导出

该命令不适用于读取电子表格数据（.xlsx、.xls）——请使用专门的电子表格工具处理这类文件。

Setup

配置

Files are cached server-side by content hash so repeated operations skip re-upload. If

WITAN_STATELESS=1

is set (or

--stateless

is passed), files are processed but not stored.

The CLI automatically applies per-attempt request timeouts and retries transient API failures (

, plus timeout/network errors). Non-retryable

4xx

responses fail immediately.

文件会按内容哈希在服务端缓存，因此重复操作无需重新上传。如果设置了

WITAN_STATELESS=1

（或传入

--stateless

参数），文件仅会被处理不会被存储。

CLI会自动为每次请求设置超时时间，并重试临时API故障（

、

，以及超时/网络错误）。不可重试的

4xx

响应会立即报错终止。

Quick Reference

快速参考

bash

undefined

bash

undefined

Get document structure first

先获取文档结构

witan read report.pdf --outline witan read slides.pptx --outline

Read specific sections

读取指定章节

witan read report.pdf --pages 1-5 witan read slides.pptx --slides 1-3 witan read notes.docx --offset 50 --limit 100

Read from URLs

从URL读取

witan read https://example.com/report.pdf --outline witan read https://example.com/data.csv

JSON output for automation

输出JSON用于自动化流程

witan read report.pdf --json witan read report.pdf --outline --json

undefined

witan read report.pdf --json witan read report.pdf --outline --json

undefined

Exit Codes

退出码

Code	Meaning
`0`	Success
`1`	Error (bad arguments, network failure, unsupported format)

代码	含义
`0`	成功
`1`	错误（参数错误、网络故障、不支持的格式）

Navigation Strategy

导航策略

Go directly with

--pages

--slides

, or

--offset

--limit

when you know where to look. Use

--outline

when you don't — it gives document structure to target the right section.

PDF workflow:

```
witan read report.pdf --outline
```
→ see chapter/section structure with page ranges
```
witan read report.pdf --pages 12-15
```
→ read the section you need

PPTX workflow:

```
witan read deck.pptx --outline
```
→ see slide titles
```
witan read deck.pptx --slides 5-8
```
→ read specific slides

Text/DOCX workflow:

```
witan read notes.docx --outline
```
→ see heading structure with line offsets

witan read notes.docx --offset 120 --limit 50

→ read a section

当你知道内容位置时，可直接使用

--pages

、

--slides

或

--offset

--limit

参数。不知道内容位置时可使用

--outline

参数——它会返回文档结构，帮你定位到正确的章节。

PDF工作流：

```
witan read report.pdf --outline
```
→ 查看带页码范围的章节结构
```
witan read report.pdf --pages 12-15
```
→ 读取你需要的章节

PPTX工作流：

```
witan read deck.pptx --outline
```
→ 查看幻灯片标题
```
witan read deck.pptx --slides 5-8
```
→ 读取指定幻灯片

文本/DOCX工作流：

```
witan read notes.docx --outline
```
→ 查看带行偏移量的标题结构

witan read notes.docx --offset 120 --limit 50

→ 读取指定章节

Command Reference

命令参考

witan read <file-or-url> [flags]

Flag	Default	Description
`--pages`	—	PDF page range (e.g. `1-5` , `1,3,5` , `1-5,10-15` )
`--slides`	—	Presentation slide range (e.g. `1-3` )
`--offset`	`1`	Start line (1-indexed)
`--limit`	`2000`	Maximum lines to return
`--outline`	`false`	Show document structure instead of content
`--json`	`false`	Output full JSON response

witan read <file-or-url> [flags]

参数	默认值	描述
`--pages`	—	PDF页码范围（例如 `1-5` 、 `1,3,5` 、 `1-5,10-15` ）
`--slides`	—	演示文稿幻灯片范围（例如 `1-3` ）
`--offset`	`1`	起始行（从1开始计数）
`--limit`	`2000`	返回的最大行数
`--outline`	`false`	显示文档结构而非内容
`--json`	`false`	输出完整JSON响应

Pagination Limits

分页限制

Constraint	Value
Max PDF pages per read	10
Max PPTX slides per read	10
Default line limit	2000
Max file size	25 MB

限制项	数值
单次读取最大PDF页数	10
单次读取最大PPTX幻灯片数	10
默认行数限制	2000
最大文件大小	25 MB

Pipeline: Source → Spreadsheet

工作流：源文件 → 电子表格

The typical flow for reading source material and populating a spreadsheet:

Explore —
```
witan read source.pdf --outline
```
to understand structure
Read —
```
witan read source.pdf --pages 3-8
```
to get the data
Parse — extract values from the text (LLM or regex)

Write —

witan xlsx exec model.xlsx --input-json '...'

to populate the spreadsheet

读取源材料并填充电子表格的典型流程：

探索 — 执行
```
witan read source.pdf --outline
```
了解文档结构
读取 — 执行
```
witan read source.pdf --pages 3-8
```
获取所需数据
解析 — 从文本中提取数值（通过LLM或正则表达式）

写入 — 执行

witan xlsx exec model.xlsx --input-json '...'

填充电子表格

Output Format

输出格式

Content mode (default): line-numbered text to stdout, metadata to stderr.

     1	Revenue Summary
     2
     3	Q1: $1,250,000
     4	Q2: $1,380,000
text/plain  [15 pages, 10 read, 847 lines total, showing 1–847]

Outline mode (

--outline

): indented structure to stdout.

Introduction  [pages 1-2]
  Background  [pages 1-1]
  Methodology  [pages 2-2]
Results  [pages 3-8]
  Financial Summary  [pages 3-5]
  Projections  [pages 6-8]
Appendix  [pages 9-15]
[15 pages]

内容模式（默认）：带行号的文本输出到标准输出，元数据输出到标准错误。

     1	Revenue Summary
     2
     3	Q1: $1,250,000
     4	Q2: $1,380,000
text/plain  [15 pages, 10 read, 847 lines total, showing 1–847]

大纲模式（

--outline

）：缩进格式的结构输出到标准输出。

Introduction  [pages 1-2]
  Background  [pages 1-1]
  Methodology  [pages 2-2]
Results  [pages 3-8]
  Financial Summary  [pages 3-5]
  Projections  [pages 6-8]
Appendix  [pages 9-15]
[15 pages]

Error Guide

错误指南

Error	Fix
`cannot access file`	Check file path exists and is readable
`downloading URL: HTTP 4xx/5xx`	Check the URL is accessible
`payload_too_large`	File exceeds 25 MB limit
`missing_content_type`	Set Content-Type header (API only)
Empty outline	Document has no bookmarks/headings; use offset/limit to navigate
Truncated text	Use `--pages` , `--slides` , or increase `--limit`

错误	解决方案
`cannot access file`	检查文件路径是否存在且可读
`downloading URL: HTTP 4xx/5xx`	检查URL是否可访问
`payload_too_large`	文件超过25 MB限制
`missing_content_type`	设置Content-Type请求头（仅API调用时）
空大纲	文档没有书签/标题，使用offset/limit参数导航
文本被截断	使用 `--pages` 、 `--slides` 参数，或提高 `--limit` 数值