liteparse

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LiteParse Skill

Parse unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally with LiteParse: fast, lightweight, no cloud dependencies or LLM required.

使用LiteParse在本地解析非结构化文档（PDF、DOCX、PPTX、XLSX、图片等）：快速、轻量，无需云依赖或LLM。

Initial Setup

初始设置

When this skill is invoked, respond with:

I'm ready to use LiteParse to parse files locally. Before we begin, please confirm that:

- `@llamaindex/liteparse` is installed globally (`npm i -g @llamaindex/liteparse`)
- The `lit` CLI command is available in your terminal

If both are set, please provide:

1. One or more files to parse (PDF, DOCX, PPTX, XLSX, images, etc.)
2. Any specific options: output format (json/text), page ranges, OCR preferences, DPI, etc.
3. What you'd like to do with the parsed content.

I will produce the appropriate `lit` CLI command or TypeScript script, and once approved, report the results.

Then wait for the user's input.

调用此Skill时，请回复：

我已准备好使用LiteParse在本地解析文件。开始之前，请确认：

- 已全局安装`@llamaindex/liteparse`（执行命令：`npm i -g @llamaindex/liteparse`）
- 终端中可使用`lit` CLI命令

若以上均已配置，请提供：

1. 一个或多个待解析的文件（PDF、DOCX、PPTX、XLSX、图片等）
2. 具体选项：输出格式（json/text）、页面范围、OCR偏好设置、DPI等
3. 对解析后内容的处理需求

我会生成对应的`lit` CLI命令或TypeScript脚本，待您确认后，将反馈处理结果。

然后等待用户输入。

Step 0 — Install LiteParse (if needed)

步骤0 — 安装LiteParse（若未安装）

liteparse

is not yet installed, install it globally:

bash

npm i -g @llamaindex/liteparse

Verify installation:

bash

lit --version

For Office document support (DOCX, PPTX, XLSX), LibreOffice is required:

bash

undefined

若尚未安装

liteparse

，请全局安装：

bash

npm i -g @llamaindex/liteparse

验证安装：

bash

lit --version

如需支持Office文档（DOCX、PPTX、XLSX），需要安装LibreOffice：

bash

undefined

macOS

brew install --cask libreoffice

Ubuntu/Debian

apt-get install libreoffice


For image parsing, ImageMagick is required:
```bash

apt-get install libreoffice


如需解析图片，需要安装ImageMagick：
```bash

macOS

brew install imagemagick

Ubuntu/Debian

apt-get install imagemagick

---

apt-get install imagemagick

---

Step 1 — Produce the CLI Command or Script

步骤1 — 生成CLI命令或脚本

Parse a Single File

解析单个文件

bash

undefined

bash

undefined

Basic text extraction

基础文本提取

lit parse document.pdf

JSON output saved to a file

将JSON输出保存到文件

lit parse document.pdf --format json -o output.json

Specific page range

指定页面范围

lit parse document.pdf --target-pages "1-5,10,15-20"

Disable OCR (faster, text-only PDFs)

禁用OCR（速度更快，仅适用于纯文本PDF）

lit parse document.pdf --no-ocr

Use an external HTTP OCR server for higher accuracy

使用外部HTTP OCR服务器以提高准确率

lit parse document.pdf --ocr-server-url http://localhost:8828/ocr

Higher DPI for better quality

更高DPI以提升质量

lit parse document.pdf --dpi 300

undefined

lit parse document.pdf --dpi 300

undefined

Batch Parse a Directory

批量解析目录

bash

lit batch-parse ./input-directory ./output-directory

bash

lit batch-parse ./input-directory ./output-directory

Only process PDFs, recursively

仅处理PDF文件，递归遍历

lit batch-parse ./input ./output --extension .pdf --recursive

undefined

lit batch-parse ./input ./output --extension .pdf --recursive

undefined

Generate Page Screenshots

生成页面截图

Screenshots are useful for LLM agents that need to see visual layout.

bash

undefined

截图适用于需要查看视觉布局的LLM Agent。

bash

undefined

All pages

所有页面

lit screenshot document.pdf -o ./screenshots

Specific pages

指定页面

lit screenshot document.pdf --pages "1,3,5" -o ./screenshots

High-DPI PNG

高DPI PNG格式

lit screenshot document.pdf --dpi 300 --format png -o ./screenshots

Page range

页面范围

lit screenshot document.pdf --pages "1-10" -o ./screenshots

---

lit screenshot document.pdf --pages "1-10" -o ./screenshots

---

Step 3 — Key Options Reference

步骤3 — 关键选项参考

OCR Options

OCR选项

Option	Description
(default)	Tesseract.js — zero setup, built-in
`--ocr-language fra`	Set OCR language (ISO code)
`--ocr-server-url <url>`	Use external HTTP OCR server (EasyOCR, PaddleOCR, custom)
`--no-ocr`	Disable OCR entirely

选项	描述
(默认)	Tesseract.js — 无需额外设置，内置集成
`--ocr-language fra`	设置OCR语言（ISO代码）
`--ocr-server-url <url>`	使用外部HTTP OCR服务器（EasyOCR、PaddleOCR或自定义服务器）
`--no-ocr`	完全禁用OCR

Output Options

输出选项

Option	Description
`--format json`	Structured JSON with bounding boxes
`--format text`	Plain text (default)
`-o <file>`	Save output to file

选项	描述
`--format json`	带边界框的结构化JSON
`--format text`	纯文本（默认）
`-o <file>`	将输出保存到文件

Performance / Quality Options

性能/质量选项

Option	Description
`--dpi <n>`	Rendering DPI (default: 150; use 300 for high quality)
`--max-pages <n>`	Limit pages parsed
`--target-pages <pages>`	Parse specific pages (e.g. `"1-5,10"` )
`--no-precise-bbox`	Disable precise bounding boxes (faster)
`--skip-diagonal-text`	Ignore rotated/diagonal text
`--preserve-small-text`	Keep very small text that would otherwise be dropped

选项	描述
`--dpi <n>`	渲染DPI（默认：150；高质量请使用300）
`--max-pages <n>`	限制解析的页面数量
`--target-pages <pages>`	解析指定页面（例如 `"1-5,10"` ）
`--no-precise-bbox`	禁用精确边界框（速度更快）
`--skip-diagonal-text`	忽略旋转/斜向文本
`--preserve-small-text`	保留原本会被丢弃的极小文本

Step 4 — Using a Config File

步骤4 — 使用配置文件

For repeated use with consistent options, generate a

liteparse.config.json

json

{
  "ocrLanguage": "en",
  "ocrEnabled": true,
  "maxPages": 1000,
  "dpi": 150,
  "outputFormat": "json",
  "preciseBoundingBox": true,
  "skipDiagonalText": false,
  "preserveVerySmallText": false
}

For an HTTP OCR server:

json

{
  "ocrServerUrl": "http://localhost:8828/ocr",
  "ocrLanguage": "en",
  "outputFormat": "json"
}

Use with:

bash

lit parse document.pdf --config liteparse.config.json

如需重复使用一致的选项，可生成

liteparse.config.json

配置文件：

json

{
  "ocrLanguage": "en",
  "ocrEnabled": true,
  "maxPages": 1000,
  "dpi": 150,
  "outputFormat": "json",
  "preciseBoundingBox": true,
  "skipDiagonalText": false,
  "preserveVerySmallText": false
}

对于HTTP OCR服务器：

json

{
  "ocrServerUrl": "http://localhost:8828/ocr",
  "ocrLanguage": "en",
  "outputFormat": "json"
}

使用配置文件：

bash

lit parse document.pdf --config liteparse.config.json

Step 5 — HTTP OCR Server API (Advanced)

步骤5 — HTTP OCR服务器API（进阶）

If the user wants to plug in a custom OCR backend, the server must implement:

Endpoint:
```
POST /ocr
```
Accepts:
```
file
```
(multipart) and
```
language
```
(string) parameters
Returns:

json

{
  "results": [
    { "text": "Hello", "bbox": [x1, y1, x2, y2], "confidence": 0.98 }
  ]
}

Ready-to-use wrappers exist for EasyOCR and PaddleOCR in the LiteParse repo.

若用户需要接入自定义OCR后端，服务器必须实现：

端点:
```
POST /ocr
```
接收参数:
```
file
```
（多部分表单）和
```
language
```
（字符串）参数
返回格式:

json

{
  "results": [
    { "text": "Hello", "bbox": [x1, y1, x2, y2], "confidence": 0.98 }
  ]
}

LiteParse仓库中提供了EasyOCR和PaddleOCR的即用型封装。

Supported Input Formats

支持的输入格式

Category	Formats
PDF	`.pdf`
Word	`.doc` , `.docx` , `.docm` , `.odt` , `.rtf`
PowerPoint	`.ppt` , `.pptx` , `.pptm` , `.odp`
Spreadsheets	`.xls` , `.xlsx` , `.xlsm` , `.ods` , `.csv` , `.tsv`
Images	`.jpg` , `.jpeg` , `.png` , `.gif` , `.bmp` , `.tiff` , `.webp` , `.svg`

Office documents require LibreOffice; images require ImageMagick. LiteParse auto-converts these formats to PDF before parsing.

类别	格式
PDF	`.pdf`
Word文档	`.doc` , `.docx` , `.docm` , `.odt` , `.rtf`
PowerPoint演示文稿	`.ppt` , `.pptx` , `.pptm` , `.odp`
电子表格	`.xls` , `.xlsx` , `.xlsm` , `.ods` , `.csv` , `.tsv`
图片	`.jpg` , `.jpeg` , `.png` , `.gif` , `.bmp` , `.tiff` , `.webp` , `.svg`

Office文档需要LibreOffice支持；图片需要ImageMagick支持。LiteParse会自动将这些格式转换为PDF后再进行解析。