liteparse

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LiteParse Skill

LiteParse Skill

Parse unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally with LiteParse: fast, lightweight, no cloud dependencies or LLM required.
使用LiteParse在本地解析非结构化文档(PDF、DOCX、PPTX、XLSX、图片等):快速、轻量,无需云依赖或LLM。

Initial Setup

初始设置

When this skill is invoked, respond with:
I'm ready to use LiteParse to parse files locally. Before we begin, please confirm that:

- `@llamaindex/liteparse` is installed globally (`npm i -g @llamaindex/liteparse`)
- The `lit` CLI command is available in your terminal

If both are set, please provide:

1. One or more files to parse (PDF, DOCX, PPTX, XLSX, images, etc.)
2. Any specific options: output format (json/text), page ranges, OCR preferences, DPI, etc.
3. What you'd like to do with the parsed content.

I will produce the appropriate `lit` CLI command or TypeScript script, and once approved, report the results.
Then wait for the user's input.

调用此Skill时,请回复:
我已准备好使用LiteParse在本地解析文件。开始之前,请确认:

- 已全局安装`@llamaindex/liteparse`(执行命令:`npm i -g @llamaindex/liteparse`)
- 终端中可使用`lit` CLI命令

若以上均已配置,请提供:

1. 一个或多个待解析的文件(PDF、DOCX、PPTX、XLSX、图片等)
2. 具体选项:输出格式(json/text)、页面范围、OCR偏好设置、DPI等
3. 对解析后内容的处理需求

我会生成对应的`lit` CLI命令或TypeScript脚本,待您确认后,将反馈处理结果。
然后等待用户输入。

Step 0 — Install LiteParse (if needed)

步骤0 — 安装LiteParse(若未安装)

If
liteparse
is not yet installed, install it globally:
bash
npm i -g @llamaindex/liteparse
Verify installation:
bash
lit --version
For Office document support (DOCX, PPTX, XLSX), LibreOffice is required:
bash
undefined
若尚未安装
liteparse
,请全局安装:
bash
npm i -g @llamaindex/liteparse
验证安装:
bash
lit --version
如需支持Office文档(DOCX、PPTX、XLSX),需要安装LibreOffice:
bash
undefined

macOS

macOS

brew install --cask libreoffice
brew install --cask libreoffice

Ubuntu/Debian

Ubuntu/Debian

apt-get install libreoffice

For image parsing, ImageMagick is required:
```bash
apt-get install libreoffice

如需解析图片,需要安装ImageMagick:
```bash

macOS

macOS

brew install imagemagick
brew install imagemagick

Ubuntu/Debian

Ubuntu/Debian

apt-get install imagemagick

---
apt-get install imagemagick

---

Step 1 — Produce the CLI Command or Script

步骤1 — 生成CLI命令或脚本

Parse a Single File

解析单个文件

bash
undefined
bash
undefined

Basic text extraction

基础文本提取

lit parse document.pdf
lit parse document.pdf

JSON output saved to a file

将JSON输出保存到文件

lit parse document.pdf --format json -o output.json
lit parse document.pdf --format json -o output.json

Specific page range

指定页面范围

lit parse document.pdf --target-pages "1-5,10,15-20"
lit parse document.pdf --target-pages "1-5,10,15-20"

Disable OCR (faster, text-only PDFs)

禁用OCR(速度更快,仅适用于纯文本PDF)

lit parse document.pdf --no-ocr
lit parse document.pdf --no-ocr

Use an external HTTP OCR server for higher accuracy

使用外部HTTP OCR服务器以提高准确率

lit parse document.pdf --ocr-server-url http://localhost:8828/ocr
lit parse document.pdf --ocr-server-url http://localhost:8828/ocr

Higher DPI for better quality

更高DPI以提升质量

lit parse document.pdf --dpi 300
undefined
lit parse document.pdf --dpi 300
undefined

Batch Parse a Directory

批量解析目录

bash
lit batch-parse ./input-directory ./output-directory
bash
lit batch-parse ./input-directory ./output-directory

Only process PDFs, recursively

仅处理PDF文件,递归遍历

lit batch-parse ./input ./output --extension .pdf --recursive
undefined
lit batch-parse ./input ./output --extension .pdf --recursive
undefined

Generate Page Screenshots

生成页面截图

Screenshots are useful for LLM agents that need to see visual layout.
bash
undefined
截图适用于需要查看视觉布局的LLM Agent。
bash
undefined

All pages

所有页面

lit screenshot document.pdf -o ./screenshots
lit screenshot document.pdf -o ./screenshots

Specific pages

指定页面

lit screenshot document.pdf --pages "1,3,5" -o ./screenshots
lit screenshot document.pdf --pages "1,3,5" -o ./screenshots

High-DPI PNG

高DPI PNG格式

lit screenshot document.pdf --dpi 300 --format png -o ./screenshots
lit screenshot document.pdf --dpi 300 --format png -o ./screenshots

Page range

页面范围

lit screenshot document.pdf --pages "1-10" -o ./screenshots

---
lit screenshot document.pdf --pages "1-10" -o ./screenshots

---

Step 3 — Key Options Reference

步骤3 — 关键选项参考

OCR Options

OCR选项

OptionDescription
(default)Tesseract.js — zero setup, built-in
--ocr-language fra
Set OCR language (ISO code)
--ocr-server-url <url>
Use external HTTP OCR server (EasyOCR, PaddleOCR, custom)
--no-ocr
Disable OCR entirely
选项描述
(默认)Tesseract.js — 无需额外设置,内置集成
--ocr-language fra
设置OCR语言(ISO代码)
--ocr-server-url <url>
使用外部HTTP OCR服务器(EasyOCR、PaddleOCR或自定义服务器)
--no-ocr
完全禁用OCR

Output Options

输出选项

OptionDescription
--format json
Structured JSON with bounding boxes
--format text
Plain text (default)
-o <file>
Save output to file
选项描述
--format json
带边界框的结构化JSON
--format text
纯文本(默认)
-o <file>
将输出保存到文件

Performance / Quality Options

性能/质量选项

OptionDescription
--dpi <n>
Rendering DPI (default: 150; use 300 for high quality)
--max-pages <n>
Limit pages parsed
--target-pages <pages>
Parse specific pages (e.g.
"1-5,10"
)
--no-precise-bbox
Disable precise bounding boxes (faster)
--skip-diagonal-text
Ignore rotated/diagonal text
--preserve-small-text
Keep very small text that would otherwise be dropped

选项描述
--dpi <n>
渲染DPI(默认:150;高质量请使用300)
--max-pages <n>
限制解析的页面数量
--target-pages <pages>
解析指定页面(例如
"1-5,10"
--no-precise-bbox
禁用精确边界框(速度更快)
--skip-diagonal-text
忽略旋转/斜向文本
--preserve-small-text
保留原本会被丢弃的极小文本

Step 4 — Using a Config File

步骤4 — 使用配置文件

For repeated use with consistent options, generate a
liteparse.config.json
:
json
{
  "ocrLanguage": "en",
  "ocrEnabled": true,
  "maxPages": 1000,
  "dpi": 150,
  "outputFormat": "json",
  "preciseBoundingBox": true,
  "skipDiagonalText": false,
  "preserveVerySmallText": false
}
For an HTTP OCR server:
json
{
  "ocrServerUrl": "http://localhost:8828/ocr",
  "ocrLanguage": "en",
  "outputFormat": "json"
}
Use with:
bash
lit parse document.pdf --config liteparse.config.json

如需重复使用一致的选项,可生成
liteparse.config.json
配置文件:
json
{
  "ocrLanguage": "en",
  "ocrEnabled": true,
  "maxPages": 1000,
  "dpi": 150,
  "outputFormat": "json",
  "preciseBoundingBox": true,
  "skipDiagonalText": false,
  "preserveVerySmallText": false
}
对于HTTP OCR服务器:
json
{
  "ocrServerUrl": "http://localhost:8828/ocr",
  "ocrLanguage": "en",
  "outputFormat": "json"
}
使用配置文件:
bash
lit parse document.pdf --config liteparse.config.json

Step 5 — HTTP OCR Server API (Advanced)

步骤5 — HTTP OCR服务器API(进阶)

If the user wants to plug in a custom OCR backend, the server must implement:
  • Endpoint:
    POST /ocr
  • Accepts:
    file
    (multipart) and
    language
    (string) parameters
  • Returns:
json
{
  "results": [
    { "text": "Hello", "bbox": [x1, y1, x2, y2], "confidence": 0.98 }
  ]
}
Ready-to-use wrappers exist for EasyOCR and PaddleOCR in the LiteParse repo.

若用户需要接入自定义OCR后端,服务器必须实现:
  • 端点:
    POST /ocr
  • 接收参数:
    file
    (多部分表单)和
    language
    (字符串)参数
  • 返回格式:
json
{
  "results": [
    { "text": "Hello", "bbox": [x1, y1, x2, y2], "confidence": 0.98 }
  ]
}
LiteParse仓库中提供了EasyOCR和PaddleOCR的即用型封装。

Supported Input Formats

支持的输入格式

CategoryFormats
PDF
.pdf
Word
.doc
,
.docx
,
.docm
,
.odt
,
.rtf
PowerPoint
.ppt
,
.pptx
,
.pptm
,
.odp
Spreadsheets
.xls
,
.xlsx
,
.xlsm
,
.ods
,
.csv
,
.tsv
Images
.jpg
,
.jpeg
,
.png
,
.gif
,
.bmp
,
.tiff
,
.webp
,
.svg
Office documents require LibreOffice; images require ImageMagick. LiteParse auto-converts these formats to PDF before parsing.
类别格式
PDF
.pdf
Word文档
.doc
,
.docx
,
.docm
,
.odt
,
.rtf
PowerPoint演示文稿
.ppt
,
.pptx
,
.pptm
,
.odp
电子表格
.xls
,
.xlsx
,
.xlsm
,
.ods
,
.csv
,
.tsv
图片
.jpg
,
.jpeg
,
.png
,
.gif
,
.bmp
,
.tiff
,
.webp
,
.svg
Office文档需要LibreOffice支持;图片需要ImageMagick支持。LiteParse会自动将这些格式转换为PDF后再进行解析。