liteparse
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLiteParse Skill
LiteParse Skill
Parse unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally with LiteParse: fast, lightweight, no cloud dependencies or LLM required.
使用LiteParse在本地解析非结构化文档(PDF、DOCX、PPTX、XLSX、图片等):快速、轻量,无需云依赖或LLM。
Initial Setup
初始设置
When this skill is invoked, respond with:
I'm ready to use LiteParse to parse files locally. Before we begin, please confirm that:
- `@llamaindex/liteparse` is installed globally (`npm i -g @llamaindex/liteparse`)
- The `lit` CLI command is available in your terminal
If both are set, please provide:
1. One or more files to parse (PDF, DOCX, PPTX, XLSX, images, etc.)
2. Any specific options: output format (json/text), page ranges, OCR preferences, DPI, etc.
3. What you'd like to do with the parsed content.
I will produce the appropriate `lit` CLI command or TypeScript script, and once approved, report the results.Then wait for the user's input.
调用此Skill时,请回复:
我已准备好使用LiteParse在本地解析文件。开始之前,请确认:
- 已全局安装`@llamaindex/liteparse`(执行命令:`npm i -g @llamaindex/liteparse`)
- 终端中可使用`lit` CLI命令
若以上均已配置,请提供:
1. 一个或多个待解析的文件(PDF、DOCX、PPTX、XLSX、图片等)
2. 具体选项:输出格式(json/text)、页面范围、OCR偏好设置、DPI等
3. 对解析后内容的处理需求
我会生成对应的`lit` CLI命令或TypeScript脚本,待您确认后,将反馈处理结果。然后等待用户输入。
Step 0 — Install LiteParse (if needed)
步骤0 — 安装LiteParse(若未安装)
If is not yet installed, install it globally:
liteparsebash
npm i -g @llamaindex/liteparseVerify installation:
bash
lit --versionFor Office document support (DOCX, PPTX, XLSX), LibreOffice is required:
bash
undefined若尚未安装,请全局安装:
liteparsebash
npm i -g @llamaindex/liteparse验证安装:
bash
lit --version如需支持Office文档(DOCX、PPTX、XLSX),需要安装LibreOffice:
bash
undefinedmacOS
macOS
brew install --cask libreoffice
brew install --cask libreoffice
Ubuntu/Debian
Ubuntu/Debian
apt-get install libreoffice
For image parsing, ImageMagick is required:
```bashapt-get install libreoffice
如需解析图片,需要安装ImageMagick:
```bashmacOS
macOS
brew install imagemagick
brew install imagemagick
Ubuntu/Debian
Ubuntu/Debian
apt-get install imagemagick
---apt-get install imagemagick
---Step 1 — Produce the CLI Command or Script
步骤1 — 生成CLI命令或脚本
Parse a Single File
解析单个文件
bash
undefinedbash
undefinedBasic text extraction
基础文本提取
lit parse document.pdf
lit parse document.pdf
JSON output saved to a file
将JSON输出保存到文件
lit parse document.pdf --format json -o output.json
lit parse document.pdf --format json -o output.json
Specific page range
指定页面范围
lit parse document.pdf --target-pages "1-5,10,15-20"
lit parse document.pdf --target-pages "1-5,10,15-20"
Disable OCR (faster, text-only PDFs)
禁用OCR(速度更快,仅适用于纯文本PDF)
lit parse document.pdf --no-ocr
lit parse document.pdf --no-ocr
Use an external HTTP OCR server for higher accuracy
使用外部HTTP OCR服务器以提高准确率
lit parse document.pdf --ocr-server-url http://localhost:8828/ocr
lit parse document.pdf --ocr-server-url http://localhost:8828/ocr
Higher DPI for better quality
更高DPI以提升质量
lit parse document.pdf --dpi 300
undefinedlit parse document.pdf --dpi 300
undefinedBatch Parse a Directory
批量解析目录
bash
lit batch-parse ./input-directory ./output-directorybash
lit batch-parse ./input-directory ./output-directoryOnly process PDFs, recursively
仅处理PDF文件,递归遍历
lit batch-parse ./input ./output --extension .pdf --recursive
undefinedlit batch-parse ./input ./output --extension .pdf --recursive
undefinedGenerate Page Screenshots
生成页面截图
Screenshots are useful for LLM agents that need to see visual layout.
bash
undefined截图适用于需要查看视觉布局的LLM Agent。
bash
undefinedAll pages
所有页面
lit screenshot document.pdf -o ./screenshots
lit screenshot document.pdf -o ./screenshots
Specific pages
指定页面
lit screenshot document.pdf --pages "1,3,5" -o ./screenshots
lit screenshot document.pdf --pages "1,3,5" -o ./screenshots
High-DPI PNG
高DPI PNG格式
lit screenshot document.pdf --dpi 300 --format png -o ./screenshots
lit screenshot document.pdf --dpi 300 --format png -o ./screenshots
Page range
页面范围
lit screenshot document.pdf --pages "1-10" -o ./screenshots
---lit screenshot document.pdf --pages "1-10" -o ./screenshots
---Step 3 — Key Options Reference
步骤3 — 关键选项参考
OCR Options
OCR选项
| Option | Description |
|---|---|
| (default) | Tesseract.js — zero setup, built-in |
| Set OCR language (ISO code) |
| Use external HTTP OCR server (EasyOCR, PaddleOCR, custom) |
| Disable OCR entirely |
| 选项 | 描述 |
|---|---|
| (默认) | Tesseract.js — 无需额外设置,内置集成 |
| 设置OCR语言(ISO代码) |
| 使用外部HTTP OCR服务器(EasyOCR、PaddleOCR或自定义服务器) |
| 完全禁用OCR |
Output Options
输出选项
| Option | Description |
|---|---|
| Structured JSON with bounding boxes |
| Plain text (default) |
| Save output to file |
| 选项 | 描述 |
|---|---|
| 带边界框的结构化JSON |
| 纯文本(默认) |
| 将输出保存到文件 |
Performance / Quality Options
性能/质量选项
| Option | Description |
|---|---|
| Rendering DPI (default: 150; use 300 for high quality) |
| Limit pages parsed |
| Parse specific pages (e.g. |
| Disable precise bounding boxes (faster) |
| Ignore rotated/diagonal text |
| Keep very small text that would otherwise be dropped |
| 选项 | 描述 |
|---|---|
| 渲染DPI(默认:150;高质量请使用300) |
| 限制解析的页面数量 |
| 解析指定页面(例如 |
| 禁用精确边界框(速度更快) |
| 忽略旋转/斜向文本 |
| 保留原本会被丢弃的极小文本 |
Step 4 — Using a Config File
步骤4 — 使用配置文件
For repeated use with consistent options, generate a :
liteparse.config.jsonjson
{
"ocrLanguage": "en",
"ocrEnabled": true,
"maxPages": 1000,
"dpi": 150,
"outputFormat": "json",
"preciseBoundingBox": true,
"skipDiagonalText": false,
"preserveVerySmallText": false
}For an HTTP OCR server:
json
{
"ocrServerUrl": "http://localhost:8828/ocr",
"ocrLanguage": "en",
"outputFormat": "json"
}Use with:
bash
lit parse document.pdf --config liteparse.config.json如需重复使用一致的选项,可生成配置文件:
liteparse.config.jsonjson
{
"ocrLanguage": "en",
"ocrEnabled": true,
"maxPages": 1000,
"dpi": 150,
"outputFormat": "json",
"preciseBoundingBox": true,
"skipDiagonalText": false,
"preserveVerySmallText": false
}对于HTTP OCR服务器:
json
{
"ocrServerUrl": "http://localhost:8828/ocr",
"ocrLanguage": "en",
"outputFormat": "json"
}使用配置文件:
bash
lit parse document.pdf --config liteparse.config.jsonStep 5 — HTTP OCR Server API (Advanced)
步骤5 — HTTP OCR服务器API(进阶)
If the user wants to plug in a custom OCR backend, the server must implement:
- Endpoint:
POST /ocr - Accepts: (multipart) and
file(string) parameterslanguage - Returns:
json
{
"results": [
{ "text": "Hello", "bbox": [x1, y1, x2, y2], "confidence": 0.98 }
]
}Ready-to-use wrappers exist for EasyOCR and PaddleOCR in the LiteParse repo.
若用户需要接入自定义OCR后端,服务器必须实现:
- 端点:
POST /ocr - 接收参数: (多部分表单)和
file(字符串)参数language - 返回格式:
json
{
"results": [
{ "text": "Hello", "bbox": [x1, y1, x2, y2], "confidence": 0.98 }
]
}LiteParse仓库中提供了EasyOCR和PaddleOCR的即用型封装。
Supported Input Formats
支持的输入格式
| Category | Formats |
|---|---|
| |
| Word | |
| PowerPoint | |
| Spreadsheets | |
| Images | |
Office documents require LibreOffice; images require ImageMagick. LiteParse auto-converts these formats to PDF before parsing.
| 类别 | 格式 |
|---|---|
| |
| Word文档 | |
| PowerPoint演示文稿 | |
| 电子表格 | |
| 图片 | |
Office文档需要LibreOffice支持;图片需要ImageMagick支持。LiteParse会自动将这些格式转换为PDF后再进行解析。