polaris-datainsight-doc-extract
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePolaris AI DataInsight — Doc Extract Skill
Polaris AI DataInsight — 文档提取技能
Use the Polaris AI DataInsight Doc Extract API to extract text, images, tables, charts, shapes, equations, and more from Word, PowerPoint, Excel, HWP, and HWPX files, returning everything as a structured JSON. A single API call gives you the full document structure without any manual parsing.
unifiedSchema使用Polaris AI DataInsight Doc Extract API从Word、PowerPoint、Excel、HWP和HWPX文件中提取文本、图片、表格、图表、形状、公式等内容,并将所有内容以结构化的 JSON格式返回。只需一次API调用,即可获取完整的文档结构,无需任何手动解析。
unifiedSchemaWhen to Use This Skill
何时使用此技能
- The user wants to extract text, tables, charts, or images from DOCX, PPTX, XLSX, HWP, or HWPX files
- The user needs to understand a document's structure (page count, element types, position data, etc.)
- The extracted data will be used in a RAG pipeline, data analysis workflow, or automation task
- Table data needs to be converted to CSV, or chart data needs to be broken down into series and labels
- The user needs to parse special elements like headers, footers, equations, or shapes
- 当用户想要从DOCX、PPTX、XLSX、HWP或HWPX文件中提取文本、表格、图表或图片时
- 当用户需要了解文档结构(页数、元素类型、位置数据等)时
- 提取的数据将用于RAG流水线、数据分析工作流或自动化任务时
- 需要将表格数据转换为CSV格式,或需要将图表数据分解为系列和标签时
- 用户需要解析页眉、页脚、公式或形状等特殊元素时
What This Skill Does
此技能的功能
- Authentication — Authenticates with the Polaris DataInsight API via the header.
x-po-di-apikey - Upload and extract — Sends the file as a multipart/form-data POST request and extracts the full document structure.
- Parse ZIP response — The API returns a ZIP file; extract it and load the JSON inside.
unifiedSchema - Deliver structured data — Returns a JSON organized by page and element type (text, table, chart, image, shape, equation, etc.).
- Support multiple usage patterns — Handles full text extraction, table-to-CSV conversion, RAG chunk generation, and more.
- 身份验证 — 通过请求头完成与Polaris DataInsight API的身份验证。
x-po-di-apikey - 上传与提取 — 将文件作为multipart/form-data格式的POST请求发送,提取完整的文档结构。
- 解析ZIP响应 — API返回一个ZIP文件,需解压并加载其中的JSON。
unifiedSchema - 交付结构化数据 — 返回按页面和元素类型(文本、表格、图表、图片、形状、公式等)组织的JSON数据。
- 支持多种使用模式 — 支持全文提取、表格转CSV、RAG块生成等多种功能。
How to Use
使用方法
Prerequisites
前置条件
Get an API Key: Sign up at https://datainsight.polarisoffice.com and generate your API key.
Authentication: Include the API key as a header on every request.
Header: x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEYSet the environment variable:
bash
export POLARIS_DATAINSIGHT_API_KEY="your-api-key-here"身份验证:在每个请求的请求头中包含API密钥。
Header: x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY设置环境变量:
bash
export POLARIS_DATAINSIGHT_API_KEY="your-api-key-here"Limits
限制条件
| Item | Limit |
|---|---|
| Supported formats | HWP, HWPX, DOCX, PPTX, XLSX |
| Max file size | 25 MB |
| Timeout | 10 minutes |
| Rate limit | 10 requests per minute |
| 项目 | 限制 |
|---|---|
| 支持的格式 | HWP、HWPX、DOCX、PPTX、XLSX |
| 最大文件大小 | 25 MB |
| 超时时间 | 10分钟 |
| 请求频率限制 | 每分钟10次 |
Basic Usage
基础用法
Endpoint:
POST https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extractExtract a document with Python:
python
import requests
import json
import zipfile
import io
def extract_document(file_path: str, api_key: str) -> dict:
with open(file_path, "rb") as f:
response = requests.post(
"https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract",
headers={"x-po-di-apikey": api_key},
files={"file": f}
)
if response.status_code != 200:
raise Exception(f"API error: {response.status_code} - {response.text}")
# Response is a ZIP file
zip_buffer = io.BytesIO(response.content)
with zipfile.ZipFile(zip_buffer) as z:
json_files = [name for name in z.namelist() if name.endswith('.json')]
if json_files:
with z.open(json_files[0]) as jf:
return json.load(jf)
raise Exception("No JSON found in ZIP")接口地址:
POST https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract使用Python提取文档:
python
import requests
import json
import zipfile
import io
def extract_document(file_path: str, api_key: str) -> dict:
with open(file_path, "rb") as f:
response = requests.post(
"https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract",
headers={"x-po-di-apikey": api_key},
files={"file": f}
)
if response.status_code != 200:
raise Exception(f"API error: {response.status_code} - {response.text}")
# Response is a ZIP file
zip_buffer = io.BytesIO(response.content)
with zipfile.ZipFile(zip_buffer) as z:
json_files = [name for name in z.namelist() if name.endswith('.json')]
if json_files:
with z.open(json_files[0]) as jf:
return json.load(jf)
raise Exception("No JSON found in ZIP")Example usage
Example usage
import os
api_key = os.environ["POLARIS_DATAINSIGHT_API_KEY"]
schema = extract_document("report.docx", api_key)
print(f"Extracted {schema['totalPages']} pages")
**Extract with curl:**
```bash
curl -X POST "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract" \
-H "x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY" \
-F "file=@example.docx" \
--output result.zip
unzip result.zip -d result/
cat result/*.json | python -m json.toolimport os
api_key = os.environ["POLARIS_DATAINSIGHT_API_KEY"]
schema = extract_document("report.docx", api_key)
print(f"Extracted {schema['totalPages']} pages")
**使用curl提取:**
```bash
curl -X POST "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract" \
-H "x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY" \
-F "file=@example.docx" \
--output result.zip
unzip result.zip -d result/
cat result/*.json | python -m json.toolAdvanced Usage
进阶用法
Response Structure (unifiedSchema)
响应结构(unifiedSchema)
Root:
json
{
"docName": "sample.docx",
"totalPages": 3,
"pages": [ ... ]
}Page ():
pages[]json
{
"pageNum": 1,
"pageWidth": 595.3,
"pageHeight": 842.0,
"extractionSummary": {
"text": 5, "image": 2, "table": 1, "chart": 1
},
"elements": [ ... ]
}Element types ():
elements[].type| type | Description |
|---|---|
| Text block |
| Image |
| Table |
| Chart |
| Shape |
| Equation |
| Header / Footer |
Common element structure:
json
{
"type": "text",
"id": "te1",
"boundaryBox": { "left": 40, "top": 80, "right": 300, "bottom": 120 },
"content": { "text": "Body content here" }
}Table content:
json
{
"content": {
"html": "<table>...</table>",
"csv": "Header1,Header2\nValue1,Value2",
"json": [
{
"metrics": { "rowaddr": 0, "coladdr": 0, "rowspan": 1, "colspan": 1 },
"para": [{ "content": [{ "text": "Cell content" }] }]
}
]
}
}Chart content:
json
{
"content": {
"chart_type": "column",
"title": "Annual Sales Comparison",
"x_axis_labels": ["Q1", "Q2", "Q3", "Q4"],
"series_names": ["2023", "2024"],
"series_values": [[100, 200, 150, 300], [120, 220, 180, 320]],
"csv": "Quarter,2023,2024\nQ1,100,120\nQ2,200,220"
}
}根节点:
json
{
"docName": "sample.docx",
"totalPages": 3,
"pages": [ ... ]
}页面():
pages[]json
{
"pageNum": 1,
"pageWidth": 595.3,
"pageHeight": 842.0,
"extractionSummary": {
"text": 5, "image": 2, "table": 1, "chart": 1
},
"elements": [ ... ]
}元素类型():
elements[].type| 类型 | 描述 |
|---|---|
| 文本块 |
| 图片 |
| 表格 |
| 图表 |
| 形状 |
| 公式 |
| 页眉/页脚 |
通用元素结构:
json
{
"type": "text",
"id": "te1",
"boundaryBox": { "left": 40, "top": 80, "right": 300, "bottom": 120 },
"content": { "text": "Body content here" }
}表格内容:
json
{
"content": {
"html": "<table>...</table>",
"csv": "Header1,Header2\nValue1,Value2",
"json": [
{
"metrics": { "rowaddr": 0, "coladdr": 0, "rowspan": 1, "colspan": 1 },
"para": [{ "content": [{ "text": "Cell content" }] }]
}
]
}
}图表内容:
json
{
"content": {
"chart_type": "column",
"title": "Annual Sales Comparison",
"x_axis_labels": ["Q1", "Q2", "Q3", "Q4"],
"series_names": ["2023", "2024"],
"series_values": [[100, 200, 150, 300], [120, 220, 180, 320]],
"csv": "Quarter,2023,2024\nQ1,100,120\nQ2,200,220"
}
}Usage Patterns
使用场景示例
Extract all text:
python
def get_all_text(schema: dict) -> str:
texts = []
for page in schema.get("pages", []):
for el in page.get("elements", []):
if el["type"] == "text" and el.get("content", {}).get("text"):
texts.append(el["content"]["text"])
return "\n".join(texts)Extract tables as CSV:
python
def get_tables_as_csv(schema: dict) -> list:
tables = []
for page in schema.get("pages", []):
for el in page.get("elements", []):
if el["type"] == "table":
csv_data = el.get("content", {}).get("csv", "")
if csv_data:
tables.append(csv_data)
return tablesGenerate RAG chunks:
python
def make_rag_chunks(schema: dict) -> list:
chunks = []
doc_name = schema.get("docName", "")
for page in schema.get("pages", []):
for el in page.get("elements", []):
text = el.get("content", {}).get("text") or el.get("content", {}).get("csv") or ""
if text.strip():
chunks.append({
"source": doc_name,
"page": page["pageNum"],
"type": el["type"],
"text": text.strip()
})
return chunks提取所有文本:
python
def get_all_text(schema: dict) -> str:
texts = []
for page in schema.get("pages", []):
for el in page.get("elements", []):
if el["type"] == "text" and el.get("content", {}).get("text"):
texts.append(el["content"]["text"])
return "\n".join(texts)提取表格为CSV格式:
python
def get_tables_as_csv(schema: dict) -> list:
tables = []
for page in schema.get("pages", []):
for el in page.get("elements", []):
if el["type"] == "table":
csv_data = el.get("content", {}).get("csv", "")
if csv_data:
tables.append(csv_data)
return tables生成RAG块:
python
def make_rag_chunks(schema: dict) -> list:
chunks = []
doc_name = schema.get("docName", "")
for page in schema.get("pages", []):
for el in page.get("elements", []):
text = el.get("content", {}).get("text") or el.get("content", {}).get("csv") or ""
if text.strip():
chunks.append({
"source": doc_name,
"page": page["pageNum"],
"type": el["type"],
"text": text.strip()
})
return chunksExample
示例
User: "Extract all table data from this DOCX report as CSV."
Output:
python
import os
schema = extract_document("report.docx", os.environ["POLARIS_DATAINSIGHT_API_KEY"])
tables = get_tables_as_csv(schema)
for i, csv_data in enumerate(tables):
print(f"=== Table {i+1} ===")
print(csv_data)=== Table 1 ===
Quarter,Revenue,Cost
Q1,1200,800
Q2,1500,900
=== Table 2 ===
Item,Amount
Labor,500
Operations,300Inspired by: Polaris Office DataInsight API documentation and workflow.
用户: "将这份DOCX报告中的所有表格数据提取为CSV格式。"
输出:
python
import os
schema = extract_document("report.docx", os.environ["POLARIS_DATAINSIGHT_API_KEY"])
tables = get_tables_as_csv(schema)
for i, csv_data in enumerate(tables):
print(f"=== Table {i+1} ===")
print(csv_data)=== Table 1 ===
Quarter,Revenue,Cost
Q1,1200,800
Q2,1500,900
=== Table 2 ===
Item,Amount
Labor,500
Operations,300灵感来源: Polaris Office DataInsight API文档与工作流。
Tips
注意事项
- The response is always a ZIP file. Do not try to parse directly as JSON — use
response.contentto extract it first.zipfile.ZipFile - is available for both
content.csvandtableelements, making it the most convenient format for data extraction.chart - The rate limit is 10 requests per minute. When processing multiple files, add a delay (e.g., ) between calls.
time.sleep(6) - Use to determine where each element sits on the page — useful for layout analysis.
boundaryBox - Always store the API key in an environment variable () and never hardcode it.
POLARIS_DATAINSIGHT_API_KEY
- API响应始终是ZIP文件。请勿直接将解析为JSON,需先使用
response.content解压。zipfile.ZipFile - 在
content.csv和table元素中均可用,是数据提取最便捷的格式。chart - 请求频率限制为每分钟10次。处理多个文件时,请在调用之间添加延迟(例如)。
time.sleep(6) - 可使用确定每个元素在页面上的位置,这对布局分析很有用。
boundaryBox - 请始终将API密钥存储在环境变量()中,切勿硬编码。
POLARIS_DATAINSIGHT_API_KEY
Common Use Cases
常见使用场景
- Document search systems: Extract full text and store it in a vector database for semantic search
- Automated report analysis: Collect table and chart data from PPTX/DOCX reports for analysis
- HWP digitization: Convert HWP/HWPX documents into structured, machine-readable data
- RAG pipeline setup: Split documents into chunks for use in LLM-based Q&A systems
- Data migration: Move table and chart data from legacy Office documents into a database
- 文档搜索系统: 提取全文并存储到向量数据库中,用于语义搜索
- 自动化报告分析: 从PPTX/DOCX报告中收集表格和图表数据进行分析
- HWP文档数字化: 将HWP/HWPX文档转换为结构化的机器可读数据
- RAG流水线搭建: 将文档拆分为块,用于基于LLM的问答系统
- 数据迁移: 将旧版Office文档中的表格和图表数据迁移到数据库中
License & Terms
许可证与条款
- Skill Definition: This file is provided under the Apache 2.0 license.
SKILL.md - Service Access: Usage of the DataInsight API requires a valid subscription or license key.
- Restrictions: Unauthorized redistribution of the API endpoints or bypassing authentication is strictly prohibited.
- Support: For licensing inquiries, visit https://datainsight.polarisoffice.com.
- 技能定义: 本文件基于Apache 2.0许可证提供。
SKILL.md - 服务访问: 使用DataInsight API需要有效的订阅或许可证密钥。
- 限制: 严禁未经授权重新分发API接口或绕过身份验证机制。
- 支持: 如需咨询许可证相关问题,请访问https://datainsight.polarisoffice.com。