polaris-datainsight-doc-extract

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Polaris AI DataInsight — Doc Extract Skill

Polaris AI DataInsight — 文档提取技能

Use the Polaris AI DataInsight Doc Extract API to extract text, images, tables, charts, shapes, equations, and more from Word, PowerPoint, Excel, HWP, and HWPX files, returning everything as a structured

unifiedSchema

JSON. A single API call gives you the full document structure without any manual parsing.

使用Polaris AI DataInsight Doc Extract API从Word、PowerPoint、Excel、HWP和HWPX文件中提取文本、图片、表格、图表、形状、公式等内容，并将所有内容以结构化的

unifiedSchema

JSON格式返回。只需一次API调用，即可获取完整的文档结构，无需任何手动解析。

When to Use This Skill

何时使用此技能

The user wants to extract text, tables, charts, or images from DOCX, PPTX, XLSX, HWP, or HWPX files
The user needs to understand a document's structure (page count, element types, position data, etc.)
The extracted data will be used in a RAG pipeline, data analysis workflow, or automation task
Table data needs to be converted to CSV, or chart data needs to be broken down into series and labels
The user needs to parse special elements like headers, footers, equations, or shapes

当用户想要从DOCX、PPTX、XLSX、HWP或HWPX文件中提取文本、表格、图表或图片时
当用户需要了解文档结构（页数、元素类型、位置数据等）时
提取的数据将用于RAG流水线、数据分析工作流或自动化任务时
需要将表格数据转换为CSV格式，或需要将图表数据分解为系列和标签时
用户需要解析页眉、页脚、公式或形状等特殊元素时

What This Skill Does

此技能的功能

Authentication — Authenticates with the Polaris DataInsight API via the
```
x-po-di-apikey
```
header.
Upload and extract — Sends the file as a multipart/form-data POST request and extracts the full document structure.
Parse ZIP response — The API returns a ZIP file; extract it and load the
```
unifiedSchema
```
JSON inside.
Deliver structured data — Returns a JSON organized by page and element type (text, table, chart, image, shape, equation, etc.).
Support multiple usage patterns — Handles full text extraction, table-to-CSV conversion, RAG chunk generation, and more.

身份验证 — 通过
```
x-po-di-apikey
```
请求头完成与Polaris DataInsight API的身份验证。
上传与提取 — 将文件作为multipart/form-data格式的POST请求发送，提取完整的文档结构。
解析ZIP响应 — API返回一个ZIP文件，需解压并加载其中的
```
unifiedSchema
```
JSON。
交付结构化数据 — 返回按页面和元素类型（文本、表格、图表、图片、形状、公式等）组织的JSON数据。
支持多种使用模式 — 支持全文提取、表格转CSV、RAG块生成等多种功能。

How to Use

使用方法

Prerequisites

前置条件

Get an API Key: Sign up at https://datainsight.polarisoffice.com and generate your API key.

Authentication: Include the API key as a header on every request.

Header: x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY

Set the environment variable:

bash

export POLARIS_DATAINSIGHT_API_KEY="your-api-key-here"

获取API密钥：访问https://datainsight.polarisoffice.com注册账号并生成您的API密钥。

身份验证：在每个请求的请求头中包含API密钥。

Header: x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY

设置环境变量:

bash

export POLARIS_DATAINSIGHT_API_KEY="your-api-key-here"

Limits

限制条件

Item	Limit
Supported formats	HWP, HWPX, DOCX, PPTX, XLSX
Max file size	25 MB
Timeout	10 minutes
Rate limit	10 requests per minute

项目	限制
支持的格式	HWP、HWPX、DOCX、PPTX、XLSX
最大文件大小	25 MB
超时时间	10分钟
请求频率限制	每分钟10次

Basic Usage

基础用法

Endpoint:

POST https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract

Extract a document with Python:

python

import requests
import json
import zipfile
import io

def extract_document(file_path: str, api_key: str) -> dict:
    with open(file_path, "rb") as f:
        response = requests.post(
            "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract",
            headers={"x-po-di-apikey": api_key},
            files={"file": f}
        )

    if response.status_code != 200:
        raise Exception(f"API error: {response.status_code} - {response.text}")

    # Response is a ZIP file
    zip_buffer = io.BytesIO(response.content)
    with zipfile.ZipFile(zip_buffer) as z:
        json_files = [name for name in z.namelist() if name.endswith('.json')]
        if json_files:
            with z.open(json_files[0]) as jf:
                return json.load(jf)

    raise Exception("No JSON found in ZIP")

接口地址:

POST https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract

使用Python提取文档:

python

import requests
import json
import zipfile
import io

def extract_document(file_path: str, api_key: str) -> dict:
    with open(file_path, "rb") as f:
        response = requests.post(
            "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract",
            headers={"x-po-di-apikey": api_key},
            files={"file": f}
        )

    if response.status_code != 200:
        raise Exception(f"API error: {response.status_code} - {response.text}")

    # Response is a ZIP file
    zip_buffer = io.BytesIO(response.content)
    with zipfile.ZipFile(zip_buffer) as z:
        json_files = [name for name in z.namelist() if name.endswith('.json')]
        if json_files:
            with z.open(json_files[0]) as jf:
                return json.load(jf)

    raise Exception("No JSON found in ZIP")

Example usage

import os api_key = os.environ["POLARIS_DATAINSIGHT_API_KEY"] schema = extract_document("report.docx", api_key) print(f"Extracted {schema['totalPages']} pages")


**Extract with curl:**

```bash
curl -X POST "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract" \
  -H "x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY" \
  -F "file=@example.docx" \
  --output result.zip

unzip result.zip -d result/
cat result/*.json | python -m json.tool

import os api_key = os.environ["POLARIS_DATAINSIGHT_API_KEY"] schema = extract_document("report.docx", api_key) print(f"Extracted {schema['totalPages']} pages")


**使用curl提取:**

```bash
curl -X POST "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract" \
  -H "x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY" \
  -F "file=@example.docx" \
  --output result.zip

unzip result.zip -d result/
cat result/*.json | python -m json.tool

Advanced Usage

进阶用法

Response Structure (unifiedSchema)

响应结构（unifiedSchema）

Root:

json

{
  "docName": "sample.docx",
  "totalPages": 3,
  "pages": [ ... ]
}

Page (
pages[]
):

json

{
  "pageNum": 1,
  "pageWidth": 595.3,
  "pageHeight": 842.0,
  "extractionSummary": {
    "text": 5, "image": 2, "table": 1, "chart": 1
  },
  "elements": [ ... ]
}

Element types (
elements[].type
):

type	Description
`text`	Text block
`image`	Image
`table`	Table
`chart`	Chart
`shape`	Shape
`equation`	Equation
`header` / `footer`	Header / Footer

Common element structure:

json

{
  "type": "text",
  "id": "te1",
  "boundaryBox": { "left": 40, "top": 80, "right": 300, "bottom": 120 },
  "content": { "text": "Body content here" }
}

Table content:

json

{
  "content": {
    "html": "<table>...</table>",
    "csv": "Header1,Header2\nValue1,Value2",
    "json": [
      {
        "metrics": { "rowaddr": 0, "coladdr": 0, "rowspan": 1, "colspan": 1 },
        "para": [{ "content": [{ "text": "Cell content" }] }]
      }
    ]
  }
}

Chart content:

json

{
  "content": {
    "chart_type": "column",
    "title": "Annual Sales Comparison",
    "x_axis_labels": ["Q1", "Q2", "Q3", "Q4"],
    "series_names": ["2023", "2024"],
    "series_values": [[100, 200, 150, 300], [120, 220, 180, 320]],
    "csv": "Quarter,2023,2024\nQ1,100,120\nQ2,200,220"
  }
}

根节点:

json

{
  "docName": "sample.docx",
  "totalPages": 3,
  "pages": [ ... ]
}

页面（
pages[]
）:

json

{
  "pageNum": 1,
  "pageWidth": 595.3,
  "pageHeight": 842.0,
  "extractionSummary": {
    "text": 5, "image": 2, "table": 1, "chart": 1
  },
  "elements": [ ... ]
}

元素类型（
elements[].type
）:

类型	描述
`text`	文本块
`image`	图片
`table`	表格
`chart`	图表
`shape`	形状
`equation`	公式
`header` / `footer`	页眉/页脚

通用元素结构:

json

{
  "type": "text",
  "id": "te1",
  "boundaryBox": { "left": 40, "top": 80, "right": 300, "bottom": 120 },
  "content": { "text": "Body content here" }
}

表格内容:

json

{
  "content": {
    "html": "<table>...</table>",
    "csv": "Header1,Header2\nValue1,Value2",
    "json": [
      {
        "metrics": { "rowaddr": 0, "coladdr": 0, "rowspan": 1, "colspan": 1 },
        "para": [{ "content": [{ "text": "Cell content" }] }]
      }
    ]
  }
}

图表内容:

json

{
  "content": {
    "chart_type": "column",
    "title": "Annual Sales Comparison",
    "x_axis_labels": ["Q1", "Q2", "Q3", "Q4"],
    "series_names": ["2023", "2024"],
    "series_values": [[100, 200, 150, 300], [120, 220, 180, 320]],
    "csv": "Quarter,2023,2024\nQ1,100,120\nQ2,200,220"
  }
}

Usage Patterns

使用场景示例

Extract all text:

python

def get_all_text(schema: dict) -> str:
    texts = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "text" and el.get("content", {}).get("text"):
                texts.append(el["content"]["text"])
    return "\n".join(texts)

Extract tables as CSV:

python

def get_tables_as_csv(schema: dict) -> list:
    tables = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "table":
                csv_data = el.get("content", {}).get("csv", "")
                if csv_data:
                    tables.append(csv_data)
    return tables

Generate RAG chunks:

python

def make_rag_chunks(schema: dict) -> list:
    chunks = []
    doc_name = schema.get("docName", "")
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            text = el.get("content", {}).get("text") or el.get("content", {}).get("csv") or ""
            if text.strip():
                chunks.append({
                    "source": doc_name,
                    "page": page["pageNum"],
                    "type": el["type"],
                    "text": text.strip()
                })
    return chunks

提取所有文本:

python

def get_all_text(schema: dict) -> str:
    texts = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "text" and el.get("content", {}).get("text"):
                texts.append(el["content"]["text"])
    return "\n".join(texts)

提取表格为CSV格式:

python

def get_tables_as_csv(schema: dict) -> list:
    tables = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "table":
                csv_data = el.get("content", {}).get("csv", "")
                if csv_data:
                    tables.append(csv_data)
    return tables

生成RAG块:

python

def make_rag_chunks(schema: dict) -> list:
    chunks = []
    doc_name = schema.get("docName", "")
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            text = el.get("content", {}).get("text") or el.get("content", {}).get("csv") or ""
            if text.strip():
                chunks.append({
                    "source": doc_name,
                    "page": page["pageNum"],
                    "type": el["type"],
                    "text": text.strip()
                })
    return chunks

Example

示例

User: "Extract all table data from this DOCX report as CSV."

Output:

python

import os
schema = extract_document("report.docx", os.environ["POLARIS_DATAINSIGHT_API_KEY"])
tables = get_tables_as_csv(schema)
for i, csv_data in enumerate(tables):
    print(f"=== Table {i+1} ===")
    print(csv_data)

=== Table 1 ===
Quarter,Revenue,Cost
Q1,1200,800
Q2,1500,900

=== Table 2 ===
Item,Amount
Labor,500
Operations,300

Inspired by: Polaris Office DataInsight API documentation and workflow.

用户: "将这份DOCX报告中的所有表格数据提取为CSV格式。"

输出:

python

import os
schema = extract_document("report.docx", os.environ["POLARIS_DATAINSIGHT_API_KEY"])
tables = get_tables_as_csv(schema)
for i, csv_data in enumerate(tables):
    print(f"=== Table {i+1} ===")
    print(csv_data)

=== Table 1 ===
Quarter,Revenue,Cost
Q1,1200,800
Q2,1500,900

=== Table 2 ===
Item,Amount
Labor,500
Operations,300

灵感来源: Polaris Office DataInsight API文档与工作流。

Tips

注意事项

The response is always a ZIP file. Do not try to parse
```
response.content
```
directly as JSON — use
```
zipfile.ZipFile
```
to extract it first.
```
content.csv
```
is available for both
```
table
```
and
```
chart
```
elements, making it the most convenient format for data extraction.
The rate limit is 10 requests per minute. When processing multiple files, add a delay (e.g.,
```
time.sleep(6)
```
) between calls.
Use
```
boundaryBox
```
to determine where each element sits on the page — useful for layout analysis.
Always store the API key in an environment variable (
```
POLARIS_DATAINSIGHT_API_KEY
```
) and never hardcode it.

API响应始终是ZIP文件。请勿直接将
```
response.content
```
解析为JSON，需先使用
```
zipfile.ZipFile
```
解压。
```
content.csv
```
在
```
table
```
和
```
chart
```
元素中均可用，是数据提取最便捷的格式。
请求频率限制为每分钟10次。处理多个文件时，请在调用之间添加延迟（例如
```
time.sleep(6)
```
）。
可使用
```
boundaryBox
```
确定每个元素在页面上的位置，这对布局分析很有用。
请始终将API密钥存储在环境变量（
```
POLARIS_DATAINSIGHT_API_KEY
```
）中，切勿硬编码。

Common Use Cases

常见使用场景

Document search systems: Extract full text and store it in a vector database for semantic search
Automated report analysis: Collect table and chart data from PPTX/DOCX reports for analysis
HWP digitization: Convert HWP/HWPX documents into structured, machine-readable data
RAG pipeline setup: Split documents into chunks for use in LLM-based Q&A systems
Data migration: Move table and chart data from legacy Office documents into a database

文档搜索系统: 提取全文并存储到向量数据库中，用于语义搜索
自动化报告分析: 从PPTX/DOCX报告中收集表格和图表数据进行分析
HWP文档数字化: 将HWP/HWPX文档转换为结构化的机器可读数据
RAG流水线搭建: 将文档拆分为块，用于基于LLM的问答系统
数据迁移: 将旧版Office文档中的表格和图表数据迁移到数据库中

License & Terms

许可证与条款

Skill Definition: This
```
SKILL.md
```
file is provided under the Apache 2.0 license.
Service Access: Usage of the DataInsight API requires a valid subscription or license key.
Restrictions: Unauthorized redistribution of the API endpoints or bypassing authentication is strictly prohibited.
Support: For licensing inquiries, visit https://datainsight.polarisoffice.com.

技能定义: 本
```
SKILL.md
```
文件基于Apache 2.0许可证提供。
服务访问: 使用DataInsight API需要有效的订阅或许可证密钥。
限制: 严禁未经授权重新分发API接口或绕过身份验证机制。
支持: 如需咨询许可证相关问题，请访问https://datainsight.polarisoffice.com。