polaris-datainsight-doc-extract

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Polaris AI DataInsight — Doc Extract Skill

Polaris AI DataInsight — 文档提取技能

Use the Polaris AI DataInsight Doc Extract API to extract text, images, tables, charts, shapes, equations, and more from Word, PowerPoint, Excel, HWP, and HWPX files, returning everything as a structured
unifiedSchema
JSON. A single API call gives you the full document structure without any manual parsing.

使用Polaris AI DataInsight Doc Extract API从Word、PowerPoint、Excel、HWP和HWPX文件中提取文本、图片、表格、图表、形状、公式等内容,并将所有内容以结构化的
unifiedSchema
JSON格式返回。只需一次API调用,即可获取完整的文档结构,无需任何手动解析。

When to Use This Skill

何时使用此技能

  • The user wants to extract text, tables, charts, or images from DOCX, PPTX, XLSX, HWP, or HWPX files
  • The user needs to understand a document's structure (page count, element types, position data, etc.)
  • The extracted data will be used in a RAG pipeline, data analysis workflow, or automation task
  • Table data needs to be converted to CSV, or chart data needs to be broken down into series and labels
  • The user needs to parse special elements like headers, footers, equations, or shapes

  • 当用户想要从DOCX、PPTX、XLSX、HWP或HWPX文件中提取文本、表格、图表或图片时
  • 当用户需要了解文档结构(页数、元素类型、位置数据等)时
  • 提取的数据将用于RAG流水线、数据分析工作流或自动化任务时
  • 需要将表格数据转换为CSV格式,或需要将图表数据分解为系列和标签时
  • 用户需要解析页眉、页脚、公式或形状等特殊元素时

What This Skill Does

此技能的功能

  1. Authentication — Authenticates with the Polaris DataInsight API via the
    x-po-di-apikey
    header.
  2. Upload and extract — Sends the file as a multipart/form-data POST request and extracts the full document structure.
  3. Parse ZIP response — The API returns a ZIP file; extract it and load the
    unifiedSchema
    JSON inside.
  4. Deliver structured data — Returns a JSON organized by page and element type (text, table, chart, image, shape, equation, etc.).
  5. Support multiple usage patterns — Handles full text extraction, table-to-CSV conversion, RAG chunk generation, and more.

  1. 身份验证 — 通过
    x-po-di-apikey
    请求头完成与Polaris DataInsight API的身份验证。
  2. 上传与提取 — 将文件作为multipart/form-data格式的POST请求发送,提取完整的文档结构。
  3. 解析ZIP响应 — API返回一个ZIP文件,需解压并加载其中的
    unifiedSchema
    JSON。
  4. 交付结构化数据 — 返回按页面和元素类型(文本、表格、图表、图片、形状、公式等)组织的JSON数据。
  5. 支持多种使用模式 — 支持全文提取、表格转CSV、RAG块生成等多种功能。

How to Use

使用方法

Prerequisites

前置条件

Get an API Key: Sign up at https://datainsight.polarisoffice.com and generate your API key.
Authentication: Include the API key as a header on every request.
Header: x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY
Set the environment variable:
bash
export POLARIS_DATAINSIGHT_API_KEY="your-api-key-here"
身份验证:在每个请求的请求头中包含API密钥。
Header: x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY
设置环境变量:
bash
export POLARIS_DATAINSIGHT_API_KEY="your-api-key-here"

Limits

限制条件

ItemLimit
Supported formatsHWP, HWPX, DOCX, PPTX, XLSX
Max file size25 MB
Timeout10 minutes
Rate limit10 requests per minute
项目限制
支持的格式HWP、HWPX、DOCX、PPTX、XLSX
最大文件大小25 MB
超时时间10分钟
请求频率限制每分钟10次

Basic Usage

基础用法

Endpoint:
POST https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract
Extract a document with Python:
python
import requests
import json
import zipfile
import io

def extract_document(file_path: str, api_key: str) -> dict:
    with open(file_path, "rb") as f:
        response = requests.post(
            "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract",
            headers={"x-po-di-apikey": api_key},
            files={"file": f}
        )

    if response.status_code != 200:
        raise Exception(f"API error: {response.status_code} - {response.text}")

    # Response is a ZIP file
    zip_buffer = io.BytesIO(response.content)
    with zipfile.ZipFile(zip_buffer) as z:
        json_files = [name for name in z.namelist() if name.endswith('.json')]
        if json_files:
            with z.open(json_files[0]) as jf:
                return json.load(jf)

    raise Exception("No JSON found in ZIP")
接口地址:
POST https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract
使用Python提取文档:
python
import requests
import json
import zipfile
import io

def extract_document(file_path: str, api_key: str) -> dict:
    with open(file_path, "rb") as f:
        response = requests.post(
            "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract",
            headers={"x-po-di-apikey": api_key},
            files={"file": f}
        )

    if response.status_code != 200:
        raise Exception(f"API error: {response.status_code} - {response.text}")

    # Response is a ZIP file
    zip_buffer = io.BytesIO(response.content)
    with zipfile.ZipFile(zip_buffer) as z:
        json_files = [name for name in z.namelist() if name.endswith('.json')]
        if json_files:
            with z.open(json_files[0]) as jf:
                return json.load(jf)

    raise Exception("No JSON found in ZIP")

Example usage

Example usage

import os api_key = os.environ["POLARIS_DATAINSIGHT_API_KEY"] schema = extract_document("report.docx", api_key) print(f"Extracted {schema['totalPages']} pages")

**Extract with curl:**

```bash
curl -X POST "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract" \
  -H "x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY" \
  -F "file=@example.docx" \
  --output result.zip

unzip result.zip -d result/
cat result/*.json | python -m json.tool

import os api_key = os.environ["POLARIS_DATAINSIGHT_API_KEY"] schema = extract_document("report.docx", api_key) print(f"Extracted {schema['totalPages']} pages")

**使用curl提取:**

```bash
curl -X POST "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract" \
  -H "x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY" \
  -F "file=@example.docx" \
  --output result.zip

unzip result.zip -d result/
cat result/*.json | python -m json.tool

Advanced Usage

进阶用法

Response Structure (unifiedSchema)

响应结构(unifiedSchema)

Root:
json
{
  "docName": "sample.docx",
  "totalPages": 3,
  "pages": [ ... ]
}
Page (
pages[]
):
json
{
  "pageNum": 1,
  "pageWidth": 595.3,
  "pageHeight": 842.0,
  "extractionSummary": {
    "text": 5, "image": 2, "table": 1, "chart": 1
  },
  "elements": [ ... ]
}
Element types (
elements[].type
):
typeDescription
text
Text block
image
Image
table
Table
chart
Chart
shape
Shape
equation
Equation
header
/
footer
Header / Footer
Common element structure:
json
{
  "type": "text",
  "id": "te1",
  "boundaryBox": { "left": 40, "top": 80, "right": 300, "bottom": 120 },
  "content": { "text": "Body content here" }
}
Table content:
json
{
  "content": {
    "html": "<table>...</table>",
    "csv": "Header1,Header2\nValue1,Value2",
    "json": [
      {
        "metrics": { "rowaddr": 0, "coladdr": 0, "rowspan": 1, "colspan": 1 },
        "para": [{ "content": [{ "text": "Cell content" }] }]
      }
    ]
  }
}
Chart content:
json
{
  "content": {
    "chart_type": "column",
    "title": "Annual Sales Comparison",
    "x_axis_labels": ["Q1", "Q2", "Q3", "Q4"],
    "series_names": ["2023", "2024"],
    "series_values": [[100, 200, 150, 300], [120, 220, 180, 320]],
    "csv": "Quarter,2023,2024\nQ1,100,120\nQ2,200,220"
  }
}
根节点:
json
{
  "docName": "sample.docx",
  "totalPages": 3,
  "pages": [ ... ]
}
页面(
pages[]
):
json
{
  "pageNum": 1,
  "pageWidth": 595.3,
  "pageHeight": 842.0,
  "extractionSummary": {
    "text": 5, "image": 2, "table": 1, "chart": 1
  },
  "elements": [ ... ]
}
元素类型(
elements[].type
):
类型描述
text
文本块
image
图片
table
表格
chart
图表
shape
形状
equation
公式
header
/
footer
页眉/页脚
通用元素结构:
json
{
  "type": "text",
  "id": "te1",
  "boundaryBox": { "left": 40, "top": 80, "right": 300, "bottom": 120 },
  "content": { "text": "Body content here" }
}
表格内容:
json
{
  "content": {
    "html": "<table>...</table>",
    "csv": "Header1,Header2\nValue1,Value2",
    "json": [
      {
        "metrics": { "rowaddr": 0, "coladdr": 0, "rowspan": 1, "colspan": 1 },
        "para": [{ "content": [{ "text": "Cell content" }] }]
      }
    ]
  }
}
图表内容:
json
{
  "content": {
    "chart_type": "column",
    "title": "Annual Sales Comparison",
    "x_axis_labels": ["Q1", "Q2", "Q3", "Q4"],
    "series_names": ["2023", "2024"],
    "series_values": [[100, 200, 150, 300], [120, 220, 180, 320]],
    "csv": "Quarter,2023,2024\nQ1,100,120\nQ2,200,220"
  }
}

Usage Patterns

使用场景示例

Extract all text:
python
def get_all_text(schema: dict) -> str:
    texts = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "text" and el.get("content", {}).get("text"):
                texts.append(el["content"]["text"])
    return "\n".join(texts)
Extract tables as CSV:
python
def get_tables_as_csv(schema: dict) -> list:
    tables = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "table":
                csv_data = el.get("content", {}).get("csv", "")
                if csv_data:
                    tables.append(csv_data)
    return tables
Generate RAG chunks:
python
def make_rag_chunks(schema: dict) -> list:
    chunks = []
    doc_name = schema.get("docName", "")
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            text = el.get("content", {}).get("text") or el.get("content", {}).get("csv") or ""
            if text.strip():
                chunks.append({
                    "source": doc_name,
                    "page": page["pageNum"],
                    "type": el["type"],
                    "text": text.strip()
                })
    return chunks

提取所有文本:
python
def get_all_text(schema: dict) -> str:
    texts = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "text" and el.get("content", {}).get("text"):
                texts.append(el["content"]["text"])
    return "\n".join(texts)
提取表格为CSV格式:
python
def get_tables_as_csv(schema: dict) -> list:
    tables = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "table":
                csv_data = el.get("content", {}).get("csv", "")
                if csv_data:
                    tables.append(csv_data)
    return tables
生成RAG块:
python
def make_rag_chunks(schema: dict) -> list:
    chunks = []
    doc_name = schema.get("docName", "")
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            text = el.get("content", {}).get("text") or el.get("content", {}).get("csv") or ""
            if text.strip():
                chunks.append({
                    "source": doc_name,
                    "page": page["pageNum"],
                    "type": el["type"],
                    "text": text.strip()
                })
    return chunks

Example

示例

User: "Extract all table data from this DOCX report as CSV."
Output:
python
import os
schema = extract_document("report.docx", os.environ["POLARIS_DATAINSIGHT_API_KEY"])
tables = get_tables_as_csv(schema)
for i, csv_data in enumerate(tables):
    print(f"=== Table {i+1} ===")
    print(csv_data)
=== Table 1 ===
Quarter,Revenue,Cost
Q1,1200,800
Q2,1500,900

=== Table 2 ===
Item,Amount
Labor,500
Operations,300
Inspired by: Polaris Office DataInsight API documentation and workflow.

用户: "将这份DOCX报告中的所有表格数据提取为CSV格式。"
输出:
python
import os
schema = extract_document("report.docx", os.environ["POLARIS_DATAINSIGHT_API_KEY"])
tables = get_tables_as_csv(schema)
for i, csv_data in enumerate(tables):
    print(f"=== Table {i+1} ===")
    print(csv_data)
=== Table 1 ===
Quarter,Revenue,Cost
Q1,1200,800
Q2,1500,900

=== Table 2 ===
Item,Amount
Labor,500
Operations,300
灵感来源: Polaris Office DataInsight API文档与工作流。

Tips

注意事项

  • The response is always a ZIP file. Do not try to parse
    response.content
    directly as JSON — use
    zipfile.ZipFile
    to extract it first.
  • content.csv
    is available for both
    table
    and
    chart
    elements, making it the most convenient format for data extraction.
  • The rate limit is 10 requests per minute. When processing multiple files, add a delay (e.g.,
    time.sleep(6)
    ) between calls.
  • Use
    boundaryBox
    to determine where each element sits on the page — useful for layout analysis.
  • Always store the API key in an environment variable (
    POLARIS_DATAINSIGHT_API_KEY
    ) and never hardcode it.

  • API响应始终是ZIP文件。请勿直接将
    response.content
    解析为JSON,需先使用
    zipfile.ZipFile
    解压。
  • content.csv
    table
    chart
    元素中均可用,是数据提取最便捷的格式。
  • 请求频率限制为每分钟10次。处理多个文件时,请在调用之间添加延迟(例如
    time.sleep(6)
    )。
  • 可使用
    boundaryBox
    确定每个元素在页面上的位置,这对布局分析很有用。
  • 请始终将API密钥存储在环境变量(
    POLARIS_DATAINSIGHT_API_KEY
    )中,切勿硬编码。

Common Use Cases

常见使用场景

  • Document search systems: Extract full text and store it in a vector database for semantic search
  • Automated report analysis: Collect table and chart data from PPTX/DOCX reports for analysis
  • HWP digitization: Convert HWP/HWPX documents into structured, machine-readable data
  • RAG pipeline setup: Split documents into chunks for use in LLM-based Q&A systems
  • Data migration: Move table and chart data from legacy Office documents into a database

  • 文档搜索系统: 提取全文并存储到向量数据库中,用于语义搜索
  • 自动化报告分析: 从PPTX/DOCX报告中收集表格和图表数据进行分析
  • HWP文档数字化: 将HWP/HWPX文档转换为结构化的机器可读数据
  • RAG流水线搭建: 将文档拆分为块,用于基于LLM的问答系统
  • 数据迁移: 将旧版Office文档中的表格和图表数据迁移到数据库中

License & Terms

许可证与条款

  • Skill Definition: This
    SKILL.md
    file is provided under the Apache 2.0 license.
  • Service Access: Usage of the DataInsight API requires a valid subscription or license key.
  • Restrictions: Unauthorized redistribution of the API endpoints or bypassing authentication is strictly prohibited.
  • Support: For licensing inquiries, visit https://datainsight.polarisoffice.com.
  • 技能定义: 本
    SKILL.md
    文件基于Apache 2.0许可证提供。
  • 服务访问: 使用DataInsight API需要有效的订阅或许可证密钥。
  • 限制: 严禁未经授权重新分发API接口或绕过身份验证机制。
  • 支持: 如需咨询许可证相关问题,请访问https://datainsight.polarisoffice.com