doc-pipeline

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Doc Pipeline Skill

Doc Pipeline Skill

Overview

概述

This skill enables building document processing pipelines - chain multiple operations (extract, transform, convert) into reusable workflows with data flowing between stages.
本Skill支持构建文档处理流水线——将多种操作(提取、转换、格式转换)串联为可复用的工作流,数据可在各个阶段之间流转。

How to Use

使用方法

  1. Describe what you want to accomplish
  2. Provide any required input data or files
  3. I'll execute the appropriate operations
Example prompts:
  • "PDF → Extract Text → Translate → Generate DOCX"
  • "Image → OCR → Summarize → Create Report"
  • "Excel → Analyze → Generate Charts → Create PPT"
  • "Multiple inputs → Merge → Format → Output"
  1. 描述你想要完成的任务
  2. 提供所需的输入数据或文件
  3. 我将执行相应的操作
示例提示词:
  • "PDF → 提取文本 → 翻译 → 生成DOCX"
  • "图片 → OCR识别 → 摘要生成 → 创建报告"
  • "Excel → 数据分析 → 生成图表 → 创建PPT"
  • "多输入 → 合并 → 格式化 → 输出"

Domain Knowledge

领域知识

Pipeline Architecture

流水线架构

Stage 1      Stage 2      Stage 3      Stage 4
┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐
│Extract│ → │Transform│ → │ AI   │ → │Output│
│ PDF  │    │  Data  │    │Analyze│   │ DOCX │
└──────┘    └──────┘    └──────┘    └──────┘
     │           │           │           │
     └───────────┴───────────┴───────────┘
                 Data Flow
Stage 1      Stage 2      Stage 3      Stage 4
┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐
│Extract│ → │Transform│ → │ AI   │ → │Output│
│ PDF  │    │  Data  │    │Analyze│   │ DOCX │
└──────┘    └──────┘    └──────┘    └──────┘
     │           │           │           │
     └───────────┴───────────┴───────────┘
                 Data Flow

Pipeline DSL (Domain Specific Language)

流水线DSL(领域特定语言)

yaml
undefined
yaml
undefined

pipeline.yaml

pipeline.yaml

name: contract-review-pipeline description: Extract, analyze, and report on contracts
stages:
  • name: extract operation: pdf-extraction input: $input_file output: $extracted_text
  • name: analyze operation: ai-analyze input: $extracted_text prompt: "Review this contract for risks..." output: $analysis
  • name: report operation: docx-generation input: $analysis template: templates/review_report.docx output: $output_file
undefined
name: contract-review-pipeline description: Extract, analyze, and report on contracts
stages:
  • name: extract operation: pdf-extraction input: $input_file output: $extracted_text
  • name: analyze operation: ai-analyze input: $extracted_text prompt: "Review this contract for risks..." output: $analysis
  • name: report operation: docx-generation input: $analysis template: templates/review_report.docx output: $output_file
undefined

Python Implementation

Python实现

python
from typing import Callable, Any
from dataclasses import dataclass

@dataclass
class Stage:
    name: str
    operation: Callable
    
class Pipeline:
    def __init__(self, name: str):
        self.name = name
        self.stages: list[Stage] = []
    
    def add_stage(self, name: str, operation: Callable):
        self.stages.append(Stage(name, operation))
        return self  # Fluent API
    
    def run(self, input_data: Any) -> Any:
        data = input_data
        for stage in self.stages:
            print(f"Running stage: {stage.name}")
            data = stage.operation(data)
        return data
python
from typing import Callable, Any
from dataclasses import dataclass

@dataclass
class Stage:
    name: str
    operation: Callable
    
class Pipeline:
    def __init__(self, name: str):
        self.name = name
        self.stages: list[Stage] = []
    
    def add_stage(self, name: str, operation: Callable):
        self.stages.append(Stage(name, operation))
        return self  # Fluent API
    
    def run(self, input_data: Any) -> Any:
        data = input_data
        for stage in self.stages:
            print(f"Running stage: {stage.name}")
            data = stage.operation(data)
        return data

Example usage

Example usage

pipeline = Pipeline("contract-review") pipeline.add_stage("extract", extract_pdf_text) pipeline.add_stage("analyze", analyze_with_ai) pipeline.add_stage("generate", create_docx_report)
result = pipeline.run("/path/to/contract.pdf")
undefined
pipeline = Pipeline("contract-review") pipeline.add_stage("extract", extract_pdf_text) pipeline.add_stage("analyze", analyze_with_ai) pipeline.add_stage("generate", create_docx_report)
result = pipeline.run("/path/to/contract.pdf")
undefined

Advanced: Conditional Pipelines

进阶:条件流水线

python
class ConditionalPipeline(Pipeline):
    def add_conditional_stage(self, name: str, condition: Callable, 
                               if_true: Callable, if_false: Callable):
        def conditional_op(data):
            if condition(data):
                return if_true(data)
            return if_false(data)
        return self.add_stage(name, conditional_op)
python
class ConditionalPipeline(Pipeline):
    def add_conditional_stage(self, name: str, condition: Callable, 
                               if_true: Callable, if_false: Callable):
        def conditional_op(data):
            if condition(data):
                return if_true(data)
            return if_false(data)
        return self.add_stage(name, conditional_op)

Usage

Usage

pipeline.add_conditional_stage( "ocr_if_needed", condition=lambda d: d.get("has_images"), if_true=run_ocr, if_false=lambda d: d )
undefined
pipeline.add_conditional_stage( "ocr_if_needed", condition=lambda d: d.get("has_images"), if_true=run_ocr, if_false=lambda d: d )
undefined

Best Practices

最佳实践

  1. Keep stages focused (single responsibility)
  2. Use intermediate outputs for debugging
  3. Implement stage-level error handling
  4. Make pipelines configurable via YAML/JSON
  1. 保持阶段聚焦(单一职责)
  2. 使用中间输出进行调试
  3. 实现阶段级别的错误处理
  4. 通过YAML/JSON实现流水线可配置化

Installation

安装

bash
undefined
bash
undefined

Install required dependencies

Install required dependencies

pip install python-docx openpyxl python-pptx reportlab jinja2
undefined
pip install python-docx openpyxl python-pptx reportlab jinja2
undefined

Resources

资源