document-extraction

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Document Extraction Skill

文档提取Skill

Extract requirements from existing documentation sources for systematic requirement mining.

从现有文档来源中提取需求，用于系统化的需求挖掘。

When to Use This Skill

何时使用此Skill

Keywords: extract requirements, document mining, PDF requirements, transcript analysis, parse document, existing documentation, legacy requirements, competitive analysis

Invoke this skill when:

Mining requirements from existing documents
Processing meeting transcripts for requirements
Extracting requirements from competitor products
Analyzing regulatory documents for compliance requirements
Converting legacy documentation to structured requirements

关键词： 提取需求、文档挖掘、PDF需求、记录分析、解析文档、现有文档、遗留需求、竞品分析

在以下场景调用此Skill：

从现有文档中挖掘需求
处理会议记录以提取需求
从竞品产品中提取需求
分析监管文档以获取合规需求
将遗留文档转换为结构化需求

Supported Document Types

支持的文档类型

Type	Extension	Extraction Method
Markdown	.md	Direct Read
Text	.txt	Direct Read
PDF	.pdf	Read tool (PDF support)
Word	.docx	Read tool
Web Page	URL	WebFetch tool
Meeting Notes	.md, .txt	Transcript patterns
Specification	.md, .docx	Requirement patterns

类型	扩展名	提取方法
Markdown	.md	直接读取
文本	.txt	直接读取
PDF	.pdf	读取工具（支持PDF）
Word	.docx	读取工具
网页	URL	WebFetch工具
会议纪要	.md, .txt	记录模式匹配
规格说明书	.md, .docx	需求模式匹配

Extraction Workflow

提取流程

Step 1: Document Assessment

步骤1：文档评估

Analyze the document to determine extraction strategy:

yaml

document_assessment:
  path: "{file path or URL}"
  type: "{detected document type}"
  size: "{approximate size}"
  structure:
    has_sections: true|false
    has_lists: true|false
    has_tables: true|false
  quality:
    formal_language: true|false
    clear_requirements: true|false
    needs_interpretation: true|false

分析文档以确定提取策略：

yaml

document_assessment:
  path: "{file path or URL}"
  type: "{detected document type}"
  size: "{approximate size}"
  structure:
    has_sections: true|false
    has_lists: true|false
    has_tables: true|false
  quality:
    formal_language: true|false
    clear_requirements: true|false
    needs_interpretation: true|false

Step 2: Pattern Matching

步骤2：模式匹配

Apply requirement detection patterns:

Explicit Requirement Markers:

text

- "The system shall..."
- "The system must..."
- "Users should be able to..."
- "REQ-XXX:"
- Numbered requirements (1.1, 1.2, etc.)

EARS Patterns:

text

- "When [trigger], the [system] shall [response]"
- "While [state], the [system] shall [behavior]"
- "Where [feature], the [system] shall [behavior]"
- "If [condition], then the [system] shall [response]"

Implicit Requirement Indicators:

text

- "It is important that..."
- "We need..."
- "The goal is to..."
- "Users expect..."
- "Performance should..."

应用需求检测模式：

显式需求标记：

text

- "The system shall..."
- "The system must..."
- "Users should be able to..."
- "REQ-XXX:"
- Numbered requirements (1.1, 1.2, etc.)

EARS模式：

text

- "When [trigger], the [system] shall [response]"
- "While [state], the [system] shall [behavior]"
- "Where [feature], the [system] shall [behavior]"
- "If [condition], then the [system] shall [response]"

隐式需求指标：

text

- "It is important that..."
- "We need..."
- "The goal is to..."
- "Users expect..."
- "Performance should..."

Step 3: Requirement Extraction

步骤3：需求提取

For each identified requirement:

yaml

extracted_requirement:
  id: REQ-{sequence}
  text: "{cleaned requirement statement}"
  source: document
  source_file: "{file path}"
  source_location: "{section/page/line}"
  original_text: "{exact text from document}"
  type: functional|non-functional|constraint|assumption
  confidence: high|medium|low
  extraction_method: explicit|pattern|inferred
  needs_review: true|false
  review_notes: "{why review needed}"

针对每个识别出的需求：

yaml

extracted_requirement:
  id: REQ-{sequence}
  text: "{cleaned requirement statement}"
  source: document
  source_file: "{file path}"
  source_location: "{section/page/line}"
  original_text: "{exact text from document}"
  type: functional|non-functional|constraint|assumption
  confidence: high|medium|low
  extraction_method: explicit|pattern|inferred
  needs_review: true|false
  review_notes: "{why review needed}"

Step 4: Categorization

步骤4：分类

Categorize extracted requirements:

yaml

categories:
  functional:
    - features
    - behaviors
    - interactions
  non_functional:
    - performance
    - security
    - usability
    - reliability
    - scalability
  constraints:
    - technical
    - business
    - regulatory
  assumptions:
    - environmental
    - user_behavior
    - dependencies

对提取的需求进行分类：

yaml

categories:
  functional:
    - features
    - behaviors
    - interactions
  non_functional:
    - performance
    - security
    - usability
    - reliability
    - scalability
  constraints:
    - technical
    - business
    - regulatory
  assumptions:
    - environmental
    - user_behavior
    - dependencies

Step 5: Deduplication

步骤5：去重

Identify and merge duplicate requirements:

yaml

deduplication:
  strategy: semantic_similarity
  threshold: 0.8
  action: merge|flag_for_review
  merged_requirements:
    - id: REQ-merged-001
      sources: [REQ-001, REQ-015]
      text: "{consolidated requirement}"

识别并合并重复需求：

yaml

deduplication:
  strategy: semantic_similarity
  threshold: 0.8
  action: merge|flag_for_review
  merged_requirements:
    - id: REQ-merged-001
      sources: [REQ-001, REQ-015]
      text: "{consolidated requirement}"

Document-Specific Strategies

特定文档策略

Meeting Transcripts

会议记录

yaml

transcript_extraction:
  focus_on:
    - Action items
    - Decisions made
    - Requirements discussed
    - Concerns raised
  patterns:
    - "We decided that..."
    - "The requirement is..."
    - "Action item:"
    - "TODO:"
    - "Need to..."
  speaker_context:
    - Note who said what
    - Weight by speaker role

yaml

transcript_extraction:
  focus_on:
    - Action items
    - Decisions made
    - Requirements discussed
    - Concerns raised
  patterns:
    - "We decided that..."
    - "The requirement is..."
    - "Action item:"
    - "TODO:"
    - "Need to..."
  speaker_context:
    - Note who said what
    - Weight by speaker role

Regulatory Documents

监管文档

yaml

regulatory_extraction:
  focus_on:
    - Mandatory requirements ("shall", "must")
    - Prohibited actions ("shall not", "must not")
    - Conditional requirements ("if...then")
  compliance_mapping:
    - Reference section numbers
    - Note effective dates
    - Track version/revision

yaml

regulatory_extraction:
  focus_on:
    - Mandatory requirements ("shall", "must")
    - Prohibited actions ("shall not", "must not")
    - Conditional requirements ("if...then")
  compliance_mapping:
    - Reference section numbers
    - Note effective dates
    - Track version/revision

Competitor Analysis

竞品分析

yaml

competitor_extraction:
  focus_on:
    - Feature descriptions
    - User capabilities
    - Unique selling points
  output:
    - Feature requirements
    - Differentiation opportunities
    - Gap identification
  confidence: low  # Based on external observation

yaml

competitor_extraction:
  focus_on:
    - Feature descriptions
    - User capabilities
    - Unique selling points
  output:
    - Feature requirements
    - Differentiation opportunities
    - Gap identification
  confidence: low  # Based on external observation

Legacy Specifications

遗留规格说明书

yaml

legacy_extraction:
  focus_on:
    - Existing requirements
    - System behaviors
    - Integration points
  modernization:
    - Update terminology
    - Convert to EARS format
    - Flag deprecated requirements

yaml

legacy_extraction:
  focus_on:
    - Existing requirements
    - System behaviors
    - Integration points
  modernization:
    - Update terminology
    - Convert to EARS format
    - Flag deprecated requirements

Output Format

输出格式

Per-Document Output

单文档输出

yaml

extraction_result:
  source:
    file: "{path or URL}"
    type: "{document type}"
    extraction_date: "{ISO-8601}"
    confidence: high|medium|low

  statistics:
    total_candidates: {number}
    extracted: {number}
    filtered: {number}
    needs_review: {number}

  requirements:
    - id: REQ-{number}
      text: "{requirement}"
      type: functional|non-functional|constraint
      source_location: "{section/page}"
      confidence: high|medium|low
      original_text: "{exact source text}"

  review_items:
    - requirement_id: REQ-{number}
      reason: "{why review needed}"
      suggestion: "{proposed action}"

  metadata:
    sections_processed: {number}
    extraction_patterns_used: ["{pattern names}"]

yaml

extraction_result:
  source:
    file: "{path or URL}"
    type: "{document type}"
    extraction_date: "{ISO-8601}"
    confidence: high|medium|low

  statistics:
    total_candidates: {number}
    extracted: {number}
    filtered: {number}
    needs_review: {number}

  requirements:
    - id: REQ-{number}
      text: "{requirement}"
      type: functional|non-functional|constraint
      source_location: "{section/page}"
      confidence: high|medium|low
      original_text: "{exact source text}"

  review_items:
    - requirement_id: REQ-{number}
      reason: "{why review needed}"
      suggestion: "{proposed action}"

  metadata:
    sections_processed: {number}
    extraction_patterns_used: ["{pattern names}"]

Autonomy Levels

自主级别

Guided Mode

引导模式

yaml

guided_behavior:
  document_selection: Human selects
  extraction_strategy: AI suggests, human approves
  each_requirement: AI highlights, human confirms
  categorization: AI suggests, human validates

yaml

guided_behavior:
  document_selection: Human selects
  extraction_strategy: AI suggests, human approves
  each_requirement: AI highlights, human confirms
  categorization: AI suggests, human validates

Semi-Autonomous Mode

半自主模式

yaml

semi_auto_behavior:
  document_selection: AI suggests priority, human approves list
  extraction_strategy: AI chooses autonomously
  requirements: AI extracts all, human reviews in batches
  categorization: AI categorizes, human spot-checks

yaml

semi_auto_behavior:
  document_selection: AI suggests priority, human approves list
  extraction_strategy: AI chooses autonomously
  requirements: AI extracts all, human reviews in batches
  categorization: AI categorizes, human spot-checks

Fully Autonomous Mode

完全自主模式

yaml

full_auto_behavior:
  document_selection: AI processes all relevant
  extraction_strategy: AI optimizes per document
  requirements: AI extracts, deduplicates, categorizes
  output: Full extraction report for final review

yaml

full_auto_behavior:
  document_selection: AI processes all relevant
  extraction_strategy: AI optimizes per document
  requirements: AI extracts, deduplicates, categorizes
  output: Full extraction report for final review

Quality Indicators

质量指标

High Confidence Extraction

高置信度提取

Explicit requirement markers ("shall", "must")
EARS-pattern matches
Numbered requirement lists
Clear imperative statements

显式需求标记（"shall"、"must"）
EARS模式匹配
编号需求列表
清晰的命令式语句

Medium Confidence Extraction

中等置信度提取

Implicit indicators ("should", "needs to")
Context-dependent interpretation
Partial pattern matches
Requires domain knowledge

隐式指标（"should"、"needs to"）
依赖上下文的解读
部分模式匹配
需要领域知识

Low Confidence Extraction

低置信度提取

Inferred from descriptions
Narrative text interpretation
Competitive analysis
Assumptions based on context

从描述中推断
叙述性文本解读
竞品分析
基于上下文的假设

Delegation

任务委派

For related tasks, delegate to:

gap-analysis: Check extracted requirements for completeness
domain-research: Research unfamiliar terms or concepts
elicitation-methodology: Route back for technique selection

对于相关任务，可委派给：

gap-analysis：检查提取的需求是否完整
domain-research：研究不熟悉的术语或概念
elicitation-methodology：返回以选择合适的技术

Output Location

输出位置

Save extraction results to:

text

.requirements/{domain}/documents/DOC-{filename}-{timestamp}.yaml

将提取结果保存至：

text

.requirements/{domain}/documents/DOC-{filename}-{timestamp}.yaml

document-extraction

Original

Translation

Document Extraction Skill

文档提取Skill

When to Use This Skill

何时使用此Skill

Supported Document Types

支持的文档类型

Extraction Workflow

提取流程

Step 1: Document Assessment

步骤1：文档评估

Step 2: Pattern Matching

步骤2：模式匹配

Step 3: Requirement Extraction

步骤3：需求提取

Step 4: Categorization

步骤4：分类

Step 5: Deduplication

步骤5：去重

Document-Specific Strategies

特定文档策略

Meeting Transcripts

会议记录

Regulatory Documents

监管文档

Competitor Analysis

竞品分析

Legacy Specifications

遗留规格说明书

Output Format

输出格式

Per-Document Output

单文档输出

Autonomy Levels

自主级别

Guided Mode

引导模式

Semi-Autonomous Mode

半自主模式

Fully Autonomous Mode

完全自主模式

Quality Indicators

质量指标

High Confidence Extraction

高置信度提取

Medium Confidence Extraction

中等置信度提取

Low Confidence Extraction

低置信度提取

Delegation

任务委派

Output Location

输出位置

Related

相关Skill