excel-parser

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Excel Parser

Excel解析器

Table of Contents

目录

Overview

概述

Provide intelligent routing strategies for parsing Excel/CSV files by analyzing complexity and choosing the optimal processing path. The skill implements a "Scout Pattern" that scans file metadata before processing to balance speed (Pandas) with accuracy (semantic extraction).
通过分析文件复杂度并选择最优处理路径,为Excel/CSV文件解析提供智能路由策略。该技能实现了“侦察模式”,在处理前先扫描文件元数据,以平衡速度(Pandas)与准确性(语义提取)。

Core Philosophy: Scout Pattern

核心理念:侦察模式

Before processing data, deploy a lightweight "scout" to analyze file metadata and make intelligent routing decisions:
  1. Metadata Scanning - Use
    openpyxl
    to scan file structure without loading data
  2. Complexity Scoring - Calculate score based on merged cells, row count, and layout
  3. Path Selection - Choose between Pandas (fast) or HTML (accurate) processing
  4. Optimized Execution - Execute with the most appropriate tool for the file type
Key Principle: "LLM handles metadata decisions, Pandas/HTML processes bulk data"
在处理数据之前,部署轻量级“侦察器”分析文件元数据并做出智能路由决策:
  1. 元数据扫描 - 使用
    openpyxl
    扫描文件结构,无需加载数据
  2. 复杂度评分 - 根据合并单元格、行数和布局计算分数
  3. 路径选择 - 在Pandas(快速)或HTML(准确)处理方式中选择
  4. 优化执行 - 针对文件类型使用最合适的工具执行处理
关键原则:“大语言模型处理元数据决策,Pandas/HTML处理批量数据”

When to Use This Skill

何时使用该技能

Use excel-parser when:
  • Processing Excel/CSV files with unknown structure or varying complexity
  • Handling files ranging from simple data tables to complex financial reports
  • Need to optimize between processing speed and extraction accuracy
  • Working with files that may contain merged cells, multi-level headers, or irregular layouts
Skip this skill when:
  • File structure is already known and documented
  • Processing simple, well-structured tables with confirmed format
  • Using predefined scripts for specific file formats
在以下场景使用excel-parser:
  • 处理结构未知或复杂度多变的Excel/CSV文件
  • 处理从简单数据表到复杂财务报表的各类文件
  • 需要在处理速度与提取准确性之间进行优化
  • 处理可能包含合并单元格、多级表头或不规则布局的文件
在以下场景跳过该技能:
  • 文件结构已明确且有文档记录
  • 处理结构清晰、格式固定的简单表格
  • 使用针对特定文件格式的预定义脚本

Processing Workflow

处理流程

Step 1: Analyze File Complexity

步骤1:分析文件复杂度

Use the
scripts/complexity_analyzer.py
to scan file metadata:
bash
python scripts/complexity_analyzer.py <file_path> [sheet_name]
What it analyzes (without loading data):
  • Merged cell distribution (shallow vs deep in the table)
  • Row count and data continuity
  • Empty row interruptions (indicates multi-table layouts)
Output (JSON format):
json
{
  "is_complex": false,
  "recommended_strategy": "pandas",
  "reasons": ["No deep merges detected", "Row count exceeds 1000, forcing Pandas mode"],
  "stats": {
    "total_rows": 5000,
    "deep_merges": 0,
    "empty_interruptions": 0
  }
}
使用
scripts/complexity_analyzer.py
扫描文件元数据:
bash
python scripts/complexity_analyzer.py <file_path> [sheet_name]
分析内容(无需加载数据):
  • 合并单元格分布(表格中的浅层与深层分布)
  • 行数与数据连续性
  • 空行中断(表明存在多表格布局)
输出(JSON格式):
json
{
  "is_complex": false,
  "recommended_strategy": "pandas",
  "reasons": ["No deep merges detected", "Row count exceeds 1000, forcing Pandas mode"],
  "stats": {
    "total_rows": 5000,
    "deep_merges": 0,
    "empty_interruptions": 0
  }
}

Step 2: Route to Optimal Strategy

步骤2:路由至最优策略

Based on complexity analysis:
  • is_complex = false → Use Path A (Pandas Standard Mode)
  • is_complex = true → Use Path B (HTML Semantic Mode)
根据复杂度分析结果:
  • is_complex = false → 使用路径A(Pandas标准模式)
  • is_complex = true → 使用路径B(HTML语义化模式)

Step 3: Execute Processing

步骤3:执行处理

Follow the selected path's workflow to extract data.
遵循所选路径的工作流提取数据。

Complexity Scoring Rules

复杂度评分规则

Rule 1: Deep Merged Cells

规则1:深层合并单元格

  • Condition: Merged cells appearing beyond row 5
  • Interpretation: Complex table structure (not just header formatting)
  • Decision: Mark as complex if >2 deep merges detected
  • Example: Financial reports with merged category labels in data region
  • 条件:合并单元格出现在第5行之后
  • 解读:复杂表格结构(不只是表头格式)
  • 决策:若检测到超过2个深层合并单元格,则标记为复杂
  • 示例:数据区域包含合并类别标签的财务报表

Rule 2: Empty Row Interruptions

规则2:空行中断

  • Condition: Multiple empty rows within the table
  • Interpretation: Multiple sub-tables in single sheet
  • Decision: Mark as complex if >2 empty row interruptions found
  • Example: Summary table + detail table in one sheet
  • 条件:表格内存在多个空行
  • 解读:单个工作表中存在多个子表格
  • 决策:若发现超过2个空行中断,则标记为复杂
  • 示例:一个工作表中包含汇总表+明细表

Rule 3: Row Count Override

规则3:行数强制规则

  • Condition: Total rows >1000
  • Interpretation: Too large for HTML processing (token explosion)
  • Decision: Force Pandas mode regardless of complexity
  • Rationale: HTML conversion would exceed token limits
  • 条件:总行数>1000
  • 解读:文件过大,不适合HTML处理(会导致令牌爆炸)
  • 决策:无论复杂度如何,强制使用Pandas模式
  • 理由:HTML转换会超出令牌限制

Rule 4: Default (Standard Table)

规则4:默认(标准表格)

  • Condition: No deep merges, continuous data, moderate size
  • Interpretation: Standard data table
  • Decision: Use Pandas for optimal speed
  • 条件:无深层合并单元格、数据连续、尺寸适中
  • 解读:标准数据表
  • 决策:使用Pandas以获得最优速度

Path A: Pandas Standard Mode

路径A:Pandas标准模式

When: Simple/large tables (most common case)
Strategy: Agent analyzes ONLY the first 20 rows to determine header position, then use Pandas to read full data at native speed.
Workflow:
  1. Sample First 20 Rows
    • Read only the first 20 rows using
      pd.read_excel(..., nrows=20)
    • Convert to CSV format for analysis
  2. Determine Header Position
    • Examine the sampled rows to identify which row contains the actual column headers
    • Common patterns: Row 0 (standard), Row 1-2 (if title rows exist), Row with distinct column names
  3. Read Full Data
    • Use
      pd.read_excel(..., header=<detected_row>)
      to load complete data
    • The header parameter ensures proper column naming
Token Cost: ~500 tokens (only 20 rows analyzed) Processing Speed: Very fast (Pandas native speed)
For implementation details, see
references/smart_excel_router.py
适用场景:简单/大型表格(最常见情况)
策略:Agent仅分析前20行以确定表头位置,然后使用Pandas以原生速度读取完整数据。
工作流
  1. 采样前20行
    • 使用
      pd.read_excel(..., nrows=20)
      仅读取前20行
    • 转换为CSV格式进行分析
  2. 确定表头位置
    • 检查采样行以识别包含实际列名的行
    • 常见模式:第0行(标准)、第1-2行(若存在标题行)、包含不同列名的行
  3. 读取完整数据
    • 使用
      pd.read_excel(..., header=<detected_row>)
      加载完整数据
    • header参数确保正确的列命名
令牌成本:约500令牌(仅分析20行) 处理速度:极快(Pandas原生速度)
实现细节请参考
references/smart_excel_router.py

Path B: HTML Semantic Mode

路径B:HTML语义化模式

When: Complex/irregular tables (merged cells, multi-level headers)
Strategy: Convert to semantic HTML preserving structure (rowspan/colspan), then extract data understanding the visual layout.
Workflow:
  1. Convert to Semantic HTML
    • Load workbook with
      openpyxl
    • Build HTML table preserving merged cell spans
    • Use
      rowspan
      and
      colspan
      attributes to maintain structure
  2. Extract Structured Data
    • Analyze HTML table structure
    • Identify hierarchical headers from merged cells
    • Extract data preserving semantic relationships
Token Cost: Higher (full HTML structure analyzed) Processing Speed: Slower (semantic extraction) Use Case: Only for small (<1000 rows), complex files where Pandas would fail
For implementation details, see
references/smart_excel_router.py
适用场景:复杂/不规则表格(合并单元格、多级表头)
策略:转换为保留结构(rowspan/colspan)的语义化HTML,然后结合视觉布局提取数据。
工作流
  1. 转换为语义化HTML
    • 使用
      openpyxl
      加载工作簿
    • 构建保留合并单元格跨度的HTML表格
    • 使用
      rowspan
      colspan
      属性维持结构
  2. 提取结构化数据
    • 分析HTML表格结构
    • 从合并单元格中识别分层表头
    • 提取数据并保留语义关系
令牌成本:较高(需分析完整HTML结构) 处理速度:较慢(语义提取) 适用场景:仅适用于Pandas无法处理的小型(<1000行)复杂文件
实现细节请参考
references/smart_excel_router.py

Best Practices

最佳实践

1. Trust the Scout

1. 信任侦察器

Always run complexity analysis before processing. The metadata scan is fast (<1 second) and prevents wasted effort on wrong approach.
处理前始终运行复杂度分析。元数据扫描速度极快(<1秒),可避免因采用错误方法而浪费精力。

2. Respect the Row Count Rule

2. 遵守行数规则

Never attempt HTML mode on files >1000 rows. Token limits will cause failures.
切勿对超过1000行的文件尝试HTML模式,令牌限制会导致失败。

3. Pandas First for Unknown Files

3. 未知文件优先使用Pandas

When in doubt, try Pandas mode first. It fails fast and clearly when structure is incompatible.
如有疑问,先尝试Pandas模式。当结构不兼容时,它会快速且明确地失败。

4. Cache Analysis Results

4. 缓存分析结果

If processing multiple sheets from same file, run analysis once and cache results.
若处理同一文件的多个工作表,只需运行一次分析并缓存结果。

5. Preserve Original Files

5. 保留原始文件

Never modify the original Excel file during analysis or processing.
分析或处理过程中切勿修改原始Excel文件。

Troubleshooting

故障排除

File Cannot Be Opened

文件无法打开

  • Symptom:
    FileNotFoundError
    or permission errors
  • Causes: Invalid path, file locked by another process, insufficient permissions
  • Solutions:
    • Verify file path is correct and file exists
    • Close the file if open in Excel or another application
    • Check read permissions on the file
  • 症状
    FileNotFoundError
    或权限错误
  • 原因:路径无效、文件被其他进程锁定、权限不足
  • 解决方案
    • 验证文件路径正确且文件存在
    • 若文件在Excel或其他应用中打开,请先关闭
    • 检查文件的读取权限

Corrupted File Errors

文件损坏错误

  • Symptom:
    BadZipFile
    or
    InvalidFileException
  • Causes: Incomplete download, file corruption, wrong file extension
  • Solutions:
    • Re-download or obtain fresh copy of the file
    • Verify file is actual Excel format (not CSV with .xlsx extension)
    • Try opening in Excel to confirm file integrity
  • 症状
    BadZipFile
    InvalidFileException
  • 原因:下载不完整、文件损坏、文件扩展名错误
  • 解决方案
    • 重新下载或获取文件的全新副本
    • 验证文件为真实Excel格式(不是带有.xlsx扩展名的CSV文件)
    • 尝试在Excel中打开以确认文件完整性

Memory Issues with Large Files

大文件内存问题

  • Symptom:
    MemoryError
    or system slowdown
  • Causes: File too large for available RAM
  • Solutions:
    • Use
      read_only=True
      mode in openpyxl
    • Process file in chunks using Pandas
      chunksize
      parameter
    • Increase system memory or use machine with more RAM
  • 症状
    MemoryError
    或系统运行缓慢
  • 原因:文件过大,超出可用内存
  • 解决方案
    • 在openpyxl中使用
      read_only=True
      模式
    • 使用Pandas的
      chunksize
      参数分块处理文件
    • 增加系统内存或使用内存更大的机器

Encoding Problems

编码问题

  • Symptom: Garbled text or
    UnicodeDecodeError
  • Causes: Non-UTF8 encoding in source data
  • Solutions:
    • Specify encoding when reading CSV:
      pd.read_csv(..., encoding='gbk')
    • For Excel, data is usually UTF-8; check source data generation
  • 症状:文本乱码或
    UnicodeDecodeError
  • 原因:源数据采用非UTF8编码
  • 解决方案
    • 读取CSV时指定编码:
      pd.read_csv(..., encoding='gbk')
    • Excel数据通常为UTF-8;检查源数据的生成方式

HTML Mode Token Overflow

HTML模式令牌溢出

  • Symptom: Truncated output or API errors
  • Causes: Complex file exceeds token limits despite row count check
  • Solutions:
    • Force Pandas mode even for complex files
    • Split sheet into smaller ranges and process separately
    • Extract only essential columns before HTML conversion
  • 症状:输出截断或API错误
  • 原因:尽管已检查行数,复杂文件仍超出令牌限制
  • 解决方案
    • 即使是复杂文件也强制使用Pandas模式
    • 将工作表拆分为较小范围并分别处理
    • 在HTML转换前仅提取必要列

Incorrect Header Detection

表头检测错误

  • Symptom: Wrong columns or data shifted
  • Causes: Unusual header patterns not caught by sampling
  • Solutions:
    • Manually specify header row if known
    • Increase sample size beyond 20 rows
    • Use HTML mode for better structure understanding
  • 症状:列错误或数据偏移
  • 原因:采样未覆盖不常见的表头模式
  • 解决方案
    • 若已知表头行,手动指定
    • 将采样数量增加到20行以上
    • 使用HTML模式以更好地理解结构

Dependencies

依赖项

Required Python packages:
  • openpyxl
    - Metadata scanning and Excel file manipulation
  • pandas
    - High-speed data reading and manipulation
所需Python包:
  • openpyxl
    - 元数据扫描与Excel文件操作
  • pandas
    - 高速数据读取与操作

Resources

资源

This skill includes:
  • scripts/complexity_analyzer.py
    - Standalone executable for complexity analysis
  • references/smart_excel_router.py
    - Complete implementation reference with both processing paths
该技能包含:
  • scripts/complexity_analyzer.py
    - 用于复杂度分析的独立可执行脚本
  • references/smart_excel_router.py
    - 包含两种处理路径的完整实现参考