csv-excel-merger

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CSV/Excel Merger

CSV/Excel 文件合并工具

Intelligently merge multiple CSV or Excel files with automatic column matching and data deduplication.
可通过自动列匹配和数据去重功能,智能合并多个CSV或Excel文件。

Instructions

操作说明

When a user needs to merge CSV or Excel files:
  1. Analyze Input Files:
    • How many files need to be merged?
    • What format (CSV, Excel, TSV)?
    • Are the files provided or need to be read from disk?
    • Do columns have the same names across files?
    • What is the primary key (unique identifier)?
  2. Inspect File Structures:
    • Read headers from each file
    • Identify column names and data types
    • Detect encoding (UTF-8, Latin-1, etc.)
    • Check for missing columns
    • Look for duplicate column names
  3. Create Merge Strategy:
    Column Matching:
    • Exact name match: "email" = "email"
    • Case-insensitive: "Email" = "email"
    • Fuzzy match: "E-mail" ≈ "email"
    • Common patterns:
      • "first_name", "firstname", "First Name" → "first_name"
      • "phone", "phone_number", "tel" → "phone"
      • "email", "e-mail", "email_address" → "email"
    Conflict Resolution (when same record appears in multiple files):
    • Keep first: Use value from first file
    • Keep last: Use value from last file (most recent)
    • Keep longest: Use most complete value
    • Manual review: Flag conflicts for user review
    • Merge: Combine non-conflicting fields
    Deduplication:
    • Identify duplicate rows based on primary key
    • Options: keep first, keep last, keep all, merge values
    • Track source file for each row
  4. Perform Merge:
    python
    # Example merge logic
    import pandas as pd
    
    # Read files
    df1 = pd.read_csv('file1.csv')
    df2 = pd.read_csv('file2.csv')
    
    # Normalize column names
    df1.columns = df1.columns.str.lower().str.strip()
    df2.columns = df2.columns.str.lower().str.strip()
    
    # Map similar columns
    column_mapping = {
        'firstname': 'first_name',
        'e_mail': 'email',
        # ...
    }
    df2 = df2.rename(columns=column_mapping)
    
    # Merge
    merged = pd.concat([df1, df2], ignore_index=True)
    
    # Deduplicate
    merged = merged.drop_duplicates(subset=['email'], keep='last')
    
    # Save
    merged.to_csv('merged_output.csv', index=False)
  5. Format Output:
    📊 CSV/EXCEL MERGER REPORT
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    📁 INPUT FILES
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    File 1: contacts_jan.csv
      Rows: 1,245
      Columns: 8 (name, email, phone, company, ...)
    
    File 2: contacts_feb.csv
      Rows: 987
      Columns: 9 (firstname, lastname, email, mobile, ...)
    
    File 3: leads_export.xlsx
      Rows: 2,103
      Columns: 12 (full_name, email_address, phone, ...)
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    🔄 COLUMN MAPPING
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    Unified Schema:
    • first_name ← [firstname, first name, fname]
    • last_name ← [lastname, last name, lname]
    • email ← [email, e-mail, email_address]
    • phone ← [phone, mobile, phone_number, tel]
    • company ← [company, organization, org]
    • title ← [title, job_title, position]
    • source ← [file origin tracking]
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    🔍 MERGE ANALYSIS
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    Total rows before merge: 4,335
    Duplicate records found: 892
    Conflicts detected: 47
    
    Deduplication Strategy: Keep most recent (by source file date)
    Primary Key: email
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    ⚠️ CONFLICTS
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    Record: john.doe@example.com
      File 1 phone: (555) 123-4567
      File 2 phone: (555) 987-6543
      Resolution: Kept most recent (File 2)
    
    [List top 10 conflicts]
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    ✅ MERGE RESULTS
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    Output File: merged_contacts.csv
    Total Rows: 3,443
    Columns: 7
    Duplicates Removed: 892
    
    Breakdown by Source:
    • contacts_jan.csv: 1,245 rows (398 unique)
    • contacts_feb.csv: 987 rows (521 unique)
    • leads_export.xlsx: 2,103 rows (2,524 unique)
    
    Data Quality:
    • Email completeness: 98.2%
    • Phone completeness: 87.5%
    • Company completeness: 91.3%
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    💡 RECOMMENDATIONS
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    • Review 47 conflict records manually
    • Standardize phone number format
    • Fill missing company names (8.7% incomplete)
    • Export conflicts to: conflicts_review.csv
  6. Handle Special Cases:
    Multiple Primary Keys:
    • Use compound keys: (email + company)
    • Offer options when ambiguous
    Different Data Types:
    • Convert dates to standard format
    • Normalize phone numbers
    • Standardize country codes
    • Clean whitespace and casing
    Missing Columns:
    • Fill with empty values
    • Flag missing data
    • Offer to create new columns
    Large Files:
    • Use chunking for files > 100MB
    • Show progress indicator
    • Estimate memory usage
  7. Generate Code: Provide Python/pandas script that:
    • Reads all files
    • Performs intelligent column matching
    • Deduplicates based on strategy
    • Resolves conflicts
    • Saves merged output
    • Generates detailed report
  8. Export Options:
    • CSV (UTF-8)
    • Excel (.xlsx)
    • JSON
    • SQL INSERT statements
    • Parquet (for large datasets)
当用户需要合并CSV或Excel文件时:
  1. 分析输入文件:
    • 需要合并多少个文件?
    • 文件格式是什么(CSV、Excel、TSV)?
    • 文件是已提供还是需要从磁盘读取?
    • 不同文件中的列名是否一致?
    • 主键(唯一标识符)是什么?
  2. 检查文件结构:
    • 读取每个文件的表头
    • 识别列名和数据类型
    • 检测编码格式(UTF-8、Latin-1等)
    • 检查是否存在缺失列
    • 查找重复的列名
  3. 制定合并策略:
    列匹配:
    • 精确名称匹配:"email" = "email"
    • 大小写不敏感匹配:"Email" = "email"
    • 模糊匹配:"E-mail" ≈ "email"
    • 常见匹配规则:
      • "first_name", "firstname", "First Name" → "first_name"
      • "phone", "phone_number", "tel" → "phone"
      • "email", "e-mail", "email_address" → "email"
    冲突解决(同一记录出现在多个文件中时):
    • 保留第一条:使用第一个文件中的值
    • 保留最后一条:使用最后一个文件中的值(最新数据)
    • 保留最完整值:使用信息最完整的条目
    • 人工审核:标记冲突供用户查看
    • 合并字段:整合无冲突的字段
    数据去重:
    • 基于主键识别重复行
    • 可选策略:保留第一条、保留最后一条、全部保留、合并值
    • 跟踪每一行的来源文件
  4. 执行合并:
    python
    # Example merge logic
    import pandas as pd
    
    # Read files
    df1 = pd.read_csv('file1.csv')
    df2 = pd.read_csv('file2.csv')
    
    # Normalize column names
    df1.columns = df1.columns.str.lower().str.strip()
    df2.columns = df2.columns.str.lower().str.strip()
    
    # Map similar columns
    column_mapping = {
        'firstname': 'first_name',
        'e_mail': 'email',
        # ...
    }
    df2 = df2.rename(columns=column_mapping)
    
    # Merge
    merged = pd.concat([df1, df2], ignore_index=True)
    
    # Deduplicate
    merged = merged.drop_duplicates(subset=['email'], keep='last')
    
    # Save
    merged.to_csv('merged_output.csv', index=False)
  5. 生成输出报告:
    📊 CSV/EXCEL 合并报告
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    📁 输入文件
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    文件1: contacts_jan.csv
      行数: 1,245
      列数: 8(name, email, phone, company, ...)
    
    文件2: contacts_feb.csv
      行数: 987
      列数: 9(firstname, lastname, email, mobile, ...)
    
    文件3: leads_export.xlsx
      行数: 2,103
      列数: 12(full_name, email_address, phone, ...)
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    🔄 列映射
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    统一架构:
    • first_name ← [firstname, first name, fname]
    • last_name ← [lastname, last name, lname]
    • email ← [email, e-mail, email_address]
    • phone ← [phone, mobile, phone_number, tel]
    • company ← [company, organization, org]
    • title ← [title, job_title, position]
    • source ← [文件来源跟踪]
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    🔍 合并分析
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    合并前行数总计: 4,335
    发现重复记录: 892
    检测到冲突: 47
    
    去重策略: 保留最新记录(按来源文件日期)
    主键: email
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    ⚠️ 冲突记录
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    记录: john.doe@example.com
      文件1电话: (555) 123-4567
      文件2电话: (555) 987-6543
      解决方案: 保留最新记录(文件2)
    
    [显示前10条冲突记录]
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    ✅ 合并结果
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    输出文件: merged_contacts.csv
    总行数: 3,443
    列数: 7
    移除重复数: 892
    
    来源分布:
    • contacts_jan.csv: 1,245行(398条唯一记录)
    • contacts_feb.csv: 987行(521条唯一记录)
    • leads_export.xlsx: 2,103行(2,524条唯一记录)
    
    数据质量:
    • Email完整性: 98.2%
    • 电话完整性: 87.5%
    • 公司信息完整性: 91.3%
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    💡 建议
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    • 人工审核47条冲突记录
    • 标准化电话号码格式
    • 补充缺失的公司名称(8.7%不完整)
    • 将冲突记录导出至: conflicts_review.csv
  6. 处理特殊情况:
    多主键:
    • 使用复合键:(email + company)
    • 存在歧义时提供可选方案
    不同数据类型:
    • 将日期转换为标准格式
    • 标准化电话号码
    • 统一国家代码
    • 清理空格和大小写
    缺失列:
    • 用空值填充
    • 标记缺失数据
    • 提供创建新列的选项
    大文件:
    • 对大于100MB的文件使用分块处理
    • 显示进度指示器
    • 预估内存使用量
  7. 生成代码: 提供Python/pandas脚本,实现:
    • 读取所有文件
    • 执行智能列匹配
    • 根据策略去重
    • 解决冲突
    • 保存合并后的输出文件
    • 生成详细报告
  8. 导出选项:
    • CSV (UTF-8)
    • Excel (.xlsx)
    • JSON
    • SQL INSERT语句
    • Parquet(适用于大型数据集)

Example Triggers

触发示例

  • "Merge these three CSV files"
  • "Combine multiple Excel sheets into one file"
  • "Deduplicate and merge customer data"
  • "Join spreadsheets with different column names"
  • "Consolidate contact lists from different sources"
  • "合并这三个CSV文件"
  • "将多个Excel工作表合并为一个文件"
  • "去重并合并客户数据"
  • "合并列名不同的电子表格"
  • "整合来自不同来源的联系人列表"

Best Practices

最佳实践

Column Matching:
  • Use fuzzy matching for similar names
  • Maintain original column name mapping report
  • Allow manual override of auto-matching
Data Quality:
  • Trim whitespace
  • Standardize formats (phone, email, dates)
  • Detect and flag invalid data
  • Preserve data types
Performance:
  • Use chunking for large files
  • Process in batches
  • Show progress for long operations
  • Optimize memory usage
Transparency:
  • Log all merge decisions
  • Track source file for each row
  • Report conflicts and resolutions
  • Generate detailed merge report
列匹配:
  • 对相似名称使用模糊匹配
  • 保留原始列名映射报告
  • 允许手动覆盖自动匹配结果
数据质量:
  • 去除空格
  • 标准化格式(电话、邮箱、日期)
  • 检测并标记无效数据
  • 保留数据类型
性能:
  • 对大文件使用分块处理
  • 批量处理
  • 为长时间操作显示进度
  • 优化内存使用
透明度:
  • 记录所有合并决策
  • 跟踪每一行的来源文件
  • 报告冲突及解决方案
  • 生成详细的合并报告

Output Quality

输出质量

Ensure merges:
  • Intelligently match columns
  • Handle different schemas
  • Deduplicate properly
  • Preserve data integrity
  • Flag conflicts for review
  • Generate comprehensive report
  • Maintain data quality
  • Track data lineage (source)
  • Handle edge cases gracefully
  • Provide validation statistics
Generate clean, deduplicated merged files with full transparency and data quality checks.
确保合并操作:
  • 智能匹配列
  • 处理不同架构
  • 正确去重
  • 保留数据完整性
  • 标记冲突供审核
  • 生成全面报告
  • 维持数据质量
  • 跟踪数据血缘(来源)
  • 优雅处理边缘情况
  • 提供验证统计数据
生成干净、去重的合并文件,具备完全的透明度和数据质量检查。