csv-excel-merger

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CSV/Excel Merger

CSV/Excel 文件合并工具

Intelligently merge multiple CSV or Excel files with automatic column matching and data deduplication.

可通过自动列匹配和数据去重功能，智能合并多个CSV或Excel文件。

Instructions

操作说明

When a user needs to merge CSV or Excel files:

Analyze Input Files:
- How many files need to be merged?
- What format (CSV, Excel, TSV)?
- Are the files provided or need to be read from disk?
- Do columns have the same names across files?
- What is the primary key (unique identifier)?
Inspect File Structures:
- Read headers from each file
- Identify column names and data types
- Detect encoding (UTF-8, Latin-1, etc.)
- Check for missing columns
- Look for duplicate column names
Create Merge Strategy:

Column Matching:
- Exact name match: "email" = "email"
- Case-insensitive: "Email" = "email"
- Fuzzy match: "E-mail" ≈ "email"
- Common patterns:
  - "first_name", "firstname", "First Name" → "first_name"
  - "phone", "phone_number", "tel" → "phone"
  - "email", "e-mail", "email_address" → "email"
Conflict Resolution (when same record appears in multiple files):
- Keep first: Use value from first file
- Keep last: Use value from last file (most recent)
- Keep longest: Use most complete value
- Manual review: Flag conflicts for user review
- Merge: Combine non-conflicting fields
Deduplication:
- Identify duplicate rows based on primary key
- Options: keep first, keep last, keep all, merge values
- Track source file for each row

Perform Merge:

python

# Example merge logic
import pandas as pd

# Read files
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')

# Normalize column names
df1.columns = df1.columns.str.lower().str.strip()
df2.columns = df2.columns.str.lower().str.strip()

# Map similar columns
column_mapping = {
    'firstname': 'first_name',
    'e_mail': 'email',
    # ...
}
df2 = df2.rename(columns=column_mapping)

# Merge
merged = pd.concat([df1, df2], ignore_index=True)

# Deduplicate
merged = merged.drop_duplicates(subset=['email'], keep='last')

# Save
merged.to_csv('merged_output.csv', index=False)

Format Output:

📊 CSV/EXCEL MERGER REPORT

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📁 INPUT FILES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

File 1: contacts_jan.csv
  Rows: 1,245
  Columns: 8 (name, email, phone, company, ...)

File 2: contacts_feb.csv
  Rows: 987
  Columns: 9 (firstname, lastname, email, mobile, ...)

File 3: leads_export.xlsx
  Rows: 2,103
  Columns: 12 (full_name, email_address, phone, ...)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔄 COLUMN MAPPING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Unified Schema:
• first_name ← [firstname, first name, fname]
• last_name ← [lastname, last name, lname]
• email ← [email, e-mail, email_address]
• phone ← [phone, mobile, phone_number, tel]
• company ← [company, organization, org]
• title ← [title, job_title, position]
• source ← [file origin tracking]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔍 MERGE ANALYSIS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Total rows before merge: 4,335
Duplicate records found: 892
Conflicts detected: 47

Deduplication Strategy: Keep most recent (by source file date)
Primary Key: email

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ CONFLICTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Record: john.doe@example.com
  File 1 phone: (555) 123-4567
  File 2 phone: (555) 987-6543
  Resolution: Kept most recent (File 2)

[List top 10 conflicts]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ MERGE RESULTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Output File: merged_contacts.csv
Total Rows: 3,443
Columns: 7
Duplicates Removed: 892

Breakdown by Source:
• contacts_jan.csv: 1,245 rows (398 unique)
• contacts_feb.csv: 987 rows (521 unique)
• leads_export.xlsx: 2,103 rows (2,524 unique)

Data Quality:
• Email completeness: 98.2%
• Phone completeness: 87.5%
• Company completeness: 91.3%

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
💡 RECOMMENDATIONS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

• Review 47 conflict records manually
• Standardize phone number format
• Fill missing company names (8.7% incomplete)
• Export conflicts to: conflicts_review.csv

Handle Special Cases:

Multiple Primary Keys:
- Use compound keys: (email + company)
- Offer options when ambiguous
Different Data Types:
- Convert dates to standard format
- Normalize phone numbers
- Standardize country codes
- Clean whitespace and casing
Missing Columns:
- Fill with empty values
- Flag missing data
- Offer to create new columns
Large Files:
- Use chunking for files > 100MB
- Show progress indicator
- Estimate memory usage
Generate Code: Provide Python/pandas script that:
- Reads all files
- Performs intelligent column matching
- Deduplicates based on strategy
- Resolves conflicts
- Saves merged output
- Generates detailed report
Export Options:
- CSV (UTF-8)
- Excel (.xlsx)
- JSON
- SQL INSERT statements
- Parquet (for large datasets)

当用户需要合并CSV或Excel文件时：

分析输入文件:
- 需要合并多少个文件？
- 文件格式是什么（CSV、Excel、TSV）？
- 文件是已提供还是需要从磁盘读取？
- 不同文件中的列名是否一致？
- 主键（唯一标识符）是什么？
检查文件结构:
- 读取每个文件的表头
- 识别列名和数据类型
- 检测编码格式（UTF-8、Latin-1等）
- 检查是否存在缺失列
- 查找重复的列名
制定合并策略:

列匹配:
- 精确名称匹配："email" = "email"
- 大小写不敏感匹配："Email" = "email"
- 模糊匹配："E-mail" ≈ "email"
- 常见匹配规则:
  - "first_name", "firstname", "First Name" → "first_name"
  - "phone", "phone_number", "tel" → "phone"
  - "email", "e-mail", "email_address" → "email"
冲突解决（同一记录出现在多个文件中时）:
- 保留第一条：使用第一个文件中的值
- 保留最后一条：使用最后一个文件中的值（最新数据）
- 保留最完整值：使用信息最完整的条目
- 人工审核：标记冲突供用户查看
- 合并字段：整合无冲突的字段
数据去重:
- 基于主键识别重复行
- 可选策略：保留第一条、保留最后一条、全部保留、合并值
- 跟踪每一行的来源文件

执行合并:

python

# Example merge logic
import pandas as pd

# Read files
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')

# Normalize column names
df1.columns = df1.columns.str.lower().str.strip()
df2.columns = df2.columns.str.lower().str.strip()

# Map similar columns
column_mapping = {
    'firstname': 'first_name',
    'e_mail': 'email',
    # ...
}
df2 = df2.rename(columns=column_mapping)

# Merge
merged = pd.concat([df1, df2], ignore_index=True)

# Deduplicate
merged = merged.drop_duplicates(subset=['email'], keep='last')

# Save
merged.to_csv('merged_output.csv', index=False)

生成输出报告:

📊 CSV/EXCEL 合并报告

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📁 输入文件
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

文件1: contacts_jan.csv
  行数: 1,245
  列数: 8（name, email, phone, company, ...）

文件2: contacts_feb.csv
  行数: 987
  列数: 9（firstname, lastname, email, mobile, ...）

文件3: leads_export.xlsx
  行数: 2,103
  列数: 12（full_name, email_address, phone, ...）

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔄 列映射
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

统一架构:
• first_name ← [firstname, first name, fname]
• last_name ← [lastname, last name, lname]
• email ← [email, e-mail, email_address]
• phone ← [phone, mobile, phone_number, tel]
• company ← [company, organization, org]
• title ← [title, job_title, position]
• source ← [文件来源跟踪]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔍 合并分析
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

合并前行数总计: 4,335
发现重复记录: 892
检测到冲突: 47

去重策略: 保留最新记录（按来源文件日期）
主键: email

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ 冲突记录
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

记录: john.doe@example.com
  文件1电话: (555) 123-4567
  文件2电话: (555) 987-6543
  解决方案: 保留最新记录（文件2）

[显示前10条冲突记录]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ 合并结果
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

输出文件: merged_contacts.csv
总行数: 3,443
列数: 7
移除重复数: 892

来源分布:
• contacts_jan.csv: 1,245行（398条唯一记录）
• contacts_feb.csv: 987行（521条唯一记录）
• leads_export.xlsx: 2,103行（2,524条唯一记录）

数据质量:
• Email完整性: 98.2%
• 电话完整性: 87.5%
• 公司信息完整性: 91.3%

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
💡 建议
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

• 人工审核47条冲突记录
• 标准化电话号码格式
• 补充缺失的公司名称（8.7%不完整）
• 将冲突记录导出至: conflicts_review.csv

处理特殊情况:

多主键:
- 使用复合键：(email + company)
- 存在歧义时提供可选方案
不同数据类型:
- 将日期转换为标准格式
- 标准化电话号码
- 统一国家代码
- 清理空格和大小写
缺失列:
- 用空值填充
- 标记缺失数据
- 提供创建新列的选项
大文件:
- 对大于100MB的文件使用分块处理
- 显示进度指示器
- 预估内存使用量
生成代码: 提供Python/pandas脚本，实现：
- 读取所有文件
- 执行智能列匹配
- 根据策略去重
- 解决冲突
- 保存合并后的输出文件
- 生成详细报告
导出选项:
- CSV (UTF-8)
- Excel (.xlsx)
- JSON
- SQL INSERT语句
- Parquet（适用于大型数据集）

Example Triggers

触发示例

"Merge these three CSV files"
"Combine multiple Excel sheets into one file"
"Deduplicate and merge customer data"
"Join spreadsheets with different column names"
"Consolidate contact lists from different sources"

"合并这三个CSV文件"
"将多个Excel工作表合并为一个文件"
"去重并合并客户数据"
"合并列名不同的电子表格"
"整合来自不同来源的联系人列表"

Best Practices

最佳实践

Column Matching:

Use fuzzy matching for similar names
Maintain original column name mapping report
Allow manual override of auto-matching

Data Quality:

Trim whitespace
Standardize formats (phone, email, dates)
Detect and flag invalid data
Preserve data types

Performance:

Use chunking for large files
Process in batches
Show progress for long operations
Optimize memory usage

Transparency:

Log all merge decisions
Track source file for each row
Report conflicts and resolutions
Generate detailed merge report

列匹配:

对相似名称使用模糊匹配
保留原始列名映射报告
允许手动覆盖自动匹配结果

数据质量:

去除空格
标准化格式（电话、邮箱、日期）
检测并标记无效数据
保留数据类型

性能:

对大文件使用分块处理
批量处理
为长时间操作显示进度
优化内存使用

透明度:

记录所有合并决策
跟踪每一行的来源文件
报告冲突及解决方案
生成详细的合并报告

Output Quality

输出质量

Ensure merges:

Intelligently match columns
Handle different schemas
Deduplicate properly
Preserve data integrity
Flag conflicts for review
Generate comprehensive report
Maintain data quality
Track data lineage (source)
Handle edge cases gracefully
Provide validation statistics

Generate clean, deduplicated merged files with full transparency and data quality checks.

确保合并操作：

智能匹配列
处理不同架构
正确去重
保留数据完整性
标记冲突供审核
生成全面报告
维持数据质量
跟踪数据血缘（来源）
优雅处理边缘情况
提供验证统计数据