csv-excel-merger
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCSV/Excel Merger
CSV/Excel 文件合并工具
Intelligently merge multiple CSV or Excel files with automatic column matching and data deduplication.
可通过自动列匹配和数据去重功能,智能合并多个CSV或Excel文件。
Instructions
操作说明
When a user needs to merge CSV or Excel files:
-
Analyze Input Files:
- How many files need to be merged?
- What format (CSV, Excel, TSV)?
- Are the files provided or need to be read from disk?
- Do columns have the same names across files?
- What is the primary key (unique identifier)?
-
Inspect File Structures:
- Read headers from each file
- Identify column names and data types
- Detect encoding (UTF-8, Latin-1, etc.)
- Check for missing columns
- Look for duplicate column names
-
Create Merge Strategy:Column Matching:
- Exact name match: "email" = "email"
- Case-insensitive: "Email" = "email"
- Fuzzy match: "E-mail" ≈ "email"
- Common patterns:
- "first_name", "firstname", "First Name" → "first_name"
- "phone", "phone_number", "tel" → "phone"
- "email", "e-mail", "email_address" → "email"
Conflict Resolution (when same record appears in multiple files):- Keep first: Use value from first file
- Keep last: Use value from last file (most recent)
- Keep longest: Use most complete value
- Manual review: Flag conflicts for user review
- Merge: Combine non-conflicting fields
Deduplication:- Identify duplicate rows based on primary key
- Options: keep first, keep last, keep all, merge values
- Track source file for each row
-
Perform Merge:python
# Example merge logic import pandas as pd # Read files df1 = pd.read_csv('file1.csv') df2 = pd.read_csv('file2.csv') # Normalize column names df1.columns = df1.columns.str.lower().str.strip() df2.columns = df2.columns.str.lower().str.strip() # Map similar columns column_mapping = { 'firstname': 'first_name', 'e_mail': 'email', # ... } df2 = df2.rename(columns=column_mapping) # Merge merged = pd.concat([df1, df2], ignore_index=True) # Deduplicate merged = merged.drop_duplicates(subset=['email'], keep='last') # Save merged.to_csv('merged_output.csv', index=False) -
Format Output:
📊 CSV/EXCEL MERGER REPORT ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 📁 INPUT FILES ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ File 1: contacts_jan.csv Rows: 1,245 Columns: 8 (name, email, phone, company, ...) File 2: contacts_feb.csv Rows: 987 Columns: 9 (firstname, lastname, email, mobile, ...) File 3: leads_export.xlsx Rows: 2,103 Columns: 12 (full_name, email_address, phone, ...) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 🔄 COLUMN MAPPING ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Unified Schema: • first_name ← [firstname, first name, fname] • last_name ← [lastname, last name, lname] • email ← [email, e-mail, email_address] • phone ← [phone, mobile, phone_number, tel] • company ← [company, organization, org] • title ← [title, job_title, position] • source ← [file origin tracking] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 🔍 MERGE ANALYSIS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Total rows before merge: 4,335 Duplicate records found: 892 Conflicts detected: 47 Deduplication Strategy: Keep most recent (by source file date) Primary Key: email ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⚠️ CONFLICTS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Record: john.doe@example.com File 1 phone: (555) 123-4567 File 2 phone: (555) 987-6543 Resolution: Kept most recent (File 2) [List top 10 conflicts] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✅ MERGE RESULTS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Output File: merged_contacts.csv Total Rows: 3,443 Columns: 7 Duplicates Removed: 892 Breakdown by Source: • contacts_jan.csv: 1,245 rows (398 unique) • contacts_feb.csv: 987 rows (521 unique) • leads_export.xlsx: 2,103 rows (2,524 unique) Data Quality: • Email completeness: 98.2% • Phone completeness: 87.5% • Company completeness: 91.3% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 💡 RECOMMENDATIONS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ • Review 47 conflict records manually • Standardize phone number format • Fill missing company names (8.7% incomplete) • Export conflicts to: conflicts_review.csv -
Handle Special Cases:Multiple Primary Keys:
- Use compound keys: (email + company)
- Offer options when ambiguous
Different Data Types:- Convert dates to standard format
- Normalize phone numbers
- Standardize country codes
- Clean whitespace and casing
Missing Columns:- Fill with empty values
- Flag missing data
- Offer to create new columns
Large Files:- Use chunking for files > 100MB
- Show progress indicator
- Estimate memory usage
-
Generate Code: Provide Python/pandas script that:
- Reads all files
- Performs intelligent column matching
- Deduplicates based on strategy
- Resolves conflicts
- Saves merged output
- Generates detailed report
-
Export Options:
- CSV (UTF-8)
- Excel (.xlsx)
- JSON
- SQL INSERT statements
- Parquet (for large datasets)
当用户需要合并CSV或Excel文件时:
-
分析输入文件:
- 需要合并多少个文件?
- 文件格式是什么(CSV、Excel、TSV)?
- 文件是已提供还是需要从磁盘读取?
- 不同文件中的列名是否一致?
- 主键(唯一标识符)是什么?
-
检查文件结构:
- 读取每个文件的表头
- 识别列名和数据类型
- 检测编码格式(UTF-8、Latin-1等)
- 检查是否存在缺失列
- 查找重复的列名
-
制定合并策略:列匹配:
- 精确名称匹配:"email" = "email"
- 大小写不敏感匹配:"Email" = "email"
- 模糊匹配:"E-mail" ≈ "email"
- 常见匹配规则:
- "first_name", "firstname", "First Name" → "first_name"
- "phone", "phone_number", "tel" → "phone"
- "email", "e-mail", "email_address" → "email"
冲突解决(同一记录出现在多个文件中时):- 保留第一条:使用第一个文件中的值
- 保留最后一条:使用最后一个文件中的值(最新数据)
- 保留最完整值:使用信息最完整的条目
- 人工审核:标记冲突供用户查看
- 合并字段:整合无冲突的字段
数据去重:- 基于主键识别重复行
- 可选策略:保留第一条、保留最后一条、全部保留、合并值
- 跟踪每一行的来源文件
-
执行合并:python
# Example merge logic import pandas as pd # Read files df1 = pd.read_csv('file1.csv') df2 = pd.read_csv('file2.csv') # Normalize column names df1.columns = df1.columns.str.lower().str.strip() df2.columns = df2.columns.str.lower().str.strip() # Map similar columns column_mapping = { 'firstname': 'first_name', 'e_mail': 'email', # ... } df2 = df2.rename(columns=column_mapping) # Merge merged = pd.concat([df1, df2], ignore_index=True) # Deduplicate merged = merged.drop_duplicates(subset=['email'], keep='last') # Save merged.to_csv('merged_output.csv', index=False) -
生成输出报告:
📊 CSV/EXCEL 合并报告 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 📁 输入文件 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 文件1: contacts_jan.csv 行数: 1,245 列数: 8(name, email, phone, company, ...) 文件2: contacts_feb.csv 行数: 987 列数: 9(firstname, lastname, email, mobile, ...) 文件3: leads_export.xlsx 行数: 2,103 列数: 12(full_name, email_address, phone, ...) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 🔄 列映射 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 统一架构: • first_name ← [firstname, first name, fname] • last_name ← [lastname, last name, lname] • email ← [email, e-mail, email_address] • phone ← [phone, mobile, phone_number, tel] • company ← [company, organization, org] • title ← [title, job_title, position] • source ← [文件来源跟踪] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 🔍 合并分析 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 合并前行数总计: 4,335 发现重复记录: 892 检测到冲突: 47 去重策略: 保留最新记录(按来源文件日期) 主键: email ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⚠️ 冲突记录 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 记录: john.doe@example.com 文件1电话: (555) 123-4567 文件2电话: (555) 987-6543 解决方案: 保留最新记录(文件2) [显示前10条冲突记录] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✅ 合并结果 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 输出文件: merged_contacts.csv 总行数: 3,443 列数: 7 移除重复数: 892 来源分布: • contacts_jan.csv: 1,245行(398条唯一记录) • contacts_feb.csv: 987行(521条唯一记录) • leads_export.xlsx: 2,103行(2,524条唯一记录) 数据质量: • Email完整性: 98.2% • 电话完整性: 87.5% • 公司信息完整性: 91.3% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 💡 建议 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ • 人工审核47条冲突记录 • 标准化电话号码格式 • 补充缺失的公司名称(8.7%不完整) • 将冲突记录导出至: conflicts_review.csv -
处理特殊情况:多主键:
- 使用复合键:(email + company)
- 存在歧义时提供可选方案
不同数据类型:- 将日期转换为标准格式
- 标准化电话号码
- 统一国家代码
- 清理空格和大小写
缺失列:- 用空值填充
- 标记缺失数据
- 提供创建新列的选项
大文件:- 对大于100MB的文件使用分块处理
- 显示进度指示器
- 预估内存使用量
-
生成代码: 提供Python/pandas脚本,实现:
- 读取所有文件
- 执行智能列匹配
- 根据策略去重
- 解决冲突
- 保存合并后的输出文件
- 生成详细报告
-
导出选项:
- CSV (UTF-8)
- Excel (.xlsx)
- JSON
- SQL INSERT语句
- Parquet(适用于大型数据集)
Example Triggers
触发示例
- "Merge these three CSV files"
- "Combine multiple Excel sheets into one file"
- "Deduplicate and merge customer data"
- "Join spreadsheets with different column names"
- "Consolidate contact lists from different sources"
- "合并这三个CSV文件"
- "将多个Excel工作表合并为一个文件"
- "去重并合并客户数据"
- "合并列名不同的电子表格"
- "整合来自不同来源的联系人列表"
Best Practices
最佳实践
Column Matching:
- Use fuzzy matching for similar names
- Maintain original column name mapping report
- Allow manual override of auto-matching
Data Quality:
- Trim whitespace
- Standardize formats (phone, email, dates)
- Detect and flag invalid data
- Preserve data types
Performance:
- Use chunking for large files
- Process in batches
- Show progress for long operations
- Optimize memory usage
Transparency:
- Log all merge decisions
- Track source file for each row
- Report conflicts and resolutions
- Generate detailed merge report
列匹配:
- 对相似名称使用模糊匹配
- 保留原始列名映射报告
- 允许手动覆盖自动匹配结果
数据质量:
- 去除空格
- 标准化格式(电话、邮箱、日期)
- 检测并标记无效数据
- 保留数据类型
性能:
- 对大文件使用分块处理
- 批量处理
- 为长时间操作显示进度
- 优化内存使用
透明度:
- 记录所有合并决策
- 跟踪每一行的来源文件
- 报告冲突及解决方案
- 生成详细的合并报告
Output Quality
输出质量
Ensure merges:
- Intelligently match columns
- Handle different schemas
- Deduplicate properly
- Preserve data integrity
- Flag conflicts for review
- Generate comprehensive report
- Maintain data quality
- Track data lineage (source)
- Handle edge cases gracefully
- Provide validation statistics
Generate clean, deduplicated merged files with full transparency and data quality checks.
确保合并操作:
- 智能匹配列
- 处理不同架构
- 正确去重
- 保留数据完整性
- 标记冲突供审核
- 生成全面报告
- 维持数据质量
- 跟踪数据血缘(来源)
- 优雅处理边缘情况
- 提供验证统计数据
生成干净、去重的合并文件,具备完全的透明度和数据质量检查。