multi-source-data-merger
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMulti Source Data Merger
多源数据合并工具
Overview
概述
This skill guides the process of merging data from multiple sources with different formats into a unified dataset. It covers reading heterogeneous file formats, applying field name mappings, resolving conflicts using priority ordering, and generating comprehensive output files including conflict reports.
本技能指导你将来自不同格式的多源数据合并为统一数据集的流程,涵盖读取异构文件格式、应用字段名称映射、使用优先级排序解决冲突,以及生成包含冲突报告的完整输出文件等内容。
Workflow
工作流程
Step 1: Analyze Requirements and Source Files
步骤1:分析需求与源文件
Before writing any code, thoroughly understand the task:
- Identify all source files and their formats (JSON, CSV, Parquet, XML, etc.)
- Determine the merge key (e.g., ,
user_id) that links records across sourcesrecord_id - Review field mapping requirements - source fields may have different names that map to common output fields
- Understand conflict resolution rules - typically based on source priority ordering
- Identify expected output formats and structure
Important: Do not attempt to read binary formats (Parquet, Excel, etc.) as text files - use appropriate libraries.
在编写任何代码之前,需彻底理解任务要求:
- 识别所有源文件及其格式(JSON、CSV、Parquet、XML等)
- 确定合并键(如、
user_id),用于关联不同源的记录record_id - 查看字段映射要求 - 源字段可能有不同名称,需映射到统一的输出字段
- 理解冲突解决规则 - 通常基于数据源优先级排序
- 明确预期输出格式和结构
重要提示: 请勿尝试将二进制格式(Parquet、Excel等)作为文本文件读取 - 使用对应的专用库。
Step 2: Set Up Environment
步骤2:搭建环境
- Create a Python virtual environment using or
uvvenv - Install required dependencies based on source formats:
- - Core data manipulation
pandas - - Parquet file support
pyarrow - - Excel file support
openpyxl - - XML parsing (if needed)
lxml
- Verify installations before proceeding
Example environment setup:
bash
uv venv .venv
source .venv/bin/activate
uv pip install pandas pyarrow- 使用或
uv创建Python虚拟环境venv - 根据源文件格式安装所需依赖:
- - 核心数据处理库
pandas - - Parquet文件支持
pyarrow - - Excel文件支持
openpyxl - - XML解析(如有需要)
lxml
- 在继续前验证依赖安装完成
环境搭建示例:
bash
uv venv .venv
source .venv/bin/activate
uv pip install pandas pyarrowStep 3: Write the Merge Script
步骤3:编写合并脚本
Structure the script with clear separation of concerns:
- Data reading functions - One per format type
- Field mapping function - Apply column renames
- Data normalization - Handle date formats, type conversions
- Merge logic - Combine records using the merge key
- Conflict resolution - Apply priority rules
- Output generation - Write merged data and conflict reports
Script quality practices:
- Validate syntax before execution:
python -m py_compile script.py - Use try-except blocks with informative error messages
- Document assumptions about data formats
按职责分离原则组织脚本结构:
- 数据读取函数 - 每种格式对应一个函数
- 字段映射函数 - 执行列重命名
- 数据标准化 - 处理日期格式、类型转换
- 合并逻辑 - 使用合并键整合记录
- 冲突解决 - 应用优先级规则
- 输出生成 - 写入合并后数据和冲突报告
脚本质量规范:
- 执行前验证语法:
python -m py_compile script.py - 使用try-except块并提供清晰的错误信息
- 记录关于数据格式的假设
Step 4: Execute and Verify
步骤4:执行与验证
Run a comprehensive verification process:
- Check output file existence at expected locations
- Validate merged data contains expected values
- Verify conflict report structure and content
- Run any provided test suites
执行全面的验证流程:
- 检查输出文件是否存在于预期路径
- 验证合并后数据包含预期值
- 验证冲突报告的结构与内容
- 运行提供的所有测试套件
Common Pitfalls
常见陷阱
Binary File Handling
二进制文件处理
- Mistake: Attempting to read Parquet/Excel files as text
- Solution: Always use pandas with appropriate engine (for Parquet,
pyarrowfor Excel)openpyxl
- 错误做法: 尝试将Parquet/Excel文件作为文本读取
- 解决方案: 始终使用pandas搭配对应引擎(Parquet用,Excel用
pyarrow)openpyxl
Syntax Errors in Scripts
脚本语法错误
- Mistake: Writing long scripts without validation, leading to indentation or syntax errors
- Solution: Run before execution
python -m py_compile script.py
- 错误做法: 编写长脚本却不提前验证,导致缩进或语法错误
- 解决方案: 执行前运行
python -m py_compile script.py
Date Format Normalization
日期格式标准化
- Mistake: Assuming consistent date formats across sources
- Solution: Implement flexible date parsing that handles multiple formats:
- ISO format:
2024-01-15 - US format:
01/15/2024 - European format:
15-01-2024 - Datetime:
2024-01-15T10:30:00
- ISO format:
- 错误做法: 假设所有源的日期格式一致
- 解决方案: 实现灵活的日期解析逻辑,支持多种格式:
- ISO格式:
2024-01-15 - 美国格式:
01/15/2024 - 欧洲格式:
15-01-2024 - 时间戳格式:
2024-01-15T10:30:00
- ISO格式:
Incomplete Script Output
脚本输出不完整
- Mistake: Writing very long scripts that may get truncated
- Solution: Break into modular functions, verify complete code visibility
- 错误做法: 编写过长脚本导致内容被截断
- 解决方案: 拆分为模块化函数,确保代码完整可见
Environment Path Issues
环境路径问题
- Mistake: Repeating PATH exports in every command
- Solution: Set PATH once in a setup step or use absolute paths to executables
- 错误做法: 在每个命令中重复导出PATH
- 解决方案: 在初始化步骤中一次性设置PATH,或使用可执行文件的绝对路径
Verification Strategies
验证策略
Output Validation Checklist
输出验证清单
-
File existence check:python
import os assert os.path.exists("output/merged_data.json") assert os.path.exists("output/conflict_report.json") -
Data completeness check:python
import json with open("output/merged_data.json") as f: data = json.load(f) # Verify expected record count assert len(data) == expected_count -
Conflict report validation:python
with open("output/conflict_report.json") as f: conflicts = json.load(f) # Verify conflict structure has required fields for conflict in conflicts: assert "field" in conflict assert "selected" in conflict assert "sources" in conflict -
Sample value verification:python
# Spot-check specific merged records record = next(r for r in data if r["user_id"] == "expected_id") assert record["field_name"] == "expected_value"
-
文件存在性检查:python
import os assert os.path.exists("output/merged_data.json") assert os.path.exists("output/conflict_report.json") -
数据完整性检查:python
import json with open("output/merged_data.json") as f: data = json.load(f) # 验证预期记录数量 assert len(data) == expected_count -
冲突报告验证:python
with open("output/conflict_report.json") as f: conflicts = json.load(f) # 验证冲突结构包含必填字段 for conflict in conflicts: assert "field" in conflict assert "selected" in conflict assert "sources" in conflict -
样本值验证:python
# 抽查特定合并记录 record = next(r for r in data if r["user_id"] == "expected_id") assert record["field_name"] == "expected_value"
Consolidate Verification
整合验证流程
Instead of running multiple separate verification commands, create a single comprehensive test script that validates all aspects of the output.
不要单独运行多个验证命令,应创建一个综合测试脚本,一次性验证输出的所有方面。
Edge Cases to Consider
需要考虑的边缘情况
- Empty source files - Handle gracefully with appropriate warnings
- Missing merge keys - Decide whether to skip or error
- Type mismatches - Convert consistently (e.g., user_id as string vs integer)
- Null/None values - Determine handling in conflict resolution
- Unicode/encoding - Specify encoding when reading text-based formats
- Records in some sources but not others - Include partial records or require complete matches
- 空源文件 - 优雅处理并给出适当警告
- 缺失合并键 - 决定是跳过记录还是抛出错误
- 类型不匹配 - 统一转换(如user_id的字符串与整数类型)
- Null/None值 - 确定冲突解决中的处理方式
- Unicode/编码 - 读取文本格式时指定编码
- 部分源存在记录,其他源不存在 - 决定是否包含部分记录或要求完全匹配
Field Mapping Example
字段映射示例
When sources have different field names for the same concept:
python
FIELD_MAPPINGS = {
"source_a": {
"firstName": "first_name",
"lastName": "last_name",
"emailAddress": "email"
},
"source_b": {
"fname": "first_name",
"lname": "last_name",
"mail": "email"
}
}
def apply_mapping(df, source_name):
mapping = FIELD_MAPPINGS.get(source_name, {})
return df.rename(columns=mapping)当不同源对同一概念使用不同字段名时:
python
FIELD_MAPPINGS = {
"source_a": {
"firstName": "first_name",
"lastName": "last_name",
"emailAddress": "email"
},
"source_b": {
"fname": "first_name",
"lname": "last_name",
"mail": "email"
}
}
def apply_mapping(df, source_name):
mapping = FIELD_MAPPINGS.get(source_name, {})
return df.rename(columns=mapping)Conflict Resolution Pattern
冲突解决模式
When the same field has different values across sources:
python
def resolve_conflict(values_by_source, priority_order):
"""
Select value based on source priority.
Args:
values_by_source: dict mapping source name to value
priority_order: list of source names from highest to lowest priority
Returns:
tuple: (selected_value, conflict_info)
"""
conflict_info = None
unique_values = set(v for v in values_by_source.values() if v is not None)
if len(unique_values) > 1:
conflict_info = {
"sources": values_by_source,
"resolved_by": "priority"
}
for source in priority_order:
if source in values_by_source and values_by_source[source] is not None:
return values_by_source[source], conflict_info
return None, conflict_info当同一字段在不同源中有不同值时:
python
def resolve_conflict(values_by_source, priority_order):
"""
根据源优先级选择值。
参数:
values_by_source: 字典,键为源名称,值为对应字段值
priority_order: 源名称列表,按优先级从高到低排列
返回:
元组: (选中的值, 冲突信息)
"""
conflict_info = None
unique_values = set(v for v in values_by_source.values() if v is not None)
if len(unique_values) > 1:
conflict_info = {
"sources": values_by_source,
"resolved_by": "priority"
}
for source in priority_order:
if source in values_by_source and values_by_source[source] is not None:
return values_by_source[source], conflict_info
return None, conflict_info