multi-source-data-merger

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Multi Source Data Merger

多源数据合并工具

Overview

概述

This skill guides the process of merging data from multiple sources with different formats into a unified dataset. It covers reading heterogeneous file formats, applying field name mappings, resolving conflicts using priority ordering, and generating comprehensive output files including conflict reports.

本技能指导你将来自不同格式的多源数据合并为统一数据集的流程，涵盖读取异构文件格式、应用字段名称映射、使用优先级排序解决冲突，以及生成包含冲突报告的完整输出文件等内容。

Workflow

工作流程

Step 1: Analyze Requirements and Source Files

步骤1：分析需求与源文件

Before writing any code, thoroughly understand the task:

Identify all source files and their formats (JSON, CSV, Parquet, XML, etc.)
Determine the merge key (e.g.,
```
user_id
```
,
```
record_id
```
) that links records across sources
Review field mapping requirements - source fields may have different names that map to common output fields
Understand conflict resolution rules - typically based on source priority ordering
Identify expected output formats and structure

Important: Do not attempt to read binary formats (Parquet, Excel, etc.) as text files - use appropriate libraries.

在编写任何代码之前，需彻底理解任务要求：

识别所有源文件及其格式（JSON、CSV、Parquet、XML等）
确定合并键（如
```
user_id
```
、
```
record_id
```
），用于关联不同源的记录
查看字段映射要求 - 源字段可能有不同名称，需映射到统一的输出字段
理解冲突解决规则 - 通常基于数据源优先级排序
明确预期输出格式和结构

重要提示： 请勿尝试将二进制格式（Parquet、Excel等）作为文本文件读取 - 使用对应的专用库。

Step 2: Set Up Environment

步骤2：搭建环境

Create a Python virtual environment using
```
uv
```
or
```
venv
```
Install required dependencies based on source formats:
- ```
pandas
```
  - Core data manipulation
- ```
pyarrow
```
  - Parquet file support
- ```
openpyxl
```
  - Excel file support
- ```
lxml
```
  - XML parsing (if needed)
Verify installations before proceeding

Example environment setup:

bash

uv venv .venv
source .venv/bin/activate
uv pip install pandas pyarrow

使用
uv
或
venv
创建Python虚拟环境
根据源文件格式安装所需依赖：
- ```
pandas
```
  - 核心数据处理库
- ```
pyarrow
```
  - Parquet文件支持
- ```
openpyxl
```
  - Excel文件支持
- ```
lxml
```
  - XML解析（如有需要）
在继续前验证依赖安装完成

环境搭建示例：

bash

uv venv .venv
source .venv/bin/activate
uv pip install pandas pyarrow

Step 3: Write the Merge Script

步骤3：编写合并脚本

Structure the script with clear separation of concerns:

Data reading functions - One per format type
Field mapping function - Apply column renames
Data normalization - Handle date formats, type conversions
Merge logic - Combine records using the merge key
Conflict resolution - Apply priority rules
Output generation - Write merged data and conflict reports

Script quality practices:

Validate syntax before execution:
```
python -m py_compile script.py
```
Use try-except blocks with informative error messages
Document assumptions about data formats

按职责分离原则组织脚本结构：

数据读取函数 - 每种格式对应一个函数
字段映射函数 - 执行列重命名
数据标准化 - 处理日期格式、类型转换
合并逻辑 - 使用合并键整合记录
冲突解决 - 应用优先级规则
输出生成 - 写入合并后数据和冲突报告

脚本质量规范：

执行前验证语法：
```
python -m py_compile script.py
```
使用try-except块并提供清晰的错误信息
记录关于数据格式的假设

Step 4: Execute and Verify

步骤4：执行与验证

Run a comprehensive verification process:

Check output file existence at expected locations
Validate merged data contains expected values
Verify conflict report structure and content
Run any provided test suites

执行全面的验证流程：

检查输出文件是否存在于预期路径
验证合并后数据包含预期值
验证冲突报告的结构与内容
运行提供的所有测试套件

Common Pitfalls

常见陷阱

Binary File Handling

二进制文件处理

Mistake: Attempting to read Parquet/Excel files as text
Solution: Always use pandas with appropriate engine (
```
pyarrow
```
for Parquet,
```
openpyxl
```
for Excel)

错误做法： 尝试将Parquet/Excel文件作为文本读取
解决方案： 始终使用pandas搭配对应引擎（Parquet用
```
pyarrow
```
，Excel用
```
openpyxl
```
）

Syntax Errors in Scripts

脚本语法错误

Mistake: Writing long scripts without validation, leading to indentation or syntax errors
Solution: Run
```
python -m py_compile script.py
```
before execution

错误做法： 编写长脚本却不提前验证，导致缩进或语法错误
解决方案： 执行前运行
```
python -m py_compile script.py
```

Date Format Normalization

日期格式标准化

Mistake: Assuming consistent date formats across sources
Solution: Implement flexible date parsing that handles multiple formats:
- ISO format:
```
2024-01-15
```
- US format:
```
01/15/2024
```
- European format:
```
15-01-2024
```
- Datetime:
```
2024-01-15T10:30:00
```

错误做法： 假设所有源的日期格式一致
解决方案： 实现灵活的日期解析逻辑，支持多种格式：
- ISO格式：
```
2024-01-15
```
- 美国格式：
```
01/15/2024
```
- 欧洲格式：
```
15-01-2024
```
- 时间戳格式：
```
2024-01-15T10:30:00
```

Incomplete Script Output

脚本输出不完整

Mistake: Writing very long scripts that may get truncated
Solution: Break into modular functions, verify complete code visibility

错误做法： 编写过长脚本导致内容被截断
解决方案： 拆分为模块化函数，确保代码完整可见

Environment Path Issues

环境路径问题

Mistake: Repeating PATH exports in every command
Solution: Set PATH once in a setup step or use absolute paths to executables

错误做法： 在每个命令中重复导出PATH
解决方案： 在初始化步骤中一次性设置PATH，或使用可执行文件的绝对路径

Verification Strategies

验证策略

Output Validation Checklist

输出验证清单

File existence check:

python

import os
assert os.path.exists("output/merged_data.json")
assert os.path.exists("output/conflict_report.json")

Data completeness check:

python

import json
with open("output/merged_data.json") as f:
    data = json.load(f)
# Verify expected record count
assert len(data) == expected_count

Conflict report validation:

python

with open("output/conflict_report.json") as f:
    conflicts = json.load(f)
# Verify conflict structure has required fields
for conflict in conflicts:
    assert "field" in conflict
    assert "selected" in conflict
    assert "sources" in conflict

Sample value verification:

python

# Spot-check specific merged records
record = next(r for r in data if r["user_id"] == "expected_id")
assert record["field_name"] == "expected_value"

文件存在性检查：

python

import os
assert os.path.exists("output/merged_data.json")
assert os.path.exists("output/conflict_report.json")

数据完整性检查：

python

import json
with open("output/merged_data.json") as f:
    data = json.load(f)
# 验证预期记录数量
assert len(data) == expected_count

冲突报告验证：

python

with open("output/conflict_report.json") as f:
    conflicts = json.load(f)
# 验证冲突结构包含必填字段
for conflict in conflicts:
    assert "field" in conflict
    assert "selected" in conflict
    assert "sources" in conflict

样本值验证：

python

# 抽查特定合并记录
record = next(r for r in data if r["user_id"] == "expected_id")
assert record["field_name"] == "expected_value"

Consolidate Verification

整合验证流程

Instead of running multiple separate verification commands, create a single comprehensive test script that validates all aspects of the output.

不要单独运行多个验证命令，应创建一个综合测试脚本，一次性验证输出的所有方面。

Edge Cases to Consider

需要考虑的边缘情况

Empty source files - Handle gracefully with appropriate warnings
Missing merge keys - Decide whether to skip or error
Type mismatches - Convert consistently (e.g., user_id as string vs integer)
Null/None values - Determine handling in conflict resolution
Unicode/encoding - Specify encoding when reading text-based formats
Records in some sources but not others - Include partial records or require complete matches

空源文件 - 优雅处理并给出适当警告
缺失合并键 - 决定是跳过记录还是抛出错误
类型不匹配 - 统一转换（如user_id的字符串与整数类型）
Null/None值 - 确定冲突解决中的处理方式
Unicode/编码 - 读取文本格式时指定编码
部分源存在记录，其他源不存在 - 决定是否包含部分记录或要求完全匹配

Field Mapping Example

字段映射示例

When sources have different field names for the same concept:

python

FIELD_MAPPINGS = {
    "source_a": {
        "firstName": "first_name",
        "lastName": "last_name",
        "emailAddress": "email"
    },
    "source_b": {
        "fname": "first_name",
        "lname": "last_name",
        "mail": "email"
    }
}

def apply_mapping(df, source_name):
    mapping = FIELD_MAPPINGS.get(source_name, {})
    return df.rename(columns=mapping)

当不同源对同一概念使用不同字段名时：

python

FIELD_MAPPINGS = {
    "source_a": {
        "firstName": "first_name",
        "lastName": "last_name",
        "emailAddress": "email"
    },
    "source_b": {
        "fname": "first_name",
        "lname": "last_name",
        "mail": "email"
    }
}

def apply_mapping(df, source_name):
    mapping = FIELD_MAPPINGS.get(source_name, {})
    return df.rename(columns=mapping)

Conflict Resolution Pattern

冲突解决模式

When the same field has different values across sources:

python

def resolve_conflict(values_by_source, priority_order):
    """
    Select value based on source priority.

    Args:
        values_by_source: dict mapping source name to value
        priority_order: list of source names from highest to lowest priority

    Returns:
        tuple: (selected_value, conflict_info)
    """
    conflict_info = None
    unique_values = set(v for v in values_by_source.values() if v is not None)

    if len(unique_values) > 1:
        conflict_info = {
            "sources": values_by_source,
            "resolved_by": "priority"
        }

    for source in priority_order:
        if source in values_by_source and values_by_source[source] is not None:
            return values_by_source[source], conflict_info

    return None, conflict_info

当同一字段在不同源中有不同值时：

python

def resolve_conflict(values_by_source, priority_order):
    """
    根据源优先级选择值。

    参数:
        values_by_source: 字典，键为源名称，值为对应字段值
        priority_order: 源名称列表，按优先级从高到低排列

    返回:
        元组: (选中的值, 冲突信息)
    """
    conflict_info = None
    unique_values = set(v for v in values_by_source.values() if v is not None)

    if len(unique_values) > 1:
        conflict_info = {
            "sources": values_by_source,
            "resolved_by": "priority"
        }

    for source in priority_order:
        if source in values_by_source and values_by_source[source] is not None:
            return values_by_source[source], conflict_info

    return None, conflict_info