multi-source-data-merger

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Multi Source Data Merger

多源数据合并工具

Overview

概述

This skill guides the process of merging data from multiple sources with different formats into a unified dataset. It covers reading heterogeneous file formats, applying field name mappings, resolving conflicts using priority ordering, and generating comprehensive output files including conflict reports.
本技能指导你将来自不同格式的多源数据合并为统一数据集的流程,涵盖读取异构文件格式、应用字段名称映射、使用优先级排序解决冲突,以及生成包含冲突报告的完整输出文件等内容。

Workflow

工作流程

Step 1: Analyze Requirements and Source Files

步骤1:分析需求与源文件

Before writing any code, thoroughly understand the task:
  1. Identify all source files and their formats (JSON, CSV, Parquet, XML, etc.)
  2. Determine the merge key (e.g.,
    user_id
    ,
    record_id
    ) that links records across sources
  3. Review field mapping requirements - source fields may have different names that map to common output fields
  4. Understand conflict resolution rules - typically based on source priority ordering
  5. Identify expected output formats and structure
Important: Do not attempt to read binary formats (Parquet, Excel, etc.) as text files - use appropriate libraries.
在编写任何代码之前,需彻底理解任务要求:
  1. 识别所有源文件及其格式(JSON、CSV、Parquet、XML等)
  2. 确定合并键(如
    user_id
    record_id
    ),用于关联不同源的记录
  3. 查看字段映射要求 - 源字段可能有不同名称,需映射到统一的输出字段
  4. 理解冲突解决规则 - 通常基于数据源优先级排序
  5. 明确预期输出格式和结构
重要提示: 请勿尝试将二进制格式(Parquet、Excel等)作为文本文件读取 - 使用对应的专用库。

Step 2: Set Up Environment

步骤2:搭建环境

  1. Create a Python virtual environment using
    uv
    or
    venv
  2. Install required dependencies based on source formats:
    • pandas
      - Core data manipulation
    • pyarrow
      - Parquet file support
    • openpyxl
      - Excel file support
    • lxml
      - XML parsing (if needed)
  3. Verify installations before proceeding
Example environment setup:
bash
uv venv .venv
source .venv/bin/activate
uv pip install pandas pyarrow
  1. 使用
    uv
    venv
    创建Python虚拟环境
  2. 根据源文件格式安装所需依赖
    • pandas
      - 核心数据处理库
    • pyarrow
      - Parquet文件支持
    • openpyxl
      - Excel文件支持
    • lxml
      - XML解析(如有需要)
  3. 在继续前验证依赖安装完成
环境搭建示例:
bash
uv venv .venv
source .venv/bin/activate
uv pip install pandas pyarrow

Step 3: Write the Merge Script

步骤3:编写合并脚本

Structure the script with clear separation of concerns:
  1. Data reading functions - One per format type
  2. Field mapping function - Apply column renames
  3. Data normalization - Handle date formats, type conversions
  4. Merge logic - Combine records using the merge key
  5. Conflict resolution - Apply priority rules
  6. Output generation - Write merged data and conflict reports
Script quality practices:
  • Validate syntax before execution:
    python -m py_compile script.py
  • Use try-except blocks with informative error messages
  • Document assumptions about data formats
按职责分离原则组织脚本结构:
  1. 数据读取函数 - 每种格式对应一个函数
  2. 字段映射函数 - 执行列重命名
  3. 数据标准化 - 处理日期格式、类型转换
  4. 合并逻辑 - 使用合并键整合记录
  5. 冲突解决 - 应用优先级规则
  6. 输出生成 - 写入合并后数据和冲突报告
脚本质量规范:
  • 执行前验证语法:
    python -m py_compile script.py
  • 使用try-except块并提供清晰的错误信息
  • 记录关于数据格式的假设

Step 4: Execute and Verify

步骤4:执行与验证

Run a comprehensive verification process:
  1. Check output file existence at expected locations
  2. Validate merged data contains expected values
  3. Verify conflict report structure and content
  4. Run any provided test suites
执行全面的验证流程:
  1. 检查输出文件是否存在于预期路径
  2. 验证合并后数据包含预期值
  3. 验证冲突报告的结构与内容
  4. 运行提供的所有测试套件

Common Pitfalls

常见陷阱

Binary File Handling

二进制文件处理

  • Mistake: Attempting to read Parquet/Excel files as text
  • Solution: Always use pandas with appropriate engine (
    pyarrow
    for Parquet,
    openpyxl
    for Excel)
  • 错误做法: 尝试将Parquet/Excel文件作为文本读取
  • 解决方案: 始终使用pandas搭配对应引擎(Parquet用
    pyarrow
    ,Excel用
    openpyxl

Syntax Errors in Scripts

脚本语法错误

  • Mistake: Writing long scripts without validation, leading to indentation or syntax errors
  • Solution: Run
    python -m py_compile script.py
    before execution
  • 错误做法: 编写长脚本却不提前验证,导致缩进或语法错误
  • 解决方案: 执行前运行
    python -m py_compile script.py

Date Format Normalization

日期格式标准化

  • Mistake: Assuming consistent date formats across sources
  • Solution: Implement flexible date parsing that handles multiple formats:
    • ISO format:
      2024-01-15
    • US format:
      01/15/2024
    • European format:
      15-01-2024
    • Datetime:
      2024-01-15T10:30:00
  • 错误做法: 假设所有源的日期格式一致
  • 解决方案: 实现灵活的日期解析逻辑,支持多种格式:
    • ISO格式:
      2024-01-15
    • 美国格式:
      01/15/2024
    • 欧洲格式:
      15-01-2024
    • 时间戳格式:
      2024-01-15T10:30:00

Incomplete Script Output

脚本输出不完整

  • Mistake: Writing very long scripts that may get truncated
  • Solution: Break into modular functions, verify complete code visibility
  • 错误做法: 编写过长脚本导致内容被截断
  • 解决方案: 拆分为模块化函数,确保代码完整可见

Environment Path Issues

环境路径问题

  • Mistake: Repeating PATH exports in every command
  • Solution: Set PATH once in a setup step or use absolute paths to executables
  • 错误做法: 在每个命令中重复导出PATH
  • 解决方案: 在初始化步骤中一次性设置PATH,或使用可执行文件的绝对路径

Verification Strategies

验证策略

Output Validation Checklist

输出验证清单

  1. File existence check:
    python
    import os
    assert os.path.exists("output/merged_data.json")
    assert os.path.exists("output/conflict_report.json")
  2. Data completeness check:
    python
    import json
    with open("output/merged_data.json") as f:
        data = json.load(f)
    # Verify expected record count
    assert len(data) == expected_count
  3. Conflict report validation:
    python
    with open("output/conflict_report.json") as f:
        conflicts = json.load(f)
    # Verify conflict structure has required fields
    for conflict in conflicts:
        assert "field" in conflict
        assert "selected" in conflict
        assert "sources" in conflict
  4. Sample value verification:
    python
    # Spot-check specific merged records
    record = next(r for r in data if r["user_id"] == "expected_id")
    assert record["field_name"] == "expected_value"
  1. 文件存在性检查:
    python
    import os
    assert os.path.exists("output/merged_data.json")
    assert os.path.exists("output/conflict_report.json")
  2. 数据完整性检查:
    python
    import json
    with open("output/merged_data.json") as f:
        data = json.load(f)
    # 验证预期记录数量
    assert len(data) == expected_count
  3. 冲突报告验证:
    python
    with open("output/conflict_report.json") as f:
        conflicts = json.load(f)
    # 验证冲突结构包含必填字段
    for conflict in conflicts:
        assert "field" in conflict
        assert "selected" in conflict
        assert "sources" in conflict
  4. 样本值验证:
    python
    # 抽查特定合并记录
    record = next(r for r in data if r["user_id"] == "expected_id")
    assert record["field_name"] == "expected_value"

Consolidate Verification

整合验证流程

Instead of running multiple separate verification commands, create a single comprehensive test script that validates all aspects of the output.
不要单独运行多个验证命令,应创建一个综合测试脚本,一次性验证输出的所有方面。

Edge Cases to Consider

需要考虑的边缘情况

  • Empty source files - Handle gracefully with appropriate warnings
  • Missing merge keys - Decide whether to skip or error
  • Type mismatches - Convert consistently (e.g., user_id as string vs integer)
  • Null/None values - Determine handling in conflict resolution
  • Unicode/encoding - Specify encoding when reading text-based formats
  • Records in some sources but not others - Include partial records or require complete matches
  • 空源文件 - 优雅处理并给出适当警告
  • 缺失合并键 - 决定是跳过记录还是抛出错误
  • 类型不匹配 - 统一转换(如user_id的字符串与整数类型)
  • Null/None值 - 确定冲突解决中的处理方式
  • Unicode/编码 - 读取文本格式时指定编码
  • 部分源存在记录,其他源不存在 - 决定是否包含部分记录或要求完全匹配

Field Mapping Example

字段映射示例

When sources have different field names for the same concept:
python
FIELD_MAPPINGS = {
    "source_a": {
        "firstName": "first_name",
        "lastName": "last_name",
        "emailAddress": "email"
    },
    "source_b": {
        "fname": "first_name",
        "lname": "last_name",
        "mail": "email"
    }
}

def apply_mapping(df, source_name):
    mapping = FIELD_MAPPINGS.get(source_name, {})
    return df.rename(columns=mapping)
当不同源对同一概念使用不同字段名时:
python
FIELD_MAPPINGS = {
    "source_a": {
        "firstName": "first_name",
        "lastName": "last_name",
        "emailAddress": "email"
    },
    "source_b": {
        "fname": "first_name",
        "lname": "last_name",
        "mail": "email"
    }
}

def apply_mapping(df, source_name):
    mapping = FIELD_MAPPINGS.get(source_name, {})
    return df.rename(columns=mapping)

Conflict Resolution Pattern

冲突解决模式

When the same field has different values across sources:
python
def resolve_conflict(values_by_source, priority_order):
    """
    Select value based on source priority.

    Args:
        values_by_source: dict mapping source name to value
        priority_order: list of source names from highest to lowest priority

    Returns:
        tuple: (selected_value, conflict_info)
    """
    conflict_info = None
    unique_values = set(v for v in values_by_source.values() if v is not None)

    if len(unique_values) > 1:
        conflict_info = {
            "sources": values_by_source,
            "resolved_by": "priority"
        }

    for source in priority_order:
        if source in values_by_source and values_by_source[source] is not None:
            return values_by_source[source], conflict_info

    return None, conflict_info
当同一字段在不同源中有不同值时:
python
def resolve_conflict(values_by_source, priority_order):
    """
    根据源优先级选择值。

    参数:
        values_by_source: 字典,键为源名称,值为对应字段值
        priority_order: 源名称列表,按优先级从高到低排列

    返回:
        元组: (选中的值, 冲突信息)
    """
    conflict_info = None
    unique_values = set(v for v in values_by_source.values() if v is not None)

    if len(unique_values) > 1:
        conflict_info = {
            "sources": values_by_source,
            "resolved_by": "priority"
        }

    for source in priority_order:
        if source in values_by_source and values_by_source[source] is not None:
            return values_by_source[source], conflict_info

    return None, conflict_info