dataset-comparer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Dataset Comparer

数据集比较工具

Compare two CSV/Excel datasets to identify differences, additions, deletions, and value changes.
比较两个CSV/Excel数据集,识别差异、新增、删除以及值的变更。

Features

功能特性

  • Row Comparison: Find added, removed, and matching rows
  • Value Changes: Detect changed values in matching rows
  • Column Comparison: Identify schema differences
  • Statistics: Summary of differences
  • Diff Reports: HTML, CSV, and JSON output
  • Flexible Matching: Compare by key columns or row position
  • 行比较:找出新增、删除和匹配的行
  • 值变更检测:识别匹配行中的值变化
  • 列比较:识别Schema差异
  • 统计信息:差异汇总
  • 差异报告:支持HTML、CSV和JSON输出
  • 灵活匹配:可按关键字段或行位置进行比较

Quick Start

快速开始

python
from dataset_comparer import DatasetComparer

comparer = DatasetComparer()
comparer.load("old_data.csv", "new_data.csv")
python
from dataset_comparer import DatasetComparer

comparer = DatasetComparer()
comparer.load("old_data.csv", "new_data.csv")

Compare by key column

Compare by key column

diff = comparer.compare(key_columns=["id"])
print(f"Added rows: {diff['added_count']}") print(f"Removed rows: {diff['removed_count']}") print(f"Changed rows: {diff['changed_count']}")
diff = comparer.compare(key_columns=["id"])
print(f"Added rows: {diff['added_count']}") print(f"Removed rows: {diff['removed_count']}") print(f"Changed rows: {diff['changed_count']}")

Generate report

Generate report

comparer.generate_report("diff_report.html")
undefined
comparer.generate_report("diff_report.html")
undefined

CLI Usage

CLI 使用方式

bash
undefined
bash
undefined

Basic comparison

Basic comparison

python dataset_comparer.py --old old.csv --new new.csv
python dataset_comparer.py --old old.csv --new new.csv

Compare by key column

Compare by key column

python dataset_comparer.py --old old.csv --new new.csv --key id
python dataset_comparer.py --old old.csv --new new.csv --key id

Multiple key columns

Multiple key columns

python dataset_comparer.py --old old.csv --new new.csv --key id,date
python dataset_comparer.py --old old.csv --new new.csv --key id,date

Generate HTML report

Generate HTML report

python dataset_comparer.py --old old.csv --new new.csv --key id --report diff.html
python dataset_comparer.py --old old.csv --new new.csv --key id --report diff.html

Export differences to CSV

Export differences to CSV

python dataset_comparer.py --old old.csv --new new.csv --key id --output diff.csv
python dataset_comparer.py --old old.csv --new new.csv --key id --output diff.csv

JSON output

JSON output

python dataset_comparer.py --old old.csv --new new.csv --key id --json
python dataset_comparer.py --old old.csv --new new.csv --key id --json

Ignore specific columns

Ignore specific columns

python dataset_comparer.py --old old.csv --new new.csv --key id --ignore updated_at,modified_date
python dataset_comparer.py --old old.csv --new new.csv --key id --ignore updated_at,modified_date

Compare only specific columns

Compare only specific columns

python dataset_comparer.py --old old.csv --new new.csv --key id --columns name,email,status
undefined
python dataset_comparer.py --old old.csv --new new.csv --key id --columns name,email,status
undefined

API Reference

API 参考

DatasetComparer Class

DatasetComparer 类

python
class DatasetComparer:
    def __init__(self)

    # Data loading
    def load(self, old_path: str, new_path: str) -> 'DatasetComparer'
    def load_dataframes(self, old_df: pd.DataFrame,
                       new_df: pd.DataFrame) -> 'DatasetComparer'

    # Comparison
    def compare(self, key_columns: list = None,
               ignore_columns: list = None,
               compare_columns: list = None) -> dict

    # Detailed results
    def get_added_rows(self) -> pd.DataFrame
    def get_removed_rows(self) -> pd.DataFrame
    def get_changed_rows(self) -> pd.DataFrame
    def get_unchanged_rows(self) -> pd.DataFrame

    # Schema comparison
    def compare_schema(self) -> dict

    # Reports
    def generate_report(self, output: str, format: str = "html") -> str
    def to_dataframe(self) -> pd.DataFrame
    def summary(self) -> str
python
class DatasetComparer:
    def __init__(self)

    # Data loading
    def load(self, old_path: str, new_path: str) -> 'DatasetComparer'
    def load_dataframes(self, old_df: pd.DataFrame,
                       new_df: pd.DataFrame) -> 'DatasetComparer'

    # Comparison
    def compare(self, key_columns: list = None,
               ignore_columns: list = None,
               compare_columns: list = None) -> dict

    # Detailed results
    def get_added_rows(self) -> pd.DataFrame
    def get_removed_rows(self) -> pd.DataFrame
    def get_changed_rows(self) -> pd.DataFrame
    def get_unchanged_rows(self) -> pd.DataFrame

    # Schema comparison
    def compare_schema(self) -> dict

    # Reports
    def generate_report(self, output: str, format: str = "html") -> str
    def to_dataframe(self) -> pd.DataFrame
    def summary(self) -> str

Comparison Methods

比较方式

Key-Based Comparison

基于关键字段的比较

Compare rows by matching key columns (like primary keys):
python
diff = comparer.compare(key_columns=["customer_id"])
通过匹配关键字段(如主键)来比较行:
python
diff = comparer.compare(key_columns=["customer_id"])

Multiple keys for composite matching

Multiple keys for composite matching

diff = comparer.compare(key_columns=["order_id", "product_id"])
undefined
diff = comparer.compare(key_columns=["order_id", "product_id"])
undefined

Position-Based Comparison

基于位置的比较

Compare rows by their position (row number):
python
diff = comparer.compare()  # No keys = positional comparison
通过行的位置(行号)来比较行:
python
diff = comparer.compare()  # No keys = positional comparison

Output Format

输出格式

Comparison Result

比较结果

python
{
    "summary": {
        "old_rows": 1000,
        "new_rows": 1050,
        "added_count": 75,
        "removed_count": 25,
        "changed_count": 50,
        "unchanged_count": 900,
        "total_differences": 150
    },
    "schema_changes": {
        "added_columns": ["new_field"],
        "removed_columns": ["old_field"],
        "type_changes": [
            {"column": "amount", "old_type": "int64", "new_type": "float64"}
        ]
    },
    "key_columns": ["id"],
    "compared_columns": ["name", "email", "status"],
    "ignored_columns": ["updated_at"]
}
python
{
    "summary": {
        "old_rows": 1000,
        "new_rows": 1050,
        "added_count": 75,
        "removed_count": 25,
        "changed_count": 50,
        "unchanged_count": 900,
        "total_differences": 150
    },
    "schema_changes": {
        "added_columns": ["new_field"],
        "removed_columns": ["old_field"],
        "type_changes": [
            {"column": "amount", "old_type": "int64", "new_type": "float64"}
        ]
    },
    "key_columns": ["id"],
    "compared_columns": ["name", "email", "status"],
    "ignored_columns": ["updated_at"]
}

Changed Row Details

变更行详情

python
changes = comparer.get_changed_rows()
python
changes = comparer.get_changed_rows()

Returns DataFrame with columns:

Returns DataFrame with columns:

_key: Key value(s) for the row

_key: Key value(s) for the row

_column: Column that changed

_column: Column that changed

_old_value: Original value

_old_value: Original value

_new_value: New value

_new_value: New value

undefined
undefined

Schema Comparison

Schema 比较

Compare column structure:
python
schema = comparer.compare_schema()
比较列结构:
python
schema = comparer.compare_schema()

Returns:

Returns:

{ "old_columns": ["id", "name", "old_field"], "new_columns": ["id", "name", "new_field"], "common_columns": ["id", "name"], "added_columns": ["new_field"], "removed_columns": ["old_field"], "type_changes": [ {"column": "price", "old_type": "int64", "new_type": "float64"} ], "old_row_count": 1000, "new_row_count": 1050 }
undefined
{ "old_columns": ["id", "name", "old_field"], "new_columns": ["id", "name", "new_field"], "common_columns": ["id", "name"], "added_columns": ["new_field"], "removed_columns": ["old_field"], "type_changes": [ {"column": "price", "old_type": "int64", "new_type": "float64"} ], "old_row_count": 1000, "new_row_count": 1050 }
undefined

Filtering Options

过滤选项

Ignore Columns

忽略列

Skip certain columns during comparison:
python
diff = comparer.compare(
    key_columns=["id"],
    ignore_columns=["updated_at", "modified_by", "timestamp"]
)
比较时跳过特定列:
python
diff = comparer.compare(
    key_columns=["id"],
    ignore_columns=["updated_at", "modified_by", "timestamp"]
)

Compare Specific Columns

仅比较指定列

Only compare selected columns:
python
diff = comparer.compare(
    key_columns=["id"],
    compare_columns=["name", "email", "status"]  # Only these columns
)
仅比较选中的列:
python
diff = comparer.compare(
    key_columns=["id"],
    compare_columns=["name", "email", "status"]  # Only these columns
)

Report Formats

报告格式

HTML Report

HTML 报告

python
comparer.generate_report("diff_report.html", format="html")
Features:
  • Summary statistics
  • Interactive tables
  • Color-coded changes (green=added, red=removed, yellow=changed)
  • Schema comparison section
python
comparer.generate_report("diff_report.html", format="html")
特性:
  • 汇总统计信息
  • 交互式表格
  • 颜色编码的变更(绿色=新增,红色=删除,黄色=变更)
  • Schema 比较部分

CSV Export

CSV 导出

python
comparer.generate_report("diff_report.csv", format="csv")
Includes all differences in tabular format.
python
comparer.generate_report("diff_report.csv", format="csv")
以表格格式包含所有差异。

JSON Output

JSON 输出

python
comparer.generate_report("diff_report.json", format="json")
Complete diff data in JSON format.
python
comparer.generate_report("diff_report.json", format="json")
以JSON格式提供完整的差异数据。

Example Workflows

示例工作流

Data Migration Validation

数据迁移验证

python
comparer = DatasetComparer()
comparer.load("source_data.csv", "migrated_data.csv")

diff = comparer.compare(key_columns=["id"])

if diff["summary"]["total_differences"] == 0:
    print("Migration successful - no differences!")
else:
    print(f"Found {diff['summary']['total_differences']} differences")
    comparer.generate_report("migration_issues.html")
python
comparer = DatasetComparer()
comparer.load("source_data.csv", "migrated_data.csv")

diff = comparer.compare(key_columns=["id"])

if diff["summary"]["total_differences"] == 0:
    print("Migration successful - no differences!")
else:
    print(f"Found {diff['summary']['total_differences']} differences")
    comparer.generate_report("migration_issues.html")

ETL Pipeline Verification

ETL 管道验证

python
comparer = DatasetComparer()
comparer.load("yesterday.csv", "today.csv")

diff = comparer.compare(
    key_columns=["transaction_id"],
    ignore_columns=["processing_timestamp"]
)
python
comparer = DatasetComparer()
comparer.load("yesterday.csv", "today.csv")

diff = comparer.compare(
    key_columns=["transaction_id"],
    ignore_columns=["processing_timestamp"]
)

Check for unexpected changes

Check for unexpected changes

changed = comparer.get_changed_rows() if len(changed) > 0: print("Warning: Historical records changed!") print(changed)
undefined
changed = comparer.get_changed_rows() if len(changed) > 0: print("Warning: Historical records changed!") print(changed)
undefined

Incremental Update Detection

增量更新检测

python
comparer = DatasetComparer()
comparer.load("last_sync.csv", "current.csv")

diff = comparer.compare(key_columns=["customer_id"])
python
comparer = DatasetComparer()
comparer.load("last_sync.csv", "current.csv")

diff = comparer.compare(key_columns=["customer_id"])

Get new records for processing

Get new records for processing

new_records = comparer.get_added_rows() print(f"New records to process: {len(new_records)}")
new_records = comparer.get_added_rows() print(f"New records to process: {len(new_records)}")

Get deleted records

Get deleted records

deleted = comparer.get_removed_rows() print(f"Records to deactivate: {len(deleted)}")
undefined
deleted = comparer.get_removed_rows() print(f"Records to deactivate: {len(deleted)}")
undefined

Schema Change Detection

Schema 变更检测

python
comparer = DatasetComparer()
comparer.load("v1_export.csv", "v2_export.csv")

schema = comparer.compare_schema()

if schema["added_columns"]:
    print(f"New columns: {schema['added_columns']}")

if schema["removed_columns"]:
    print(f"Removed columns: {schema['removed_columns']}")

if schema["type_changes"]:
    for change in schema["type_changes"]:
        print(f"Type change: {change['column']} "
              f"({change['old_type']} -> {change['new_type']})")
python
comparer = DatasetComparer()
comparer.load("v1_export.csv", "v2_export.csv")

schema = comparer.compare_schema()

if schema["added_columns"]:
    print(f"New columns: {schema['added_columns']}")

if schema["removed_columns"]:
    print(f"Removed columns: {schema['removed_columns']}")

if schema["type_changes"]:
    for change in schema["type_changes"]:
        print(f"Type change: {change['column']} "
              f"({change['old_type']} -> {change['new_type']})")

Large Dataset Tips

大数据集使用技巧

For very large datasets:
python
undefined
对于非常大的数据集:
python
undefined

Compare in chunks

Compare in chunks

comparer = DatasetComparer() comparer.load("large_old.csv", "large_new.csv")
comparer = DatasetComparer() comparer.load("large_old.csv", "large_new.csv")

Use key column for efficient matching

Use key column for efficient matching

diff = comparer.compare(key_columns=["id"])
diff = comparer.compare(key_columns=["id"])

Export only differences (not full data)

Export only differences (not full data)

comparer.generate_report("diff_only.csv", format="csv")
undefined
comparer.generate_report("diff_only.csv", format="csv")
undefined

Dependencies

依赖项

  • pandas>=2.0.0
  • numpy>=1.24.0
  • pandas>=2.0.0
  • numpy>=1.24.0