dataset-comparer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDataset Comparer
数据集比较工具
Compare two CSV/Excel datasets to identify differences, additions, deletions, and value changes.
比较两个CSV/Excel数据集,识别差异、新增、删除以及值的变更。
Features
功能特性
- Row Comparison: Find added, removed, and matching rows
- Value Changes: Detect changed values in matching rows
- Column Comparison: Identify schema differences
- Statistics: Summary of differences
- Diff Reports: HTML, CSV, and JSON output
- Flexible Matching: Compare by key columns or row position
- 行比较:找出新增、删除和匹配的行
- 值变更检测:识别匹配行中的值变化
- 列比较:识别Schema差异
- 统计信息:差异汇总
- 差异报告:支持HTML、CSV和JSON输出
- 灵活匹配:可按关键字段或行位置进行比较
Quick Start
快速开始
python
from dataset_comparer import DatasetComparer
comparer = DatasetComparer()
comparer.load("old_data.csv", "new_data.csv")python
from dataset_comparer import DatasetComparer
comparer = DatasetComparer()
comparer.load("old_data.csv", "new_data.csv")Compare by key column
Compare by key column
diff = comparer.compare(key_columns=["id"])
print(f"Added rows: {diff['added_count']}")
print(f"Removed rows: {diff['removed_count']}")
print(f"Changed rows: {diff['changed_count']}")
diff = comparer.compare(key_columns=["id"])
print(f"Added rows: {diff['added_count']}")
print(f"Removed rows: {diff['removed_count']}")
print(f"Changed rows: {diff['changed_count']}")
Generate report
Generate report
comparer.generate_report("diff_report.html")
undefinedcomparer.generate_report("diff_report.html")
undefinedCLI Usage
CLI 使用方式
bash
undefinedbash
undefinedBasic comparison
Basic comparison
python dataset_comparer.py --old old.csv --new new.csv
python dataset_comparer.py --old old.csv --new new.csv
Compare by key column
Compare by key column
python dataset_comparer.py --old old.csv --new new.csv --key id
python dataset_comparer.py --old old.csv --new new.csv --key id
Multiple key columns
Multiple key columns
python dataset_comparer.py --old old.csv --new new.csv --key id,date
python dataset_comparer.py --old old.csv --new new.csv --key id,date
Generate HTML report
Generate HTML report
python dataset_comparer.py --old old.csv --new new.csv --key id --report diff.html
python dataset_comparer.py --old old.csv --new new.csv --key id --report diff.html
Export differences to CSV
Export differences to CSV
python dataset_comparer.py --old old.csv --new new.csv --key id --output diff.csv
python dataset_comparer.py --old old.csv --new new.csv --key id --output diff.csv
JSON output
JSON output
python dataset_comparer.py --old old.csv --new new.csv --key id --json
python dataset_comparer.py --old old.csv --new new.csv --key id --json
Ignore specific columns
Ignore specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --ignore updated_at,modified_date
python dataset_comparer.py --old old.csv --new new.csv --key id --ignore updated_at,modified_date
Compare only specific columns
Compare only specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --columns name,email,status
undefinedpython dataset_comparer.py --old old.csv --new new.csv --key id --columns name,email,status
undefinedAPI Reference
API 参考
DatasetComparer Class
DatasetComparer 类
python
class DatasetComparer:
def __init__(self)
# Data loading
def load(self, old_path: str, new_path: str) -> 'DatasetComparer'
def load_dataframes(self, old_df: pd.DataFrame,
new_df: pd.DataFrame) -> 'DatasetComparer'
# Comparison
def compare(self, key_columns: list = None,
ignore_columns: list = None,
compare_columns: list = None) -> dict
# Detailed results
def get_added_rows(self) -> pd.DataFrame
def get_removed_rows(self) -> pd.DataFrame
def get_changed_rows(self) -> pd.DataFrame
def get_unchanged_rows(self) -> pd.DataFrame
# Schema comparison
def compare_schema(self) -> dict
# Reports
def generate_report(self, output: str, format: str = "html") -> str
def to_dataframe(self) -> pd.DataFrame
def summary(self) -> strpython
class DatasetComparer:
def __init__(self)
# Data loading
def load(self, old_path: str, new_path: str) -> 'DatasetComparer'
def load_dataframes(self, old_df: pd.DataFrame,
new_df: pd.DataFrame) -> 'DatasetComparer'
# Comparison
def compare(self, key_columns: list = None,
ignore_columns: list = None,
compare_columns: list = None) -> dict
# Detailed results
def get_added_rows(self) -> pd.DataFrame
def get_removed_rows(self) -> pd.DataFrame
def get_changed_rows(self) -> pd.DataFrame
def get_unchanged_rows(self) -> pd.DataFrame
# Schema comparison
def compare_schema(self) -> dict
# Reports
def generate_report(self, output: str, format: str = "html") -> str
def to_dataframe(self) -> pd.DataFrame
def summary(self) -> strComparison Methods
比较方式
Key-Based Comparison
基于关键字段的比较
Compare rows by matching key columns (like primary keys):
python
diff = comparer.compare(key_columns=["customer_id"])通过匹配关键字段(如主键)来比较行:
python
diff = comparer.compare(key_columns=["customer_id"])Multiple keys for composite matching
Multiple keys for composite matching
diff = comparer.compare(key_columns=["order_id", "product_id"])
undefineddiff = comparer.compare(key_columns=["order_id", "product_id"])
undefinedPosition-Based Comparison
基于位置的比较
Compare rows by their position (row number):
python
diff = comparer.compare() # No keys = positional comparison通过行的位置(行号)来比较行:
python
diff = comparer.compare() # No keys = positional comparisonOutput Format
输出格式
Comparison Result
比较结果
python
{
"summary": {
"old_rows": 1000,
"new_rows": 1050,
"added_count": 75,
"removed_count": 25,
"changed_count": 50,
"unchanged_count": 900,
"total_differences": 150
},
"schema_changes": {
"added_columns": ["new_field"],
"removed_columns": ["old_field"],
"type_changes": [
{"column": "amount", "old_type": "int64", "new_type": "float64"}
]
},
"key_columns": ["id"],
"compared_columns": ["name", "email", "status"],
"ignored_columns": ["updated_at"]
}python
{
"summary": {
"old_rows": 1000,
"new_rows": 1050,
"added_count": 75,
"removed_count": 25,
"changed_count": 50,
"unchanged_count": 900,
"total_differences": 150
},
"schema_changes": {
"added_columns": ["new_field"],
"removed_columns": ["old_field"],
"type_changes": [
{"column": "amount", "old_type": "int64", "new_type": "float64"}
]
},
"key_columns": ["id"],
"compared_columns": ["name", "email", "status"],
"ignored_columns": ["updated_at"]
}Changed Row Details
变更行详情
python
changes = comparer.get_changed_rows()python
changes = comparer.get_changed_rows()Returns DataFrame with columns:
Returns DataFrame with columns:
_key: Key value(s) for the row
_key: Key value(s) for the row
_column: Column that changed
_column: Column that changed
_old_value: Original value
_old_value: Original value
_new_value: New value
_new_value: New value
undefinedundefinedSchema Comparison
Schema 比较
Compare column structure:
python
schema = comparer.compare_schema()比较列结构:
python
schema = comparer.compare_schema()Returns:
Returns:
{
"old_columns": ["id", "name", "old_field"],
"new_columns": ["id", "name", "new_field"],
"common_columns": ["id", "name"],
"added_columns": ["new_field"],
"removed_columns": ["old_field"],
"type_changes": [
{"column": "price", "old_type": "int64", "new_type": "float64"}
],
"old_row_count": 1000,
"new_row_count": 1050
}
undefined{
"old_columns": ["id", "name", "old_field"],
"new_columns": ["id", "name", "new_field"],
"common_columns": ["id", "name"],
"added_columns": ["new_field"],
"removed_columns": ["old_field"],
"type_changes": [
{"column": "price", "old_type": "int64", "new_type": "float64"}
],
"old_row_count": 1000,
"new_row_count": 1050
}
undefinedFiltering Options
过滤选项
Ignore Columns
忽略列
Skip certain columns during comparison:
python
diff = comparer.compare(
key_columns=["id"],
ignore_columns=["updated_at", "modified_by", "timestamp"]
)比较时跳过特定列:
python
diff = comparer.compare(
key_columns=["id"],
ignore_columns=["updated_at", "modified_by", "timestamp"]
)Compare Specific Columns
仅比较指定列
Only compare selected columns:
python
diff = comparer.compare(
key_columns=["id"],
compare_columns=["name", "email", "status"] # Only these columns
)仅比较选中的列:
python
diff = comparer.compare(
key_columns=["id"],
compare_columns=["name", "email", "status"] # Only these columns
)Report Formats
报告格式
HTML Report
HTML 报告
python
comparer.generate_report("diff_report.html", format="html")Features:
- Summary statistics
- Interactive tables
- Color-coded changes (green=added, red=removed, yellow=changed)
- Schema comparison section
python
comparer.generate_report("diff_report.html", format="html")特性:
- 汇总统计信息
- 交互式表格
- 颜色编码的变更(绿色=新增,红色=删除,黄色=变更)
- Schema 比较部分
CSV Export
CSV 导出
python
comparer.generate_report("diff_report.csv", format="csv")Includes all differences in tabular format.
python
comparer.generate_report("diff_report.csv", format="csv")以表格格式包含所有差异。
JSON Output
JSON 输出
python
comparer.generate_report("diff_report.json", format="json")Complete diff data in JSON format.
python
comparer.generate_report("diff_report.json", format="json")以JSON格式提供完整的差异数据。
Example Workflows
示例工作流
Data Migration Validation
数据迁移验证
python
comparer = DatasetComparer()
comparer.load("source_data.csv", "migrated_data.csv")
diff = comparer.compare(key_columns=["id"])
if diff["summary"]["total_differences"] == 0:
print("Migration successful - no differences!")
else:
print(f"Found {diff['summary']['total_differences']} differences")
comparer.generate_report("migration_issues.html")python
comparer = DatasetComparer()
comparer.load("source_data.csv", "migrated_data.csv")
diff = comparer.compare(key_columns=["id"])
if diff["summary"]["total_differences"] == 0:
print("Migration successful - no differences!")
else:
print(f"Found {diff['summary']['total_differences']} differences")
comparer.generate_report("migration_issues.html")ETL Pipeline Verification
ETL 管道验证
python
comparer = DatasetComparer()
comparer.load("yesterday.csv", "today.csv")
diff = comparer.compare(
key_columns=["transaction_id"],
ignore_columns=["processing_timestamp"]
)python
comparer = DatasetComparer()
comparer.load("yesterday.csv", "today.csv")
diff = comparer.compare(
key_columns=["transaction_id"],
ignore_columns=["processing_timestamp"]
)Check for unexpected changes
Check for unexpected changes
changed = comparer.get_changed_rows()
if len(changed) > 0:
print("Warning: Historical records changed!")
print(changed)
undefinedchanged = comparer.get_changed_rows()
if len(changed) > 0:
print("Warning: Historical records changed!")
print(changed)
undefinedIncremental Update Detection
增量更新检测
python
comparer = DatasetComparer()
comparer.load("last_sync.csv", "current.csv")
diff = comparer.compare(key_columns=["customer_id"])python
comparer = DatasetComparer()
comparer.load("last_sync.csv", "current.csv")
diff = comparer.compare(key_columns=["customer_id"])Get new records for processing
Get new records for processing
new_records = comparer.get_added_rows()
print(f"New records to process: {len(new_records)}")
new_records = comparer.get_added_rows()
print(f"New records to process: {len(new_records)}")
Get deleted records
Get deleted records
deleted = comparer.get_removed_rows()
print(f"Records to deactivate: {len(deleted)}")
undefineddeleted = comparer.get_removed_rows()
print(f"Records to deactivate: {len(deleted)}")
undefinedSchema Change Detection
Schema 变更检测
python
comparer = DatasetComparer()
comparer.load("v1_export.csv", "v2_export.csv")
schema = comparer.compare_schema()
if schema["added_columns"]:
print(f"New columns: {schema['added_columns']}")
if schema["removed_columns"]:
print(f"Removed columns: {schema['removed_columns']}")
if schema["type_changes"]:
for change in schema["type_changes"]:
print(f"Type change: {change['column']} "
f"({change['old_type']} -> {change['new_type']})")python
comparer = DatasetComparer()
comparer.load("v1_export.csv", "v2_export.csv")
schema = comparer.compare_schema()
if schema["added_columns"]:
print(f"New columns: {schema['added_columns']}")
if schema["removed_columns"]:
print(f"Removed columns: {schema['removed_columns']}")
if schema["type_changes"]:
for change in schema["type_changes"]:
print(f"Type change: {change['column']} "
f"({change['old_type']} -> {change['new_type']})")Large Dataset Tips
大数据集使用技巧
For very large datasets:
python
undefined对于非常大的数据集:
python
undefinedCompare in chunks
Compare in chunks
comparer = DatasetComparer()
comparer.load("large_old.csv", "large_new.csv")
comparer = DatasetComparer()
comparer.load("large_old.csv", "large_new.csv")
Use key column for efficient matching
Use key column for efficient matching
diff = comparer.compare(key_columns=["id"])
diff = comparer.compare(key_columns=["id"])
Export only differences (not full data)
Export only differences (not full data)
comparer.generate_report("diff_only.csv", format="csv")
undefinedcomparer.generate_report("diff_only.csv", format="csv")
undefinedDependencies
依赖项
- pandas>=2.0.0
- numpy>=1.24.0
- pandas>=2.0.0
- numpy>=1.24.0