csv-data-auditor

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CSV Data Auditor

CSV数据审计器

Instructions

操作说明

When auditing CSV data, perform these validation checks systematically:

审计CSV数据时，请系统执行以下验证检查：

1. File Structure Validation

1. 文件结构验证

Verify the file exists and is readable
Check if the file is a valid CSV format
Ensure consistent delimiter usage
Verify proper quoting and escaping
Check for balanced quotes

验证文件是否存在且可读取
检查文件是否为有效的CSV格式
确保分隔符使用一致
验证引号与转义符的正确使用
检查引号是否配对

2. Header Analysis

2. 表头分析

Confirm headers exist (unless headerless CSV)
Check for duplicate header names
Validate header naming conventions
Ensure headers are descriptive and consistent
Check for special characters in headers

确认表头存在（无表头CSV除外）
检查是否有重复的表头名称
验证表头命名规范
确保表头描述清晰且格式一致
检查表头中是否包含特殊字符

3. Row Consistency

3. 行一致性检查

Count total rows and columns
Verify all rows have the same number of columns
Check for empty rows or rows with only whitespace
Identify truncated rows
Detect corrupted or malformed rows

统计总行数与列数
验证所有行的列数是否相同
检查是否存在空行或仅含空白字符的行
识别被截断的行
检测损坏或格式错误的行

4. Data Type Validation

4. 数据类型验证

For each column, validate the expected data type:

针对每一列，验证预期的数据类型：

Numeric Fields

数值字段

Check for non-numeric values
Validate decimal point usage
Look for negative values where inappropriate
Check for extremely large/small values

检查是否存在非数值值
验证小数点使用规范
查找不应出现的负值
检查是否存在过大/过小的极端值

Date/Time Fields

日期/时间字段

Verify date format consistency (ISO 8601, US, EU, etc.)
Check for invalid dates (e.g., February 30th)
Validate time zone handling
Look for future dates where inappropriate

验证日期格式一致性（ISO 8601、美式、欧式等）
检查是否存在无效日期（如2月30日）
验证时区处理逻辑
查找不应出现的未来日期

Text Fields

文本字段

Check for encoding issues
Look for unexpected special characters
Verify string length constraints
Check for leading/trailing whitespace

检查编码问题
查找意外出现的特殊字符
验证字符串长度限制
检查是否存在前导/尾随空白字符

Boolean Fields

布尔字段

Validate true/false representations
Check for 1/0, Y/N, Yes/No consistency
Look for ambiguous values

验证true/false的表示方式
检查1/0、Y/N、Yes/No的使用一致性
查找含义模糊的值

5. Data Quality Checks

5. 数据质量检查

Missing Values

缺失值

Count NULL/empty values per column
Calculate missing value percentages
Identify patterns in missing data
Check for placeholders like "N/A", "null", "-"

统计每列的NULL/空值数量
计算缺失值占比
识别缺失数据的模式
检查是否存在"N/A"、"null"、"-"等占位符

Duplicates

重复值

Find exact duplicate rows
Check for duplicate IDs or keys
Identify near-duplicates (typos, variations)
Validate uniqueness constraints

查找完全重复的行
检查是否存在重复的ID或键
识别近似重复值（拼写错误、格式变体）
验证唯一性约束

Outliers

异常值

Identify statistical outliers in numeric columns
Check for values outside expected ranges
Look for anomalies in categorical data
Validate business logic constraints

识别数值列中的统计异常值
检查是否存在超出预期范围的值
查找分类数据中的异常情况
验证业务逻辑约束

Consistency Checks

一致性检查

Cross-validate related columns
Check referential integrity
Validate business rules
Look for logical contradictions

交叉验证相关列的数据
检查引用完整性
验证业务规则
查找逻辑矛盾

6. Validation Script (Python)

6. 验证脚本（Python）

Create a Python script to perform automated checks:

python

import pandas as pd
import numpy as np
from datetime import datetime

def audit_csv(file_path, delimiter=','):
    """Perform comprehensive CSV audit"""

    # Load CSV
    try:
        df = pd.read_csv(file_path, delimiter=delimiter)
    except Exception as e:
        return {"error": f"Failed to load CSV: {str(e)}"}

    audit_report = {
        "file_info": {
            "rows": len(df),
            "columns": len(df.columns),
            "headers": list(df.columns)
        },
        "issues": []
    }

    # Check for missing values
    missing_counts = df.isnull().sum()
    for col, count in missing_counts.items():
        if count > 0:
            percentage = (count / len(df)) * 100
            audit_report["issues"].append({
                "type": "missing_values",
                "column": col,
                "count": count,
                "percentage": round(percentage, 2)
            })

    # Check for duplicates
    duplicate_rows = df.duplicated().sum()
    if duplicate_rows > 0:
        audit_report["issues"].append({
            "type": "duplicates",
            "count": duplicate_rows
        })

    # Data type validation
    for col in df.columns:
        # Check for mixed types in object columns
        if df[col].dtype == 'object':
            sample_values = df[col].dropna().head(10)
            # Add specific type checks based on column name patterns

    return audit_report

创建Python脚本以执行自动化检查：

python

import pandas as pd
import numpy as np
from datetime import datetime

def audit_csv(file_path, delimiter=','):
    """Perform comprehensive CSV audit"""

    # Load CSV
    try:
        df = pd.read_csv(file_path, delimiter=delimiter)
    except Exception as e:
        return {"error": f"Failed to load CSV: {str(e)}"}

    audit_report = {
        "file_info": {
            "rows": len(df),
            "columns": len(df.columns),
            "headers": list(df.columns)
        },
        "issues": []
    }

    # Check for missing values
    missing_counts = df.isnull().sum()
    for col, count in missing_counts.items():
        if count > 0:
            percentage = (count / len(df)) * 100
            audit_report["issues"].append({
                "type": "missing_values",
                "column": col,
                "count": count,
                "percentage": round(percentage, 2)
            })

    # Check for duplicates
    duplicate_rows = df.duplicated().sum()
    if duplicate_rows > 0:
        audit_report["issues"].append({
            "type": "duplicates",
            "count": duplicate_rows
        })

    # Data type validation
    for col in df.columns:
        # Check for mixed types in object columns
        if df[col].dtype == 'object':
            sample_values = df[col].dropna().head(10)
            # Add specific type checks based on column name patterns

    return audit_report

7. Report Format

7. 报告格式

Generate a structured audit report:

markdown

undefined

生成结构化的审计报告：

markdown

undefined

CSV Audit Report

CSV审计报告

File Information

文件信息

File: data.csv
Size: 15.2 MB
Rows: 50,000
Columns: 12
Headers: id, name, email, age, join_date, salary, department, ...

文件: data.csv
大小: 15.2 MB
行数: 50,000
列数: 12
表头: id, name, email, age, join_date, salary, department, ...

Issues Found

发现的问题

Critical Issues

严重问题

Missing Values:
- ```
email
```
  : 2,500 missing (5.0%)
- ```
salary
```
  : 150 missing (0.3%)

缺失值:
- ```
email
```
  : 2,500条缺失(5.0%)
- ```
salary
```
  : 150条缺失(0.3%)

Warnings

警告

Inconsistent Date Format:
- ```
join_date
```
  : Mix of ISO and US formats detected
- Examples: 2023-01-15, 01/15/2023, 15-Jan-2023
Potential Outliers:
- ```
age
```
  : Values 0 and 150 detected
- ```
salary
```
  : Extremely high values > $1M

日期格式不一致:
- ```
join_date
```
  : 检测到ISO格式与美式格式混用
- 示例: 2023-01-15, 01/15/2023, 15-Jan-2023
潜在异常值:
- ```
age
```
  : 检测到值为0和150的记录
- ```
salary
```
  : 检测到超过$1M的极高值

Recommendations

建议

Clean up email field - contact data source
Standardize date format to ISO 8601
Validate age and salary ranges
Remove or investigate duplicate rows

清理email字段 - 联系数据源提供方
将日期格式统一为ISO 8601
验证age和salary的取值范围
删除或调查重复行

Summary

总结

Overall Quality: Good
Issues to Fix: 3 critical, 5 warnings
Estimated Fix Time: 2-3 hours

undefined

整体质量: 良好
需要修复的问题: 3个严重问题,5个警告
预估修复时间: 2-3小时

undefined

8. Common Issues and Solutions

8. 常见问题与解决方案

Encoding Problems

编码问题

Try different encodings (UTF-8, Latin-1, Windows-1252)
Use
```
chardet
```
library to detect encoding
Handle BOM (Byte Order Mark) if present

尝试使用不同编码格式(UTF-8, Latin-1, Windows-1252)
使用
```
chardet
```
库检测编码格式
处理可能存在的BOM(字节顺序标记)

Large Files

大文件处理

Process in chunks for memory efficiency
Use Dask for out-of-core processing
Sample data for quick validation

分块处理以提升内存效率
使用Dask进行核外处理
采样数据以快速验证

Complex CSVs

复杂CSV处理

Handle quoted fields with embedded delimiters
Process multi-line records carefully
Validate escape character usage

处理包含嵌入分隔符的带引号字段
谨慎处理多行记录
验证转义字符的使用

Usage

使用方法

Provide the CSV file path
Specify delimiter if not comma
Configure validation rules based on your data
Review the generated audit report
Address issues systematically

提供CSV文件路径
若分隔符不是逗号则指定分隔符
根据你的数据配置验证规则
查看生成的审计报告
系统性地解决问题

Tips

提示

Always keep a backup of original data
Document any data transformations
Create validation rules based on business requirements
Consider using schema validation tools like Great Expectations
Automate regular audits for production data

始终保留原始数据的备份
记录所有数据转换操作
根据业务需求制定验证规则
考虑使用Great Expectations等Schema验证工具
为生产数据定期自动化执行审计