csv-data-auditor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CSV Data Auditor

CSV数据审计器

Instructions

操作说明

When auditing CSV data, perform these validation checks systematically:
审计CSV数据时,请系统执行以下验证检查:

1. File Structure Validation

1. 文件结构验证

  • Verify the file exists and is readable
  • Check if the file is a valid CSV format
  • Ensure consistent delimiter usage
  • Verify proper quoting and escaping
  • Check for balanced quotes
  • 验证文件是否存在且可读取
  • 检查文件是否为有效的CSV格式
  • 确保分隔符使用一致
  • 验证引号与转义符的正确使用
  • 检查引号是否配对

2. Header Analysis

2. 表头分析

  • Confirm headers exist (unless headerless CSV)
  • Check for duplicate header names
  • Validate header naming conventions
  • Ensure headers are descriptive and consistent
  • Check for special characters in headers
  • 确认表头存在(无表头CSV除外)
  • 检查是否有重复的表头名称
  • 验证表头命名规范
  • 确保表头描述清晰且格式一致
  • 检查表头中是否包含特殊字符

3. Row Consistency

3. 行一致性检查

  • Count total rows and columns
  • Verify all rows have the same number of columns
  • Check for empty rows or rows with only whitespace
  • Identify truncated rows
  • Detect corrupted or malformed rows
  • 统计总行数与列数
  • 验证所有行的列数是否相同
  • 检查是否存在空行或仅含空白字符的行
  • 识别被截断的行
  • 检测损坏或格式错误的行

4. Data Type Validation

4. 数据类型验证

For each column, validate the expected data type:
针对每一列,验证预期的数据类型:

Numeric Fields

数值字段

  • Check for non-numeric values
  • Validate decimal point usage
  • Look for negative values where inappropriate
  • Check for extremely large/small values
  • 检查是否存在非数值值
  • 验证小数点使用规范
  • 查找不应出现的负值
  • 检查是否存在过大/过小的极端值

Date/Time Fields

日期/时间字段

  • Verify date format consistency (ISO 8601, US, EU, etc.)
  • Check for invalid dates (e.g., February 30th)
  • Validate time zone handling
  • Look for future dates where inappropriate
  • 验证日期格式一致性(ISO 8601、美式、欧式等)
  • 检查是否存在无效日期(如2月30日)
  • 验证时区处理逻辑
  • 查找不应出现的未来日期

Text Fields

文本字段

  • Check for encoding issues
  • Look for unexpected special characters
  • Verify string length constraints
  • Check for leading/trailing whitespace
  • 检查编码问题
  • 查找意外出现的特殊字符
  • 验证字符串长度限制
  • 检查是否存在前导/尾随空白字符

Boolean Fields

布尔字段

  • Validate true/false representations
  • Check for 1/0, Y/N, Yes/No consistency
  • Look for ambiguous values
  • 验证true/false的表示方式
  • 检查1/0、Y/N、Yes/No的使用一致性
  • 查找含义模糊的值

5. Data Quality Checks

5. 数据质量检查

Missing Values

缺失值

  • Count NULL/empty values per column
  • Calculate missing value percentages
  • Identify patterns in missing data
  • Check for placeholders like "N/A", "null", "-"
  • 统计每列的NULL/空值数量
  • 计算缺失值占比
  • 识别缺失数据的模式
  • 检查是否存在"N/A"、"null"、"-"等占位符

Duplicates

重复值

  • Find exact duplicate rows
  • Check for duplicate IDs or keys
  • Identify near-duplicates (typos, variations)
  • Validate uniqueness constraints
  • 查找完全重复的行
  • 检查是否存在重复的ID或键
  • 识别近似重复值(拼写错误、格式变体)
  • 验证唯一性约束

Outliers

异常值

  • Identify statistical outliers in numeric columns
  • Check for values outside expected ranges
  • Look for anomalies in categorical data
  • Validate business logic constraints
  • 识别数值列中的统计异常值
  • 检查是否存在超出预期范围的值
  • 查找分类数据中的异常情况
  • 验证业务逻辑约束

Consistency Checks

一致性检查

  • Cross-validate related columns
  • Check referential integrity
  • Validate business rules
  • Look for logical contradictions
  • 交叉验证相关列的数据
  • 检查引用完整性
  • 验证业务规则
  • 查找逻辑矛盾

6. Validation Script (Python)

6. 验证脚本(Python)

Create a Python script to perform automated checks:
python
import pandas as pd
import numpy as np
from datetime import datetime

def audit_csv(file_path, delimiter=','):
    """Perform comprehensive CSV audit"""

    # Load CSV
    try:
        df = pd.read_csv(file_path, delimiter=delimiter)
    except Exception as e:
        return {"error": f"Failed to load CSV: {str(e)}"}

    audit_report = {
        "file_info": {
            "rows": len(df),
            "columns": len(df.columns),
            "headers": list(df.columns)
        },
        "issues": []
    }

    # Check for missing values
    missing_counts = df.isnull().sum()
    for col, count in missing_counts.items():
        if count > 0:
            percentage = (count / len(df)) * 100
            audit_report["issues"].append({
                "type": "missing_values",
                "column": col,
                "count": count,
                "percentage": round(percentage, 2)
            })

    # Check for duplicates
    duplicate_rows = df.duplicated().sum()
    if duplicate_rows > 0:
        audit_report["issues"].append({
            "type": "duplicates",
            "count": duplicate_rows
        })

    # Data type validation
    for col in df.columns:
        # Check for mixed types in object columns
        if df[col].dtype == 'object':
            sample_values = df[col].dropna().head(10)
            # Add specific type checks based on column name patterns

    return audit_report
创建Python脚本以执行自动化检查:
python
import pandas as pd
import numpy as np
from datetime import datetime

def audit_csv(file_path, delimiter=','):
    """Perform comprehensive CSV audit"""

    # Load CSV
    try:
        df = pd.read_csv(file_path, delimiter=delimiter)
    except Exception as e:
        return {"error": f"Failed to load CSV: {str(e)}"}

    audit_report = {
        "file_info": {
            "rows": len(df),
            "columns": len(df.columns),
            "headers": list(df.columns)
        },
        "issues": []
    }

    # Check for missing values
    missing_counts = df.isnull().sum()
    for col, count in missing_counts.items():
        if count > 0:
            percentage = (count / len(df)) * 100
            audit_report["issues"].append({
                "type": "missing_values",
                "column": col,
                "count": count,
                "percentage": round(percentage, 2)
            })

    # Check for duplicates
    duplicate_rows = df.duplicated().sum()
    if duplicate_rows > 0:
        audit_report["issues"].append({
            "type": "duplicates",
            "count": duplicate_rows
        })

    # Data type validation
    for col in df.columns:
        # Check for mixed types in object columns
        if df[col].dtype == 'object':
            sample_values = df[col].dropna().head(10)
            # Add specific type checks based on column name patterns

    return audit_report

7. Report Format

7. 报告格式

Generate a structured audit report:
markdown
undefined
生成结构化的审计报告:
markdown
undefined

CSV Audit Report

CSV审计报告

File Information

文件信息

  • File: data.csv
  • Size: 15.2 MB
  • Rows: 50,000
  • Columns: 12
  • Headers: id, name, email, age, join_date, salary, department, ...
  • 文件: data.csv
  • 大小: 15.2 MB
  • 行数: 50,000
  • 列数: 12
  • 表头: id, name, email, age, join_date, salary, department, ...

Issues Found

发现的问题

Critical Issues

严重问题

  1. Missing Values:
    • email
      : 2,500 missing (5.0%)
    • salary
      : 150 missing (0.3%)
  1. 缺失值:
    • email
      : 2,500条缺失(5.0%)
    • salary
      : 150条缺失(0.3%)

Warnings

警告

  1. Inconsistent Date Format:
    • join_date
      : Mix of ISO and US formats detected
    • Examples: 2023-01-15, 01/15/2023, 15-Jan-2023
  2. Potential Outliers:
    • age
      : Values 0 and 150 detected
    • salary
      : Extremely high values > $1M
  1. 日期格式不一致:
    • join_date
      : 检测到ISO格式与美式格式混用
    • 示例: 2023-01-15, 01/15/2023, 15-Jan-2023
  2. 潜在异常值:
    • age
      : 检测到值为0和150的记录
    • salary
      : 检测到超过$1M的极高值

Recommendations

建议

  1. Clean up email field - contact data source
  2. Standardize date format to ISO 8601
  3. Validate age and salary ranges
  4. Remove or investigate duplicate rows
  1. 清理email字段 - 联系数据源提供方
  2. 将日期格式统一为ISO 8601
  3. 验证age和salary的取值范围
  4. 删除或调查重复行

Summary

总结

  • Overall Quality: Good
  • Issues to Fix: 3 critical, 5 warnings
  • Estimated Fix Time: 2-3 hours
undefined
  • 整体质量: 良好
  • 需要修复的问题: 3个严重问题,5个警告
  • 预估修复时间: 2-3小时
undefined

8. Common Issues and Solutions

8. 常见问题与解决方案

Encoding Problems

编码问题

  • Try different encodings (UTF-8, Latin-1, Windows-1252)
  • Use
    chardet
    library to detect encoding
  • Handle BOM (Byte Order Mark) if present
  • 尝试使用不同编码格式(UTF-8, Latin-1, Windows-1252)
  • 使用
    chardet
    库检测编码格式
  • 处理可能存在的BOM(字节顺序标记)

Large Files

大文件处理

  • Process in chunks for memory efficiency
  • Use Dask for out-of-core processing
  • Sample data for quick validation
  • 分块处理以提升内存效率
  • 使用Dask进行核外处理
  • 采样数据以快速验证

Complex CSVs

复杂CSV处理

  • Handle quoted fields with embedded delimiters
  • Process multi-line records carefully
  • Validate escape character usage
  • 处理包含嵌入分隔符的带引号字段
  • 谨慎处理多行记录
  • 验证转义字符的使用

Usage

使用方法

  1. Provide the CSV file path
  2. Specify delimiter if not comma
  3. Configure validation rules based on your data
  4. Review the generated audit report
  5. Address issues systematically
  1. 提供CSV文件路径
  2. 若分隔符不是逗号则指定分隔符
  3. 根据你的数据配置验证规则
  4. 查看生成的审计报告
  5. 系统性地解决问题

Tips

提示

  • Always keep a backup of original data
  • Document any data transformations
  • Create validation rules based on business requirements
  • Consider using schema validation tools like Great Expectations
  • Automate regular audits for production data
  • 始终保留原始数据的备份
  • 记录所有数据转换操作
  • 根据业务需求制定验证规则
  • 考虑使用Great Expectations等Schema验证工具
  • 为生产数据定期自动化执行审计