csv-data-auditor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCSV Data Auditor
CSV数据审计器
Instructions
操作说明
When auditing CSV data, perform these validation checks systematically:
审计CSV数据时,请系统执行以下验证检查:
1. File Structure Validation
1. 文件结构验证
- Verify the file exists and is readable
- Check if the file is a valid CSV format
- Ensure consistent delimiter usage
- Verify proper quoting and escaping
- Check for balanced quotes
- 验证文件是否存在且可读取
- 检查文件是否为有效的CSV格式
- 确保分隔符使用一致
- 验证引号与转义符的正确使用
- 检查引号是否配对
2. Header Analysis
2. 表头分析
- Confirm headers exist (unless headerless CSV)
- Check for duplicate header names
- Validate header naming conventions
- Ensure headers are descriptive and consistent
- Check for special characters in headers
- 确认表头存在(无表头CSV除外)
- 检查是否有重复的表头名称
- 验证表头命名规范
- 确保表头描述清晰且格式一致
- 检查表头中是否包含特殊字符
3. Row Consistency
3. 行一致性检查
- Count total rows and columns
- Verify all rows have the same number of columns
- Check for empty rows or rows with only whitespace
- Identify truncated rows
- Detect corrupted or malformed rows
- 统计总行数与列数
- 验证所有行的列数是否相同
- 检查是否存在空行或仅含空白字符的行
- 识别被截断的行
- 检测损坏或格式错误的行
4. Data Type Validation
4. 数据类型验证
For each column, validate the expected data type:
针对每一列,验证预期的数据类型:
Numeric Fields
数值字段
- Check for non-numeric values
- Validate decimal point usage
- Look for negative values where inappropriate
- Check for extremely large/small values
- 检查是否存在非数值值
- 验证小数点使用规范
- 查找不应出现的负值
- 检查是否存在过大/过小的极端值
Date/Time Fields
日期/时间字段
- Verify date format consistency (ISO 8601, US, EU, etc.)
- Check for invalid dates (e.g., February 30th)
- Validate time zone handling
- Look for future dates where inappropriate
- 验证日期格式一致性(ISO 8601、美式、欧式等)
- 检查是否存在无效日期(如2月30日)
- 验证时区处理逻辑
- 查找不应出现的未来日期
Text Fields
文本字段
- Check for encoding issues
- Look for unexpected special characters
- Verify string length constraints
- Check for leading/trailing whitespace
- 检查编码问题
- 查找意外出现的特殊字符
- 验证字符串长度限制
- 检查是否存在前导/尾随空白字符
Boolean Fields
布尔字段
- Validate true/false representations
- Check for 1/0, Y/N, Yes/No consistency
- Look for ambiguous values
- 验证true/false的表示方式
- 检查1/0、Y/N、Yes/No的使用一致性
- 查找含义模糊的值
5. Data Quality Checks
5. 数据质量检查
Missing Values
缺失值
- Count NULL/empty values per column
- Calculate missing value percentages
- Identify patterns in missing data
- Check for placeholders like "N/A", "null", "-"
- 统计每列的NULL/空值数量
- 计算缺失值占比
- 识别缺失数据的模式
- 检查是否存在"N/A"、"null"、"-"等占位符
Duplicates
重复值
- Find exact duplicate rows
- Check for duplicate IDs or keys
- Identify near-duplicates (typos, variations)
- Validate uniqueness constraints
- 查找完全重复的行
- 检查是否存在重复的ID或键
- 识别近似重复值(拼写错误、格式变体)
- 验证唯一性约束
Outliers
异常值
- Identify statistical outliers in numeric columns
- Check for values outside expected ranges
- Look for anomalies in categorical data
- Validate business logic constraints
- 识别数值列中的统计异常值
- 检查是否存在超出预期范围的值
- 查找分类数据中的异常情况
- 验证业务逻辑约束
Consistency Checks
一致性检查
- Cross-validate related columns
- Check referential integrity
- Validate business rules
- Look for logical contradictions
- 交叉验证相关列的数据
- 检查引用完整性
- 验证业务规则
- 查找逻辑矛盾
6. Validation Script (Python)
6. 验证脚本(Python)
Create a Python script to perform automated checks:
python
import pandas as pd
import numpy as np
from datetime import datetime
def audit_csv(file_path, delimiter=','):
"""Perform comprehensive CSV audit"""
# Load CSV
try:
df = pd.read_csv(file_path, delimiter=delimiter)
except Exception as e:
return {"error": f"Failed to load CSV: {str(e)}"}
audit_report = {
"file_info": {
"rows": len(df),
"columns": len(df.columns),
"headers": list(df.columns)
},
"issues": []
}
# Check for missing values
missing_counts = df.isnull().sum()
for col, count in missing_counts.items():
if count > 0:
percentage = (count / len(df)) * 100
audit_report["issues"].append({
"type": "missing_values",
"column": col,
"count": count,
"percentage": round(percentage, 2)
})
# Check for duplicates
duplicate_rows = df.duplicated().sum()
if duplicate_rows > 0:
audit_report["issues"].append({
"type": "duplicates",
"count": duplicate_rows
})
# Data type validation
for col in df.columns:
# Check for mixed types in object columns
if df[col].dtype == 'object':
sample_values = df[col].dropna().head(10)
# Add specific type checks based on column name patterns
return audit_report创建Python脚本以执行自动化检查:
python
import pandas as pd
import numpy as np
from datetime import datetime
def audit_csv(file_path, delimiter=','):
"""Perform comprehensive CSV audit"""
# Load CSV
try:
df = pd.read_csv(file_path, delimiter=delimiter)
except Exception as e:
return {"error": f"Failed to load CSV: {str(e)}"}
audit_report = {
"file_info": {
"rows": len(df),
"columns": len(df.columns),
"headers": list(df.columns)
},
"issues": []
}
# Check for missing values
missing_counts = df.isnull().sum()
for col, count in missing_counts.items():
if count > 0:
percentage = (count / len(df)) * 100
audit_report["issues"].append({
"type": "missing_values",
"column": col,
"count": count,
"percentage": round(percentage, 2)
})
# Check for duplicates
duplicate_rows = df.duplicated().sum()
if duplicate_rows > 0:
audit_report["issues"].append({
"type": "duplicates",
"count": duplicate_rows
})
# Data type validation
for col in df.columns:
# Check for mixed types in object columns
if df[col].dtype == 'object':
sample_values = df[col].dropna().head(10)
# Add specific type checks based on column name patterns
return audit_report7. Report Format
7. 报告格式
Generate a structured audit report:
markdown
undefined生成结构化的审计报告:
markdown
undefinedCSV Audit Report
CSV审计报告
File Information
文件信息
- File: data.csv
- Size: 15.2 MB
- Rows: 50,000
- Columns: 12
- Headers: id, name, email, age, join_date, salary, department, ...
- 文件: data.csv
- 大小: 15.2 MB
- 行数: 50,000
- 列数: 12
- 表头: id, name, email, age, join_date, salary, department, ...
Issues Found
发现的问题
Critical Issues
严重问题
- Missing Values:
- : 2,500 missing (5.0%)
email - : 150 missing (0.3%)
salary
- 缺失值:
- : 2,500条缺失(5.0%)
email - : 150条缺失(0.3%)
salary
Warnings
警告
-
Inconsistent Date Format:
- : Mix of ISO and US formats detected
join_date - Examples: 2023-01-15, 01/15/2023, 15-Jan-2023
-
Potential Outliers:
- : Values 0 and 150 detected
age - : Extremely high values > $1M
salary
-
日期格式不一致:
- : 检测到ISO格式与美式格式混用
join_date - 示例: 2023-01-15, 01/15/2023, 15-Jan-2023
-
潜在异常值:
- : 检测到值为0和150的记录
age - : 检测到超过$1M的极高值
salary
Recommendations
建议
- Clean up email field - contact data source
- Standardize date format to ISO 8601
- Validate age and salary ranges
- Remove or investigate duplicate rows
- 清理email字段 - 联系数据源提供方
- 将日期格式统一为ISO 8601
- 验证age和salary的取值范围
- 删除或调查重复行
Summary
总结
- Overall Quality: Good
- Issues to Fix: 3 critical, 5 warnings
- Estimated Fix Time: 2-3 hours
undefined- 整体质量: 良好
- 需要修复的问题: 3个严重问题,5个警告
- 预估修复时间: 2-3小时
undefined8. Common Issues and Solutions
8. 常见问题与解决方案
Encoding Problems
编码问题
- Try different encodings (UTF-8, Latin-1, Windows-1252)
- Use library to detect encoding
chardet - Handle BOM (Byte Order Mark) if present
- 尝试使用不同编码格式(UTF-8, Latin-1, Windows-1252)
- 使用库检测编码格式
chardet - 处理可能存在的BOM(字节顺序标记)
Large Files
大文件处理
- Process in chunks for memory efficiency
- Use Dask for out-of-core processing
- Sample data for quick validation
- 分块处理以提升内存效率
- 使用Dask进行核外处理
- 采样数据以快速验证
Complex CSVs
复杂CSV处理
- Handle quoted fields with embedded delimiters
- Process multi-line records carefully
- Validate escape character usage
- 处理包含嵌入分隔符的带引号字段
- 谨慎处理多行记录
- 验证转义字符的使用
Usage
使用方法
- Provide the CSV file path
- Specify delimiter if not comma
- Configure validation rules based on your data
- Review the generated audit report
- Address issues systematically
- 提供CSV文件路径
- 若分隔符不是逗号则指定分隔符
- 根据你的数据配置验证规则
- 查看生成的审计报告
- 系统性地解决问题
Tips
提示
- Always keep a backup of original data
- Document any data transformations
- Create validation rules based on business requirements
- Consider using schema validation tools like Great Expectations
- Automate regular audits for production data
- 始终保留原始数据的备份
- 记录所有数据转换操作
- 根据业务需求制定验证规则
- 考虑使用Great Expectations等Schema验证工具
- 为生产数据定期自动化执行审计