data-scrubbing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Scrubbing

数据清洗(Data Scrubbing)

When to Use

适用场景

  • Profile a table or file and define data-quality rules before analysis or modeling
  • Clean, standardize, dedupe, or link records in CSV, Parquet, SQL extracts, or notebook pipelines
  • Treat missing values, duplicates, outliers, types, encodings, and column naming consistently
  • Document a reproducible scrub pipeline with validation checks and sign-off criteria
  • Scrub actuarial/insurance fields (policy keys, claims triangles, exposure bases) for downstream reserving or pricing prep
  • Flag or redact PII at a technical level before sharing extracts (coordinate with compliance for legal requirements)
  • 在分析或建模前对表格或文件进行数据探查并定义数据质量规则
  • 清理、标准化、去重或关联CSV、Parquet、SQL导出数据或笔记本管道中的记录
  • 统一处理缺失值、重复项、异常值、数据类型、编码及列命名
  • 记录包含验证检查和签字确认标准的可复现清洗管道
  • 清洗精算/保险字段(保单密钥、理赔三角表、风险暴露基数),为后续准备金计提或定价准备数据
  • 在共享导出数据前从技术层面标记或编辑PII(需与合规部门协调法律要求)

When NOT to Use

不适用场景

  • Star/snowflake modeling, warehouse ETL/ELT, CDC, or platform ingestion design →
    data-warehouse-engineer
  • Predictive modeling, A/B tests, causal inference, feature engineering for ML, or MLOps →
    data-scientist
  • Loss development, IBNR, pricing models, or appointed-actuary sign-off →
    actuary
  • Assumption sets, governance memos, or model assumption workshops →
    assumption-setting
  • SOC 2 / ISO control mapping, audit evidence automation, or privacy legal program →
    compliance-engineer
  • Cloud cost allocation, FinOps dashboards, or unit economics only →
    finops-analyst
  • Spreadsheet formula integrity or cell-level model audit without a scrub pipeline →
    audit-xls
    (if available)
  • 星型/雪花模型建模、数据仓库ETL/ELT、CDC或平台 ingestion 设计 → 请使用
    data-warehouse-engineer
  • 预测建模、A/B测试、因果推断、机器学习特征工程或MLOps → 请使用
    data-scientist
  • 损失发展、IBNR、定价模型或指定精算师签字确认 → 请使用
    actuary
  • 假设集、治理备忘录或模型假设研讨会 → 请使用
    assumption-setting
  • SOC 2 / ISO控制映射、审计证据自动化或隐私法律项目 → 请使用
    compliance-engineer
  • 云成本分配、FinOps仪表板或仅单位经济分析 → 请使用
    finops-analyst
  • 仅检查电子表格公式完整性或单元格级模型审计,无清洗管道 → 请使用
    audit-xls
    (若可用)

Related skills

相关技能

NeedSkill
Dimensional modeling, ETL/ELT, warehouse SQL performance
data-warehouse-engineer
ML modeling, experiments, production model monitoring
data-scientist
Reserving, triangles, IBNR, pricing actuarial methods
actuary
Assumption documentation and governance
assumption-setting
Technical compliance controls and audit evidence
compliance-engineer
Cloud spend attribution and cost optimization
finops-analyst
Enterprise data governance and catalog design
data-architect
Analytics engineering (dbt layers, mart tests)
analytics-data-engineer
需求技能
维度建模、ETL/ELT、数据仓库SQL性能优化
data-warehouse-engineer
机器学习建模、实验、生产模型监控
data-scientist
准备金计提、三角表、IBNR、精算定价方法
actuary
假设文档与治理
assumption-setting
技术合规控制与审计证据
compliance-engineer
云支出归因与成本优化
finops-analyst
企业数据治理与数据目录设计
data-architect
分析工程(dbt层、数据集市测试)
analytics-data-engineer

Core Workflows

核心工作流

1. Intake and scope

1. 接收与范围界定

  1. Identify source(s), grain, primary keys, and downstream consumer (report, model, regulatory filing)
  2. Record business definitions for critical fields and acceptable quality thresholds
  3. Choose deliverables: scrubbed dataset, rule catalog, pipeline code, validation report, sign-off checklist
  4. Confirm what must not change (audit trail, raw landing zone immutability)
See
references/data_scrubbing_scope_and_workflow.md
.
  1. 识别数据源、数据粒度、主键及下游使用者(报告、模型、监管申报)
  2. 记录关键字段的业务定义及可接受的质量阈值
  3. 确定交付物:清洗后的数据集、规则目录、管道代码、验证报告、签字确认清单
  4. 确认不可修改的内容(审计追踪、原始落地区的不可变性)
参考文档:
references/data_scrubbing_scope_and_workflow.md

2. Profile and define quality rules

2. 数据探查与质量规则定义

  1. Run structural profile: row/column counts, types, null rates, cardinality, min/max, patterns
  2. Classify columns: identifier, measure, dimension, date, free text, PII-sensitive
  3. Draft rules: uniqueness, referential checks, range/domain, regex, cross-field logic, volume gates
  4. Prioritize rules by severity (blocker vs warning) and tie each to a remediation action
See
references/profiling_and_quality_rules.md
.
  1. 执行结构探查:行/列计数、数据类型、空值率、基数、极值、模式
  2. 列分类:标识符、度量值、维度、日期、自由文本、PII敏感字段
  3. 起草规则:唯一性、引用检查、范围/域、正则表达式、跨字段逻辑、数量阈值
  4. 按严重程度(阻塞项 vs 警告)优先排序规则,并为每条规则关联整改措施
参考文档:
references/profiling_and_quality_rules.md

3. Remediate missing values, duplicates, outliers

3. 缺失值、重复项、异常值整改

  1. Apply documented strategies per column (impute, flag, drop, split, business rule)
  2. Deduplicate at correct grain; preserve lineage for merged records
  3. Treat outliers with explicit policy (cap, winsorize, exclude, investigate)—never silent deletion
  4. Re-run profile deltas after each major remediation pass
See
references/missing_duplicates_and_outliers.md
.
  1. 针对每列应用已记录的策略(填充、标记、删除、拆分、业务规则)
  2. 在正确粒度上去重;保留合并记录的 lineage
  3. 采用明确策略处理异常值(截断、缩尾、排除、调查)——绝不静默删除
  4. 每次重大整改后重新运行探查并对比差异
参考文档:
references/missing_duplicates_and_outliers.md

4. Standardize and coerce types

4. 标准化与类型转换

  1. Normalize names, units, currencies, time zones, and categorical vocabularies
  2. Coerce types with explicit parse rules and quarantine rows that fail
  3. Fix encoding (UTF-8), delimiters, locale-specific decimals, and boolean sentinels
  4. Version mapping tables (code → label) alongside the pipeline
See
references/standardization_and_type_coercion.md
.
  1. 标准化名称、单位、货币、时区及分类词汇
  2. 使用明确的解析规则转换数据类型,并隔离转换失败的行
  3. 修复编码(UTF-8)、分隔符、区域特定小数格式及布尔标记
  4. 为管道配套版本化映射表(代码→标签)
参考文档:
references/standardization_and_type_coercion.md

5. PII and governance (technical, not legal advice)

5. PII与治理(技术层面,非法律建议)

  1. Inventory sensitive columns; classify using organizational taxonomy when provided
  2. Apply minimization: drop, hash/tokenize, mask, or aggregate per approved pattern
  3. Log scrub actions; restrict outputs; never commit secrets or production PII to public repos
  4. Escalate legal basis, retention, and cross-border rules to
    compliance-engineer
    / counsel
See
references/pii_redaction_and_governance.md
.
  1. 盘点敏感列;若提供组织分类体系则按其分类
  2. 应用最小化原则:按批准的模式删除、哈希/令牌化、掩码或聚合
  3. 记录清洗操作;限制输出;绝不将机密或生产环境PII提交至公共仓库
  4. 将法律依据、保留期限及跨境规则 escalate 至
    compliance-engineer
    /法律顾问
参考文档:
references/pii_redaction_and_governance.md

6. Actuarial / insurance scrubbing

6. 精算/保险数据清洗

  1. Validate policy/claim keys, effective/accident dates, and triangle orientation
  2. Align exposure bases and earned premium logic with documented definitions
  3. Scrub large losses, sublimits, and reinsurance fields without distorting triangle structure
  4. Hand off reserving/pricing math to
    actuary
    after data is signed off for modeling
See
references/actuarial_insurance_data_scrubbing.md
.
  1. 验证保单/理赔密钥、生效/事故日期及三角表方向
  2. 对齐风险暴露基数及已赚保费逻辑与已记录的定义
  3. 清洗大额损失、分项限额及再保险字段,同时不破坏三角表结构
  4. 数据签字确认后将准备金计提/定价计算移交
    actuary
参考文档:
references/actuarial_insurance_data_scrubbing.md

7. Validate, document, sign off

7. 验证、文档记录与签字确认

  1. Execute rule suite on scrubbed output; compare to thresholds and prior period if applicable
  2. Produce validation report: pass/fail counts, quarantine volume, top failure reasons
  3. Package reproducible pipeline (script/SQL/notebook), config, and rule catalog with version hash
  4. Obtain owner sign-off before promoting to modeling or reporting consumers
See
references/data_scrubbing_scope_and_workflow.md
(sign-off section).
  1. 在清洗后的输出上执行规则套件;若适用则与阈值及往期数据对比
  2. 生成验证报告:通过/失败计数、隔离数据量、主要失败原因
  3. 打包可复现管道(脚本/SQL/笔记本)、配置文件及带版本哈希的规则目录
  4. 在提交给建模或报告使用者前获取所有者签字确认
参考文档:
references/data_scrubbing_scope_and_workflow.md
(签字确认章节)

When to load references

何时加载参考文档

TopicReference
Scope, workflow, sign-off
references/data_scrubbing_scope_and_workflow.md
Profiling and quality rules
references/profiling_and_quality_rules.md
Missing, duplicates, outliers
references/missing_duplicates_and_outliers.md
Standardization and types
references/standardization_and_type_coercion.md
PII and governance
references/pii_redaction_and_governance.md
Actuarial / insurance data
references/actuarial_insurance_data_scrubbing.md
主题参考文档
范围、工作流、签字确认
references/data_scrubbing_scope_and_workflow.md
数据探查与质量规则
references/profiling_and_quality_rules.md
缺失值、重复项、异常值
references/missing_duplicates_and_outliers.md
标准化与数据类型
references/standardization_and_type_coercion.md
PII与治理
references/pii_redaction_and_governance.md
精算/保险数据
references/actuarial_insurance_data_scrubbing.md