data-scrubbing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Scrubbing

数据清洗（Data Scrubbing）

When to Use

适用场景

Profile a table or file and define data-quality rules before analysis or modeling
Clean, standardize, dedupe, or link records in CSV, Parquet, SQL extracts, or notebook pipelines
Treat missing values, duplicates, outliers, types, encodings, and column naming consistently
Document a reproducible scrub pipeline with validation checks and sign-off criteria
Scrub actuarial/insurance fields (policy keys, claims triangles, exposure bases) for downstream reserving or pricing prep
Flag or redact PII at a technical level before sharing extracts (coordinate with compliance for legal requirements)

在分析或建模前对表格或文件进行数据探查并定义数据质量规则
清理、标准化、去重或关联CSV、Parquet、SQL导出数据或笔记本管道中的记录
统一处理缺失值、重复项、异常值、数据类型、编码及列命名
记录包含验证检查和签字确认标准的可复现清洗管道
清洗精算/保险字段（保单密钥、理赔三角表、风险暴露基数），为后续准备金计提或定价准备数据
在共享导出数据前从技术层面标记或编辑PII（需与合规部门协调法律要求）

When NOT to Use

不适用场景

Star/snowflake modeling, warehouse ETL/ELT, CDC, or platform ingestion design →
```
data-warehouse-engineer
```
Predictive modeling, A/B tests, causal inference, feature engineering for ML, or MLOps →
```
data-scientist
```
Loss development, IBNR, pricing models, or appointed-actuary sign-off →
```
actuary
```
Assumption sets, governance memos, or model assumption workshops →
```
assumption-setting
```
SOC 2 / ISO control mapping, audit evidence automation, or privacy legal program →
```
compliance-engineer
```
Cloud cost allocation, FinOps dashboards, or unit economics only →
```
finops-analyst
```
Spreadsheet formula integrity or cell-level model audit without a scrub pipeline →
```
audit-xls
```
(if available)

星型/雪花模型建模、数据仓库ETL/ELT、CDC或平台 ingestion 设计 → 请使用
```
data-warehouse-engineer
```
预测建模、A/B测试、因果推断、机器学习特征工程或MLOps → 请使用
```
data-scientist
```
损失发展、IBNR、定价模型或指定精算师签字确认 → 请使用
```
actuary
```
假设集、治理备忘录或模型假设研讨会 → 请使用
```
assumption-setting
```
SOC 2 / ISO控制映射、审计证据自动化或隐私法律项目 → 请使用
```
compliance-engineer
```
云成本分配、FinOps仪表板或仅单位经济分析 → 请使用
```
finops-analyst
```
仅检查电子表格公式完整性或单元格级模型审计，无清洗管道 → 请使用
```
audit-xls
```
（若可用）

Related skills

Need	Skill
Dimensional modeling, ETL/ELT, warehouse SQL performance	`data-warehouse-engineer`
ML modeling, experiments, production model monitoring	`data-scientist`
Reserving, triangles, IBNR, pricing actuarial methods	`actuary`
Assumption documentation and governance	`assumption-setting`
Technical compliance controls and audit evidence	`compliance-engineer`
Cloud spend attribution and cost optimization	`finops-analyst`
Enterprise data governance and catalog design	`data-architect`
Analytics engineering (dbt layers, mart tests)	`analytics-data-engineer`

需求	技能
维度建模、ETL/ELT、数据仓库SQL性能优化	`data-warehouse-engineer`
机器学习建模、实验、生产模型监控	`data-scientist`
准备金计提、三角表、IBNR、精算定价方法	`actuary`
假设文档与治理	`assumption-setting`
技术合规控制与审计证据	`compliance-engineer`
云支出归因与成本优化	`finops-analyst`
企业数据治理与数据目录设计	`data-architect`
分析工程（dbt层、数据集市测试）	`analytics-data-engineer`

Core Workflows

核心工作流

1. Intake and scope

1. 接收与范围界定

Identify source(s), grain, primary keys, and downstream consumer (report, model, regulatory filing)
Record business definitions for critical fields and acceptable quality thresholds
Choose deliverables: scrubbed dataset, rule catalog, pipeline code, validation report, sign-off checklist
Confirm what must not change (audit trail, raw landing zone immutability)

See
references/data_scrubbing_scope_and_workflow.md
.

识别数据源、数据粒度、主键及下游使用者（报告、模型、监管申报）
记录关键字段的业务定义及可接受的质量阈值
确定交付物：清洗后的数据集、规则目录、管道代码、验证报告、签字确认清单
确认不可修改的内容（审计追踪、原始落地区的不可变性）

参考文档：
references/data_scrubbing_scope_and_workflow.md

2. Profile and define quality rules

2. 数据探查与质量规则定义

Run structural profile: row/column counts, types, null rates, cardinality, min/max, patterns
Classify columns: identifier, measure, dimension, date, free text, PII-sensitive
Draft rules: uniqueness, referential checks, range/domain, regex, cross-field logic, volume gates
Prioritize rules by severity (blocker vs warning) and tie each to a remediation action

See
references/profiling_and_quality_rules.md
.

执行结构探查：行/列计数、数据类型、空值率、基数、极值、模式
列分类：标识符、度量值、维度、日期、自由文本、PII敏感字段
起草规则：唯一性、引用检查、范围/域、正则表达式、跨字段逻辑、数量阈值
按严重程度（阻塞项 vs 警告）优先排序规则，并为每条规则关联整改措施

参考文档：
references/profiling_and_quality_rules.md

3. Remediate missing values, duplicates, outliers

3. 缺失值、重复项、异常值整改

Apply documented strategies per column (impute, flag, drop, split, business rule)
Deduplicate at correct grain; preserve lineage for merged records
Treat outliers with explicit policy (cap, winsorize, exclude, investigate)—never silent deletion
Re-run profile deltas after each major remediation pass

See
references/missing_duplicates_and_outliers.md
.

针对每列应用已记录的策略（填充、标记、删除、拆分、业务规则）
在正确粒度上去重；保留合并记录的 lineage
采用明确策略处理异常值（截断、缩尾、排除、调查）——绝不静默删除
每次重大整改后重新运行探查并对比差异

参考文档：
references/missing_duplicates_and_outliers.md

4. Standardize and coerce types

4. 标准化与类型转换

Normalize names, units, currencies, time zones, and categorical vocabularies
Coerce types with explicit parse rules and quarantine rows that fail
Fix encoding (UTF-8), delimiters, locale-specific decimals, and boolean sentinels
Version mapping tables (code → label) alongside the pipeline

See
references/standardization_and_type_coercion.md
.

标准化名称、单位、货币、时区及分类词汇
使用明确的解析规则转换数据类型，并隔离转换失败的行
修复编码（UTF-8）、分隔符、区域特定小数格式及布尔标记
为管道配套版本化映射表（代码→标签）

参考文档：
references/standardization_and_type_coercion.md

5. PII and governance (technical, not legal advice)

5. PII与治理（技术层面，非法律建议）

Inventory sensitive columns; classify using organizational taxonomy when provided
Apply minimization: drop, hash/tokenize, mask, or aggregate per approved pattern
Log scrub actions; restrict outputs; never commit secrets or production PII to public repos
Escalate legal basis, retention, and cross-border rules to
```
compliance-engineer
```
/ counsel

See
references/pii_redaction_and_governance.md
.

盘点敏感列；若提供组织分类体系则按其分类
应用最小化原则：按批准的模式删除、哈希/令牌化、掩码或聚合
记录清洗操作；限制输出；绝不将机密或生产环境PII提交至公共仓库
将法律依据、保留期限及跨境规则 escalate 至
```
compliance-engineer
```
/法律顾问

参考文档：
references/pii_redaction_and_governance.md

6. Actuarial / insurance scrubbing

6. 精算/保险数据清洗

Validate policy/claim keys, effective/accident dates, and triangle orientation
Align exposure bases and earned premium logic with documented definitions
Scrub large losses, sublimits, and reinsurance fields without distorting triangle structure
Hand off reserving/pricing math to
```
actuary
```
after data is signed off for modeling

See
references/actuarial_insurance_data_scrubbing.md
.

验证保单/理赔密钥、生效/事故日期及三角表方向
对齐风险暴露基数及已赚保费逻辑与已记录的定义
清洗大额损失、分项限额及再保险字段，同时不破坏三角表结构
数据签字确认后将准备金计提/定价计算移交
```
actuary
```

参考文档：
references/actuarial_insurance_data_scrubbing.md

7. Validate, document, sign off

7. 验证、文档记录与签字确认

Execute rule suite on scrubbed output; compare to thresholds and prior period if applicable
Produce validation report: pass/fail counts, quarantine volume, top failure reasons
Package reproducible pipeline (script/SQL/notebook), config, and rule catalog with version hash
Obtain owner sign-off before promoting to modeling or reporting consumers

See
references/data_scrubbing_scope_and_workflow.md
(sign-off section).

在清洗后的输出上执行规则套件；若适用则与阈值及往期数据对比
生成验证报告：通过/失败计数、隔离数据量、主要失败原因
打包可复现管道（脚本/SQL/笔记本）、配置文件及带版本哈希的规则目录
在提交给建模或报告使用者前获取所有者签字确认

参考文档：
references/data_scrubbing_scope_and_workflow.md
（签字确认章节）

When to load references

何时加载参考文档

Topic	Reference
Scope, workflow, sign-off	`references/data_scrubbing_scope_and_workflow.md`
Profiling and quality rules	`references/profiling_and_quality_rules.md`
Missing, duplicates, outliers	`references/missing_duplicates_and_outliers.md`
Standardization and types	`references/standardization_and_type_coercion.md`
PII and governance	`references/pii_redaction_and_governance.md`
Actuarial / insurance data	`references/actuarial_insurance_data_scrubbing.md`

主题	参考文档
范围、工作流、签字确认	`references/data_scrubbing_scope_and_workflow.md`
数据探查与质量规则	`references/profiling_and_quality_rules.md`
缺失值、重复项、异常值	`references/missing_duplicates_and_outliers.md`
标准化与数据类型	`references/standardization_and_type_coercion.md`
PII与治理	`references/pii_redaction_and_governance.md`
精算/保险数据	`references/actuarial_insurance_data_scrubbing.md`