data-scrubbing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Scrubbing
数据清洗(Data Scrubbing)
When to Use
适用场景
- Profile a table or file and define data-quality rules before analysis or modeling
- Clean, standardize, dedupe, or link records in CSV, Parquet, SQL extracts, or notebook pipelines
- Treat missing values, duplicates, outliers, types, encodings, and column naming consistently
- Document a reproducible scrub pipeline with validation checks and sign-off criteria
- Scrub actuarial/insurance fields (policy keys, claims triangles, exposure bases) for downstream reserving or pricing prep
- Flag or redact PII at a technical level before sharing extracts (coordinate with compliance for legal requirements)
- 在分析或建模前对表格或文件进行数据探查并定义数据质量规则
- 清理、标准化、去重或关联CSV、Parquet、SQL导出数据或笔记本管道中的记录
- 统一处理缺失值、重复项、异常值、数据类型、编码及列命名
- 记录包含验证检查和签字确认标准的可复现清洗管道
- 清洗精算/保险字段(保单密钥、理赔三角表、风险暴露基数),为后续准备金计提或定价准备数据
- 在共享导出数据前从技术层面标记或编辑PII(需与合规部门协调法律要求)
When NOT to Use
不适用场景
- Star/snowflake modeling, warehouse ETL/ELT, CDC, or platform ingestion design →
data-warehouse-engineer - Predictive modeling, A/B tests, causal inference, feature engineering for ML, or MLOps →
data-scientist - Loss development, IBNR, pricing models, or appointed-actuary sign-off →
actuary - Assumption sets, governance memos, or model assumption workshops →
assumption-setting - SOC 2 / ISO control mapping, audit evidence automation, or privacy legal program →
compliance-engineer - Cloud cost allocation, FinOps dashboards, or unit economics only →
finops-analyst - Spreadsheet formula integrity or cell-level model audit without a scrub pipeline → (if available)
audit-xls
- 星型/雪花模型建模、数据仓库ETL/ELT、CDC或平台 ingestion 设计 → 请使用
data-warehouse-engineer - 预测建模、A/B测试、因果推断、机器学习特征工程或MLOps → 请使用
data-scientist - 损失发展、IBNR、定价模型或指定精算师签字确认 → 请使用
actuary - 假设集、治理备忘录或模型假设研讨会 → 请使用
assumption-setting - SOC 2 / ISO控制映射、审计证据自动化或隐私法律项目 → 请使用
compliance-engineer - 云成本分配、FinOps仪表板或仅单位经济分析 → 请使用
finops-analyst - 仅检查电子表格公式完整性或单元格级模型审计,无清洗管道 → 请使用(若可用)
audit-xls
Related skills
相关技能
| Need | Skill |
|---|---|
| Dimensional modeling, ETL/ELT, warehouse SQL performance | |
| ML modeling, experiments, production model monitoring | |
| Reserving, triangles, IBNR, pricing actuarial methods | |
| Assumption documentation and governance | |
| Technical compliance controls and audit evidence | |
| Cloud spend attribution and cost optimization | |
| Enterprise data governance and catalog design | |
| Analytics engineering (dbt layers, mart tests) | |
| 需求 | 技能 |
|---|---|
| 维度建模、ETL/ELT、数据仓库SQL性能优化 | |
| 机器学习建模、实验、生产模型监控 | |
| 准备金计提、三角表、IBNR、精算定价方法 | |
| 假设文档与治理 | |
| 技术合规控制与审计证据 | |
| 云支出归因与成本优化 | |
| 企业数据治理与数据目录设计 | |
| 分析工程(dbt层、数据集市测试) | |
Core Workflows
核心工作流
1. Intake and scope
1. 接收与范围界定
- Identify source(s), grain, primary keys, and downstream consumer (report, model, regulatory filing)
- Record business definitions for critical fields and acceptable quality thresholds
- Choose deliverables: scrubbed dataset, rule catalog, pipeline code, validation report, sign-off checklist
- Confirm what must not change (audit trail, raw landing zone immutability)
See .
references/data_scrubbing_scope_and_workflow.md- 识别数据源、数据粒度、主键及下游使用者(报告、模型、监管申报)
- 记录关键字段的业务定义及可接受的质量阈值
- 确定交付物:清洗后的数据集、规则目录、管道代码、验证报告、签字确认清单
- 确认不可修改的内容(审计追踪、原始落地区的不可变性)
参考文档:
references/data_scrubbing_scope_and_workflow.md2. Profile and define quality rules
2. 数据探查与质量规则定义
- Run structural profile: row/column counts, types, null rates, cardinality, min/max, patterns
- Classify columns: identifier, measure, dimension, date, free text, PII-sensitive
- Draft rules: uniqueness, referential checks, range/domain, regex, cross-field logic, volume gates
- Prioritize rules by severity (blocker vs warning) and tie each to a remediation action
See .
references/profiling_and_quality_rules.md- 执行结构探查:行/列计数、数据类型、空值率、基数、极值、模式
- 列分类:标识符、度量值、维度、日期、自由文本、PII敏感字段
- 起草规则:唯一性、引用检查、范围/域、正则表达式、跨字段逻辑、数量阈值
- 按严重程度(阻塞项 vs 警告)优先排序规则,并为每条规则关联整改措施
参考文档:
references/profiling_and_quality_rules.md3. Remediate missing values, duplicates, outliers
3. 缺失值、重复项、异常值整改
- Apply documented strategies per column (impute, flag, drop, split, business rule)
- Deduplicate at correct grain; preserve lineage for merged records
- Treat outliers with explicit policy (cap, winsorize, exclude, investigate)—never silent deletion
- Re-run profile deltas after each major remediation pass
See .
references/missing_duplicates_and_outliers.md- 针对每列应用已记录的策略(填充、标记、删除、拆分、业务规则)
- 在正确粒度上去重;保留合并记录的 lineage
- 采用明确策略处理异常值(截断、缩尾、排除、调查)——绝不静默删除
- 每次重大整改后重新运行探查并对比差异
参考文档:
references/missing_duplicates_and_outliers.md4. Standardize and coerce types
4. 标准化与类型转换
- Normalize names, units, currencies, time zones, and categorical vocabularies
- Coerce types with explicit parse rules and quarantine rows that fail
- Fix encoding (UTF-8), delimiters, locale-specific decimals, and boolean sentinels
- Version mapping tables (code → label) alongside the pipeline
See .
references/standardization_and_type_coercion.md- 标准化名称、单位、货币、时区及分类词汇
- 使用明确的解析规则转换数据类型,并隔离转换失败的行
- 修复编码(UTF-8)、分隔符、区域特定小数格式及布尔标记
- 为管道配套版本化映射表(代码→标签)
参考文档:
references/standardization_and_type_coercion.md5. PII and governance (technical, not legal advice)
5. PII与治理(技术层面,非法律建议)
- Inventory sensitive columns; classify using organizational taxonomy when provided
- Apply minimization: drop, hash/tokenize, mask, or aggregate per approved pattern
- Log scrub actions; restrict outputs; never commit secrets or production PII to public repos
- Escalate legal basis, retention, and cross-border rules to / counsel
compliance-engineer
See .
references/pii_redaction_and_governance.md- 盘点敏感列;若提供组织分类体系则按其分类
- 应用最小化原则:按批准的模式删除、哈希/令牌化、掩码或聚合
- 记录清洗操作;限制输出;绝不将机密或生产环境PII提交至公共仓库
- 将法律依据、保留期限及跨境规则 escalate 至/法律顾问
compliance-engineer
参考文档:
references/pii_redaction_and_governance.md6. Actuarial / insurance scrubbing
6. 精算/保险数据清洗
- Validate policy/claim keys, effective/accident dates, and triangle orientation
- Align exposure bases and earned premium logic with documented definitions
- Scrub large losses, sublimits, and reinsurance fields without distorting triangle structure
- Hand off reserving/pricing math to after data is signed off for modeling
actuary
See .
references/actuarial_insurance_data_scrubbing.md- 验证保单/理赔密钥、生效/事故日期及三角表方向
- 对齐风险暴露基数及已赚保费逻辑与已记录的定义
- 清洗大额损失、分项限额及再保险字段,同时不破坏三角表结构
- 数据签字确认后将准备金计提/定价计算移交
actuary
参考文档:
references/actuarial_insurance_data_scrubbing.md7. Validate, document, sign off
7. 验证、文档记录与签字确认
- Execute rule suite on scrubbed output; compare to thresholds and prior period if applicable
- Produce validation report: pass/fail counts, quarantine volume, top failure reasons
- Package reproducible pipeline (script/SQL/notebook), config, and rule catalog with version hash
- Obtain owner sign-off before promoting to modeling or reporting consumers
See (sign-off section).
references/data_scrubbing_scope_and_workflow.md- 在清洗后的输出上执行规则套件;若适用则与阈值及往期数据对比
- 生成验证报告:通过/失败计数、隔离数据量、主要失败原因
- 打包可复现管道(脚本/SQL/笔记本)、配置文件及带版本哈希的规则目录
- 在提交给建模或报告使用者前获取所有者签字确认
参考文档:(签字确认章节)
references/data_scrubbing_scope_and_workflow.mdWhen to load references
何时加载参考文档
| Topic | Reference |
|---|---|
| Scope, workflow, sign-off | |
| Profiling and quality rules | |
| Missing, duplicates, outliers | |
| Standardization and types | |
| PII and governance | |
| Actuarial / insurance data | |
| 主题 | 参考文档 |
|---|---|
| 范围、工作流、签字确认 | |
| 数据探查与质量规则 | |
| 缺失值、重复项、异常值 | |
| 标准化与数据类型 | |
| PII与治理 | |
| 精算/保险数据 | |