data-validation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Validation Skill

数据验证技能

Pre-delivery QA checklist, common data analysis pitfalls, result sanity checking, and documentation standards for reproducibility.
交付前质量检查清单、常见数据分析陷阱、结果合理性检查,以及为可复现性制定的文档标准。

Pre-Delivery QA Checklist

交付前质量检查清单

Run through this checklist before sharing any analysis with stakeholders.
在与利益相关者分享任何分析结果前,请完成以下清单检查。

Data Quality Checks

数据质量检查

  • Source verification: Confirmed which tables/data sources were used. Are they the right ones for this question?
  • Freshness: Data is current enough for the analysis. Noted the "as of" date.
  • Completeness: No unexpected gaps in time series or missing segments.
  • Null handling: Checked null rates in key columns. Nulls are handled appropriately (excluded, imputed, or flagged).
  • Deduplication: Confirmed no double-counting from bad joins or duplicate source records.
  • Filter verification: All WHERE clauses and filters are correct. No unintended exclusions.
  • 数据源验证:确认所使用的表/数据源是否为该问题的合适选择。
  • 时效性:数据的时效性符合分析需求,并标注了“截至”日期。
  • 完整性:时间序列或细分数据中无意外缺失的部分。
  • 空值处理:检查关键列的空值占比,且空值已得到适当处理(排除、填充或标记)。
  • 去重验证:确认不存在因错误连接或源数据重复导致的重复统计。
  • 过滤条件验证:所有WHERE子句和过滤条件均正确,无意外排除的数据。

Calculation Checks

计算逻辑检查

  • Aggregation logic: GROUP BY includes all non-aggregated columns. Aggregation level matches the analysis grain.
  • Denominator correctness: Rate and percentage calculations use the right denominator. Denominators are non-zero.
  • Date alignment: Comparisons use the same time period length. Partial periods are excluded or noted.
  • Join correctness: JOIN types are appropriate (INNER vs LEFT). Many-to-many joins haven't inflated counts.
  • Metric definitions: Metrics match how stakeholders define them. Any deviations are noted.
  • Subtotals sum: Parts add up to the whole where expected. If they don't, explain why (e.g., overlap).
  • 聚合逻辑:GROUP BY子句包含所有非聚合列,聚合粒度与分析需求匹配。
  • 分母正确性:比率和百分比计算使用正确的分母,且分母不为零。
  • 日期对齐:对比使用的时间周期长度一致,部分周期已被排除或标注说明。
  • 连接正确性:JOIN类型选择恰当(INNER vs LEFT),未因多对多连接导致统计数据膨胀。
  • 指标定义:指标与利益相关者的定义一致,如有偏差已标注说明。
  • 小计求和验证:各部分数据之和符合预期的总计值,若不符合需解释原因(如存在重叠)。

Reasonableness Checks

合理性检查

  • Magnitude: Numbers are in a plausible range. Revenue isn't negative. Percentages are between 0-100%.
  • Trend continuity: No unexplained jumps or drops in time series.
  • Cross-reference: Key numbers match other known sources (dashboards, previous reports, finance data).
  • Order of magnitude: Total revenue is in the right ballpark. User counts match known figures.
  • Edge cases: What happens at the boundaries? Empty segments, zero-activity periods, new entities.
  • 数值量级:数据处于合理范围内,收入不为负数,百分比在0-100%之间。
  • 趋势连续性:时间序列中无无法解释的突变或骤降。
  • 交叉验证:关键数据与其他已知来源(仪表盘、往期报告、财务数据)匹配。
  • 量级匹配:总收入与已知的ARR量级相符,用户数与已知数据一致。
  • 边界情况:验证边界场景下的结果,如空细分群体、零活动周期、新实体等。

Presentation Checks

呈现方式检查

  • Chart accuracy: Bar charts start at zero. Axes are labeled. Scales are consistent across panels.
  • Number formatting: Appropriate precision. Consistent currency/percentage formatting. Thousands separators where needed.
  • Title clarity: Titles state the insight, not just the metric. Date ranges are specified.
  • Caveat transparency: Known limitations and assumptions are stated explicitly.
  • Reproducibility: Someone else could recreate this analysis from the documentation provided.
  • 图表准确性:柱状图从0开始,坐标轴已标注,多面板图表的刻度一致。
  • 数字格式:精度恰当,货币/百分比格式统一,按需添加千位分隔符。
  • 标题清晰度:标题明确说明洞察内容,而非仅标注指标,并指定时间范围。
  • 局限性透明度:明确说明已知的限制条件和假设前提。
  • 可复现性:他人可根据提供的文档重现该分析过程。

Common Data Analysis Pitfalls

常见数据分析陷阱

Join Explosion

连接膨胀

The problem: A many-to-many join silently multiplies rows, inflating counts and sums.
How to detect:
sql
-- Check row count before and after join
SELECT COUNT(*) FROM table_a;  -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id;  -- 3,500 (uh oh)
How to prevent:
  • Always check row counts after joins
  • If counts increase, investigate the join relationship (is it really 1:1 or 1:many?)
  • Use
    COUNT(DISTINCT a.id)
    instead of
    COUNT(*)
    when counting entities through joins
问题:多对多连接会悄无声息地增加行数,导致统计数据膨胀。
检测方法
sql
-- Check row count before and after join
SELECT COUNT(*) FROM table_a;  -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id;  -- 3,500 (uh oh)
预防措施
  • 连接后始终检查行数
  • 若行数增加,调查连接关系(是否真的是1:1或1:多?)
  • 通过连接统计实体时,使用
    COUNT(DISTINCT a.id)
    替代
    COUNT(*)

Survivorship Bias

生存偏差

The problem: Analyzing only entities that exist today, ignoring those that were deleted, churned, or failed.
Examples:
  • Analyzing user behavior of "current users" misses churned users
  • Looking at "companies using our product" ignores those who evaluated and left
  • Studying properties of "successful" outcomes without "unsuccessful" ones
How to prevent: Ask "who is NOT in this dataset?" before drawing conclusions.
问题:仅分析当前存在的实体,忽略已删除、流失或失败的实体。
示例
  • 分析“当前用户”的行为时,遗漏了流失用户
  • 研究“使用我们产品的公司”时,忽略了曾评估但最终放弃的公司
  • 仅研究“成功”结果的特征,未纳入“失败”结果
预防措施:在得出结论前,先问自己“哪些实体不在这个数据集中?”

Incomplete Period Comparison

不完整周期对比

The problem: Comparing a partial period to a full period.
Examples:
  • "January revenue is $500K vs. December's $800K" -- but January isn't over yet
  • "This week's signups are down" -- checked on Wednesday, comparing to a full prior week
How to prevent: Always filter to complete periods, or compare same-day-of-month / same-number-of-days.
问题:将部分周期与完整周期进行对比。
示例
  • “1月收入为50万美元,而12月为80万美元”——但1月尚未结束
  • “本周注册量下降”——在周三进行检查,将部分周与完整周对比
预防措施:始终过滤为完整周期,或对比相同天数/同月同日的数据。

Denominator Shifting

分母偏移

The problem: The denominator changes between periods, making rates incomparable.
Examples:
  • Conversion rate improves because you changed how you count "eligible" users
  • Churn rate changes because the definition of "active" was updated
How to prevent: Use consistent definitions across all compared periods. Note any definition changes.
问题:不同周期的分母定义发生变化,导致比率无法对比。
示例
  • 转化率提升是因为更改了“符合条件的用户”的统计方式
  • 流失率变化是因为“活跃用户”的定义被更新
预防措施:在所有对比周期中使用一致的定义,若有定义变更需标注说明。

Average of Averages

平均值的平均值

The problem: Averaging pre-computed averages gives wrong results when group sizes differ.
Example:
  • Group A: 100 users, average revenue $50
  • Group B: 10 users, average revenue $200
  • Wrong: Average of averages = ($50 + $200) / 2 = $125
  • Right: Weighted average = (100*$50 + 10*$200) / 110 = $63.64
How to prevent: Always aggregate from raw data. Never average pre-aggregated averages.
问题:当组规模不同时,对预先计算的平均值再取平均会得到错误结果。
示例
  • A组:100名用户,平均收入50美元
  • B组:10名用户,平均收入200美元
  • 错误计算:平均值的平均值 = (50 + 200) / 2 = 125美元
  • 正确计算:加权平均值 = (10050 + 10200) / 110 = 63.64美元
预防措施:始终从原始数据进行聚合,切勿对预先聚合的平均值再取平均。

Timezone Mismatches

时区不匹配

The problem: Different data sources use different timezones, causing misalignment.
Examples:
  • Event timestamps in UTC vs. user-facing dates in local time
  • Daily rollups that use different cutoff times
How to prevent: Standardize all timestamps to a single timezone (UTC recommended) before analysis. Document the timezone used.
问题:不同数据源使用不同时区,导致数据对齐错误。
示例
  • 事件时间戳为UTC,而用户可见日期为当地时间
  • 每日汇总使用不同的截止时间
预防措施:分析前将所有时间戳统一为单个时区(推荐使用UTC),并记录所使用的时区。

Selection Bias in Segmentation

细分选择偏差

The problem: Segments are defined by the outcome you're measuring, creating circular logic.
Examples:
  • "Users who completed onboarding have higher retention" -- obviously, they self-selected
  • "Power users generate more revenue" -- they became power users BY generating revenue
How to prevent: Define segments based on pre-treatment characteristics, not outcomes.
问题:根据要衡量的结果来定义细分群体,形成循环逻辑。
示例
  • “完成入职流程的用户留存率更高”——显然这是用户自我选择的结果
  • “核心用户产生更多收入”——核心用户的定义本身就是基于收入多少
预防措施:根据预处理特征而非结果来定义细分群体。

Result Sanity Checking

结果合理性检查

Magnitude Checks

量级检查

For any key number in your analysis, verify it passes the "smell test":
Metric TypeSanity Check
User countsDoes this match known MAU/DAU figures?
RevenueIs this in the right order of magnitude vs. known ARR?
Conversion ratesIs this between 0% and 100%? Does it match dashboard figures?
Growth ratesIs 50%+ MoM growth realistic, or is there a data issue?
AveragesIs the average reasonable given what you know about the distribution?
PercentagesDo segment percentages sum to ~100%?
对于分析中的任何关键数据,需验证其是否通过“合理性测试”:
指标类型合理性检查
用户数是否与已知的MAU/DAU数据匹配?
收入与已知的ARR量级是否相符?
转化率是否在0%-100%之间?与仪表盘数据是否匹配?
增长率月环比增长50%以上是否合理,还是存在数据问题?
平均值结合已知的分布情况,该平均值是否合理?
百分比细分群体的百分比之和是否约为100%?

Cross-Validation Techniques

交叉验证技巧

  1. Calculate the same metric two different ways and verify they match
  2. Spot-check individual records -- pick a few specific entities and trace their data manually
  3. Compare to known benchmarks -- match against published dashboards, finance reports, or prior analyses
  4. Reverse engineer -- if total revenue is X, does per-user revenue times user count approximately equal X?
  5. Boundary checks -- what happens when you filter to a single day, a single user, or a single category? Are those micro-results sensible?
  1. 用两种不同方法计算同一指标,验证结果是否一致
  2. 抽查个别记录——选取几个特定实体,手动追踪其数据
  3. 与已知基准对比——与已发布的仪表盘、财务报告或往期分析结果匹配
  4. 反向推导——若总收入为X,用户平均收入乘以用户数是否约等于X?
  5. 边界检查——过滤到单日、单个用户或单个分类时,结果是否合理?

Red Flags That Warrant Investigation

需要深入调查的危险信号

  • Any metric that changed by more than 50% period-over-period without an obvious cause
  • Counts or sums that are exact round numbers (suggests a filter or default value issue)
  • Rates exactly at 0% or 100% (may indicate incomplete data)
  • Results that perfectly confirm the hypothesis (reality is usually messier)
  • Identical values across time periods or segments (suggests the query is ignoring a dimension)
  • 任何指标环比变化超过50%且无明显原因
  • 统计数或求和数为整数(可能表明过滤条件或默认值存在问题)
  • 比率恰好为0%或100%(可能表明数据不完整)
  • 结果完全符合假设(现实通常更复杂)
  • 不同时间段或细分群体的结果完全相同(可能表明查询忽略了某个维度)

Documentation Standards for Reproducibility

可复现性文档标准

Analysis Documentation Template

分析文档模板

Every non-trivial analysis should include:
markdown
undefined
所有非简单分析都应包含以下内容:
markdown
undefined

Analysis: [Title]

Analysis: [Title]

Question

Question

[The specific question being answered]
[The specific question being answered]

Data Sources

Data Sources

  • Table: [schema.table_name] (as of [date])
  • Table: [schema.other_table] (as of [date])
  • File: [filename] (source: [where it came from])
  • Table: [schema.table_name] (as of [date])
  • Table: [schema.other_table] (as of [date])
  • File: [filename] (source: [where it came from])

Definitions

Definitions

  • [Metric A]: [Exactly how it's calculated]
  • [Segment X]: [Exactly how membership is determined]
  • [Time period]: [Start date] to [end date], [timezone]
  • [Metric A]: [Exactly how it's calculated]
  • [Segment X]: [Exactly how membership is determined]
  • [Time period]: [Start date] to [end date], [timezone]

Methodology

Methodology

  1. [Step 1 of the analysis approach]
  2. [Step 2]
  3. [Step 3]
  1. [Step 1 of the analysis approach]
  2. [Step 2]
  3. [Step 3]

Assumptions and Limitations

Assumptions and Limitations

  • [Assumption 1 and why it's reasonable]
  • [Limitation 1 and its potential impact on conclusions]
  • [Assumption 1 and why it's reasonable]
  • [Limitation 1 and its potential impact on conclusions]

Key Findings

Key Findings

  1. [Finding 1 with supporting evidence]
  2. [Finding 2 with supporting evidence]
  1. [Finding 1 with supporting evidence]
  2. [Finding 2 with supporting evidence]

SQL Queries

SQL Queries

[All queries used, with comments]
[All queries used, with comments]

Caveats

Caveats

  • [Things the reader should know before acting on this]
undefined
  • [Things the reader should know before acting on this]
undefined

Code Documentation

代码文档

For any code (SQL, Python) that may be reused:
python
"""
Analysis: Monthly Cohort Retention
Author: [Name]
Date: [Date]
Data Source: events table, users table
Last Validated: [Date] -- results matched dashboard within 2%

Purpose:
    Calculate monthly user retention cohorts based on first activity date.

Assumptions:
    - "Active" means at least one event in the month
    - Excludes test/internal accounts (user_type != 'internal')
    - Uses UTC dates throughout

Output:
    Cohort retention matrix with cohort_month rows and months_since_signup columns.
    Values are retention rates (0-100%).
"""
对于任何可能被复用的代码(SQL、Python):
python
"""
Analysis: Monthly Cohort Retention
Author: [Name]
Date: [Date]
Data Source: events table, users table
Last Validated: [Date] -- results matched dashboard within 2%

Purpose:
    Calculate monthly user retention cohorts based on first activity date.

Assumptions:
    - "Active" means at least one event in the month
    - Excludes test/internal accounts (user_type != 'internal')
    - Uses UTC dates throughout

Output:
    Cohort retention matrix with cohort_month rows and months_since_signup columns.
    Values are retention rates (0-100%).
"""

Version Control for Analyses

分析版本控制

  • Save queries and code in version control (git) or a shared docs system
  • Note the date of the data snapshot used
  • If an analysis is re-run with updated data, document what changed and why
  • Link to prior versions of recurring analyses for trend comparison
  • 将查询语句和代码保存到版本控制系统(git)或共享文档系统中
  • 记录所使用的数据快照日期
  • 若使用更新的数据重新运行分析,需记录变更内容及原因
  • 为定期分析的往期版本添加链接,以便进行趋势对比