Data Validation Skill

数据验证技能

Pre-delivery QA checklist, common data analysis pitfalls, result sanity checking, and documentation standards for reproducibility.

交付前质量检查清单、常见数据分析陷阱、结果合理性检查，以及为可复现性制定的文档标准。

Pre-Delivery QA Checklist

交付前质量检查清单

Run through this checklist before sharing any analysis with stakeholders.

在与利益相关者分享任何分析结果前，请完成以下清单检查。

Data Quality Checks

数据质量检查

Source verification: Confirmed which tables/data sources were used. Are they the right ones for this question?
Freshness: Data is current enough for the analysis. Noted the "as of" date.
Completeness: No unexpected gaps in time series or missing segments.
Null handling: Checked null rates in key columns. Nulls are handled appropriately (excluded, imputed, or flagged).
Deduplication: Confirmed no double-counting from bad joins or duplicate source records.
Filter verification: All WHERE clauses and filters are correct. No unintended exclusions.

数据源验证：确认所使用的表/数据源是否为该问题的合适选择。
时效性：数据的时效性符合分析需求，并标注了“截至”日期。
完整性：时间序列或细分数据中无意外缺失的部分。
空值处理：检查关键列的空值占比，且空值已得到适当处理（排除、填充或标记）。
去重验证：确认不存在因错误连接或源数据重复导致的重复统计。
过滤条件验证：所有WHERE子句和过滤条件均正确，无意外排除的数据。

Calculation Checks

计算逻辑检查

Aggregation logic: GROUP BY includes all non-aggregated columns. Aggregation level matches the analysis grain.
Denominator correctness: Rate and percentage calculations use the right denominator. Denominators are non-zero.
Date alignment: Comparisons use the same time period length. Partial periods are excluded or noted.
Join correctness: JOIN types are appropriate (INNER vs LEFT). Many-to-many joins haven't inflated counts.
Metric definitions: Metrics match how stakeholders define them. Any deviations are noted.
Subtotals sum: Parts add up to the whole where expected. If they don't, explain why (e.g., overlap).

聚合逻辑：GROUP BY子句包含所有非聚合列，聚合粒度与分析需求匹配。
分母正确性：比率和百分比计算使用正确的分母，且分母不为零。
日期对齐：对比使用的时间周期长度一致，部分周期已被排除或标注说明。
连接正确性：JOIN类型选择恰当（INNER vs LEFT），未因多对多连接导致统计数据膨胀。
指标定义：指标与利益相关者的定义一致，如有偏差已标注说明。
小计求和验证：各部分数据之和符合预期的总计值，若不符合需解释原因（如存在重叠）。

Reasonableness Checks

合理性检查

Magnitude: Numbers are in a plausible range. Revenue isn't negative. Percentages are between 0-100%.
Trend continuity: No unexplained jumps or drops in time series.
Cross-reference: Key numbers match other known sources (dashboards, previous reports, finance data).
Order of magnitude: Total revenue is in the right ballpark. User counts match known figures.
Edge cases: What happens at the boundaries? Empty segments, zero-activity periods, new entities.

数值量级：数据处于合理范围内，收入不为负数，百分比在0-100%之间。
趋势连续性：时间序列中无无法解释的突变或骤降。
交叉验证：关键数据与其他已知来源（仪表盘、往期报告、财务数据）匹配。
量级匹配：总收入与已知的ARR量级相符，用户数与已知数据一致。
边界情况：验证边界场景下的结果，如空细分群体、零活动周期、新实体等。

Presentation Checks

呈现方式检查

Chart accuracy: Bar charts start at zero. Axes are labeled. Scales are consistent across panels.
Number formatting: Appropriate precision. Consistent currency/percentage formatting. Thousands separators where needed.
Title clarity: Titles state the insight, not just the metric. Date ranges are specified.
Caveat transparency: Known limitations and assumptions are stated explicitly.
Reproducibility: Someone else could recreate this analysis from the documentation provided.

图表准确性：柱状图从0开始，坐标轴已标注，多面板图表的刻度一致。
数字格式：精度恰当，货币/百分比格式统一，按需添加千位分隔符。
标题清晰度：标题明确说明洞察内容，而非仅标注指标，并指定时间范围。
局限性透明度：明确说明已知的限制条件和假设前提。
可复现性：他人可根据提供的文档重现该分析过程。

Common Data Analysis Pitfalls

常见数据分析陷阱

Join Explosion

连接膨胀

The problem: A many-to-many join silently multiplies rows, inflating counts and sums.

How to detect:

sql

-- Check row count before and after join
SELECT COUNT(*) FROM table_a;  -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id;  -- 3,500 (uh oh)

How to prevent:

Always check row counts after joins
If counts increase, investigate the join relationship (is it really 1:1 or 1:many?)
Use
```
COUNT(DISTINCT a.id)
```
instead of
```
COUNT(*)
```
when counting entities through joins

问题：多对多连接会悄无声息地增加行数，导致统计数据膨胀。

检测方法：

sql

-- Check row count before and after join
SELECT COUNT(*) FROM table_a;  -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id;  -- 3,500 (uh oh)

预防措施：

连接后始终检查行数
若行数增加，调查连接关系（是否真的是1:1或1:多？）
通过连接统计实体时，使用
```
COUNT(DISTINCT a.id)
```
替代
```
COUNT(*)
```

Survivorship Bias

生存偏差

The problem: Analyzing only entities that exist today, ignoring those that were deleted, churned, or failed.

Examples:

Analyzing user behavior of "current users" misses churned users
Looking at "companies using our product" ignores those who evaluated and left
Studying properties of "successful" outcomes without "unsuccessful" ones

How to prevent: Ask "who is NOT in this dataset?" before drawing conclusions.

问题：仅分析当前存在的实体，忽略已删除、流失或失败的实体。

示例：

分析“当前用户”的行为时，遗漏了流失用户
研究“使用我们产品的公司”时，忽略了曾评估但最终放弃的公司
仅研究“成功”结果的特征，未纳入“失败”结果

预防措施：在得出结论前，先问自己“哪些实体不在这个数据集中？”

Incomplete Period Comparison

不完整周期对比

The problem: Comparing a partial period to a full period.

Examples:

"January revenue is $500K vs. December's $800K" -- but January isn't over yet
"This week's signups are down" -- checked on Wednesday, comparing to a full prior week

How to prevent: Always filter to complete periods, or compare same-day-of-month / same-number-of-days.

问题：将部分周期与完整周期进行对比。

示例：

“1月收入为50万美元，而12月为80万美元”——但1月尚未结束
“本周注册量下降”——在周三进行检查，将部分周与完整周对比

预防措施：始终过滤为完整周期，或对比相同天数/同月同日的数据。

Denominator Shifting

分母偏移

The problem: The denominator changes between periods, making rates incomparable.

Examples:

Conversion rate improves because you changed how you count "eligible" users
Churn rate changes because the definition of "active" was updated

How to prevent: Use consistent definitions across all compared periods. Note any definition changes.

问题：不同周期的分母定义发生变化，导致比率无法对比。

示例：

转化率提升是因为更改了“符合条件的用户”的统计方式
流失率变化是因为“活跃用户”的定义被更新

预防措施：在所有对比周期中使用一致的定义，若有定义变更需标注说明。

Average of Averages

平均值的平均值

The problem: Averaging pre-computed averages gives wrong results when group sizes differ.

Example:

Group A: 100 users, average revenue $50
Group B: 10 users, average revenue $200
Wrong: Average of averages = ($50 + $200) / 2 = $125
Right: Weighted average = (100*$50 + 10*$200) / 110 = $63.64

How to prevent: Always aggregate from raw data. Never average pre-aggregated averages.

问题：当组规模不同时，对预先计算的平均值再取平均会得到错误结果。

示例：

A组：100名用户，平均收入50美元
B组：10名用户，平均收入200美元
错误计算：平均值的平均值 = (50 + 200) / 2 = 125美元
正确计算：加权平均值 = (10050 + 10200) / 110 = 63.64美元

预防措施：始终从原始数据进行聚合，切勿对预先聚合的平均值再取平均。

Timezone Mismatches

时区不匹配

The problem: Different data sources use different timezones, causing misalignment.

Examples:

Event timestamps in UTC vs. user-facing dates in local time
Daily rollups that use different cutoff times

How to prevent: Standardize all timestamps to a single timezone (UTC recommended) before analysis. Document the timezone used.

问题：不同数据源使用不同时区，导致数据对齐错误。

示例：

事件时间戳为UTC，而用户可见日期为当地时间
每日汇总使用不同的截止时间

预防措施：分析前将所有时间戳统一为单个时区（推荐使用UTC），并记录所使用的时区。

Selection Bias in Segmentation

细分选择偏差

The problem: Segments are defined by the outcome you're measuring, creating circular logic.

Examples:

"Users who completed onboarding have higher retention" -- obviously, they self-selected
"Power users generate more revenue" -- they became power users BY generating revenue

How to prevent: Define segments based on pre-treatment characteristics, not outcomes.

问题：根据要衡量的结果来定义细分群体，形成循环逻辑。

示例：

“完成入职流程的用户留存率更高”——显然这是用户自我选择的结果
“核心用户产生更多收入”——核心用户的定义本身就是基于收入多少

预防措施：根据预处理特征而非结果来定义细分群体。

Result Sanity Checking

结果合理性检查

Magnitude Checks

量级检查

For any key number in your analysis, verify it passes the "smell test":

Metric Type	Sanity Check
User counts	Does this match known MAU/DAU figures?
Revenue	Is this in the right order of magnitude vs. known ARR?
Conversion rates	Is this between 0% and 100%? Does it match dashboard figures?
Growth rates	Is 50%+ MoM growth realistic, or is there a data issue?
Averages	Is the average reasonable given what you know about the distribution?
Percentages	Do segment percentages sum to ~100%?

对于分析中的任何关键数据，需验证其是否通过“合理性测试”：

指标类型	合理性检查
用户数	是否与已知的MAU/DAU数据匹配？
收入	与已知的ARR量级是否相符？
转化率	是否在0%-100%之间？与仪表盘数据是否匹配？
增长率	月环比增长50%以上是否合理，还是存在数据问题？
平均值	结合已知的分布情况，该平均值是否合理？
百分比	细分群体的百分比之和是否约为100%？

Cross-Validation Techniques

交叉验证技巧

Calculate the same metric two different ways and verify they match
Spot-check individual records -- pick a few specific entities and trace their data manually
Compare to known benchmarks -- match against published dashboards, finance reports, or prior analyses
Reverse engineer -- if total revenue is X, does per-user revenue times user count approximately equal X?
Boundary checks -- what happens when you filter to a single day, a single user, or a single category? Are those micro-results sensible?

用两种不同方法计算同一指标，验证结果是否一致
抽查个别记录——选取几个特定实体，手动追踪其数据
与已知基准对比——与已发布的仪表盘、财务报告或往期分析结果匹配
反向推导——若总收入为X，用户平均收入乘以用户数是否约等于X？
边界检查——过滤到单日、单个用户或单个分类时，结果是否合理？

Red Flags That Warrant Investigation

需要深入调查的危险信号

Any metric that changed by more than 50% period-over-period without an obvious cause
Counts or sums that are exact round numbers (suggests a filter or default value issue)
Rates exactly at 0% or 100% (may indicate incomplete data)
Results that perfectly confirm the hypothesis (reality is usually messier)
Identical values across time periods or segments (suggests the query is ignoring a dimension)

任何指标环比变化超过50%且无明显原因
统计数或求和数为整数（可能表明过滤条件或默认值存在问题）
比率恰好为0%或100%（可能表明数据不完整）
结果完全符合假设（现实通常更复杂）
不同时间段或细分群体的结果完全相同（可能表明查询忽略了某个维度）

Documentation Standards for Reproducibility

可复现性文档标准

Analysis Documentation Template

分析文档模板

Every non-trivial analysis should include:

markdown

undefined

所有非简单分析都应包含以下内容：

markdown

undefined

Analysis: [Title]

Question

[The specific question being answered]

Data Sources

Table: [schema.table_name] (as of [date])
Table: [schema.other_table] (as of [date])
File: [filename] (source: [where it came from])

Table: [schema.table_name] (as of [date])
Table: [schema.other_table] (as of [date])
File: [filename] (source: [where it came from])

Definitions

[Metric A]: [Exactly how it's calculated]
[Segment X]: [Exactly how membership is determined]
[Time period]: [Start date] to [end date], [timezone]

[Metric A]: [Exactly how it's calculated]
[Segment X]: [Exactly how membership is determined]
[Time period]: [Start date] to [end date], [timezone]

Methodology

[Step 1 of the analysis approach]
[Step 2]
[Step 3]

[Step 1 of the analysis approach]
[Step 2]
[Step 3]

Assumptions and Limitations

[Assumption 1 and why it's reasonable]
[Limitation 1 and its potential impact on conclusions]

[Assumption 1 and why it's reasonable]
[Limitation 1 and its potential impact on conclusions]

Key Findings

[Finding 1 with supporting evidence]
[Finding 2 with supporting evidence]

[Finding 1 with supporting evidence]
[Finding 2 with supporting evidence]

SQL Queries

[All queries used, with comments]

Caveats

[Things the reader should know before acting on this]

undefined

[Things the reader should know before acting on this]

undefined

Code Documentation

代码文档

For any code (SQL, Python) that may be reused:

python

"""
Analysis: Monthly Cohort Retention
Author: [Name]
Date: [Date]
Data Source: events table, users table
Last Validated: [Date] -- results matched dashboard within 2%

Purpose:
    Calculate monthly user retention cohorts based on first activity date.

Assumptions:
    - "Active" means at least one event in the month
    - Excludes test/internal accounts (user_type != 'internal')
    - Uses UTC dates throughout

Output:
    Cohort retention matrix with cohort_month rows and months_since_signup columns.
    Values are retention rates (0-100%).
"""

对于任何可能被复用的代码（SQL、Python）：

python

"""
Analysis: Monthly Cohort Retention
Author: [Name]
Date: [Date]
Data Source: events table, users table
Last Validated: [Date] -- results matched dashboard within 2%

Purpose:
    Calculate monthly user retention cohorts based on first activity date.

Assumptions:
    - "Active" means at least one event in the month
    - Excludes test/internal accounts (user_type != 'internal')
    - Uses UTC dates throughout

Output:
    Cohort retention matrix with cohort_month rows and months_since_signup columns.
    Values are retention rates (0-100%).
"""

Version Control for Analyses

分析版本控制

Save queries and code in version control (git) or a shared docs system
Note the date of the data snapshot used
If an analysis is re-run with updated data, document what changed and why
Link to prior versions of recurring analyses for trend comparison

将查询语句和代码保存到版本控制系统（git）或共享文档系统中
记录所使用的数据快照日期
若使用更新的数据重新运行分析，需记录变更内容及原因
为定期分析的往期版本添加链接，以便进行趋势对比

data-validation

Original

Translation

Data Validation Skill

数据验证技能

Pre-Delivery QA Checklist

交付前质量检查清单

Data Quality Checks

数据质量检查

Calculation Checks

计算逻辑检查

Reasonableness Checks

合理性检查

Presentation Checks

呈现方式检查

Common Data Analysis Pitfalls

常见数据分析陷阱

Join Explosion

连接膨胀

Survivorship Bias

生存偏差

Incomplete Period Comparison

不完整周期对比

Denominator Shifting

分母偏移

Average of Averages

平均值的平均值

Timezone Mismatches

时区不匹配

Selection Bias in Segmentation

细分选择偏差

Result Sanity Checking

结果合理性检查

Magnitude Checks

量级检查

Cross-Validation Techniques

交叉验证技巧

Red Flags That Warrant Investigation

需要深入调查的危险信号

Documentation Standards for Reproducibility

可复现性文档标准

Analysis Documentation Template

分析文档模板

Analysis: [Title]

Analysis: [Title]

Question

Question

Data Sources

Data Sources

Definitions

Definitions

Methodology

Methodology

Assumptions and Limitations

Assumptions and Limitations

Key Findings

Key Findings

SQL Queries

SQL Queries

Caveats

Caveats

Code Documentation

代码文档

Version Control for Analyses

分析版本控制