data-validation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Validation Skill
数据验证技能
Pre-delivery QA checklist, common data analysis pitfalls, result sanity checking, and documentation standards for reproducibility.
交付前质量检查清单、常见数据分析陷阱、结果合理性检查,以及为可复现性制定的文档标准。
Pre-Delivery QA Checklist
交付前质量检查清单
Run through this checklist before sharing any analysis with stakeholders.
在与利益相关者分享任何分析结果前,请完成以下清单检查。
Data Quality Checks
数据质量检查
- Source verification: Confirmed which tables/data sources were used. Are they the right ones for this question?
- Freshness: Data is current enough for the analysis. Noted the "as of" date.
- Completeness: No unexpected gaps in time series or missing segments.
- Null handling: Checked null rates in key columns. Nulls are handled appropriately (excluded, imputed, or flagged).
- Deduplication: Confirmed no double-counting from bad joins or duplicate source records.
- Filter verification: All WHERE clauses and filters are correct. No unintended exclusions.
- 数据源验证:确认所使用的表/数据源是否为该问题的合适选择。
- 时效性:数据的时效性符合分析需求,并标注了“截至”日期。
- 完整性:时间序列或细分数据中无意外缺失的部分。
- 空值处理:检查关键列的空值占比,且空值已得到适当处理(排除、填充或标记)。
- 去重验证:确认不存在因错误连接或源数据重复导致的重复统计。
- 过滤条件验证:所有WHERE子句和过滤条件均正确,无意外排除的数据。
Calculation Checks
计算逻辑检查
- Aggregation logic: GROUP BY includes all non-aggregated columns. Aggregation level matches the analysis grain.
- Denominator correctness: Rate and percentage calculations use the right denominator. Denominators are non-zero.
- Date alignment: Comparisons use the same time period length. Partial periods are excluded or noted.
- Join correctness: JOIN types are appropriate (INNER vs LEFT). Many-to-many joins haven't inflated counts.
- Metric definitions: Metrics match how stakeholders define them. Any deviations are noted.
- Subtotals sum: Parts add up to the whole where expected. If they don't, explain why (e.g., overlap).
- 聚合逻辑:GROUP BY子句包含所有非聚合列,聚合粒度与分析需求匹配。
- 分母正确性:比率和百分比计算使用正确的分母,且分母不为零。
- 日期对齐:对比使用的时间周期长度一致,部分周期已被排除或标注说明。
- 连接正确性:JOIN类型选择恰当(INNER vs LEFT),未因多对多连接导致统计数据膨胀。
- 指标定义:指标与利益相关者的定义一致,如有偏差已标注说明。
- 小计求和验证:各部分数据之和符合预期的总计值,若不符合需解释原因(如存在重叠)。
Reasonableness Checks
合理性检查
- Magnitude: Numbers are in a plausible range. Revenue isn't negative. Percentages are between 0-100%.
- Trend continuity: No unexplained jumps or drops in time series.
- Cross-reference: Key numbers match other known sources (dashboards, previous reports, finance data).
- Order of magnitude: Total revenue is in the right ballpark. User counts match known figures.
- Edge cases: What happens at the boundaries? Empty segments, zero-activity periods, new entities.
- 数值量级:数据处于合理范围内,收入不为负数,百分比在0-100%之间。
- 趋势连续性:时间序列中无无法解释的突变或骤降。
- 交叉验证:关键数据与其他已知来源(仪表盘、往期报告、财务数据)匹配。
- 量级匹配:总收入与已知的ARR量级相符,用户数与已知数据一致。
- 边界情况:验证边界场景下的结果,如空细分群体、零活动周期、新实体等。
Presentation Checks
呈现方式检查
- Chart accuracy: Bar charts start at zero. Axes are labeled. Scales are consistent across panels.
- Number formatting: Appropriate precision. Consistent currency/percentage formatting. Thousands separators where needed.
- Title clarity: Titles state the insight, not just the metric. Date ranges are specified.
- Caveat transparency: Known limitations and assumptions are stated explicitly.
- Reproducibility: Someone else could recreate this analysis from the documentation provided.
- 图表准确性:柱状图从0开始,坐标轴已标注,多面板图表的刻度一致。
- 数字格式:精度恰当,货币/百分比格式统一,按需添加千位分隔符。
- 标题清晰度:标题明确说明洞察内容,而非仅标注指标,并指定时间范围。
- 局限性透明度:明确说明已知的限制条件和假设前提。
- 可复现性:他人可根据提供的文档重现该分析过程。
Common Data Analysis Pitfalls
常见数据分析陷阱
Join Explosion
连接膨胀
The problem: A many-to-many join silently multiplies rows, inflating counts and sums.
How to detect:
sql
-- Check row count before and after join
SELECT COUNT(*) FROM table_a; -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id; -- 3,500 (uh oh)How to prevent:
- Always check row counts after joins
- If counts increase, investigate the join relationship (is it really 1:1 or 1:many?)
- Use instead of
COUNT(DISTINCT a.id)when counting entities through joinsCOUNT(*)
问题:多对多连接会悄无声息地增加行数,导致统计数据膨胀。
检测方法:
sql
-- Check row count before and after join
SELECT COUNT(*) FROM table_a; -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id; -- 3,500 (uh oh)预防措施:
- 连接后始终检查行数
- 若行数增加,调查连接关系(是否真的是1:1或1:多?)
- 通过连接统计实体时,使用替代
COUNT(DISTINCT a.id)COUNT(*)
Survivorship Bias
生存偏差
The problem: Analyzing only entities that exist today, ignoring those that were deleted, churned, or failed.
Examples:
- Analyzing user behavior of "current users" misses churned users
- Looking at "companies using our product" ignores those who evaluated and left
- Studying properties of "successful" outcomes without "unsuccessful" ones
How to prevent: Ask "who is NOT in this dataset?" before drawing conclusions.
问题:仅分析当前存在的实体,忽略已删除、流失或失败的实体。
示例:
- 分析“当前用户”的行为时,遗漏了流失用户
- 研究“使用我们产品的公司”时,忽略了曾评估但最终放弃的公司
- 仅研究“成功”结果的特征,未纳入“失败”结果
预防措施:在得出结论前,先问自己“哪些实体不在这个数据集中?”
Incomplete Period Comparison
不完整周期对比
The problem: Comparing a partial period to a full period.
Examples:
- "January revenue is $500K vs. December's $800K" -- but January isn't over yet
- "This week's signups are down" -- checked on Wednesday, comparing to a full prior week
How to prevent: Always filter to complete periods, or compare same-day-of-month / same-number-of-days.
问题:将部分周期与完整周期进行对比。
示例:
- “1月收入为50万美元,而12月为80万美元”——但1月尚未结束
- “本周注册量下降”——在周三进行检查,将部分周与完整周对比
预防措施:始终过滤为完整周期,或对比相同天数/同月同日的数据。
Denominator Shifting
分母偏移
The problem: The denominator changes between periods, making rates incomparable.
Examples:
- Conversion rate improves because you changed how you count "eligible" users
- Churn rate changes because the definition of "active" was updated
How to prevent: Use consistent definitions across all compared periods. Note any definition changes.
问题:不同周期的分母定义发生变化,导致比率无法对比。
示例:
- 转化率提升是因为更改了“符合条件的用户”的统计方式
- 流失率变化是因为“活跃用户”的定义被更新
预防措施:在所有对比周期中使用一致的定义,若有定义变更需标注说明。
Average of Averages
平均值的平均值
The problem: Averaging pre-computed averages gives wrong results when group sizes differ.
Example:
- Group A: 100 users, average revenue $50
- Group B: 10 users, average revenue $200
- Wrong: Average of averages = ($50 + $200) / 2 = $125
- Right: Weighted average = (100*$50 + 10*$200) / 110 = $63.64
How to prevent: Always aggregate from raw data. Never average pre-aggregated averages.
问题:当组规模不同时,对预先计算的平均值再取平均会得到错误结果。
示例:
- A组:100名用户,平均收入50美元
- B组:10名用户,平均收入200美元
- 错误计算:平均值的平均值 = (50 + 200) / 2 = 125美元
- 正确计算:加权平均值 = (10050 + 10200) / 110 = 63.64美元
预防措施:始终从原始数据进行聚合,切勿对预先聚合的平均值再取平均。
Timezone Mismatches
时区不匹配
The problem: Different data sources use different timezones, causing misalignment.
Examples:
- Event timestamps in UTC vs. user-facing dates in local time
- Daily rollups that use different cutoff times
How to prevent: Standardize all timestamps to a single timezone (UTC recommended) before analysis. Document the timezone used.
问题:不同数据源使用不同时区,导致数据对齐错误。
示例:
- 事件时间戳为UTC,而用户可见日期为当地时间
- 每日汇总使用不同的截止时间
预防措施:分析前将所有时间戳统一为单个时区(推荐使用UTC),并记录所使用的时区。
Selection Bias in Segmentation
细分选择偏差
The problem: Segments are defined by the outcome you're measuring, creating circular logic.
Examples:
- "Users who completed onboarding have higher retention" -- obviously, they self-selected
- "Power users generate more revenue" -- they became power users BY generating revenue
How to prevent: Define segments based on pre-treatment characteristics, not outcomes.
问题:根据要衡量的结果来定义细分群体,形成循环逻辑。
示例:
- “完成入职流程的用户留存率更高”——显然这是用户自我选择的结果
- “核心用户产生更多收入”——核心用户的定义本身就是基于收入多少
预防措施:根据预处理特征而非结果来定义细分群体。
Result Sanity Checking
结果合理性检查
Magnitude Checks
量级检查
For any key number in your analysis, verify it passes the "smell test":
| Metric Type | Sanity Check |
|---|---|
| User counts | Does this match known MAU/DAU figures? |
| Revenue | Is this in the right order of magnitude vs. known ARR? |
| Conversion rates | Is this between 0% and 100%? Does it match dashboard figures? |
| Growth rates | Is 50%+ MoM growth realistic, or is there a data issue? |
| Averages | Is the average reasonable given what you know about the distribution? |
| Percentages | Do segment percentages sum to ~100%? |
对于分析中的任何关键数据,需验证其是否通过“合理性测试”:
| 指标类型 | 合理性检查 |
|---|---|
| 用户数 | 是否与已知的MAU/DAU数据匹配? |
| 收入 | 与已知的ARR量级是否相符? |
| 转化率 | 是否在0%-100%之间?与仪表盘数据是否匹配? |
| 增长率 | 月环比增长50%以上是否合理,还是存在数据问题? |
| 平均值 | 结合已知的分布情况,该平均值是否合理? |
| 百分比 | 细分群体的百分比之和是否约为100%? |
Cross-Validation Techniques
交叉验证技巧
- Calculate the same metric two different ways and verify they match
- Spot-check individual records -- pick a few specific entities and trace their data manually
- Compare to known benchmarks -- match against published dashboards, finance reports, or prior analyses
- Reverse engineer -- if total revenue is X, does per-user revenue times user count approximately equal X?
- Boundary checks -- what happens when you filter to a single day, a single user, or a single category? Are those micro-results sensible?
- 用两种不同方法计算同一指标,验证结果是否一致
- 抽查个别记录——选取几个特定实体,手动追踪其数据
- 与已知基准对比——与已发布的仪表盘、财务报告或往期分析结果匹配
- 反向推导——若总收入为X,用户平均收入乘以用户数是否约等于X?
- 边界检查——过滤到单日、单个用户或单个分类时,结果是否合理?
Red Flags That Warrant Investigation
需要深入调查的危险信号
- Any metric that changed by more than 50% period-over-period without an obvious cause
- Counts or sums that are exact round numbers (suggests a filter or default value issue)
- Rates exactly at 0% or 100% (may indicate incomplete data)
- Results that perfectly confirm the hypothesis (reality is usually messier)
- Identical values across time periods or segments (suggests the query is ignoring a dimension)
- 任何指标环比变化超过50%且无明显原因
- 统计数或求和数为整数(可能表明过滤条件或默认值存在问题)
- 比率恰好为0%或100%(可能表明数据不完整)
- 结果完全符合假设(现实通常更复杂)
- 不同时间段或细分群体的结果完全相同(可能表明查询忽略了某个维度)
Documentation Standards for Reproducibility
可复现性文档标准
Analysis Documentation Template
分析文档模板
Every non-trivial analysis should include:
markdown
undefined所有非简单分析都应包含以下内容:
markdown
undefinedAnalysis: [Title]
Analysis: [Title]
Question
Question
[The specific question being answered]
[The specific question being answered]
Data Sources
Data Sources
- Table: [schema.table_name] (as of [date])
- Table: [schema.other_table] (as of [date])
- File: [filename] (source: [where it came from])
- Table: [schema.table_name] (as of [date])
- Table: [schema.other_table] (as of [date])
- File: [filename] (source: [where it came from])
Definitions
Definitions
- [Metric A]: [Exactly how it's calculated]
- [Segment X]: [Exactly how membership is determined]
- [Time period]: [Start date] to [end date], [timezone]
- [Metric A]: [Exactly how it's calculated]
- [Segment X]: [Exactly how membership is determined]
- [Time period]: [Start date] to [end date], [timezone]
Methodology
Methodology
- [Step 1 of the analysis approach]
- [Step 2]
- [Step 3]
- [Step 1 of the analysis approach]
- [Step 2]
- [Step 3]
Assumptions and Limitations
Assumptions and Limitations
- [Assumption 1 and why it's reasonable]
- [Limitation 1 and its potential impact on conclusions]
- [Assumption 1 and why it's reasonable]
- [Limitation 1 and its potential impact on conclusions]
Key Findings
Key Findings
- [Finding 1 with supporting evidence]
- [Finding 2 with supporting evidence]
- [Finding 1 with supporting evidence]
- [Finding 2 with supporting evidence]
SQL Queries
SQL Queries
[All queries used, with comments]
[All queries used, with comments]
Caveats
Caveats
- [Things the reader should know before acting on this]
undefined- [Things the reader should know before acting on this]
undefinedCode Documentation
代码文档
For any code (SQL, Python) that may be reused:
python
"""
Analysis: Monthly Cohort Retention
Author: [Name]
Date: [Date]
Data Source: events table, users table
Last Validated: [Date] -- results matched dashboard within 2%
Purpose:
Calculate monthly user retention cohorts based on first activity date.
Assumptions:
- "Active" means at least one event in the month
- Excludes test/internal accounts (user_type != 'internal')
- Uses UTC dates throughout
Output:
Cohort retention matrix with cohort_month rows and months_since_signup columns.
Values are retention rates (0-100%).
"""对于任何可能被复用的代码(SQL、Python):
python
"""
Analysis: Monthly Cohort Retention
Author: [Name]
Date: [Date]
Data Source: events table, users table
Last Validated: [Date] -- results matched dashboard within 2%
Purpose:
Calculate monthly user retention cohorts based on first activity date.
Assumptions:
- "Active" means at least one event in the month
- Excludes test/internal accounts (user_type != 'internal')
- Uses UTC dates throughout
Output:
Cohort retention matrix with cohort_month rows and months_since_signup columns.
Values are retention rates (0-100%).
"""Version Control for Analyses
分析版本控制
- Save queries and code in version control (git) or a shared docs system
- Note the date of the data snapshot used
- If an analysis is re-run with updated data, document what changed and why
- Link to prior versions of recurring analyses for trend comparison
- 将查询语句和代码保存到版本控制系统(git)或共享文档系统中
- 记录所使用的数据快照日期
- 若使用更新的数据重新运行分析,需记录变更内容及原因
- 为定期分析的往期版本添加链接,以便进行趋势对比