statistical-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Statistical Analysis Skill

统计分析技能

Descriptive statistics, trend analysis, outlier detection, hypothesis testing, and guidance on when to be cautious about statistical claims.
描述性统计、趋势分析、异常值检测、假设检验,以及关于统计结论需谨慎的注意事项。

Descriptive Statistics Methodology

描述性统计方法

Central Tendency

集中趋势

Choose the right measure of center based on the data:
SituationUseWhy
Symmetric distribution, no outliersMeanMost efficient estimator
Skewed distributionMedianRobust to outliers
Categorical or ordinal dataModeOnly option for non-numeric
Highly skewed with outliers (e.g., revenue per user)Median + meanReport both; the gap shows skew
Always report mean and median together for business metrics. If they diverge significantly, the data is skewed and the mean alone is misleading.
根据数据类型选择合适的中心度量指标:
场景适用指标原因
对称分布、无异常值Mean(均值)最有效的估计量
偏态分布Median(中位数)对异常值鲁棒
分类或有序数据Mode(众数)非数值数据的唯一选择
高度偏态且含异常值(如单用户收入)Median + Mean(中位数+均值)同时报告两者,差值可体现偏态程度
针对业务指标,务必同时报告均值和中位数。若两者差异显著,说明数据存在偏态,仅用均值会产生误导。

Spread and Variability

离散程度与变异性

  • Standard deviation: How far values typically fall from the mean. Use with normally distributed data.
  • Interquartile range (IQR): Distance from p25 to p75. Robust to outliers. Use with skewed data.
  • Coefficient of variation (CV): StdDev / Mean. Use to compare variability across metrics with different scales.
  • Range: Max minus min. Sensitive to outliers but gives a quick sense of data extent.
  • Standard deviation(标准差):衡量数值与均值的典型偏离程度,适用于正态分布数据。
  • Interquartile range (IQR)(四分位距):第25百分位数到第75百分位数的距离,对异常值鲁棒,适用于偏态数据。
  • Coefficient of variation (CV)(变异系数):标准差/均值,用于比较不同尺度指标的变异性。
  • Range(极差):最大值减最小值,对异常值敏感,但可快速了解数据范围。

Percentiles for Business Context

业务场景中的百分位数

Report key percentiles to tell a richer story than mean alone:
p1:   Bottom 1% (floor / minimum typical value)
p5:   Low end of normal range
p25:  First quartile
p50:  Median (typical user)
p75:  Third quartile
p90:  Top 10% / power users
p95:  High end of normal range
p99:  Top 1% / extreme users
Example narrative: "The median session duration is 4.2 minutes, but the top 10% of users spend over 22 minutes per session, pulling the mean up to 7.8 minutes."
报告关键百分位数,比单独使用均值能呈现更丰富的信息:
p1:   底部1%(下限/典型最小值)
p5:   正常范围下限
p25:  第一四分位数
p50:  Median(典型用户水平)
p75:  第三四分位数
p90:  顶部10% / 核心用户
p95:  正常范围上限
p99:  顶部1% / 极端用户
示例表述:"会话时长中位数为4.2分钟,但顶部10%的用户每次会话时长超过22分钟,将均值拉高至7.8分钟。"

Describing Distributions

分布特征描述

Characterize every numeric distribution you analyze:
  • Shape: Normal, right-skewed, left-skewed, bimodal, uniform, heavy-tailed
  • Center: Mean and median (and the gap between them)
  • Spread: Standard deviation or IQR
  • Outliers: How many and how extreme
  • Bounds: Is there a natural floor (zero) or ceiling (100%)?
分析每个数值分布时,需明确以下特征:
  • 形态:正态、右偏、左偏、双峰、均匀、厚尾
  • 中心:均值和中位数(以及两者的差值)
  • 离散程度:标准差或四分位距
  • 异常值:数量及极端程度
  • 边界:是否存在自然下限(如0)或上限(如100%)

Trend Analysis and Forecasting

趋势分析与预测

Identifying Trends

趋势识别

Moving averages to smooth noise:
python
undefined
移动平均用于平滑噪声:
python
undefined

7-day moving average (good for daily data with weekly seasonality)

7日移动平均(适用于含周度季节性的日度数据)

df['ma_7d'] = df['metric'].rolling(window=7, min_periods=1).mean()
df['ma_7d'] = df['metric'].rolling(window=7, min_periods=1).mean()

28-day moving average (smooths weekly AND monthly patterns)

28日移动平均(同时平滑周度和月度模式)

df['ma_28d'] = df['metric'].rolling(window=28, min_periods=1).mean()

**Period-over-period comparison**:
- Week-over-week (WoW): Compare to same day last week
- Month-over-month (MoM): Compare to same month prior
- Year-over-year (YoY): Gold standard for seasonal businesses
- Same-day-last-year: Compare specific calendar day

**Growth rates**:
Simple growth: (current - previous) / previous CAGR: (ending / beginning) ^ (1 / years) - 1 Log growth: ln(current / previous) -- better for volatile series
undefined
df['ma_28d'] = df['metric'].rolling(window=28, min_periods=1).mean()

**同期对比**:
- Week-over-week (WoW):与上周同日对比
- Month-over-month (MoM):与上月同期对比
- Year-over-year (YoY):季节性业务的黄金标准
- Same-day-last-year:与去年同日对比

**增长率计算**:
简单增长率: (当期值 - 上期值) / 上期值 CAGR(复合年均增长率): (期末值 / 期初值) ^ (1 / 年数) - 1 对数增长率: ln(当期值 / 上期值) -- 更适合波动较大的序列
undefined

Seasonality Detection

季节性检测

Check for periodic patterns:
  1. Plot the raw time series -- visual inspection first
  2. Compute day-of-week averages: is there a clear weekly pattern?
  3. Compute month-of-year averages: is there an annual cycle?
  4. When comparing periods, always use YoY or same-period comparisons to avoid conflating trend with seasonality
检查周期性模式:
  1. 绘制原始时间序列图 -- 优先可视化检查
  2. 计算周度日均数值:是否存在明显的周度模式?
  3. 计算年度月均数值:是否存在年度周期?
  4. 对比周期时,务必使用YoY或同期对比,避免将趋势与季节性混淆

Forecasting (Simple Methods)

预测(简易方法)

For business analysts (not data scientists), use straightforward methods:
  • Naive forecast: Tomorrow = today. Use as a baseline.
  • Seasonal naive: Tomorrow = same day last week/year.
  • Linear trend: Fit a line to historical data. Only for clearly linear trends.
  • Moving average forecast: Use trailing average as the forecast.
Always communicate uncertainty. Provide a range, not a point estimate:
  • "We expect 10K-12K signups next month based on the 3-month trend"
  • NOT "We will get exactly 11,234 signups next month"
When to escalate to a data scientist: Non-linear trends, multiple seasonalities, external factors (marketing spend, holidays), or when forecast accuracy matters for resource allocation.
针对业务分析师(而非数据科学家),推荐使用简单方法:
  • 朴素预测:明日值 = 今日值,用作基准线。
  • 季节性朴素预测:明日值 = 上周/去年同日的值。
  • 线性趋势预测:为历史数据拟合直线,仅适用于明显的线性趋势。
  • 移动平均预测:用近期平均值作为预测值。
务必传达不确定性,提供范围而非点估计:
  • "基于3个月的趋势,我们预计下月注册量为1万-1.2万"
  • 而非 "我们下月将获得恰好11234个注册量"
何时升级求助数据科学家:非线性趋势、多重季节性、外部因素(营销投入、节假日)影响,或预测准确性对资源分配至关重要时。

Outlier and Anomaly Detection

异常值与异常检测

Statistical Methods

统计方法

Z-score method (for normally distributed data):
python
z_scores = (df['value'] - df['value'].mean()) / df['value'].std()
outliers = df[abs(z_scores) > 3]  # More than 3 standard deviations
IQR method (robust to non-normal distributions):
python
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['value'] < lower_bound) | (df['value'] > upper_bound)]
Percentile method (simplest):
python
outliers = df[(df['value'] < df['value'].quantile(0.01)) |
              (df['value'] > df['value'].quantile(0.99))]
Z-score法(适用于正态分布数据):
python
z_scores = (df['value'] - df['value'].mean()) / df['value'].std()
outliers = df[abs(z_scores) > 3]  # 超过3个标准差
IQR法(对非正态分布鲁棒):
python
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['value'] < lower_bound) | (df['value'] > upper_bound)]
百分位法(最简单):
python
outliers = df[(df['value'] < df['value'].quantile(0.01)) |
              (df['value'] > df['value'].quantile(0.99))]

Handling Outliers

异常值处理

Do NOT automatically remove outliers. Instead:
  1. Investigate: Is this a data error, a genuine extreme value, or a different population?
  2. Data errors: Fix or remove (e.g., negative ages, timestamps in year 1970)
  3. Genuine extremes: Keep them but consider using robust statistics (median instead of mean)
  4. Different population: Segment them out for separate analysis (e.g., enterprise vs. SMB customers)
Report what you did: "We excluded 47 records (0.3%) with transaction amounts >$50K, which represent bulk enterprise orders analyzed separately."
请勿自动删除异常值,应遵循以下步骤:
  1. 调查原因:这是数据错误、真实的极端值,还是来自不同群体?
  2. 数据错误:修正或删除(如负年龄、1970年时间戳)
  3. 真实极端值:保留,但考虑使用鲁棒统计量(如中位数替代均值)
  4. 不同群体:单独拆分分析(如企业客户 vs 中小客户)
需说明处理方式:"我们排除了47条记录(占0.3%),这些记录的交易金额超过5万美元,属于批量企业订单,将单独分析。"

Time Series Anomaly Detection

时间序列异常检测

For detecting unusual values in a time series:
  1. Compute expected value (moving average or same-period-last-year)
  2. Compute deviation from expected
  3. Flag deviations beyond a threshold (typically 2-3 standard deviations of the residuals)
  4. Distinguish between point anomalies (single unusual value) and change points (sustained shift)
检测时间序列中的异常值:
  1. 计算预期值(移动平均或去年同期值)
  2. 计算与预期值的偏差
  3. 标记超出阈值的偏差(通常为残差的2-3个标准差)
  4. 区分点异常(单个异常值)和突变点(持续的趋势转变)

Hypothesis Testing Basics

假设检验基础

When to Use

适用场景

Use hypothesis testing when you need to determine whether an observed difference is likely real or could be due to random chance. Common scenarios:
  • A/B test results: Is variant B actually better than A?
  • Before/after comparison: Did the product change actually move the metric?
  • Segment comparison: Do enterprise customers really have higher retention?
当你需要判断观测到的差异是真实存在还是随机误差导致时,使用假设检验。常见场景:
  • A/B测试结果:变体B是否真的优于变体A?
  • 前后对比:产品变更是否真的影响了指标?
  • 群体对比:企业客户的留存率是否真的更高?

The Framework

框架流程

  1. Null hypothesis (H0): There is no difference (the default assumption)
  2. Alternative hypothesis (H1): There is a difference
  3. Choose significance level (alpha): Typically 0.05 (5% chance of false positive)
  4. Compute test statistic and p-value
  5. Interpret: If p < alpha, reject H0 (evidence of a real difference)
  1. 原假设(H0):不存在差异(默认假设)
  2. 备择假设(H1):存在差异
  3. 选择显著性水平(alpha):通常为0.05(5%的假阳性概率)
  4. 计算检验统计量和p值
  5. 解读结果:若p < alpha,则拒绝原假设(存在真实差异的证据)

Common Tests

常见检验方法

ScenarioTestWhen to Use
Compare two group meanst-test (independent)Normal data, two groups
Compare two group proportionsz-test for proportionsConversion rates, binary outcomes
Compare paired measurementsPaired t-testBefore/after on same entities
Compare 3+ group meansANOVAMultiple segments or variants
Non-normal data, two groupsMann-Whitney U testSkewed metrics, ordinal data
Association between categoriesChi-squared testTwo categorical variables
场景检验方法适用条件
两组均值对比独立样本t检验正态分布数据、两组样本
两组比例对比比例z检验转化率、二元结果
配对测量对比配对t检验同一实体的前后对比
三组及以上均值对比ANOVA(方差分析)多个群体或变体
非正态数据、两组对比Mann-Whitney U检验偏态指标、有序数据
分类变量关联分析卡方检验两个分类变量

Practical Significance vs. Statistical Significance

统计显著性 vs 实际显著性

Statistical significance means the difference is unlikely due to chance.
Practical significance means the difference is large enough to matter for business decisions.
A difference can be statistically significant but practically meaningless (common with large samples). Always report:
  • Effect size: How big is the difference? (e.g., "Variant B improved conversion by 0.3 percentage points")
  • Confidence interval: What's the range of plausible true effects?
  • Business impact: What does this translate to in revenue, users, or other business terms?
统计显著性:差异由随机误差导致的概率极低。
实际显著性:差异大到足以影响业务决策。
差异可能具备统计显著性但无实际意义(大样本中常见)。务必报告:
  • 效应量:差异有多大?(如"变体B将转化率提升了0.3个百分点")
  • 置信区间:真实效应的合理范围是什么?
  • 业务影响:这转化为多少收入、用户或其他业务指标?

Sample Size Considerations

样本量考量

  • Small samples produce unreliable results, even with significant p-values
  • Rule of thumb for proportions: Need at least 30 events per group for basic reliability
  • For detecting small effects (e.g., 1% conversion rate change), you may need thousands of observations per group
  • If your sample is small, say so: "With only 200 observations per group, we have limited power to detect effects smaller than X%"
  • 小样本结果不可靠,即使p值显著
  • 比例类指标经验法则:每组至少需要30个事件才能保证基本可靠性
  • 检测小效应(如转化率提升1%)可能需要每组数千个观测值
  • 若样本量小,需明确说明:"每组仅200个观测值,我们检测小于X%效应的能力有限"

When to Be Cautious About Statistical Claims

统计结论的注意事项

Correlation Is Not Causation

相关性不等于因果性

When you find a correlation, explicitly consider:
  • Reverse causation: Maybe B causes A, not A causes B
  • Confounding variables: Maybe C causes both A and B
  • Coincidence: With enough variables, spurious correlations are inevitable
What you can say: "Users who use feature X have 30% higher retention" What you cannot say without more evidence: "Feature X causes 30% higher retention"
当发现相关性时,需明确考虑:
  • 反向因果:可能是B导致A,而非A导致B
  • 混杂变量:可能是C同时导致A和B
  • 巧合:变量足够多时,虚假相关性不可避免
可表述内容:"使用功能X的用户留存率高30%" 无更多证据时不可表述:"功能X导致留存率提升30%"

Multiple Comparisons Problem

多重比较问题

When you test many hypotheses, some will be "significant" by chance:
  • Testing 20 metrics at p=0.05 means ~1 will be falsely significant
  • If you looked at many segments before finding one that's different, note that
  • Adjust for multiple comparisons with Bonferroni correction (divide alpha by number of tests) or report how many tests were run
当检验多个假设时,部分结果可能因随机因素呈现"显著性":
  • 在p=0.05的水平下检验20个指标,约1个会出现假阳性
  • 若在多个群体中筛选后才发现差异,需说明这一点
  • 可通过Bonferroni校正(将alpha除以检验次数)或说明检验总数来调整

Simpson's Paradox

辛普森悖论

A trend in aggregated data can reverse when data is segmented:
  • Always check whether the conclusion holds across key segments
  • Example: Overall conversion goes up, but conversion goes down in every segment -- because the mix shifted toward a higher-converting segment
聚合数据中的趋势在细分后可能反转:
  • 务必检查结论在关键群体中是否成立
  • 示例:整体转化率上升,但每个细分群体的转化率都下降——因为高转化群体的占比提升了

Survivorship Bias

幸存者偏差

You can only analyze entities that "survived" to be in your dataset:
  • Analyzing active users ignores those who churned
  • Analyzing successful companies ignores those that failed
  • Always ask: "Who is missing from this dataset, and would their inclusion change the conclusion?"
你只能分析"存活"到数据集中的实体:
  • 分析活跃用户时,忽略了已流失的用户
  • 分析成功企业时,忽略了已倒闭的企业
  • 务必自问:"数据集中遗漏了谁?他们的加入会改变结论吗?"

Ecological Fallacy

生态谬误

Aggregate trends may not apply to individuals:
  • "Countries with higher X have higher Y" does NOT mean "individuals with higher X have higher Y"
  • Be careful about applying group-level findings to individual cases
聚合趋势可能不适用于个体:
  • "X值高的国家Y值也高" 并不意味着 "X值高的个体Y值也高"
  • 谨慎将群体层面的结论应用于个体案例

Anchoring on Specific Numbers

对精确数字的过度依赖

Be wary of false precision:
  • "Churn will be 4.73% next quarter" implies more certainty than is warranted
  • Prefer ranges: "We expect churn between 4-6% based on historical patterns"
  • Round appropriately: "About 5%" is often more honest than "4.73%"
警惕虚假精确性:
  • "下季度流失率将为4.73%" 暗示了超出实际的确定性
  • 优先使用范围:"根据历史模式,我们预计流失率在4%-6%之间"
  • 合理取整:"约5%" 通常比 "4.73%" 更真实