statistical-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseStatistical Analysis
统计分析
Apply statistical methods to understand data and validate findings.
运用统计方法理解数据并验证研究结论。
Quick Start
快速开始
python
from scipy import stats
import numpy as nppython
from scipy import stats
import numpy as npDescriptive statistics
Descriptive statistics
data = np.array([1, 2, 3, 4, 5])
print(f"Mean: {np.mean(data)}")
print(f"Std: {np.std(data)}")
data = np.array([1, 2, 3, 4, 5])
print(f"Mean: {np.mean(data)}")
print(f"Std: {np.std(data)}")
Hypothesis testing
Hypothesis testing
group1 = [23, 25, 27, 29, 31]
group2 = [20, 22, 24, 26, 28]
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"P-value: {p_value}")
undefinedgroup1 = [23, 25, 27, 29, 31]
group2 = [20, 22, 24, 26, 28]
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"P-value: {p_value}")
undefinedCore Tests
核心检验方法
T-Test (Compare Means)
T-Test(均值比较)
python
undefinedpython
undefinedOne-sample: Compare to population mean
One-sample: Compare to population mean
stats.ttest_1samp(data, 100)
stats.ttest_1samp(data, 100)
Two-sample: Compare two groups
Two-sample: Compare two groups
stats.ttest_ind(group1, group2)
stats.ttest_ind(group1, group2)
Paired: Before/after comparison
Paired: Before/after comparison
stats.ttest_rel(before, after)
undefinedstats.ttest_rel(before, after)
undefinedChi-Square (Categorical Data)
Chi-Square(分类数据检验)
python
from scipy.stats import chi2_contingency
observed = np.array([[10, 20], [15, 25]])
chi2, p_value, dof, expected = chi2_contingency(observed)python
from scipy.stats import chi2_contingency
observed = np.array([[10, 20], [15, 25]])
chi2, p_value, dof, expected = chi2_contingency(observed)ANOVA (Multiple Groups)
ANOVA(多组比较)
python
f_stat, p_value = stats.f_oneway(group1, group2, group3)python
f_stat, p_value = stats.f_oneway(group1, group2, group3)Confidence Intervals
置信区间
python
from scipy import stats
confidence_level = 0.95
mean = np.mean(data)
se = stats.sem(data)
ci = stats.t.interval(confidence_level, len(data)-1, mean, se)
print(f"95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]")python
from scipy import stats
confidence_level = 0.95
mean = np.mean(data)
se = stats.sem(data)
ci = stats.t.interval(confidence_level, len(data)-1, mean, se)
print(f"95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]")Correlation
相关性分析
python
undefinedpython
undefinedPearson (linear)
Pearson (linear)
r, p_value = stats.pearsonr(x, y)
r, p_value = stats.pearsonr(x, y)
Spearman (rank-based)
Spearman (rank-based)
rho, p_value = stats.spearmanr(x, y)
undefinedrho, p_value = stats.spearmanr(x, y)
undefinedDistributions
分布分析
python
undefinedpython
undefinedNormal
Normal
x = np.linspace(-3, 3, 100)
pdf = stats.norm.pdf(x, loc=0, scale=1)
x = np.linspace(-3, 3, 100)
pdf = stats.norm.pdf(x, loc=0, scale=1)
Sampling
Sampling
samples = np.random.normal(0, 1, 1000)
samples = np.random.normal(0, 1, 1000)
Test normality
Test normality
stat, p_value = stats.shapiro(data)
undefinedstat, p_value = stats.shapiro(data)
undefinedA/B Testing Framework
A/B测试框架
python
def ab_test(control, treatment, alpha=0.05):
"""
Run A/B test with statistical significance
Returns: significant (bool), p_value (float)
"""
t_stat, p_value = stats.ttest_ind(control, treatment)
significant = p_value < alpha
improvement = (np.mean(treatment) - np.mean(control)) / np.mean(control) * 100
return {
'significant': significant,
'p_value': p_value,
'improvement': f"{improvement:.2f}%"
}python
def ab_test(control, treatment, alpha=0.05):
"""
Run A/B test with statistical significance
Returns: significant (bool), p_value (float)
"""
t_stat, p_value = stats.ttest_ind(control, treatment)
significant = p_value < alpha
improvement = (np.mean(treatment) - np.mean(control)) / np.mean(control) * 100
return {
'significant': significant,
'p_value': p_value,
'improvement': f"{improvement:.2f}%"
}Interpretation
结果解读
P-value < 0.05: Reject null hypothesis (statistically significant)
P-value >= 0.05: Fail to reject null (not significant)
P值 < 0.05:拒绝原假设(具有统计显著性)
P值 >= 0.05:无法拒绝原假设(无显著性)
Common Pitfalls
常见误区
- Multiple testing without correction
- Small sample sizes
- Ignoring assumptions (normality, independence)
- Confusing correlation with causation
- p-hacking (searching for significance)
- 未校正的多重检验
- 样本量过小
- 忽略前提假设(正态性、独立性)
- 将相关性混淆为因果关系
- P值操纵(刻意寻找显著性结果)
Troubleshooting
问题排查
Common Issues
常见问题
Problem: Non-normal data for t-test
python
undefined问题:T检验数据不符合正态分布
python
undefinedCheck normality first
Check normality first
stat, p = stats.shapiro(data)
if p < 0.05:
# Use non-parametric alternative
stat, p = stats.mannwhitneyu(group1, group2) # Instead of ttest_ind
**Problem: Multiple comparisons inflating false positives**
```python
from statsmodels.stats.multitest import multipletestsstat, p = stats.shapiro(data)
if p < 0.05:
# Use non-parametric alternative
stat, p = stats.mannwhitneyu(group1, group2) # Instead of ttest_ind
**问题:多重检验导致假阳性率升高**
```python
from statsmodels.stats.multitest import multipletestsApply Bonferroni correction
Apply Bonferroni correction
p_values = [0.01, 0.03, 0.04, 0.02, 0.06]
rejected, p_adjusted, _, _ = multipletests(p_values, method='bonferroni')
**Problem: Underpowered study (sample too small)**
```python
from statsmodels.stats.power import TTestIndPowerp_values = [0.01, 0.03, 0.04, 0.02, 0.06]
rejected, p_adjusted, _, _ = multipletests(p_values, method='bonferroni')
**问题:研究功效不足(样本量过小)**
```python
from statsmodels.stats.power import TTestIndPowerCalculate required sample size
Calculate required sample size
power_analysis = TTestIndPower()
sample_size = power_analysis.solve_power(
effect_size=0.5, # Medium effect (Cohen's d)
power=0.8, # 80% power
alpha=0.05 # 5% significance
)
print(f"Required n per group: {sample_size:.0f}")
**Problem: Heterogeneous variances**
```pythonpower_analysis = TTestIndPower()
sample_size = power_analysis.solve_power(
effect_size=0.5, # Medium effect (Cohen's d)
power=0.8, # 80% power
alpha=0.05 # 5% significance
)
print(f"Required n per group: {sample_size:.0f}")
**问题:方差不齐**
```pythonCheck with Levene's test
Check with Levene's test
stat, p = stats.levene(group1, group2)
if p < 0.05:
# Use Welch's t-test (default in scipy)
t, p = stats.ttest_ind(group1, group2, equal_var=False)
**Problem: Outliers affecting results**
```python
from scipy.stats import zscorestat, p = stats.levene(group1, group2)
if p < 0.05:
# Use Welch's t-test (default in scipy)
t, p = stats.ttest_ind(group1, group2, equal_var=False)
**问题:异常值影响结果**
```python
from scipy.stats import zscoreDetect outliers (|z| > 3)
Detect outliers (|z| > 3)
z_scores = np.abs(zscore(data))
clean_data = data[z_scores < 3]
z_scores = np.abs(zscore(data))
clean_data = data[z_scores < 3]
Or use robust statistics
Or use robust statistics
median = np.median(data)
mad = np.median(np.abs(data - median)) # Median Absolute Deviation
undefinedmedian = np.median(data)
mad = np.median(np.abs(data - median)) # Median Absolute Deviation
undefinedDebug Checklist
调试清单
- Check sample size adequacy (power analysis)
- Test normality assumption (Shapiro-Wilk)
- Test homogeneity of variance (Levene's)
- Check for outliers (z-scores, IQR)
- Apply multiple testing correction if needed
- Report effect sizes, not just p-values
- 检查样本量是否充足(功效分析)
- 检验正态性假设(Shapiro-Wilk检验)
- 检验方差齐性(Levene检验)
- 检查异常值(Z分数、四分位距)
- 必要时应用多重检验校正
- 报告效应量,而非仅报告P值