statistics-math
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseStatistics & Mathematics
统计学与数学
Mathematical foundations for data science, machine learning, and statistical analysis.
数据科学、机器学习与统计分析的数学基础。
Quick Start
快速开始
python
import numpy as np
import scipy.stats as stats
from sklearn.linear_model import LinearRegressionpython
import numpy as np
import scipy.stats as stats
from sklearn.linear_model import LinearRegressionDescriptive Statistics
Descriptive Statistics
data = np.array([23, 45, 67, 32, 45, 67, 89, 12, 34, 56])
print(f"Mean: {np.mean(data):.2f}")
print(f"Median: {np.median(data):.2f}")
print(f"Std Dev: {np.std(data, ddof=1):.2f}")
print(f"IQR: {np.percentile(data, 75) - np.percentile(data, 25):.2f}")
data = np.array([23, 45, 67, 32, 45, 67, 89, 12, 34, 56])
print(f"Mean: {np.mean(data):.2f}")
print(f"Median: {np.median(data):.2f}")
print(f"Std Dev: {np.std(data, ddof=1):.2f}")
print(f"IQR: {np.percentile(data, 75) - np.percentile(data, 25):.2f}")
Hypothesis Testing
Hypothesis Testing
sample_a = [23, 45, 67, 32, 45]
sample_b = [56, 78, 45, 67, 89]
t_stat, p_value = stats.ttest_ind(sample_a, sample_b)
print(f"T-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")
if p_value < 0.05:
print("Reject null hypothesis: significant difference")
else:
print("Fail to reject null hypothesis")
undefinedsample_a = [23, 45, 67, 32, 45]
sample_b = [56, 78, 45, 67, 89]
t_stat, p_value = stats.ttest_ind(sample_a, sample_b)
print(f"T-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")
if p_value < 0.05:
print("Reject null hypothesis: significant difference")
else:
print("Fail to reject null hypothesis")
undefinedCore Concepts
核心概念
1. Probability Distributions
1. 概率分布
python
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as pltpython
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as pltNormal Distribution
Normal Distribution
mu, sigma = 100, 15
normal_dist = stats.norm(loc=mu, scale=sigma)
x = np.linspace(50, 150, 100)
mu, sigma = 100, 15
normal_dist = stats.norm(loc=mu, scale=sigma)
x = np.linspace(50, 150, 100)
PDF, CDF calculations
PDF, CDF calculations
print(f"P(X < 85): {normal_dist.cdf(85):.4f}")
print(f"P(X > 115): {1 - normal_dist.cdf(115):.4f}")
print(f"95th percentile: {normal_dist.ppf(0.95):.2f}")
print(f"P(X < 85): {normal_dist.cdf(85):.4f}")
print(f"P(X > 115): {1 - normal_dist.cdf(115):.4f}")
print(f"95th percentile: {normal_dist.ppf(0.95):.2f}")
Binomial Distribution (discrete)
Binomial Distribution (discrete)
n, p = 100, 0.3
binom_dist = stats.binom(n=n, p=p)
print(f"P(X = 30): {binom_dist.pmf(30):.4f}")
print(f"P(X <= 30): {binom_dist.cdf(30):.4f}")
n, p = 100, 0.3
binom_dist = stats.binom(n=n, p=p)
print(f"P(X = 30): {binom_dist.pmf(30):.4f}")
print(f"P(X <= 30): {binom_dist.cdf(30):.4f}")
Poisson Distribution (events per time)
Poisson Distribution (events per time)
lambda_param = 5
poisson_dist = stats.poisson(mu=lambda_param)
print(f"P(X = 3): {poisson_dist.pmf(3):.4f}")
lambda_param = 5
poisson_dist = stats.poisson(mu=lambda_param)
print(f"P(X = 3): {poisson_dist.pmf(3):.4f}")
Central Limit Theorem demonstration
Central Limit Theorem demonstration
population = np.random.exponential(scale=10, size=100000)
sample_means = [np.mean(np.random.choice(population, 30)) for _ in range(1000)]
print(f"Sample means are approximately normal: mean={np.mean(sample_means):.2f}")
undefinedpopulation = np.random.exponential(scale=10, size=100000)
sample_means = [np.mean(np.random.choice(population, 30)) for _ in range(1000)]
print(f"Sample means are approximately normal: mean={np.mean(sample_means):.2f}")
undefined2. Hypothesis Testing Framework
2. 假设检验框架
python
from scipy import stats
import numpy as np
class HypothesisTest:
"""Framework for statistical hypothesis testing."""
@staticmethod
def two_sample_ttest(group_a, group_b, alpha=0.05):
"""Independent samples t-test."""
t_stat, p_value = stats.ttest_ind(group_a, group_b)
effect_size = (np.mean(group_a) - np.mean(group_b)) / np.sqrt(
(np.var(group_a) + np.var(group_b)) / 2
)
return {
"t_statistic": t_stat,
"p_value": p_value,
"significant": p_value < alpha,
"effect_size_cohens_d": effect_size
}
@staticmethod
def chi_square_test(observed, expected=None, alpha=0.05):
"""Chi-square test for categorical data."""
if expected is None:
chi2, p_value, dof, expected = stats.chi2_contingency(observed)
else:
chi2, p_value = stats.chisquare(observed, expected)
dof = len(observed) - 1
return {
"chi2_statistic": chi2,
"p_value": p_value,
"degrees_of_freedom": dof,
"significant": p_value < alpha
}
@staticmethod
def ab_test_proportion(conversions_a, total_a, conversions_b, total_b, alpha=0.05):
"""Two-proportion z-test for A/B testing."""
p_a = conversions_a / total_a
p_b = conversions_b / total_b
p_pooled = (conversions_a + conversions_b) / (total_a + total_b)
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/total_a + 1/total_b))
z_stat = (p_a - p_b) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
return {
"conversion_a": p_a,
"conversion_b": p_b,
"lift": (p_b - p_a) / p_a * 100,
"z_statistic": z_stat,
"p_value": p_value,
"significant": p_value < alpha
}python
from scipy import stats
import numpy as np
class HypothesisTest:
"""Framework for statistical hypothesis testing."""
@staticmethod
def two_sample_ttest(group_a, group_b, alpha=0.05):
"""Independent samples t-test."""
t_stat, p_value = stats.ttest_ind(group_a, group_b)
effect_size = (np.mean(group_a) - np.mean(group_b)) / np.sqrt(
(np.var(group_a) + np.var(group_b)) / 2
)
return {
"t_statistic": t_stat,
"p_value": p_value,
"significant": p_value < alpha,
"effect_size_cohens_d": effect_size
}
@staticmethod
def chi_square_test(observed, expected=None, alpha=0.05):
"""Chi-square test for categorical data."""
if expected is None:
chi2, p_value, dof, expected = stats.chi2_contingency(observed)
else:
chi2, p_value = stats.chisquare(observed, expected)
dof = len(observed) - 1
return {
"chi2_statistic": chi2,
"p_value": p_value,
"degrees_of_freedom": dof,
"significant": p_value < alpha
}
@staticmethod
def ab_test_proportion(conversions_a, total_a, conversions_b, total_b, alpha=0.05):
"""Two-proportion z-test for A/B testing."""
p_a = conversions_a / total_a
p_b = conversions_b / total_b
p_pooled = (conversions_a + conversions_b) / (total_a + total_b)
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/total_a + 1/total_b))
z_stat = (p_a - p_b) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
return {
"conversion_a": p_a,
"conversion_b": p_b,
"lift": (p_b - p_a) / p_a * 100,
"z_statistic": z_stat,
"p_value": p_value,
"significant": p_value < alpha
}Usage
Usage
result = HypothesisTest.ab_test_proportion(
conversions_a=120, total_a=1000,
conversions_b=150, total_b=1000
)
print(f"Lift: {result['lift']:.1f}%, p-value: {result['p_value']:.4f}")
undefinedresult = HypothesisTest.ab_test_proportion(
conversions_a=120, total_a=1000,
conversions_b=150, total_b=1000
)
print(f"Lift: {result['lift']:.1f}%, p-value: {result['p_value']:.4f}")
undefined3. Linear Algebra Essentials
3. 线性代数基础
python
import numpy as nppython
import numpy as npMatrix operations
Matrix operations
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
Basic operations
Basic operations
print("Matrix multiplication:", A @ B)
print("Element-wise:", A * B)
print("Transpose:", A.T)
print("Inverse:", np.linalg.inv(A))
print("Determinant:", np.linalg.det(A))
print("Matrix multiplication:", A @ B)
print("Element-wise:", A * B)
print("Transpose:", A.T)
print("Inverse:", np.linalg.inv(A))
print("Determinant:", np.linalg.det(A))
Eigenvalues and eigenvectors (PCA foundation)
Eigenvalues and eigenvectors (PCA foundation)
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"Eigenvalues: {eigenvalues}")
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"Eigenvalues: {eigenvalues}")
Singular Value Decomposition (dimensionality reduction)
Singular Value Decomposition (dimensionality reduction)
U, S, Vt = np.linalg.svd(A)
print(f"Singular values: {S}")
U, S, Vt = np.linalg.svd(A)
print(f"Singular values: {S}")
Solving linear systems: Ax = b
Solving linear systems: Ax = b
b = np.array([5, 11])
x = np.linalg.solve(A, b)
print(f"Solution: {x}")
b = np.array([5, 11])
x = np.linalg.solve(A, b)
print(f"Solution: {x}")
Cosine similarity (NLP, recommendations)
Cosine similarity (NLP, recommendations)
def cosine_similarity(v1, v2):
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
vec1 = np.array([1, 2, 3])
vec2 = np.array([4, 5, 6])
print(f"Cosine similarity: {cosine_similarity(vec1, vec2):.4f}")
undefineddef cosine_similarity(v1, v2):
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
vec1 = np.array([1, 2, 3])
vec2 = np.array([4, 5, 6])
print(f"Cosine similarity: {cosine_similarity(vec1, vec2):.4f}")
undefined4. Regression Analysis
4. 回归分析
python
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_squared_error
import statsmodels.api as smpython
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_squared_error
import statsmodels.api as smMultiple Linear Regression with statsmodels
Multiple Linear Regression with statsmodels
X = np.random.randn(100, 3)
y = 2X[:, 0] + 3X[:, 1] - X[:, 2] + np.random.randn(100)*0.5
X_with_const = sm.add_constant(X)
model = sm.OLS(y, X_with_const).fit()
print(model.summary())
print(f"R-squared: {model.rsquared:.4f}")
print(f"Coefficients: {model.params}")
print(f"P-values: {model.pvalues}")
X = np.random.randn(100, 3)
y = 2X[:, 0] + 3X[:, 1] - X[:, 2] + np.random.randn(100)*0.5
X_with_const = sm.add_constant(X)
model = sm.OLS(y, X_with_const).fit()
print(model.summary())
print(f"R-squared: {model.rsquared:.4f}")
print(f"Coefficients: {model.params}")
print(f"P-values: {model.pvalues}")
Regularization comparison
Regularization comparison
X_train, y_train = X[:80], y[:80]
X_test, y_test = X[80:], y[80:]
models = {
"OLS": LinearRegression(),
"Ridge": Ridge(alpha=1.0),
"Lasso": Lasso(alpha=0.1)
}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"{name}: R²={r2_score(y_test, y_pred):.4f}, RMSE={np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
undefinedX_train, y_train = X[:80], y[:80]
X_test, y_test = X[80:], y[80:]
models = {
"OLS": LinearRegression(),
"Ridge": Ridge(alpha=1.0),
"Lasso": Lasso(alpha=0.1)
}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"{name}: R²={r2_score(y_test, y_pred):.4f}, RMSE={np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
undefinedTools & Technologies
工具与技术
| Tool | Purpose | Version (2025) |
|---|---|---|
| NumPy | Numerical computing | 1.26+ |
| SciPy | Scientific computing | 1.12+ |
| pandas | Data manipulation | 2.2+ |
| statsmodels | Statistical models | 0.14+ |
| scikit-learn | ML algorithms | 1.4+ |
| 工具 | 用途 | 版本(2025) |
|---|---|---|
| NumPy | 数值计算 | 1.26+ |
| SciPy | 科学计算 | 1.12+ |
| pandas | 数据处理 | 2.2+ |
| statsmodels | 统计模型 | 0.14+ |
| scikit-learn | 机器学习算法 | 1.4+ |
Troubleshooting Guide
故障排查指南
| Issue | Symptoms | Root Cause | Fix |
|---|---|---|---|
| Low p-value, small effect | Significant but meaningless | Large sample size | Check effect size |
| High variance | Unstable estimates | Small sample, outliers | More data, robust methods |
| Multicollinearity | Inflated coefficients | Correlated features | VIF check, remove features |
| Heteroscedasticity | Invalid inference | Non-constant variance | Weighted least squares |
| 问题 | 症状 | 根本原因 | 解决方法 |
|---|---|---|---|
| p值低但效应量小 | 结果显著但无实际意义 | 样本量过大 | 检查效应量 |
| 高方差 | 估计结果不稳定 | 样本量小、存在异常值 | 增加数据、使用稳健方法 |
| 多重共线性 | 系数被放大 | 特征之间存在相关性 | 检查VIF、移除特征 |
| 异方差性 | 推断结果无效 | 方差非恒定 | 使用加权最小二乘法 |
Best Practices
最佳实践
python
undefinedpython
undefined✅ DO: Check assumptions before testing
✅ DO: Check assumptions before testing
from scipy.stats import shapiro
stat, p = shapiro(data)
if p > 0.05:
print("Data is approximately normal")
from scipy.stats import shapiro
stat, p = shapiro(data)
if p > 0.05:
print("Data is approximately normal")
✅ DO: Use effect sizes, not just p-values
✅ DO: Use effect sizes, not just p-values
✅ DO: Correct for multiple comparisons (Bonferroni)
✅ DO: Correct for multiple comparisons (Bonferroni)
✅ DO: Report confidence intervals
✅ DO: Report confidence intervals
❌ DON'T: p-hack by trying many tests
❌ DON'T: p-hack by trying many tests
❌ DON'T: Confuse correlation with causation
❌ DON'T: Confuse correlation with causation
❌ DON'T: Ignore sample size requirements
❌ DON'T: Ignore sample size requirements
undefinedundefinedResources
参考资源
- Khan Academy Statistics
- StatQuest with Josh Starmer
- "Introduction to Statistical Learning" (ISLR)
Skill Certification Checklist:
- Can calculate descriptive statistics
- Can perform hypothesis tests (t-test, chi-square)
- Can implement A/B testing
- Can perform regression analysis
- Can use matrix operations for ML
- 可汗学院统计学
- StatQuest with Josh Starmer
- 《统计学习导论》(ISLR)
技能认证检查清单:
- 能够计算描述性统计量
- 能够执行假设检验(t检验、卡方检验)
- 能够实施A/B测试
- 能够进行回归分析
- 能够使用矩阵运算支持机器学习