statistical-analyzer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Statistical Analyzer

统计分析工具

Guided statistical analysis with hypothesis testing, regression, ANOVA, and plain-English results.
具备引导式统计分析能力,支持假设检验、回归、ANOVA,输出通俗易懂的结果。

Features

功能特性

  • Hypothesis Testing: t-tests, chi-square, proportion tests
  • Regression Analysis: Linear, polynomial, multiple regression
  • ANOVA: One-way, two-way ANOVA with post-hoc tests
  • Distribution Analysis: Normality tests, Q-Q plots
  • Correlation Analysis: Pearson, Spearman with significance
  • Plain-English Results: Interpret statistical outputs
  • Visualizations: Regression plots, residual analysis, box plots
  • Report Generation: PDF/HTML reports with interpretations
  • 假设检验:t检验、卡方检验、比例检验
  • 回归分析:线性、多项式、多元回归
  • ANOVA:单因素、双因素ANOVA及事后检验
  • 分布分析:正态性检验、Q-Q图
  • 相关性分析:带显著性校验的皮尔逊、斯皮尔曼相关分析
  • 通俗易懂的结果解读:对统计输出结果进行自然语言解释
  • 可视化能力:回归图、残差分析、箱线图
  • 报告生成:附带结果解读的PDF/HTML报告

Quick Start

快速开始

python
from statistical_analyzer import StatisticalAnalyzer

analyzer = StatisticalAnalyzer()
python
from statistical_analyzer import StatisticalAnalyzer

analyzer = StatisticalAnalyzer()

T-test

T-test

analyzer.load_data(df, group_col='treatment', value_col='score') results = analyzer.t_test(group1='control', group2='experimental') print(results['interpretation'])
analyzer.load_data(df, group_col='treatment', value_col='score') results = analyzer.t_test(group1='control', group2='experimental') print(results['interpretation'])

Regression

Regression

analyzer.load_data(df) results = analyzer.linear_regression(x='age', y='income') print(f"R²: {results['r_squared']}") analyzer.plot_regression('regression.png')
undefined
analyzer.load_data(df) results = analyzer.linear_regression(x='age', y='income') print(f"R²: {results['r_squared']}") analyzer.plot_regression('regression.png')
undefined

CLI Usage

CLI 使用方式

bash
undefined
bash
undefined

T-test

T-test

python statistical_analyzer.py --data data.csv --test t-test --group treatment --value score --output results.html
python statistical_analyzer.py --data data.csv --test t-test --group treatment --value score --output results.html

ANOVA

ANOVA

python statistical_analyzer.py --data data.csv --test anova --group category --value score --output results.pdf
python statistical_analyzer.py --data data.csv --test anova --group category --value score --output results.pdf

Regression

Regression

python statistical_analyzer.py --data data.csv --test regression --x age --y income --output report.pdf
python statistical_analyzer.py --data data.csv --test regression --x age --y income --output report.pdf

Correlation matrix

Correlation matrix

python statistical_analyzer.py --data data.csv --test correlation --output correlation.png
undefined
python statistical_analyzer.py --data data.csv --test correlation --output correlation.png
undefined

API Reference

API 参考

StatisticalAnalyzer Class

StatisticalAnalyzer 类

python
class StatisticalAnalyzer:
    def __init__(self)

    # Data Loading
    def load_data(self, data, **kwargs) -> 'StatisticalAnalyzer'
    def load_csv(self, filepath, **kwargs) -> 'StatisticalAnalyzer'

    # Hypothesis Tests
    def t_test(self, group1, group2, paired=False, alternative='two-sided') -> Dict
    def one_sample_t_test(self, column, expected_mean, alternative='two-sided') -> Dict
    def anova(self, groups, value_col) -> Dict
    def chi_square(self, observed, expected=None) -> Dict
    def proportion_test(self, successes, total, expected_prop=0.5) -> Dict

    # Regression
    def linear_regression(self, x, y) -> Dict
    def polynomial_regression(self, x, y, degree=2) -> Dict
    def multiple_regression(self, predictors: List[str], target: str) -> Dict

    # Correlation
    def correlation(self, method='pearson') -> pd.DataFrame  # Correlation matrix
    def correlation_test(self, var1, var2, method='pearson') -> Dict

    # Distribution Tests
    def normality_test(self, column, method='shapiro') -> Dict
    def qq_plot(self, column, output=None) -> str

    # Visualization
    def plot_regression(self, output, x=None, y=None) -> str
    def plot_residuals(self, output) -> str
    def plot_distribution(self, column, output) -> str
    def plot_boxplot(self, groups, value_col, output) -> str

    # Reporting
    def generate_report(self, output, format='pdf') -> str
    def summary(self) -> str
python
class StatisticalAnalyzer:
    def __init__(self)

    # Data Loading
    def load_data(self, data, **kwargs) -> 'StatisticalAnalyzer'
    def load_csv(self, filepath, **kwargs) -> 'StatisticalAnalyzer'

    # Hypothesis Tests
    def t_test(self, group1, group2, paired=False, alternative='two-sided') -> Dict
    def one_sample_t_test(self, column, expected_mean, alternative='two-sided') -> Dict
    def anova(self, groups, value_col) -> Dict
    def chi_square(self, observed, expected=None) -> Dict
    def proportion_test(self, successes, total, expected_prop=0.5) -> Dict

    # Regression
    def linear_regression(self, x, y) -> Dict
    def polynomial_regression(self, x, y, degree=2) -> Dict
    def multiple_regression(self, predictors: List[str], target: str) -> Dict

    # Correlation
    def correlation(self, method='pearson') -> pd.DataFrame  # Correlation matrix
    def correlation_test(self, var1, var2, method='pearson') -> Dict

    # Distribution Tests
    def normality_test(self, column, method='shapiro') -> Dict
    def qq_plot(self, column, output=None) -> str

    # Visualization
    def plot_regression(self, output, x=None, y=None) -> str
    def plot_residuals(self, output) -> str
    def plot_distribution(self, column, output) -> str
    def plot_boxplot(self, groups, value_col, output) -> str

    # Reporting
    def generate_report(self, output, format='pdf') -> str
    def summary(self) -> str

Tests

使用示例

T-Test

t检验

Compare means between two groups:
python
analyzer.load_csv('data.csv')
比较两组数据的均值:
python
analyzer.load_csv('data.csv')

Independent samples

独立样本检验

results = analyzer.t_test( group1='control', group2='treatment', paired=False )
results = analyzer.t_test( group1='control', group2='treatment', paired=False )

Results

结果输出

print(results)
print(results)

{

{

'statistic': -2.45,

'statistic': -2.45,

'p_value': 0.018,

'p_value': 0.018,

'mean_diff': -5.2,

'mean_diff': -5.2,

'ci': (-9.5, -0.9),

'ci': (-9.5, -0.9),

'interpretation': 'The difference is statistically significant (p=0.018)...'

'interpretation': 'The difference is statistically significant (p=0.018)...'

}

}

Paired samples (before/after)

配对样本检验(前后对比)

results = analyzer.t_test( group1='before', group2='after', paired=True )
undefined
results = analyzer.t_test( group1='before', group2='after', paired=True )
undefined

ANOVA

ANOVA

Compare means across multiple groups:
python
results = analyzer.anova(
    groups=['control', 'treatment_a', 'treatment_b'],
    value_col='score'
)
比较多组数据的均值:
python
results = analyzer.anova(
    groups=['control', 'treatment_a', 'treatment_b'],
    value_col='score'
)

Results include post-hoc tests

结果包含事后检验

print(results['interpretation'])
print(results['interpretation'])

"There is a statistically significant difference between groups (p<0.001).

"There is a statistically significant difference between groups (p<0.001).

Post-hoc tests show treatment_a differs from control (p=0.003)..."

Post-hoc tests show treatment_a differs from control (p=0.003)..."

undefined
undefined

Regression Analysis

回归分析

python
undefined
python
undefined

Simple linear regression

简单线性回归

results = analyzer.linear_regression(x='hours_studied', y='exam_score')
print(f"R² = {results['r_squared']:.3f}") print(f"Equation: y = {results['slope']:.2f}x + {results['intercept']:.2f}") print(f"p-value: {results['p_value']:.4f}")
results = analyzer.linear_regression(x='hours_studied', y='exam_score')
print(f"R² = {results['r_squared']:.3f}") print(f"Equation: y = {results['slope']:.2f}x + {results['intercept']:.2f}") print(f"p-value: {results['p_value']:.4f}")

Polynomial regression

多项式回归

results = analyzer.polynomial_regression(x='age', y='salary', degree=2)
results = analyzer.polynomial_regression(x='age', y='salary', degree=2)

Multiple regression

多元回归

results = analyzer.multiple_regression( predictors=['age', 'experience', 'education'], target='salary' )
undefined
results = analyzer.multiple_regression( predictors=['age', 'experience', 'education'], target='salary' )
undefined

Correlation Analysis

相关性分析

python
undefined
python
undefined

Full correlation matrix

完整相关矩阵

corr_matrix = analyzer.correlation(method='pearson') print(corr_matrix)
corr_matrix = analyzer.correlation(method='pearson') print(corr_matrix)

Test specific correlation

特定变量相关性检验

results = analyzer.correlation_test('height', 'weight', method='pearson') print(results['interpretation'])
results = analyzer.correlation_test('height', 'weight', method='pearson') print(results['interpretation'])

"There is a strong positive correlation (r=0.82, p<0.001)"

"There is a strong positive correlation (r=0.82, p<0.001)"

undefined
undefined

Distribution Tests

分布检验

python
undefined
python
undefined

Test normality

正态性检验

results = analyzer.normality_test('scores', method='shapiro')
results = analyzer.normality_test('scores', method='shapiro')

Returns: {'statistic': 0.98, 'p_value': 0.35,

Returns: {'statistic': 0.98, 'p_value': 0.35,

'interpretation': 'Data appears normally distributed (p=0.35)'}

'interpretation': 'Data appears normally distributed (p=0.35)'}

Q-Q plot

Q-Q图

analyzer.qq_plot('scores', output='qq_plot.png')
undefined
analyzer.qq_plot('scores', output='qq_plot.png')
undefined

Interpretation Guide

解读指南

The analyzer provides plain-English interpretations:
分析工具会提供通俗易懂的结果解读:

Significance Levels

显著性水平

  • p < 0.001: "Highly significant"
  • p < 0.01: "Very significant"
  • p < 0.05: "Statistically significant"
  • p ≥ 0.05: "Not statistically significant"
  • p < 0.001:"高度显著"
  • p < 0.01:"非常显著"
  • p < 0.05:"统计显著"
  • p ≥ 0.05:"无统计显著性"

Effect Sizes

效应量

  • Cohen's d: Small (0.2), Medium (0.5), Large (0.8)
  • : Weak (<0.3), Moderate (0.3-0.7), Strong (>0.7)
  • Correlation: Weak (<0.3), Moderate (0.3-0.7), Strong (>0.7)
  • Cohen's d:小(0.2)、中(0.5)、大(0.8)
  • :弱(<0.3)、中(0.3-0.7)、强(>0.7)
  • 相关性:弱(<0.3)、中(0.3-0.7)、强(>0.7)

Visualizations

可视化

Regression Plot

回归图

python
analyzer.linear_regression(x='age', y='income')
analyzer.plot_regression('regression.png')
python
analyzer.linear_regression(x='age', y='income')
analyzer.plot_regression('regression.png')

Creates scatter plot with regression line and confidence interval

Creates scatter plot with regression line and confidence interval

undefined
undefined

Residual Plot

残差图

python
analyzer.plot_residuals('residuals.png')
python
analyzer.plot_residuals('residuals.png')

Checks regression assumptions (homoscedasticity)

Checks regression assumptions (homoscedasticity)

undefined
undefined

Box Plot

箱线图

python
analyzer.plot_boxplot(
    groups=['control', 'treatment_a', 'treatment_b'],
    value_col='score',
    output='boxplot.png'
)
python
analyzer.plot_boxplot(
    groups=['control', 'treatment_a', 'treatment_b'],
    value_col='score',
    output='boxplot.png'
)

Distribution Plot

分布图

python
analyzer.plot_distribution('scores', 'distribution.png')
python
analyzer.plot_distribution('scores', 'distribution.png')

Histogram with normal curve overlay

Histogram with normal curve overlay

undefined
undefined

Reports

报告

Generate comprehensive reports:
python
analyzer.load_csv('data.csv')
analyzer.t_test(group1='control', group2='treatment')
analyzer.linear_regression(x='hours', y='score')
生成完整的分析报告:
python
analyzer.load_csv('data.csv')
analyzer.t_test(group1='control', group2='treatment')
analyzer.linear_regression(x='hours', y='score')

PDF report with all analyses

包含所有分析结果的PDF报告

analyzer.generate_report('analysis_report.pdf', format='pdf')
analyzer.generate_report('analysis_report.pdf', format='pdf')

HTML report

HTML报告

analyzer.generate_report('analysis_report.html', format='html')

Reports include:
- Summary statistics
- Test results with interpretations
- Visualizations
- Assumptions checks
- Recommendations
analyzer.generate_report('analysis_report.html', format='html')

报告包含内容:
- 统计摘要
- 带解读的检验结果
- 可视化图表
- 假设校验结果
- 相关建议

Assumptions Checking

假设校验

Automatic assumptions validation:
python
undefined
自动完成假设有效性验证:
python
undefined

T-test checks:

T-test 校验项:

- Normality (Shapiro-Wilk)

- 正态性(Shapiro-Wilk检验)

- Equal variances (Levene's test)

- 方差齐性(Levene检验)

Warnings if assumptions violated

假设不满足时会给出警告

ANOVA checks:

ANOVA 校验项:

- Normality per group

- 各组数据正态性

- Homogeneity of variances

- 方差齐性

Suggests non-parametric alternatives

会建议非参数替代方案

Regression checks:

回归校验项:

- Linearity

- 线性关系

- Homoscedasticity

- 方差齐性

- Normality of residuals

- 残差正态性

- Independence (Durbin-Watson)

- 独立性(Durbin-Watson检验)

undefined
undefined

Dependencies

依赖项

  • scipy>=1.10.0
  • statsmodels>=0.14.0
  • pandas>=2.0.0
  • numpy>=1.24.0
  • matplotlib>=3.7.0
  • seaborn>=0.12.0
  • reportlab>=4.0.0
  • scipy>=1.10.0
  • statsmodels>=0.14.0
  • pandas>=2.0.0
  • numpy>=1.24.0
  • matplotlib>=3.7.0
  • seaborn>=0.12.0
  • reportlab>=4.0.0