data-scientist

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Scientist

数据科学家

Expert-level data science for business impact.

助力业务价值的专家级数据科学内容。

Core Competencies

核心能力

Machine learning
Statistical modeling
Experimentation design
Predictive analytics
Feature engineering
Model evaluation
Data storytelling
Production ML

机器学习
统计建模
实验设计
预测分析
特征工程
模型评估
数据叙事
生产级ML

Machine Learning Workflow

机器学习工作流

PROBLEM DEFINITION → DATA → FEATURES → MODEL → EVALUATION → DEPLOYMENT

1. Problem Definition
   - Business objective
   - Success metrics
   - Constraints

2. Data Collection
   - Data sources
   - Data quality
   - Sample size

3. Feature Engineering
   - Feature creation
   - Feature selection
   - Transformation

4. Model Development
   - Algorithm selection
   - Training
   - Tuning

5. Evaluation
   - Metrics
   - Validation
   - Business impact

6. Deployment
   - Production pipeline
   - Monitoring
   - Iteration

PROBLEM DEFINITION → DATA → FEATURES → MODEL → EVALUATION → DEPLOYMENT

1. Problem Definition
   - Business objective
   - Success metrics
   - Constraints

2. Data Collection
   - Data sources
   - Data quality
   - Sample size

3. Feature Engineering
   - Feature creation
   - Feature selection
   - Transformation

4. Model Development
   - Algorithm selection
   - Training
   - Tuning

5. Evaluation
   - Metrics
   - Validation
   - Business impact

6. Deployment
   - Production pipeline
   - Monitoring
   - Iteration

Model Selection

模型选择

Algorithm Comparison

算法对比

Algorithm	Use Case	Pros	Cons
Linear Regression	Continuous prediction	Interpretable, fast	Linear relationships only
Logistic Regression	Binary classification	Interpretable, probabilistic	Linear boundaries
Random Forest	Classification/Regression	Handles non-linearity	Less interpretable
XGBoost	Classification/Regression	High accuracy	Overfitting risk
Neural Networks	Complex patterns	Flexible	Requires lots of data

算法	适用场景	优势	劣势
Linear Regression	连续值预测	可解释性强、运算快速	仅适用于线性关系
Logistic Regression	二分类任务	可解释性强、基于概率	仅能处理线性边界
Random Forest	分类/回归任务	可处理非线性关系	可解释性较弱
XGBoost	分类/回归任务	准确率高	存在过拟合风险
Neural Networks	复杂模式识别	灵活性强	需要大量数据支撑

Model Selection Framework

模型选择框架

python

def select_model(problem_type, data_size, interpretability_need, accuracy_need):
    """
    problem_type: 'classification' or 'regression'
    data_size: 'small' (<10K), 'medium' (10K-1M), 'large' (>1M)
    interpretability_need: 'high', 'medium', 'low'
    accuracy_need: 'high', 'medium', 'low'
    """

    if interpretability_need == 'high':
        if problem_type == 'classification':
            return 'Logistic Regression'
        else:
            return 'Linear Regression'

    if data_size == 'small':
        return 'Random Forest'

    if accuracy_need == 'high':
        if data_size == 'large':
            return 'Neural Network'
        else:
            return 'XGBoost'

    return 'Random Forest'

python

def select_model(problem_type, data_size, interpretability_need, accuracy_need):
    """
    problem_type: 'classification' or 'regression'
    data_size: 'small' (<10K), 'medium' (10K-1M), 'large' (>1M)
    interpretability_need: 'high', 'medium', 'low'
    accuracy_need: 'high', 'medium', 'low'
    """

    if interpretability_need == 'high':
        if problem_type == 'classification':
            return 'Logistic Regression'
        else:
            return 'Linear Regression'

    if data_size == 'small':
        return 'Random Forest'

    if accuracy_need == 'high':
        if data_size == 'large':
            return 'Neural Network'
        else:
            return 'XGBoost'

    return 'Random Forest'

Feature Engineering

特征工程

Feature Types

特征类型

python

undefined

python

undefined

Numerical Features

def engineer_numerical(df, col): features = { f'{col}_log': np.log1p(df[col]), f'{col}_sqrt': np.sqrt(df[col]), f'{col}_squared': df[col] ** 2, f'{col}_binned': pd.cut(df[col], bins=5, labels=False) } return pd.DataFrame(features)

Categorical Features

def engineer_categorical(df, col): # One-hot encoding dummies = pd.get_dummies(df[col], prefix=col)

# Target encoding
target_mean = df.groupby(col)['target'].mean()
target_encoded = df[col].map(target_mean)

# Frequency encoding
freq = df[col].value_counts(normalize=True)
freq_encoded = df[col].map(freq)

return dummies, target_encoded, freq_encoded

def engineer_categorical(df, col): # One-hot encoding dummies = pd.get_dummies(df[col], prefix=col)

# Target encoding
target_mean = df.groupby(col)['target'].mean()
target_encoded = df[col].map(target_mean)

# Frequency encoding
freq = df[col].value_counts(normalize=True)
freq_encoded = df[col].map(freq)

return dummies, target_encoded, freq_encoded

Time Features

def engineer_time(df, col): df[col] = pd.to_datetime(df[col]) features = { f'{col}_hour': df[col].dt.hour, f'{col}_day': df[col].dt.day, f'{col}_dayofweek': df[col].dt.dayofweek, f'{col}_month': df[col].dt.month, f'{col}_is_weekend': df[col].dt.dayofweek.isin([5, 6]).astype(int), f'{col}_hour_sin': np.sin(2 * np.pi * df[col].dt.hour / 24), f'{col}_hour_cos': np.cos(2 * np.pi * df[col].dt.hour / 24) } return pd.DataFrame(features)

undefined

undefined

Feature Selection

特征选择

python

from sklearn.feature_selection import mutual_info_classif, RFE
from sklearn.ensemble import RandomForestClassifier

def select_features(X, y, method='importance', n_features=20):
    if method == 'importance':
        model = RandomForestClassifier(n_estimators=100)
        model.fit(X, y)
        importance = pd.Series(model.feature_importances_, index=X.columns)
        return importance.nlargest(n_features).index.tolist()

    elif method == 'mutual_info':
        mi_scores = mutual_info_classif(X, y)
        mi_series = pd.Series(mi_scores, index=X.columns)
        return mi_series.nlargest(n_features).index.tolist()

    elif method == 'rfe':
        model = RandomForestClassifier(n_estimators=100)
        rfe = RFE(model, n_features_to_select=n_features)
        rfe.fit(X, y)
        return X.columns[rfe.support_].tolist()

python

from sklearn.feature_selection import mutual_info_classif, RFE
from sklearn.ensemble import RandomForestClassifier

def select_features(X, y, method='importance', n_features=20):
    if method == 'importance':
        model = RandomForestClassifier(n_estimators=100)
        model.fit(X, y)
        importance = pd.Series(model.feature_importances_, index=X.columns)
        return importance.nlargest(n_features).index.tolist()

    elif method == 'mutual_info':
        mi_scores = mutual_info_classif(X, y)
        mi_series = pd.Series(mi_scores, index=X.columns)
        return mi_series.nlargest(n_features).index.tolist()

    elif method == 'rfe':
        model = RandomForestClassifier(n_estimators=100)
        rfe = RFE(model, n_features_to_select=n_features)
        rfe.fit(X, y)
        return X.columns[rfe.support_].tolist()

Model Evaluation

模型评估

Classification Metrics

分类任务评估指标

python

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix,
    classification_report
)

def evaluate_classification(y_true, y_pred, y_proba=None):
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'f1': f1_score(y_true, y_pred),
    }

    if y_proba is not None:
        metrics['auc_roc'] = roc_auc_score(y_true, y_proba)

    print(classification_report(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred))

    return metrics

python

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix,
    classification_report
)

def evaluate_classification(y_true, y_pred, y_proba=None):
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'f1': f1_score(y_true, y_pred),
    }

    if y_proba is not None:
        metrics['auc_roc'] = roc_auc_score(y_true, y_proba)

    print(classification_report(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred))

    return metrics

Regression Metrics

回归任务评估指标

python

from sklearn.metrics import (
    mean_absolute_error, mean_squared_error,
    r2_score, mean_absolute_percentage_error
)

def evaluate_regression(y_true, y_pred):
    metrics = {
        'mae': mean_absolute_error(y_true, y_pred),
        'mse': mean_squared_error(y_true, y_pred),
        'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
        'r2': r2_score(y_true, y_pred),
        'mape': mean_absolute_percentage_error(y_true, y_pred)
    }
    return metrics

python

from sklearn.metrics import (
    mean_absolute_error, mean_squared_error,
    r2_score, mean_absolute_percentage_error
)

def evaluate_regression(y_true, y_pred):
    metrics = {
        'mae': mean_absolute_error(y_true, y_pred),
        'mse': mean_squared_error(y_true, y_pred),
        'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
        'r2': r2_score(y_true, y_pred),
        'mape': mean_absolute_percentage_error(y_true, y_pred)
    }
    return metrics

Experimentation

实验设计

A/B Test Design

A/B测试设计

python

from scipy import stats

def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.8):
    """
    Calculate required sample size per variant

    baseline_rate: Current conversion rate (e.g., 0.05)
    mde: Minimum detectable effect (e.g., 0.1 for 10% lift)
    alpha: Significance level
    power: Statistical power
    """
    effect_size = baseline_rate * mde
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)

    p = baseline_rate
    p_new = p + effect_size

    n = (2 * p * (1 - p) * (z_alpha + z_beta) ** 2) / (effect_size ** 2)

    return int(np.ceil(n))


def analyze_ab_test(control, treatment, alpha=0.05):
    """
    Analyze A/B test results

    control: array of 0/1 outcomes for control
    treatment: array of 0/1 outcomes for treatment
    """
    n_control = len(control)
    n_treatment = len(treatment)

    p_control = control.mean()
    p_treatment = treatment.mean()

    # Pooled proportion
    p_pool = (control.sum() + treatment.sum()) / (n_control + n_treatment)

    # Standard error
    se = np.sqrt(p_pool * (1 - p_pool) * (1/n_control + 1/n_treatment))

    # Z-statistic
    z = (p_treatment - p_control) / se

    # P-value (two-tailed)
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    # Confidence interval
    ci_low = (p_treatment - p_control) - 1.96 * se
    ci_high = (p_treatment - p_control) + 1.96 * se

    return {
        'control_rate': p_control,
        'treatment_rate': p_treatment,
        'lift': (p_treatment - p_control) / p_control,
        'z_statistic': z,
        'p_value': p_value,
        'significant': p_value < alpha,
        'confidence_interval': (ci_low, ci_high)
    }

python

from scipy import stats

def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.8):
    """
    Calculate required sample size per variant

    baseline_rate: Current conversion rate (e.g., 0.05)
    mde: Minimum detectable effect (e.g., 0.1 for 10% lift)
    alpha: Significance level
    power: Statistical power
    """
    effect_size = baseline_rate * mde
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)

    p = baseline_rate
    p_new = p + effect_size

    n = (2 * p * (1 - p) * (z_alpha + z_beta) ** 2) / (effect_size ** 2)

    return int(np.ceil(n))


def analyze_ab_test(control, treatment, alpha=0.05):
    """
    Analyze A/B test results

    control: array of 0/1 outcomes for control
    treatment: array of 0/1 outcomes for treatment
    """
    n_control = len(control)
    n_treatment = len(treatment)

    p_control = control.mean()
    p_treatment = treatment.mean()

    # Pooled proportion
    p_pool = (control.sum() + treatment.sum()) / (n_control + n_treatment)

    # Standard error
    se = np.sqrt(p_pool * (1 - p_pool) * (1/n_control + 1/n_treatment))

    # Z-statistic
    z = (p_treatment - p_control) / se

    # P-value (two-tailed)
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    # Confidence interval
    ci_low = (p_treatment - p_control) - 1.96 * se
    ci_high = (p_treatment - p_control) + 1.96 * se

    return {
        'control_rate': p_control,
        'treatment_rate': p_treatment,
        'lift': (p_treatment - p_control) / p_control,
        'z_statistic': z,
        'p_value': p_value,
        'significant': p_value < alpha,
        'confidence_interval': (ci_low, ci_high)
    }

Statistical Analysis

统计分析

Hypothesis Testing

假设检验

python

from scipy import stats

python

from scipy import stats

T-test

def compare_means(group1, group2): stat, p_value = stats.ttest_ind(group1, group2) effect_size = (group1.mean() - group2.mean()) / np.sqrt( (group1.std()**2 + group2.std()**2) / 2 ) return {'t_statistic': stat, 'p_value': p_value, 'cohens_d': effect_size}

Chi-square

def test_independence(contingency_table): chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table) return {'chi2': chi2, 'p_value': p_value, 'degrees_of_freedom': dof}

Correlation

def analyze_correlation(x, y): pearson_r, pearson_p = stats.pearsonr(x, y) spearman_r, spearman_p = stats.spearmanr(x, y) return { 'pearson': {'r': pearson_r, 'p_value': pearson_p}, 'spearman': {'r': spearman_r, 'p_value': spearman_p} }

undefined

undefined

Project Template

项目模板

markdown

undefined

markdown

undefined

Data Science Project: [Name]

数据科学项目：[项目名称]

Business Objective

业务目标

[What business problem are we solving?]

[我们要解决什么业务问题？]

Success Metrics

成功指标

Primary: [Metric]
Secondary: [Metric]

核心指标：[指标名称]
次要指标：[指标名称]

Data

数据情况

Sources: [List]
Size: [Rows/Features]
Time period: [Dates]

数据源：[列表]
数据规模：[行数/特征数]
时间范围：[日期区间]

Methodology

方法论

[Step 1]
[Step 2]

[步骤1]
[步骤2]

Results

结果展示

Model Performance

模型性能

Metric	Value
[Metric]	[Value]

指标	数值
[指标名称]	[具体数值]

Business Impact

业务价值

[Impact 1]
[Impact 2]

[价值点1]
[价值点2]

Recommendations

建议方案

[Recommendation]

[建议内容]

Next Steps

后续计划

[Next step]

[下一步行动]

Appendix

附录

[Technical details]

undefined

[技术细节说明]

undefined

Reference Materials

参考资料

```
references/ml_algorithms.md
```
- Algorithm deep dives
```
references/feature_engineering.md
```
- Feature engineering patterns
```
references/experimentation.md
```
- A/B testing guide
```
references/statistics.md
```
- Statistical methods

```
references/ml_algorithms.md
```
- 算法深度解析
```
references/feature_engineering.md
```
- 特征工程模式
```
references/experimentation.md
```
- A/B测试指南
```
references/statistics.md
```
- 统计方法

Scripts

脚本工具

bash

undefined

bash

undefined

Model trainer

模型训练器

python scripts/train_model.py --config model_config.yaml

Feature importance analyzer

特征重要性分析工具

python scripts/feature_importance.py --model model.pkl --data test.csv

A/B test analyzer

A/B测试分析工具

python scripts/ab_analyzer.py --control control.csv --treatment treatment.csv

Model evaluator

模型评估工具

python scripts/evaluate_model.py --model model.pkl --test test.csv

undefined

python scripts/evaluate_model.py --model model.pkl --test test.csv

undefined