data-scientist

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Scientist

数据科学家

Expert-level data science for business impact.
助力业务价值的专家级数据科学内容。

Core Competencies

核心能力

  • Machine learning
  • Statistical modeling
  • Experimentation design
  • Predictive analytics
  • Feature engineering
  • Model evaluation
  • Data storytelling
  • Production ML
  • 机器学习
  • 统计建模
  • 实验设计
  • 预测分析
  • 特征工程
  • 模型评估
  • 数据叙事
  • 生产级ML

Machine Learning Workflow

机器学习工作流

PROBLEM DEFINITION → DATA → FEATURES → MODEL → EVALUATION → DEPLOYMENT

1. Problem Definition
   - Business objective
   - Success metrics
   - Constraints

2. Data Collection
   - Data sources
   - Data quality
   - Sample size

3. Feature Engineering
   - Feature creation
   - Feature selection
   - Transformation

4. Model Development
   - Algorithm selection
   - Training
   - Tuning

5. Evaluation
   - Metrics
   - Validation
   - Business impact

6. Deployment
   - Production pipeline
   - Monitoring
   - Iteration
PROBLEM DEFINITION → DATA → FEATURES → MODEL → EVALUATION → DEPLOYMENT

1. Problem Definition
   - Business objective
   - Success metrics
   - Constraints

2. Data Collection
   - Data sources
   - Data quality
   - Sample size

3. Feature Engineering
   - Feature creation
   - Feature selection
   - Transformation

4. Model Development
   - Algorithm selection
   - Training
   - Tuning

5. Evaluation
   - Metrics
   - Validation
   - Business impact

6. Deployment
   - Production pipeline
   - Monitoring
   - Iteration

Model Selection

模型选择

Algorithm Comparison

算法对比

AlgorithmUse CaseProsCons
Linear RegressionContinuous predictionInterpretable, fastLinear relationships only
Logistic RegressionBinary classificationInterpretable, probabilisticLinear boundaries
Random ForestClassification/RegressionHandles non-linearityLess interpretable
XGBoostClassification/RegressionHigh accuracyOverfitting risk
Neural NetworksComplex patternsFlexibleRequires lots of data
算法适用场景优势劣势
Linear Regression连续值预测可解释性强、运算快速仅适用于线性关系
Logistic Regression二分类任务可解释性强、基于概率仅能处理线性边界
Random Forest分类/回归任务可处理非线性关系可解释性较弱
XGBoost分类/回归任务准确率高存在过拟合风险
Neural Networks复杂模式识别灵活性强需要大量数据支撑

Model Selection Framework

模型选择框架

python
def select_model(problem_type, data_size, interpretability_need, accuracy_need):
    """
    problem_type: 'classification' or 'regression'
    data_size: 'small' (<10K), 'medium' (10K-1M), 'large' (>1M)
    interpretability_need: 'high', 'medium', 'low'
    accuracy_need: 'high', 'medium', 'low'
    """

    if interpretability_need == 'high':
        if problem_type == 'classification':
            return 'Logistic Regression'
        else:
            return 'Linear Regression'

    if data_size == 'small':
        return 'Random Forest'

    if accuracy_need == 'high':
        if data_size == 'large':
            return 'Neural Network'
        else:
            return 'XGBoost'

    return 'Random Forest'
python
def select_model(problem_type, data_size, interpretability_need, accuracy_need):
    """
    problem_type: 'classification' or 'regression'
    data_size: 'small' (<10K), 'medium' (10K-1M), 'large' (>1M)
    interpretability_need: 'high', 'medium', 'low'
    accuracy_need: 'high', 'medium', 'low'
    """

    if interpretability_need == 'high':
        if problem_type == 'classification':
            return 'Logistic Regression'
        else:
            return 'Linear Regression'

    if data_size == 'small':
        return 'Random Forest'

    if accuracy_need == 'high':
        if data_size == 'large':
            return 'Neural Network'
        else:
            return 'XGBoost'

    return 'Random Forest'

Feature Engineering

特征工程

Feature Types

特征类型

python
undefined
python
undefined

Numerical Features

Numerical Features

def engineer_numerical(df, col): features = { f'{col}_log': np.log1p(df[col]), f'{col}_sqrt': np.sqrt(df[col]), f'{col}_squared': df[col] ** 2, f'{col}_binned': pd.cut(df[col], bins=5, labels=False) } return pd.DataFrame(features)
def engineer_numerical(df, col): features = { f'{col}_log': np.log1p(df[col]), f'{col}_sqrt': np.sqrt(df[col]), f'{col}_squared': df[col] ** 2, f'{col}_binned': pd.cut(df[col], bins=5, labels=False) } return pd.DataFrame(features)

Categorical Features

Categorical Features

def engineer_categorical(df, col): # One-hot encoding dummies = pd.get_dummies(df[col], prefix=col)
# Target encoding
target_mean = df.groupby(col)['target'].mean()
target_encoded = df[col].map(target_mean)

# Frequency encoding
freq = df[col].value_counts(normalize=True)
freq_encoded = df[col].map(freq)

return dummies, target_encoded, freq_encoded
def engineer_categorical(df, col): # One-hot encoding dummies = pd.get_dummies(df[col], prefix=col)
# Target encoding
target_mean = df.groupby(col)['target'].mean()
target_encoded = df[col].map(target_mean)

# Frequency encoding
freq = df[col].value_counts(normalize=True)
freq_encoded = df[col].map(freq)

return dummies, target_encoded, freq_encoded

Time Features

Time Features

def engineer_time(df, col): df[col] = pd.to_datetime(df[col]) features = { f'{col}_hour': df[col].dt.hour, f'{col}_day': df[col].dt.day, f'{col}_dayofweek': df[col].dt.dayofweek, f'{col}_month': df[col].dt.month, f'{col}_is_weekend': df[col].dt.dayofweek.isin([5, 6]).astype(int), f'{col}_hour_sin': np.sin(2 * np.pi * df[col].dt.hour / 24), f'{col}_hour_cos': np.cos(2 * np.pi * df[col].dt.hour / 24) } return pd.DataFrame(features)
undefined
def engineer_time(df, col): df[col] = pd.to_datetime(df[col]) features = { f'{col}_hour': df[col].dt.hour, f'{col}_day': df[col].dt.day, f'{col}_dayofweek': df[col].dt.dayofweek, f'{col}_month': df[col].dt.month, f'{col}_is_weekend': df[col].dt.dayofweek.isin([5, 6]).astype(int), f'{col}_hour_sin': np.sin(2 * np.pi * df[col].dt.hour / 24), f'{col}_hour_cos': np.cos(2 * np.pi * df[col].dt.hour / 24) } return pd.DataFrame(features)
undefined

Feature Selection

特征选择

python
from sklearn.feature_selection import mutual_info_classif, RFE
from sklearn.ensemble import RandomForestClassifier

def select_features(X, y, method='importance', n_features=20):
    if method == 'importance':
        model = RandomForestClassifier(n_estimators=100)
        model.fit(X, y)
        importance = pd.Series(model.feature_importances_, index=X.columns)
        return importance.nlargest(n_features).index.tolist()

    elif method == 'mutual_info':
        mi_scores = mutual_info_classif(X, y)
        mi_series = pd.Series(mi_scores, index=X.columns)
        return mi_series.nlargest(n_features).index.tolist()

    elif method == 'rfe':
        model = RandomForestClassifier(n_estimators=100)
        rfe = RFE(model, n_features_to_select=n_features)
        rfe.fit(X, y)
        return X.columns[rfe.support_].tolist()
python
from sklearn.feature_selection import mutual_info_classif, RFE
from sklearn.ensemble import RandomForestClassifier

def select_features(X, y, method='importance', n_features=20):
    if method == 'importance':
        model = RandomForestClassifier(n_estimators=100)
        model.fit(X, y)
        importance = pd.Series(model.feature_importances_, index=X.columns)
        return importance.nlargest(n_features).index.tolist()

    elif method == 'mutual_info':
        mi_scores = mutual_info_classif(X, y)
        mi_series = pd.Series(mi_scores, index=X.columns)
        return mi_series.nlargest(n_features).index.tolist()

    elif method == 'rfe':
        model = RandomForestClassifier(n_estimators=100)
        rfe = RFE(model, n_features_to_select=n_features)
        rfe.fit(X, y)
        return X.columns[rfe.support_].tolist()

Model Evaluation

模型评估

Classification Metrics

分类任务评估指标

python
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix,
    classification_report
)

def evaluate_classification(y_true, y_pred, y_proba=None):
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'f1': f1_score(y_true, y_pred),
    }

    if y_proba is not None:
        metrics['auc_roc'] = roc_auc_score(y_true, y_proba)

    print(classification_report(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred))

    return metrics
python
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix,
    classification_report
)

def evaluate_classification(y_true, y_pred, y_proba=None):
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'f1': f1_score(y_true, y_pred),
    }

    if y_proba is not None:
        metrics['auc_roc'] = roc_auc_score(y_true, y_proba)

    print(classification_report(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred))

    return metrics

Regression Metrics

回归任务评估指标

python
from sklearn.metrics import (
    mean_absolute_error, mean_squared_error,
    r2_score, mean_absolute_percentage_error
)

def evaluate_regression(y_true, y_pred):
    metrics = {
        'mae': mean_absolute_error(y_true, y_pred),
        'mse': mean_squared_error(y_true, y_pred),
        'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
        'r2': r2_score(y_true, y_pred),
        'mape': mean_absolute_percentage_error(y_true, y_pred)
    }
    return metrics
python
from sklearn.metrics import (
    mean_absolute_error, mean_squared_error,
    r2_score, mean_absolute_percentage_error
)

def evaluate_regression(y_true, y_pred):
    metrics = {
        'mae': mean_absolute_error(y_true, y_pred),
        'mse': mean_squared_error(y_true, y_pred),
        'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
        'r2': r2_score(y_true, y_pred),
        'mape': mean_absolute_percentage_error(y_true, y_pred)
    }
    return metrics

Experimentation

实验设计

A/B Test Design

A/B测试设计

python
from scipy import stats

def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.8):
    """
    Calculate required sample size per variant

    baseline_rate: Current conversion rate (e.g., 0.05)
    mde: Minimum detectable effect (e.g., 0.1 for 10% lift)
    alpha: Significance level
    power: Statistical power
    """
    effect_size = baseline_rate * mde
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)

    p = baseline_rate
    p_new = p + effect_size

    n = (2 * p * (1 - p) * (z_alpha + z_beta) ** 2) / (effect_size ** 2)

    return int(np.ceil(n))


def analyze_ab_test(control, treatment, alpha=0.05):
    """
    Analyze A/B test results

    control: array of 0/1 outcomes for control
    treatment: array of 0/1 outcomes for treatment
    """
    n_control = len(control)
    n_treatment = len(treatment)

    p_control = control.mean()
    p_treatment = treatment.mean()

    # Pooled proportion
    p_pool = (control.sum() + treatment.sum()) / (n_control + n_treatment)

    # Standard error
    se = np.sqrt(p_pool * (1 - p_pool) * (1/n_control + 1/n_treatment))

    # Z-statistic
    z = (p_treatment - p_control) / se

    # P-value (two-tailed)
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    # Confidence interval
    ci_low = (p_treatment - p_control) - 1.96 * se
    ci_high = (p_treatment - p_control) + 1.96 * se

    return {
        'control_rate': p_control,
        'treatment_rate': p_treatment,
        'lift': (p_treatment - p_control) / p_control,
        'z_statistic': z,
        'p_value': p_value,
        'significant': p_value < alpha,
        'confidence_interval': (ci_low, ci_high)
    }
python
from scipy import stats

def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.8):
    """
    Calculate required sample size per variant

    baseline_rate: Current conversion rate (e.g., 0.05)
    mde: Minimum detectable effect (e.g., 0.1 for 10% lift)
    alpha: Significance level
    power: Statistical power
    """
    effect_size = baseline_rate * mde
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)

    p = baseline_rate
    p_new = p + effect_size

    n = (2 * p * (1 - p) * (z_alpha + z_beta) ** 2) / (effect_size ** 2)

    return int(np.ceil(n))


def analyze_ab_test(control, treatment, alpha=0.05):
    """
    Analyze A/B test results

    control: array of 0/1 outcomes for control
    treatment: array of 0/1 outcomes for treatment
    """
    n_control = len(control)
    n_treatment = len(treatment)

    p_control = control.mean()
    p_treatment = treatment.mean()

    # Pooled proportion
    p_pool = (control.sum() + treatment.sum()) / (n_control + n_treatment)

    # Standard error
    se = np.sqrt(p_pool * (1 - p_pool) * (1/n_control + 1/n_treatment))

    # Z-statistic
    z = (p_treatment - p_control) / se

    # P-value (two-tailed)
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    # Confidence interval
    ci_low = (p_treatment - p_control) - 1.96 * se
    ci_high = (p_treatment - p_control) + 1.96 * se

    return {
        'control_rate': p_control,
        'treatment_rate': p_treatment,
        'lift': (p_treatment - p_control) / p_control,
        'z_statistic': z,
        'p_value': p_value,
        'significant': p_value < alpha,
        'confidence_interval': (ci_low, ci_high)
    }

Statistical Analysis

统计分析

Hypothesis Testing

假设检验

python
from scipy import stats
python
from scipy import stats

T-test

T-test

def compare_means(group1, group2): stat, p_value = stats.ttest_ind(group1, group2) effect_size = (group1.mean() - group2.mean()) / np.sqrt( (group1.std()**2 + group2.std()**2) / 2 ) return {'t_statistic': stat, 'p_value': p_value, 'cohens_d': effect_size}
def compare_means(group1, group2): stat, p_value = stats.ttest_ind(group1, group2) effect_size = (group1.mean() - group2.mean()) / np.sqrt( (group1.std()**2 + group2.std()**2) / 2 ) return {'t_statistic': stat, 'p_value': p_value, 'cohens_d': effect_size}

Chi-square

Chi-square

def test_independence(contingency_table): chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table) return {'chi2': chi2, 'p_value': p_value, 'degrees_of_freedom': dof}
def test_independence(contingency_table): chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table) return {'chi2': chi2, 'p_value': p_value, 'degrees_of_freedom': dof}

Correlation

Correlation

def analyze_correlation(x, y): pearson_r, pearson_p = stats.pearsonr(x, y) spearman_r, spearman_p = stats.spearmanr(x, y) return { 'pearson': {'r': pearson_r, 'p_value': pearson_p}, 'spearman': {'r': spearman_r, 'p_value': spearman_p} }
undefined
def analyze_correlation(x, y): pearson_r, pearson_p = stats.pearsonr(x, y) spearman_r, spearman_p = stats.spearmanr(x, y) return { 'pearson': {'r': pearson_r, 'p_value': pearson_p}, 'spearman': {'r': spearman_r, 'p_value': spearman_p} }
undefined

Project Template

项目模板

markdown
undefined
markdown
undefined

Data Science Project: [Name]

数据科学项目:[项目名称]

Business Objective

业务目标

[What business problem are we solving?]
[我们要解决什么业务问题?]

Success Metrics

成功指标

  • Primary: [Metric]
  • Secondary: [Metric]
  • 核心指标:[指标名称]
  • 次要指标:[指标名称]

Data

数据情况

  • Sources: [List]
  • Size: [Rows/Features]
  • Time period: [Dates]
  • 数据源:[列表]
  • 数据规模:[行数/特征数]
  • 时间范围:[日期区间]

Methodology

方法论

  1. [Step 1]
  2. [Step 2]
  1. [步骤1]
  2. [步骤2]

Results

结果展示

Model Performance

模型性能

MetricValue
[Metric][Value]
指标数值
[指标名称][具体数值]

Business Impact

业务价值

  • [Impact 1]
  • [Impact 2]
  • [价值点1]
  • [价值点2]

Recommendations

建议方案

  1. [Recommendation]
  1. [建议内容]

Next Steps

后续计划

  • [Next step]
  • [下一步行动]

Appendix

附录

[Technical details]
undefined
[技术细节说明]
undefined

Reference Materials

参考资料

  • references/ml_algorithms.md
    - Algorithm deep dives
  • references/feature_engineering.md
    - Feature engineering patterns
  • references/experimentation.md
    - A/B testing guide
  • references/statistics.md
    - Statistical methods
  • references/ml_algorithms.md
    - 算法深度解析
  • references/feature_engineering.md
    - 特征工程模式
  • references/experimentation.md
    - A/B测试指南
  • references/statistics.md
    - 统计方法

Scripts

脚本工具

bash
undefined
bash
undefined

Model trainer

模型训练器

python scripts/train_model.py --config model_config.yaml
python scripts/train_model.py --config model_config.yaml

Feature importance analyzer

特征重要性分析工具

python scripts/feature_importance.py --model model.pkl --data test.csv
python scripts/feature_importance.py --model model.pkl --data test.csv

A/B test analyzer

A/B测试分析工具

python scripts/ab_analyzer.py --control control.csv --treatment treatment.csv
python scripts/ab_analyzer.py --control control.csv --treatment treatment.csv

Model evaluator

模型评估工具

python scripts/evaluate_model.py --model model.pkl --test test.csv
undefined
python scripts/evaluate_model.py --model model.pkl --test test.csv
undefined