data-analyst

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Analyst

数据分析师

Expert-level data analysis for business insights.
提供专业级别的数据分析,助力挖掘商业洞察。

Core Competencies

核心能力

  • SQL and database querying
  • Data visualization
  • Statistical analysis
  • Business intelligence
  • Data storytelling
  • Dashboard development
  • Reporting automation
  • Stakeholder communication
  • SQL与数据库查询
  • 数据可视化
  • 统计分析
  • 商业智能
  • 数据叙事
  • 仪表盘开发
  • 报告自动化
  • 利益相关者沟通

SQL Mastery

SQL精通

Query Patterns

查询模式

Aggregation:
sql
SELECT
    date_trunc('month', created_at) as month,
    COUNT(*) as total_orders,
    COUNT(DISTINCT customer_id) as unique_customers,
    SUM(amount) as total_revenue,
    AVG(amount) as avg_order_value
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY 1
ORDER BY 1;
Window Functions:
sql
SELECT
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as running_total,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as order_number,
    LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as previous_order
FROM orders;
CTEs for Clarity:
sql
WITH monthly_metrics AS (
    SELECT
        date_trunc('month', created_at) as month,
        SUM(amount) as revenue
    FROM orders
    GROUP BY 1
),
growth_calc AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) as prev_revenue
    FROM monthly_metrics
)
SELECT
    month,
    revenue,
    ROUND((revenue - prev_revenue) / prev_revenue * 100, 1) as growth_pct
FROM growth_calc;
Cohort Analysis:
sql
WITH first_orders AS (
    SELECT
        customer_id,
        date_trunc('month', MIN(created_at)) as cohort_month
    FROM orders
    GROUP BY 1
),
cohort_data AS (
    SELECT
        f.cohort_month,
        date_trunc('month', o.created_at) as order_month,
        COUNT(DISTINCT o.customer_id) as customers
    FROM orders o
    JOIN first_orders f ON o.customer_id = f.customer_id
    GROUP BY 1, 2
)
SELECT
    cohort_month,
    order_month,
    EXTRACT(MONTH FROM AGE(order_month, cohort_month)) as months_since_cohort,
    customers
FROM cohort_data
ORDER BY 1, 2;
聚合查询:
sql
SELECT
    date_trunc('month', created_at) as month,
    COUNT(*) as total_orders,
    COUNT(DISTINCT customer_id) as unique_customers,
    SUM(amount) as total_revenue,
    AVG(amount) as avg_order_value
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY 1
ORDER BY 1;
窗口函数:
sql
SELECT
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as running_total,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as order_number,
    LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as previous_order
FROM orders;
使用CTE提升可读性:
sql
WITH monthly_metrics AS (
    SELECT
        date_trunc('month', created_at) as month,
        SUM(amount) as revenue
    FROM orders
    GROUP BY 1
),
growth_calc AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) as prev_revenue
    FROM monthly_metrics
)
SELECT
    month,
    revenue,
    ROUND((revenue - prev_revenue) / prev_revenue * 100, 1) as growth_pct
FROM growth_calc;
同期群分析:
sql
WITH first_orders AS (
    SELECT
        customer_id,
        date_trunc('month', MIN(created_at)) as cohort_month
    FROM orders
    GROUP BY 1
),
cohort_data AS (
    SELECT
        f.cohort_month,
        date_trunc('month', o.created_at) as order_month,
        COUNT(DISTINCT o.customer_id) as customers
    FROM orders o
    JOIN first_orders f ON o.customer_id = f.customer_id
    GROUP BY 1, 2
)
SELECT
    cohort_month,
    order_month,
    EXTRACT(MONTH FROM AGE(order_month, cohort_month)) as months_since_cohort,
    customers
FROM cohort_data
ORDER BY 1, 2;

Query Optimization

查询优化

Use EXPLAIN:
sql
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;
Best Practices:
  • Use indexes on filtered columns
  • Avoid SELECT * in production
  • Use LIMIT for exploratory queries
  • Filter early, aggregate late
  • Use appropriate data types
使用EXPLAIN分析:
sql
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;
最佳实践:
  • 为过滤字段创建索引
  • 生产环境避免使用SELECT *
  • 探索性查询使用LIMIT
  • 先过滤,后聚合
  • 使用合适的数据类型

Data Visualization

数据可视化

Chart Selection Guide

图表选择指南

Data TypeBest ChartAlternative
Trend over timeLine chartArea chart
Part of wholePie/DonutStacked bar
ComparisonBar chartColumn chart
DistributionHistogramBox plot
CorrelationScatter plotHeatmap
GeographicMapChoropleth
数据类型最佳图表替代选项
随时间变化的趋势折线图面积图
占比关系饼图/环形图堆叠条形图
对比分析条形图柱状图
分布情况直方图箱线图
相关性散点图热力图
地理数据地图分级统计图

Visualization Best Practices

可视化最佳实践

Do:
  • Start Y-axis at zero (for bars)
  • Use consistent colors
  • Label axes clearly
  • Include context (benchmarks, targets)
  • Order categories meaningfully
Don't:
  • Use 3D charts
  • Use more than 5-7 colors
  • Truncate axes misleadingly
  • Clutter with gridlines
  • Use pie charts for many categories
建议:
  • 条形图Y轴从0开始
  • 使用统一的配色
  • 清晰标注坐标轴
  • 包含上下文信息(基准值、目标值)
  • 按有意义的顺序排列类别
避免:
  • 使用3D图表
  • 使用超过5-7种颜色
  • 误导性截断坐标轴
  • 用网格线造成视觉混乱
  • 类别过多时使用饼图

Dashboard Layout

仪表盘布局

┌─────────────────────────────────────────────────────────────┐
│  EXECUTIVE SUMMARY                                          │
│  [KPI 1: $X]  [KPI 2: X%]  [KPI 3: X]  [KPI 4: X%]         │
├─────────────────────────────────────────────────────────────┤
│  TRENDS                          │  BREAKDOWN               │
│  [Line Chart - Primary Metric]   │  [Bar Chart - Segments]  │
│                                  │                          │
├──────────────────────────────────┼──────────────────────────┤
│  COMPARISON                      │  DETAIL TABLE            │
│  [Bar Chart - vs Target/LY]      │  [Top N with metrics]    │
│                                  │                          │
└──────────────────────────────────┴──────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│  执行摘要                                                  │
│  [关键指标1: $X]  [关键指标2: X%]  [关键指标3: X]  [关键指标4: X%]         │
├─────────────────────────────────────────────────────────────┤
│  趋势分析                          │  细分拆解               │
│  [折线图 - 核心指标]   │  [条形图 - 细分维度]  │
│                                  │                          │
├──────────────────────────────────┼──────────────────────────┤
│  对比分析                      │  详情表格            │
│  [条形图 - 与目标/去年同期对比]      │  [Top N 明细数据]    │
│                                  │                          │
└──────────────────────────────────┴──────────────────────────┘

Statistical Analysis

统计分析

Descriptive Statistics

描述性统计

python
import pandas as pd
import numpy as np

def describe_data(df, column):
    stats = {
        'count': df[column].count(),
        'mean': df[column].mean(),
        'median': df[column].median(),
        'std': df[column].std(),
        'min': df[column].min(),
        'max': df[column].max(),
        'q25': df[column].quantile(0.25),
        'q75': df[column].quantile(0.75),
        'skewness': df[column].skew(),
        'kurtosis': df[column].kurtosis()
    }
    return stats
python
import pandas as pd
import numpy as np

def describe_data(df, column):
    stats = {
        'count': df[column].count(),
        'mean': df[column].mean(),
        'median': df[column].median(),
        'std': df[column].std(),
        'min': df[column].min(),
        'max': df[column].max(),
        'q25': df[column].quantile(0.25),
        'q75': df[column].quantile(0.75),
        'skewness': df[column].skew(),
        'kurtosis': df[column].kurtosis()
    }
    return stats

Hypothesis Testing

假设检验

python
from scipy import stats
python
from scipy import stats

T-test: Compare two groups

T检验:对比两组数据

def compare_groups(group_a, group_b, alpha=0.05): stat, p_value = stats.ttest_ind(group_a, group_b)
result = {
    't_statistic': stat,
    'p_value': p_value,
    'significant': p_value < alpha,
    'effect_size': (group_a.mean() - group_b.mean()) / np.sqrt(
        (group_a.std()**2 + group_b.std()**2) / 2
    )
}
return result
def compare_groups(group_a, group_b, alpha=0.05): stat, p_value = stats.ttest_ind(group_a, group_b)
result = {
    't_statistic': stat,
    'p_value': p_value,
    'significant': p_value < alpha,
    'effect_size': (group_a.mean() - group_b.mean()) / np.sqrt(
        (group_a.std()**2 + group_b.std()**2) / 2
    )
}
return result

Chi-square: Test independence

卡方检验:检验独立性

def test_independence(observed, alpha=0.05): stat, p_value, dof, expected = stats.chi2_contingency(observed)
return {
    'chi2': stat,
    'p_value': p_value,
    'degrees_of_freedom': dof,
    'significant': p_value < alpha
}
undefined
def test_independence(observed, alpha=0.05): stat, p_value, dof, expected = stats.chi2_contingency(observed)
return {
    'chi2': stat,
    'p_value': p_value,
    'degrees_of_freedom': dof,
    'significant': p_value < alpha
}
undefined

Regression Analysis

回归分析

python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

def simple_regression(X, y):
    model = LinearRegression()
    model.fit(X.reshape(-1, 1), y)

    predictions = model.predict(X.reshape(-1, 1))

    return {
        'coefficient': model.coef_[0],
        'intercept': model.intercept_,
        'r_squared': r2_score(y, predictions),
        'mae': mean_absolute_error(y, predictions)
    }
python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

def simple_regression(X, y):
    model = LinearRegression()
    model.fit(X.reshape(-1, 1), y)

    predictions = model.predict(X.reshape(-1, 1))

    return {
        'coefficient': model.coef_[0],
        'intercept': model.intercept_,
        'r_squared': r2_score(y, predictions),
        'mae': mean_absolute_error(y, predictions)
    }

Business Analysis

商业分析

Analysis Framework

分析框架

markdown
undefined
markdown
undefined

Analysis: [Topic]

分析:[主题]

Business Question

业务问题

[What are we trying to answer?]
[我们需要解答什么问题?]

Hypothesis

假设

[What do we expect to find?]
[我们预期会发现什么?]

Data Sources

数据来源

Methodology

分析方法

  1. [Step 1]
  2. [Step 2]
  3. [Step 3]
  1. [步骤1]
  2. [步骤2]
  3. [步骤3]

Findings

分析结果

Finding 1: [Title]

发现1: [标题]

[Description with supporting data]
[带支撑数据的描述]

Finding 2: [Title]

发现2: [标题]

[Description with supporting data]
[带支撑数据的描述]

Recommendations

建议

  1. [Recommendation]: [Expected impact]
  2. [Recommendation]: [Expected impact]

Limitations

局限性

  • [Limitation 1]
  • [Limitation 2]
  • [局限性1]
  • [局限性2]

Next Steps

后续行动

  • [Action item]
undefined
  • [行动项]
undefined

Key Business Metrics

核心业务指标

Acquisition:
  • Customer Acquisition Cost (CAC)
  • Cost per Lead (CPL)
  • Conversion Rate
Engagement:
  • Daily/Monthly Active Users
  • Session Duration
  • Feature Adoption
Retention:
  • Churn Rate
  • Retention Rate
  • Net Revenue Retention
Revenue:
  • Monthly Recurring Revenue (MRR)
  • Average Revenue Per User (ARPU)
  • Lifetime Value (LTV)
获客类:
  • 客户获取成本 (CAC)
  • 线索获取成本 (CPL)
  • 转化率
参与度类:
  • 日活/月活用户数
  • 会话时长
  • 功能渗透率
留存类:
  • 流失率
  • 留存率
  • 净收入留存率
收入类:
  • 月度经常性收入 (MRR)
  • 每用户平均收入 (ARPU)
  • 用户生命周期价值 (LTV)

Data Storytelling

数据叙事

Presentation Structure

汇报结构

1. CONTEXT
   - Why does this matter?
   - What question are we answering?

2. KEY FINDING
   - Lead with the insight
   - Make it memorable

3. EVIDENCE
   - Show the data
   - Use effective visuals

4. IMPLICATIONS
   - What does this mean?
   - So what?

5. RECOMMENDATIONS
   - What should we do?
   - Clear next steps
1. 背景
   - 为什么这件事重要?
   - 我们要解答什么问题?

2. 核心发现
   - 先抛出核心洞察
   - 让结论令人印象深刻

3. 证据支撑
   - 展示数据
   - 使用有效的可视化图表

4. 业务影响
   - 这个发现意味着什么?
   - 对业务有什么价值?

5. 行动建议
   - 我们应该做什么?
   - 明确的后续步骤

Insight Template

洞察模板

markdown
undefined
markdown
undefined

[Headline: Action-oriented finding]

[标题:面向行动的发现]

What: [One sentence description of the finding]
So What: [Why this matters to the business]
Now What: [Recommended action]
Evidence: [Chart or data supporting the finding]
Confidence: [High/Medium/Low]
undefined
是什么: [一句话描述发现内容]
为什么重要: [这个发现对业务的意义]
应该怎么做: [推荐的行动方案]
支撑证据: [图表或数据]
置信度: [高/中/低]
undefined

Reference Materials

参考资料

  • references/sql_patterns.md
    - Advanced SQL queries
  • references/visualization.md
    - Chart selection guide
  • references/statistics.md
    - Statistical methods
  • references/storytelling.md
    - Presentation best practices
  • references/sql_patterns.md
    - 高级SQL查询
  • references/visualization.md
    - 图表选择指南
  • references/statistics.md
    - 统计方法
  • references/storytelling.md
    - 汇报最佳实践

Scripts

脚本工具

bash
undefined
bash
undefined

Data profiler

数据探查工具

python scripts/data_profiler.py --table orders --output profile.html
python scripts/data_profiler.py --table orders --output profile.html

SQL query analyzer

SQL查询分析工具

python scripts/query_analyzer.py --query query.sql --explain
python scripts/query_analyzer.py --query query.sql --explain

Dashboard generator

仪表盘生成工具

python scripts/dashboard_gen.py --config dashboard.yaml
python scripts/dashboard_gen.py --config dashboard.yaml

Report automation

报告自动化工具

python scripts/report_gen.py --template monthly --output report.pdf
undefined
python scripts/report_gen.py --template monthly --output report.pdf
undefined