data-analyst

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Analyst

数据分析师

Expert-level data analysis for business insights.

提供专业级别的数据分析，助力挖掘商业洞察。

Core Competencies

核心能力

SQL and database querying
Data visualization
Statistical analysis
Business intelligence
Data storytelling
Dashboard development
Reporting automation
Stakeholder communication

SQL与数据库查询
数据可视化
统计分析
商业智能
数据叙事
仪表盘开发
报告自动化
利益相关者沟通

SQL Mastery

SQL精通

Query Patterns

查询模式

Aggregation:

sql

SELECT
    date_trunc('month', created_at) as month,
    COUNT(*) as total_orders,
    COUNT(DISTINCT customer_id) as unique_customers,
    SUM(amount) as total_revenue,
    AVG(amount) as avg_order_value
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY 1
ORDER BY 1;

Window Functions:

sql

SELECT
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as running_total,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as order_number,
    LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as previous_order
FROM orders;

CTEs for Clarity:

sql

WITH monthly_metrics AS (
    SELECT
        date_trunc('month', created_at) as month,
        SUM(amount) as revenue
    FROM orders
    GROUP BY 1
),
growth_calc AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) as prev_revenue
    FROM monthly_metrics
)
SELECT
    month,
    revenue,
    ROUND((revenue - prev_revenue) / prev_revenue * 100, 1) as growth_pct
FROM growth_calc;

Cohort Analysis:

sql

WITH first_orders AS (
    SELECT
        customer_id,
        date_trunc('month', MIN(created_at)) as cohort_month
    FROM orders
    GROUP BY 1
),
cohort_data AS (
    SELECT
        f.cohort_month,
        date_trunc('month', o.created_at) as order_month,
        COUNT(DISTINCT o.customer_id) as customers
    FROM orders o
    JOIN first_orders f ON o.customer_id = f.customer_id
    GROUP BY 1, 2
)
SELECT
    cohort_month,
    order_month,
    EXTRACT(MONTH FROM AGE(order_month, cohort_month)) as months_since_cohort,
    customers
FROM cohort_data
ORDER BY 1, 2;

聚合查询：

sql

SELECT
    date_trunc('month', created_at) as month,
    COUNT(*) as total_orders,
    COUNT(DISTINCT customer_id) as unique_customers,
    SUM(amount) as total_revenue,
    AVG(amount) as avg_order_value
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY 1
ORDER BY 1;

窗口函数：

sql

SELECT
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as running_total,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as order_number,
    LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as previous_order
FROM orders;

使用CTE提升可读性：

sql

WITH monthly_metrics AS (
    SELECT
        date_trunc('month', created_at) as month,
        SUM(amount) as revenue
    FROM orders
    GROUP BY 1
),
growth_calc AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) as prev_revenue
    FROM monthly_metrics
)
SELECT
    month,
    revenue,
    ROUND((revenue - prev_revenue) / prev_revenue * 100, 1) as growth_pct
FROM growth_calc;

同期群分析：

sql

WITH first_orders AS (
    SELECT
        customer_id,
        date_trunc('month', MIN(created_at)) as cohort_month
    FROM orders
    GROUP BY 1
),
cohort_data AS (
    SELECT
        f.cohort_month,
        date_trunc('month', o.created_at) as order_month,
        COUNT(DISTINCT o.customer_id) as customers
    FROM orders o
    JOIN first_orders f ON o.customer_id = f.customer_id
    GROUP BY 1, 2
)
SELECT
    cohort_month,
    order_month,
    EXTRACT(MONTH FROM AGE(order_month, cohort_month)) as months_since_cohort,
    customers
FROM cohort_data
ORDER BY 1, 2;

Query Optimization

查询优化

Use EXPLAIN:

sql

EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;

Best Practices:

Use indexes on filtered columns
Avoid SELECT * in production
Use LIMIT for exploratory queries
Filter early, aggregate late
Use appropriate data types

使用EXPLAIN分析：

sql

EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;

最佳实践：

为过滤字段创建索引
生产环境避免使用SELECT *
探索性查询使用LIMIT
先过滤，后聚合
使用合适的数据类型

Data Visualization

数据可视化

Chart Selection Guide

图表选择指南

Data Type	Best Chart	Alternative
Trend over time	Line chart	Area chart
Part of whole	Pie/Donut	Stacked bar
Comparison	Bar chart	Column chart
Distribution	Histogram	Box plot
Correlation	Scatter plot	Heatmap
Geographic	Map	Choropleth

数据类型	最佳图表	替代选项
随时间变化的趋势	折线图	面积图
占比关系	饼图/环形图	堆叠条形图
对比分析	条形图	柱状图
分布情况	直方图	箱线图
相关性	散点图	热力图
地理数据	地图	分级统计图

Visualization Best Practices

可视化最佳实践

Do:

Start Y-axis at zero (for bars)
Use consistent colors
Label axes clearly
Include context (benchmarks, targets)
Order categories meaningfully

Don't:

Use 3D charts
Use more than 5-7 colors
Truncate axes misleadingly
Clutter with gridlines
Use pie charts for many categories

建议：

条形图Y轴从0开始
使用统一的配色
清晰标注坐标轴
包含上下文信息（基准值、目标值）
按有意义的顺序排列类别

避免：

使用3D图表
使用超过5-7种颜色
误导性截断坐标轴
用网格线造成视觉混乱
类别过多时使用饼图

Dashboard Layout

仪表盘布局

┌─────────────────────────────────────────────────────────────┐
│  EXECUTIVE SUMMARY                                          │
│  [KPI 1: $X]  [KPI 2: X%]  [KPI 3: X]  [KPI 4: X%]         │
├─────────────────────────────────────────────────────────────┤
│  TRENDS                          │  BREAKDOWN               │
│  [Line Chart - Primary Metric]   │  [Bar Chart - Segments]  │
│                                  │                          │
├──────────────────────────────────┼──────────────────────────┤
│  COMPARISON                      │  DETAIL TABLE            │
│  [Bar Chart - vs Target/LY]      │  [Top N with metrics]    │
│                                  │                          │
└──────────────────────────────────┴──────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│  执行摘要                                                  │
│  [关键指标1: $X]  [关键指标2: X%]  [关键指标3: X]  [关键指标4: X%]         │
├─────────────────────────────────────────────────────────────┤
│  趋势分析                          │  细分拆解               │
│  [折线图 - 核心指标]   │  [条形图 - 细分维度]  │
│                                  │                          │
├──────────────────────────────────┼──────────────────────────┤
│  对比分析                      │  详情表格            │
│  [条形图 - 与目标/去年同期对比]      │  [Top N 明细数据]    │
│                                  │                          │
└──────────────────────────────────┴──────────────────────────┘

Statistical Analysis

统计分析

Descriptive Statistics

描述性统计

python

import pandas as pd
import numpy as np

def describe_data(df, column):
    stats = {
        'count': df[column].count(),
        'mean': df[column].mean(),
        'median': df[column].median(),
        'std': df[column].std(),
        'min': df[column].min(),
        'max': df[column].max(),
        'q25': df[column].quantile(0.25),
        'q75': df[column].quantile(0.75),
        'skewness': df[column].skew(),
        'kurtosis': df[column].kurtosis()
    }
    return stats

python

import pandas as pd
import numpy as np

def describe_data(df, column):
    stats = {
        'count': df[column].count(),
        'mean': df[column].mean(),
        'median': df[column].median(),
        'std': df[column].std(),
        'min': df[column].min(),
        'max': df[column].max(),
        'q25': df[column].quantile(0.25),
        'q75': df[column].quantile(0.75),
        'skewness': df[column].skew(),
        'kurtosis': df[column].kurtosis()
    }
    return stats

Hypothesis Testing

假设检验

python

from scipy import stats

python

from scipy import stats

T-test: Compare two groups

T检验：对比两组数据

def compare_groups(group_a, group_b, alpha=0.05): stat, p_value = stats.ttest_ind(group_a, group_b)

result = {
    't_statistic': stat,
    'p_value': p_value,
    'significant': p_value < alpha,
    'effect_size': (group_a.mean() - group_b.mean()) / np.sqrt(
        (group_a.std()**2 + group_b.std()**2) / 2
    )
}
return result

def compare_groups(group_a, group_b, alpha=0.05): stat, p_value = stats.ttest_ind(group_a, group_b)

result = {
    't_statistic': stat,
    'p_value': p_value,
    'significant': p_value < alpha,
    'effect_size': (group_a.mean() - group_b.mean()) / np.sqrt(
        (group_a.std()**2 + group_b.std()**2) / 2
    )
}
return result

Chi-square: Test independence

卡方检验：检验独立性

def test_independence(observed, alpha=0.05): stat, p_value, dof, expected = stats.chi2_contingency(observed)

return {
    'chi2': stat,
    'p_value': p_value,
    'degrees_of_freedom': dof,
    'significant': p_value < alpha
}

undefined

def test_independence(observed, alpha=0.05): stat, p_value, dof, expected = stats.chi2_contingency(observed)

return {
    'chi2': stat,
    'p_value': p_value,
    'degrees_of_freedom': dof,
    'significant': p_value < alpha
}

undefined

Regression Analysis

回归分析

python

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

def simple_regression(X, y):
    model = LinearRegression()
    model.fit(X.reshape(-1, 1), y)

    predictions = model.predict(X.reshape(-1, 1))

    return {
        'coefficient': model.coef_[0],
        'intercept': model.intercept_,
        'r_squared': r2_score(y, predictions),
        'mae': mean_absolute_error(y, predictions)
    }

python

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

def simple_regression(X, y):
    model = LinearRegression()
    model.fit(X.reshape(-1, 1), y)

    predictions = model.predict(X.reshape(-1, 1))

    return {
        'coefficient': model.coef_[0],
        'intercept': model.intercept_,
        'r_squared': r2_score(y, predictions),
        'mae': mean_absolute_error(y, predictions)
    }

Business Analysis

商业分析

Analysis Framework

分析框架

markdown

undefined

markdown

undefined

Analysis: [Topic]

分析：[主题]

Business Question

业务问题

[What are we trying to answer?]

[我们需要解答什么问题？]

Hypothesis

假设

[What do we expect to find?]

[我们预期会发现什么？]

Data Sources

数据来源

Methodology

分析方法

[Step 1]
[Step 2]
[Step 3]

[步骤1]
[步骤2]
[步骤3]

Findings

分析结果

Finding 1: [Title]

发现1: [标题]

[Description with supporting data]

[带支撑数据的描述]

Finding 2: [Title]

发现2: [标题]

[Description with supporting data]

[带支撑数据的描述]

Recommendations

建议

[Recommendation]: [Expected impact]
[Recommendation]: [Expected impact]

Limitations

局限性

[Limitation 1]
[Limitation 2]

[局限性1]
[局限性2]

Next Steps

后续行动

[Action item]

undefined

[行动项]

undefined

Key Business Metrics

核心业务指标

Acquisition:

Customer Acquisition Cost (CAC)
Cost per Lead (CPL)
Conversion Rate

Engagement:

Daily/Monthly Active Users
Session Duration
Feature Adoption

Retention:

Churn Rate
Retention Rate
Net Revenue Retention

Revenue:

Monthly Recurring Revenue (MRR)
Average Revenue Per User (ARPU)
Lifetime Value (LTV)

获客类：

客户获取成本 (CAC)
线索获取成本 (CPL)
转化率

参与度类：

日活/月活用户数
会话时长
功能渗透率

留存类：

流失率
留存率
净收入留存率

收入类：

月度经常性收入 (MRR)
每用户平均收入 (ARPU)
用户生命周期价值 (LTV)

Data Storytelling

数据叙事

Presentation Structure

汇报结构

1. CONTEXT
   - Why does this matter?
   - What question are we answering?

2. KEY FINDING
   - Lead with the insight
   - Make it memorable

3. EVIDENCE
   - Show the data
   - Use effective visuals

4. IMPLICATIONS
   - What does this mean?
   - So what?

5. RECOMMENDATIONS
   - What should we do?
   - Clear next steps

1. 背景
   - 为什么这件事重要？
   - 我们要解答什么问题？

2. 核心发现
   - 先抛出核心洞察
   - 让结论令人印象深刻

3. 证据支撑
   - 展示数据
   - 使用有效的可视化图表

4. 业务影响
   - 这个发现意味着什么？
   - 对业务有什么价值？

5. 行动建议
   - 我们应该做什么？
   - 明确的后续步骤

Insight Template

洞察模板

markdown

undefined

markdown

undefined

[Headline: Action-oriented finding]

[标题：面向行动的发现]

What: [One sentence description of the finding]

So What: [Why this matters to the business]

Now What: [Recommended action]

Evidence: [Chart or data supporting the finding]

Confidence: [High/Medium/Low]

undefined

是什么： [一句话描述发现内容]

为什么重要： [这个发现对业务的意义]

应该怎么做： [推荐的行动方案]

支撑证据： [图表或数据]

置信度： [高/中/低]

undefined

Reference Materials

参考资料

```
references/sql_patterns.md
```
- Advanced SQL queries
```
references/visualization.md
```
- Chart selection guide
```
references/statistics.md
```
- Statistical methods
```
references/storytelling.md
```
- Presentation best practices

```
references/sql_patterns.md
```
- 高级SQL查询
```
references/visualization.md
```
- 图表选择指南
```
references/statistics.md
```
- 统计方法
```
references/storytelling.md
```
- 汇报最佳实践

Scripts

脚本工具

bash

undefined

bash

undefined

Data profiler

数据探查工具

python scripts/data_profiler.py --table orders --output profile.html

SQL query analyzer

SQL查询分析工具

python scripts/query_analyzer.py --query query.sql --explain

Dashboard generator

仪表盘生成工具

python scripts/dashboard_gen.py --config dashboard.yaml

Report automation

报告自动化工具

python scripts/report_gen.py --template monthly --output report.pdf

undefined

python scripts/report_gen.py --template monthly --output report.pdf

undefined