data-analyst
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Analyst
数据分析师
Expert-level data analysis for business insights.
提供专业级别的数据分析,助力挖掘商业洞察。
Core Competencies
核心能力
- SQL and database querying
- Data visualization
- Statistical analysis
- Business intelligence
- Data storytelling
- Dashboard development
- Reporting automation
- Stakeholder communication
- SQL与数据库查询
- 数据可视化
- 统计分析
- 商业智能
- 数据叙事
- 仪表盘开发
- 报告自动化
- 利益相关者沟通
SQL Mastery
SQL精通
Query Patterns
查询模式
Aggregation:
sql
SELECT
date_trunc('month', created_at) as month,
COUNT(*) as total_orders,
COUNT(DISTINCT customer_id) as unique_customers,
SUM(amount) as total_revenue,
AVG(amount) as avg_order_value
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY 1
ORDER BY 1;Window Functions:
sql
SELECT
customer_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as running_total,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as order_number,
LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as previous_order
FROM orders;CTEs for Clarity:
sql
WITH monthly_metrics AS (
SELECT
date_trunc('month', created_at) as month,
SUM(amount) as revenue
FROM orders
GROUP BY 1
),
growth_calc AS (
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) as prev_revenue
FROM monthly_metrics
)
SELECT
month,
revenue,
ROUND((revenue - prev_revenue) / prev_revenue * 100, 1) as growth_pct
FROM growth_calc;Cohort Analysis:
sql
WITH first_orders AS (
SELECT
customer_id,
date_trunc('month', MIN(created_at)) as cohort_month
FROM orders
GROUP BY 1
),
cohort_data AS (
SELECT
f.cohort_month,
date_trunc('month', o.created_at) as order_month,
COUNT(DISTINCT o.customer_id) as customers
FROM orders o
JOIN first_orders f ON o.customer_id = f.customer_id
GROUP BY 1, 2
)
SELECT
cohort_month,
order_month,
EXTRACT(MONTH FROM AGE(order_month, cohort_month)) as months_since_cohort,
customers
FROM cohort_data
ORDER BY 1, 2;聚合查询:
sql
SELECT
date_trunc('month', created_at) as month,
COUNT(*) as total_orders,
COUNT(DISTINCT customer_id) as unique_customers,
SUM(amount) as total_revenue,
AVG(amount) as avg_order_value
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY 1
ORDER BY 1;窗口函数:
sql
SELECT
customer_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as running_total,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as order_number,
LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) as previous_order
FROM orders;使用CTE提升可读性:
sql
WITH monthly_metrics AS (
SELECT
date_trunc('month', created_at) as month,
SUM(amount) as revenue
FROM orders
GROUP BY 1
),
growth_calc AS (
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) as prev_revenue
FROM monthly_metrics
)
SELECT
month,
revenue,
ROUND((revenue - prev_revenue) / prev_revenue * 100, 1) as growth_pct
FROM growth_calc;同期群分析:
sql
WITH first_orders AS (
SELECT
customer_id,
date_trunc('month', MIN(created_at)) as cohort_month
FROM orders
GROUP BY 1
),
cohort_data AS (
SELECT
f.cohort_month,
date_trunc('month', o.created_at) as order_month,
COUNT(DISTINCT o.customer_id) as customers
FROM orders o
JOIN first_orders f ON o.customer_id = f.customer_id
GROUP BY 1, 2
)
SELECT
cohort_month,
order_month,
EXTRACT(MONTH FROM AGE(order_month, cohort_month)) as months_since_cohort,
customers
FROM cohort_data
ORDER BY 1, 2;Query Optimization
查询优化
Use EXPLAIN:
sql
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;Best Practices:
- Use indexes on filtered columns
- Avoid SELECT * in production
- Use LIMIT for exploratory queries
- Filter early, aggregate late
- Use appropriate data types
使用EXPLAIN分析:
sql
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;最佳实践:
- 为过滤字段创建索引
- 生产环境避免使用SELECT *
- 探索性查询使用LIMIT
- 先过滤,后聚合
- 使用合适的数据类型
Data Visualization
数据可视化
Chart Selection Guide
图表选择指南
| Data Type | Best Chart | Alternative |
|---|---|---|
| Trend over time | Line chart | Area chart |
| Part of whole | Pie/Donut | Stacked bar |
| Comparison | Bar chart | Column chart |
| Distribution | Histogram | Box plot |
| Correlation | Scatter plot | Heatmap |
| Geographic | Map | Choropleth |
| 数据类型 | 最佳图表 | 替代选项 |
|---|---|---|
| 随时间变化的趋势 | 折线图 | 面积图 |
| 占比关系 | 饼图/环形图 | 堆叠条形图 |
| 对比分析 | 条形图 | 柱状图 |
| 分布情况 | 直方图 | 箱线图 |
| 相关性 | 散点图 | 热力图 |
| 地理数据 | 地图 | 分级统计图 |
Visualization Best Practices
可视化最佳实践
Do:
- Start Y-axis at zero (for bars)
- Use consistent colors
- Label axes clearly
- Include context (benchmarks, targets)
- Order categories meaningfully
Don't:
- Use 3D charts
- Use more than 5-7 colors
- Truncate axes misleadingly
- Clutter with gridlines
- Use pie charts for many categories
建议:
- 条形图Y轴从0开始
- 使用统一的配色
- 清晰标注坐标轴
- 包含上下文信息(基准值、目标值)
- 按有意义的顺序排列类别
避免:
- 使用3D图表
- 使用超过5-7种颜色
- 误导性截断坐标轴
- 用网格线造成视觉混乱
- 类别过多时使用饼图
Dashboard Layout
仪表盘布局
┌─────────────────────────────────────────────────────────────┐
│ EXECUTIVE SUMMARY │
│ [KPI 1: $X] [KPI 2: X%] [KPI 3: X] [KPI 4: X%] │
├─────────────────────────────────────────────────────────────┤
│ TRENDS │ BREAKDOWN │
│ [Line Chart - Primary Metric] │ [Bar Chart - Segments] │
│ │ │
├──────────────────────────────────┼──────────────────────────┤
│ COMPARISON │ DETAIL TABLE │
│ [Bar Chart - vs Target/LY] │ [Top N with metrics] │
│ │ │
└──────────────────────────────────┴──────────────────────────┘┌─────────────────────────────────────────────────────────────┐
│ 执行摘要 │
│ [关键指标1: $X] [关键指标2: X%] [关键指标3: X] [关键指标4: X%] │
├─────────────────────────────────────────────────────────────┤
│ 趋势分析 │ 细分拆解 │
│ [折线图 - 核心指标] │ [条形图 - 细分维度] │
│ │ │
├──────────────────────────────────┼──────────────────────────┤
│ 对比分析 │ 详情表格 │
│ [条形图 - 与目标/去年同期对比] │ [Top N 明细数据] │
│ │ │
└──────────────────────────────────┴──────────────────────────┘Statistical Analysis
统计分析
Descriptive Statistics
描述性统计
python
import pandas as pd
import numpy as np
def describe_data(df, column):
stats = {
'count': df[column].count(),
'mean': df[column].mean(),
'median': df[column].median(),
'std': df[column].std(),
'min': df[column].min(),
'max': df[column].max(),
'q25': df[column].quantile(0.25),
'q75': df[column].quantile(0.75),
'skewness': df[column].skew(),
'kurtosis': df[column].kurtosis()
}
return statspython
import pandas as pd
import numpy as np
def describe_data(df, column):
stats = {
'count': df[column].count(),
'mean': df[column].mean(),
'median': df[column].median(),
'std': df[column].std(),
'min': df[column].min(),
'max': df[column].max(),
'q25': df[column].quantile(0.25),
'q75': df[column].quantile(0.75),
'skewness': df[column].skew(),
'kurtosis': df[column].kurtosis()
}
return statsHypothesis Testing
假设检验
python
from scipy import statspython
from scipy import statsT-test: Compare two groups
T检验:对比两组数据
def compare_groups(group_a, group_b, alpha=0.05):
stat, p_value = stats.ttest_ind(group_a, group_b)
result = {
't_statistic': stat,
'p_value': p_value,
'significant': p_value < alpha,
'effect_size': (group_a.mean() - group_b.mean()) / np.sqrt(
(group_a.std()**2 + group_b.std()**2) / 2
)
}
return resultdef compare_groups(group_a, group_b, alpha=0.05):
stat, p_value = stats.ttest_ind(group_a, group_b)
result = {
't_statistic': stat,
'p_value': p_value,
'significant': p_value < alpha,
'effect_size': (group_a.mean() - group_b.mean()) / np.sqrt(
(group_a.std()**2 + group_b.std()**2) / 2
)
}
return resultChi-square: Test independence
卡方检验:检验独立性
def test_independence(observed, alpha=0.05):
stat, p_value, dof, expected = stats.chi2_contingency(observed)
return {
'chi2': stat,
'p_value': p_value,
'degrees_of_freedom': dof,
'significant': p_value < alpha
}undefineddef test_independence(observed, alpha=0.05):
stat, p_value, dof, expected = stats.chi2_contingency(observed)
return {
'chi2': stat,
'p_value': p_value,
'degrees_of_freedom': dof,
'significant': p_value < alpha
}undefinedRegression Analysis
回归分析
python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
def simple_regression(X, y):
model = LinearRegression()
model.fit(X.reshape(-1, 1), y)
predictions = model.predict(X.reshape(-1, 1))
return {
'coefficient': model.coef_[0],
'intercept': model.intercept_,
'r_squared': r2_score(y, predictions),
'mae': mean_absolute_error(y, predictions)
}python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
def simple_regression(X, y):
model = LinearRegression()
model.fit(X.reshape(-1, 1), y)
predictions = model.predict(X.reshape(-1, 1))
return {
'coefficient': model.coef_[0],
'intercept': model.intercept_,
'r_squared': r2_score(y, predictions),
'mae': mean_absolute_error(y, predictions)
}Business Analysis
商业分析
Analysis Framework
分析框架
markdown
undefinedmarkdown
undefinedAnalysis: [Topic]
分析:[主题]
Business Question
业务问题
[What are we trying to answer?]
[我们需要解答什么问题?]
Hypothesis
假设
[What do we expect to find?]
[我们预期会发现什么?]
Data Sources
数据来源
Methodology
分析方法
- [Step 1]
- [Step 2]
- [Step 3]
- [步骤1]
- [步骤2]
- [步骤3]
Findings
分析结果
Finding 1: [Title]
发现1: [标题]
[Description with supporting data]
[带支撑数据的描述]
Finding 2: [Title]
发现2: [标题]
[Description with supporting data]
[带支撑数据的描述]
Recommendations
建议
- [Recommendation]: [Expected impact]
- [Recommendation]: [Expected impact]
Limitations
局限性
- [Limitation 1]
- [Limitation 2]
- [局限性1]
- [局限性2]
Next Steps
后续行动
- [Action item]
undefined- [行动项]
undefinedKey Business Metrics
核心业务指标
Acquisition:
- Customer Acquisition Cost (CAC)
- Cost per Lead (CPL)
- Conversion Rate
Engagement:
- Daily/Monthly Active Users
- Session Duration
- Feature Adoption
Retention:
- Churn Rate
- Retention Rate
- Net Revenue Retention
Revenue:
- Monthly Recurring Revenue (MRR)
- Average Revenue Per User (ARPU)
- Lifetime Value (LTV)
获客类:
- 客户获取成本 (CAC)
- 线索获取成本 (CPL)
- 转化率
参与度类:
- 日活/月活用户数
- 会话时长
- 功能渗透率
留存类:
- 流失率
- 留存率
- 净收入留存率
收入类:
- 月度经常性收入 (MRR)
- 每用户平均收入 (ARPU)
- 用户生命周期价值 (LTV)
Data Storytelling
数据叙事
Presentation Structure
汇报结构
1. CONTEXT
- Why does this matter?
- What question are we answering?
2. KEY FINDING
- Lead with the insight
- Make it memorable
3. EVIDENCE
- Show the data
- Use effective visuals
4. IMPLICATIONS
- What does this mean?
- So what?
5. RECOMMENDATIONS
- What should we do?
- Clear next steps1. 背景
- 为什么这件事重要?
- 我们要解答什么问题?
2. 核心发现
- 先抛出核心洞察
- 让结论令人印象深刻
3. 证据支撑
- 展示数据
- 使用有效的可视化图表
4. 业务影响
- 这个发现意味着什么?
- 对业务有什么价值?
5. 行动建议
- 我们应该做什么?
- 明确的后续步骤Insight Template
洞察模板
markdown
undefinedmarkdown
undefined[Headline: Action-oriented finding]
[标题:面向行动的发现]
What: [One sentence description of the finding]
So What: [Why this matters to the business]
Now What: [Recommended action]
Evidence:
[Chart or data supporting the finding]
Confidence: [High/Medium/Low]
undefined是什么: [一句话描述发现内容]
为什么重要: [这个发现对业务的意义]
应该怎么做: [推荐的行动方案]
支撑证据:
[图表或数据]
置信度: [高/中/低]
undefinedReference Materials
参考资料
- - Advanced SQL queries
references/sql_patterns.md - - Chart selection guide
references/visualization.md - - Statistical methods
references/statistics.md - - Presentation best practices
references/storytelling.md
- - 高级SQL查询
references/sql_patterns.md - - 图表选择指南
references/visualization.md - - 统计方法
references/statistics.md - - 汇报最佳实践
references/storytelling.md
Scripts
脚本工具
bash
undefinedbash
undefinedData profiler
数据探查工具
python scripts/data_profiler.py --table orders --output profile.html
python scripts/data_profiler.py --table orders --output profile.html
SQL query analyzer
SQL查询分析工具
python scripts/query_analyzer.py --query query.sql --explain
python scripts/query_analyzer.py --query query.sql --explain
Dashboard generator
仪表盘生成工具
python scripts/dashboard_gen.py --config dashboard.yaml
python scripts/dashboard_gen.py --config dashboard.yaml
Report automation
报告自动化工具
python scripts/report_gen.py --template monthly --output report.pdf
undefinedpython scripts/report_gen.py --template monthly --output report.pdf
undefined