data-science

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Science

数据科学

Data analysis, SQL, and insights generation.

数据分析、SQL及洞察生成。

When to Use

适用场景

Writing SQL queries
Data analysis and exploration
Creating visualizations
Statistical analysis
ETL and data pipelines

编写SQL查询
数据分析与探索
创建可视化图表
统计分析
ETL与数据管道

SQL Patterns

SQL模式

Common Queries

常用查询

sql

-- Aggregation with window functions
SELECT
    user_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) as running_total,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC) as recency_rank
FROM orders;

-- CTEs for readability
WITH monthly_stats AS (
    SELECT
        DATE_TRUNC('month', created_at) as month,
        COUNT(*) as total_orders,
        SUM(amount) as revenue
    FROM orders
    GROUP BY 1
),
growth AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) as prev_revenue,
        (revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) as growth_rate
    FROM monthly_stats
)
SELECT * FROM growth;

sql

-- Aggregation with window functions
SELECT
    user_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) as running_total,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC) as recency_rank
FROM orders;

-- CTEs for readability
WITH monthly_stats AS (
    SELECT
        DATE_TRUNC('month', created_at) as month,
        COUNT(*) as total_orders,
        SUM(amount) as revenue
    FROM orders
    GROUP BY 1
),
growth AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) as prev_revenue,
        (revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) as growth_rate
    FROM monthly_stats
)
SELECT * FROM growth;

BigQuery Specifics

BigQuery专属操作

sql

-- Partitioned table query
SELECT *
FROM `project.dataset.events`
WHERE DATE(_PARTITIONTIME) BETWEEN '2024-01-01' AND '2024-01-31';

-- UNNEST for arrays
SELECT
    user_id,
    item
FROM `project.dataset.orders`,
UNNEST(items) as item;

-- Approximate counts for large data
SELECT APPROX_COUNT_DISTINCT(user_id) as unique_users
FROM `project.dataset.events`;

sql

-- Partitioned table query
SELECT *
FROM `project.dataset.events`
WHERE DATE(_PARTITIONTIME) BETWEEN '2024-01-01' AND '2024-01-31';

-- UNNEST for arrays
SELECT
    user_id,
    item
FROM `project.dataset.orders`,
UNNEST(items) as item;

-- Approximate counts for large data
SELECT APPROX_COUNT_DISTINCT(user_id) as unique_users
FROM `project.dataset.events`;

Python Analysis

Python数据分析

python

import pandas as pd
import numpy as np

python

import pandas as pd
import numpy as np

Load and explore

df = pd.read_csv('data.csv') df.info() df.describe()

Clean and transform

df['date'] = pd.to_datetime(df['date']) df = df.dropna(subset=['required_field']) df['category'] = df['category'].fillna('Unknown')

Aggregate

summary = df.groupby('category').agg({ 'value': ['mean', 'sum', 'count'], 'date': ['min', 'max'] }).round(2)

Visualize

import matplotlib.pyplot as plt df.groupby('date')['value'].sum().plot(figsize=(12, 6)) plt.title('Daily Values') plt.savefig('chart.png', dpi=150, bbox_inches='tight')

undefined

import matplotlib.pyplot as plt df.groupby('date')['value'].sum().plot(figsize=(12, 6)) plt.title('Daily Values') plt.savefig('chart.png', dpi=150, bbox_inches='tight')

undefined

Statistical Analysis

统计分析

python

from scipy import stats

python

from scipy import stats

Hypothesis testing

t_stat, p_value = stats.ttest_ind(group_a, group_b)

Correlation

correlation = df['x'].corr(df['y'])

Regression

from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X, y) print(f"R² = {model.score(X, y):.3f}")

undefined

from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X, y) print(f"R² = {model.score(X, y):.3f}")

undefined

Output Format

输出格式

markdown

undefined

markdown

undefined

Analysis Summary

分析摘要

Question: [What we're trying to answer] Data Source: [Tables/files used] Date Range: [Time period]

问题： [我们试图解答的问题] 数据源： [使用的表/文件] 时间范围： [时间段]

Key Findings

关键发现

[Finding with supporting metric]
[Finding with supporting metric]

[带支撑指标的发现]
[带支撑指标的发现]

Visualization

可视化

[Chart description or embedded image]

[图表描述或嵌入图片]

Recommendations

建议

[Actionable insight]

undefined

[可执行的洞察]

undefined

Examples

示例

Input: "Analyze user retention" Action: Query cohort data, calculate retention rates, visualize trends

Input: "Find top customers" Action: Write SQL for RFM analysis, segment users, summarize findings

输入： "分析用户留存" 操作： 查询同期群数据，计算留存率，可视化趋势

输入： "找出顶级客户" 操作： 编写RFM分析的SQL，细分用户，总结发现