data-science

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Science

数据科学

Data analysis, SQL, and insights generation.
数据分析、SQL及洞察生成。

When to Use

适用场景

  • Writing SQL queries
  • Data analysis and exploration
  • Creating visualizations
  • Statistical analysis
  • ETL and data pipelines
  • 编写SQL查询
  • 数据分析与探索
  • 创建可视化图表
  • 统计分析
  • ETL与数据管道

SQL Patterns

SQL模式

Common Queries

常用查询

sql
-- Aggregation with window functions
SELECT
    user_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) as running_total,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC) as recency_rank
FROM orders;

-- CTEs for readability
WITH monthly_stats AS (
    SELECT
        DATE_TRUNC('month', created_at) as month,
        COUNT(*) as total_orders,
        SUM(amount) as revenue
    FROM orders
    GROUP BY 1
),
growth AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) as prev_revenue,
        (revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) as growth_rate
    FROM monthly_stats
)
SELECT * FROM growth;
sql
-- Aggregation with window functions
SELECT
    user_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) as running_total,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC) as recency_rank
FROM orders;

-- CTEs for readability
WITH monthly_stats AS (
    SELECT
        DATE_TRUNC('month', created_at) as month,
        COUNT(*) as total_orders,
        SUM(amount) as revenue
    FROM orders
    GROUP BY 1
),
growth AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) as prev_revenue,
        (revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) as growth_rate
    FROM monthly_stats
)
SELECT * FROM growth;

BigQuery Specifics

BigQuery专属操作

sql
-- Partitioned table query
SELECT *
FROM `project.dataset.events`
WHERE DATE(_PARTITIONTIME) BETWEEN '2024-01-01' AND '2024-01-31';

-- UNNEST for arrays
SELECT
    user_id,
    item
FROM `project.dataset.orders`,
UNNEST(items) as item;

-- Approximate counts for large data
SELECT APPROX_COUNT_DISTINCT(user_id) as unique_users
FROM `project.dataset.events`;
sql
-- Partitioned table query
SELECT *
FROM `project.dataset.events`
WHERE DATE(_PARTITIONTIME) BETWEEN '2024-01-01' AND '2024-01-31';

-- UNNEST for arrays
SELECT
    user_id,
    item
FROM `project.dataset.orders`,
UNNEST(items) as item;

-- Approximate counts for large data
SELECT APPROX_COUNT_DISTINCT(user_id) as unique_users
FROM `project.dataset.events`;

Python Analysis

Python数据分析

python
import pandas as pd
import numpy as np
python
import pandas as pd
import numpy as np

Load and explore

Load and explore

df = pd.read_csv('data.csv') df.info() df.describe()
df = pd.read_csv('data.csv') df.info() df.describe()

Clean and transform

Clean and transform

df['date'] = pd.to_datetime(df['date']) df = df.dropna(subset=['required_field']) df['category'] = df['category'].fillna('Unknown')
df['date'] = pd.to_datetime(df['date']) df = df.dropna(subset=['required_field']) df['category'] = df['category'].fillna('Unknown')

Aggregate

Aggregate

summary = df.groupby('category').agg({ 'value': ['mean', 'sum', 'count'], 'date': ['min', 'max'] }).round(2)
summary = df.groupby('category').agg({ 'value': ['mean', 'sum', 'count'], 'date': ['min', 'max'] }).round(2)

Visualize

Visualize

import matplotlib.pyplot as plt df.groupby('date')['value'].sum().plot(figsize=(12, 6)) plt.title('Daily Values') plt.savefig('chart.png', dpi=150, bbox_inches='tight')
undefined
import matplotlib.pyplot as plt df.groupby('date')['value'].sum().plot(figsize=(12, 6)) plt.title('Daily Values') plt.savefig('chart.png', dpi=150, bbox_inches='tight')
undefined

Statistical Analysis

统计分析

python
from scipy import stats
python
from scipy import stats

Hypothesis testing

Hypothesis testing

t_stat, p_value = stats.ttest_ind(group_a, group_b)
t_stat, p_value = stats.ttest_ind(group_a, group_b)

Correlation

Correlation

correlation = df['x'].corr(df['y'])
correlation = df['x'].corr(df['y'])

Regression

Regression

from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X, y) print(f"R² = {model.score(X, y):.3f}")
undefined
from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X, y) print(f"R² = {model.score(X, y):.3f}")
undefined

Output Format

输出格式

markdown
undefined
markdown
undefined

Analysis Summary

分析摘要

Question: [What we're trying to answer] Data Source: [Tables/files used] Date Range: [Time period]
问题: [我们试图解答的问题] 数据源: [使用的表/文件] 时间范围: [时间段]

Key Findings

关键发现

  1. [Finding with supporting metric]
  2. [Finding with supporting metric]
  1. [带支撑指标的发现]
  2. [带支撑指标的发现]

Visualization

可视化

[Chart description or embedded image]
[图表描述或嵌入图片]

Recommendations

建议

  • [Actionable insight]
undefined
  • [可执行的洞察]
undefined

Examples

示例

Input: "Analyze user retention" Action: Query cohort data, calculate retention rates, visualize trends
Input: "Find top customers" Action: Write SQL for RFM analysis, segment users, summarize findings
输入: "分析用户留存" 操作: 查询同期群数据,计算留存率,可视化趋势
输入: "找出顶级客户" 操作: 编写RFM分析的SQL,细分用户,总结发现