data-science
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Science
数据科学
Data analysis, SQL, and insights generation.
数据分析、SQL及洞察生成。
When to Use
适用场景
- Writing SQL queries
- Data analysis and exploration
- Creating visualizations
- Statistical analysis
- ETL and data pipelines
- 编写SQL查询
- 数据分析与探索
- 创建可视化图表
- 统计分析
- ETL与数据管道
SQL Patterns
SQL模式
Common Queries
常用查询
sql
-- Aggregation with window functions
SELECT
user_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) as running_total,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC) as recency_rank
FROM orders;
-- CTEs for readability
WITH monthly_stats AS (
SELECT
DATE_TRUNC('month', created_at) as month,
COUNT(*) as total_orders,
SUM(amount) as revenue
FROM orders
GROUP BY 1
),
growth AS (
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) as prev_revenue,
(revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) as growth_rate
FROM monthly_stats
)
SELECT * FROM growth;sql
-- Aggregation with window functions
SELECT
user_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) as running_total,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC) as recency_rank
FROM orders;
-- CTEs for readability
WITH monthly_stats AS (
SELECT
DATE_TRUNC('month', created_at) as month,
COUNT(*) as total_orders,
SUM(amount) as revenue
FROM orders
GROUP BY 1
),
growth AS (
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) as prev_revenue,
(revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) as growth_rate
FROM monthly_stats
)
SELECT * FROM growth;BigQuery Specifics
BigQuery专属操作
sql
-- Partitioned table query
SELECT *
FROM `project.dataset.events`
WHERE DATE(_PARTITIONTIME) BETWEEN '2024-01-01' AND '2024-01-31';
-- UNNEST for arrays
SELECT
user_id,
item
FROM `project.dataset.orders`,
UNNEST(items) as item;
-- Approximate counts for large data
SELECT APPROX_COUNT_DISTINCT(user_id) as unique_users
FROM `project.dataset.events`;sql
-- Partitioned table query
SELECT *
FROM `project.dataset.events`
WHERE DATE(_PARTITIONTIME) BETWEEN '2024-01-01' AND '2024-01-31';
-- UNNEST for arrays
SELECT
user_id,
item
FROM `project.dataset.orders`,
UNNEST(items) as item;
-- Approximate counts for large data
SELECT APPROX_COUNT_DISTINCT(user_id) as unique_users
FROM `project.dataset.events`;Python Analysis
Python数据分析
python
import pandas as pd
import numpy as nppython
import pandas as pd
import numpy as npLoad and explore
Load and explore
df = pd.read_csv('data.csv')
df.info()
df.describe()
df = pd.read_csv('data.csv')
df.info()
df.describe()
Clean and transform
Clean and transform
df['date'] = pd.to_datetime(df['date'])
df = df.dropna(subset=['required_field'])
df['category'] = df['category'].fillna('Unknown')
df['date'] = pd.to_datetime(df['date'])
df = df.dropna(subset=['required_field'])
df['category'] = df['category'].fillna('Unknown')
Aggregate
Aggregate
summary = df.groupby('category').agg({
'value': ['mean', 'sum', 'count'],
'date': ['min', 'max']
}).round(2)
summary = df.groupby('category').agg({
'value': ['mean', 'sum', 'count'],
'date': ['min', 'max']
}).round(2)
Visualize
Visualize
import matplotlib.pyplot as plt
df.groupby('date')['value'].sum().plot(figsize=(12, 6))
plt.title('Daily Values')
plt.savefig('chart.png', dpi=150, bbox_inches='tight')
undefinedimport matplotlib.pyplot as plt
df.groupby('date')['value'].sum().plot(figsize=(12, 6))
plt.title('Daily Values')
plt.savefig('chart.png', dpi=150, bbox_inches='tight')
undefinedStatistical Analysis
统计分析
python
from scipy import statspython
from scipy import statsHypothesis testing
Hypothesis testing
t_stat, p_value = stats.ttest_ind(group_a, group_b)
t_stat, p_value = stats.ttest_ind(group_a, group_b)
Correlation
Correlation
correlation = df['x'].corr(df['y'])
correlation = df['x'].corr(df['y'])
Regression
Regression
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X, y)
print(f"R² = {model.score(X, y):.3f}")
undefinedfrom sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X, y)
print(f"R² = {model.score(X, y):.3f}")
undefinedOutput Format
输出格式
markdown
undefinedmarkdown
undefinedAnalysis Summary
分析摘要
Question: [What we're trying to answer]
Data Source: [Tables/files used]
Date Range: [Time period]
问题: [我们试图解答的问题]
数据源: [使用的表/文件]
时间范围: [时间段]
Key Findings
关键发现
- [Finding with supporting metric]
- [Finding with supporting metric]
- [带支撑指标的发现]
- [带支撑指标的发现]
Visualization
可视化
[Chart description or embedded image]
[图表描述或嵌入图片]
Recommendations
建议
- [Actionable insight]
undefined- [可执行的洞察]
undefinedExamples
示例
Input: "Analyze user retention"
Action: Query cohort data, calculate retention rates, visualize trends
Input: "Find top customers"
Action: Write SQL for RFM analysis, segment users, summarize findings
输入: "分析用户留存"
操作: 查询同期群数据,计算留存率,可视化趋势
输入: "找出顶级客户"
操作: 编写RFM分析的SQL,细分用户,总结发现