data-engineering

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Engineering & Analytics Skill

数据工程与分析技能

Quick Start - SQL Data Pipeline

快速入门 - SQL数据管道

sql
-- Create staging table
CREATE TABLE staging_events AS
SELECT 
  event_id,
  user_id,
  event_type,
  event_time,
  properties
FROM raw_events
WHERE event_time >= CURRENT_DATE - INTERVAL '1 day'
AND event_type IN ('click', 'purchase', 'view');

-- Aggregate metrics
SELECT
  DATE(event_time) as date,
  user_id,
  COUNT(*) as event_count,
  COUNT(DISTINCT event_type) as unique_events
FROM staging_events
GROUP BY 1, 2
ORDER BY date DESC, event_count DESC;
sql
-- Create staging table
CREATE TABLE staging_events AS
SELECT 
  event_id,
  user_id,
  event_type,
  event_time,
  properties
FROM raw_events
WHERE event_time >= CURRENT_DATE - INTERVAL '1 day'
AND event_type IN ('click', 'purchase', 'view');

-- Aggregate metrics
SELECT
  DATE(event_time) as date,
  user_id,
  COUNT(*) as event_count,
  COUNT(DISTINCT event_type) as unique_events
FROM staging_events
GROUP BY 1, 2
ORDER BY date DESC, event_count DESC;

Core Technologies

核心技术

Data Processing

数据处理

  • Apache Spark
  • Apache Flink
  • Pandas / Polars
  • dbt (data transformation)
  • Apache Spark
  • Apache Flink
  • Pandas / Polars
  • dbt (数据转换)

Data Warehousing

数据仓库

  • Snowflake
  • BigQuery (GCP)
  • Redshift (AWS)
  • Azure Synapse
  • Snowflake
  • BigQuery (GCP)
  • Redshift (AWS)
  • Azure Synapse

ETL/ELT Tools

ETL/ELT工具

  • dbt
  • Airflow
  • Talend
  • Informatica
  • dbt
  • Airflow
  • Talend
  • Informatica

Streaming

流处理

  • Apache Kafka
  • AWS Kinesis
  • Apache Pulsar
  • Apache Kafka
  • AWS Kinesis
  • Apache Pulsar

ML & Analytics

机器学习与分析

  • scikit-learn
  • TensorFlow
  • Tableau / Power BI
  • scikit-learn
  • TensorFlow
  • Tableau / Power BI

Best Practices

最佳实践

  1. Data Quality - Validation and testing
  2. Documentation - Clear metadata
  3. Performance - Query optimization
  4. Governance - Data security
  5. Monitoring - Pipeline alerts
  6. Scalability - Design for growth
  7. Version Control - Git for code and configs
  8. Testing - Data and pipeline testing
  1. 数据质量 - 验证与测试
  2. 文档 - 清晰的元数据
  3. 性能 - 查询优化
  4. 治理 - 数据安全
  5. 监控 - 管道告警
  6. 可扩展性 - 面向增长的设计
  7. 版本控制 - 代码与配置使用Git
  8. 测试 - 数据与管道测试

Resources

资源