duckdb

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DuckDB

DuckDB

Overview

概述

DuckDB is a high-performance, in-process analytical database management system (often called "SQLite for analytics"). Execute complex SQL queries directly on CSV, Parquet, JSON files, and Python DataFrames (pandas, Polars) without importing data or running a separate database server.
DuckDB是一款高性能的进程内分析型数据库管理系统(常被称为“分析领域的SQLite”)。无需导入数据或运行单独的数据库服务器,即可直接对CSV、Parquet、JSON文件以及Python DataFrames(pandas、Polars)执行复杂SQL查询。

When to Use This Skill

何时使用该工具

Activate when the user:
  • Wants to run SQL queries on data files (CSV, Parquet, JSON)
  • Needs to perform complex analytical queries (aggregations, joins, window functions)
  • Asks to query pandas or Polars DataFrames using SQL
  • Wants to explore or analyze data without loading it into memory
  • Needs fast analytical performance on medium to large datasets
  • Mentions DuckDB explicitly or wants OLAP-style analytics
当用户有以下需求时启用:
  • 想要对数据文件(CSV、Parquet、JSON)运行SQL查询
  • 需要执行复杂的分析查询(聚合、连接、窗口函数)
  • 询问如何使用SQL查询pandas或Polars DataFrames
  • 想要在不加载数据到内存的情况下探索或分析数据
  • 需要对中大型数据集实现快速的分析性能
  • 明确提到DuckDB或需要OLAP风格的分析

Installation

安装步骤

Check if DuckDB is installed:
bash
python3 -c "import duckdb; print(duckdb.__version__)"
If not installed:
bash
pip3 install duckdb
For Polars integration:
bash
pip3 install duckdb 'polars[pyarrow]'
检查DuckDB是否已安装:
bash
python3 -c "import duckdb; print(duckdb.__version__)"
若未安装:
bash
pip3 install duckdb
如需集成Polars:
bash
pip3 install duckdb 'polars[pyarrow]'

Core Capabilities

核心功能

1. Querying Data Files Directly

1. 直接查询数据文件

DuckDB can query files without loading them into memory:
python
import duckdb
DuckDB无需将文件加载到内存即可查询:
python
import duckdb

Query CSV file

Query CSV file

result = duckdb.sql("SELECT * FROM 'data.csv' WHERE age > 25") print(result.df()) # Convert to pandas DataFrame
result = duckdb.sql("SELECT * FROM 'data.csv' WHERE age > 25") print(result.df()) # Convert to pandas DataFrame

Query Parquet file

Query Parquet file

result = duckdb.sql(""" SELECT category, SUM(amount) as total FROM 'sales.parquet' GROUP BY category ORDER BY total DESC """)
result = duckdb.sql(""" SELECT category, SUM(amount) as total FROM 'sales.parquet' GROUP BY category ORDER BY total DESC """)

Query JSON file

Query JSON file

result = duckdb.sql("SELECT * FROM 'users.json' LIMIT 10")
result = duckdb.sql("SELECT * FROM 'users.json' LIMIT 10")

Query multiple files with wildcards

Query multiple files with wildcards

result = duckdb.sql("SELECT * FROM 'data/*.parquet'")
undefined
result = duckdb.sql("SELECT * FROM 'data/*.parquet'")
undefined

2. Working with Pandas DataFrames

2. 与pandas DataFrames协同使用

DuckDB can directly query pandas DataFrames:
python
import duckdb
import pandas as pd
DuckDB可直接查询pandas DataFrames:
python
import duckdb
import pandas as pd

Create or load a DataFrame

Create or load a DataFrame

df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv')

Query the DataFrame using SQL

Query the DataFrame using SQL

result = duckdb.sql(""" SELECT category, AVG(price) as avg_price, COUNT(*) as count FROM df WHERE price > 100 GROUP BY category HAVING count > 5 """)
result = duckdb.sql(""" SELECT category, AVG(price) as avg_price, COUNT(*) as count FROM df WHERE price > 100 GROUP BY category HAVING count > 5 """)

Convert result to pandas DataFrame

Convert result to pandas DataFrame

result_df = result.df() print(result_df)
undefined
result_df = result.df() print(result_df)
undefined

3. Working with Polars DataFrames

3. 与Polars DataFrames协同使用

DuckDB integrates seamlessly with Polars using Apache Arrow:
python
import duckdb
import polars as pl
DuckDB通过Apache Arrow与Polars无缝集成:
python
import duckdb
import polars as pl

Create or load a Polars DataFrame

Create or load a Polars DataFrame

df = pl.read_csv('data.csv')
df = pl.read_csv('data.csv')

Query Polars DataFrame with DuckDB

Query Polars DataFrame with DuckDB

result = duckdb.sql(""" SELECT date_trunc('month', date) as month, SUM(revenue) as monthly_revenue FROM df GROUP BY month ORDER BY month """)
result = duckdb.sql(""" SELECT date_trunc('month', date) as month, SUM(revenue) as monthly_revenue FROM df GROUP BY month ORDER BY month """)

Convert result to Polars DataFrame

Convert result to Polars DataFrame

result_df = result.pl()
result_df = result.pl()

For lazy evaluation, use lazy=True

For lazy evaluation, use lazy=True

lazy_result = result.pl(lazy=True)
undefined
lazy_result = result.pl(lazy=True)
undefined

4. Creating Persistent Databases

4. 创建持久化数据库

Create database files for persistent storage:
python
import duckdb
创建数据库文件用于持久化存储:
python
import duckdb

Connect to a persistent database (creates file if doesn't exist)

Connect to a persistent database (creates file if doesn't exist)

con = duckdb.connect('my_database.duckdb')
con = duckdb.connect('my_database.duckdb')

Create table and insert data

Create table and insert data

con.execute(""" CREATE TABLE users AS SELECT * FROM 'users.csv' """)
con.execute(""" CREATE TABLE users AS SELECT * FROM 'users.csv' """)

Query the database

Query the database

result = con.execute("SELECT * FROM users WHERE age > 30").fetchdf()
result = con.execute("SELECT * FROM users WHERE age > 30").fetchdf()

Close connection

Close connection

con.close()
undefined
con.close()
undefined

5. Complex Analytical Queries

5. 复杂分析查询

DuckDB excels at analytical queries:
python
import duckdb
DuckDB擅长处理分析型查询:
python
import duckdb

Window functions

Window functions

result = duckdb.sql(""" SELECT name, department, salary, AVG(salary) OVER (PARTITION BY department) as dept_avg, RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank FROM 'employees.csv' """)
result = duckdb.sql(""" SELECT name, department, salary, AVG(salary) OVER (PARTITION BY department) as dept_avg, RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank FROM 'employees.csv' """)

CTEs and subqueries

CTEs and subqueries

result = duckdb.sql(""" WITH monthly_sales AS ( SELECT date_trunc('month', sale_date) as month, product_id, SUM(amount) as total_sales FROM 'sales.parquet' GROUP BY month, product_id ) SELECT m.month, p.product_name, m.total_sales, LAG(m.total_sales) OVER ( PARTITION BY m.product_id ORDER BY m.month ) as prev_month_sales FROM monthly_sales m JOIN 'products.csv' p ON m.product_id = p.id ORDER BY m.month DESC, m.total_sales DESC """)
undefined
result = duckdb.sql(""" WITH monthly_sales AS ( SELECT date_trunc('month', sale_date) as month, product_id, SUM(amount) as total_sales FROM 'sales.parquet' GROUP BY month, product_id ) SELECT m.month, p.product_name, m.total_sales, LAG(m.total_sales) OVER ( PARTITION BY m.product_id ORDER BY m.month ) as prev_month_sales FROM monthly_sales m JOIN 'products.csv' p ON m.product_id = p.id ORDER BY m.month DESC, m.total_sales DESC """)
undefined

6. Joins Across Different Data Sources

6. 跨不同数据源连接

Join data from multiple files and DataFrames:
python
import duckdb
import pandas as pd
可连接来自多个文件和DataFrames的数据:
python
import duckdb
import pandas as pd

Load DataFrame

Load DataFrame

customers_df = pd.read_csv('customers.csv')
customers_df = pd.read_csv('customers.csv')

Join DataFrame with Parquet file

Join DataFrame with Parquet file

result = duckdb.sql(""" SELECT c.customer_name, c.email, o.order_date, o.total_amount FROM customers_df c JOIN 'orders.parquet' o ON c.customer_id = o.customer_id WHERE o.order_date >= '2024-01-01' ORDER BY o.order_date DESC """)
undefined
result = duckdb.sql(""" SELECT c.customer_name, c.email, o.order_date, o.total_amount FROM customers_df c JOIN 'orders.parquet' o ON c.customer_id = o.customer_id WHERE o.order_date >= '2024-01-01' ORDER BY o.order_date DESC """)
undefined

Common Patterns

常见使用模式

Pattern 1: Quick Data Exploration

模式1:快速数据探索

python
import duckdb
python
import duckdb

Get table schema

Get table schema

duckdb.sql("DESCRIBE SELECT * FROM 'data.parquet'").show()
duckdb.sql("DESCRIBE SELECT * FROM 'data.parquet'").show()

Quick statistics

Quick statistics

duckdb.sql(""" SELECT COUNT(*) as rows, COUNT(DISTINCT user_id) as unique_users, MIN(created_at) as earliest_date, MAX(created_at) as latest_date FROM 'data.csv' """).show()
duckdb.sql(""" SELECT COUNT(*) as rows, COUNT(DISTINCT user_id) as unique_users, MIN(created_at) as earliest_date, MAX(created_at) as latest_date FROM 'data.csv' """).show()

Sample data

Sample data

duckdb.sql("SELECT * FROM 'large_file.parquet' USING SAMPLE 1000").show()
undefined
duckdb.sql("SELECT * FROM 'large_file.parquet' USING SAMPLE 1000").show()
undefined

Pattern 2: Data Transformation Pipeline

模式2:数据转换流水线

python
import duckdb
python
import duckdb

ETL pipeline using DuckDB

ETL pipeline using DuckDB

con = duckdb.connect('analytics.duckdb')
con = duckdb.connect('analytics.duckdb')

Extract and transform

Extract and transform

con.execute(""" CREATE TABLE clean_sales AS SELECT date_trunc('day', timestamp) as sale_date, UPPER(TRIM(product_name)) as product_name, quantity, price, quantity * price as total_amount, CASE WHEN quantity > 10 THEN 'bulk' ELSE 'retail' END as sale_type FROM 'raw_sales.csv' WHERE price > 0 AND quantity > 0 """)
con.execute(""" CREATE TABLE clean_sales AS SELECT date_trunc('day', timestamp) as sale_date, UPPER(TRIM(product_name)) as product_name, quantity, price, quantity * price as total_amount, CASE WHEN quantity > 10 THEN 'bulk' ELSE 'retail' END as sale_type FROM 'raw_sales.csv' WHERE price > 0 AND quantity > 0 """)

Create aggregated view

Create aggregated view

con.execute(""" CREATE VIEW daily_summary AS SELECT sale_date, sale_type, COUNT(*) as num_sales, SUM(total_amount) as revenue FROM clean_sales GROUP BY sale_date, sale_type """)
result = con.execute("SELECT * FROM daily_summary ORDER BY sale_date DESC").fetchdf() con.close()
undefined
con.execute(""" CREATE VIEW daily_summary AS SELECT sale_date, sale_type, COUNT(*) as num_sales, SUM(total_amount) as revenue FROM clean_sales GROUP BY sale_date, sale_type """)
result = con.execute("SELECT * FROM daily_summary ORDER BY sale_date DESC").fetchdf() con.close()
undefined

Pattern 3: Combining DuckDB + Polars for Optimal Performance

模式3:DuckDB + Polars 组合实现最优性能

python
import duckdb
import polars as pl
python
import duckdb
import polars as pl

Read multiple parquet files with Polars

Read multiple parquet files with Polars

df = pl.read_parquet('data/*.parquet')
df = pl.read_parquet('data/*.parquet')

Use DuckDB for complex SQL analytics

Use DuckDB for complex SQL analytics

result = duckdb.sql(""" SELECT customer_segment, product_category, COUNT(DISTINCT customer_id) as customers, SUM(revenue) as total_revenue, AVG(revenue) as avg_revenue, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) as median_revenue FROM df WHERE order_date >= CURRENT_DATE - INTERVAL '30 days' GROUP BY customer_segment, product_category HAVING total_revenue > 10000 ORDER BY total_revenue DESC """).pl() # Return as Polars DataFrame
result = duckdb.sql(""" SELECT customer_segment, product_category, COUNT(DISTINCT customer_id) as customers, SUM(revenue) as total_revenue, AVG(revenue) as avg_revenue, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) as median_revenue FROM df WHERE order_date >= CURRENT_DATE - INTERVAL '30 days' GROUP BY customer_segment, product_category HAVING total_revenue > 10000 ORDER BY total_revenue DESC """).pl() # Return as Polars DataFrame

Continue processing with Polars

Continue processing with Polars

final_result = result.with_columns([ (pl.col('total_revenue') / pl.col('customers')).alias('revenue_per_customer') ])
undefined
final_result = result.with_columns([ (pl.col('total_revenue') / pl.col('customers')).alias('revenue_per_customer') ])
undefined

Pattern 4: Export Query Results

模式4:导出查询结果

python
import duckdb
python
import duckdb

Export to CSV

Export to CSV

duckdb.sql(""" COPY ( SELECT * FROM 'input.parquet' WHERE status = 'active' ) TO 'output.csv' (HEADER, DELIMITER ',') """)
duckdb.sql(""" COPY ( SELECT * FROM 'input.parquet' WHERE status = 'active' ) TO 'output.csv' (HEADER, DELIMITER ',') """)

Export to Parquet

Export to Parquet

duckdb.sql(""" COPY ( SELECT date, category, SUM(amount) as total FROM 'sales.csv' GROUP BY date, category ) TO 'summary.parquet' (FORMAT PARQUET) """)
duckdb.sql(""" COPY ( SELECT date, category, SUM(amount) as total FROM 'sales.csv' GROUP BY date, category ) TO 'summary.parquet' (FORMAT PARQUET) """)

Export to JSON

Export to JSON

duckdb.sql(""" COPY (SELECT * FROM users WHERE age > 21) TO 'filtered_users.json' (FORMAT JSON) """)
undefined
duckdb.sql(""" COPY (SELECT * FROM users WHERE age > 21) TO 'filtered_users.json' (FORMAT JSON) """)
undefined

Performance Tips

性能优化技巧

  1. Use Parquet for large datasets: Parquet is columnar and compressed, ideal for analytical queries
  2. Filter early: Push filters down to file reads when possible
  3. Partition large files: Use DuckDB's automatic partitioning for large datasets
  4. Use projections: Only select columns you need
  5. Leverage indexes: For persistent databases, create indexes on frequently queried columns
python
undefined
  1. 大型数据集使用Parquet:Parquet是列式存储且经过压缩,非常适合分析型查询
  2. 尽早过滤:尽可能将过滤条件下推到文件读取阶段
  3. 拆分大型文件:对大型数据集使用DuckDB的自动分区功能
  4. 使用投影:仅选择需要的列
  5. 利用索引:对于持久化数据库,在频繁查询的列上创建索引
python
undefined

Good: Filter and project early

Good: Filter and project early

duckdb.sql("SELECT name, age FROM 'users.parquet' WHERE age > 25")
duckdb.sql("SELECT name, age FROM 'users.parquet' WHERE age > 25")

Less efficient: Select all then filter

Less efficient: Select all then filter

duckdb.sql("SELECT * FROM 'users.parquet'").df()[lambda x: x['age'] > 25]
undefined
duckdb.sql("SELECT * FROM 'users.parquet'").df()[lambda x: x['age'] > 25]
undefined

Integration with Polars

与Polars的集成

DuckDB and Polars work together seamlessly via Apache Arrow:
python
import duckdb
import polars as pl
DuckDB和Polars通过Apache Arrow实现无缝协作:
python
import duckdb
import polars as pl

Polars for data loading and transformation

Polars for data loading and transformation

df = ( pl.scan_parquet('data/*.parquet') .filter(pl.col('date') >= '2024-01-01') .collect() )
df = ( pl.scan_parquet('data/*.parquet') .filter(pl.col('date') >= '2024-01-01') .collect() )

DuckDB for complex SQL analytics

DuckDB for complex SQL analytics

result = duckdb.sql(""" SELECT user_id, COUNT(*) as sessions, SUM(duration) as total_duration, AVG(duration) as avg_duration, MAX(duration) as max_duration FROM df GROUP BY user_id HAVING sessions > 5 """).pl()
result = duckdb.sql(""" SELECT user_id, COUNT(*) as sessions, SUM(duration) as total_duration, AVG(duration) as avg_duration, MAX(duration) as max_duration FROM df GROUP BY user_id HAVING sessions > 5 """).pl()

Back to Polars for final processing

Back to Polars for final processing

top_users = result.top_k(10, by='total_duration')

See the `polars` skill for more Polars-specific operations and the references/integration.md file for detailed integration examples.
top_users = result.top_k(10, by='total_duration')

更多Polars专属操作请查看`polars`工具文档,详细的集成示例请参考references/integration.md文件。

Error Handling

错误处理

Common issues and solutions:
python
import duckdb

try:
    result = duckdb.sql("SELECT * FROM 'data.csv'")
except duckdb.Error as e:
    print(f"DuckDB error: {e}")
except FileNotFoundError:
    print("File not found")
except Exception as e:
    print(f"Unexpected error: {e}")
常见问题及解决方案:
python
import duckdb

try:
    result = duckdb.sql("SELECT * FROM 'data.csv'")
except duckdb.Error as e:
    print(f"DuckDB error: {e}")
except FileNotFoundError:
    print("File not found")
except Exception as e:
    print(f"Unexpected error: {e}")

Resources

参考资源