duckdb
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDuckDB
DuckDB
Overview
概述
DuckDB is a high-performance, in-process analytical database management system (often called "SQLite for analytics"). Execute complex SQL queries directly on CSV, Parquet, JSON files, and Python DataFrames (pandas, Polars) without importing data or running a separate database server.
DuckDB是一款高性能的进程内分析型数据库管理系统(常被称为“分析领域的SQLite”)。无需导入数据或运行单独的数据库服务器,即可直接对CSV、Parquet、JSON文件以及Python DataFrames(pandas、Polars)执行复杂SQL查询。
When to Use This Skill
何时使用该工具
Activate when the user:
- Wants to run SQL queries on data files (CSV, Parquet, JSON)
- Needs to perform complex analytical queries (aggregations, joins, window functions)
- Asks to query pandas or Polars DataFrames using SQL
- Wants to explore or analyze data without loading it into memory
- Needs fast analytical performance on medium to large datasets
- Mentions DuckDB explicitly or wants OLAP-style analytics
当用户有以下需求时启用:
- 想要对数据文件(CSV、Parquet、JSON)运行SQL查询
- 需要执行复杂的分析查询(聚合、连接、窗口函数)
- 询问如何使用SQL查询pandas或Polars DataFrames
- 想要在不加载数据到内存的情况下探索或分析数据
- 需要对中大型数据集实现快速的分析性能
- 明确提到DuckDB或需要OLAP风格的分析
Installation
安装步骤
Check if DuckDB is installed:
bash
python3 -c "import duckdb; print(duckdb.__version__)"If not installed:
bash
pip3 install duckdbFor Polars integration:
bash
pip3 install duckdb 'polars[pyarrow]'检查DuckDB是否已安装:
bash
python3 -c "import duckdb; print(duckdb.__version__)"若未安装:
bash
pip3 install duckdb如需集成Polars:
bash
pip3 install duckdb 'polars[pyarrow]'Core Capabilities
核心功能
1. Querying Data Files Directly
1. 直接查询数据文件
DuckDB can query files without loading them into memory:
python
import duckdbDuckDB无需将文件加载到内存即可查询:
python
import duckdbQuery CSV file
Query CSV file
result = duckdb.sql("SELECT * FROM 'data.csv' WHERE age > 25")
print(result.df()) # Convert to pandas DataFrame
result = duckdb.sql("SELECT * FROM 'data.csv' WHERE age > 25")
print(result.df()) # Convert to pandas DataFrame
Query Parquet file
Query Parquet file
result = duckdb.sql("""
SELECT category, SUM(amount) as total
FROM 'sales.parquet'
GROUP BY category
ORDER BY total DESC
""")
result = duckdb.sql("""
SELECT category, SUM(amount) as total
FROM 'sales.parquet'
GROUP BY category
ORDER BY total DESC
""")
Query JSON file
Query JSON file
result = duckdb.sql("SELECT * FROM 'users.json' LIMIT 10")
result = duckdb.sql("SELECT * FROM 'users.json' LIMIT 10")
Query multiple files with wildcards
Query multiple files with wildcards
result = duckdb.sql("SELECT * FROM 'data/*.parquet'")
undefinedresult = duckdb.sql("SELECT * FROM 'data/*.parquet'")
undefined2. Working with Pandas DataFrames
2. 与pandas DataFrames协同使用
DuckDB can directly query pandas DataFrames:
python
import duckdb
import pandas as pdDuckDB可直接查询pandas DataFrames:
python
import duckdb
import pandas as pdCreate or load a DataFrame
Create or load a DataFrame
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv')
Query the DataFrame using SQL
Query the DataFrame using SQL
result = duckdb.sql("""
SELECT
category,
AVG(price) as avg_price,
COUNT(*) as count
FROM df
WHERE price > 100
GROUP BY category
HAVING count > 5
""")
result = duckdb.sql("""
SELECT
category,
AVG(price) as avg_price,
COUNT(*) as count
FROM df
WHERE price > 100
GROUP BY category
HAVING count > 5
""")
Convert result to pandas DataFrame
Convert result to pandas DataFrame
result_df = result.df()
print(result_df)
undefinedresult_df = result.df()
print(result_df)
undefined3. Working with Polars DataFrames
3. 与Polars DataFrames协同使用
DuckDB integrates seamlessly with Polars using Apache Arrow:
python
import duckdb
import polars as plDuckDB通过Apache Arrow与Polars无缝集成:
python
import duckdb
import polars as plCreate or load a Polars DataFrame
Create or load a Polars DataFrame
df = pl.read_csv('data.csv')
df = pl.read_csv('data.csv')
Query Polars DataFrame with DuckDB
Query Polars DataFrame with DuckDB
result = duckdb.sql("""
SELECT
date_trunc('month', date) as month,
SUM(revenue) as monthly_revenue
FROM df
GROUP BY month
ORDER BY month
""")
result = duckdb.sql("""
SELECT
date_trunc('month', date) as month,
SUM(revenue) as monthly_revenue
FROM df
GROUP BY month
ORDER BY month
""")
Convert result to Polars DataFrame
Convert result to Polars DataFrame
result_df = result.pl()
result_df = result.pl()
For lazy evaluation, use lazy=True
For lazy evaluation, use lazy=True
lazy_result = result.pl(lazy=True)
undefinedlazy_result = result.pl(lazy=True)
undefined4. Creating Persistent Databases
4. 创建持久化数据库
Create database files for persistent storage:
python
import duckdb创建数据库文件用于持久化存储:
python
import duckdbConnect to a persistent database (creates file if doesn't exist)
Connect to a persistent database (creates file if doesn't exist)
con = duckdb.connect('my_database.duckdb')
con = duckdb.connect('my_database.duckdb')
Create table and insert data
Create table and insert data
con.execute("""
CREATE TABLE users AS
SELECT * FROM 'users.csv'
""")
con.execute("""
CREATE TABLE users AS
SELECT * FROM 'users.csv'
""")
Query the database
Query the database
result = con.execute("SELECT * FROM users WHERE age > 30").fetchdf()
result = con.execute("SELECT * FROM users WHERE age > 30").fetchdf()
Close connection
Close connection
con.close()
undefinedcon.close()
undefined5. Complex Analytical Queries
5. 复杂分析查询
DuckDB excels at analytical queries:
python
import duckdbDuckDB擅长处理分析型查询:
python
import duckdbWindow functions
Window functions
result = duckdb.sql("""
SELECT
name,
department,
salary,
AVG(salary) OVER (PARTITION BY department) as dept_avg,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM 'employees.csv'
""")
result = duckdb.sql("""
SELECT
name,
department,
salary,
AVG(salary) OVER (PARTITION BY department) as dept_avg,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM 'employees.csv'
""")
CTEs and subqueries
CTEs and subqueries
result = duckdb.sql("""
WITH monthly_sales AS (
SELECT
date_trunc('month', sale_date) as month,
product_id,
SUM(amount) as total_sales
FROM 'sales.parquet'
GROUP BY month, product_id
)
SELECT
m.month,
p.product_name,
m.total_sales,
LAG(m.total_sales) OVER (
PARTITION BY m.product_id
ORDER BY m.month
) as prev_month_sales
FROM monthly_sales m
JOIN 'products.csv' p ON m.product_id = p.id
ORDER BY m.month DESC, m.total_sales DESC
""")
undefinedresult = duckdb.sql("""
WITH monthly_sales AS (
SELECT
date_trunc('month', sale_date) as month,
product_id,
SUM(amount) as total_sales
FROM 'sales.parquet'
GROUP BY month, product_id
)
SELECT
m.month,
p.product_name,
m.total_sales,
LAG(m.total_sales) OVER (
PARTITION BY m.product_id
ORDER BY m.month
) as prev_month_sales
FROM monthly_sales m
JOIN 'products.csv' p ON m.product_id = p.id
ORDER BY m.month DESC, m.total_sales DESC
""")
undefined6. Joins Across Different Data Sources
6. 跨不同数据源连接
Join data from multiple files and DataFrames:
python
import duckdb
import pandas as pd可连接来自多个文件和DataFrames的数据:
python
import duckdb
import pandas as pdLoad DataFrame
Load DataFrame
customers_df = pd.read_csv('customers.csv')
customers_df = pd.read_csv('customers.csv')
Join DataFrame with Parquet file
Join DataFrame with Parquet file
result = duckdb.sql("""
SELECT
c.customer_name,
c.email,
o.order_date,
o.total_amount
FROM customers_df c
JOIN 'orders.parquet' o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
ORDER BY o.order_date DESC
""")
undefinedresult = duckdb.sql("""
SELECT
c.customer_name,
c.email,
o.order_date,
o.total_amount
FROM customers_df c
JOIN 'orders.parquet' o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
ORDER BY o.order_date DESC
""")
undefinedCommon Patterns
常见使用模式
Pattern 1: Quick Data Exploration
模式1:快速数据探索
python
import duckdbpython
import duckdbGet table schema
Get table schema
duckdb.sql("DESCRIBE SELECT * FROM 'data.parquet'").show()
duckdb.sql("DESCRIBE SELECT * FROM 'data.parquet'").show()
Quick statistics
Quick statistics
duckdb.sql("""
SELECT
COUNT(*) as rows,
COUNT(DISTINCT user_id) as unique_users,
MIN(created_at) as earliest_date,
MAX(created_at) as latest_date
FROM 'data.csv'
""").show()
duckdb.sql("""
SELECT
COUNT(*) as rows,
COUNT(DISTINCT user_id) as unique_users,
MIN(created_at) as earliest_date,
MAX(created_at) as latest_date
FROM 'data.csv'
""").show()
Sample data
Sample data
duckdb.sql("SELECT * FROM 'large_file.parquet' USING SAMPLE 1000").show()
undefinedduckdb.sql("SELECT * FROM 'large_file.parquet' USING SAMPLE 1000").show()
undefinedPattern 2: Data Transformation Pipeline
模式2:数据转换流水线
python
import duckdbpython
import duckdbETL pipeline using DuckDB
ETL pipeline using DuckDB
con = duckdb.connect('analytics.duckdb')
con = duckdb.connect('analytics.duckdb')
Extract and transform
Extract and transform
con.execute("""
CREATE TABLE clean_sales AS
SELECT
date_trunc('day', timestamp) as sale_date,
UPPER(TRIM(product_name)) as product_name,
quantity,
price,
quantity * price as total_amount,
CASE
WHEN quantity > 10 THEN 'bulk'
ELSE 'retail'
END as sale_type
FROM 'raw_sales.csv'
WHERE price > 0 AND quantity > 0
""")
con.execute("""
CREATE TABLE clean_sales AS
SELECT
date_trunc('day', timestamp) as sale_date,
UPPER(TRIM(product_name)) as product_name,
quantity,
price,
quantity * price as total_amount,
CASE
WHEN quantity > 10 THEN 'bulk'
ELSE 'retail'
END as sale_type
FROM 'raw_sales.csv'
WHERE price > 0 AND quantity > 0
""")
Create aggregated view
Create aggregated view
con.execute("""
CREATE VIEW daily_summary AS
SELECT
sale_date,
sale_type,
COUNT(*) as num_sales,
SUM(total_amount) as revenue
FROM clean_sales
GROUP BY sale_date, sale_type
""")
result = con.execute("SELECT * FROM daily_summary ORDER BY sale_date DESC").fetchdf()
con.close()
undefinedcon.execute("""
CREATE VIEW daily_summary AS
SELECT
sale_date,
sale_type,
COUNT(*) as num_sales,
SUM(total_amount) as revenue
FROM clean_sales
GROUP BY sale_date, sale_type
""")
result = con.execute("SELECT * FROM daily_summary ORDER BY sale_date DESC").fetchdf()
con.close()
undefinedPattern 3: Combining DuckDB + Polars for Optimal Performance
模式3:DuckDB + Polars 组合实现最优性能
python
import duckdb
import polars as plpython
import duckdb
import polars as plRead multiple parquet files with Polars
Read multiple parquet files with Polars
df = pl.read_parquet('data/*.parquet')
df = pl.read_parquet('data/*.parquet')
Use DuckDB for complex SQL analytics
Use DuckDB for complex SQL analytics
result = duckdb.sql("""
SELECT
customer_segment,
product_category,
COUNT(DISTINCT customer_id) as customers,
SUM(revenue) as total_revenue,
AVG(revenue) as avg_revenue,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) as median_revenue
FROM df
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY customer_segment, product_category
HAVING total_revenue > 10000
ORDER BY total_revenue DESC
""").pl() # Return as Polars DataFrame
result = duckdb.sql("""
SELECT
customer_segment,
product_category,
COUNT(DISTINCT customer_id) as customers,
SUM(revenue) as total_revenue,
AVG(revenue) as avg_revenue,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) as median_revenue
FROM df
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY customer_segment, product_category
HAVING total_revenue > 10000
ORDER BY total_revenue DESC
""").pl() # Return as Polars DataFrame
Continue processing with Polars
Continue processing with Polars
final_result = result.with_columns([
(pl.col('total_revenue') / pl.col('customers')).alias('revenue_per_customer')
])
undefinedfinal_result = result.with_columns([
(pl.col('total_revenue') / pl.col('customers')).alias('revenue_per_customer')
])
undefinedPattern 4: Export Query Results
模式4:导出查询结果
python
import duckdbpython
import duckdbExport to CSV
Export to CSV
duckdb.sql("""
COPY (
SELECT * FROM 'input.parquet' WHERE status = 'active'
) TO 'output.csv' (HEADER, DELIMITER ',')
""")
duckdb.sql("""
COPY (
SELECT * FROM 'input.parquet' WHERE status = 'active'
) TO 'output.csv' (HEADER, DELIMITER ',')
""")
Export to Parquet
Export to Parquet
duckdb.sql("""
COPY (
SELECT date, category, SUM(amount) as total
FROM 'sales.csv'
GROUP BY date, category
) TO 'summary.parquet' (FORMAT PARQUET)
""")
duckdb.sql("""
COPY (
SELECT date, category, SUM(amount) as total
FROM 'sales.csv'
GROUP BY date, category
) TO 'summary.parquet' (FORMAT PARQUET)
""")
Export to JSON
Export to JSON
duckdb.sql("""
COPY (SELECT * FROM users WHERE age > 21)
TO 'filtered_users.json' (FORMAT JSON)
""")
undefinedduckdb.sql("""
COPY (SELECT * FROM users WHERE age > 21)
TO 'filtered_users.json' (FORMAT JSON)
""")
undefinedPerformance Tips
性能优化技巧
- Use Parquet for large datasets: Parquet is columnar and compressed, ideal for analytical queries
- Filter early: Push filters down to file reads when possible
- Partition large files: Use DuckDB's automatic partitioning for large datasets
- Use projections: Only select columns you need
- Leverage indexes: For persistent databases, create indexes on frequently queried columns
python
undefined- 大型数据集使用Parquet:Parquet是列式存储且经过压缩,非常适合分析型查询
- 尽早过滤:尽可能将过滤条件下推到文件读取阶段
- 拆分大型文件:对大型数据集使用DuckDB的自动分区功能
- 使用投影:仅选择需要的列
- 利用索引:对于持久化数据库,在频繁查询的列上创建索引
python
undefinedGood: Filter and project early
Good: Filter and project early
duckdb.sql("SELECT name, age FROM 'users.parquet' WHERE age > 25")
duckdb.sql("SELECT name, age FROM 'users.parquet' WHERE age > 25")
Less efficient: Select all then filter
Less efficient: Select all then filter
duckdb.sql("SELECT * FROM 'users.parquet'").df()[lambda x: x['age'] > 25]
undefinedduckdb.sql("SELECT * FROM 'users.parquet'").df()[lambda x: x['age'] > 25]
undefinedIntegration with Polars
与Polars的集成
DuckDB and Polars work together seamlessly via Apache Arrow:
python
import duckdb
import polars as plDuckDB和Polars通过Apache Arrow实现无缝协作:
python
import duckdb
import polars as plPolars for data loading and transformation
Polars for data loading and transformation
df = (
pl.scan_parquet('data/*.parquet')
.filter(pl.col('date') >= '2024-01-01')
.collect()
)
df = (
pl.scan_parquet('data/*.parquet')
.filter(pl.col('date') >= '2024-01-01')
.collect()
)
DuckDB for complex SQL analytics
DuckDB for complex SQL analytics
result = duckdb.sql("""
SELECT
user_id,
COUNT(*) as sessions,
SUM(duration) as total_duration,
AVG(duration) as avg_duration,
MAX(duration) as max_duration
FROM df
GROUP BY user_id
HAVING sessions > 5
""").pl()
result = duckdb.sql("""
SELECT
user_id,
COUNT(*) as sessions,
SUM(duration) as total_duration,
AVG(duration) as avg_duration,
MAX(duration) as max_duration
FROM df
GROUP BY user_id
HAVING sessions > 5
""").pl()
Back to Polars for final processing
Back to Polars for final processing
top_users = result.top_k(10, by='total_duration')
See the `polars` skill for more Polars-specific operations and the references/integration.md file for detailed integration examples.top_users = result.top_k(10, by='total_duration')
更多Polars专属操作请查看`polars`工具文档,详细的集成示例请参考references/integration.md文件。Error Handling
错误处理
Common issues and solutions:
python
import duckdb
try:
result = duckdb.sql("SELECT * FROM 'data.csv'")
except duckdb.Error as e:
print(f"DuckDB error: {e}")
except FileNotFoundError:
print("File not found")
except Exception as e:
print(f"Unexpected error: {e}")常见问题及解决方案:
python
import duckdb
try:
result = duckdb.sql("SELECT * FROM 'data.csv'")
except duckdb.Error as e:
print(f"DuckDB error: {e}")
except FileNotFoundError:
print("File not found")
except Exception as e:
print(f"Unexpected error: {e}")Resources
参考资源
- references/integration.md: Detailed examples of DuckDB + Polars integration patterns
- Official docs: https://duckdb.org/docs/
- Python API: https://duckdb.org/docs/api/python/overview
- references/integration.md:DuckDB + Polars集成模式的详细示例
- 官方文档:https://duckdb.org/docs/
- Python API:https://duckdb.org/docs/api/python/overview