duckdb
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDuckDB - The SQL Engine for Scientific Data
DuckDB - 面向科学数据的SQL引擎
DuckDB brings the power of professional SQL to the Python data science stack. It is optimized for "Online Analytical Processing" (OLAP), meaning it excels at large-scale aggregations, joins, and complex queries on datasets that are larger than memory.
DuckDB 将专业SQL的强大能力引入Python数据科学技术栈。它针对「联机分析处理」(OLAP)进行了优化,擅长对超出内存的数据集执行大规模聚合、关联和复杂查询。
When to Use
适用场景
- Performing complex SQL queries (JOINs, Window functions) on Pandas or Polars data.
- Querying large Parquet or CSV files directly without loading them into memory.
- Efficiently joining data from different sources (e.g., a CSV file and a Pandas DataFrame).
- Building analytical pipelines where SQL is more concise or faster than DataFrame code.
- Managing local datasets that are too big for Excel but don't need a full PostgreSQL server.
- Intermediate data storage and feature engineering for Machine Learning.
- 对Pandas或Polars数据执行复杂SQL查询(JOIN、窗口函数)。
- 无需将大型Parquet或CSV文件加载到内存,直接对其进行查询。
- 高效关联不同来源的数据(例如CSV文件与Pandas DataFrame)。
- 构建分析流水线,相比DataFrame代码,SQL在这类场景下更简洁或执行更快。
- 管理Excel无法承载但又不需要完整PostgreSQL服务器的本地数据集。
- 作为机器学习的中间数据存储和特征工程工具。
Reference Documentation
参考文档
Official docs: https://duckdb.org/docs/
Python API: https://duckdb.org/docs/api/python/overview
Search patterns:, , ,
Python API: https://duckdb.org/docs/api/python/overview
Search patterns:
duckdb.sqlduckdb.queryduckdb.read_parquetduckdb.from_df官方文档:https://duckdb.org/docs/
Python API:https://duckdb.org/docs/api/python/overview
常用搜索关键词:, , ,
Python API:https://duckdb.org/docs/api/python/overview
常用搜索关键词:
duckdb.sqlduckdb.queryduckdb.read_parquetduckdb.from_dfCore Principles
核心特性
In-Process Execution
嵌入式执行
DuckDB runs inside your Python process. There is no server to start or manage. The data can be stored in a file (.db) or kept entirely in memory.
DuckDB 在Python进程内部运行,无需启动或管理服务器。数据可以存储在.db文件中,也可以完全保存在内存中。
Columnar Engine
列式引擎
Like Polars, DuckDB uses a columnar storage and vectorized execution engine, making it orders of magnitude faster than row-based databases (like SQLite) for analytical tasks.
与Polars类似,DuckDB 采用列式存储和向量化执行引擎,在分析任务上比基于行的数据库(如SQLite)快几个数量级。
Seamless Interoperability
无缝互操作性
DuckDB can "see" your Python variables. You can run a SQL query directly against a Pandas DataFrame variable as if it were a table in the database.
DuckDB 可以直接识别Python变量,你可以像操作数据库表一样,直接对Pandas DataFrame变量执行SQL查询。
Quick Reference
快速入门
Installation
安装
bash
pip install duckdbbash
pip install duckdbStandard Imports
标准导入
python
import duckdb
import pandas as pd
import numpy as nppython
import duckdb
import pandas as pd
import numpy as npBasic Pattern - Querying Python Data
基础用法 - 查询Python数据
python
import duckdb
import pandas as pdpython
import duckdb
import pandas as pd1. Create a sample DataFrame
1. Create a sample DataFrame
df = pd.DataFrame({"id": [1, 2, 3], "val": [10.5, 20.0, 15.2]})
df = pd.DataFrame({"id": [1, 2, 3], "val": [10.5, 20.0, 15.2]})
2. Query the DataFrame directly via SQL
2. Query the DataFrame directly via SQL
DuckDB automatically finds the 'df' variable in the local scope
DuckDB automatically finds the 'df' variable in the local scope
result_df = duckdb.sql("SELECT id, val * 2 AS doubled FROM df WHERE val > 12").df()
print(result_df)
undefinedresult_df = duckdb.sql("SELECT id, val * 2 AS doubled FROM df WHERE val > 12").df()
print(result_df)
undefinedCritical Rules
重要规则
✅ DO
✅ 推荐做法
- Query Files Directly - Use instead of loading the file first. DuckDB will only read the required columns and rows.
SELECT * FROM 'data.parquet' - Use the .df(), .pl(), .arrow() methods - Efficiently convert query results to your preferred format (Pandas, Polars, or Arrow).
- Use Persistent Storage for Large Data - Use if you want your data to persist between script runs.
duckdb.connect('my_data.db') - Leverage Parquet - DuckDB is a "best-in-class" engine for Parquet files; use them for maximum speed.
- Use Wildcards - Query thousands of files at once using .
FROM 'data/*.parquet' - Use EXPLAIN - Prefix your query with to see how DuckDB is executing the query and find bottlenecks.
EXPLAIN ANALYZE
- 直接查询文件 - 使用 而非先加载文件。DuckDB 只会读取所需的列和行。
SELECT * FROM 'data.parquet' - 使用.df()、.pl()、.arrow()方法 - 将查询结果高效转换为你偏好的格式(Pandas、Polars或Arrow)。
- 对大数据使用持久化存储 - 如果你希望数据在脚本运行之间保留,使用 。
duckdb.connect('my_data.db') - 优先使用Parquet - DuckDB 是处理Parquet文件的「顶级」引擎,使用该格式可获得最佳速度。
- 使用通配符 - 通过 一次性查询数千个文件。
FROM 'data/*.parquet' - 使用EXPLAIN - 在查询前添加 ,查看DuckDB的执行计划并找出性能瓶颈。
EXPLAIN ANALYZE
❌ DON'T
❌ 不推荐做法
- Use for High-Frequency Writes (OLTP) - DuckDB is for analysis. If you need to insert rows one by one thousands of times per second, use SQLite or PostgreSQL.
- Ignore the Connection - If you are using a file-based database, ensure you close the connection or use a context manager to avoid file locking.
- Manually Load CSVs if not needed - Don't do then query it. Query the file path directly for better performance.
pd.read_csv()
- 用于高频写入场景(OLTP) - DuckDB 适用于分析场景。如果需要每秒数千次逐行插入操作,请使用SQLite或PostgreSQL。
- 忽略连接管理 - 如果使用文件型数据库,请确保关闭连接或使用上下文管理器,避免文件锁定。
- 不必要时手动加载CSV - 不要先执行 再查询,直接查询文件路径可获得更好的性能。
pd.read_csv()
Anti-Patterns (NEVER)
反模式(绝对避免)
python
import duckdb
import pandas as pdpython
import duckdb
import pandas as pd❌ BAD: Loading everything into Pandas just to do a simple filter
❌ BAD: Loading everything into Pandas just to do a simple filter
df = pd.read_csv("massive.csv")
df = pd.read_csv("massive.csv")
result = df[df['val'] > 100]
result = df[df['val'] > 100]
✅ GOOD: Let DuckDB filter while reading (saves RAM)
✅ GOOD: Let DuckDB filter while reading (saves RAM)
result = duckdb.sql("SELECT * FROM 'massive.csv' WHERE val > 100").df()
result = duckdb.sql("SELECT * FROM 'massive.csv' WHERE val > 100").df()
❌ BAD: Manual string formatting for SQL queries (SQL Injection risk)
❌ BAD: Manual string formatting for SQL queries (SQL Injection risk)
duckdb.sql(f"SELECT * FROM data WHERE name = '{user_input}'")
duckdb.sql(f"SELECT * FROM data WHERE name = '{user_input}'")
✅ GOOD: Use prepared statements or parameters
✅ GOOD: Use prepared statements or parameters
duckdb.execute("SELECT * FROM data WHERE name = ?", [user_input]).df()
duckdb.execute("SELECT * FROM data WHERE name = ?", [user_input]).df()
❌ BAD: Re-reading the same file in a loop
❌ BAD: Re-reading the same file in a loop
for i in range(10):
for i in range(10):
res = duckdb.sql("SELECT mean(val) FROM 'large.parquet'").df()
res = duckdb.sql("SELECT mean(val) FROM 'large.parquet'").df()
✅ GOOD: Create a VIEW or TABLE first
✅ GOOD: Create a VIEW or TABLE first
duckdb.sql("CREATE VIEW data_view AS SELECT * FROM 'large.parquet'")
for i in range(10):
res = duckdb.sql("SELECT mean(val) FROM data_view").df()
undefinedduckdb.sql("CREATE VIEW data_view AS SELECT * FROM 'large.parquet'")
for i in range(10):
res = duckdb.sql("SELECT mean(val) FROM data_view").df()
undefinedSQL Features and Operations
SQL特性与操作
Querying Different Sources
查询不同来源的数据
python
undefinedpython
undefinedQuery a CSV
Query a CSV
res_csv = duckdb.sql("SELECT * FROM read_csv_auto('data.csv')").df()
res_csv = duckdb.sql("SELECT * FROM read_csv_auto('data.csv')").df()
Query a Parquet file
Query a Parquet file
res_pq = duckdb.sql("SELECT * FROM 'data.parquet' WHERE price > 50").df()
res_pq = duckdb.sql("SELECT * FROM 'data.parquet' WHERE price > 50").df()
Query multiple Parquet files and join with a Pandas DF
Query multiple Parquet files and join with a Pandas DF
df_metadata = pd.DataFrame(...)
query = """
SELECT p., m.category
FROM 'raw_data/.parquet' p
JOIN df_metadata m ON p.id = m.id
LIMIT 10
"""
res = duckdb.sql(query).df()
undefineddf_metadata = pd.DataFrame(...)
query = """
SELECT p., m.category
FROM 'raw_data/.parquet' p
JOIN df_metadata m ON p.id = m.id
LIMIT 10
"""
res = duckdb.sql(query).df()
undefinedRelational API (Programmatic SQL)
关系型API(程序化SQL)
If you prefer a Pythonic method-chaining style over raw SQL:
python
rel = duckdb.from_df(df)
res = rel.filter("val > 15").project("id, val * 2").order("val").limit(5)
print(res.df())如果你偏好Python式的方法链风格而非原生SQL:
python
rel = duckdb.from_df(df)
res = rel.filter("val > 15").project("id, val * 2").order("val").limit(5)
print(res.df())Advanced SQL: Window Functions and Aggregations
高级SQL:窗口函数与聚合
DuckDB supports full modern SQL, which is often easier for complex statistics than Pandas.
python
query = """
SELECT
date,
station_id,
temp,
AVG(temp) OVER (PARTITION BY station_id ORDER BY date ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) as rolling_7d_avg,
temp - LAG(temp) OVER (PARTITION BY station_id ORDER BY date) as daily_change
FROM 'weather_data.parquet'
"""
df_stats = duckdb.sql(query).df()DuckDB 支持完整的现代SQL,对于复杂统计任务,SQL通常比Pandas更简洁。
python
query = """
SELECT
date,
station_id,
temp,
AVG(temp) OVER (PARTITION BY station_id ORDER BY date ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) as rolling_7d_avg,
temp - LAG(temp) OVER (PARTITION BY station_id ORDER BY date) as daily_change
FROM 'weather_data.parquet'
"""
df_stats = duckdb.sql(query).df()Working with Persistent Databases
持久化数据库操作
python
undefinedpython
undefinedCreate or open a database file
Create or open a database file
con = duckdb.connect('scientific_project.db')
con = duckdb.connect('scientific_project.db')
Create a table from a dataframe
Create a table from a dataframe
con.execute("CREATE TABLE experiment_results AS SELECT * FROM df")
con.execute("CREATE TABLE experiment_results AS SELECT * FROM df")
Check tables
Check tables
print(con.execute("SHOW TABLES").df())
print(con.execute("SHOW TABLES").df())
Close connection
Close connection
con.close()
undefinedcon.close()
undefinedPerformance Optimization
性能优化
1. External Aggregation (Disk Spilling)
1. 外部聚合(磁盘溢出)
If a query exceeds your RAM, DuckDB can "spill to disk" to finish the calculation.
python
undefined如果查询所需内存超出可用空间,DuckDB 可以「溢出到磁盘」以完成计算。
python
undefinedEnable temp directory for large queries
Enable temp directory for large queries
duckdb.sql("SET temp_directory='/tmp/duckdb_temp/'")
duckdb.sql("SET max_memory='4GB'") # Limit RAM usage
undefinedduckdb.sql("SET temp_directory='/tmp/duckdb_temp/'")
duckdb.sql("SET max_memory='4GB'") # Limit RAM usage
undefined2. Parallel Processing
2. 并行处理
DuckDB is parallel by default. You can control the number of threads.
python
duckdb.sql("SET threads TO 8")DuckDB 默认启用并行处理,你可以控制线程数量。
python
duckdb.sql("SET threads TO 8")3. Sampling for Exploratory Analysis
3. 采样用于探索性分析
Querying a sample of a massive file is instantaneous.
python
undefined查询大规模文件的样本数据几乎是瞬时的。
python
undefinedRandom 10% sample
Random 10% sample
df_sample = duckdb.sql("SELECT * FROM 'huge.parquet' USING SAMPLE 10%").df()
undefineddf_sample = duckdb.sql("SELECT * FROM 'huge.parquet' USING SAMPLE 10%").df()
undefinedPractical Workflows
实用工作流
1. The "Big Data" Cleaning Pipeline
1. 「大数据」清洗流水线
python
def process_experiment_logs(glob_pattern):
"""Clean and aggregate TBs of log data across many files."""
query = f"""
WITH clean_data AS (
SELECT
timestamp::TIMESTAMP as ts,
sensor_id,
value
FROM read_csv_auto('{glob_pattern}')
WHERE value IS NOT NULL AND value != -999
)
SELECT
time_bucket(INTERVAL '1 hour', ts) as hour,
sensor_id,
AVG(value) as avg_val
FROM clean_data
GROUP BY 1, 2
ORDER BY 1, 2
"""
return duckdb.sql(query).df()python
def process_experiment_logs(glob_pattern):
"""Clean and aggregate TBs of log data across many files."""
query = f"""
WITH clean_data AS (
SELECT
timestamp::TIMESTAMP as ts,
sensor_id,
value
FROM read_csv_auto('{glob_pattern}')
WHERE value IS NOT NULL AND value != -999
)
SELECT
time_bucket(INTERVAL '1 hour', ts) as hour,
sensor_id,
AVG(value) as avg_val
FROM clean_data
GROUP BY 1, 2
ORDER BY 1, 2
"""
return duckdb.sql(query).df()2. Fast Feature Engineering for ML
2. 机器学习快速特征工程
python
def create_features(df_train):
# Use SQL to create complex lag and moving average features
return duckdb.sql("""
SELECT *,
AVG(price) OVER (PARTITION BY item ORDER BY date ROWS 3 PRECEDING) as ma3,
COUNT(*) OVER (PARTITION BY user) as user_activity_count
FROM df_train
""").df()python
def create_features(df_train):
# Use SQL to create complex lag and moving average features
return duckdb.sql("""
SELECT *,
AVG(price) OVER (PARTITION BY item ORDER BY date ROWS 3 PRECEDING) as ma3,
COUNT(*) OVER (PARTITION BY user) as user_activity_count
FROM df_train
""").df()3. Interop: DuckDB to PyTorch/TensorFlow
3. 互操作性:DuckDB转PyTorch/TensorFlow
python
undefinedpython
undefinedQuery data and convert to Arrow for zero-copy transfer to Deep Learning
Query data and convert to Arrow for zero-copy transfer to Deep Learning
arrow_table = duckdb.sql("SELECT * FROM 'data.parquet'").arrow()
arrow_table = duckdb.sql("SELECT * FROM 'data.parquet'").arrow()
Then in PyTorch (requires torch.utils.dlpack or similar)
Then in PyTorch (requires torch.utils.dlpack or similar)
Or just use the fast arrow-to-numpy/tensor path
Or just use the fast arrow-to-numpy/tensor path
undefinedundefinedCommon Pitfalls and Solutions
常见问题与解决方案
The "Variable Not Found" Error
「变量未找到」错误
DuckDB looks for variables in the scope where is called.
duckdb.sql()python
undefinedDuckDB 会在调用 的作用域中查找变量。
duckdb.sql()python
undefined✅ Solution: If calling inside a function, ensure the variable is local or passed
✅ Solution: If calling inside a function, ensure the variable is local or passed
def my_query(my_df):
return duckdb.sql("SELECT * FROM my_df").df()
undefineddef my_query(my_df):
return duckdb.sql("SELECT * FROM my_df").df()
undefinedLocked Database File
数据库文件锁定
If two processes try to open the same .db file as 'write', it will fail.
python
undefined如果两个进程尝试以「写入」模式打开同一个.db文件,操作会失败。
python
undefined✅ Solution: Open as read-only for secondary processes
✅ Solution: Open as read-only for secondary processes
con = duckdb.connect('data.db', read_only=True)
undefinedcon = duckdb.connect('data.db', read_only=True)
undefinedDate/Time Parsing
日期/时间解析
CSVs often have weird date formats.
python
undefinedCSV文件通常包含不规范的日期格式。
python
undefined✅ Solution: Use strptime or let DuckDB auto-detect
✅ Solution: Use strptime or let DuckDB auto-detect
SELECT strptime(date_col, '%d/%m/%Y') FROM 'data.csv'
SELECT strptime(date_col, '%d/%m/%Y') FROM 'data.csv'
undefinedundefinedBest Practices
最佳实践
- Query files directly - Use instead of loading into memory first.
SELECT * FROM 'file.parquet' - Use appropriate output format - Use for Pandas,
.df()for Polars,.pl()for Arrow..arrow() - Create views for repeated queries - Avoid re-reading the same file multiple times.
- Use prepared statements - Prevent SQL injection and improve performance for repeated queries.
- Leverage Parquet format - DuckDB excels at Parquet; use it for maximum performance.
- Use wildcards for multiple files - Query many files at once with .
FROM 'data/*.parquet' - Enable disk spilling for large queries - Set and
temp_directoryfor out-of-memory operations.max_memory - Use EXPLAIN ANALYZE - Understand query execution and identify bottlenecks.
- Close connections properly - Use context managers or explicitly close file-based connections.
- Prefer SQL for complex aggregations - Window functions and complex joins are often clearer in SQL than DataFrame code.
DuckDB is the bridge between the analytical power of SQL and the flexibility of Python. It eliminates the "data loading tax" and allows scientists to focus on asking complex questions of their data at lightning speeds.
- 直接查询文件 - 使用 而非先加载到内存。
SELECT * FROM 'file.parquet' - 选择合适的输出格式 - Pandas用,Polars用
.df(),Arrow用.pl()。.arrow() - 为重复查询创建视图 - 避免多次读取同一文件。
- 使用预编译语句 - 防止SQL注入并提升重复查询的性能。
- 优先使用Parquet格式 - DuckDB在Parquet处理上表现优异,使用该格式可获得最佳性能。
- 对多文件使用通配符 - 通过 一次性查询多个文件。
FROM 'data/*.parquet' - 为大型查询启用磁盘溢出 - 设置和
temp_directory以处理内存不足的操作。max_memory - 使用EXPLAIN ANALYZE - 理解查询执行过程并识别性能瓶颈。
- 正确关闭连接 - 对文件型连接使用上下文管理器或显式关闭。
- 复杂聚合优先用SQL - 窗口函数和复杂关联在SQL中通常比DataFrame代码更清晰。
DuckDB 是SQL分析能力与Python灵活性之间的桥梁。它消除了「数据加载成本」,让数据科学家能够专注于以极快的速度对数据提出复杂问题。