duckdb

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DuckDB - The SQL Engine for Scientific Data

DuckDB - 面向科学数据的SQL引擎

DuckDB brings the power of professional SQL to the Python data science stack. It is optimized for "Online Analytical Processing" (OLAP), meaning it excels at large-scale aggregations, joins, and complex queries on datasets that are larger than memory.
DuckDB 将专业SQL的强大能力引入Python数据科学技术栈。它针对「联机分析处理」(OLAP)进行了优化,擅长对超出内存的数据集执行大规模聚合、关联和复杂查询。

When to Use

适用场景

  • Performing complex SQL queries (JOINs, Window functions) on Pandas or Polars data.
  • Querying large Parquet or CSV files directly without loading them into memory.
  • Efficiently joining data from different sources (e.g., a CSV file and a Pandas DataFrame).
  • Building analytical pipelines where SQL is more concise or faster than DataFrame code.
  • Managing local datasets that are too big for Excel but don't need a full PostgreSQL server.
  • Intermediate data storage and feature engineering for Machine Learning.
  • 对Pandas或Polars数据执行复杂SQL查询(JOIN、窗口函数)。
  • 无需将大型Parquet或CSV文件加载到内存,直接对其进行查询。
  • 高效关联不同来源的数据(例如CSV文件与Pandas DataFrame)。
  • 构建分析流水线,相比DataFrame代码,SQL在这类场景下更简洁或执行更快。
  • 管理Excel无法承载但又不需要完整PostgreSQL服务器的本地数据集。
  • 作为机器学习的中间数据存储和特征工程工具。

Reference Documentation

参考文档

Official docs: https://duckdb.org/docs/
Python API: https://duckdb.org/docs/api/python/overview
Search patterns:
duckdb.sql
,
duckdb.query
,
duckdb.read_parquet
,
duckdb.from_df
官方文档https://duckdb.org/docs/
Python APIhttps://duckdb.org/docs/api/python/overview
常用搜索关键词
duckdb.sql
,
duckdb.query
,
duckdb.read_parquet
,
duckdb.from_df

Core Principles

核心特性

In-Process Execution

嵌入式执行

DuckDB runs inside your Python process. There is no server to start or manage. The data can be stored in a file (.db) or kept entirely in memory.
DuckDB 在Python进程内部运行,无需启动或管理服务器。数据可以存储在.db文件中,也可以完全保存在内存中。

Columnar Engine

列式引擎

Like Polars, DuckDB uses a columnar storage and vectorized execution engine, making it orders of magnitude faster than row-based databases (like SQLite) for analytical tasks.
与Polars类似,DuckDB 采用列式存储和向量化执行引擎,在分析任务上比基于行的数据库(如SQLite)快几个数量级。

Seamless Interoperability

无缝互操作性

DuckDB can "see" your Python variables. You can run a SQL query directly against a Pandas DataFrame variable as if it were a table in the database.
DuckDB 可以直接识别Python变量,你可以像操作数据库表一样,直接对Pandas DataFrame变量执行SQL查询。

Quick Reference

快速入门

Installation

安装

bash
pip install duckdb
bash
pip install duckdb

Standard Imports

标准导入

python
import duckdb
import pandas as pd
import numpy as np
python
import duckdb
import pandas as pd
import numpy as np

Basic Pattern - Querying Python Data

基础用法 - 查询Python数据

python
import duckdb
import pandas as pd
python
import duckdb
import pandas as pd

1. Create a sample DataFrame

1. Create a sample DataFrame

df = pd.DataFrame({"id": [1, 2, 3], "val": [10.5, 20.0, 15.2]})
df = pd.DataFrame({"id": [1, 2, 3], "val": [10.5, 20.0, 15.2]})

2. Query the DataFrame directly via SQL

2. Query the DataFrame directly via SQL

DuckDB automatically finds the 'df' variable in the local scope

DuckDB automatically finds the 'df' variable in the local scope

result_df = duckdb.sql("SELECT id, val * 2 AS doubled FROM df WHERE val > 12").df()
print(result_df)
undefined
result_df = duckdb.sql("SELECT id, val * 2 AS doubled FROM df WHERE val > 12").df()
print(result_df)
undefined

Critical Rules

重要规则

✅ DO

✅ 推荐做法

  • Query Files Directly - Use
    SELECT * FROM 'data.parquet'
    instead of loading the file first. DuckDB will only read the required columns and rows.
  • Use the .df(), .pl(), .arrow() methods - Efficiently convert query results to your preferred format (Pandas, Polars, or Arrow).
  • Use Persistent Storage for Large Data - Use
    duckdb.connect('my_data.db')
    if you want your data to persist between script runs.
  • Leverage Parquet - DuckDB is a "best-in-class" engine for Parquet files; use them for maximum speed.
  • Use Wildcards - Query thousands of files at once using
    FROM 'data/*.parquet'
    .
  • Use EXPLAIN - Prefix your query with
    EXPLAIN ANALYZE
    to see how DuckDB is executing the query and find bottlenecks.
  • 直接查询文件 - 使用
    SELECT * FROM 'data.parquet'
    而非先加载文件。DuckDB 只会读取所需的列和行。
  • 使用.df()、.pl()、.arrow()方法 - 将查询结果高效转换为你偏好的格式(Pandas、Polars或Arrow)。
  • 对大数据使用持久化存储 - 如果你希望数据在脚本运行之间保留,使用
    duckdb.connect('my_data.db')
  • 优先使用Parquet - DuckDB 是处理Parquet文件的「顶级」引擎,使用该格式可获得最佳速度。
  • 使用通配符 - 通过
    FROM 'data/*.parquet'
    一次性查询数千个文件。
  • 使用EXPLAIN - 在查询前添加
    EXPLAIN ANALYZE
    ,查看DuckDB的执行计划并找出性能瓶颈。

❌ DON'T

❌ 不推荐做法

  • Use for High-Frequency Writes (OLTP) - DuckDB is for analysis. If you need to insert rows one by one thousands of times per second, use SQLite or PostgreSQL.
  • Ignore the Connection - If you are using a file-based database, ensure you close the connection or use a context manager to avoid file locking.
  • Manually Load CSVs if not needed - Don't do
    pd.read_csv()
    then query it. Query the file path directly for better performance.
  • 用于高频写入场景(OLTP) - DuckDB 适用于分析场景。如果需要每秒数千次逐行插入操作,请使用SQLite或PostgreSQL。
  • 忽略连接管理 - 如果使用文件型数据库,请确保关闭连接或使用上下文管理器,避免文件锁定。
  • 不必要时手动加载CSV - 不要先执行
    pd.read_csv()
    再查询,直接查询文件路径可获得更好的性能。

Anti-Patterns (NEVER)

反模式(绝对避免)

python
import duckdb
import pandas as pd
python
import duckdb
import pandas as pd

❌ BAD: Loading everything into Pandas just to do a simple filter

❌ BAD: Loading everything into Pandas just to do a simple filter

df = pd.read_csv("massive.csv")

df = pd.read_csv("massive.csv")

result = df[df['val'] > 100]

result = df[df['val'] > 100]

✅ GOOD: Let DuckDB filter while reading (saves RAM)

✅ GOOD: Let DuckDB filter while reading (saves RAM)

result = duckdb.sql("SELECT * FROM 'massive.csv' WHERE val > 100").df()
result = duckdb.sql("SELECT * FROM 'massive.csv' WHERE val > 100").df()

❌ BAD: Manual string formatting for SQL queries (SQL Injection risk)

❌ BAD: Manual string formatting for SQL queries (SQL Injection risk)

duckdb.sql(f"SELECT * FROM data WHERE name = '{user_input}'")

duckdb.sql(f"SELECT * FROM data WHERE name = '{user_input}'")

✅ GOOD: Use prepared statements or parameters

✅ GOOD: Use prepared statements or parameters

duckdb.execute("SELECT * FROM data WHERE name = ?", [user_input]).df()
duckdb.execute("SELECT * FROM data WHERE name = ?", [user_input]).df()

❌ BAD: Re-reading the same file in a loop

❌ BAD: Re-reading the same file in a loop

for i in range(10):

for i in range(10):

res = duckdb.sql("SELECT mean(val) FROM 'large.parquet'").df()

res = duckdb.sql("SELECT mean(val) FROM 'large.parquet'").df()

✅ GOOD: Create a VIEW or TABLE first

✅ GOOD: Create a VIEW or TABLE first

duckdb.sql("CREATE VIEW data_view AS SELECT * FROM 'large.parquet'") for i in range(10): res = duckdb.sql("SELECT mean(val) FROM data_view").df()
undefined
duckdb.sql("CREATE VIEW data_view AS SELECT * FROM 'large.parquet'") for i in range(10): res = duckdb.sql("SELECT mean(val) FROM data_view").df()
undefined

SQL Features and Operations

SQL特性与操作

Querying Different Sources

查询不同来源的数据

python
undefined
python
undefined

Query a CSV

Query a CSV

res_csv = duckdb.sql("SELECT * FROM read_csv_auto('data.csv')").df()
res_csv = duckdb.sql("SELECT * FROM read_csv_auto('data.csv')").df()

Query a Parquet file

Query a Parquet file

res_pq = duckdb.sql("SELECT * FROM 'data.parquet' WHERE price > 50").df()
res_pq = duckdb.sql("SELECT * FROM 'data.parquet' WHERE price > 50").df()

Query multiple Parquet files and join with a Pandas DF

Query multiple Parquet files and join with a Pandas DF

df_metadata = pd.DataFrame(...) query = """ SELECT p., m.category FROM 'raw_data/.parquet' p JOIN df_metadata m ON p.id = m.id LIMIT 10 """ res = duckdb.sql(query).df()
undefined
df_metadata = pd.DataFrame(...) query = """ SELECT p., m.category FROM 'raw_data/.parquet' p JOIN df_metadata m ON p.id = m.id LIMIT 10 """ res = duckdb.sql(query).df()
undefined

Relational API (Programmatic SQL)

关系型API(程序化SQL)

If you prefer a Pythonic method-chaining style over raw SQL:
python
rel = duckdb.from_df(df)
res = rel.filter("val > 15").project("id, val * 2").order("val").limit(5)
print(res.df())
如果你偏好Python式的方法链风格而非原生SQL:
python
rel = duckdb.from_df(df)
res = rel.filter("val > 15").project("id, val * 2").order("val").limit(5)
print(res.df())

Advanced SQL: Window Functions and Aggregations

高级SQL:窗口函数与聚合

DuckDB supports full modern SQL, which is often easier for complex statistics than Pandas.
python
query = """
    SELECT 
        date,
        station_id,
        temp,
        AVG(temp) OVER (PARTITION BY station_id ORDER BY date ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) as rolling_7d_avg,
        temp - LAG(temp) OVER (PARTITION BY station_id ORDER BY date) as daily_change
    FROM 'weather_data.parquet'
"""
df_stats = duckdb.sql(query).df()
DuckDB 支持完整的现代SQL,对于复杂统计任务,SQL通常比Pandas更简洁。
python
query = """
    SELECT 
        date,
        station_id,
        temp,
        AVG(temp) OVER (PARTITION BY station_id ORDER BY date ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) as rolling_7d_avg,
        temp - LAG(temp) OVER (PARTITION BY station_id ORDER BY date) as daily_change
    FROM 'weather_data.parquet'
"""
df_stats = duckdb.sql(query).df()

Working with Persistent Databases

持久化数据库操作

python
undefined
python
undefined

Create or open a database file

Create or open a database file

con = duckdb.connect('scientific_project.db')
con = duckdb.connect('scientific_project.db')

Create a table from a dataframe

Create a table from a dataframe

con.execute("CREATE TABLE experiment_results AS SELECT * FROM df")
con.execute("CREATE TABLE experiment_results AS SELECT * FROM df")

Check tables

Check tables

print(con.execute("SHOW TABLES").df())
print(con.execute("SHOW TABLES").df())

Close connection

Close connection

con.close()
undefined
con.close()
undefined

Performance Optimization

性能优化

1. External Aggregation (Disk Spilling)

1. 外部聚合(磁盘溢出)

If a query exceeds your RAM, DuckDB can "spill to disk" to finish the calculation.
python
undefined
如果查询所需内存超出可用空间,DuckDB 可以「溢出到磁盘」以完成计算。
python
undefined

Enable temp directory for large queries

Enable temp directory for large queries

duckdb.sql("SET temp_directory='/tmp/duckdb_temp/'") duckdb.sql("SET max_memory='4GB'") # Limit RAM usage
undefined
duckdb.sql("SET temp_directory='/tmp/duckdb_temp/'") duckdb.sql("SET max_memory='4GB'") # Limit RAM usage
undefined

2. Parallel Processing

2. 并行处理

DuckDB is parallel by default. You can control the number of threads.
python
duckdb.sql("SET threads TO 8")
DuckDB 默认启用并行处理,你可以控制线程数量。
python
duckdb.sql("SET threads TO 8")

3. Sampling for Exploratory Analysis

3. 采样用于探索性分析

Querying a sample of a massive file is instantaneous.
python
undefined
查询大规模文件的样本数据几乎是瞬时的。
python
undefined

Random 10% sample

Random 10% sample

df_sample = duckdb.sql("SELECT * FROM 'huge.parquet' USING SAMPLE 10%").df()
undefined
df_sample = duckdb.sql("SELECT * FROM 'huge.parquet' USING SAMPLE 10%").df()
undefined

Practical Workflows

实用工作流

1. The "Big Data" Cleaning Pipeline

1. 「大数据」清洗流水线

python
def process_experiment_logs(glob_pattern):
    """Clean and aggregate TBs of log data across many files."""
    query = f"""
    WITH clean_data AS (
        SELECT 
            timestamp::TIMESTAMP as ts,
            sensor_id,
            value
        FROM read_csv_auto('{glob_pattern}')
        WHERE value IS NOT NULL AND value != -999
    )
    SELECT 
        time_bucket(INTERVAL '1 hour', ts) as hour,
        sensor_id,
        AVG(value) as avg_val
    FROM clean_data
    GROUP BY 1, 2
    ORDER BY 1, 2
    """
    return duckdb.sql(query).df()
python
def process_experiment_logs(glob_pattern):
    """Clean and aggregate TBs of log data across many files."""
    query = f"""
    WITH clean_data AS (
        SELECT 
            timestamp::TIMESTAMP as ts,
            sensor_id,
            value
        FROM read_csv_auto('{glob_pattern}')
        WHERE value IS NOT NULL AND value != -999
    )
    SELECT 
        time_bucket(INTERVAL '1 hour', ts) as hour,
        sensor_id,
        AVG(value) as avg_val
    FROM clean_data
    GROUP BY 1, 2
    ORDER BY 1, 2
    """
    return duckdb.sql(query).df()

2. Fast Feature Engineering for ML

2. 机器学习快速特征工程

python
def create_features(df_train):
    # Use SQL to create complex lag and moving average features
    return duckdb.sql("""
        SELECT *,
               AVG(price) OVER (PARTITION BY item ORDER BY date ROWS 3 PRECEDING) as ma3,
               COUNT(*) OVER (PARTITION BY user) as user_activity_count
        FROM df_train
    """).df()
python
def create_features(df_train):
    # Use SQL to create complex lag and moving average features
    return duckdb.sql("""
        SELECT *,
               AVG(price) OVER (PARTITION BY item ORDER BY date ROWS 3 PRECEDING) as ma3,
               COUNT(*) OVER (PARTITION BY user) as user_activity_count
        FROM df_train
    """).df()

3. Interop: DuckDB to PyTorch/TensorFlow

3. 互操作性:DuckDB转PyTorch/TensorFlow

python
undefined
python
undefined

Query data and convert to Arrow for zero-copy transfer to Deep Learning

Query data and convert to Arrow for zero-copy transfer to Deep Learning

arrow_table = duckdb.sql("SELECT * FROM 'data.parquet'").arrow()
arrow_table = duckdb.sql("SELECT * FROM 'data.parquet'").arrow()

Then in PyTorch (requires torch.utils.dlpack or similar)

Then in PyTorch (requires torch.utils.dlpack or similar)

Or just use the fast arrow-to-numpy/tensor path

Or just use the fast arrow-to-numpy/tensor path

undefined
undefined

Common Pitfalls and Solutions

常见问题与解决方案

The "Variable Not Found" Error

「变量未找到」错误

DuckDB looks for variables in the scope where
duckdb.sql()
is called.
python
undefined
DuckDB 会在调用
duckdb.sql()
的作用域中查找变量。
python
undefined

✅ Solution: If calling inside a function, ensure the variable is local or passed

✅ Solution: If calling inside a function, ensure the variable is local or passed

def my_query(my_df): return duckdb.sql("SELECT * FROM my_df").df()
undefined
def my_query(my_df): return duckdb.sql("SELECT * FROM my_df").df()
undefined

Locked Database File

数据库文件锁定

If two processes try to open the same .db file as 'write', it will fail.
python
undefined
如果两个进程尝试以「写入」模式打开同一个.db文件,操作会失败。
python
undefined

✅ Solution: Open as read-only for secondary processes

✅ Solution: Open as read-only for secondary processes

con = duckdb.connect('data.db', read_only=True)
undefined
con = duckdb.connect('data.db', read_only=True)
undefined

Date/Time Parsing

日期/时间解析

CSVs often have weird date formats.
python
undefined
CSV文件通常包含不规范的日期格式。
python
undefined

✅ Solution: Use strptime or let DuckDB auto-detect

✅ Solution: Use strptime or let DuckDB auto-detect

SELECT strptime(date_col, '%d/%m/%Y') FROM 'data.csv'

SELECT strptime(date_col, '%d/%m/%Y') FROM 'data.csv'

undefined
undefined

Best Practices

最佳实践

  1. Query files directly - Use
    SELECT * FROM 'file.parquet'
    instead of loading into memory first.
  2. Use appropriate output format - Use
    .df()
    for Pandas,
    .pl()
    for Polars,
    .arrow()
    for Arrow.
  3. Create views for repeated queries - Avoid re-reading the same file multiple times.
  4. Use prepared statements - Prevent SQL injection and improve performance for repeated queries.
  5. Leverage Parquet format - DuckDB excels at Parquet; use it for maximum performance.
  6. Use wildcards for multiple files - Query many files at once with
    FROM 'data/*.parquet'
    .
  7. Enable disk spilling for large queries - Set
    temp_directory
    and
    max_memory
    for out-of-memory operations.
  8. Use EXPLAIN ANALYZE - Understand query execution and identify bottlenecks.
  9. Close connections properly - Use context managers or explicitly close file-based connections.
  10. Prefer SQL for complex aggregations - Window functions and complex joins are often clearer in SQL than DataFrame code.
DuckDB is the bridge between the analytical power of SQL and the flexibility of Python. It eliminates the "data loading tax" and allows scientists to focus on asking complex questions of their data at lightning speeds.
  1. 直接查询文件 - 使用
    SELECT * FROM 'file.parquet'
    而非先加载到内存。
  2. 选择合适的输出格式 - Pandas用
    .df()
    ,Polars用
    .pl()
    ,Arrow用
    .arrow()
  3. 为重复查询创建视图 - 避免多次读取同一文件。
  4. 使用预编译语句 - 防止SQL注入并提升重复查询的性能。
  5. 优先使用Parquet格式 - DuckDB在Parquet处理上表现优异,使用该格式可获得最佳性能。
  6. 对多文件使用通配符 - 通过
    FROM 'data/*.parquet'
    一次性查询多个文件。
  7. 为大型查询启用磁盘溢出 - 设置
    temp_directory
    max_memory
    以处理内存不足的操作。
  8. 使用EXPLAIN ANALYZE - 理解查询执行过程并识别性能瓶颈。
  9. 正确关闭连接 - 对文件型连接使用上下文管理器或显式关闭。
  10. 复杂聚合优先用SQL - 窗口函数和复杂关联在SQL中通常比DataFrame代码更清晰。
DuckDB 是SQL分析能力与Python灵活性之间的桥梁。它消除了「数据加载成本」,让数据科学家能够专注于以极快的速度对数据提出复杂问题。