data-engineering-storage-remote-access-integrations-duckdb

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DuckDB Remote Storage Integration

DuckDB远程云存储集成

DuckDB provides multiple ways to access cloud storage (S3, GCS, Azure) from within the database.
DuckDB提供多种从数据库内部访问云存储(S3、GCS、Azure)的方式。

HTTPFS Extension (Native)

HTTPFS扩展(原生支持)

The HTTPFS extension enables direct queries on remote files.
python
import duckdb
from contextlib import contextmanager

@contextmanager
def get_duckdb_connection():
    """Context manager ensures connection cleanup."""
    con = duckdb.connect()
    try:
        con.execute("INSTALL httpfs; LOAD httpfs;")
        yield con
    finally:
        con.close()
HTTPFS扩展支持直接对远程文件执行查询。
python
import duckdb
from contextlib import contextmanager

@contextmanager
def get_duckdb_connection():
    """Context manager ensures connection cleanup."""
    con = duckdb.connect()
    try:
        con.execute("INSTALL httpfs; LOAD httpfs;")
        yield con
    finally:
        con.close()

Configure and query

Configure and query

with get_duckdb_connection() as con: # S3 configuration con.execute(""" SET s3_region='us-east-1'; SET s3_access_key_id='AKIA...'; SET s3_secret_access_key='...'; -- For temporary creds: SET s3_session_token='...' -- For S3-compatible: SET s3_endpoint='http://minio:9000'; """)
# Query Parquet directly
df = con.sql("""
    SELECT category, SUM(value) as total
    FROM read_parquet('s3://bucket/data/*.parquet')
    WHERE date >= '2024-01-01'
    GROUP BY category
""").pl()

# Read from GCS (configure via environment or default credentials)
df = con.sql("SELECT * FROM read_csv('gs://bucket/data.csv')").pl()
undefined
with get_duckdb_connection() as con: # S3 configuration con.execute(""" SET s3_region='us-east-1'; SET s3_access_key_id='AKIA...'; SET s3_secret_access_key='...'; -- For temporary creds: SET s3_session_token='...' -- For S3-compatible: SET s3_endpoint='http://minio:9000'; """)
# Query Parquet directly
df = con.sql("""
    SELECT category, SUM(value) as total
    FROM read_parquet('s3://bucket/data/*.parquet')
    WHERE date >= '2024-01-01'
    GROUP BY category
""").pl()

# Read from GCS (configure via environment or default credentials)
df = con.sql("SELECT * FROM read_csv('gs://bucket/data.csv')").pl()
undefined

Configuration via Environment Variables

通过环境变量配置

Instead of hardcoding credentials, use environment variables:
python
import os
os.environ['AWS_ACCESS_KEY_ID'] = 'AKIA...'
os.environ['AWS_SECRET_ACCESS_KEY'] = '...'
os.environ['AWS_REGION'] = 'us-east-1'
无需硬编码凭证,可使用环境变量进行配置:
python
import os
os.environ['AWS_ACCESS_KEY_ID'] = 'AKIA...'
os.environ['AWS_SECRET_ACCESS_KEY'] = '...'
os.environ['AWS_REGION'] = 'us-east-1'

DuckDB HTTPFS reads these automatically on first use

DuckDB HTTPFS reads these automatically on first use

import duckdb con = duckdb.connect() con.execute("INSTALL httpfs; LOAD httpfs;") df = con.sql("SELECT * FROM read_parquet('s3://bucket/data.parquet')").pl()
undefined
import duckdb con = duckdb.connect() con.execute("INSTALL httpfs; LOAD httpfs;") df = con.sql("SELECT * FROM read_parquet('s3://bucket/data.parquet')").pl()
undefined

Via fsspec

通过fsspec集成

Register fsspec filesystems for protocols DuckDB doesn't natively support:
python
import fsspec
import duckdb
为DuckDB原生不支持的协议注册fsspec文件系统:
python
import fsspec
import duckdb

Register GCS (or any fsspec protocol)

Register GCS (or any fsspec protocol)

duckdb.register_filesystem(fsspec.filesystem('gcs'))
duckdb.register_filesystem(fsspec.filesystem('gcs'))

Now use gcs:// URIs natively

Now use gcs:// URIs natively

df = duckdb.sql(""" SELECT * FROM read_parquet('gcs://bucket/data.parquet') """).pl()
undefined
df = duckdb.sql(""" SELECT * FROM read_parquet('gcs://bucket/data.parquet') """).pl()
undefined

Copy Operations

数据复制操作

Copy data between DuckDB tables and cloud storage:
python
import duckdb

with duckdb.connect() as con:
    # Export table to S3
    con.sql("""
        COPY (SELECT * FROM my_table)
        TO 's3://bucket/output.parquet'
        (FORMAT PARQUET)
    """)

    # Import from S3
    con.sql("""
        CREATE TABLE imported AS
        SELECT * FROM read_parquet('s3://bucket/input.parquet')
    """)
在DuckDB表与云存储之间复制数据:
python
import duckdb

with duckdb.connect() as con:
    # Export table to S3
    con.sql("""
        COPY (SELECT * FROM my_table)
        TO 's3://bucket/output.parquet'
        (FORMAT PARQUET)
    """)

    # Import from S3
    con.sql("""
        CREATE TABLE imported AS
        SELECT * FROM read_parquet('s3://bucket/input.parquet')
    """)

Delta Lake Integration

Delta Lake集成

Read Delta tables from cloud storage:
python
import duckdb

with duckdb.connect() as con:
    con.execute("INSTALL delta; LOAD delta;")

    # Query Delta table
    df = con.sql("""
        SELECT * FROM delta_scan('s3://bucket/delta-table/')
        WHERE date >= '2024-01-01'
    """).pl()

    # Time travel (read specific version)
    df = con.sql("""
        SELECT * FROM delta_scan('s3://bucket/delta-table/', version => 5)
    """).pl()
从云存储读取Delta表:
python
import duckdb

with duckdb.connect() as con:
    con.execute("INSTALL delta; LOAD delta;")

    # Query Delta table
    df = con.sql("""
        SELECT * FROM delta_scan('s3://bucket/delta-table/')
        WHERE date >= '2024-01-01'
    """).pl()

    # Time travel (read specific version)
    df = con.sql("""
        SELECT * FROM delta_scan('s3://bucket/delta-table/', version => 5)
    """).pl()

Connection Management (FIXED)

连接管理(规范做法)

python
undefined
python
undefined

✅ DO: Use context manager

✅ 推荐:使用上下文管理器

with duckdb.connect("analytics.db") as con: con.sql("CREATE TABLE ...")
with duckdb.connect("analytics.db") as con: con.sql("CREATE TABLE ...")

❌ DON'T: Leak connections

❌ 不推荐:连接泄漏

con = duckdb.connect("analytics.db") con.sql("...") # Never closed → leak
con = duckdb.connect("analytics.db") con.sql("...") # 未关闭连接 → 资源泄漏

✅ DO: If you must, manually close

✅ 推荐:若必须手动管理,确保关闭连接

con = duckdb.connect("analytics.db") try: con.sql("...") finally: con.close()
undefined
con = duckdb.connect("analytics.db") try: con.sql("...") finally: con.close()
undefined

Authentication

身份认证

See
@data-engineering-storage-authentication
for S3, GCS, Azure patterns. DuckDB HTTPFS reads standard environment variables (
AWS_ACCESS_KEY_ID
,
AWS_SECRET_ACCESS_KEY
,
AWS_REGION
, etc.) automatically.
有关S3、GCS、Azure的认证方式,请参考
@data-engineering-storage-authentication
。DuckDB HTTPFS会自动读取标准环境变量(
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_REGION
等)。

Performance Tips

性能优化建议

  • Predicate pushdown: Filter in SQL query, not after loading
  • Column pruning: Only select needed columns
  • Parquet format: Use Parquet (not CSV) for remote queries
  • Partitioning: Store data partitioned (e.g., by date) for efficient queries
  • ⚠️ Row group filtering: Parquet row groups enable scanning subsets
  • ⚠️ Use
    EXPLAIN
    to verify pushdown:
    con.sql("EXPLAIN SELECT ...").pl()

  • 谓词下推:在SQL查询中过滤数据,而非加载后再过滤
  • 列裁剪:仅选择需要的列
  • Parquet格式:远程查询优先使用Parquet格式(而非CSV)
  • 数据分区:按日期等维度对数据进行分区,提升查询效率
  • ⚠️ 行组过滤:Parquet行组支持扫描数据子集
  • ⚠️ 使用
    EXPLAIN
    验证下推是否生效:
    con.sql("EXPLAIN SELECT ...").pl()

References

参考资料