data-engineering-storage-remote-access-integrations-duckdb
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDuckDB Remote Storage Integration
DuckDB远程云存储集成
DuckDB provides multiple ways to access cloud storage (S3, GCS, Azure) from within the database.
DuckDB提供多种从数据库内部访问云存储(S3、GCS、Azure)的方式。
HTTPFS Extension (Native)
HTTPFS扩展(原生支持)
The HTTPFS extension enables direct queries on remote files.
python
import duckdb
from contextlib import contextmanager
@contextmanager
def get_duckdb_connection():
"""Context manager ensures connection cleanup."""
con = duckdb.connect()
try:
con.execute("INSTALL httpfs; LOAD httpfs;")
yield con
finally:
con.close()HTTPFS扩展支持直接对远程文件执行查询。
python
import duckdb
from contextlib import contextmanager
@contextmanager
def get_duckdb_connection():
"""Context manager ensures connection cleanup."""
con = duckdb.connect()
try:
con.execute("INSTALL httpfs; LOAD httpfs;")
yield con
finally:
con.close()Configure and query
Configure and query
with get_duckdb_connection() as con:
# S3 configuration
con.execute("""
SET s3_region='us-east-1';
SET s3_access_key_id='AKIA...';
SET s3_secret_access_key='...';
-- For temporary creds: SET s3_session_token='...'
-- For S3-compatible: SET s3_endpoint='http://minio:9000';
""")
# Query Parquet directly
df = con.sql("""
SELECT category, SUM(value) as total
FROM read_parquet('s3://bucket/data/*.parquet')
WHERE date >= '2024-01-01'
GROUP BY category
""").pl()
# Read from GCS (configure via environment or default credentials)
df = con.sql("SELECT * FROM read_csv('gs://bucket/data.csv')").pl()undefinedwith get_duckdb_connection() as con:
# S3 configuration
con.execute("""
SET s3_region='us-east-1';
SET s3_access_key_id='AKIA...';
SET s3_secret_access_key='...';
-- For temporary creds: SET s3_session_token='...'
-- For S3-compatible: SET s3_endpoint='http://minio:9000';
""")
# Query Parquet directly
df = con.sql("""
SELECT category, SUM(value) as total
FROM read_parquet('s3://bucket/data/*.parquet')
WHERE date >= '2024-01-01'
GROUP BY category
""").pl()
# Read from GCS (configure via environment or default credentials)
df = con.sql("SELECT * FROM read_csv('gs://bucket/data.csv')").pl()undefinedConfiguration via Environment Variables
通过环境变量配置
Instead of hardcoding credentials, use environment variables:
python
import os
os.environ['AWS_ACCESS_KEY_ID'] = 'AKIA...'
os.environ['AWS_SECRET_ACCESS_KEY'] = '...'
os.environ['AWS_REGION'] = 'us-east-1'无需硬编码凭证,可使用环境变量进行配置:
python
import os
os.environ['AWS_ACCESS_KEY_ID'] = 'AKIA...'
os.environ['AWS_SECRET_ACCESS_KEY'] = '...'
os.environ['AWS_REGION'] = 'us-east-1'DuckDB HTTPFS reads these automatically on first use
DuckDB HTTPFS reads these automatically on first use
import duckdb
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
df = con.sql("SELECT * FROM read_parquet('s3://bucket/data.parquet')").pl()
undefinedimport duckdb
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
df = con.sql("SELECT * FROM read_parquet('s3://bucket/data.parquet')").pl()
undefinedVia fsspec
通过fsspec集成
Register fsspec filesystems for protocols DuckDB doesn't natively support:
python
import fsspec
import duckdb为DuckDB原生不支持的协议注册fsspec文件系统:
python
import fsspec
import duckdbRegister GCS (or any fsspec protocol)
Register GCS (or any fsspec protocol)
duckdb.register_filesystem(fsspec.filesystem('gcs'))
duckdb.register_filesystem(fsspec.filesystem('gcs'))
Now use gcs:// URIs natively
Now use gcs:// URIs natively
df = duckdb.sql("""
SELECT * FROM read_parquet('gcs://bucket/data.parquet')
""").pl()
undefineddf = duckdb.sql("""
SELECT * FROM read_parquet('gcs://bucket/data.parquet')
""").pl()
undefinedCopy Operations
数据复制操作
Copy data between DuckDB tables and cloud storage:
python
import duckdb
with duckdb.connect() as con:
# Export table to S3
con.sql("""
COPY (SELECT * FROM my_table)
TO 's3://bucket/output.parquet'
(FORMAT PARQUET)
""")
# Import from S3
con.sql("""
CREATE TABLE imported AS
SELECT * FROM read_parquet('s3://bucket/input.parquet')
""")在DuckDB表与云存储之间复制数据:
python
import duckdb
with duckdb.connect() as con:
# Export table to S3
con.sql("""
COPY (SELECT * FROM my_table)
TO 's3://bucket/output.parquet'
(FORMAT PARQUET)
""")
# Import from S3
con.sql("""
CREATE TABLE imported AS
SELECT * FROM read_parquet('s3://bucket/input.parquet')
""")Delta Lake Integration
Delta Lake集成
Read Delta tables from cloud storage:
python
import duckdb
with duckdb.connect() as con:
con.execute("INSTALL delta; LOAD delta;")
# Query Delta table
df = con.sql("""
SELECT * FROM delta_scan('s3://bucket/delta-table/')
WHERE date >= '2024-01-01'
""").pl()
# Time travel (read specific version)
df = con.sql("""
SELECT * FROM delta_scan('s3://bucket/delta-table/', version => 5)
""").pl()从云存储读取Delta表:
python
import duckdb
with duckdb.connect() as con:
con.execute("INSTALL delta; LOAD delta;")
# Query Delta table
df = con.sql("""
SELECT * FROM delta_scan('s3://bucket/delta-table/')
WHERE date >= '2024-01-01'
""").pl()
# Time travel (read specific version)
df = con.sql("""
SELECT * FROM delta_scan('s3://bucket/delta-table/', version => 5)
""").pl()Connection Management (FIXED)
连接管理(规范做法)
python
undefinedpython
undefined✅ DO: Use context manager
✅ 推荐:使用上下文管理器
with duckdb.connect("analytics.db") as con:
con.sql("CREATE TABLE ...")
with duckdb.connect("analytics.db") as con:
con.sql("CREATE TABLE ...")
❌ DON'T: Leak connections
❌ 不推荐:连接泄漏
con = duckdb.connect("analytics.db")
con.sql("...") # Never closed → leak
con = duckdb.connect("analytics.db")
con.sql("...") # 未关闭连接 → 资源泄漏
✅ DO: If you must, manually close
✅ 推荐:若必须手动管理,确保关闭连接
con = duckdb.connect("analytics.db")
try:
con.sql("...")
finally:
con.close()
undefinedcon = duckdb.connect("analytics.db")
try:
con.sql("...")
finally:
con.close()
undefinedAuthentication
身份认证
See for S3, GCS, Azure patterns. DuckDB HTTPFS reads standard environment variables (, , , etc.) automatically.
@data-engineering-storage-authenticationAWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_REGION有关S3、GCS、Azure的认证方式,请参考。DuckDB HTTPFS会自动读取标准环境变量(、、等)。
@data-engineering-storage-authenticationAWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_REGIONPerformance Tips
性能优化建议
- ✅ Predicate pushdown: Filter in SQL query, not after loading
- ✅ Column pruning: Only select needed columns
- ✅ Parquet format: Use Parquet (not CSV) for remote queries
- ✅ Partitioning: Store data partitioned (e.g., by date) for efficient queries
- ⚠️ Row group filtering: Parquet row groups enable scanning subsets
- ⚠️ Use to verify pushdown:
EXPLAINcon.sql("EXPLAIN SELECT ...").pl()
- ✅ 谓词下推:在SQL查询中过滤数据,而非加载后再过滤
- ✅ 列裁剪:仅选择需要的列
- ✅ Parquet格式:远程查询优先使用Parquet格式(而非CSV)
- ✅ 数据分区:按日期等维度对数据进行分区,提升查询效率
- ⚠️ 行组过滤:Parquet行组支持扫描数据子集
- ⚠️ 使用验证下推是否生效:
EXPLAINcon.sql("EXPLAIN SELECT ...").pl()
References
参考资料
- DuckDB HTTPFS Documentation
- DuckDB Delta Lake Extension
- - DuckDB basics
@data-engineering-core
- DuckDB HTTPFS 官方文档
- DuckDB Delta Lake 扩展文档
- - DuckDB基础使用
@data-engineering-core