data-engineering-storage-remote-access-integrations-delta-lake

Original🇺🇸 English
Translated

Delta Lake integration with cloud storage (S3, GCS, Azure). Covers storage_options, PyArrow filesystem, time travel, and partitioned writes.

1installs
Added on

NPX Install

npx skill4agent add legout/data-platform-agent-skills data-engineering-storage-remote-access-integrations-delta-lake

Tags

Translated version includes tags in frontmatter

Delta Lake on Cloud Storage

Integrating Delta Lake tables with cloud storage (S3, GCS, Azure) using the pure-Python
deltalake
package.

Installation

bash
pip install deltalake pyarrow

Configuration Patterns

Method 1: storage_options (Recommended)

The simplest approach using dictionary-based configuration:
python
from deltalake import DeltaTable, write_deltalake
import pyarrow as pa

# S3 configuration
storage_options = {
    "AWS_ACCESS_KEY_ID": "AKIA...",
    "AWS_SECRET_ACCESS_KEY": "...",
    "AWS_REGION": "us-east-1"
}
# Alternatively, use environment variables (preferred for production)
# os.environ['AWS_ACCESS_KEY_ID'], etc.

# Write Delta table
write_deltalake(
    "s3://bucket/delta-table",
    data=pa_table,
    storage_options=storage_options,
    mode="overwrite",
    partition_by=["date"]
)

# Read Delta table
dt = DeltaTable(
    "s3://bucket/delta-table",
    storage_options=storage_options
)
df = dt.to_pandas()
GCS configuration:
python
storage_options = {
    "GOOGLE_SERVICE_ACCOUNT_KEY_JSON": "/path/to/key.json"
    # Or use env var GOOGLE_APPLICATION_CREDENTIALS
}
Azure configuration:
python
storage_options = {
    "AZURE_STORAGE_CONNECTION_STRING": "...",
    # OR: "AZURE_STORAGE_ACCOUNT_NAME" + "AZURE_STORAGE_ACCOUNT_KEY"
}

Method 2: PyArrow Filesystem (Advanced)

Use PyArrow filesystem objects for more control:
python
import pyarrow.fs as fs
from deltalake import write_deltalake, DeltaTable

# Create filesystem
raw_fs, subpath = fs.FileSystem.from_uri("s3://bucket/delta-table")
filesystem = fs.SubTreeFileSystem(subpath, raw_fs)

# Write
write_deltalake(
    "delta-table",  # relative to filesystem root
    data=pa_table,
    filesystem=filesystem,
    mode="append"
)

# Read
dt = DeltaTable("delta-table", filesystem=filesystem)

Time Travel

python
from deltalake import DeltaTable

dt = DeltaTable("s3://bucket/delta-table")

# Load specific version
dt.load_version(5)
df_v5 = dt.to_pandas()

# Load by timestamp
dt.load_with_datetime("2024-01-01T12:00:00Z")
df_ts = dt.to_pandas()

# Get history
history = dt.history().to_pandas()
print(history[["version", "timestamp", "operation"]])

Maintenance Operations

python
# Vacuum old files (retention in hours)
dt.vacuum(retention_hours=24)  # Clean files older than 24h

# Optimize compaction (combine small files)
dt.optimize().execute()

# Get file list
files = dt.files()
print(files)  # List of Parquet files in the table

# Get metadata
details = dt.details()
print(details)

Incremental Processing

For change data capture (CDC) patterns:
python
from deltalake import DeltaTable
from datetime import datetime

dt = DeltaTable("s3://bucket/delta-table")

# Get changes since last checkpoint
last_version = get_checkpoint()  # Your checkpoint tracking

# Read only added/modified files
changes = (
    dt.history()
    .filter(f"version > {last_version}")
    .to_pyarrow_table()
)

# Or read full snapshot and compare
df = dt.to_pandas()
# ... compare with previous snapshot ...

# Update checkpoint
save_checkpoint(dt.version())

Best Practices

  1. Use environment variables for credentials in production (never hardcode)
  2. Partition tables by date/region for efficient querying
  3. Vacuum regularly to clean up old files (but retain enough for your time travel needs)
  4. Optimize periodically to compact small files
  5. Track versions for incremental processing using
    dt.version()
    and
    dt.history()
  6. ⚠️ Don't disable vacuum entirely - storage bloat
  7. ⚠️ Don't vacuum too aggressively - you'll lose time travel capability

Authentication

See
@data-engineering-storage-authentication
for detailed cloud auth patterns.
For S3:
  • Environment:
    AWS_ACCESS_KEY_ID
    ,
    AWS_SECRET_ACCESS_KEY
    ,
    AWS_REGION
  • IAM roles (EC2, ECS, Lambda) override env vars
  • For S3-compatible (MinIO):
    AWS_ENDPOINT_URL
    or in
    storage_options

Related

  • @data-engineering-storage-lakehouse/delta-lake
    - Delta Lake concepts and API
  • @data-engineering-core
    - Using Delta with DuckDB
  • @data-engineering-storage-lakehouse
    - Comparisons with Iceberg, Hudi

References