Delta Lake on Cloud Storage

Integrating Delta Lake tables with cloud storage (S3, GCS, Azure) using the pure-Python

deltalake

package.

Installation

bash

pip install deltalake pyarrow

Configuration Patterns

Method 1: storage_options (Recommended)

The simplest approach using dictionary-based configuration:

python

from deltalake import DeltaTable, write_deltalake
import pyarrow as pa

# S3 configuration
storage_options = {
    "AWS_ACCESS_KEY_ID": "AKIA...",
    "AWS_SECRET_ACCESS_KEY": "...",
    "AWS_REGION": "us-east-1"
}
# Alternatively, use environment variables (preferred for production)
# os.environ['AWS_ACCESS_KEY_ID'], etc.

# Write Delta table
write_deltalake(
    "s3://bucket/delta-table",
    data=pa_table,
    storage_options=storage_options,
    mode="overwrite",
    partition_by=["date"]
)

# Read Delta table
dt = DeltaTable(
    "s3://bucket/delta-table",
    storage_options=storage_options
)
df = dt.to_pandas()

GCS configuration:

python

storage_options = {
    "GOOGLE_SERVICE_ACCOUNT_KEY_JSON": "/path/to/key.json"
    # Or use env var GOOGLE_APPLICATION_CREDENTIALS
}

Azure configuration:

python

storage_options = {
    "AZURE_STORAGE_CONNECTION_STRING": "...",
    # OR: "AZURE_STORAGE_ACCOUNT_NAME" + "AZURE_STORAGE_ACCOUNT_KEY"
}

Method 2: PyArrow Filesystem (Advanced)

Use PyArrow filesystem objects for more control:

python

import pyarrow.fs as fs
from deltalake import write_deltalake, DeltaTable

# Create filesystem
raw_fs, subpath = fs.FileSystem.from_uri("s3://bucket/delta-table")
filesystem = fs.SubTreeFileSystem(subpath, raw_fs)

# Write
write_deltalake(
    "delta-table",  # relative to filesystem root
    data=pa_table,
    filesystem=filesystem,
    mode="append"
)

# Read
dt = DeltaTable("delta-table", filesystem=filesystem)

Time Travel

python

from deltalake import DeltaTable

dt = DeltaTable("s3://bucket/delta-table")

# Load specific version
dt.load_version(5)
df_v5 = dt.to_pandas()

# Load by timestamp
dt.load_with_datetime("2024-01-01T12:00:00Z")
df_ts = dt.to_pandas()

# Get history
history = dt.history().to_pandas()
print(history[["version", "timestamp", "operation"]])

Maintenance Operations

python

# Vacuum old files (retention in hours)
dt.vacuum(retention_hours=24)  # Clean files older than 24h

# Optimize compaction (combine small files)
dt.optimize().execute()

# Get file list
files = dt.files()
print(files)  # List of Parquet files in the table

# Get metadata
details = dt.details()
print(details)

Incremental Processing

For change data capture (CDC) patterns:

python

from deltalake import DeltaTable
from datetime import datetime

dt = DeltaTable("s3://bucket/delta-table")

# Get changes since last checkpoint
last_version = get_checkpoint()  # Your checkpoint tracking

# Read only added/modified files
changes = (
    dt.history()
    .filter(f"version > {last_version}")
    .to_pyarrow_table()
)

# Or read full snapshot and compare
df = dt.to_pandas()
# ... compare with previous snapshot ...

# Update checkpoint
save_checkpoint(dt.version())

Best Practices

✅ Use environment variables for credentials in production (never hardcode)
✅ Partition tables by date/region for efficient querying
✅ Vacuum regularly to clean up old files (but retain enough for your time travel needs)
✅ Optimize periodically to compact small files
✅ Track versions for incremental processing using
```
dt.version()
```
and
```
dt.history()
```
⚠️ Don't disable vacuum entirely - storage bloat
⚠️ Don't vacuum too aggressively - you'll lose time travel capability

Authentication

See

@data-engineering-storage-authentication

for detailed cloud auth patterns.

For S3:

Environment:

AWS_ACCESS_KEY_ID

AWS_SECRET_ACCESS_KEY

AWS_REGION

IAM roles (EC2, ECS, Lambda) override env vars
For S3-compatible (MinIO):
```
AWS_ENDPOINT_URL
```
or in
```
storage_options
```

@data-engineering-storage-lakehouse/delta-lake

- Delta Lake concepts and API

```
@data-engineering-core
```
- Using Delta with DuckDB
```
@data-engineering-storage-lakehouse
```
- Comparisons with Iceberg, Hudi

data-engineering-storage-remote-access-integrations-delta-lake

NPX Install

Tags

SKILL.md Content

Delta Lake on Cloud Storage

Installation

Configuration Patterns

Method 1: storage_options (Recommended)

Method 2: PyArrow Filesystem (Advanced)

Time Travel

Maintenance Operations

Incremental Processing

Best Practices

Authentication

Related

References