data-engineering-storage-remote-access-integrations-pandas

Original：🇺🇸 English

Translated

Reading and writing data with Pandas from/to cloud storage (S3, GCS, Azure) using fsspec and PyArrow filesystems.

2installs

Sourcelegout/data-platform-agent-skills

Added on2026-02-21

NPX Install

npx skill4agent add legout/data-platform-agent-skills data-engineering-storage-remote-access-integrations-pandas

SKILL.md Content

View Translation Comparison →

Pandas Integration with Remote Storage

Pandas leverages fsspec under the hood for cloud storage access (s3://, gs://, etc.). This makes reading from and writing to cloud storage straightforward.

Auto-Detection (Simplest)

Pandas automatically uses fsspec for cloud URIs:

python

import pandas as pd

# Read CSV/Parquet directly from cloud URIs
df = pd.read_csv("s3://bucket/data.csv")
df = pd.read_parquet("s3://bucket/data.parquet")
df = pd.read_json("gs://bucket/data.json")

# Compression is auto-detected
df = pd.read_csv("s3://bucket/data.csv.gz")  # Automatically decompressed

Note: Auto-detection uses default credentials. For explicit auth, see below.

Explicit Filesystem (More Control)

python

import fsspec
import pandas as pd

# Create fsspec filesystem with configuration
fs = fsspec.filesystem("s3", anon=False)  # Uses default credentials chain

# Open file through filesystem
with fs.open("s3://bucket/data.csv") as f:
    df = pd.read_csv(f)

# Or pass filesystem directly (recommended for performance)
df = pd.read_parquet(
    "s3://bucket/data.parquet",
    filesystem=fs,
    columns=["id", "value"],  # Column pruning reduces data transfer
    filters=[("date", ">=", "2024-01-01")]  # Row group filtering
)

PyArrow Filesystem Backend

For better Arrow integration and zero-copy transfers:

python

import pyarrow.fs as fs
import pandas as pd

s3_fs = fs.S3FileSystem(region="us-east-1")

# Read with column filtering
df = pd.read_parquet(
    "bucket/data.parquet",  # Note: no s3:// prefix when using filesystem
    filesystem=s3_fs,
    columns=["id", "name", "value"]
)

# Write to cloud storage
df.to_parquet(
    "s3://bucket/output/",
    filesystem=s3_fs,
    partition_cols=["year", "month"]  # Partitioned write
)

Partitioned Writes

Write partitioned datasets efficiently:

python

import pandas as pd

df = pd.DataFrame({
    "id": [1, 2, 3],
    "year": [2024, 2024, 2023],
    "month": [1, 2, 12],
    "value": [100.0, 200.0, 150.0]
})

# Using fsspec
fs = fsspec.filesystem("s3")
df.to_parquet(
    "s3://bucket/output/",
    partition_cols=["year", "month"],
    filesystem=fs
)
# Output structure: s3://bucket/output/year=2024/month=1/part-0.parquet

Authentication

Auto-detection: Uses default credential chain (AWS_PROFILE, ~/.aws/credentials, IAM role)
Explicit: Pass
```
key=
```
,
```
secret=
```
to
```
fsspec.filesystem()
```
constructor

For S3-compatible (MinIO, Ceph):

python

fs = fsspec.filesystem("s3", client_kwargs={
    "endpoint_url": "http://minio.local:9000"
})

See

@data-engineering-storage-authentication

for detailed patterns.

Performance Tips

Column pruning:
```
pd.read_parquet(columns=[...])
```
only reads needed columns
Row group filtering: Use
```
filters=
```
parameter for partitioned data

Cache results: Wrap filesystem with

simplecache::

or

filecache::

python

cached_fs = fsspec.filesystem("simplecache", target_protocol="s3")
df = pd.read_parquet("simplecache::s3://bucket/data.parquet", filesystem=cached_fs)

Use Parquet, not CSV: Parquet supports pushdown, compression, and typed storage
For large datasets: Consider PySpark or Dask instead of pandas (pandas loads everything into memory)

Limitations

pandas loads entire DataFrame into memory - not suitable for datasets larger than RAM
For lazy evaluation and better performance with large files, use
```
@data-engineering-core
```
(Polars)
Multi-file reads require manual iteration (use
```
fs.glob()
```
+ list comprehension)

Alternatives

Polars (
```
@data-engineering-core
```
): Faster, memory-mapped, lazy evaluation
Dask: Parallel pandas for out-of-core computation
PySpark: Distributed processing for big data

References

pandas I/O documentation
fsspec documentation

@data-engineering-storage-remote-access/libraries/fsspec

data-engineering-storage-remote-access-integrations-pandas

NPX Install

Tags

SKILL.md Content

Pandas Integration with Remote Storage

Auto-Detection (Simplest)

Explicit Filesystem (More Control)

PyArrow Filesystem Backend

Partitioned Writes

Authentication

Performance Tips

Limitations

Alternatives

References