data-engineering-storage-remote-access-integrations-pandas

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Pandas Integration with Remote Storage

Pandas与远程存储的集成

Pandas leverages fsspec under the hood for cloud storage access (s3://, gs://, etc.). This makes reading from and writing to cloud storage straightforward.
Pandas在底层借助fsspec实现云存储访问(如s3://、gs://等),这使得在云存储中读写数据变得简单直接。

Auto-Detection (Simplest)

自动检测(最简方式)

Pandas automatically uses fsspec for cloud URIs:
python
import pandas as pd
Pandas会自动为云URI使用fsspec:
python
import pandas as pd

Read CSV/Parquet directly from cloud URIs

直接从云URI读取CSV/Parquet文件

df = pd.read_csv("s3://bucket/data.csv") df = pd.read_parquet("s3://bucket/data.parquet") df = pd.read_json("gs://bucket/data.json")
df = pd.read_csv("s3://bucket/data.csv") df = pd.read_parquet("s3://bucket/data.parquet") df = pd.read_json("gs://bucket/data.json")

Compression is auto-detected

压缩格式会自动检测

df = pd.read_csv("s3://bucket/data.csv.gz") # Automatically decompressed

**Note:** Auto-detection uses default credentials. For explicit auth, see below.
df = pd.read_csv("s3://bucket/data.csv.gz") # 自动解压

**注意:** 自动检测使用默认凭据。如需显式认证,请参见下文。

Explicit Filesystem (More Control)

显式文件系统(更多控制权)

python
import fsspec
import pandas as pd
python
import fsspec
import pandas as pd

Create fsspec filesystem with configuration

创建带配置的fsspec文件系统

fs = fsspec.filesystem("s3", anon=False) # Uses default credentials chain
fs = fsspec.filesystem("s3", anon=False) # 使用默认凭据链

Open file through filesystem

通过文件系统打开文件

with fs.open("s3://bucket/data.csv") as f: df = pd.read_csv(f)
with fs.open("s3://bucket/data.csv") as f: df = pd.read_csv(f)

Or pass filesystem directly (recommended for performance)

或者直接传入文件系统(推荐用于提升性能)

df = pd.read_parquet( "s3://bucket/data.parquet", filesystem=fs, columns=["id", "value"], # Column pruning reduces data transfer filters=[("date", ">=", "2024-01-01")] # Row group filtering )
undefined
df = pd.read_parquet( "s3://bucket/data.parquet", filesystem=fs, columns=["id", "value"], # 列裁剪减少数据传输 filters=[("date", ">=", "2024-01-01")] # 行组过滤 )
undefined

PyArrow Filesystem Backend

PyArrow文件系统后端

For better Arrow integration and zero-copy transfers:
python
import pyarrow.fs as fs
import pandas as pd

s3_fs = fs.S3FileSystem(region="us-east-1")
为了更好的Arrow集成和零拷贝传输:
python
import pyarrow.fs as fs
import pandas as pd

s3_fs = fs.S3FileSystem(region="us-east-1")

Read with column filtering

带列过滤的读取

df = pd.read_parquet( "bucket/data.parquet", # Note: no s3:// prefix when using filesystem filesystem=s3_fs, columns=["id", "name", "value"] )
df = pd.read_parquet( "bucket/data.parquet", # 注意:使用文件系统时无需s3://前缀 filesystem=s3_fs, columns=["id", "name", "value"] )

Write to cloud storage

写入云存储

df.to_parquet( "s3://bucket/output/", filesystem=s3_fs, partition_cols=["year", "month"] # Partitioned write )
undefined
df.to_parquet( "s3://bucket/output/", filesystem=s3_fs, partition_cols=["year", "month"] # 按分区写入 )
undefined

Partitioned Writes

分区写入

Write partitioned datasets efficiently:
python
import pandas as pd

df = pd.DataFrame({
    "id": [1, 2, 3],
    "year": [2024, 2024, 2023],
    "month": [1, 2, 12],
    "value": [100.0, 200.0, 150.0]
})
高效写入分区数据集:
python
import pandas as pd

df = pd.DataFrame({
    "id": [1, 2, 3],
    "year": [2024, 2024, 2023],
    "month": [1, 2, 12],
    "value": [100.0, 200.0, 150.0]
})

Using fsspec

使用fsspec

fs = fsspec.filesystem("s3") df.to_parquet( "s3://bucket/output/", partition_cols=["year", "month"], filesystem=fs )
fs = fsspec.filesystem("s3") df.to_parquet( "s3://bucket/output/", partition_cols=["year", "month"], filesystem=fs )

Output structure: s3://bucket/output/year=2024/month=1/part-0.parquet

输出结构:s3://bucket/output/year=2024/month=1/part-0.parquet

undefined
undefined

Authentication

认证方式

  • Auto-detection: Uses default credential chain (AWS_PROFILE, ~/.aws/credentials, IAM role)
  • Explicit: Pass
    key=
    ,
    secret=
    to
    fsspec.filesystem()
    constructor
  • For S3-compatible (MinIO, Ceph):
    python
    fs = fsspec.filesystem("s3", client_kwargs={
        "endpoint_url": "http://minio.local:9000"
    })
See
@data-engineering-storage-authentication
for detailed patterns.
  • 自动检测:使用默认凭据链(AWS_PROFILE、~/.aws/credentials、IAM角色)
  • 显式指定:在
    fsspec.filesystem()
    构造函数中传入
    key=
    secret=
    参数
  • 兼容S3的存储(如MinIO、Ceph):
    python
    fs = fsspec.filesystem("s3", client_kwargs={
        "endpoint_url": "http://minio.local:9000"
    })
详细模式请参见
@data-engineering-storage-authentication

Performance Tips

性能优化技巧

  1. Column pruning:
    pd.read_parquet(columns=[...])
    only reads needed columns
  2. Row group filtering: Use
    filters=
    parameter for partitioned data
  3. Cache results: Wrap filesystem with
    simplecache::
    or
    filecache::
    python
    cached_fs = fsspec.filesystem("simplecache", target_protocol="s3")
    df = pd.read_parquet("simplecache::s3://bucket/data.parquet", filesystem=cached_fs)
  4. Use Parquet, not CSV: Parquet supports pushdown, compression, and typed storage
  5. For large datasets: Consider PySpark or Dask instead of pandas (pandas loads everything into memory)
  1. 列裁剪
    pd.read_parquet(columns=[...])
    仅读取所需列
  2. 行组过滤:对分区数据使用
    filters=
    参数
  3. 结果缓存:用
    simplecache::
    filecache::
    包装文件系统
    python
    cached_fs = fsspec.filesystem("simplecache", target_protocol="s3")
    df = pd.read_parquet("simplecache::s3://bucket/data.parquet", filesystem=cached_fs)
  4. 使用Parquet而非CSV:Parquet支持下推操作、压缩和类型化存储
  5. 处理大型数据集:考虑使用PySpark或Dask替代Pandas(Pandas会将所有数据加载到内存中)

Limitations

局限性

  • pandas loads entire DataFrame into memory - not suitable for datasets larger than RAM
  • For lazy evaluation and better performance with large files, use
    @data-engineering-core
    (Polars)
  • Multi-file reads require manual iteration (use
    fs.glob()
    + list comprehension)
  • Pandas会将整个DataFrame加载到内存中——不适合大于RAM的数据集
  • 如需惰性求值和处理大文件时获得更好性能,请使用
    @data-engineering-core
    (Polars)
  • 多文件读取需要手动迭代(使用
    fs.glob()
    + 列表推导式)

Alternatives

替代方案

  • Polars (
    @data-engineering-core
    ): Faster, memory-mapped, lazy evaluation
  • Dask: Parallel pandas for out-of-core computation
  • PySpark: Distributed processing for big data

  • Polars
    @data-engineering-core
    ):速度更快、支持内存映射、惰性求值
  • Dask:用于核外计算的并行化Pandas
  • PySpark:用于大数据的分布式处理

References

参考资料