data-engineering-storage-remote-access-integrations-pandas

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Pandas Integration with Remote Storage

Pandas与远程存储的集成

Pandas leverages fsspec under the hood for cloud storage access (s3://, gs://, etc.). This makes reading from and writing to cloud storage straightforward.

Pandas在底层借助fsspec实现云存储访问（如s3://、gs://等），这使得在云存储中读写数据变得简单直接。

Auto-Detection (Simplest)

自动检测（最简方式）

Pandas automatically uses fsspec for cloud URIs:

python

import pandas as pd

Pandas会自动为云URI使用fsspec：

python

import pandas as pd

Read CSV/Parquet directly from cloud URIs

直接从云URI读取CSV/Parquet文件

df = pd.read_csv("s3://bucket/data.csv") df = pd.read_parquet("s3://bucket/data.parquet") df = pd.read_json("gs://bucket/data.json")

Compression is auto-detected

压缩格式会自动检测

df = pd.read_csv("s3://bucket/data.csv.gz") # Automatically decompressed


**Note:** Auto-detection uses default credentials. For explicit auth, see below.

df = pd.read_csv("s3://bucket/data.csv.gz") # 自动解压


**注意：** 自动检测使用默认凭据。如需显式认证，请参见下文。

Explicit Filesystem (More Control)

显式文件系统（更多控制权）

python

import fsspec
import pandas as pd

python

import fsspec
import pandas as pd

Create fsspec filesystem with configuration

创建带配置的fsspec文件系统

fs = fsspec.filesystem("s3", anon=False) # Uses default credentials chain

fs = fsspec.filesystem("s3", anon=False) # 使用默认凭据链

Open file through filesystem

通过文件系统打开文件

with fs.open("s3://bucket/data.csv") as f: df = pd.read_csv(f)

Or pass filesystem directly (recommended for performance)

或者直接传入文件系统（推荐用于提升性能）

df = pd.read_parquet( "s3://bucket/data.parquet", filesystem=fs, columns=["id", "value"], # Column pruning reduces data transfer filters=[("date", ">=", "2024-01-01")] # Row group filtering )

undefined

df = pd.read_parquet( "s3://bucket/data.parquet", filesystem=fs, columns=["id", "value"], # 列裁剪减少数据传输 filters=[("date", ">=", "2024-01-01")] # 行组过滤 )

undefined

PyArrow Filesystem Backend

PyArrow文件系统后端

For better Arrow integration and zero-copy transfers:

python

import pyarrow.fs as fs
import pandas as pd

s3_fs = fs.S3FileSystem(region="us-east-1")

为了更好的Arrow集成和零拷贝传输：

python

import pyarrow.fs as fs
import pandas as pd

s3_fs = fs.S3FileSystem(region="us-east-1")

Read with column filtering

带列过滤的读取

df = pd.read_parquet( "bucket/data.parquet", # Note: no s3:// prefix when using filesystem filesystem=s3_fs, columns=["id", "name", "value"] )

df = pd.read_parquet( "bucket/data.parquet", # 注意：使用文件系统时无需s3://前缀 filesystem=s3_fs, columns=["id", "name", "value"] )

Write to cloud storage

写入云存储

df.to_parquet( "s3://bucket/output/", filesystem=s3_fs, partition_cols=["year", "month"] # Partitioned write )

undefined

df.to_parquet( "s3://bucket/output/", filesystem=s3_fs, partition_cols=["year", "month"] # 按分区写入 )

undefined

Partitioned Writes

分区写入

Write partitioned datasets efficiently:

python

import pandas as pd

df = pd.DataFrame({
    "id": [1, 2, 3],
    "year": [2024, 2024, 2023],
    "month": [1, 2, 12],
    "value": [100.0, 200.0, 150.0]
})

高效写入分区数据集：

python

import pandas as pd

df = pd.DataFrame({
    "id": [1, 2, 3],
    "year": [2024, 2024, 2023],
    "month": [1, 2, 12],
    "value": [100.0, 200.0, 150.0]
})

Using fsspec

使用fsspec

fs = fsspec.filesystem("s3") df.to_parquet( "s3://bucket/output/", partition_cols=["year", "month"], filesystem=fs )

Output structure: s3://bucket/output/year=2024/month=1/part-0.parquet

输出结构：s3://bucket/output/year=2024/month=1/part-0.parquet

undefined

undefined

Authentication

认证方式

Auto-detection: Uses default credential chain (AWS_PROFILE, ~/.aws/credentials, IAM role)
Explicit: Pass
```
key=
```
,
```
secret=
```
to
```
fsspec.filesystem()
```
constructor

For S3-compatible (MinIO, Ceph):

python

fs = fsspec.filesystem("s3", client_kwargs={
    "endpoint_url": "http://minio.local:9000"
})

See

@data-engineering-storage-authentication

for detailed patterns.

自动检测：使用默认凭据链（AWS_PROFILE、~/.aws/credentials、IAM角色）
显式指定：在
```
fsspec.filesystem()
```
构造函数中传入
```
key=
```
、
```
secret=
```
参数

兼容S3的存储（如MinIO、Ceph）：

python

fs = fsspec.filesystem("s3", client_kwargs={
    "endpoint_url": "http://minio.local:9000"
})

详细模式请参见

@data-engineering-storage-authentication

。

Performance Tips

性能优化技巧

Column pruning:
```
pd.read_parquet(columns=[...])
```
only reads needed columns
Row group filtering: Use
```
filters=
```
parameter for partitioned data

Cache results: Wrap filesystem with

simplecache::

filecache::

python

cached_fs = fsspec.filesystem("simplecache", target_protocol="s3")
df = pd.read_parquet("simplecache::s3://bucket/data.parquet", filesystem=cached_fs)

Use Parquet, not CSV: Parquet supports pushdown, compression, and typed storage
For large datasets: Consider PySpark or Dask instead of pandas (pandas loads everything into memory)

列裁剪：
```
pd.read_parquet(columns=[...])
```
仅读取所需列
行组过滤：对分区数据使用
```
filters=
```
参数

结果缓存：用

simplecache::

或

filecache::

包装文件系统

python

cached_fs = fsspec.filesystem("simplecache", target_protocol="s3")
df = pd.read_parquet("simplecache::s3://bucket/data.parquet", filesystem=cached_fs)

使用Parquet而非CSV：Parquet支持下推操作、压缩和类型化存储
处理大型数据集：考虑使用PySpark或Dask替代Pandas（Pandas会将所有数据加载到内存中）

Limitations

局限性

pandas loads entire DataFrame into memory - not suitable for datasets larger than RAM
For lazy evaluation and better performance with large files, use
```
@data-engineering-core
```
(Polars)
Multi-file reads require manual iteration (use
```
fs.glob()
```
+ list comprehension)

Pandas会将整个DataFrame加载到内存中——不适合大于RAM的数据集
如需惰性求值和处理大文件时获得更好性能，请使用
```
@data-engineering-core
```
（Polars）
多文件读取需要手动迭代（使用
```
fs.glob()
```
+ 列表推导式）

Alternatives

替代方案

Polars (
```
@data-engineering-core
```
): Faster, memory-mapped, lazy evaluation
Dask: Parallel pandas for out-of-core computation
PySpark: Distributed processing for big data

Polars（
```
@data-engineering-core
```
）：速度更快、支持内存映射、惰性求值
Dask：用于核外计算的并行化Pandas
PySpark：用于大数据的分布式处理

References

参考资料

pandas I/O documentation
fsspec documentation

@data-engineering-storage-remote-access/libraries/fsspec

pandas I/O 文档
fsspec 文档

@data-engineering-storage-remote-access/libraries/fsspec