data-engineering-storage-remote-access-integrations-pandas
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePandas Integration with Remote Storage
Pandas与远程存储的集成
Pandas leverages fsspec under the hood for cloud storage access (s3://, gs://, etc.). This makes reading from and writing to cloud storage straightforward.
Pandas在底层借助fsspec实现云存储访问(如s3://、gs://等),这使得在云存储中读写数据变得简单直接。
Auto-Detection (Simplest)
自动检测(最简方式)
Pandas automatically uses fsspec for cloud URIs:
python
import pandas as pdPandas会自动为云URI使用fsspec:
python
import pandas as pdRead CSV/Parquet directly from cloud URIs
直接从云URI读取CSV/Parquet文件
df = pd.read_csv("s3://bucket/data.csv")
df = pd.read_parquet("s3://bucket/data.parquet")
df = pd.read_json("gs://bucket/data.json")
df = pd.read_csv("s3://bucket/data.csv")
df = pd.read_parquet("s3://bucket/data.parquet")
df = pd.read_json("gs://bucket/data.json")
Compression is auto-detected
压缩格式会自动检测
df = pd.read_csv("s3://bucket/data.csv.gz") # Automatically decompressed
**Note:** Auto-detection uses default credentials. For explicit auth, see below.df = pd.read_csv("s3://bucket/data.csv.gz") # 自动解压
**注意:** 自动检测使用默认凭据。如需显式认证,请参见下文。Explicit Filesystem (More Control)
显式文件系统(更多控制权)
python
import fsspec
import pandas as pdpython
import fsspec
import pandas as pdCreate fsspec filesystem with configuration
创建带配置的fsspec文件系统
fs = fsspec.filesystem("s3", anon=False) # Uses default credentials chain
fs = fsspec.filesystem("s3", anon=False) # 使用默认凭据链
Open file through filesystem
通过文件系统打开文件
with fs.open("s3://bucket/data.csv") as f:
df = pd.read_csv(f)
with fs.open("s3://bucket/data.csv") as f:
df = pd.read_csv(f)
Or pass filesystem directly (recommended for performance)
或者直接传入文件系统(推荐用于提升性能)
df = pd.read_parquet(
"s3://bucket/data.parquet",
filesystem=fs,
columns=["id", "value"], # Column pruning reduces data transfer
filters=[("date", ">=", "2024-01-01")] # Row group filtering
)
undefineddf = pd.read_parquet(
"s3://bucket/data.parquet",
filesystem=fs,
columns=["id", "value"], # 列裁剪减少数据传输
filters=[("date", ">=", "2024-01-01")] # 行组过滤
)
undefinedPyArrow Filesystem Backend
PyArrow文件系统后端
For better Arrow integration and zero-copy transfers:
python
import pyarrow.fs as fs
import pandas as pd
s3_fs = fs.S3FileSystem(region="us-east-1")为了更好的Arrow集成和零拷贝传输:
python
import pyarrow.fs as fs
import pandas as pd
s3_fs = fs.S3FileSystem(region="us-east-1")Read with column filtering
带列过滤的读取
df = pd.read_parquet(
"bucket/data.parquet", # Note: no s3:// prefix when using filesystem
filesystem=s3_fs,
columns=["id", "name", "value"]
)
df = pd.read_parquet(
"bucket/data.parquet", # 注意:使用文件系统时无需s3://前缀
filesystem=s3_fs,
columns=["id", "name", "value"]
)
Write to cloud storage
写入云存储
df.to_parquet(
"s3://bucket/output/",
filesystem=s3_fs,
partition_cols=["year", "month"] # Partitioned write
)
undefineddf.to_parquet(
"s3://bucket/output/",
filesystem=s3_fs,
partition_cols=["year", "month"] # 按分区写入
)
undefinedPartitioned Writes
分区写入
Write partitioned datasets efficiently:
python
import pandas as pd
df = pd.DataFrame({
"id": [1, 2, 3],
"year": [2024, 2024, 2023],
"month": [1, 2, 12],
"value": [100.0, 200.0, 150.0]
})高效写入分区数据集:
python
import pandas as pd
df = pd.DataFrame({
"id": [1, 2, 3],
"year": [2024, 2024, 2023],
"month": [1, 2, 12],
"value": [100.0, 200.0, 150.0]
})Using fsspec
使用fsspec
fs = fsspec.filesystem("s3")
df.to_parquet(
"s3://bucket/output/",
partition_cols=["year", "month"],
filesystem=fs
)
fs = fsspec.filesystem("s3")
df.to_parquet(
"s3://bucket/output/",
partition_cols=["year", "month"],
filesystem=fs
)
Output structure: s3://bucket/output/year=2024/month=1/part-0.parquet
输出结构:s3://bucket/output/year=2024/month=1/part-0.parquet
undefinedundefinedAuthentication
认证方式
- Auto-detection: Uses default credential chain (AWS_PROFILE, ~/.aws/credentials, IAM role)
- Explicit: Pass ,
key=tosecret=constructorfsspec.filesystem() - For S3-compatible (MinIO, Ceph):
python
fs = fsspec.filesystem("s3", client_kwargs={ "endpoint_url": "http://minio.local:9000" })
See for detailed patterns.
@data-engineering-storage-authentication- 自动检测:使用默认凭据链(AWS_PROFILE、~/.aws/credentials、IAM角色)
- 显式指定:在构造函数中传入
fsspec.filesystem()、key=参数secret= - 兼容S3的存储(如MinIO、Ceph):
python
fs = fsspec.filesystem("s3", client_kwargs={ "endpoint_url": "http://minio.local:9000" })
详细模式请参见。
@data-engineering-storage-authenticationPerformance Tips
性能优化技巧
- Column pruning: only reads needed columns
pd.read_parquet(columns=[...]) - Row group filtering: Use parameter for partitioned data
filters= - Cache results: Wrap filesystem with or
simplecache::filecache::pythoncached_fs = fsspec.filesystem("simplecache", target_protocol="s3") df = pd.read_parquet("simplecache::s3://bucket/data.parquet", filesystem=cached_fs) - Use Parquet, not CSV: Parquet supports pushdown, compression, and typed storage
- For large datasets: Consider PySpark or Dask instead of pandas (pandas loads everything into memory)
- 列裁剪:仅读取所需列
pd.read_parquet(columns=[...]) - 行组过滤:对分区数据使用参数
filters= - 结果缓存:用或
simplecache::包装文件系统filecache::pythoncached_fs = fsspec.filesystem("simplecache", target_protocol="s3") df = pd.read_parquet("simplecache::s3://bucket/data.parquet", filesystem=cached_fs) - 使用Parquet而非CSV:Parquet支持下推操作、压缩和类型化存储
- 处理大型数据集:考虑使用PySpark或Dask替代Pandas(Pandas会将所有数据加载到内存中)
Limitations
局限性
- pandas loads entire DataFrame into memory - not suitable for datasets larger than RAM
- For lazy evaluation and better performance with large files, use (Polars)
@data-engineering-core - Multi-file reads require manual iteration (use + list comprehension)
fs.glob()
- Pandas会将整个DataFrame加载到内存中——不适合大于RAM的数据集
- 如需惰性求值和处理大文件时获得更好性能,请使用(Polars)
@data-engineering-core - 多文件读取需要手动迭代(使用+ 列表推导式)
fs.glob()
Alternatives
替代方案
- Polars (): Faster, memory-mapped, lazy evaluation
@data-engineering-core - Dask: Parallel pandas for out-of-core computation
- PySpark: Distributed processing for big data
- Polars():速度更快、支持内存映射、惰性求值
@data-engineering-core - Dask:用于核外计算的并行化Pandas
- PySpark:用于大数据的分布式处理
References
参考资料
- pandas I/O documentation
- fsspec documentation
@data-engineering-storage-remote-access/libraries/fsspec
- pandas I/O 文档
- fsspec 文档
@data-engineering-storage-remote-access/libraries/fsspec