zarr-python
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseZarr Python
Zarr Python
Overview
概述
Zarr is a Python library for storing large N-dimensional arrays with chunking and compression. Apply this skill for efficient parallel I/O, cloud-native workflows, and seamless integration with NumPy, Dask, and Xarray.
Zarr 是一个用于存储大型N维数组的Python库,支持分块和压缩。运用该工具可实现高效的并行I/O、云原生工作流,并与NumPy、Dask和Xarray无缝集成。
Quick Start
快速开始
Installation
安装
bash
uv pip install zarrRequires Python 3.11+. For cloud storage support, install additional packages:
python
uv pip install s3fs # For S3
uv pip install gcsfs # For Google Cloud Storagebash
uv pip install zarr要求Python 3.11及以上版本。如需云存储支持,请安装额外包:
python
uv pip install s3fs # 用于S3
uv pip install gcsfs # 用于Google Cloud StorageBasic Array Creation
基础数组创建
python
import zarr
import numpy as nppython
import zarr
import numpy as npCreate a 2D array with chunking and compression
创建带分块和压缩的二维数组
z = zarr.create_array(
store="data/my_array.zarr",
shape=(10000, 10000),
chunks=(1000, 1000),
dtype="f4"
)
z = zarr.create_array(
store="data/my_array.zarr",
shape=(10000, 10000),
chunks=(1000, 1000),
dtype="f4"
)
Write data using NumPy-style indexing
使用NumPy风格索引写入数据
z[:, :] = np.random.random((10000, 10000))
z[:, :] = np.random.random((10000, 10000))
Read data
读取数据
data = z[0:100, 0:100] # Returns NumPy array
undefineddata = z[0:100, 0:100] # 返回NumPy数组
undefinedCore Operations
核心操作
Creating Arrays
创建数组
Zarr provides multiple convenience functions for array creation:
python
undefinedZarr提供多种便捷函数用于数组创建:
python
undefinedCreate empty array
创建空数组
z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000), dtype='f4',
store='data.zarr')
z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000), dtype='f4',
store='data.zarr')
Create filled arrays
创建填充数组
z = zarr.ones((5000, 5000), chunks=(500, 500))
z = zarr.full((1000, 1000), fill_value=42, chunks=(100, 100))
z = zarr.ones((5000, 5000), chunks=(500, 500))
z = zarr.full((1000, 1000), fill_value=42, chunks=(100, 100))
Create from existing data
从已有数据创建
data = np.arange(10000).reshape(100, 100)
z = zarr.array(data, chunks=(10, 10), store='data.zarr')
data = np.arange(10000).reshape(100, 100)
z = zarr.array(data, chunks=(10, 10), store='data.zarr')
Create like another array
仿照另一个数组创建
z2 = zarr.zeros_like(z) # Matches shape, chunks, dtype of z
undefinedz2 = zarr.zeros_like(z) # 匹配z的形状、分块和数据类型
undefinedOpening Existing Arrays
打开已有数组
python
undefinedpython
undefinedOpen array (read/write mode by default)
打开数组(默认读写模式)
z = zarr.open_array('data.zarr', mode='r+')
z = zarr.open_array('data.zarr', mode='r+')
Read-only mode
只读模式
z = zarr.open_array('data.zarr', mode='r')
z = zarr.open_array('data.zarr', mode='r')
The open() function auto-detects arrays vs groups
open()函数自动检测数组或组
z = zarr.open('data.zarr') # Returns Array or Group
undefinedz = zarr.open('data.zarr') # 返回Array或Group
undefinedReading and Writing Data
数据读写
Zarr arrays support NumPy-like indexing:
python
undefinedZarr数组支持类NumPy索引:
python
undefinedWrite entire array
写入整个数组
z[:] = 42
z[:] = 42
Write slices
写入切片
z[0, :] = np.arange(100)
z[10:20, 50:60] = np.random.random((10, 10))
z[0, :] = np.arange(100)
z[10:20, 50:60] = np.random.random((10, 10))
Read data (returns NumPy array)
读取数据(返回NumPy数组)
data = z[0:100, 0:100]
row = z[5, :]
data = z[0:100, 0:100]
row = z[5, :]
Advanced indexing
高级索引
z.vindex[[0, 5, 10], [2, 8, 15]] # Coordinate indexing
z.oindex[0:10, [5, 10, 15]] # Orthogonal indexing
z.blocks[0, 0] # Block/chunk indexing
undefinedz.vindex[[0, 5, 10], [2, 8, 15]] # 坐标索引
z.oindex[0:10, [5, 10, 15]] # 正交索引
z.blocks[0, 0] # 块/分块索引
undefinedResizing and Appending
调整大小与追加
python
undefinedpython
undefinedResize array
调整数组大小
z.resize(15000, 15000) # Expands or shrinks dimensions
z.resize(15000, 15000) # 扩展或缩小维度
Append data along an axis
沿某一轴追加数据
z.append(np.random.random((1000, 10000)), axis=0) # Adds rows
undefinedz.append(np.random.random((1000, 10000)), axis=0) # 添加行
undefinedChunking Strategies
分块策略
Chunking is critical for performance. Choose chunk sizes and shapes based on access patterns.
分块对性能至关重要。需根据访问模式选择分块大小和形状。
Chunk Size Guidelines
分块大小指南
- Minimum chunk size: 1 MB recommended for optimal performance
- Balance: Larger chunks = fewer metadata operations; smaller chunks = better parallel access
- Memory consideration: Entire chunks must fit in memory during compression
python
undefined- 最小分块大小:建议1 MB以获得最佳性能
- 平衡原则:分块越大,元数据操作越少;分块越小,并行访问性越好
- 内存考虑:压缩时整个分块必须能放入内存
python
undefinedConfigure chunk size (aim for ~1MB per chunk)
配置分块大小(目标为每个分块约1MB)
For float32 data: 1MB = 262,144 elements = 512×512 array
对于float32数据:1MB = 262,144个元素 = 512×512数组
z = zarr.zeros(
shape=(10000, 10000),
chunks=(512, 512), # ~1MB chunks
dtype='f4'
)
undefinedz = zarr.zeros(
shape=(10000, 10000),
chunks=(512, 512), # 约1MB分块
dtype='f4'
)
undefinedAligning Chunks with Access Patterns
分块与访问模式对齐
Critical: Chunk shape dramatically affects performance based on how data is accessed.
python
undefined关键:分块形状会根据数据访问模式极大影响性能。
python
undefinedIf accessing rows frequently (first dimension)
如果频繁访问行(第一维度)
z = zarr.zeros((10000, 10000), chunks=(10, 10000)) # Chunk spans columns
z = zarr.zeros((10000, 10000), chunks=(10, 10000)) # 分块覆盖所有列
If accessing columns frequently (second dimension)
如果频繁访问列(第二维度)
z = zarr.zeros((10000, 10000), chunks=(10000, 10)) # Chunk spans rows
z = zarr.zeros((10000, 10000), chunks=(10000, 10)) # 分块覆盖所有行
For mixed access patterns (balanced approach)
对于混合访问模式(平衡方案)
z = zarr.zeros((10000, 10000), chunks=(1000, 1000)) # Square chunks
**Performance example**: For a (200, 200, 200) array, reading along the first dimension:
- Using chunks (1, 200, 200): ~107ms
- Using chunks (200, 200, 1): ~1.65ms (65× faster!)z = zarr.zeros((10000, 10000), chunks=(1000, 1000)) # 方形分块
**性能示例**:对于(200, 200, 200)数组,沿第一维度读取:
- 使用分块(1, 200, 200):约107ms
- 使用分块(200, 200, 1):约1.65ms(快65倍!)Sharding for Large-Scale Storage
分片用于大规模存储
When arrays have millions of small chunks, use sharding to group chunks into larger storage objects:
python
from zarr.codecs import ShardingCodec, BytesCodec
from zarr.codecs.blosc import BloscCodec当数组包含数百万个小分块时,使用分片将分块分组为更大的存储对象:
python
from zarr.codecs import ShardingCodec, BytesCodec
from zarr.codecs.blosc import BloscCodecCreate array with sharding
创建带分片的数组
z = zarr.create_array(
store='data.zarr',
shape=(100000, 100000),
chunks=(100, 100), # Small chunks for access
shards=(1000, 1000), # Groups 100 chunks per shard
dtype='f4'
)
**Benefits**:
- Reduces file system overhead from millions of small files
- Improves cloud storage performance (fewer object requests)
- Prevents filesystem block size waste
**Important**: Entire shards must fit in memory before writing.z = zarr.create_array(
store='data.zarr',
shape=(100000, 100000),
chunks=(100, 100), # 小分块用于访问
shards=(1000, 1000), # 每个分片包含100个分块
dtype='f4'
)
**优势**:
- 减少数百万个小文件带来的文件系统开销
- 提升云存储性能(减少对象请求数)
- 避免文件系统块大小浪费
**注意**:写入前整个分片必须能放入内存。Compression
压缩
Zarr applies compression per chunk to reduce storage while maintaining fast access.
Zarr对每个分块应用压缩,以减少存储同时保持快速访问。
Configuring Compression
配置压缩
python
from zarr.codecs.blosc import BloscCodec
from zarr.codecs import GzipCodec, ZstdCodecpython
from zarr.codecs.blosc import BloscCodec
from zarr.codecs import GzipCodec, ZstdCodecDefault: Blosc with Zstandard
默认:使用Zstandard的Blosc
z = zarr.zeros((1000, 1000), chunks=(100, 100)) # Uses default compression
z = zarr.zeros((1000, 1000), chunks=(100, 100)) # 使用默认压缩
Configure Blosc codec
配置Blosc编解码器
z = zarr.create_array(
store='data.zarr',
shape=(1000, 1000),
chunks=(100, 100),
dtype='f4',
codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')]
)
z = zarr.create_array(
store='data.zarr',
shape=(1000, 1000),
chunks=(100, 100),
dtype='f4',
codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')]
)
Available Blosc compressors: 'blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd'
可用的Blosc压缩器:'blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd'
Use Gzip compression
使用Gzip压缩
z = zarr.create_array(
store='data.zarr',
shape=(1000, 1000),
chunks=(100, 100),
dtype='f4',
codecs=[GzipCodec(level=6)]
)
z = zarr.create_array(
store='data.zarr',
shape=(1000, 1000),
chunks=(100, 100),
dtype='f4',
codecs=[GzipCodec(level=6)]
)
Disable compression
禁用压缩
z = zarr.create_array(
store='data.zarr',
shape=(1000, 1000),
chunks=(100, 100),
dtype='f4',
codecs=[BytesCodec()] # No compression
)
undefinedz = zarr.create_array(
store='data.zarr',
shape=(1000, 1000),
chunks=(100, 100),
dtype='f4',
codecs=[BytesCodec()] # 无压缩
)
undefinedCompression Performance Tips
压缩性能技巧
- Blosc (default): Fast compression/decompression, good for interactive workloads
- Zstandard: Better compression ratios, slightly slower than LZ4
- Gzip: Maximum compression, slower performance
- LZ4: Fastest compression, lower ratios
- Shuffle: Enable shuffle filter for better compression on numeric data
python
undefined- Blosc(默认):压缩/解压速度快,适合交互式工作负载
- Zstandard:压缩比更高,比LZ4稍慢
- Gzip:压缩比最大,性能较慢
- LZ4:压缩速度最快,压缩比较低
- Shuffle:对数值数据启用shuffle过滤器可获得更好的压缩效果
python
undefinedOptimal for numeric scientific data
数值科学数据的最优配置
codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')]
codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')]
Optimal for speed
速度优先的最优配置
codecs=[BloscCodec(cname='lz4', clevel=1)]
codecs=[BloscCodec(cname='lz4', clevel=1)]
Optimal for compression ratio
压缩比优先的最优配置
codecs=[GzipCodec(level=9)]
undefinedcodecs=[GzipCodec(level=9)]
undefinedStorage Backends
存储后端
Zarr supports multiple storage backends through a flexible storage interface.
Zarr通过灵活的存储接口支持多种存储后端。
Local Filesystem (Default)
本地文件系统(默认)
python
from zarr.storage import LocalStorepython
from zarr.storage import LocalStoreExplicit store creation
显式创建存储
store = LocalStore('data/my_array.zarr')
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
store = LocalStore('data/my_array.zarr')
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
Or use string path (creates LocalStore automatically)
或使用字符串路径(自动创建LocalStore)
z = zarr.open_array('data/my_array.zarr', mode='w', shape=(1000, 1000),
chunks=(100, 100))
undefinedz = zarr.open_array('data/my_array.zarr', mode='w', shape=(1000, 1000),
chunks=(100, 100))
undefinedIn-Memory Storage
内存存储
python
from zarr.storage import MemoryStorepython
from zarr.storage import MemoryStoreCreate in-memory store
创建内存存储
store = MemoryStore()
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
store = MemoryStore()
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
Data exists only in memory, not persisted
数据仅存在于内存中,不会持久化
undefinedundefinedZIP File Storage
ZIP文件存储
python
from zarr.storage import ZipStorepython
from zarr.storage import ZipStoreWrite to ZIP file
写入ZIP文件
store = ZipStore('data.zip', mode='w')
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
z[:] = np.random.random((1000, 1000))
store.close() # IMPORTANT: Must close ZipStore
store = ZipStore('data.zip', mode='w')
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
z[:] = np.random.random((1000, 1000))
store.close() # 重要:必须关闭ZipStore
Read from ZIP file
从ZIP文件读取
store = ZipStore('data.zip', mode='r')
z = zarr.open_array(store=store)
data = z[:]
store.close()
undefinedstore = ZipStore('data.zip', mode='r')
z = zarr.open_array(store=store)
data = z[:]
store.close()
undefinedCloud Storage (S3, GCS)
云存储(S3、GCS)
python
import s3fs
import zarrpython
import s3fs
import zarrS3 storage
S3存储
s3 = s3fs.S3FileSystem(anon=False) # Use credentials
store = s3fs.S3Map(root='my-bucket/path/to/array.zarr', s3=s3)
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
z[:] = data
s3 = s3fs.S3FileSystem(anon=False) # 使用凭证
store = s3fs.S3Map(root='my-bucket/path/to/array.zarr', s3=s3)
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
z[:] = data
Google Cloud Storage
Google Cloud Storage
import gcsfs
gcs = gcsfs.GCSFileSystem(project='my-project')
store = gcsfs.GCSMap(root='my-bucket/path/to/array.zarr', gcs=gcs)
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
**Cloud Storage Best Practices**:
- Use consolidated metadata to reduce latency: `zarr.consolidate_metadata(store)`
- Align chunk sizes with cloud object sizing (typically 5-100 MB optimal)
- Enable parallel writes using Dask for large-scale data
- Consider sharding to reduce number of objectsimport gcsfs
gcs = gcsfs.GCSFileSystem(project='my-project')
store = gcsfs.GCSMap(root='my-bucket/path/to/array.zarr', gcs=gcs)
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
**云存储最佳实践**:
- 使用合并元数据以减少延迟:`zarr.consolidate_metadata(store)`
- 使分块大小与云对象大小对齐(通常5-100 MB最优)
- 使用Dask进行大规模数据的并行写入
- 考虑使用分片以减少对象数量Groups and Hierarchies
组与层级结构
Groups organize multiple arrays hierarchically, similar to directories or HDF5 groups.
组用于层级化组织多个数组,类似于目录或HDF5组。
Creating and Using Groups
创建和使用组
python
undefinedpython
undefinedCreate root group
创建根组
root = zarr.group(store='data/hierarchy.zarr')
root = zarr.group(store='data/hierarchy.zarr')
Create sub-groups
创建子组
temperature = root.create_group('temperature')
precipitation = root.create_group('precipitation')
temperature = root.create_group('temperature')
precipitation = root.create_group('precipitation')
Create arrays within groups
在组内创建数组
temp_array = temperature.create_array(
name='t2m',
shape=(365, 720, 1440),
chunks=(1, 720, 1440),
dtype='f4'
)
precip_array = precipitation.create_array(
name='prcp',
shape=(365, 720, 1440),
chunks=(1, 720, 1440),
dtype='f4'
)
temp_array = temperature.create_array(
name='t2m',
shape=(365, 720, 1440),
chunks=(1, 720, 1440),
dtype='f4'
)
precip_array = precipitation.create_array(
name='prcp',
shape=(365, 720, 1440),
chunks=(1, 720, 1440),
dtype='f4'
)
Access using paths
使用路径访问数组
array = root['temperature/t2m']
array = root['temperature/t2m']
Visualize hierarchy
可视化层级结构
print(root.tree())
print(root.tree())
Output:
输出:
/
/
├── temperature
├── temperature
│ └── t2m (365, 720, 1440) f4
│ └── t2m (365, 720, 1440) f4
└── precipitation
└── precipitation
└── prcp (365, 720, 1440) f4
└── prcp (365, 720, 1440) f4
undefinedundefinedH5py-Compatible API
兼容H5py的API
Zarr provides an h5py-compatible interface for familiar HDF5 users:
python
undefinedZarr为熟悉HDF5的用户提供兼容h5py的接口:
python
undefinedCreate group with h5py-style methods
使用h5py风格方法创建组
root = zarr.group('data.zarr')
dataset = root.create_dataset('my_data', shape=(1000, 1000), chunks=(100, 100),
dtype='f4')
root = zarr.group('data.zarr')
dataset = root.create_dataset('my_data', shape=(1000, 1000), chunks=(100, 100),
dtype='f4')
Access like h5py
像h5py一样访问
grp = root.require_group('subgroup')
arr = grp.require_dataset('array', shape=(500, 500), chunks=(50, 50), dtype='i4')
undefinedgrp = root.require_group('subgroup')
arr = grp.require_dataset('array', shape=(500, 500), chunks=(50, 50), dtype='i4')
undefinedAttributes and Metadata
属性与元数据
Attach custom metadata to arrays and groups using attributes:
python
undefined可以使用属性为数组和组附加自定义元数据:
python
undefinedAdd attributes to array
为数组添加属性
z = zarr.zeros((1000, 1000), chunks=(100, 100))
z.attrs['description'] = 'Temperature data in Kelvin'
z.attrs['units'] = 'K'
z.attrs['created'] = '2024-01-15'
z.attrs['processing_version'] = 2.1
z = zarr.zeros((1000, 1000), chunks=(100, 100))
z.attrs['description'] = '开尔文温度数据'
z.attrs['units'] = 'K'
z.attrs['created'] = '2024-01-15'
z.attrs['processing_version'] = 2.1
Attributes are stored as JSON
属性以JSON格式存储
print(z.attrs['units']) # Output: K
print(z.attrs['units']) # 输出: K
Add attributes to groups
为组添加属性
root = zarr.group('data.zarr')
root.attrs['project'] = 'Climate Analysis'
root.attrs['institution'] = 'Research Institute'
root = zarr.group('data.zarr')
root.attrs['project'] = '气候分析'
root.attrs['institution'] = '研究院'
Attributes persist with the array/group
属性随数组/组持久化
z2 = zarr.open('data.zarr')
print(z2.attrs['description'])
**Important**: Attributes must be JSON-serializable (strings, numbers, lists, dicts, booleans, null).z2 = zarr.open('data.zarr')
print(z2.attrs['description'])
**重要**:属性必须是可JSON序列化的(字符串、数字、列表、字典、布尔值、空值)。Integration with NumPy, Dask, and Xarray
与NumPy、Dask和Xarray集成
NumPy Integration
NumPy集成
Zarr arrays implement the NumPy array interface:
python
import numpy as np
import zarr
z = zarr.zeros((1000, 1000), chunks=(100, 100))Zarr数组实现了NumPy数组接口:
python
import numpy as np
import zarr
z = zarr.zeros((1000, 1000), chunks=(100, 100))Use NumPy functions directly
直接使用NumPy函数
result = np.sum(z, axis=0) # NumPy operates on Zarr array
mean = np.mean(z[:100, :100])
result = np.sum(z, axis=0) # NumPy直接操作Zarr数组
mean = np.mean(z[:100, :100])
Convert to NumPy array
转换为NumPy数组
numpy_array = z[:] # Loads entire array into memory
undefinednumpy_array = z[:] # 将整个数组加载到内存
undefinedDask Integration
Dask集成
Dask provides lazy, parallel computation on Zarr arrays:
python
import dask.array as da
import zarrDask为Zarr数组提供惰性并行计算:
python
import dask.array as da
import zarrCreate large Zarr array
创建大型Zarr数组
z = zarr.open('data.zarr', mode='w', shape=(100000, 100000),
chunks=(1000, 1000), dtype='f4')
z = zarr.open('data.zarr', mode='w', shape=(100000, 100000),
chunks=(1000, 1000), dtype='f4')
Load as Dask array (lazy, no data loaded)
加载为Dask数组(惰性,不加载数据)
dask_array = da.from_zarr('data.zarr')
dask_array = da.from_zarr('data.zarr')
Perform computations (parallel, out-of-core)
执行计算(并行、核外)
result = dask_array.mean(axis=0).compute() # Parallel computation
result = dask_array.mean(axis=0).compute() # 并行计算
Write Dask array to Zarr
将Dask数组写入Zarr
large_array = da.random.random((100000, 100000), chunks=(1000, 1000))
da.to_zarr(large_array, 'output.zarr')
**Benefits**:
- Process datasets larger than memory
- Automatic parallel computation across chunks
- Efficient I/O with chunked storagelarge_array = da.random.random((100000, 100000), chunks=(1000, 1000))
da.to_zarr(large_array, 'output.zarr')
**优势**:
- 处理大于内存的数据集
- 自动跨分块并行计算
- 分块存储实现高效I/OXarray Integration
Xarray集成
Xarray provides labeled, multidimensional arrays with Zarr backend:
python
import xarray as xr
import zarrXarray提供带Zarr后端的标签化多维数组:
python
import xarray as xr
import zarrOpen Zarr store as Xarray Dataset (lazy loading)
以Xarray Dataset形式打开Zarr存储(惰性加载)
ds = xr.open_zarr('data.zarr')
ds = xr.open_zarr('data.zarr')
Dataset includes coordinates and metadata
Dataset包含坐标和元数据
print(ds)
print(ds)
Access variables
访问变量
temperature = ds['temperature']
temperature = ds['temperature']
Perform labeled operations
执行标签化操作
subset = ds.sel(time='2024-01', lat=slice(30, 60))
subset = ds.sel(time='2024-01', lat=slice(30, 60))
Write Xarray Dataset to Zarr
将Xarray Dataset写入Zarr
ds.to_zarr('output.zarr')
ds.to_zarr('output.zarr')
Create from scratch with coordinates
从 scratch 创建带坐标的Dataset
ds = xr.Dataset(
{
'temperature': (['time', 'lat', 'lon'], data),
'precipitation': (['time', 'lat', 'lon'], data2)
},
coords={
'time': pd.date_range('2024-01-01', periods=365),
'lat': np.arange(-90, 91, 1),
'lon': np.arange(-180, 180, 1)
}
)
ds.to_zarr('climate_data.zarr')
**Benefits**:
- Named dimensions and coordinates
- Label-based indexing and selection
- Integration with pandas for time series
- NetCDF-like interface familiar to climate/geospatial scientistsds = xr.Dataset(
{
'temperature': (['time', 'lat', 'lon'], data),
'precipitation': (['time', 'lat', 'lon'], data2)
},
coords={
'time': pd.date_range('2024-01-01', periods=365),
'lat': np.arange(-90, 91, 1),
'lon': np.arange(-180, 180, 1)
}
)
ds.to_zarr('climate_data.zarr')
**优势**:
- 命名维度和坐标
- 基于标签的索引和选择
- 与pandas集成处理时间序列
- 气候/地理空间科学家熟悉的类NetCDF接口Parallel Computing and Synchronization
并行计算与同步
Thread-Safe Operations
线程安全操作
python
from zarr import ThreadSynchronizer
import zarrpython
from zarr import ThreadSynchronizer
import zarrFor multi-threaded writes
用于多线程写入
synchronizer = ThreadSynchronizer()
z = zarr.open_array('data.zarr', mode='r+', shape=(10000, 10000),
chunks=(1000, 1000), synchronizer=synchronizer)
synchronizer = ThreadSynchronizer()
z = zarr.open_array('data.zarr', mode='r+', shape=(10000, 10000),
chunks=(1000, 1000), synchronizer=synchronizer)
Safe for concurrent writes from multiple threads
支持多线程并发写入
(when writes don't span chunk boundaries)
(当写入不跨分块边界时)
undefinedundefinedProcess-Safe Operations
进程安全操作
python
from zarr import ProcessSynchronizer
import zarrpython
from zarr import ProcessSynchronizer
import zarrFor multi-process writes
用于多进程写入
synchronizer = ProcessSynchronizer('sync_data.sync')
z = zarr.open_array('data.zarr', mode='r+', shape=(10000, 10000),
chunks=(1000, 1000), synchronizer=synchronizer)
synchronizer = ProcessSynchronizer('sync_data.sync')
z = zarr.open_array('data.zarr', mode='r+', shape=(10000, 10000),
chunks=(1000, 1000), synchronizer=synchronizer)
Safe for concurrent writes from multiple processes
支持多进程并发写入
**Note**:
- Concurrent reads require no synchronization
- Synchronization only needed for writes that may span chunk boundaries
- Each process/thread writing to separate chunks needs no synchronization
**注意**:
- 并发读取无需同步
- 仅当写入可能跨分块边界时才需要同步
- 每个进程/线程写入不同分块时无需同步Consolidated Metadata
合并元数据
For hierarchical stores with many arrays, consolidate metadata into a single file to reduce I/O operations:
python
import zarr对于包含多个数组的层级存储,可将元数据合并到单个文件中以减少I/O操作:
python
import zarrAfter creating arrays/groups
创建数组/组后
root = zarr.group('data.zarr')
root = zarr.group('data.zarr')
... create multiple arrays/groups ...
... 创建多个数组/组 ...
Consolidate metadata
合并元数据
zarr.consolidate_metadata('data.zarr')
zarr.consolidate_metadata('data.zarr')
Open with consolidated metadata (faster, especially on cloud storage)
使用合并后的元数据打开(速度更快,尤其在云存储上)
root = zarr.open_consolidated('data.zarr')
**Benefits**:
- Reduces metadata read operations from N (one per array) to 1
- Critical for cloud storage (reduces latency)
- Speeds up `tree()` operations and group traversal
**Cautions**:
- Metadata can become stale if arrays update without re-consolidation
- Not suitable for frequently-updated datasets
- Multi-writer scenarios may have inconsistent readsroot = zarr.open_consolidated('data.zarr')
**优势**:
- 将元数据读取操作从N次(每个数组一次)减少到1次
- 对云存储至关重要(减少延迟)
- 加速`tree()`操作和组遍历
**注意**:
- 如果数组更新后未重新合并,元数据可能会过时
- 不适用于频繁更新的数据集
- 多写入场景下可能出现不一致读取Performance Optimization
性能优化
Checklist for Optimal Performance
最佳性能检查清单
-
Chunk Size: Aim for 1-10 MB per chunkpython
# For float32: 1MB = 262,144 elements chunks = (512, 512) # 512×512×4 bytes = ~1MB -
Chunk Shape: Align with access patternspython
# Row-wise access → chunk spans columns: (small, large) # Column-wise access → chunk spans rows: (large, small) # Random access → balanced: (medium, medium) -
Compression: Choose based on workloadpython
# Interactive/fast: BloscCodec(cname='lz4') # Balanced: BloscCodec(cname='zstd', clevel=5) # Maximum compression: GzipCodec(level=9) -
Storage Backend: Match to environmentpython
# Local: LocalStore (default) # Cloud: S3Map/GCSMap with consolidated metadata # Temporary: MemoryStore -
Sharding: Use for large-scale datasetspython
# When you have millions of small chunks shards=(10*chunk_size, 10*chunk_size) -
Parallel I/O: Use Dask for large operationspython
import dask.array as da dask_array = da.from_zarr('data.zarr') result = dask_array.compute(scheduler='threads', num_workers=8)
-
分块大小:目标为每个分块1-10 MBpython
# 对于float32:1MB = 262,144个元素 chunks = (512, 512) # 512×512×4字节 = ~1MB -
分块形状:与访问模式对齐python
# 行式访问 → 分块覆盖列:(小, 大) # 列式访问 → 分块覆盖行:(大, 小) # 随机访问 → 平衡:(中, 中) -
压缩:根据工作负载选择python
# 交互式/速度优先:BloscCodec(cname='lz4') # 平衡:BloscCodec(cname='zstd', clevel=5) # 压缩比优先:GzipCodec(level=9) -
存储后端:匹配运行环境python
# 本地:LocalStore(默认) # 云:带合并元数据的S3Map/GCSMap # 临时:MemoryStore -
分片:用于大规模数据集python
# 当存在数百万个小分块时 shards=(10*chunk_size, 10*chunk_size) -
并行I/O:使用Dask处理大型操作python
import dask.array as da dask_array = da.from_zarr('data.zarr') result = dask_array.compute(scheduler='threads', num_workers=8)
Profiling and Debugging
性能分析与调试
python
undefinedpython
undefinedPrint detailed array information
打印详细数组信息
print(z.info)
print(z.info)
Output includes:
输出包括:
- Type, shape, chunks, dtype
- 类型、形状、分块、数据类型
- Compression codec and level
- 压缩编解码器和级别
- Storage size (compressed vs uncompressed)
- 存储大小(压缩后 vs 未压缩)
- Storage location
- 存储位置
Check storage size
检查存储大小
print(f"Compressed size: {z.nbytes_stored / 1e6:.2f} MB")
print(f"Uncompressed size: {z.nbytes / 1e6:.2f} MB")
print(f"Compression ratio: {z.nbytes / z.nbytes_stored:.2f}x")
undefinedprint(f"压缩后大小: {z.nbytes_stored / 1e6:.2f} MB")
print(f"未压缩大小: {z.nbytes / 1e6:.2f} MB")
print(f"压缩比: {z.nbytes / z.nbytes_stored:.2f}x")
undefinedCommon Patterns and Best Practices
常见模式与最佳实践
Pattern: Time Series Data
模式:时间序列数据
python
undefinedpython
undefinedStore time series with time as first dimension
存储时间序列,时间作为第一维度
This allows efficient appending of new time steps
这样可以高效追加新的时间步
z = zarr.open('timeseries.zarr', mode='a',
shape=(0, 720, 1440), # Start with 0 time steps
chunks=(1, 720, 1440), # One time step per chunk
dtype='f4')
z = zarr.open('timeseries.zarr', mode='a',
shape=(0, 720, 1440), # 初始0个时间步
chunks=(1, 720, 1440), # 每个分块一个时间步
dtype='f4')
Append new time steps
追加新的时间步
new_data = np.random.random((1, 720, 1440))
z.append(new_data, axis=0)
undefinednew_data = np.random.random((1, 720, 1440))
z.append(new_data, axis=0)
undefinedPattern: Large Matrix Operations
模式:大型矩阵运算
python
import dask.array as dapython
import dask.array as daCreate large matrix in Zarr
在Zarr中创建大型矩阵
z = zarr.open('matrix.zarr', mode='w',
shape=(100000, 100000),
chunks=(1000, 1000),
dtype='f8')
z = zarr.open('matrix.zarr', mode='w',
shape=(100000, 100000),
chunks=(1000, 1000),
dtype='f8')
Use Dask for parallel computation
使用Dask进行并行计算
dask_z = da.from_zarr('matrix.zarr')
result = (dask_z @ dask_z.T).compute() # Parallel matrix multiply
undefineddask_z = da.from_zarr('matrix.zarr')
result = (dask_z @ dask_z.T).compute() # 并行矩阵乘法
undefinedPattern: Cloud-Native Workflow
模式:云原生工作流
python
import s3fs
import zarrpython
import s3fs
import zarrWrite to S3
写入S3
s3 = s3fs.S3FileSystem()
store = s3fs.S3Map(root='s3://my-bucket/data.zarr', s3=s3)
s3 = s3fs.S3FileSystem()
store = s3fs.S3Map(root='s3://my-bucket/data.zarr', s3=s3)
Create array with appropriate chunking for cloud
创建适合云的分块数组
z = zarr.open_array(store=store, mode='w',
shape=(10000, 10000),
chunks=(500, 500), # ~1MB chunks
dtype='f4')
z[:] = data
z = zarr.open_array(store=store, mode='w',
shape=(10000, 10000),
chunks=(500, 500), # 约1MB分块
dtype='f4')
z[:] = data
Consolidate metadata for faster reads
合并元数据以加快读取
zarr.consolidate_metadata(store)
zarr.consolidate_metadata(store)
Read from S3 (anywhere, anytime)
从S3读取(随时随地)
store_read = s3fs.S3Map(root='s3://my-bucket/data.zarr', s3=s3)
z_read = zarr.open_consolidated(store_read)
subset = z_read[0:100, 0:100]
undefinedstore_read = s3fs.S3Map(root='s3://my-bucket/data.zarr', s3=s3)
z_read = zarr.open_consolidated(store_read)
subset = z_read[0:100, 0:100]
undefinedPattern: Format Conversion
模式:格式转换
python
undefinedpython
undefinedHDF5 to Zarr
HDF5转Zarr
import h5py
import zarr
with h5py.File('data.h5', 'r') as h5:
dataset = h5['dataset_name']
z = zarr.array(dataset[:],
chunks=(1000, 1000),
store='data.zarr')
import h5py
import zarr
with h5py.File('data.h5', 'r') as h5:
dataset = h5['dataset_name']
z = zarr.array(dataset[:],
chunks=(1000, 1000),
store='data.zarr')
NumPy to Zarr
NumPy转Zarr
import numpy as np
data = np.load('data.npy')
z = zarr.array(data, chunks='auto', store='data.zarr')
import numpy as np
data = np.load('data.npy')
z = zarr.array(data, chunks='auto', store='data.zarr')
Zarr to NetCDF (via Xarray)
Zarr转NetCDF(通过Xarray)
import xarray as xr
ds = xr.open_zarr('data.zarr')
ds.to_netcdf('data.nc')
undefinedimport xarray as xr
ds = xr.open_zarr('data.zarr')
ds.to_netcdf('data.nc')
undefinedCommon Issues and Solutions
常见问题与解决方案
Issue: Slow Performance
问题:性能缓慢
Diagnosis: Check chunk size and alignment
python
print(z.chunks) # Are chunks appropriate size?
print(z.info) # Check compression ratioSolutions:
- Increase chunk size to 1-10 MB
- Align chunks with access pattern
- Try different compression codecs
- Use Dask for parallel operations
诊断:检查分块大小和对齐情况
python
print(z.chunks) # 分块大小是否合适?
print(z.info) # 检查压缩比解决方案:
- 将分块大小增加到1-10 MB
- 分块与访问模式对齐
- 尝试不同的压缩编解码器
- 使用Dask进行并行操作
Issue: High Memory Usage
问题:内存占用过高
Cause: Loading entire array or large chunks into memory
Solutions:
python
undefined原因:将整个数组或大型分块加载到内存
解决方案:
python
undefinedDon't load entire array
不要加载整个数组
Bad: data = z[:]
错误:data = z[:]
Good: Process in chunks
正确:分块处理
for i in range(0, z.shape[0], 1000):
chunk = z[i:i+1000, :]
process(chunk)
for i in range(0, z.shape[0], 1000):
chunk = z[i:i+1000, :]
process(chunk)
Or use Dask for automatic chunking
或使用Dask自动分块
import dask.array as da
dask_z = da.from_zarr('data.zarr')
result = dask_z.mean().compute() # Processes in chunks
undefinedimport dask.array as da
dask_z = da.from_zarr('data.zarr')
result = dask_z.mean().compute() # 分块处理
undefinedIssue: Cloud Storage Latency
问题:云存储延迟高
Solutions:
python
undefined解决方案:
python
undefined1. Consolidate metadata
1. 合并元数据
zarr.consolidate_metadata(store)
z = zarr.open_consolidated(store)
zarr.consolidate_metadata(store)
z = zarr.open_consolidated(store)
2. Use appropriate chunk sizes (5-100 MB for cloud)
2. 使用合适的分块大小(云存储建议5-100 MB)
chunks = (2000, 2000) # Larger chunks for cloud
chunks = (2000, 2000) # 云存储使用更大分块
3. Enable sharding
3. 启用分片
shards = (10000, 10000) # Groups many chunks
undefinedshards = (10000, 10000) # 分组多个分块
undefinedIssue: Concurrent Write Conflicts
问题:并发写入冲突
Solution: Use synchronizers or ensure non-overlapping writes
python
from zarr import ProcessSynchronizer
sync = ProcessSynchronizer('sync.sync')
z = zarr.open_array('data.zarr', mode='r+', synchronizer=sync)解决方案:使用同步器或确保写入不重叠
python
from zarr import ProcessSynchronizer
sync = ProcessSynchronizer('sync.sync')
z = zarr.open_array('data.zarr', mode='r+', synchronizer=sync)Or design workflow so each process writes to separate chunks
或设计工作流,让每个进程写入不同分块
undefinedundefinedAdditional Resources
额外资源
For detailed API documentation, advanced usage, and the latest updates:
- Official Documentation: https://zarr.readthedocs.io/
- Zarr Specifications: https://zarr-specs.readthedocs.io/
- GitHub Repository: https://github.com/zarr-developers/zarr-python
- Community Chat: https://gitter.im/zarr-developers/community
Related Libraries:
- Xarray: https://docs.xarray.dev/ (labeled arrays)
- Dask: https://docs.dask.org/ (parallel computing)
- NumCodecs: https://numcodecs.readthedocs.io/ (compression codecs)
如需详细API文档、高级用法和最新更新:
- 官方文档:https://zarr.readthedocs.io/
- Zarr规范:https://zarr-specs.readthedocs.io/
- GitHub仓库:https://github.com/zarr-developers/zarr-python
- 社区聊天室:https://gitter.im/zarr-developers/community
相关库: