zarr-python

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Zarr Python

Zarr Python

Overview

概述

Zarr is a Python library for storing large N-dimensional arrays with chunking and compression. Apply this skill for efficient parallel I/O, cloud-native workflows, and seamless integration with NumPy, Dask, and Xarray.
Zarr 是一个用于存储大型N维数组的Python库,支持分块和压缩。运用该工具可实现高效的并行I/O、云原生工作流,并与NumPy、Dask和Xarray无缝集成。

Quick Start

快速开始

Installation

安装

bash
uv pip install zarr
Requires Python 3.11+. For cloud storage support, install additional packages:
python
uv pip install s3fs  # For S3
uv pip install gcsfs  # For Google Cloud Storage
bash
uv pip install zarr
要求Python 3.11及以上版本。如需云存储支持,请安装额外包:
python
uv pip install s3fs  # 用于S3
uv pip install gcsfs  # 用于Google Cloud Storage

Basic Array Creation

基础数组创建

python
import zarr
import numpy as np
python
import zarr
import numpy as np

Create a 2D array with chunking and compression

创建带分块和压缩的二维数组

z = zarr.create_array( store="data/my_array.zarr", shape=(10000, 10000), chunks=(1000, 1000), dtype="f4" )
z = zarr.create_array( store="data/my_array.zarr", shape=(10000, 10000), chunks=(1000, 1000), dtype="f4" )

Write data using NumPy-style indexing

使用NumPy风格索引写入数据

z[:, :] = np.random.random((10000, 10000))
z[:, :] = np.random.random((10000, 10000))

Read data

读取数据

data = z[0:100, 0:100] # Returns NumPy array
undefined
data = z[0:100, 0:100] # 返回NumPy数组
undefined

Core Operations

核心操作

Creating Arrays

创建数组

Zarr provides multiple convenience functions for array creation:
python
undefined
Zarr提供多种便捷函数用于数组创建:
python
undefined

Create empty array

创建空数组

z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000), dtype='f4', store='data.zarr')
z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000), dtype='f4', store='data.zarr')

Create filled arrays

创建填充数组

z = zarr.ones((5000, 5000), chunks=(500, 500)) z = zarr.full((1000, 1000), fill_value=42, chunks=(100, 100))
z = zarr.ones((5000, 5000), chunks=(500, 500)) z = zarr.full((1000, 1000), fill_value=42, chunks=(100, 100))

Create from existing data

从已有数据创建

data = np.arange(10000).reshape(100, 100) z = zarr.array(data, chunks=(10, 10), store='data.zarr')
data = np.arange(10000).reshape(100, 100) z = zarr.array(data, chunks=(10, 10), store='data.zarr')

Create like another array

仿照另一个数组创建

z2 = zarr.zeros_like(z) # Matches shape, chunks, dtype of z
undefined
z2 = zarr.zeros_like(z) # 匹配z的形状、分块和数据类型
undefined

Opening Existing Arrays

打开已有数组

python
undefined
python
undefined

Open array (read/write mode by default)

打开数组(默认读写模式)

z = zarr.open_array('data.zarr', mode='r+')
z = zarr.open_array('data.zarr', mode='r+')

Read-only mode

只读模式

z = zarr.open_array('data.zarr', mode='r')
z = zarr.open_array('data.zarr', mode='r')

The open() function auto-detects arrays vs groups

open()函数自动检测数组或组

z = zarr.open('data.zarr') # Returns Array or Group
undefined
z = zarr.open('data.zarr') # 返回Array或Group
undefined

Reading and Writing Data

数据读写

Zarr arrays support NumPy-like indexing:
python
undefined
Zarr数组支持类NumPy索引:
python
undefined

Write entire array

写入整个数组

z[:] = 42
z[:] = 42

Write slices

写入切片

z[0, :] = np.arange(100) z[10:20, 50:60] = np.random.random((10, 10))
z[0, :] = np.arange(100) z[10:20, 50:60] = np.random.random((10, 10))

Read data (returns NumPy array)

读取数据(返回NumPy数组)

data = z[0:100, 0:100] row = z[5, :]
data = z[0:100, 0:100] row = z[5, :]

Advanced indexing

高级索引

z.vindex[[0, 5, 10], [2, 8, 15]] # Coordinate indexing z.oindex[0:10, [5, 10, 15]] # Orthogonal indexing z.blocks[0, 0] # Block/chunk indexing
undefined
z.vindex[[0, 5, 10], [2, 8, 15]] # 坐标索引 z.oindex[0:10, [5, 10, 15]] # 正交索引 z.blocks[0, 0] # 块/分块索引
undefined

Resizing and Appending

调整大小与追加

python
undefined
python
undefined

Resize array

调整数组大小

z.resize(15000, 15000) # Expands or shrinks dimensions
z.resize(15000, 15000) # 扩展或缩小维度

Append data along an axis

沿某一轴追加数据

z.append(np.random.random((1000, 10000)), axis=0) # Adds rows
undefined
z.append(np.random.random((1000, 10000)), axis=0) # 添加行
undefined

Chunking Strategies

分块策略

Chunking is critical for performance. Choose chunk sizes and shapes based on access patterns.
分块对性能至关重要。需根据访问模式选择分块大小和形状。

Chunk Size Guidelines

分块大小指南

  • Minimum chunk size: 1 MB recommended for optimal performance
  • Balance: Larger chunks = fewer metadata operations; smaller chunks = better parallel access
  • Memory consideration: Entire chunks must fit in memory during compression
python
undefined
  • 最小分块大小:建议1 MB以获得最佳性能
  • 平衡原则:分块越大,元数据操作越少;分块越小,并行访问性越好
  • 内存考虑:压缩时整个分块必须能放入内存
python
undefined

Configure chunk size (aim for ~1MB per chunk)

配置分块大小(目标为每个分块约1MB)

For float32 data: 1MB = 262,144 elements = 512×512 array

对于float32数据:1MB = 262,144个元素 = 512×512数组

z = zarr.zeros( shape=(10000, 10000), chunks=(512, 512), # ~1MB chunks dtype='f4' )
undefined
z = zarr.zeros( shape=(10000, 10000), chunks=(512, 512), # 约1MB分块 dtype='f4' )
undefined

Aligning Chunks with Access Patterns

分块与访问模式对齐

Critical: Chunk shape dramatically affects performance based on how data is accessed.
python
undefined
关键:分块形状会根据数据访问模式极大影响性能。
python
undefined

If accessing rows frequently (first dimension)

如果频繁访问行(第一维度)

z = zarr.zeros((10000, 10000), chunks=(10, 10000)) # Chunk spans columns
z = zarr.zeros((10000, 10000), chunks=(10, 10000)) # 分块覆盖所有列

If accessing columns frequently (second dimension)

如果频繁访问列(第二维度)

z = zarr.zeros((10000, 10000), chunks=(10000, 10)) # Chunk spans rows
z = zarr.zeros((10000, 10000), chunks=(10000, 10)) # 分块覆盖所有行

For mixed access patterns (balanced approach)

对于混合访问模式(平衡方案)

z = zarr.zeros((10000, 10000), chunks=(1000, 1000)) # Square chunks

**Performance example**: For a (200, 200, 200) array, reading along the first dimension:
- Using chunks (1, 200, 200): ~107ms
- Using chunks (200, 200, 1): ~1.65ms (65× faster!)
z = zarr.zeros((10000, 10000), chunks=(1000, 1000)) # 方形分块

**性能示例**:对于(200, 200, 200)数组,沿第一维度读取:
- 使用分块(1, 200, 200):约107ms
- 使用分块(200, 200, 1):约1.65ms(快65倍!)

Sharding for Large-Scale Storage

分片用于大规模存储

When arrays have millions of small chunks, use sharding to group chunks into larger storage objects:
python
from zarr.codecs import ShardingCodec, BytesCodec
from zarr.codecs.blosc import BloscCodec
当数组包含数百万个小分块时,使用分片将分块分组为更大的存储对象:
python
from zarr.codecs import ShardingCodec, BytesCodec
from zarr.codecs.blosc import BloscCodec

Create array with sharding

创建带分片的数组

z = zarr.create_array( store='data.zarr', shape=(100000, 100000), chunks=(100, 100), # Small chunks for access shards=(1000, 1000), # Groups 100 chunks per shard dtype='f4' )

**Benefits**:
- Reduces file system overhead from millions of small files
- Improves cloud storage performance (fewer object requests)
- Prevents filesystem block size waste

**Important**: Entire shards must fit in memory before writing.
z = zarr.create_array( store='data.zarr', shape=(100000, 100000), chunks=(100, 100), # 小分块用于访问 shards=(1000, 1000), # 每个分片包含100个分块 dtype='f4' )

**优势**:
- 减少数百万个小文件带来的文件系统开销
- 提升云存储性能(减少对象请求数)
- 避免文件系统块大小浪费

**注意**:写入前整个分片必须能放入内存。

Compression

压缩

Zarr applies compression per chunk to reduce storage while maintaining fast access.
Zarr对每个分块应用压缩,以减少存储同时保持快速访问。

Configuring Compression

配置压缩

python
from zarr.codecs.blosc import BloscCodec
from zarr.codecs import GzipCodec, ZstdCodec
python
from zarr.codecs.blosc import BloscCodec
from zarr.codecs import GzipCodec, ZstdCodec

Default: Blosc with Zstandard

默认:使用Zstandard的Blosc

z = zarr.zeros((1000, 1000), chunks=(100, 100)) # Uses default compression
z = zarr.zeros((1000, 1000), chunks=(100, 100)) # 使用默认压缩

Configure Blosc codec

配置Blosc编解码器

z = zarr.create_array( store='data.zarr', shape=(1000, 1000), chunks=(100, 100), dtype='f4', codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')] )
z = zarr.create_array( store='data.zarr', shape=(1000, 1000), chunks=(100, 100), dtype='f4', codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')] )

Available Blosc compressors: 'blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd'

可用的Blosc压缩器:'blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd'

Use Gzip compression

使用Gzip压缩

z = zarr.create_array( store='data.zarr', shape=(1000, 1000), chunks=(100, 100), dtype='f4', codecs=[GzipCodec(level=6)] )
z = zarr.create_array( store='data.zarr', shape=(1000, 1000), chunks=(100, 100), dtype='f4', codecs=[GzipCodec(level=6)] )

Disable compression

禁用压缩

z = zarr.create_array( store='data.zarr', shape=(1000, 1000), chunks=(100, 100), dtype='f4', codecs=[BytesCodec()] # No compression )
undefined
z = zarr.create_array( store='data.zarr', shape=(1000, 1000), chunks=(100, 100), dtype='f4', codecs=[BytesCodec()] # 无压缩 )
undefined

Compression Performance Tips

压缩性能技巧

  • Blosc (default): Fast compression/decompression, good for interactive workloads
  • Zstandard: Better compression ratios, slightly slower than LZ4
  • Gzip: Maximum compression, slower performance
  • LZ4: Fastest compression, lower ratios
  • Shuffle: Enable shuffle filter for better compression on numeric data
python
undefined
  • Blosc(默认):压缩/解压速度快,适合交互式工作负载
  • Zstandard:压缩比更高,比LZ4稍慢
  • Gzip:压缩比最大,性能较慢
  • LZ4:压缩速度最快,压缩比较低
  • Shuffle:对数值数据启用shuffle过滤器可获得更好的压缩效果
python
undefined

Optimal for numeric scientific data

数值科学数据的最优配置

codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')]
codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')]

Optimal for speed

速度优先的最优配置

codecs=[BloscCodec(cname='lz4', clevel=1)]
codecs=[BloscCodec(cname='lz4', clevel=1)]

Optimal for compression ratio

压缩比优先的最优配置

codecs=[GzipCodec(level=9)]
undefined
codecs=[GzipCodec(level=9)]
undefined

Storage Backends

存储后端

Zarr supports multiple storage backends through a flexible storage interface.
Zarr通过灵活的存储接口支持多种存储后端。

Local Filesystem (Default)

本地文件系统(默认)

python
from zarr.storage import LocalStore
python
from zarr.storage import LocalStore

Explicit store creation

显式创建存储

store = LocalStore('data/my_array.zarr') z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
store = LocalStore('data/my_array.zarr') z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))

Or use string path (creates LocalStore automatically)

或使用字符串路径(自动创建LocalStore)

z = zarr.open_array('data/my_array.zarr', mode='w', shape=(1000, 1000), chunks=(100, 100))
undefined
z = zarr.open_array('data/my_array.zarr', mode='w', shape=(1000, 1000), chunks=(100, 100))
undefined

In-Memory Storage

内存存储

python
from zarr.storage import MemoryStore
python
from zarr.storage import MemoryStore

Create in-memory store

创建内存存储

store = MemoryStore() z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
store = MemoryStore() z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))

Data exists only in memory, not persisted

数据仅存在于内存中,不会持久化

undefined
undefined

ZIP File Storage

ZIP文件存储

python
from zarr.storage import ZipStore
python
from zarr.storage import ZipStore

Write to ZIP file

写入ZIP文件

store = ZipStore('data.zip', mode='w') z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100)) z[:] = np.random.random((1000, 1000)) store.close() # IMPORTANT: Must close ZipStore
store = ZipStore('data.zip', mode='w') z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100)) z[:] = np.random.random((1000, 1000)) store.close() # 重要:必须关闭ZipStore

Read from ZIP file

从ZIP文件读取

store = ZipStore('data.zip', mode='r') z = zarr.open_array(store=store) data = z[:] store.close()
undefined
store = ZipStore('data.zip', mode='r') z = zarr.open_array(store=store) data = z[:] store.close()
undefined

Cloud Storage (S3, GCS)

云存储(S3、GCS)

python
import s3fs
import zarr
python
import s3fs
import zarr

S3 storage

S3存储

s3 = s3fs.S3FileSystem(anon=False) # Use credentials store = s3fs.S3Map(root='my-bucket/path/to/array.zarr', s3=s3) z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100)) z[:] = data
s3 = s3fs.S3FileSystem(anon=False) # 使用凭证 store = s3fs.S3Map(root='my-bucket/path/to/array.zarr', s3=s3) z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100)) z[:] = data

Google Cloud Storage

Google Cloud Storage

import gcsfs gcs = gcsfs.GCSFileSystem(project='my-project') store = gcsfs.GCSMap(root='my-bucket/path/to/array.zarr', gcs=gcs) z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))

**Cloud Storage Best Practices**:
- Use consolidated metadata to reduce latency: `zarr.consolidate_metadata(store)`
- Align chunk sizes with cloud object sizing (typically 5-100 MB optimal)
- Enable parallel writes using Dask for large-scale data
- Consider sharding to reduce number of objects
import gcsfs gcs = gcsfs.GCSFileSystem(project='my-project') store = gcsfs.GCSMap(root='my-bucket/path/to/array.zarr', gcs=gcs) z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))

**云存储最佳实践**:
- 使用合并元数据以减少延迟:`zarr.consolidate_metadata(store)`
- 使分块大小与云对象大小对齐(通常5-100 MB最优)
- 使用Dask进行大规模数据的并行写入
- 考虑使用分片以减少对象数量

Groups and Hierarchies

组与层级结构

Groups organize multiple arrays hierarchically, similar to directories or HDF5 groups.
组用于层级化组织多个数组,类似于目录或HDF5组。

Creating and Using Groups

创建和使用组

python
undefined
python
undefined

Create root group

创建根组

root = zarr.group(store='data/hierarchy.zarr')
root = zarr.group(store='data/hierarchy.zarr')

Create sub-groups

创建子组

temperature = root.create_group('temperature') precipitation = root.create_group('precipitation')
temperature = root.create_group('temperature') precipitation = root.create_group('precipitation')

Create arrays within groups

在组内创建数组

temp_array = temperature.create_array( name='t2m', shape=(365, 720, 1440), chunks=(1, 720, 1440), dtype='f4' )
precip_array = precipitation.create_array( name='prcp', shape=(365, 720, 1440), chunks=(1, 720, 1440), dtype='f4' )
temp_array = temperature.create_array( name='t2m', shape=(365, 720, 1440), chunks=(1, 720, 1440), dtype='f4' )
precip_array = precipitation.create_array( name='prcp', shape=(365, 720, 1440), chunks=(1, 720, 1440), dtype='f4' )

Access using paths

使用路径访问数组

array = root['temperature/t2m']
array = root['temperature/t2m']

Visualize hierarchy

可视化层级结构

print(root.tree())
print(root.tree())

Output:

输出:

/

/

├── temperature

├── temperature

│ └── t2m (365, 720, 1440) f4

│ └── t2m (365, 720, 1440) f4

└── precipitation

└── precipitation

└── prcp (365, 720, 1440) f4

└── prcp (365, 720, 1440) f4

undefined
undefined

H5py-Compatible API

兼容H5py的API

Zarr provides an h5py-compatible interface for familiar HDF5 users:
python
undefined
Zarr为熟悉HDF5的用户提供兼容h5py的接口:
python
undefined

Create group with h5py-style methods

使用h5py风格方法创建组

root = zarr.group('data.zarr') dataset = root.create_dataset('my_data', shape=(1000, 1000), chunks=(100, 100), dtype='f4')
root = zarr.group('data.zarr') dataset = root.create_dataset('my_data', shape=(1000, 1000), chunks=(100, 100), dtype='f4')

Access like h5py

像h5py一样访问

grp = root.require_group('subgroup') arr = grp.require_dataset('array', shape=(500, 500), chunks=(50, 50), dtype='i4')
undefined
grp = root.require_group('subgroup') arr = grp.require_dataset('array', shape=(500, 500), chunks=(50, 50), dtype='i4')
undefined

Attributes and Metadata

属性与元数据

Attach custom metadata to arrays and groups using attributes:
python
undefined
可以使用属性为数组和组附加自定义元数据:
python
undefined

Add attributes to array

为数组添加属性

z = zarr.zeros((1000, 1000), chunks=(100, 100)) z.attrs['description'] = 'Temperature data in Kelvin' z.attrs['units'] = 'K' z.attrs['created'] = '2024-01-15' z.attrs['processing_version'] = 2.1
z = zarr.zeros((1000, 1000), chunks=(100, 100)) z.attrs['description'] = '开尔文温度数据' z.attrs['units'] = 'K' z.attrs['created'] = '2024-01-15' z.attrs['processing_version'] = 2.1

Attributes are stored as JSON

属性以JSON格式存储

print(z.attrs['units']) # Output: K
print(z.attrs['units']) # 输出: K

Add attributes to groups

为组添加属性

root = zarr.group('data.zarr') root.attrs['project'] = 'Climate Analysis' root.attrs['institution'] = 'Research Institute'
root = zarr.group('data.zarr') root.attrs['project'] = '气候分析' root.attrs['institution'] = '研究院'

Attributes persist with the array/group

属性随数组/组持久化

z2 = zarr.open('data.zarr') print(z2.attrs['description'])

**Important**: Attributes must be JSON-serializable (strings, numbers, lists, dicts, booleans, null).
z2 = zarr.open('data.zarr') print(z2.attrs['description'])

**重要**:属性必须是可JSON序列化的(字符串、数字、列表、字典、布尔值、空值)。

Integration with NumPy, Dask, and Xarray

与NumPy、Dask和Xarray集成

NumPy Integration

NumPy集成

Zarr arrays implement the NumPy array interface:
python
import numpy as np
import zarr

z = zarr.zeros((1000, 1000), chunks=(100, 100))
Zarr数组实现了NumPy数组接口:
python
import numpy as np
import zarr

z = zarr.zeros((1000, 1000), chunks=(100, 100))

Use NumPy functions directly

直接使用NumPy函数

result = np.sum(z, axis=0) # NumPy operates on Zarr array mean = np.mean(z[:100, :100])
result = np.sum(z, axis=0) # NumPy直接操作Zarr数组 mean = np.mean(z[:100, :100])

Convert to NumPy array

转换为NumPy数组

numpy_array = z[:] # Loads entire array into memory
undefined
numpy_array = z[:] # 将整个数组加载到内存
undefined

Dask Integration

Dask集成

Dask provides lazy, parallel computation on Zarr arrays:
python
import dask.array as da
import zarr
Dask为Zarr数组提供惰性并行计算:
python
import dask.array as da
import zarr

Create large Zarr array

创建大型Zarr数组

z = zarr.open('data.zarr', mode='w', shape=(100000, 100000), chunks=(1000, 1000), dtype='f4')
z = zarr.open('data.zarr', mode='w', shape=(100000, 100000), chunks=(1000, 1000), dtype='f4')

Load as Dask array (lazy, no data loaded)

加载为Dask数组(惰性,不加载数据)

dask_array = da.from_zarr('data.zarr')
dask_array = da.from_zarr('data.zarr')

Perform computations (parallel, out-of-core)

执行计算(并行、核外)

result = dask_array.mean(axis=0).compute() # Parallel computation
result = dask_array.mean(axis=0).compute() # 并行计算

Write Dask array to Zarr

将Dask数组写入Zarr

large_array = da.random.random((100000, 100000), chunks=(1000, 1000)) da.to_zarr(large_array, 'output.zarr')

**Benefits**:
- Process datasets larger than memory
- Automatic parallel computation across chunks
- Efficient I/O with chunked storage
large_array = da.random.random((100000, 100000), chunks=(1000, 1000)) da.to_zarr(large_array, 'output.zarr')

**优势**:
- 处理大于内存的数据集
- 自动跨分块并行计算
- 分块存储实现高效I/O

Xarray Integration

Xarray集成

Xarray provides labeled, multidimensional arrays with Zarr backend:
python
import xarray as xr
import zarr
Xarray提供带Zarr后端的标签化多维数组:
python
import xarray as xr
import zarr

Open Zarr store as Xarray Dataset (lazy loading)

以Xarray Dataset形式打开Zarr存储(惰性加载)

ds = xr.open_zarr('data.zarr')
ds = xr.open_zarr('data.zarr')

Dataset includes coordinates and metadata

Dataset包含坐标和元数据

print(ds)
print(ds)

Access variables

访问变量

temperature = ds['temperature']
temperature = ds['temperature']

Perform labeled operations

执行标签化操作

subset = ds.sel(time='2024-01', lat=slice(30, 60))
subset = ds.sel(time='2024-01', lat=slice(30, 60))

Write Xarray Dataset to Zarr

将Xarray Dataset写入Zarr

ds.to_zarr('output.zarr')
ds.to_zarr('output.zarr')

Create from scratch with coordinates

从 scratch 创建带坐标的Dataset

ds = xr.Dataset( { 'temperature': (['time', 'lat', 'lon'], data), 'precipitation': (['time', 'lat', 'lon'], data2) }, coords={ 'time': pd.date_range('2024-01-01', periods=365), 'lat': np.arange(-90, 91, 1), 'lon': np.arange(-180, 180, 1) } ) ds.to_zarr('climate_data.zarr')

**Benefits**:
- Named dimensions and coordinates
- Label-based indexing and selection
- Integration with pandas for time series
- NetCDF-like interface familiar to climate/geospatial scientists
ds = xr.Dataset( { 'temperature': (['time', 'lat', 'lon'], data), 'precipitation': (['time', 'lat', 'lon'], data2) }, coords={ 'time': pd.date_range('2024-01-01', periods=365), 'lat': np.arange(-90, 91, 1), 'lon': np.arange(-180, 180, 1) } ) ds.to_zarr('climate_data.zarr')

**优势**:
- 命名维度和坐标
- 基于标签的索引和选择
- 与pandas集成处理时间序列
- 气候/地理空间科学家熟悉的类NetCDF接口

Parallel Computing and Synchronization

并行计算与同步

Thread-Safe Operations

线程安全操作

python
from zarr import ThreadSynchronizer
import zarr
python
from zarr import ThreadSynchronizer
import zarr

For multi-threaded writes

用于多线程写入

synchronizer = ThreadSynchronizer() z = zarr.open_array('data.zarr', mode='r+', shape=(10000, 10000), chunks=(1000, 1000), synchronizer=synchronizer)
synchronizer = ThreadSynchronizer() z = zarr.open_array('data.zarr', mode='r+', shape=(10000, 10000), chunks=(1000, 1000), synchronizer=synchronizer)

Safe for concurrent writes from multiple threads

支持多线程并发写入

(when writes don't span chunk boundaries)

(当写入不跨分块边界时)

undefined
undefined

Process-Safe Operations

进程安全操作

python
from zarr import ProcessSynchronizer
import zarr
python
from zarr import ProcessSynchronizer
import zarr

For multi-process writes

用于多进程写入

synchronizer = ProcessSynchronizer('sync_data.sync') z = zarr.open_array('data.zarr', mode='r+', shape=(10000, 10000), chunks=(1000, 1000), synchronizer=synchronizer)
synchronizer = ProcessSynchronizer('sync_data.sync') z = zarr.open_array('data.zarr', mode='r+', shape=(10000, 10000), chunks=(1000, 1000), synchronizer=synchronizer)

Safe for concurrent writes from multiple processes

支持多进程并发写入


**Note**:
- Concurrent reads require no synchronization
- Synchronization only needed for writes that may span chunk boundaries
- Each process/thread writing to separate chunks needs no synchronization

**注意**:
- 并发读取无需同步
- 仅当写入可能跨分块边界时才需要同步
- 每个进程/线程写入不同分块时无需同步

Consolidated Metadata

合并元数据

For hierarchical stores with many arrays, consolidate metadata into a single file to reduce I/O operations:
python
import zarr
对于包含多个数组的层级存储,可将元数据合并到单个文件中以减少I/O操作:
python
import zarr

After creating arrays/groups

创建数组/组后

root = zarr.group('data.zarr')
root = zarr.group('data.zarr')

... create multiple arrays/groups ...

... 创建多个数组/组 ...

Consolidate metadata

合并元数据

zarr.consolidate_metadata('data.zarr')
zarr.consolidate_metadata('data.zarr')

Open with consolidated metadata (faster, especially on cloud storage)

使用合并后的元数据打开(速度更快,尤其在云存储上)

root = zarr.open_consolidated('data.zarr')

**Benefits**:
- Reduces metadata read operations from N (one per array) to 1
- Critical for cloud storage (reduces latency)
- Speeds up `tree()` operations and group traversal

**Cautions**:
- Metadata can become stale if arrays update without re-consolidation
- Not suitable for frequently-updated datasets
- Multi-writer scenarios may have inconsistent reads
root = zarr.open_consolidated('data.zarr')

**优势**:
- 将元数据读取操作从N次(每个数组一次)减少到1次
- 对云存储至关重要(减少延迟)
- 加速`tree()`操作和组遍历

**注意**:
- 如果数组更新后未重新合并,元数据可能会过时
- 不适用于频繁更新的数据集
- 多写入场景下可能出现不一致读取

Performance Optimization

性能优化

Checklist for Optimal Performance

最佳性能检查清单

  1. Chunk Size: Aim for 1-10 MB per chunk
    python
    # For float32: 1MB = 262,144 elements
    chunks = (512, 512)  # 512×512×4 bytes = ~1MB
  2. Chunk Shape: Align with access patterns
    python
    # Row-wise access → chunk spans columns: (small, large)
    # Column-wise access → chunk spans rows: (large, small)
    # Random access → balanced: (medium, medium)
  3. Compression: Choose based on workload
    python
    # Interactive/fast: BloscCodec(cname='lz4')
    # Balanced: BloscCodec(cname='zstd', clevel=5)
    # Maximum compression: GzipCodec(level=9)
  4. Storage Backend: Match to environment
    python
    # Local: LocalStore (default)
    # Cloud: S3Map/GCSMap with consolidated metadata
    # Temporary: MemoryStore
  5. Sharding: Use for large-scale datasets
    python
    # When you have millions of small chunks
    shards=(10*chunk_size, 10*chunk_size)
  6. Parallel I/O: Use Dask for large operations
    python
    import dask.array as da
    dask_array = da.from_zarr('data.zarr')
    result = dask_array.compute(scheduler='threads', num_workers=8)
  1. 分块大小:目标为每个分块1-10 MB
    python
    # 对于float32:1MB = 262,144个元素
    chunks = (512, 512)  # 512×512×4字节 = ~1MB
  2. 分块形状:与访问模式对齐
    python
    # 行式访问 → 分块覆盖列:(小, 大)
    # 列式访问 → 分块覆盖行:(大, 小)
    # 随机访问 → 平衡:(中, 中)
  3. 压缩:根据工作负载选择
    python
    # 交互式/速度优先:BloscCodec(cname='lz4')
    # 平衡:BloscCodec(cname='zstd', clevel=5)
    # 压缩比优先:GzipCodec(level=9)
  4. 存储后端:匹配运行环境
    python
    # 本地:LocalStore(默认)
    # 云:带合并元数据的S3Map/GCSMap
    # 临时:MemoryStore
  5. 分片:用于大规模数据集
    python
    # 当存在数百万个小分块时
    shards=(10*chunk_size, 10*chunk_size)
  6. 并行I/O:使用Dask处理大型操作
    python
    import dask.array as da
    dask_array = da.from_zarr('data.zarr')
    result = dask_array.compute(scheduler='threads', num_workers=8)

Profiling and Debugging

性能分析与调试

python
undefined
python
undefined

Print detailed array information

打印详细数组信息

print(z.info)
print(z.info)

Output includes:

输出包括:

- Type, shape, chunks, dtype

- 类型、形状、分块、数据类型

- Compression codec and level

- 压缩编解码器和级别

- Storage size (compressed vs uncompressed)

- 存储大小(压缩后 vs 未压缩)

- Storage location

- 存储位置

Check storage size

检查存储大小

print(f"Compressed size: {z.nbytes_stored / 1e6:.2f} MB") print(f"Uncompressed size: {z.nbytes / 1e6:.2f} MB") print(f"Compression ratio: {z.nbytes / z.nbytes_stored:.2f}x")
undefined
print(f"压缩后大小: {z.nbytes_stored / 1e6:.2f} MB") print(f"未压缩大小: {z.nbytes / 1e6:.2f} MB") print(f"压缩比: {z.nbytes / z.nbytes_stored:.2f}x")
undefined

Common Patterns and Best Practices

常见模式与最佳实践

Pattern: Time Series Data

模式:时间序列数据

python
undefined
python
undefined

Store time series with time as first dimension

存储时间序列,时间作为第一维度

This allows efficient appending of new time steps

这样可以高效追加新的时间步

z = zarr.open('timeseries.zarr', mode='a', shape=(0, 720, 1440), # Start with 0 time steps chunks=(1, 720, 1440), # One time step per chunk dtype='f4')
z = zarr.open('timeseries.zarr', mode='a', shape=(0, 720, 1440), # 初始0个时间步 chunks=(1, 720, 1440), # 每个分块一个时间步 dtype='f4')

Append new time steps

追加新的时间步

new_data = np.random.random((1, 720, 1440)) z.append(new_data, axis=0)
undefined
new_data = np.random.random((1, 720, 1440)) z.append(new_data, axis=0)
undefined

Pattern: Large Matrix Operations

模式:大型矩阵运算

python
import dask.array as da
python
import dask.array as da

Create large matrix in Zarr

在Zarr中创建大型矩阵

z = zarr.open('matrix.zarr', mode='w', shape=(100000, 100000), chunks=(1000, 1000), dtype='f8')
z = zarr.open('matrix.zarr', mode='w', shape=(100000, 100000), chunks=(1000, 1000), dtype='f8')

Use Dask for parallel computation

使用Dask进行并行计算

dask_z = da.from_zarr('matrix.zarr') result = (dask_z @ dask_z.T).compute() # Parallel matrix multiply
undefined
dask_z = da.from_zarr('matrix.zarr') result = (dask_z @ dask_z.T).compute() # 并行矩阵乘法
undefined

Pattern: Cloud-Native Workflow

模式:云原生工作流

python
import s3fs
import zarr
python
import s3fs
import zarr

Write to S3

写入S3

s3 = s3fs.S3FileSystem() store = s3fs.S3Map(root='s3://my-bucket/data.zarr', s3=s3)
s3 = s3fs.S3FileSystem() store = s3fs.S3Map(root='s3://my-bucket/data.zarr', s3=s3)

Create array with appropriate chunking for cloud

创建适合云的分块数组

z = zarr.open_array(store=store, mode='w', shape=(10000, 10000), chunks=(500, 500), # ~1MB chunks dtype='f4') z[:] = data
z = zarr.open_array(store=store, mode='w', shape=(10000, 10000), chunks=(500, 500), # 约1MB分块 dtype='f4') z[:] = data

Consolidate metadata for faster reads

合并元数据以加快读取

zarr.consolidate_metadata(store)
zarr.consolidate_metadata(store)

Read from S3 (anywhere, anytime)

从S3读取(随时随地)

store_read = s3fs.S3Map(root='s3://my-bucket/data.zarr', s3=s3) z_read = zarr.open_consolidated(store_read) subset = z_read[0:100, 0:100]
undefined
store_read = s3fs.S3Map(root='s3://my-bucket/data.zarr', s3=s3) z_read = zarr.open_consolidated(store_read) subset = z_read[0:100, 0:100]
undefined

Pattern: Format Conversion

模式:格式转换

python
undefined
python
undefined

HDF5 to Zarr

HDF5转Zarr

import h5py import zarr
with h5py.File('data.h5', 'r') as h5: dataset = h5['dataset_name'] z = zarr.array(dataset[:], chunks=(1000, 1000), store='data.zarr')
import h5py import zarr
with h5py.File('data.h5', 'r') as h5: dataset = h5['dataset_name'] z = zarr.array(dataset[:], chunks=(1000, 1000), store='data.zarr')

NumPy to Zarr

NumPy转Zarr

import numpy as np data = np.load('data.npy') z = zarr.array(data, chunks='auto', store='data.zarr')
import numpy as np data = np.load('data.npy') z = zarr.array(data, chunks='auto', store='data.zarr')

Zarr to NetCDF (via Xarray)

Zarr转NetCDF(通过Xarray)

import xarray as xr ds = xr.open_zarr('data.zarr') ds.to_netcdf('data.nc')
undefined
import xarray as xr ds = xr.open_zarr('data.zarr') ds.to_netcdf('data.nc')
undefined

Common Issues and Solutions

常见问题与解决方案

Issue: Slow Performance

问题:性能缓慢

Diagnosis: Check chunk size and alignment
python
print(z.chunks)  # Are chunks appropriate size?
print(z.info)    # Check compression ratio
Solutions:
  • Increase chunk size to 1-10 MB
  • Align chunks with access pattern
  • Try different compression codecs
  • Use Dask for parallel operations
诊断:检查分块大小和对齐情况
python
print(z.chunks)  # 分块大小是否合适?
print(z.info)    # 检查压缩比
解决方案:
  • 将分块大小增加到1-10 MB
  • 分块与访问模式对齐
  • 尝试不同的压缩编解码器
  • 使用Dask进行并行操作

Issue: High Memory Usage

问题:内存占用过高

Cause: Loading entire array or large chunks into memory
Solutions:
python
undefined
原因:将整个数组或大型分块加载到内存
解决方案:
python
undefined

Don't load entire array

不要加载整个数组

Bad: data = z[:]

错误:data = z[:]

Good: Process in chunks

正确:分块处理

for i in range(0, z.shape[0], 1000): chunk = z[i:i+1000, :] process(chunk)
for i in range(0, z.shape[0], 1000): chunk = z[i:i+1000, :] process(chunk)

Or use Dask for automatic chunking

或使用Dask自动分块

import dask.array as da dask_z = da.from_zarr('data.zarr') result = dask_z.mean().compute() # Processes in chunks
undefined
import dask.array as da dask_z = da.from_zarr('data.zarr') result = dask_z.mean().compute() # 分块处理
undefined

Issue: Cloud Storage Latency

问题:云存储延迟高

Solutions:
python
undefined
解决方案:
python
undefined

1. Consolidate metadata

1. 合并元数据

zarr.consolidate_metadata(store) z = zarr.open_consolidated(store)
zarr.consolidate_metadata(store) z = zarr.open_consolidated(store)

2. Use appropriate chunk sizes (5-100 MB for cloud)

2. 使用合适的分块大小(云存储建议5-100 MB)

chunks = (2000, 2000) # Larger chunks for cloud
chunks = (2000, 2000) # 云存储使用更大分块

3. Enable sharding

3. 启用分片

shards = (10000, 10000) # Groups many chunks
undefined
shards = (10000, 10000) # 分组多个分块
undefined

Issue: Concurrent Write Conflicts

问题:并发写入冲突

Solution: Use synchronizers or ensure non-overlapping writes
python
from zarr import ProcessSynchronizer

sync = ProcessSynchronizer('sync.sync')
z = zarr.open_array('data.zarr', mode='r+', synchronizer=sync)
解决方案:使用同步器或确保写入不重叠
python
from zarr import ProcessSynchronizer

sync = ProcessSynchronizer('sync.sync')
z = zarr.open_array('data.zarr', mode='r+', synchronizer=sync)

Or design workflow so each process writes to separate chunks

或设计工作流,让每个进程写入不同分块

undefined
undefined

Additional Resources

额外资源

For detailed API documentation, advanced usage, and the latest updates:
Related Libraries: