Zarr Python

Overview

概述

Zarr is a Python library for storing large N-dimensional arrays with chunking and compression. Apply this skill for efficient parallel I/O, cloud-native workflows, and seamless integration with NumPy, Dask, and Xarray.

Zarr 是一个用于存储大型N维数组的Python库，支持分块和压缩。运用该工具可实现高效的并行I/O、云原生工作流，并与NumPy、Dask和Xarray无缝集成。

Quick Start

快速开始

Installation

安装

bash

uv pip install zarr

Requires Python 3.11+. For cloud storage support, install additional packages:

python

uv pip install s3fs  # For S3
uv pip install gcsfs  # For Google Cloud Storage

bash

uv pip install zarr

要求Python 3.11及以上版本。如需云存储支持，请安装额外包：

python

uv pip install s3fs  # 用于S3
uv pip install gcsfs  # 用于Google Cloud Storage

Basic Array Creation

基础数组创建

python

import zarr
import numpy as np

python

import zarr
import numpy as np

Create a 2D array with chunking and compression

创建带分块和压缩的二维数组

z = zarr.create_array( store="data/my_array.zarr", shape=(10000, 10000), chunks=(1000, 1000), dtype="f4" )

Write data using NumPy-style indexing

使用NumPy风格索引写入数据

z[:, :] = np.random.random((10000, 10000))

Read data

读取数据

data = z[0:100, 0:100] # Returns NumPy array

undefined

data = z[0:100, 0:100] # 返回NumPy数组

undefined

Core Operations

核心操作

Creating Arrays

创建数组

Zarr provides multiple convenience functions for array creation:

python

undefined

Zarr提供多种便捷函数用于数组创建：

python

undefined

Create empty array

创建空数组

z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000), dtype='f4', store='data.zarr')

Create filled arrays

创建填充数组

z = zarr.ones((5000, 5000), chunks=(500, 500)) z = zarr.full((1000, 1000), fill_value=42, chunks=(100, 100))

Create from existing data

从已有数据创建

data = np.arange(10000).reshape(100, 100) z = zarr.array(data, chunks=(10, 10), store='data.zarr')

Create like another array

仿照另一个数组创建

z2 = zarr.zeros_like(z) # Matches shape, chunks, dtype of z

undefined

z2 = zarr.zeros_like(z) # 匹配z的形状、分块和数据类型

undefined

Opening Existing Arrays

打开已有数组

python

undefined

python

undefined

Open array (read/write mode by default)

打开数组（默认读写模式）

z = zarr.open_array('data.zarr', mode='r+')

Read-only mode

只读模式

z = zarr.open_array('data.zarr', mode='r')

The open() function auto-detects arrays vs groups

open()函数自动检测数组或组

z = zarr.open('data.zarr') # Returns Array or Group

undefined

z = zarr.open('data.zarr') # 返回Array或Group

undefined

Reading and Writing Data

数据读写

Zarr arrays support NumPy-like indexing:

python

undefined

Zarr数组支持类NumPy索引：

python

undefined

Write entire array

写入整个数组

z[:] = 42

Write slices

写入切片

z[0, :] = np.arange(100) z[10:20, 50:60] = np.random.random((10, 10))

Read data (returns NumPy array)

读取数据（返回NumPy数组）

data = z[0:100, 0:100] row = z[5, :]

Advanced indexing

高级索引

z.vindex[[0, 5, 10], [2, 8, 15]] # Coordinate indexing z.oindex[0:10, [5, 10, 15]] # Orthogonal indexing z.blocks[0, 0] # Block/chunk indexing

undefined

z.vindex[[0, 5, 10], [2, 8, 15]] # 坐标索引 z.oindex[0:10, [5, 10, 15]] # 正交索引 z.blocks[0, 0] # 块/分块索引

undefined

Resizing and Appending

调整大小与追加

python

undefined

python

undefined

Resize array

调整数组大小

z.resize(15000, 15000) # Expands or shrinks dimensions

z.resize(15000, 15000) # 扩展或缩小维度

Append data along an axis

沿某一轴追加数据

z.append(np.random.random((1000, 10000)), axis=0) # Adds rows

undefined

z.append(np.random.random((1000, 10000)), axis=0) # 添加行

undefined

Chunking Strategies

分块策略

Chunking is critical for performance. Choose chunk sizes and shapes based on access patterns.

分块对性能至关重要。需根据访问模式选择分块大小和形状。

Chunk Size Guidelines

分块大小指南

Minimum chunk size: 1 MB recommended for optimal performance
Balance: Larger chunks = fewer metadata operations; smaller chunks = better parallel access
Memory consideration: Entire chunks must fit in memory during compression

python

undefined

最小分块大小：建议1 MB以获得最佳性能
平衡原则：分块越大，元数据操作越少；分块越小，并行访问性越好
内存考虑：压缩时整个分块必须能放入内存

python

undefined

Configure chunk size (aim for ~1MB per chunk)

配置分块大小（目标为每个分块约1MB）

For float32 data: 1MB = 262,144 elements = 512×512 array

对于float32数据：1MB = 262,144个元素 = 512×512数组

z = zarr.zeros( shape=(10000, 10000), chunks=(512, 512), # ~1MB chunks dtype='f4' )

undefined

z = zarr.zeros( shape=(10000, 10000), chunks=(512, 512), # 约1MB分块 dtype='f4' )

undefined

Aligning Chunks with Access Patterns

分块与访问模式对齐

Critical: Chunk shape dramatically affects performance based on how data is accessed.

python

undefined

关键：分块形状会根据数据访问模式极大影响性能。

python

undefined

If accessing rows frequently (first dimension)

如果频繁访问行（第一维度）

z = zarr.zeros((10000, 10000), chunks=(10, 10000)) # Chunk spans columns

z = zarr.zeros((10000, 10000), chunks=(10, 10000)) # 分块覆盖所有列

If accessing columns frequently (second dimension)

如果频繁访问列（第二维度）

z = zarr.zeros((10000, 10000), chunks=(10000, 10)) # Chunk spans rows

z = zarr.zeros((10000, 10000), chunks=(10000, 10)) # 分块覆盖所有行

For mixed access patterns (balanced approach)

对于混合访问模式（平衡方案）

z = zarr.zeros((10000, 10000), chunks=(1000, 1000)) # Square chunks


**Performance example**: For a (200, 200, 200) array, reading along the first dimension:
- Using chunks (1, 200, 200): ~107ms
- Using chunks (200, 200, 1): ~1.65ms (65× faster!)

z = zarr.zeros((10000, 10000), chunks=(1000, 1000)) # 方形分块


**性能示例**：对于(200, 200, 200)数组，沿第一维度读取：
- 使用分块(1, 200, 200)：约107ms
- 使用分块(200, 200, 1)：约1.65ms（快65倍！）

Sharding for Large-Scale Storage

分片用于大规模存储

When arrays have millions of small chunks, use sharding to group chunks into larger storage objects:

python

from zarr.codecs import ShardingCodec, BytesCodec
from zarr.codecs.blosc import BloscCodec

当数组包含数百万个小分块时，使用分片将分块分组为更大的存储对象：

python

from zarr.codecs import ShardingCodec, BytesCodec
from zarr.codecs.blosc import BloscCodec

Create array with sharding

创建带分片的数组

z = zarr.create_array( store='data.zarr', shape=(100000, 100000), chunks=(100, 100), # Small chunks for access shards=(1000, 1000), # Groups 100 chunks per shard dtype='f4' )


**Benefits**:
- Reduces file system overhead from millions of small files
- Improves cloud storage performance (fewer object requests)
- Prevents filesystem block size waste

**Important**: Entire shards must fit in memory before writing.

z = zarr.create_array( store='data.zarr', shape=(100000, 100000), chunks=(100, 100), # 小分块用于访问 shards=(1000, 1000), # 每个分片包含100个分块 dtype='f4' )


**优势**：
- 减少数百万个小文件带来的文件系统开销
- 提升云存储性能（减少对象请求数）
- 避免文件系统块大小浪费

**注意**：写入前整个分片必须能放入内存。

Compression

压缩

Zarr applies compression per chunk to reduce storage while maintaining fast access.

Zarr对每个分块应用压缩，以减少存储同时保持快速访问。

Configuring Compression

配置压缩

python

from zarr.codecs.blosc import BloscCodec
from zarr.codecs import GzipCodec, ZstdCodec

python

from zarr.codecs.blosc import BloscCodec
from zarr.codecs import GzipCodec, ZstdCodec

Default: Blosc with Zstandard

默认：使用Zstandard的Blosc

z = zarr.zeros((1000, 1000), chunks=(100, 100)) # Uses default compression

z = zarr.zeros((1000, 1000), chunks=(100, 100)) # 使用默认压缩

Configure Blosc codec

配置Blosc编解码器

z = zarr.create_array( store='data.zarr', shape=(1000, 1000), chunks=(100, 100), dtype='f4', codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')] )

Available Blosc compressors: 'blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd'

可用的Blosc压缩器：'blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd'

Use Gzip compression

使用Gzip压缩

z = zarr.create_array( store='data.zarr', shape=(1000, 1000), chunks=(100, 100), dtype='f4', codecs=[GzipCodec(level=6)] )

Disable compression

禁用压缩

z = zarr.create_array( store='data.zarr', shape=(1000, 1000), chunks=(100, 100), dtype='f4', codecs=[BytesCodec()] # No compression )

undefined

z = zarr.create_array( store='data.zarr', shape=(1000, 1000), chunks=(100, 100), dtype='f4', codecs=[BytesCodec()] # 无压缩 )

undefined

Compression Performance Tips

压缩性能技巧

Blosc (default): Fast compression/decompression, good for interactive workloads
Zstandard: Better compression ratios, slightly slower than LZ4
Gzip: Maximum compression, slower performance
LZ4: Fastest compression, lower ratios
Shuffle: Enable shuffle filter for better compression on numeric data

python

undefined

Blosc（默认）：压缩/解压速度快，适合交互式工作负载
Zstandard：压缩比更高，比LZ4稍慢
Gzip：压缩比最大，性能较慢
LZ4：压缩速度最快，压缩比较低
Shuffle：对数值数据启用shuffle过滤器可获得更好的压缩效果

python

undefined

Optimal for numeric scientific data

数值科学数据的最优配置

codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')]

Optimal for speed

速度优先的最优配置

codecs=[BloscCodec(cname='lz4', clevel=1)]

Optimal for compression ratio

压缩比优先的最优配置

codecs=[GzipCodec(level=9)]

undefined

codecs=[GzipCodec(level=9)]

undefined

Storage Backends

存储后端

Zarr supports multiple storage backends through a flexible storage interface.

Zarr通过灵活的存储接口支持多种存储后端。

Local Filesystem (Default)

本地文件系统（默认）

python

from zarr.storage import LocalStore

python

from zarr.storage import LocalStore

Explicit store creation

显式创建存储

store = LocalStore('data/my_array.zarr') z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))

Or use string path (creates LocalStore automatically)

或使用字符串路径（自动创建LocalStore）

z = zarr.open_array('data/my_array.zarr', mode='w', shape=(1000, 1000), chunks=(100, 100))

undefined

z = zarr.open_array('data/my_array.zarr', mode='w', shape=(1000, 1000), chunks=(100, 100))

undefined

In-Memory Storage

内存存储

python

from zarr.storage import MemoryStore

python

from zarr.storage import MemoryStore

Create in-memory store

创建内存存储

store = MemoryStore() z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))

Data exists only in memory, not persisted

数据仅存在于内存中，不会持久化

undefined

undefined

ZIP File Storage

ZIP文件存储

python

from zarr.storage import ZipStore

python

from zarr.storage import ZipStore

Write to ZIP file

写入ZIP文件

store = ZipStore('data.zip', mode='w') z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100)) z[:] = np.random.random((1000, 1000)) store.close() # IMPORTANT: Must close ZipStore

store = ZipStore('data.zip', mode='w') z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100)) z[:] = np.random.random((1000, 1000)) store.close() # 重要：必须关闭ZipStore

Read from ZIP file

从ZIP文件读取

store = ZipStore('data.zip', mode='r') z = zarr.open_array(store=store) data = z[:] store.close()

undefined

store = ZipStore('data.zip', mode='r') z = zarr.open_array(store=store) data = z[:] store.close()

undefined

Cloud Storage (S3, GCS)

云存储（S3、GCS）

python

import s3fs
import zarr

python

import s3fs
import zarr

S3 storage

S3存储

s3 = s3fs.S3FileSystem(anon=False) # Use credentials store = s3fs.S3Map(root='my-bucket/path/to/array.zarr', s3=s3) z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100)) z[:] = data

s3 = s3fs.S3FileSystem(anon=False) # 使用凭证 store = s3fs.S3Map(root='my-bucket/path/to/array.zarr', s3=s3) z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100)) z[:] = data

Google Cloud Storage

import gcsfs gcs = gcsfs.GCSFileSystem(project='my-project') store = gcsfs.GCSMap(root='my-bucket/path/to/array.zarr', gcs=gcs) z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))


**Cloud Storage Best Practices**:
- Use consolidated metadata to reduce latency: `zarr.consolidate_metadata(store)`
- Align chunk sizes with cloud object sizing (typically 5-100 MB optimal)
- Enable parallel writes using Dask for large-scale data
- Consider sharding to reduce number of objects

import gcsfs gcs = gcsfs.GCSFileSystem(project='my-project') store = gcsfs.GCSMap(root='my-bucket/path/to/array.zarr', gcs=gcs) z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))


**云存储最佳实践**:
- 使用合并元数据以减少延迟：`zarr.consolidate_metadata(store)`
- 使分块大小与云对象大小对齐（通常5-100 MB最优）
- 使用Dask进行大规模数据的并行写入
- 考虑使用分片以减少对象数量

Groups and Hierarchies

组与层级结构

Groups organize multiple arrays hierarchically, similar to directories or HDF5 groups.

组用于层级化组织多个数组，类似于目录或HDF5组。

Creating and Using Groups

创建和使用组

python

undefined

python

undefined

Create root group

创建根组

root = zarr.group(store='data/hierarchy.zarr')

Create sub-groups

创建子组

temperature = root.create_group('temperature') precipitation = root.create_group('precipitation')

Create arrays within groups

在组内创建数组

temp_array = temperature.create_array( name='t2m', shape=(365, 720, 1440), chunks=(1, 720, 1440), dtype='f4' )

precip_array = precipitation.create_array( name='prcp', shape=(365, 720, 1440), chunks=(1, 720, 1440), dtype='f4' )

temp_array = temperature.create_array( name='t2m', shape=(365, 720, 1440), chunks=(1, 720, 1440), dtype='f4' )

precip_array = precipitation.create_array( name='prcp', shape=(365, 720, 1440), chunks=(1, 720, 1440), dtype='f4' )

Access using paths

使用路径访问数组

array = root['temperature/t2m']

Visualize hierarchy

可视化层级结构

print(root.tree())

Output:

输出:

/

├── temperature

│ └── t2m (365, 720, 1440) f4

└── precipitation

└── prcp (365, 720, 1440) f4

undefined

undefined

H5py-Compatible API

兼容H5py的API

Zarr provides an h5py-compatible interface for familiar HDF5 users:

python

undefined

Zarr为熟悉HDF5的用户提供兼容h5py的接口：

python

undefined

Create group with h5py-style methods

使用h5py风格方法创建组

root = zarr.group('data.zarr') dataset = root.create_dataset('my_data', shape=(1000, 1000), chunks=(100, 100), dtype='f4')

Access like h5py

像h5py一样访问

grp = root.require_group('subgroup') arr = grp.require_dataset('array', shape=(500, 500), chunks=(50, 50), dtype='i4')

undefined

grp = root.require_group('subgroup') arr = grp.require_dataset('array', shape=(500, 500), chunks=(50, 50), dtype='i4')

undefined

Attributes and Metadata

属性与元数据

Attach custom metadata to arrays and groups using attributes:

python

undefined

可以使用属性为数组和组附加自定义元数据：

python

undefined

Add attributes to array

为数组添加属性

z = zarr.zeros((1000, 1000), chunks=(100, 100)) z.attrs['description'] = 'Temperature data in Kelvin' z.attrs['units'] = 'K' z.attrs['created'] = '2024-01-15' z.attrs['processing_version'] = 2.1

z = zarr.zeros((1000, 1000), chunks=(100, 100)) z.attrs['description'] = '开尔文温度数据' z.attrs['units'] = 'K' z.attrs['created'] = '2024-01-15' z.attrs['processing_version'] = 2.1

Attributes are stored as JSON

属性以JSON格式存储

print(z.attrs['units']) # Output: K

print(z.attrs['units']) # 输出: K

Add attributes to groups

为组添加属性

root = zarr.group('data.zarr') root.attrs['project'] = 'Climate Analysis' root.attrs['institution'] = 'Research Institute'

root = zarr.group('data.zarr') root.attrs['project'] = '气候分析' root.attrs['institution'] = '研究院'

Attributes persist with the array/group

属性随数组/组持久化

z2 = zarr.open('data.zarr') print(z2.attrs['description'])


**Important**: Attributes must be JSON-serializable (strings, numbers, lists, dicts, booleans, null).

z2 = zarr.open('data.zarr') print(z2.attrs['description'])


**重要**：属性必须是可JSON序列化的（字符串、数字、列表、字典、布尔值、空值）。

Integration with NumPy, Dask, and Xarray

与NumPy、Dask和Xarray集成

NumPy Integration

NumPy集成

Zarr arrays implement the NumPy array interface:

python

import numpy as np
import zarr

z = zarr.zeros((1000, 1000), chunks=(100, 100))

Zarr数组实现了NumPy数组接口：

python

import numpy as np
import zarr

z = zarr.zeros((1000, 1000), chunks=(100, 100))

Use NumPy functions directly

直接使用NumPy函数

result = np.sum(z, axis=0) # NumPy operates on Zarr array mean = np.mean(z[:100, :100])

result = np.sum(z, axis=0) # NumPy直接操作Zarr数组 mean = np.mean(z[:100, :100])

Convert to NumPy array

转换为NumPy数组

numpy_array = z[:] # Loads entire array into memory

undefined

numpy_array = z[:] # 将整个数组加载到内存

undefined

Dask Integration

Dask集成

Dask provides lazy, parallel computation on Zarr arrays:

python

import dask.array as da
import zarr

Dask为Zarr数组提供惰性并行计算：

python

import dask.array as da
import zarr

Create large Zarr array

创建大型Zarr数组

z = zarr.open('data.zarr', mode='w', shape=(100000, 100000), chunks=(1000, 1000), dtype='f4')

Load as Dask array (lazy, no data loaded)

加载为Dask数组（惰性，不加载数据）

dask_array = da.from_zarr('data.zarr')

Perform computations (parallel, out-of-core)

执行计算（并行、核外）

result = dask_array.mean(axis=0).compute() # Parallel computation

result = dask_array.mean(axis=0).compute() # 并行计算

Write Dask array to Zarr

将Dask数组写入Zarr

large_array = da.random.random((100000, 100000), chunks=(1000, 1000)) da.to_zarr(large_array, 'output.zarr')


**Benefits**:
- Process datasets larger than memory
- Automatic parallel computation across chunks
- Efficient I/O with chunked storage

large_array = da.random.random((100000, 100000), chunks=(1000, 1000)) da.to_zarr(large_array, 'output.zarr')


**优势**:
- 处理大于内存的数据集
- 自动跨分块并行计算
- 分块存储实现高效I/O

Xarray Integration

Xarray集成

Xarray provides labeled, multidimensional arrays with Zarr backend:

python

import xarray as xr
import zarr

Xarray提供带Zarr后端的标签化多维数组：

python

import xarray as xr
import zarr

Open Zarr store as Xarray Dataset (lazy loading)

以Xarray Dataset形式打开Zarr存储（惰性加载）

ds = xr.open_zarr('data.zarr')

Dataset includes coordinates and metadata

Dataset包含坐标和元数据

print(ds)

Access variables

访问变量

temperature = ds['temperature']

Perform labeled operations

执行标签化操作

subset = ds.sel(time='2024-01', lat=slice(30, 60))

Write Xarray Dataset to Zarr

将Xarray Dataset写入Zarr

ds.to_zarr('output.zarr')

Create from scratch with coordinates

从 scratch 创建带坐标的Dataset

ds = xr.Dataset( { 'temperature': (['time', 'lat', 'lon'], data), 'precipitation': (['time', 'lat', 'lon'], data2) }, coords={ 'time': pd.date_range('2024-01-01', periods=365), 'lat': np.arange(-90, 91, 1), 'lon': np.arange(-180, 180, 1) } ) ds.to_zarr('climate_data.zarr')


**Benefits**:
- Named dimensions and coordinates
- Label-based indexing and selection
- Integration with pandas for time series
- NetCDF-like interface familiar to climate/geospatial scientists

ds = xr.Dataset( { 'temperature': (['time', 'lat', 'lon'], data), 'precipitation': (['time', 'lat', 'lon'], data2) }, coords={ 'time': pd.date_range('2024-01-01', periods=365), 'lat': np.arange(-90, 91, 1), 'lon': np.arange(-180, 180, 1) } ) ds.to_zarr('climate_data.zarr')


**优势**:
- 命名维度和坐标
- 基于标签的索引和选择
- 与pandas集成处理时间序列
- 气候/地理空间科学家熟悉的类NetCDF接口

Parallel Computing and Synchronization

并行计算与同步

Thread-Safe Operations

线程安全操作

python

from zarr import ThreadSynchronizer
import zarr

python

from zarr import ThreadSynchronizer
import zarr

For multi-threaded writes

用于多线程写入

synchronizer = ThreadSynchronizer() z = zarr.open_array('data.zarr', mode='r+', shape=(10000, 10000), chunks=(1000, 1000), synchronizer=synchronizer)

Safe for concurrent writes from multiple threads

支持多线程并发写入

(when writes don't span chunk boundaries)

（当写入不跨分块边界时）

undefined

undefined

Process-Safe Operations

进程安全操作

python

from zarr import ProcessSynchronizer
import zarr

python

from zarr import ProcessSynchronizer
import zarr

For multi-process writes

用于多进程写入

synchronizer = ProcessSynchronizer('sync_data.sync') z = zarr.open_array('data.zarr', mode='r+', shape=(10000, 10000), chunks=(1000, 1000), synchronizer=synchronizer)

Safe for concurrent writes from multiple processes

支持多进程并发写入


**Note**:
- Concurrent reads require no synchronization
- Synchronization only needed for writes that may span chunk boundaries
- Each process/thread writing to separate chunks needs no synchronization


**注意**:
- 并发读取无需同步
- 仅当写入可能跨分块边界时才需要同步
- 每个进程/线程写入不同分块时无需同步

Consolidated Metadata

合并元数据

For hierarchical stores with many arrays, consolidate metadata into a single file to reduce I/O operations:

python

import zarr

对于包含多个数组的层级存储，可将元数据合并到单个文件中以减少I/O操作：

python

import zarr

After creating arrays/groups

创建数组/组后

root = zarr.group('data.zarr')

... create multiple arrays/groups ...

... 创建多个数组/组 ...

Consolidate metadata

合并元数据

zarr.consolidate_metadata('data.zarr')

Open with consolidated metadata (faster, especially on cloud storage)

使用合并后的元数据打开（速度更快，尤其在云存储上）

root = zarr.open_consolidated('data.zarr')


**Benefits**:
- Reduces metadata read operations from N (one per array) to 1
- Critical for cloud storage (reduces latency)
- Speeds up `tree()` operations and group traversal

**Cautions**:
- Metadata can become stale if arrays update without re-consolidation
- Not suitable for frequently-updated datasets
- Multi-writer scenarios may have inconsistent reads

root = zarr.open_consolidated('data.zarr')


**优势**:
- 将元数据读取操作从N次（每个数组一次）减少到1次
- 对云存储至关重要（减少延迟）
- 加速`tree()`操作和组遍历

**注意**:
- 如果数组更新后未重新合并，元数据可能会过时
- 不适用于频繁更新的数据集
- 多写入场景下可能出现不一致读取

Performance Optimization

性能优化

Checklist for Optimal Performance

最佳性能检查清单

Chunk Size: Aim for 1-10 MB per chunk

python

# For float32: 1MB = 262,144 elements
chunks = (512, 512)  # 512×512×4 bytes = ~1MB

Chunk Shape: Align with access patterns

python

# Row-wise access → chunk spans columns: (small, large)
# Column-wise access → chunk spans rows: (large, small)
# Random access → balanced: (medium, medium)

Compression: Choose based on workload

python

# Interactive/fast: BloscCodec(cname='lz4')
# Balanced: BloscCodec(cname='zstd', clevel=5)
# Maximum compression: GzipCodec(level=9)

Storage Backend: Match to environment

python

# Local: LocalStore (default)
# Cloud: S3Map/GCSMap with consolidated metadata
# Temporary: MemoryStore

Sharding: Use for large-scale datasets

python

# When you have millions of small chunks
shards=(10*chunk_size, 10*chunk_size)

Parallel I/O: Use Dask for large operations

python

import dask.array as da
dask_array = da.from_zarr('data.zarr')
result = dask_array.compute(scheduler='threads', num_workers=8)

分块大小：目标为每个分块1-10 MB

python

# 对于float32：1MB = 262,144个元素
chunks = (512, 512)  # 512×512×4字节 = ~1MB

分块形状：与访问模式对齐

python

# 行式访问 → 分块覆盖列：(小, 大)
# 列式访问 → 分块覆盖行：(大, 小)
# 随机访问 → 平衡：(中, 中)

压缩：根据工作负载选择

python

# 交互式/速度优先：BloscCodec(cname='lz4')
# 平衡：BloscCodec(cname='zstd', clevel=5)
# 压缩比优先：GzipCodec(level=9)

存储后端：匹配运行环境

python

# 本地：LocalStore（默认）
# 云：带合并元数据的S3Map/GCSMap
# 临时：MemoryStore

分片：用于大规模数据集

python

# 当存在数百万个小分块时
shards=(10*chunk_size, 10*chunk_size)

并行I/O：使用Dask处理大型操作

python

import dask.array as da
dask_array = da.from_zarr('data.zarr')
result = dask_array.compute(scheduler='threads', num_workers=8)

Profiling and Debugging

性能分析与调试

python

undefined

python

undefined

Print detailed array information

打印详细数组信息

print(z.info)

Output includes:

输出包括：

- Type, shape, chunks, dtype

- 类型、形状、分块、数据类型

- Compression codec and level

- 压缩编解码器和级别

- Storage size (compressed vs uncompressed)

- 存储大小（压缩后 vs 未压缩）

- Storage location

- 存储位置

Check storage size

检查存储大小

print(f"Compressed size: {z.nbytes_stored / 1e6:.2f} MB") print(f"Uncompressed size: {z.nbytes / 1e6:.2f} MB") print(f"Compression ratio: {z.nbytes / z.nbytes_stored:.2f}x")

undefined

print(f"压缩后大小: {z.nbytes_stored / 1e6:.2f} MB") print(f"未压缩大小: {z.nbytes / 1e6:.2f} MB") print(f"压缩比: {z.nbytes / z.nbytes_stored:.2f}x")

undefined

Common Patterns and Best Practices

常见模式与最佳实践

Pattern: Time Series Data

模式：时间序列数据

python

undefined

python

undefined

Store time series with time as first dimension

存储时间序列，时间作为第一维度

This allows efficient appending of new time steps

这样可以高效追加新的时间步

z = zarr.open('timeseries.zarr', mode='a', shape=(0, 720, 1440), # Start with 0 time steps chunks=(1, 720, 1440), # One time step per chunk dtype='f4')

z = zarr.open('timeseries.zarr', mode='a', shape=(0, 720, 1440), # 初始0个时间步 chunks=(1, 720, 1440), # 每个分块一个时间步 dtype='f4')

Append new time steps

追加新的时间步

new_data = np.random.random((1, 720, 1440)) z.append(new_data, axis=0)

undefined

new_data = np.random.random((1, 720, 1440)) z.append(new_data, axis=0)

undefined

Pattern: Large Matrix Operations

模式：大型矩阵运算

python

import dask.array as da

python

import dask.array as da

Create large matrix in Zarr

在Zarr中创建大型矩阵

z = zarr.open('matrix.zarr', mode='w', shape=(100000, 100000), chunks=(1000, 1000), dtype='f8')

Use Dask for parallel computation

使用Dask进行并行计算

dask_z = da.from_zarr('matrix.zarr') result = (dask_z @ dask_z.T).compute() # Parallel matrix multiply

undefined

dask_z = da.from_zarr('matrix.zarr') result = (dask_z @ dask_z.T).compute() # 并行矩阵乘法

undefined

Pattern: Cloud-Native Workflow

模式：云原生工作流

python

import s3fs
import zarr

python

import s3fs
import zarr

Write to S3

写入S3

s3 = s3fs.S3FileSystem() store = s3fs.S3Map(root='s3://my-bucket/data.zarr', s3=s3)

Create array with appropriate chunking for cloud

创建适合云的分块数组

z = zarr.open_array(store=store, mode='w', shape=(10000, 10000), chunks=(500, 500), # ~1MB chunks dtype='f4') z[:] = data

z = zarr.open_array(store=store, mode='w', shape=(10000, 10000), chunks=(500, 500), # 约1MB分块 dtype='f4') z[:] = data

Consolidate metadata for faster reads

合并元数据以加快读取

zarr.consolidate_metadata(store)

Read from S3 (anywhere, anytime)

从S3读取（随时随地）

store_read = s3fs.S3Map(root='s3://my-bucket/data.zarr', s3=s3) z_read = zarr.open_consolidated(store_read) subset = z_read[0:100, 0:100]

undefined

store_read = s3fs.S3Map(root='s3://my-bucket/data.zarr', s3=s3) z_read = zarr.open_consolidated(store_read) subset = z_read[0:100, 0:100]

undefined

Pattern: Format Conversion

模式：格式转换

python

undefined

python

undefined

HDF5 to Zarr

HDF5转Zarr

import h5py import zarr

with h5py.File('data.h5', 'r') as h5: dataset = h5['dataset_name'] z = zarr.array(dataset[:], chunks=(1000, 1000), store='data.zarr')

import h5py import zarr

with h5py.File('data.h5', 'r') as h5: dataset = h5['dataset_name'] z = zarr.array(dataset[:], chunks=(1000, 1000), store='data.zarr')

NumPy to Zarr

NumPy转Zarr

import numpy as np data = np.load('data.npy') z = zarr.array(data, chunks='auto', store='data.zarr')

Zarr to NetCDF (via Xarray)

Zarr转NetCDF（通过Xarray）

import xarray as xr ds = xr.open_zarr('data.zarr') ds.to_netcdf('data.nc')

undefined

import xarray as xr ds = xr.open_zarr('data.zarr') ds.to_netcdf('data.nc')

undefined

Common Issues and Solutions

常见问题与解决方案

Issue: Slow Performance

问题：性能缓慢

Diagnosis: Check chunk size and alignment

python

print(z.chunks)  # Are chunks appropriate size?
print(z.info)    # Check compression ratio

Solutions:

Increase chunk size to 1-10 MB
Align chunks with access pattern
Try different compression codecs
Use Dask for parallel operations

诊断：检查分块大小和对齐情况

python

print(z.chunks)  # 分块大小是否合适？
print(z.info)    # 检查压缩比

解决方案:

将分块大小增加到1-10 MB
分块与访问模式对齐
尝试不同的压缩编解码器
使用Dask进行并行操作

Issue: High Memory Usage

问题：内存占用过高

Cause: Loading entire array or large chunks into memory

Solutions:

python

undefined

原因：将整个数组或大型分块加载到内存

解决方案:

python

undefined

Don't load entire array

不要加载整个数组

Bad: data = z[:]

错误：data = z[:]

Good: Process in chunks

正确：分块处理

for i in range(0, z.shape[0], 1000): chunk = z[i:i+1000, :] process(chunk)

Or use Dask for automatic chunking

或使用Dask自动分块

import dask.array as da dask_z = da.from_zarr('data.zarr') result = dask_z.mean().compute() # Processes in chunks

undefined

import dask.array as da dask_z = da.from_zarr('data.zarr') result = dask_z.mean().compute() # 分块处理

undefined

Issue: Cloud Storage Latency

问题：云存储延迟高

Solutions:

python

undefined

解决方案:

python

undefined

1. Consolidate metadata

1. 合并元数据

zarr.consolidate_metadata(store) z = zarr.open_consolidated(store)

2. Use appropriate chunk sizes (5-100 MB for cloud)

2. 使用合适的分块大小（云存储建议5-100 MB）

chunks = (2000, 2000) # Larger chunks for cloud

chunks = (2000, 2000) # 云存储使用更大分块

3. Enable sharding

3. 启用分片

shards = (10000, 10000) # Groups many chunks

undefined

shards = (10000, 10000) # 分组多个分块

undefined

Issue: Concurrent Write Conflicts

问题：并发写入冲突

Solution: Use synchronizers or ensure non-overlapping writes

python

from zarr import ProcessSynchronizer

sync = ProcessSynchronizer('sync.sync')
z = zarr.open_array('data.zarr', mode='r+', synchronizer=sync)

解决方案：使用同步器或确保写入不重叠

python

from zarr import ProcessSynchronizer

sync = ProcessSynchronizer('sync.sync')
z = zarr.open_array('data.zarr', mode='r+', synchronizer=sync)

Or design workflow so each process writes to separate chunks

或设计工作流，让每个进程写入不同分块

undefined

undefined

Additional Resources

额外资源

For detailed API documentation, advanced usage, and the latest updates:

Official Documentation: https://zarr.readthedocs.io/
Zarr Specifications: https://zarr-specs.readthedocs.io/
GitHub Repository: https://github.com/zarr-developers/zarr-python
Community Chat: https://gitter.im/zarr-developers/community

Related Libraries:

Xarray: https://docs.xarray.dev/ (labeled arrays)
Dask: https://docs.dask.org/ (parallel computing)
NumCodecs: https://numcodecs.readthedocs.io/ (compression codecs)

如需详细API文档、高级用法和最新更新：

官方文档：https://zarr.readthedocs.io/
Zarr规范：https://zarr-specs.readthedocs.io/
GitHub仓库：https://github.com/zarr-developers/zarr-python
社区聊天室：https://gitter.im/zarr-developers/community

相关库:

Xarray：https://docs.xarray.dev/（标签化数组）
Dask：https://docs.dask.org/（并行计算）
NumCodecs：https://numcodecs.readthedocs.io/（压缩编解码器）

zarr-python

Original

Translation

Zarr Python

Zarr Python

Overview

概述

Quick Start

快速开始

Installation

安装

Basic Array Creation

基础数组创建

Create a 2D array with chunking and compression

创建带分块和压缩的二维数组

Write data using NumPy-style indexing

使用NumPy风格索引写入数据

Read data

读取数据

Core Operations

核心操作

Creating Arrays

创建数组

Create empty array

创建空数组

Create filled arrays

创建填充数组

Create from existing data

从已有数据创建

Create like another array

仿照另一个数组创建

Opening Existing Arrays

打开已有数组

Open array (read/write mode by default)

打开数组（默认读写模式）

Read-only mode

只读模式

The open() function auto-detects arrays vs groups

open()函数自动检测数组或组

Reading and Writing Data

数据读写

Write entire array

写入整个数组

Write slices

写入切片

Read data (returns NumPy array)

读取数据（返回NumPy数组）

Advanced indexing

高级索引

Resizing and Appending

调整大小与追加

Resize array

调整数组大小

Append data along an axis

沿某一轴追加数据

Chunking Strategies

分块策略

Chunk Size Guidelines

分块大小指南

Configure chunk size (aim for ~1MB per chunk)

配置分块大小（目标为每个分块约1MB）

For float32 data: 1MB = 262,144 elements = 512×512 array

对于float32数据：1MB = 262,144个元素 = 512×512数组

Aligning Chunks with Access Patterns

分块与访问模式对齐

If accessing rows frequently (first dimension)

如果频繁访问行（第一维度）

If accessing columns frequently (second dimension)

如果频繁访问列（第二维度）

For mixed access patterns (balanced approach)

对于混合访问模式（平衡方案）

Sharding for Large-Scale Storage

分片用于大规模存储

Create array with sharding

创建带分片的数组

Compression

压缩

Configuring Compression

配置压缩

Default: Blosc with Zstandard