pufferlib

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PufferLib - High-Performance Reinforcement Learning

PufferLib - 高性能强化学习

Overview

概述

PufferLib is a high-performance reinforcement learning library designed for fast parallel environment simulation and training. It achieves training at millions of steps per second through optimized vectorization, native multi-agent support, and efficient PPO implementation (PuffeRL). The library provides the Ocean suite of 20+ environments and seamless integration with Gymnasium, PettingZoo, and specialized RL frameworks.

PufferLib是一个高性能强化学习库，专为快速并行环境模拟和训练设计。它通过优化的向量化、原生多智能体支持和高效的PPO实现（PuffeRL），实现了每秒数百万步的训练速度。该库提供包含20+个环境的Ocean套件，并可与Gymnasium、PettingZoo及专业RL框架无缝集成。

When to Use This Skill

何时使用此技能

Use this skill when:

Training RL agents with PPO on any environment (single or multi-agent)
Creating custom environments using the PufferEnv API
Optimizing performance for parallel environment simulation (vectorization)
Integrating existing environments from Gymnasium, PettingZoo, Atari, Procgen, etc.
Developing policies with CNN, LSTM, or custom architectures
Scaling RL to millions of steps per second for faster experimentation
Multi-agent RL with native multi-agent environment support

在以下场景中使用此技能：

训练RL智能体：在任意环境（单智能体或多智能体）上使用PPO进行训练
创建自定义环境：使用PufferEnv API开发自定义环境
优化性能：针对并行环境模拟（向量化）优化性能
集成现有环境：集成来自Gymnasium、PettingZoo、Atari、Procgen等的现有环境
开发策略：使用CNN、LSTM或自定义架构开发策略
扩展RL规模：将RL扩展至每秒数百万步，以加快实验速度
多智能体RL：利用原生多智能体环境支持

Core Capabilities

核心功能

1. High-Performance Training (PuffeRL)

1. 高性能训练（PuffeRL）

PuffeRL is PufferLib's optimized PPO+LSTM training algorithm achieving 1M-4M steps/second.

Quick start training:

bash

undefined

PuffeRL是PufferLib优化后的PPO+LSTM训练算法，可实现100万-400万步/秒的训练速度。

快速开始训练：

bash

undefined

CLI training

CLI训练

puffer train procgen-coinrun --train.device cuda --train.learning-rate 3e-4

Distributed training

分布式训练

torchrun --nproc_per_node=4 train.py


**Python training loop:**
```python
import pufferlib
from pufferlib import PuffeRL

torchrun --nproc_per_node=4 train.py


**Python训练循环：**
```python
import pufferlib
from pufferlib import PuffeRL

Create vectorized environment

创建向量化环境

env = pufferlib.make('procgen-coinrun', num_envs=256)

Create trainer

创建训练器

trainer = PuffeRL( env=env, policy=my_policy, device='cuda', learning_rate=3e-4, batch_size=32768 )

Training loop

训练循环

for iteration in range(num_iterations): trainer.evaluate() # Collect rollouts trainer.train() # Train on batch trainer.mean_and_log() # Log results


**For comprehensive training guidance**, read `references/training.md` for:
- Complete training workflow and CLI options
- Hyperparameter tuning with Protein
- Distributed multi-GPU/multi-node training
- Logger integration (Weights & Biases, Neptune)
- Checkpointing and resume training
- Performance optimization tips
- Curriculum learning patterns

for iteration in range(num_iterations): trainer.evaluate() # 收集轨迹 trainer.train() # 批量训练 trainer.mean_and_log() # 记录结果


**如需全面训练指导**，请阅读`references/training.md`获取以下内容：
- 完整训练流程和CLI选项
- 使用Protein进行超参数调优
- 分布式多GPU/多节点训练
- 日志集成（Weights & Biases、Neptune）
- 检查点与训练恢复
- 性能优化技巧
- 课程学习模式

2. Environment Development (PufferEnv)

2. 环境开发（PufferEnv）

Create custom high-performance environments with the PufferEnv API.

Basic environment structure:

python

import numpy as np
from pufferlib import PufferEnv

class MyEnvironment(PufferEnv):
    def __init__(self, buf=None):
        super().__init__(buf)

        # Define spaces
        self.observation_space = self.make_space((4,))
        self.action_space = self.make_discrete(4)

        self.reset()

    def reset(self):
        # Reset state and return initial observation
        return np.zeros(4, dtype=np.float32)

    def step(self, action):
        # Execute action, compute reward, check done
        obs = self._get_observation()
        reward = self._compute_reward()
        done = self._is_done()
        info = {}

        return obs, reward, done, info

Use the template script:

scripts/env_template.py

provides complete single-agent and multi-agent environment templates with examples of:

Different observation space types (vector, image, dict)
Action space variations (discrete, continuous, multi-discrete)
Multi-agent environment structure
Testing utilities

For complete environment development, read

references/environments.md

for:

PufferEnv API details and in-place operation patterns
Observation and action space definitions
Multi-agent environment creation
Ocean suite (20+ pre-built environments)
Performance optimization (Python to C workflow)
Environment wrappers and best practices
Debugging and validation techniques

使用PufferEnv API创建自定义高性能环境。

基础环境结构：

python

import numpy as np
from pufferlib import PufferEnv

class MyEnvironment(PufferEnv):
    def __init__(self, buf=None):
        super().__init__(buf)

        # 定义空间
        self.observation_space = self.make_space((4,))
        self.action_space = self.make_discrete(4)

        self.reset()

    def reset(self):
        # 重置状态并返回初始观测
        return np.zeros(4, dtype=np.float32)

    def step(self, action):
        # 执行动作、计算奖励、检查终止状态
        obs = self._get_observation()
        reward = self._compute_reward()
        done = self._is_done()
        info = {}

        return obs, reward, done, info

使用模板脚本：

scripts/env_template.py

提供完整的单智能体和多智能体环境模板，包含以下示例：

不同观测空间类型（向量、图像、字典）
动作空间变体（离散、连续、多离散）
多智能体环境结构
测试工具

如需完整环境开发指导，请阅读

references/environments.md

获取以下内容：

PufferEnv API细节和原地操作模式
观测与动作空间定义
多智能体环境创建
Ocean套件（20+个预构建环境）
性能优化（Python转C工作流）
环境包装器与最佳实践
调试与验证技术

3. Vectorization and Performance

3. 向量化与性能

Achieve maximum throughput with optimized parallel simulation.

Vectorization setup:

python

import pufferlib

通过优化的并行模拟实现最大吞吐量。

向量化设置：

python

import pufferlib

Automatic vectorization

自动向量化

env = pufferlib.make('environment_name', num_envs=256, num_workers=8)

Performance benchmarks:

性能基准：

- Pure Python envs: 100k-500k SPS

- 纯Python环境：10万-50万步/秒

- C-based envs: 100M+ SPS

- 基于C的环境：1亿+步/秒

- With training: 400k-4M total SPS

- 含训练：40万-400万总步/秒


**Key optimizations:**
- Shared memory buffers for zero-copy observation passing
- Busy-wait flags instead of pipes/queues
- Surplus environments for async returns
- Multiple environments per worker

**For vectorization optimization**, read `references/vectorization.md` for:
- Architecture and performance characteristics
- Worker and batch size configuration
- Serial vs multiprocessing vs async modes
- Shared memory and zero-copy patterns
- Hierarchical vectorization for large scale
- Multi-agent vectorization strategies
- Performance profiling and troubleshooting


**关键优化：**
- 共享内存缓冲区实现零拷贝观测传递
- 忙等待标志替代管道/队列
- 冗余环境实现异步返回
- 每个 worker 对应多个环境

**如需向量化优化指导**，请阅读`references/vectorization.md`获取以下内容：
- 架构与性能特征
- Worker与批量大小配置
- 串行 vs 多进程 vs 异步模式
- 共享内存与零拷贝模式
- 大规模分层向量化
- 多智能体向量化策略
- 性能分析与故障排除

4. Policy Development

4. 策略开发

Build policies as standard PyTorch modules with optional utilities.

Basic policy structure:

python

import torch.nn as nn
from pufferlib.pytorch import layer_init

class Policy(nn.Module):
    def __init__(self, observation_space, action_space):
        super().__init__()

        # Encoder
        self.encoder = nn.Sequential(
            layer_init(nn.Linear(obs_dim, 256)),
            nn.ReLU(),
            layer_init(nn.Linear(256, 256)),
            nn.ReLU()
        )

        # Actor and critic heads
        self.actor = layer_init(nn.Linear(256, num_actions), std=0.01)
        self.critic = layer_init(nn.Linear(256, 1), std=1.0)

    def forward(self, observations):
        features = self.encoder(observations)
        return self.actor(features), self.critic(features)

For complete policy development, read

references/policies.md

for:

CNN policies for image observations
Recurrent policies with optimized LSTM (3x faster inference)
Multi-input policies for complex observations
Continuous action policies
Multi-agent policies (shared vs independent parameters)
Advanced architectures (attention, residual)
Observation normalization and gradient clipping
Policy debugging and testing

将策略构建为标准PyTorch模块，可选用配套工具。

基础策略结构：

python

import torch.nn as nn
from pufferlib.pytorch import layer_init

class Policy(nn.Module):
    def __init__(self, observation_space, action_space):
        super().__init__()

        # 编码器
        self.encoder = nn.Sequential(
            layer_init(nn.Linear(obs_dim, 256)),
            nn.ReLU(),
            layer_init(nn.Linear(256, 256)),
            nn.ReLU()
        )

        # 动作器与评价器头
        self.actor = layer_init(nn.Linear(256, num_actions), std=0.01)
        self.critic = layer_init(nn.Linear(256, 1), std=1.0)

    def forward(self, observations):
        features = self.encoder(observations)
        return self.actor(features), self.critic(features)

如需完整策略开发指导，请阅读

references/policies.md

获取以下内容：

针对图像观测的CNN策略
优化后的LSTM循环策略（推理速度提升3倍）
针对复杂观测的多输入策略
连续动作策略
多智能体策略（共享 vs 独立参数）
高级架构（注意力、残差）
观测归一化与梯度裁剪
策略调试与测试

5. Environment Integration

5. 环境集成

Seamlessly integrate environments from popular RL frameworks.

Gymnasium integration:

python

import gymnasium as gym
import pufferlib

与主流RL框架的环境无缝集成。

Gymnasium集成：

python

import gymnasium as gym
import pufferlib

Wrap Gymnasium environment

包装Gymnasium环境

gym_env = gym.make('CartPole-v1') env = pufferlib.emulate(gym_env, num_envs=256)

Or use make directly

或直接使用make方法

env = pufferlib.make('gym-CartPole-v1', num_envs=256)


**PettingZoo multi-agent:**
```python

env = pufferlib.make('gym-CartPole-v1', num_envs=256)


**PettingZoo多智能体：**
```python

Multi-agent environment

多智能体环境

env = pufferlib.make('pettingzoo-knights-archers-zombies', num_envs=128)


**Supported frameworks:**
- Gymnasium / OpenAI Gym
- PettingZoo (parallel and AEC)
- Atari (ALE)
- Procgen
- NetHack / MiniHack
- Minigrid
- Neural MMO
- Crafter
- GPUDrive
- MicroRTS
- Griddly
- And more...

**For integration details**, read `references/integration.md` for:
- Complete integration examples for each framework
- Custom wrappers (observation, reward, frame stacking, action repeat)
- Space flattening and unflattening
- Environment registration
- Compatibility patterns
- Performance considerations
- Integration debugging

env = pufferlib.make('pettingzoo-knights-archers-zombies', num_envs=128)


**支持的框架：**
- Gymnasium / OpenAI Gym
- PettingZoo（并行和AEC）
- Atari（ALE）
- Procgen
- NetHack / MiniHack
- Minigrid
- Neural MMO
- Crafter
- GPUDrive
- MicroRTS
- Griddly
- 以及更多...

**如需集成细节**，请阅读`references/integration.md`获取以下内容：
- 各框架的完整集成示例
- 自定义包装器（观测、奖励、帧堆叠、动作重复）
- 空间扁平化与还原
- 环境注册
- 兼容模式
- 性能考量
- 集成调试

Quick Start Workflow

快速开始流程

For Training Existing Environments

训练现有环境

Choose environment from Ocean suite or compatible framework
Use
```
scripts/train_template.py
```
as starting point
Configure hyperparameters for your task
Run training with CLI or Python script
Monitor with Weights & Biases or Neptune
Refer to
```
references/training.md
```
for optimization

从Ocean套件或兼容框架中选择环境
以
```
scripts/train_template.py
```
为起点
为任务配置超参数
使用CLI或Python脚本启动训练
通过Weights & Biases或Neptune监控训练
参考
```
references/training.md
```
进行优化

For Creating Custom Environments

创建自定义环境

Start with
```
scripts/env_template.py
```
Define observation and action spaces
Implement
```
reset()
```
and
```
step()
```
methods
Test environment locally
Vectorize with
```
pufferlib.emulate()
```
or
```
make()
```
Refer to
```
references/environments.md
```
for advanced patterns
Optimize with
```
references/vectorization.md
```
if needed

从
```
scripts/env_template.py
```
开始
定义观测和动作空间
实现
```
reset()
```
和
```
step()
```
方法
本地测试环境
使用
```
pufferlib.emulate()
```
或
```
make()
```
进行向量化
参考
```
references/environments.md
```
获取高级模式
如有需要，参考
```
references/vectorization.md
```
进行优化

For Policy Development

策略开发

Choose architecture based on observations:
- Vector observations → MLP policy
- Image observations → CNN policy
- Sequential tasks → LSTM policy
- Complex observations → Multi-input policy
Use
```
layer_init
```
for proper weight initialization
Follow patterns in
```
references/policies.md
```
Test with environment before full training

根据观测选择架构：
- 向量观测 → MLP策略
- 图像观测 → CNN策略
- 序列任务 → LSTM策略
- 复杂观测 → 多输入策略
使用
```
layer_init
```
进行正确的权重初始化
遵循
```
references/policies.md
```
中的模式
在全量训练前先在环境中测试

For Performance Optimization

性能优化

Profile current throughput (steps per second)
Check vectorization configuration (num_envs, num_workers)
Optimize environment code (in-place ops, numpy vectorization)
Consider C implementation for critical paths
Use
```
references/vectorization.md
```
for systematic optimization

分析当前吞吐量（步/秒）
检查向量化配置（num_envs、num_workers）
优化环境代码（原地操作、numpy向量化）
关键路径考虑使用C实现
参考
```
references/vectorization.md
```
进行系统化优化

Resources

资源

scripts/

train_template.py - Complete training script template with:

Environment creation and configuration
Policy initialization
Logger integration (WandB, Neptune)
Training loop with checkpointing
Command-line argument parsing
Multi-GPU distributed training setup

env_template.py - Environment implementation templates:

Single-agent PufferEnv example (grid world)
Multi-agent PufferEnv example (cooperative navigation)
Multiple observation/action space patterns
Testing utilities

train_template.py - 完整训练脚本模板，包含：

环境创建与配置
策略初始化
日志集成（WandB、Neptune）
带检查点的训练循环
命令行参数解析
多GPU分布式训练设置

env_template.py - 环境实现模板：

单智能体PufferEnv示例（网格世界）
多智能体PufferEnv示例（协作导航）
多种观测/动作空间模式
测试工具

references/

training.md - Comprehensive training guide:

Training workflow and CLI options
Hyperparameter configuration
Distributed training (multi-GPU, multi-node)
Monitoring and logging
Checkpointing
Protein hyperparameter tuning
Performance optimization
Common training patterns
Troubleshooting

environments.md - Environment development guide:

PufferEnv API and characteristics
Observation and action spaces
Multi-agent environments
Ocean suite environments
Custom environment development workflow
Python to C optimization path
Third-party environment integration
Wrappers and best practices
Debugging

vectorization.md - Vectorization optimization:

Architecture and key optimizations
Vectorization modes (serial, multiprocessing, async)
Worker and batch configuration
Shared memory and zero-copy patterns
Advanced vectorization (hierarchical, custom)
Multi-agent vectorization
Performance monitoring and profiling
Troubleshooting and best practices

policies.md - Policy architecture guide:

Basic policy structure
CNN policies for images
LSTM policies with optimization
Multi-input policies
Continuous action policies
Multi-agent policies
Advanced architectures (attention, residual)
Observation processing and unflattening
Initialization and normalization
Debugging and testing

integration.md - Framework integration guide:

Gymnasium integration
PettingZoo integration (parallel and AEC)
Third-party environments (Procgen, NetHack, Minigrid, etc.)
Custom wrappers (observation, reward, frame stacking, etc.)
Space conversion and unflattening
Environment registration
Compatibility patterns
Performance considerations
Debugging integration

training.md - 全面训练指南：

训练流程与CLI选项
超参数配置
分布式训练（多GPU、多节点）
监控与日志
检查点
Protein超参数调优
性能优化
常见训练模式
故障排除

environments.md - 环境开发指南：

PufferEnv API与特性
观测与动作空间
多智能体环境
Ocean套件环境
自定义环境开发流程
Python转C优化路径
第三方环境集成
包装器与最佳实践
调试

vectorization.md - 向量化优化指南：

架构与关键优化
向量化模式（串行、多进程、异步）
Worker与批量配置
共享内存与零拷贝模式
高级向量化（分层、自定义）
多智能体向量化
性能监控与分析
故障排除与最佳实践

policies.md - 策略架构指南：

基础策略结构
针对图像的CNN策略
优化后的LSTM策略
多输入策略
连续动作策略
多智能体策略
高级架构（注意力、残差）
观测处理与还原
初始化与归一化
调试与测试

integration.md - 框架集成指南：

Gymnasium集成
PettingZoo集成（并行和AEC）
第三方环境（Procgen、NetHack、Minigrid等）
自定义包装器（观测、奖励、帧堆叠等）
空间转换与还原
环境注册
兼容模式
性能考量
集成调试

Tips for Success

成功技巧

Start simple: Begin with Ocean environments or Gymnasium integration before creating custom environments
Profile early: Measure steps per second from the start to identify bottlenecks

Use templates:

scripts/train_template.py

and

scripts/env_template.py

provide solid starting points

Read references as needed: Each reference file is self-contained and focused on a specific capability
Optimize progressively: Start with Python, profile, then optimize critical paths with C if needed
Leverage vectorization: PufferLib's vectorization is key to achieving high throughput
Monitor training: Use WandB or Neptune to track experiments and identify issues early
Test environments: Validate environment logic before scaling up training
Check existing environments: Ocean suite provides 20+ pre-built environments
Use proper initialization: Always use
```
layer_init
```
from
```
pufferlib.pytorch
```
for policies

从简开始：在创建自定义环境前，先从Ocean环境或Gymnasium集成入手
尽早分析：从一开始就测量步/秒，以识别瓶颈

使用模板：

scripts/train_template.py

和

scripts/env_template.py

提供了可靠的起点

按需阅读参考文档：每个参考文件都是独立的，专注于特定功能
逐步优化：从Python开始，分析后再针对关键路径用C优化
利用向量化：PufferLib的向量化是实现高吞吐量的关键
监控训练：使用WandB或Neptune跟踪实验，尽早发现问题
测试环境：在扩大训练规模前验证环境逻辑
检查现有环境：Ocean套件提供20+个预构建环境
正确初始化：始终使用
```
pufferlib.pytorch
```
中的
```
layer_init
```
初始化策略

Common Use Cases

常见用例

Training on Standard Benchmarks

在标准基准上训练

python

undefined

python

undefined

Atari

env = pufferlib.make('atari-pong', num_envs=256)

Procgen

env = pufferlib.make('procgen-coinrun', num_envs=256)

Minigrid

env = pufferlib.make('minigrid-empty-8x8', num_envs=256)

undefined

env = pufferlib.make('minigrid-empty-8x8', num_envs=256)

undefined

Multi-Agent Learning

多智能体学习

python

undefined

python

undefined

PettingZoo

env = pufferlib.make('pettingzoo-pistonball', num_envs=128)

Shared policy for all agents

所有智能体共享策略

policy = create_policy(env.observation_space, env.action_space) trainer = PuffeRL(env=env, policy=policy)

undefined

policy = create_policy(env.observation_space, env.action_space) trainer = PuffeRL(env=env, policy=policy)

undefined

Custom Task Development

自定义任务开发

python

undefined

python

undefined

Create custom environment

创建自定义环境

class MyTask(PufferEnv): # ... implement environment ...

class MyTask(PufferEnv): # ... 实现环境 ...

Vectorize and train

向量化并训练

env = pufferlib.emulate(MyTask, num_envs=256) trainer = PuffeRL(env=env, policy=my_policy)

undefined

env = pufferlib.emulate(MyTask, num_envs=256) trainer = PuffeRL(env=env, policy=my_policy)

undefined

High-Performance Optimization

高性能优化

python

undefined

python

undefined

Maximize throughput

最大化吞吐量

env = pufferlib.make( 'my-env', num_envs=1024, # Large batch num_workers=16, # Many workers envs_per_worker=64 # Optimize per worker )

undefined

env = pufferlib.make( 'my-env', num_envs=1024, # 大批次 num_workers=16, # 多Worker envs_per_worker=64 # 优化每个Worker的环境数量 )

undefined

Installation

安装

bash

uv pip install pufferlib

bash

uv pip install pufferlib

Documentation

文档

Official docs: https://puffer.ai/docs.html
GitHub: https://github.com/PufferAI/PufferLib
Discord: Community support available

官方文档：https://puffer.ai/docs.html
GitHub：https://github.com/PufferAI/PufferLib
Discord：提供社区支持