pufferlib

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PufferLib - High-Performance Reinforcement Learning

PufferLib - 高性能强化学习

Overview

概述

PufferLib is a high-performance reinforcement learning library designed for fast parallel environment simulation and training. It achieves training at millions of steps per second through optimized vectorization, native multi-agent support, and efficient PPO implementation (PuffeRL). The library provides the Ocean suite of 20+ environments and seamless integration with Gymnasium, PettingZoo, and specialized RL frameworks.
PufferLib是一个高性能强化学习库,专为快速并行环境模拟和训练设计。它通过优化的向量化、原生多智能体支持和高效的PPO实现(PuffeRL),实现了每秒数百万步的训练速度。该库提供包含20+个环境的Ocean套件,并可与Gymnasium、PettingZoo及专业RL框架无缝集成。

When to Use This Skill

何时使用此技能

Use this skill when:
  • Training RL agents with PPO on any environment (single or multi-agent)
  • Creating custom environments using the PufferEnv API
  • Optimizing performance for parallel environment simulation (vectorization)
  • Integrating existing environments from Gymnasium, PettingZoo, Atari, Procgen, etc.
  • Developing policies with CNN, LSTM, or custom architectures
  • Scaling RL to millions of steps per second for faster experimentation
  • Multi-agent RL with native multi-agent environment support
在以下场景中使用此技能:
  • 训练RL智能体:在任意环境(单智能体或多智能体)上使用PPO进行训练
  • 创建自定义环境:使用PufferEnv API开发自定义环境
  • 优化性能:针对并行环境模拟(向量化)优化性能
  • 集成现有环境:集成来自Gymnasium、PettingZoo、Atari、Procgen等的现有环境
  • 开发策略:使用CNN、LSTM或自定义架构开发策略
  • 扩展RL规模:将RL扩展至每秒数百万步,以加快实验速度
  • 多智能体RL:利用原生多智能体环境支持

Core Capabilities

核心功能

1. High-Performance Training (PuffeRL)

1. 高性能训练(PuffeRL)

PuffeRL is PufferLib's optimized PPO+LSTM training algorithm achieving 1M-4M steps/second.
Quick start training:
bash
undefined
PuffeRL是PufferLib优化后的PPO+LSTM训练算法,可实现100万-400万步/秒的训练速度。
快速开始训练:
bash
undefined

CLI training

CLI训练

puffer train procgen-coinrun --train.device cuda --train.learning-rate 3e-4
puffer train procgen-coinrun --train.device cuda --train.learning-rate 3e-4

Distributed training

分布式训练

torchrun --nproc_per_node=4 train.py

**Python training loop:**
```python
import pufferlib
from pufferlib import PuffeRL
torchrun --nproc_per_node=4 train.py

**Python训练循环:**
```python
import pufferlib
from pufferlib import PuffeRL

Create vectorized environment

创建向量化环境

env = pufferlib.make('procgen-coinrun', num_envs=256)
env = pufferlib.make('procgen-coinrun', num_envs=256)

Create trainer

创建训练器

trainer = PuffeRL( env=env, policy=my_policy, device='cuda', learning_rate=3e-4, batch_size=32768 )
trainer = PuffeRL( env=env, policy=my_policy, device='cuda', learning_rate=3e-4, batch_size=32768 )

Training loop

训练循环

for iteration in range(num_iterations): trainer.evaluate() # Collect rollouts trainer.train() # Train on batch trainer.mean_and_log() # Log results

**For comprehensive training guidance**, read `references/training.md` for:
- Complete training workflow and CLI options
- Hyperparameter tuning with Protein
- Distributed multi-GPU/multi-node training
- Logger integration (Weights & Biases, Neptune)
- Checkpointing and resume training
- Performance optimization tips
- Curriculum learning patterns
for iteration in range(num_iterations): trainer.evaluate() # 收集轨迹 trainer.train() # 批量训练 trainer.mean_and_log() # 记录结果

**如需全面训练指导**,请阅读`references/training.md`获取以下内容:
- 完整训练流程和CLI选项
- 使用Protein进行超参数调优
- 分布式多GPU/多节点训练
- 日志集成(Weights & Biases、Neptune)
- 检查点与训练恢复
- 性能优化技巧
- 课程学习模式

2. Environment Development (PufferEnv)

2. 环境开发(PufferEnv)

Create custom high-performance environments with the PufferEnv API.
Basic environment structure:
python
import numpy as np
from pufferlib import PufferEnv

class MyEnvironment(PufferEnv):
    def __init__(self, buf=None):
        super().__init__(buf)

        # Define spaces
        self.observation_space = self.make_space((4,))
        self.action_space = self.make_discrete(4)

        self.reset()

    def reset(self):
        # Reset state and return initial observation
        return np.zeros(4, dtype=np.float32)

    def step(self, action):
        # Execute action, compute reward, check done
        obs = self._get_observation()
        reward = self._compute_reward()
        done = self._is_done()
        info = {}

        return obs, reward, done, info
Use the template script:
scripts/env_template.py
provides complete single-agent and multi-agent environment templates with examples of:
  • Different observation space types (vector, image, dict)
  • Action space variations (discrete, continuous, multi-discrete)
  • Multi-agent environment structure
  • Testing utilities
For complete environment development, read
references/environments.md
for:
  • PufferEnv API details and in-place operation patterns
  • Observation and action space definitions
  • Multi-agent environment creation
  • Ocean suite (20+ pre-built environments)
  • Performance optimization (Python to C workflow)
  • Environment wrappers and best practices
  • Debugging and validation techniques
使用PufferEnv API创建自定义高性能环境。
基础环境结构:
python
import numpy as np
from pufferlib import PufferEnv

class MyEnvironment(PufferEnv):
    def __init__(self, buf=None):
        super().__init__(buf)

        # 定义空间
        self.observation_space = self.make_space((4,))
        self.action_space = self.make_discrete(4)

        self.reset()

    def reset(self):
        # 重置状态并返回初始观测
        return np.zeros(4, dtype=np.float32)

    def step(self, action):
        # 执行动作、计算奖励、检查终止状态
        obs = self._get_observation()
        reward = self._compute_reward()
        done = self._is_done()
        info = {}

        return obs, reward, done, info
使用模板脚本:
scripts/env_template.py
提供完整的单智能体和多智能体环境模板,包含以下示例:
  • 不同观测空间类型(向量、图像、字典)
  • 动作空间变体(离散、连续、多离散)
  • 多智能体环境结构
  • 测试工具
如需完整环境开发指导,请阅读
references/environments.md
获取以下内容:
  • PufferEnv API细节和原地操作模式
  • 观测与动作空间定义
  • 多智能体环境创建
  • Ocean套件(20+个预构建环境)
  • 性能优化(Python转C工作流)
  • 环境包装器与最佳实践
  • 调试与验证技术

3. Vectorization and Performance

3. 向量化与性能

Achieve maximum throughput with optimized parallel simulation.
Vectorization setup:
python
import pufferlib
通过优化的并行模拟实现最大吞吐量。
向量化设置:
python
import pufferlib

Automatic vectorization

自动向量化

env = pufferlib.make('environment_name', num_envs=256, num_workers=8)
env = pufferlib.make('environment_name', num_envs=256, num_workers=8)

Performance benchmarks:

性能基准:

- Pure Python envs: 100k-500k SPS

- 纯Python环境:10万-50万步/秒

- C-based envs: 100M+ SPS

- 基于C的环境:1亿+步/秒

- With training: 400k-4M total SPS

- 含训练:40万-400万总步/秒


**Key optimizations:**
- Shared memory buffers for zero-copy observation passing
- Busy-wait flags instead of pipes/queues
- Surplus environments for async returns
- Multiple environments per worker

**For vectorization optimization**, read `references/vectorization.md` for:
- Architecture and performance characteristics
- Worker and batch size configuration
- Serial vs multiprocessing vs async modes
- Shared memory and zero-copy patterns
- Hierarchical vectorization for large scale
- Multi-agent vectorization strategies
- Performance profiling and troubleshooting

**关键优化:**
- 共享内存缓冲区实现零拷贝观测传递
- 忙等待标志替代管道/队列
- 冗余环境实现异步返回
- 每个 worker 对应多个环境

**如需向量化优化指导**,请阅读`references/vectorization.md`获取以下内容:
- 架构与性能特征
- Worker与批量大小配置
- 串行 vs 多进程 vs 异步模式
- 共享内存与零拷贝模式
- 大规模分层向量化
- 多智能体向量化策略
- 性能分析与故障排除

4. Policy Development

4. 策略开发

Build policies as standard PyTorch modules with optional utilities.
Basic policy structure:
python
import torch.nn as nn
from pufferlib.pytorch import layer_init

class Policy(nn.Module):
    def __init__(self, observation_space, action_space):
        super().__init__()

        # Encoder
        self.encoder = nn.Sequential(
            layer_init(nn.Linear(obs_dim, 256)),
            nn.ReLU(),
            layer_init(nn.Linear(256, 256)),
            nn.ReLU()
        )

        # Actor and critic heads
        self.actor = layer_init(nn.Linear(256, num_actions), std=0.01)
        self.critic = layer_init(nn.Linear(256, 1), std=1.0)

    def forward(self, observations):
        features = self.encoder(observations)
        return self.actor(features), self.critic(features)
For complete policy development, read
references/policies.md
for:
  • CNN policies for image observations
  • Recurrent policies with optimized LSTM (3x faster inference)
  • Multi-input policies for complex observations
  • Continuous action policies
  • Multi-agent policies (shared vs independent parameters)
  • Advanced architectures (attention, residual)
  • Observation normalization and gradient clipping
  • Policy debugging and testing
将策略构建为标准PyTorch模块,可选用配套工具。
基础策略结构:
python
import torch.nn as nn
from pufferlib.pytorch import layer_init

class Policy(nn.Module):
    def __init__(self, observation_space, action_space):
        super().__init__()

        # 编码器
        self.encoder = nn.Sequential(
            layer_init(nn.Linear(obs_dim, 256)),
            nn.ReLU(),
            layer_init(nn.Linear(256, 256)),
            nn.ReLU()
        )

        # 动作器与评价器头
        self.actor = layer_init(nn.Linear(256, num_actions), std=0.01)
        self.critic = layer_init(nn.Linear(256, 1), std=1.0)

    def forward(self, observations):
        features = self.encoder(observations)
        return self.actor(features), self.critic(features)
如需完整策略开发指导,请阅读
references/policies.md
获取以下内容:
  • 针对图像观测的CNN策略
  • 优化后的LSTM循环策略(推理速度提升3倍)
  • 针对复杂观测的多输入策略
  • 连续动作策略
  • 多智能体策略(共享 vs 独立参数)
  • 高级架构(注意力、残差)
  • 观测归一化与梯度裁剪
  • 策略调试与测试

5. Environment Integration

5. 环境集成

Seamlessly integrate environments from popular RL frameworks.
Gymnasium integration:
python
import gymnasium as gym
import pufferlib
与主流RL框架的环境无缝集成。
Gymnasium集成:
python
import gymnasium as gym
import pufferlib

Wrap Gymnasium environment

包装Gymnasium环境

gym_env = gym.make('CartPole-v1') env = pufferlib.emulate(gym_env, num_envs=256)
gym_env = gym.make('CartPole-v1') env = pufferlib.emulate(gym_env, num_envs=256)

Or use make directly

或直接使用make方法

env = pufferlib.make('gym-CartPole-v1', num_envs=256)

**PettingZoo multi-agent:**
```python
env = pufferlib.make('gym-CartPole-v1', num_envs=256)

**PettingZoo多智能体:**
```python

Multi-agent environment

多智能体环境

env = pufferlib.make('pettingzoo-knights-archers-zombies', num_envs=128)

**Supported frameworks:**
- Gymnasium / OpenAI Gym
- PettingZoo (parallel and AEC)
- Atari (ALE)
- Procgen
- NetHack / MiniHack
- Minigrid
- Neural MMO
- Crafter
- GPUDrive
- MicroRTS
- Griddly
- And more...

**For integration details**, read `references/integration.md` for:
- Complete integration examples for each framework
- Custom wrappers (observation, reward, frame stacking, action repeat)
- Space flattening and unflattening
- Environment registration
- Compatibility patterns
- Performance considerations
- Integration debugging
env = pufferlib.make('pettingzoo-knights-archers-zombies', num_envs=128)

**支持的框架:**
- Gymnasium / OpenAI Gym
- PettingZoo(并行和AEC)
- Atari(ALE)
- Procgen
- NetHack / MiniHack
- Minigrid
- Neural MMO
- Crafter
- GPUDrive
- MicroRTS
- Griddly
- 以及更多...

**如需集成细节**,请阅读`references/integration.md`获取以下内容:
- 各框架的完整集成示例
- 自定义包装器(观测、奖励、帧堆叠、动作重复)
- 空间扁平化与还原
- 环境注册
- 兼容模式
- 性能考量
- 集成调试

Quick Start Workflow

快速开始流程

For Training Existing Environments

训练现有环境

  1. Choose environment from Ocean suite or compatible framework
  2. Use
    scripts/train_template.py
    as starting point
  3. Configure hyperparameters for your task
  4. Run training with CLI or Python script
  5. Monitor with Weights & Biases or Neptune
  6. Refer to
    references/training.md
    for optimization
  1. 从Ocean套件或兼容框架中选择环境
  2. scripts/train_template.py
    为起点
  3. 为任务配置超参数
  4. 使用CLI或Python脚本启动训练
  5. 通过Weights & Biases或Neptune监控训练
  6. 参考
    references/training.md
    进行优化

For Creating Custom Environments

创建自定义环境

  1. Start with
    scripts/env_template.py
  2. Define observation and action spaces
  3. Implement
    reset()
    and
    step()
    methods
  4. Test environment locally
  5. Vectorize with
    pufferlib.emulate()
    or
    make()
  6. Refer to
    references/environments.md
    for advanced patterns
  7. Optimize with
    references/vectorization.md
    if needed
  1. scripts/env_template.py
    开始
  2. 定义观测和动作空间
  3. 实现
    reset()
    step()
    方法
  4. 本地测试环境
  5. 使用
    pufferlib.emulate()
    make()
    进行向量化
  6. 参考
    references/environments.md
    获取高级模式
  7. 如有需要,参考
    references/vectorization.md
    进行优化

For Policy Development

策略开发

  1. Choose architecture based on observations:
    • Vector observations → MLP policy
    • Image observations → CNN policy
    • Sequential tasks → LSTM policy
    • Complex observations → Multi-input policy
  2. Use
    layer_init
    for proper weight initialization
  3. Follow patterns in
    references/policies.md
  4. Test with environment before full training
  1. 根据观测选择架构:
    • 向量观测 → MLP策略
    • 图像观测 → CNN策略
    • 序列任务 → LSTM策略
    • 复杂观测 → 多输入策略
  2. 使用
    layer_init
    进行正确的权重初始化
  3. 遵循
    references/policies.md
    中的模式
  4. 在全量训练前先在环境中测试

For Performance Optimization

性能优化

  1. Profile current throughput (steps per second)
  2. Check vectorization configuration (num_envs, num_workers)
  3. Optimize environment code (in-place ops, numpy vectorization)
  4. Consider C implementation for critical paths
  5. Use
    references/vectorization.md
    for systematic optimization
  1. 分析当前吞吐量(步/秒)
  2. 检查向量化配置(num_envs、num_workers)
  3. 优化环境代码(原地操作、numpy向量化)
  4. 关键路径考虑使用C实现
  5. 参考
    references/vectorization.md
    进行系统化优化

Resources

资源

scripts/

scripts/

train_template.py - Complete training script template with:
  • Environment creation and configuration
  • Policy initialization
  • Logger integration (WandB, Neptune)
  • Training loop with checkpointing
  • Command-line argument parsing
  • Multi-GPU distributed training setup
env_template.py - Environment implementation templates:
  • Single-agent PufferEnv example (grid world)
  • Multi-agent PufferEnv example (cooperative navigation)
  • Multiple observation/action space patterns
  • Testing utilities
train_template.py - 完整训练脚本模板,包含:
  • 环境创建与配置
  • 策略初始化
  • 日志集成(WandB、Neptune)
  • 带检查点的训练循环
  • 命令行参数解析
  • 多GPU分布式训练设置
env_template.py - 环境实现模板:
  • 单智能体PufferEnv示例(网格世界)
  • 多智能体PufferEnv示例(协作导航)
  • 多种观测/动作空间模式
  • 测试工具

references/

references/

training.md - Comprehensive training guide:
  • Training workflow and CLI options
  • Hyperparameter configuration
  • Distributed training (multi-GPU, multi-node)
  • Monitoring and logging
  • Checkpointing
  • Protein hyperparameter tuning
  • Performance optimization
  • Common training patterns
  • Troubleshooting
environments.md - Environment development guide:
  • PufferEnv API and characteristics
  • Observation and action spaces
  • Multi-agent environments
  • Ocean suite environments
  • Custom environment development workflow
  • Python to C optimization path
  • Third-party environment integration
  • Wrappers and best practices
  • Debugging
vectorization.md - Vectorization optimization:
  • Architecture and key optimizations
  • Vectorization modes (serial, multiprocessing, async)
  • Worker and batch configuration
  • Shared memory and zero-copy patterns
  • Advanced vectorization (hierarchical, custom)
  • Multi-agent vectorization
  • Performance monitoring and profiling
  • Troubleshooting and best practices
policies.md - Policy architecture guide:
  • Basic policy structure
  • CNN policies for images
  • LSTM policies with optimization
  • Multi-input policies
  • Continuous action policies
  • Multi-agent policies
  • Advanced architectures (attention, residual)
  • Observation processing and unflattening
  • Initialization and normalization
  • Debugging and testing
integration.md - Framework integration guide:
  • Gymnasium integration
  • PettingZoo integration (parallel and AEC)
  • Third-party environments (Procgen, NetHack, Minigrid, etc.)
  • Custom wrappers (observation, reward, frame stacking, etc.)
  • Space conversion and unflattening
  • Environment registration
  • Compatibility patterns
  • Performance considerations
  • Debugging integration
training.md - 全面训练指南:
  • 训练流程与CLI选项
  • 超参数配置
  • 分布式训练(多GPU、多节点)
  • 监控与日志
  • 检查点
  • Protein超参数调优
  • 性能优化
  • 常见训练模式
  • 故障排除
environments.md - 环境开发指南:
  • PufferEnv API与特性
  • 观测与动作空间
  • 多智能体环境
  • Ocean套件环境
  • 自定义环境开发流程
  • Python转C优化路径
  • 第三方环境集成
  • 包装器与最佳实践
  • 调试
vectorization.md - 向量化优化指南:
  • 架构与关键优化
  • 向量化模式(串行、多进程、异步)
  • Worker与批量配置
  • 共享内存与零拷贝模式
  • 高级向量化(分层、自定义)
  • 多智能体向量化
  • 性能监控与分析
  • 故障排除与最佳实践
policies.md - 策略架构指南:
  • 基础策略结构
  • 针对图像的CNN策略
  • 优化后的LSTM策略
  • 多输入策略
  • 连续动作策略
  • 多智能体策略
  • 高级架构(注意力、残差)
  • 观测处理与还原
  • 初始化与归一化
  • 调试与测试
integration.md - 框架集成指南:
  • Gymnasium集成
  • PettingZoo集成(并行和AEC)
  • 第三方环境(Procgen、NetHack、Minigrid等)
  • 自定义包装器(观测、奖励、帧堆叠等)
  • 空间转换与还原
  • 环境注册
  • 兼容模式
  • 性能考量
  • 集成调试

Tips for Success

成功技巧

  1. Start simple: Begin with Ocean environments or Gymnasium integration before creating custom environments
  2. Profile early: Measure steps per second from the start to identify bottlenecks
  3. Use templates:
    scripts/train_template.py
    and
    scripts/env_template.py
    provide solid starting points
  4. Read references as needed: Each reference file is self-contained and focused on a specific capability
  5. Optimize progressively: Start with Python, profile, then optimize critical paths with C if needed
  6. Leverage vectorization: PufferLib's vectorization is key to achieving high throughput
  7. Monitor training: Use WandB or Neptune to track experiments and identify issues early
  8. Test environments: Validate environment logic before scaling up training
  9. Check existing environments: Ocean suite provides 20+ pre-built environments
  10. Use proper initialization: Always use
    layer_init
    from
    pufferlib.pytorch
    for policies
  1. 从简开始:在创建自定义环境前,先从Ocean环境或Gymnasium集成入手
  2. 尽早分析:从一开始就测量步/秒,以识别瓶颈
  3. 使用模板
    scripts/train_template.py
    scripts/env_template.py
    提供了可靠的起点
  4. 按需阅读参考文档:每个参考文件都是独立的,专注于特定功能
  5. 逐步优化:从Python开始,分析后再针对关键路径用C优化
  6. 利用向量化:PufferLib的向量化是实现高吞吐量的关键
  7. 监控训练:使用WandB或Neptune跟踪实验,尽早发现问题
  8. 测试环境:在扩大训练规模前验证环境逻辑
  9. 检查现有环境:Ocean套件提供20+个预构建环境
  10. 正确初始化:始终使用
    pufferlib.pytorch
    中的
    layer_init
    初始化策略

Common Use Cases

常见用例

Training on Standard Benchmarks

在标准基准上训练

python
undefined
python
undefined

Atari

Atari

env = pufferlib.make('atari-pong', num_envs=256)
env = pufferlib.make('atari-pong', num_envs=256)

Procgen

Procgen

env = pufferlib.make('procgen-coinrun', num_envs=256)
env = pufferlib.make('procgen-coinrun', num_envs=256)

Minigrid

Minigrid

env = pufferlib.make('minigrid-empty-8x8', num_envs=256)
undefined
env = pufferlib.make('minigrid-empty-8x8', num_envs=256)
undefined

Multi-Agent Learning

多智能体学习

python
undefined
python
undefined

PettingZoo

PettingZoo

env = pufferlib.make('pettingzoo-pistonball', num_envs=128)
env = pufferlib.make('pettingzoo-pistonball', num_envs=128)

Shared policy for all agents

所有智能体共享策略

policy = create_policy(env.observation_space, env.action_space) trainer = PuffeRL(env=env, policy=policy)
undefined
policy = create_policy(env.observation_space, env.action_space) trainer = PuffeRL(env=env, policy=policy)
undefined

Custom Task Development

自定义任务开发

python
undefined
python
undefined

Create custom environment

创建自定义环境

class MyTask(PufferEnv): # ... implement environment ...
class MyTask(PufferEnv): # ... 实现环境 ...

Vectorize and train

向量化并训练

env = pufferlib.emulate(MyTask, num_envs=256) trainer = PuffeRL(env=env, policy=my_policy)
undefined
env = pufferlib.emulate(MyTask, num_envs=256) trainer = PuffeRL(env=env, policy=my_policy)
undefined

High-Performance Optimization

高性能优化

python
undefined
python
undefined

Maximize throughput

最大化吞吐量

env = pufferlib.make( 'my-env', num_envs=1024, # Large batch num_workers=16, # Many workers envs_per_worker=64 # Optimize per worker )
undefined
env = pufferlib.make( 'my-env', num_envs=1024, # 大批次 num_workers=16, # 多Worker envs_per_worker=64 # 优化每个Worker的环境数量 )
undefined

Installation

安装

bash
uv pip install pufferlib
bash
uv pip install pufferlib

Documentation

文档