pufferlib
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePufferLib - High-Performance Reinforcement Learning
PufferLib - 高性能强化学习
Overview
概述
PufferLib is a high-performance reinforcement learning library designed for fast parallel environment simulation and training. It achieves training at millions of steps per second through optimized vectorization, native multi-agent support, and efficient PPO implementation (PuffeRL). The library provides the Ocean suite of 20+ environments and seamless integration with Gymnasium, PettingZoo, and specialized RL frameworks.
PufferLib是一个高性能强化学习库,专为快速并行环境模拟和训练设计。它通过优化的向量化、原生多智能体支持和高效的PPO实现(PuffeRL),实现了每秒数百万步的训练速度。该库提供包含20+个环境的Ocean套件,并可与Gymnasium、PettingZoo及专业RL框架无缝集成。
When to Use This Skill
何时使用此技能
Use this skill when:
- Training RL agents with PPO on any environment (single or multi-agent)
- Creating custom environments using the PufferEnv API
- Optimizing performance for parallel environment simulation (vectorization)
- Integrating existing environments from Gymnasium, PettingZoo, Atari, Procgen, etc.
- Developing policies with CNN, LSTM, or custom architectures
- Scaling RL to millions of steps per second for faster experimentation
- Multi-agent RL with native multi-agent environment support
在以下场景中使用此技能:
- 训练RL智能体:在任意环境(单智能体或多智能体)上使用PPO进行训练
- 创建自定义环境:使用PufferEnv API开发自定义环境
- 优化性能:针对并行环境模拟(向量化)优化性能
- 集成现有环境:集成来自Gymnasium、PettingZoo、Atari、Procgen等的现有环境
- 开发策略:使用CNN、LSTM或自定义架构开发策略
- 扩展RL规模:将RL扩展至每秒数百万步,以加快实验速度
- 多智能体RL:利用原生多智能体环境支持
Core Capabilities
核心功能
1. High-Performance Training (PuffeRL)
1. 高性能训练(PuffeRL)
PuffeRL is PufferLib's optimized PPO+LSTM training algorithm achieving 1M-4M steps/second.
Quick start training:
bash
undefinedPuffeRL是PufferLib优化后的PPO+LSTM训练算法,可实现100万-400万步/秒的训练速度。
快速开始训练:
bash
undefinedCLI training
CLI训练
puffer train procgen-coinrun --train.device cuda --train.learning-rate 3e-4
puffer train procgen-coinrun --train.device cuda --train.learning-rate 3e-4
Distributed training
分布式训练
torchrun --nproc_per_node=4 train.py
**Python training loop:**
```python
import pufferlib
from pufferlib import PuffeRLtorchrun --nproc_per_node=4 train.py
**Python训练循环:**
```python
import pufferlib
from pufferlib import PuffeRLCreate vectorized environment
创建向量化环境
env = pufferlib.make('procgen-coinrun', num_envs=256)
env = pufferlib.make('procgen-coinrun', num_envs=256)
Create trainer
创建训练器
trainer = PuffeRL(
env=env,
policy=my_policy,
device='cuda',
learning_rate=3e-4,
batch_size=32768
)
trainer = PuffeRL(
env=env,
policy=my_policy,
device='cuda',
learning_rate=3e-4,
batch_size=32768
)
Training loop
训练循环
for iteration in range(num_iterations):
trainer.evaluate() # Collect rollouts
trainer.train() # Train on batch
trainer.mean_and_log() # Log results
**For comprehensive training guidance**, read `references/training.md` for:
- Complete training workflow and CLI options
- Hyperparameter tuning with Protein
- Distributed multi-GPU/multi-node training
- Logger integration (Weights & Biases, Neptune)
- Checkpointing and resume training
- Performance optimization tips
- Curriculum learning patternsfor iteration in range(num_iterations):
trainer.evaluate() # 收集轨迹
trainer.train() # 批量训练
trainer.mean_and_log() # 记录结果
**如需全面训练指导**,请阅读`references/training.md`获取以下内容:
- 完整训练流程和CLI选项
- 使用Protein进行超参数调优
- 分布式多GPU/多节点训练
- 日志集成(Weights & Biases、Neptune)
- 检查点与训练恢复
- 性能优化技巧
- 课程学习模式2. Environment Development (PufferEnv)
2. 环境开发(PufferEnv)
Create custom high-performance environments with the PufferEnv API.
Basic environment structure:
python
import numpy as np
from pufferlib import PufferEnv
class MyEnvironment(PufferEnv):
def __init__(self, buf=None):
super().__init__(buf)
# Define spaces
self.observation_space = self.make_space((4,))
self.action_space = self.make_discrete(4)
self.reset()
def reset(self):
# Reset state and return initial observation
return np.zeros(4, dtype=np.float32)
def step(self, action):
# Execute action, compute reward, check done
obs = self._get_observation()
reward = self._compute_reward()
done = self._is_done()
info = {}
return obs, reward, done, infoUse the template script: provides complete single-agent and multi-agent environment templates with examples of:
scripts/env_template.py- Different observation space types (vector, image, dict)
- Action space variations (discrete, continuous, multi-discrete)
- Multi-agent environment structure
- Testing utilities
For complete environment development, read for:
references/environments.md- PufferEnv API details and in-place operation patterns
- Observation and action space definitions
- Multi-agent environment creation
- Ocean suite (20+ pre-built environments)
- Performance optimization (Python to C workflow)
- Environment wrappers and best practices
- Debugging and validation techniques
使用PufferEnv API创建自定义高性能环境。
基础环境结构:
python
import numpy as np
from pufferlib import PufferEnv
class MyEnvironment(PufferEnv):
def __init__(self, buf=None):
super().__init__(buf)
# 定义空间
self.observation_space = self.make_space((4,))
self.action_space = self.make_discrete(4)
self.reset()
def reset(self):
# 重置状态并返回初始观测
return np.zeros(4, dtype=np.float32)
def step(self, action):
# 执行动作、计算奖励、检查终止状态
obs = self._get_observation()
reward = self._compute_reward()
done = self._is_done()
info = {}
return obs, reward, done, info使用模板脚本: 提供完整的单智能体和多智能体环境模板,包含以下示例:
scripts/env_template.py- 不同观测空间类型(向量、图像、字典)
- 动作空间变体(离散、连续、多离散)
- 多智能体环境结构
- 测试工具
如需完整环境开发指导,请阅读获取以下内容:
references/environments.md- PufferEnv API细节和原地操作模式
- 观测与动作空间定义
- 多智能体环境创建
- Ocean套件(20+个预构建环境)
- 性能优化(Python转C工作流)
- 环境包装器与最佳实践
- 调试与验证技术
3. Vectorization and Performance
3. 向量化与性能
Achieve maximum throughput with optimized parallel simulation.
Vectorization setup:
python
import pufferlib通过优化的并行模拟实现最大吞吐量。
向量化设置:
python
import pufferlibAutomatic vectorization
自动向量化
env = pufferlib.make('environment_name', num_envs=256, num_workers=8)
env = pufferlib.make('environment_name', num_envs=256, num_workers=8)
Performance benchmarks:
性能基准:
- Pure Python envs: 100k-500k SPS
- 纯Python环境:10万-50万步/秒
- C-based envs: 100M+ SPS
- 基于C的环境:1亿+步/秒
- With training: 400k-4M total SPS
- 含训练:40万-400万总步/秒
**Key optimizations:**
- Shared memory buffers for zero-copy observation passing
- Busy-wait flags instead of pipes/queues
- Surplus environments for async returns
- Multiple environments per worker
**For vectorization optimization**, read `references/vectorization.md` for:
- Architecture and performance characteristics
- Worker and batch size configuration
- Serial vs multiprocessing vs async modes
- Shared memory and zero-copy patterns
- Hierarchical vectorization for large scale
- Multi-agent vectorization strategies
- Performance profiling and troubleshooting
**关键优化:**
- 共享内存缓冲区实现零拷贝观测传递
- 忙等待标志替代管道/队列
- 冗余环境实现异步返回
- 每个 worker 对应多个环境
**如需向量化优化指导**,请阅读`references/vectorization.md`获取以下内容:
- 架构与性能特征
- Worker与批量大小配置
- 串行 vs 多进程 vs 异步模式
- 共享内存与零拷贝模式
- 大规模分层向量化
- 多智能体向量化策略
- 性能分析与故障排除4. Policy Development
4. 策略开发
Build policies as standard PyTorch modules with optional utilities.
Basic policy structure:
python
import torch.nn as nn
from pufferlib.pytorch import layer_init
class Policy(nn.Module):
def __init__(self, observation_space, action_space):
super().__init__()
# Encoder
self.encoder = nn.Sequential(
layer_init(nn.Linear(obs_dim, 256)),
nn.ReLU(),
layer_init(nn.Linear(256, 256)),
nn.ReLU()
)
# Actor and critic heads
self.actor = layer_init(nn.Linear(256, num_actions), std=0.01)
self.critic = layer_init(nn.Linear(256, 1), std=1.0)
def forward(self, observations):
features = self.encoder(observations)
return self.actor(features), self.critic(features)For complete policy development, read for:
references/policies.md- CNN policies for image observations
- Recurrent policies with optimized LSTM (3x faster inference)
- Multi-input policies for complex observations
- Continuous action policies
- Multi-agent policies (shared vs independent parameters)
- Advanced architectures (attention, residual)
- Observation normalization and gradient clipping
- Policy debugging and testing
将策略构建为标准PyTorch模块,可选用配套工具。
基础策略结构:
python
import torch.nn as nn
from pufferlib.pytorch import layer_init
class Policy(nn.Module):
def __init__(self, observation_space, action_space):
super().__init__()
# 编码器
self.encoder = nn.Sequential(
layer_init(nn.Linear(obs_dim, 256)),
nn.ReLU(),
layer_init(nn.Linear(256, 256)),
nn.ReLU()
)
# 动作器与评价器头
self.actor = layer_init(nn.Linear(256, num_actions), std=0.01)
self.critic = layer_init(nn.Linear(256, 1), std=1.0)
def forward(self, observations):
features = self.encoder(observations)
return self.actor(features), self.critic(features)如需完整策略开发指导,请阅读获取以下内容:
references/policies.md- 针对图像观测的CNN策略
- 优化后的LSTM循环策略(推理速度提升3倍)
- 针对复杂观测的多输入策略
- 连续动作策略
- 多智能体策略(共享 vs 独立参数)
- 高级架构(注意力、残差)
- 观测归一化与梯度裁剪
- 策略调试与测试
5. Environment Integration
5. 环境集成
Seamlessly integrate environments from popular RL frameworks.
Gymnasium integration:
python
import gymnasium as gym
import pufferlib与主流RL框架的环境无缝集成。
Gymnasium集成:
python
import gymnasium as gym
import pufferlibWrap Gymnasium environment
包装Gymnasium环境
gym_env = gym.make('CartPole-v1')
env = pufferlib.emulate(gym_env, num_envs=256)
gym_env = gym.make('CartPole-v1')
env = pufferlib.emulate(gym_env, num_envs=256)
Or use make directly
或直接使用make方法
env = pufferlib.make('gym-CartPole-v1', num_envs=256)
**PettingZoo multi-agent:**
```pythonenv = pufferlib.make('gym-CartPole-v1', num_envs=256)
**PettingZoo多智能体:**
```pythonMulti-agent environment
多智能体环境
env = pufferlib.make('pettingzoo-knights-archers-zombies', num_envs=128)
**Supported frameworks:**
- Gymnasium / OpenAI Gym
- PettingZoo (parallel and AEC)
- Atari (ALE)
- Procgen
- NetHack / MiniHack
- Minigrid
- Neural MMO
- Crafter
- GPUDrive
- MicroRTS
- Griddly
- And more...
**For integration details**, read `references/integration.md` for:
- Complete integration examples for each framework
- Custom wrappers (observation, reward, frame stacking, action repeat)
- Space flattening and unflattening
- Environment registration
- Compatibility patterns
- Performance considerations
- Integration debuggingenv = pufferlib.make('pettingzoo-knights-archers-zombies', num_envs=128)
**支持的框架:**
- Gymnasium / OpenAI Gym
- PettingZoo(并行和AEC)
- Atari(ALE)
- Procgen
- NetHack / MiniHack
- Minigrid
- Neural MMO
- Crafter
- GPUDrive
- MicroRTS
- Griddly
- 以及更多...
**如需集成细节**,请阅读`references/integration.md`获取以下内容:
- 各框架的完整集成示例
- 自定义包装器(观测、奖励、帧堆叠、动作重复)
- 空间扁平化与还原
- 环境注册
- 兼容模式
- 性能考量
- 集成调试Quick Start Workflow
快速开始流程
For Training Existing Environments
训练现有环境
- Choose environment from Ocean suite or compatible framework
- Use as starting point
scripts/train_template.py - Configure hyperparameters for your task
- Run training with CLI or Python script
- Monitor with Weights & Biases or Neptune
- Refer to for optimization
references/training.md
- 从Ocean套件或兼容框架中选择环境
- 以为起点
scripts/train_template.py - 为任务配置超参数
- 使用CLI或Python脚本启动训练
- 通过Weights & Biases或Neptune监控训练
- 参考进行优化
references/training.md
For Creating Custom Environments
创建自定义环境
- Start with
scripts/env_template.py - Define observation and action spaces
- Implement and
reset()methodsstep() - Test environment locally
- Vectorize with or
pufferlib.emulate()make() - Refer to for advanced patterns
references/environments.md - Optimize with if needed
references/vectorization.md
- 从开始
scripts/env_template.py - 定义观测和动作空间
- 实现和
reset()方法step() - 本地测试环境
- 使用或
pufferlib.emulate()进行向量化make() - 参考获取高级模式
references/environments.md - 如有需要,参考进行优化
references/vectorization.md
For Policy Development
策略开发
- Choose architecture based on observations:
- Vector observations → MLP policy
- Image observations → CNN policy
- Sequential tasks → LSTM policy
- Complex observations → Multi-input policy
- Use for proper weight initialization
layer_init - Follow patterns in
references/policies.md - Test with environment before full training
- 根据观测选择架构:
- 向量观测 → MLP策略
- 图像观测 → CNN策略
- 序列任务 → LSTM策略
- 复杂观测 → 多输入策略
- 使用进行正确的权重初始化
layer_init - 遵循中的模式
references/policies.md - 在全量训练前先在环境中测试
For Performance Optimization
性能优化
- Profile current throughput (steps per second)
- Check vectorization configuration (num_envs, num_workers)
- Optimize environment code (in-place ops, numpy vectorization)
- Consider C implementation for critical paths
- Use for systematic optimization
references/vectorization.md
- 分析当前吞吐量(步/秒)
- 检查向量化配置(num_envs、num_workers)
- 优化环境代码(原地操作、numpy向量化)
- 关键路径考虑使用C实现
- 参考进行系统化优化
references/vectorization.md
Resources
资源
scripts/
scripts/
train_template.py - Complete training script template with:
- Environment creation and configuration
- Policy initialization
- Logger integration (WandB, Neptune)
- Training loop with checkpointing
- Command-line argument parsing
- Multi-GPU distributed training setup
env_template.py - Environment implementation templates:
- Single-agent PufferEnv example (grid world)
- Multi-agent PufferEnv example (cooperative navigation)
- Multiple observation/action space patterns
- Testing utilities
train_template.py - 完整训练脚本模板,包含:
- 环境创建与配置
- 策略初始化
- 日志集成(WandB、Neptune)
- 带检查点的训练循环
- 命令行参数解析
- 多GPU分布式训练设置
env_template.py - 环境实现模板:
- 单智能体PufferEnv示例(网格世界)
- 多智能体PufferEnv示例(协作导航)
- 多种观测/动作空间模式
- 测试工具
references/
references/
training.md - Comprehensive training guide:
- Training workflow and CLI options
- Hyperparameter configuration
- Distributed training (multi-GPU, multi-node)
- Monitoring and logging
- Checkpointing
- Protein hyperparameter tuning
- Performance optimization
- Common training patterns
- Troubleshooting
environments.md - Environment development guide:
- PufferEnv API and characteristics
- Observation and action spaces
- Multi-agent environments
- Ocean suite environments
- Custom environment development workflow
- Python to C optimization path
- Third-party environment integration
- Wrappers and best practices
- Debugging
vectorization.md - Vectorization optimization:
- Architecture and key optimizations
- Vectorization modes (serial, multiprocessing, async)
- Worker and batch configuration
- Shared memory and zero-copy patterns
- Advanced vectorization (hierarchical, custom)
- Multi-agent vectorization
- Performance monitoring and profiling
- Troubleshooting and best practices
policies.md - Policy architecture guide:
- Basic policy structure
- CNN policies for images
- LSTM policies with optimization
- Multi-input policies
- Continuous action policies
- Multi-agent policies
- Advanced architectures (attention, residual)
- Observation processing and unflattening
- Initialization and normalization
- Debugging and testing
integration.md - Framework integration guide:
- Gymnasium integration
- PettingZoo integration (parallel and AEC)
- Third-party environments (Procgen, NetHack, Minigrid, etc.)
- Custom wrappers (observation, reward, frame stacking, etc.)
- Space conversion and unflattening
- Environment registration
- Compatibility patterns
- Performance considerations
- Debugging integration
training.md - 全面训练指南:
- 训练流程与CLI选项
- 超参数配置
- 分布式训练(多GPU、多节点)
- 监控与日志
- 检查点
- Protein超参数调优
- 性能优化
- 常见训练模式
- 故障排除
environments.md - 环境开发指南:
- PufferEnv API与特性
- 观测与动作空间
- 多智能体环境
- Ocean套件环境
- 自定义环境开发流程
- Python转C优化路径
- 第三方环境集成
- 包装器与最佳实践
- 调试
vectorization.md - 向量化优化指南:
- 架构与关键优化
- 向量化模式(串行、多进程、异步)
- Worker与批量配置
- 共享内存与零拷贝模式
- 高级向量化(分层、自定义)
- 多智能体向量化
- 性能监控与分析
- 故障排除与最佳实践
policies.md - 策略架构指南:
- 基础策略结构
- 针对图像的CNN策略
- 优化后的LSTM策略
- 多输入策略
- 连续动作策略
- 多智能体策略
- 高级架构(注意力、残差)
- 观测处理与还原
- 初始化与归一化
- 调试与测试
integration.md - 框架集成指南:
- Gymnasium集成
- PettingZoo集成(并行和AEC)
- 第三方环境(Procgen、NetHack、Minigrid等)
- 自定义包装器(观测、奖励、帧堆叠等)
- 空间转换与还原
- 环境注册
- 兼容模式
- 性能考量
- 集成调试
Tips for Success
成功技巧
-
Start simple: Begin with Ocean environments or Gymnasium integration before creating custom environments
-
Profile early: Measure steps per second from the start to identify bottlenecks
-
Use templates:and
scripts/train_template.pyprovide solid starting pointsscripts/env_template.py -
Read references as needed: Each reference file is self-contained and focused on a specific capability
-
Optimize progressively: Start with Python, profile, then optimize critical paths with C if needed
-
Leverage vectorization: PufferLib's vectorization is key to achieving high throughput
-
Monitor training: Use WandB or Neptune to track experiments and identify issues early
-
Test environments: Validate environment logic before scaling up training
-
Check existing environments: Ocean suite provides 20+ pre-built environments
-
Use proper initialization: Always usefrom
layer_initfor policiespufferlib.pytorch
-
从简开始:在创建自定义环境前,先从Ocean环境或Gymnasium集成入手
-
尽早分析:从一开始就测量步/秒,以识别瓶颈
-
使用模板:和
scripts/train_template.py提供了可靠的起点scripts/env_template.py -
按需阅读参考文档:每个参考文件都是独立的,专注于特定功能
-
逐步优化:从Python开始,分析后再针对关键路径用C优化
-
利用向量化:PufferLib的向量化是实现高吞吐量的关键
-
监控训练:使用WandB或Neptune跟踪实验,尽早发现问题
-
测试环境:在扩大训练规模前验证环境逻辑
-
检查现有环境:Ocean套件提供20+个预构建环境
-
正确初始化:始终使用中的
pufferlib.pytorch初始化策略layer_init
Common Use Cases
常见用例
Training on Standard Benchmarks
在标准基准上训练
python
undefinedpython
undefinedAtari
Atari
env = pufferlib.make('atari-pong', num_envs=256)
env = pufferlib.make('atari-pong', num_envs=256)
Procgen
Procgen
env = pufferlib.make('procgen-coinrun', num_envs=256)
env = pufferlib.make('procgen-coinrun', num_envs=256)
Minigrid
Minigrid
env = pufferlib.make('minigrid-empty-8x8', num_envs=256)
undefinedenv = pufferlib.make('minigrid-empty-8x8', num_envs=256)
undefinedMulti-Agent Learning
多智能体学习
python
undefinedpython
undefinedPettingZoo
PettingZoo
env = pufferlib.make('pettingzoo-pistonball', num_envs=128)
env = pufferlib.make('pettingzoo-pistonball', num_envs=128)
Shared policy for all agents
所有智能体共享策略
policy = create_policy(env.observation_space, env.action_space)
trainer = PuffeRL(env=env, policy=policy)
undefinedpolicy = create_policy(env.observation_space, env.action_space)
trainer = PuffeRL(env=env, policy=policy)
undefinedCustom Task Development
自定义任务开发
python
undefinedpython
undefinedCreate custom environment
创建自定义环境
class MyTask(PufferEnv):
# ... implement environment ...
class MyTask(PufferEnv):
# ... 实现环境 ...
Vectorize and train
向量化并训练
env = pufferlib.emulate(MyTask, num_envs=256)
trainer = PuffeRL(env=env, policy=my_policy)
undefinedenv = pufferlib.emulate(MyTask, num_envs=256)
trainer = PuffeRL(env=env, policy=my_policy)
undefinedHigh-Performance Optimization
高性能优化
python
undefinedpython
undefinedMaximize throughput
最大化吞吐量
env = pufferlib.make(
'my-env',
num_envs=1024, # Large batch
num_workers=16, # Many workers
envs_per_worker=64 # Optimize per worker
)
undefinedenv = pufferlib.make(
'my-env',
num_envs=1024, # 大批次
num_workers=16, # 多Worker
envs_per_worker=64 # 优化每个Worker的环境数量
)
undefinedInstallation
安装
bash
uv pip install pufferlibbash
uv pip install pufferlibDocumentation
文档
- Official docs: https://puffer.ai/docs.html
- GitHub: https://github.com/PufferAI/PufferLib
- Discord: Community support available
- 官方文档:https://puffer.ai/docs.html
- GitHub:https://github.com/PufferAI/PufferLib
- Discord:提供社区支持