proteinmpnn

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

ProteinMPNN Sequence Design

ProteinMPNN 序列设计

Prerequisites

前置要求

Requirement	Minimum	Recommended
Python	3.8+	3.10
CUDA	11.0+	11.7+
GPU VRAM	8GB	16GB (T4)
RAM	8GB	16GB

要求	最低配置	推荐配置
Python	3.8+	3.10
CUDA	11.0+	11.7+
GPU显存	8GB	16GB (T4)
内存	8GB	16GB

How to run

运行方法

First time? See Installation Guide to set up Modal and biomodals.

首次使用？ 请查看安装指南来设置Modal和biomodals。

Option 1: Local installation (recommended)

选项1：本地安装（推荐）

bash

git clone https://github.com/dauparas/ProteinMPNN.git
cd ProteinMPNN

python protein_mpnn_run.py \
  --pdb_path backbone.pdb \
  --out_folder output/ \
  --num_seq_per_target 16 \
  --sampling_temp "0.1"

GPU: T4 (16GB) sufficient | Time: ~50-100 sequences/minute

bash

git clone https://github.com/dauparas/ProteinMPNN.git
cd ProteinMPNN

python protein_mpnn_run.py \
  --pdb_path backbone.pdb \
  --out_folder output/ \
  --num_seq_per_target 16 \
  --sampling_temp "0.1"

GPU：T4（16GB）足够 | 耗时：约50-100条序列/分钟

Option 2: Modal (via LigandMPNN wrapper)

选项2：通过Modal运行（借助LigandMPNN包装器）

bash

cd biomodals
modal run modal_ligandmpnn.py \
  --pdb-path backbone.pdb \
  --num-seq-per-target 16

Note: LigandMPNN includes ProteinMPNN functionality.

bash

cd biomodals
modal run modal_ligandmpnn.py \
  --pdb-path backbone.pdb \
  --num-seq-per-target 16

注意：LigandMPNN包含ProteinMPNN的功能。

Config Schema

配置参数说明

Core Parameters

核心参数

Parameter	Default	Range	Description
`--pdb_path`	required	path	Single PDB input
`--pdb_path_chains`	all	A,B	Chains to design (comma-sep)
`--out_folder`	required	path	Output directory
`--num_seq_per_target`	1	1-1000	Sequences per structure
`--sampling_temp`	"0.1"	"0.0001-1.0"	Temperature (string!)
`--seed`	0	int	Random seed
`--batch_size`	1	1-32	Batch size

参数	默认值	范围	说明
`--pdb_path`	必填	文件路径	单个PDB输入文件
`--pdb_path_chains`	所有链	A,B	要设计的链（逗号分隔）
`--out_folder`	必填	文件路径	输出目录
`--num_seq_per_target`	1	1-1000	每个结构生成的序列数量
`--sampling_temp`	"0.1"	"0.0001-1.0"	采样温度（必须是字符串！）
`--seed`	0	整数	随机种子
`--batch_size`	1	1-32	批次大小

Temperature Guide

温度设置指南

0.1  -> Low diversity, high recovery (production)
0.2  -> Moderate diversity (default)
0.3  -> Higher diversity (exploration)
0.5+ -> Very diverse, lower quality

IMPORTANT: Temperature must be passed as a string, not float.

0.1  -> 多样性低，序列恢复度高（生产环境）
0.2  -> 中等多样性（默认值）
0.3  -> 较高多样性（探索场景）
0.5+ -> 多样性极高，序列质量较低

重要提示：温度参数必须以字符串形式传入，不能是浮点数。

Common mistakes

常见错误

Temperature Parameter

温度参数

✅ Correct:

bash

--sampling_temp "0.1"    # String with quotes

❌ Wrong:

bash

--sampling_temp 0.1      # Float without quotes - may cause errors
--sampling_temp 0.1,0.2  # Multiple temps need proper format

✅ 正确写法:

bash

--sampling_temp "0.1"    # 带引号的字符串

❌ 错误写法:

bash

--sampling_temp 0.1      # 不带引号的浮点数 - 可能导致错误
--sampling_temp 0.1,0.2  # 多温度设置需要使用正确格式

Fixed Positions JSONL

固定位置JSONL文件

✅ Correct:

json

{"A": [1, 2, 3, 10, 11], "B": [5, 6]}

❌ Wrong:

json

{"A": "1,2,3,10,11"}     # String instead of list
{A: [1, 2, 3]}           # Missing quotes on key
{"A": [1,2,3,]}          # Trailing comma

✅ 正确格式:

json

{"A": [1, 2, 3, 10, 11], "B": [5, 6]}

❌ 错误格式:

json

{"A": "1,2,3,10,11"}     # 使用字符串而非列表
{A: [1, 2, 3]}           # 键名缺少引号
{"A": [1,2,3,]}          # 存在 trailing 逗号

Chain Selection

链选择

✅ Correct:

bash

--pdb_path_chains A,B    # No spaces

❌ Wrong:

bash

--pdb_path_chains A, B   # Space after comma
--pdb_path_chains "A,B"  # Quotes may cause issues

✅ 正确写法:

bash

--pdb_path_chains A,B    # 逗号后无空格

❌ 错误写法:

bash

--pdb_path_chains A, B   # 逗号后有空格
--pdb_path_chains "A,B"  # 添加引号可能引发问题

Amino Acid Biases

氨基酸偏好设置

bash

undefined

bash

undefined

Bias toward certain AAs (positive = favor)

偏好特定氨基酸（正值=优先选择）

--bias_AA_jsonl '{"A": {"A": 1.5, "W": -2.0}}'

Omit specific AAs globally

全局排除特定氨基酸

--omit_AAs "CM" # No cysteine or methionine

--omit_AAs "CM" # 排除半胱氨酸和甲硫氨酸

Per-position omission

按位置排除氨基酸

--omit_AA_jsonl '{"A": {"1": "C", "2": "CM"}}'

undefined

--omit_AA_jsonl '{"A": {"1": "C", "2": "CM"}}'

undefined

Multi-Chain Design

多链设计

bash

undefined

bash

undefined

Design chains A and B together

同时设计A链和B链

--pdb_path_chains A,B

Tie chains (same sequence)

绑定链（生成相同序列）

--tied_positions_jsonl tied.jsonl

undefined

--tied_positions_jsonl tied.jsonl

undefined

Variants Comparison

变体对比

Variant	Use Case	Key Difference
ProteinMPNN	General	Original model
SolubleMPNN	Expression	Trained on soluble proteins
LigandMPNN	Small molecules	Ligand-aware context

变体	使用场景	核心差异
ProteinMPNN	通用场景	原始模型
SolubleMPNN	表达优化	基于可溶性蛋白质训练
LigandMPNN	小分子相关	支持配体上下文感知

Output format

输出格式

output/
├── seqs/
│   └── backbone.fa          # FASTA sequences
└── backbone_pdb/
    └── backbone_0001.pdb    # PDBs with designed sequence

output/
├── seqs/
│   └── backbone.fa          # FASTA格式序列
└── backbone_pdb/
    └── backbone_0001.pdb    # 包含设计序列的PDB文件

FASTA Header Format

FASTA头部格式

>backbone_0001, score=1.234, global_score=1.234, seq_recovery=0.85
MKTAYIAKQRQISFVKSHFSRQLE...

>backbone_0001, score=1.234, global_score=1.234, seq_recovery=0.85
MKTAYIAKQRQISFVKSHFSRQLE...

Common workflows

常见工作流

Binder Sequence Design

结合体序列设计

bash

python protein_mpnn_run.py \
  --pdb_path binder_backbone.pdb \
  --out_folder output/ \
  --num_seq_per_target 16 \
  --sampling_temp "0.1" \
  --pdb_path_chains B  # Design binder chain only

bash

python protein_mpnn_run.py \
  --pdb_path binder_backbone.pdb \
  --out_folder output/ \
  --num_seq_per_target 16 \
  --sampling_temp "0.1" \
  --pdb_path_chains B  # 仅设计结合体链

Interface Redesign

界面重新设计

bash

undefined

bash

undefined

Fix core, design interface

固定核心区域，设计界面区域

python protein_mpnn_run.py
--pdb_path complex.pdb
--fixed_positions_jsonl core_positions.jsonl
--num_seq_per_target 32

undefined

python protein_mpnn_run.py
--pdb_path complex.pdb
--fixed_positions_jsonl core_positions.jsonl
--num_seq_per_target 32

undefined

Multi-State Design

多态设计

bash

undefined

bash

undefined

Design for multiple conformations

为多种构象设计序列

python protein_mpnn_run.py
--pdb_path_multi state1.pdb,state2.pdb
--num_seq_per_target 16

undefined

python protein_mpnn_run.py
--pdb_path_multi state1.pdb,state2.pdb
--num_seq_per_target 16

undefined

Sample output

示例输出

Successful run

成功运行示例

$ python protein_mpnn_run.py --pdb_path backbone.pdb --out_folder output/ --num_seq_per_target 8
Loading model weights...
Designing sequences for backbone.pdb
Generated 8 sequences in 2.3 seconds

output/seqs/backbone.fa:
>backbone_0001, score=1.234, global_score=1.189, seq_recovery=0.82
MKTAYIAKQRQISFVKSHFSRQLEERGLTKE...
>backbone_0002, score=1.198, global_score=1.156, seq_recovery=0.79
MKTAYIAKQRQISFVKSQFSRQLDERGLTKE...

What good output looks like:

Score: 1.0-2.0 (lower = more confident)
Seq recovery: 0.3-0.6 for de novo, 0.7-0.9 for redesign
Diverse sequences (not all identical) when temp > 0.1

$ python protein_mpnn_run.py --pdb_path backbone.pdb --out_folder output/ --num_seq_per_target 8
Loading model weights...
Designing sequences for backbone.pdb
Generated 8 sequences in 2.3 seconds

output/seqs/backbone.fa:
>backbone_0001, score=1.234, global_score=1.189, seq_recovery=0.82
MKTAYIAKQRQISFVKSHFSRQLEERGLTKE...
>backbone_0002, score=1.198, global_score=1.156, seq_recovery=0.79
MKTAYIAKQRQISFVKSQFSRQLDERGLTKE...

优质输出特征:

分数：1.0-2.0（越低表示模型置信度越高）
序列恢复度：从头设计为0.3-0.6，重新设计为0.7-0.9
当温度>0.1时，序列具有多样性（并非完全一致）

Decision tree

决策树

Should I use ProteinMPNN?
│
├─ Have a backbone structure?
│  ├─ Yes → Continue below
│  └─ No → Use RFdiffusion first
│
├─ What's in the binding site?
│  ├─ Nothing / protein only → ProteinMPNN ✓
│  ├─ Small molecule / ligand → Use LigandMPNN
│  └─ Metal / cofactor → Use LigandMPNN
│
├─ Priority?
│  ├─ Solubility/expression → Consider SolubleMPNN
│  ├─ Speed → ProteinMPNN ✓
│  └─ AF2 optimization → Consider ColabDesign
│
└─ Need fixed positions?
   ├─ Yes → Use --fixed_positions_jsonl
   └─ No → ProteinMPNN ✓ (design all)

是否应该使用ProteinMPNN？
│
├─ 是否已有骨架结构？
│  ├─ 是 → 继续以下判断
│  └─ 否 → 先使用RFdiffusion生成骨架
│
├─ 结合位点包含什么？
│  ├─ 无/仅蛋白质 → 使用ProteinMPNN ✓
│  ├─ 小分子/配体 → 使用LigandMPNN
│  └─ 金属/辅因子 → 使用LigandMPNN
│
├─ 优先级是什么？
│  ├─ 溶解度/表达量 → 考虑使用SolubleMPNN
│  ├─ 速度 → 使用ProteinMPNN ✓
│  └─ AF2优化 → 考虑使用ColabDesign
│
└─ 是否需要固定特定位置？
   ├─ 是 → 使用--fixed_positions_jsonl参数
   └─ 否 → 使用ProteinMPNN ✓（设计所有位置）

Typical performance

典型性能

Campaign Size	Time (T4)	Cost (Modal)	Notes
100 backbones × 8 seq	15-20 min	~$2	Standard
500 backbones × 8 seq	1-1.5h	~$8	Large campaign
1000 backbones × 16 seq	3-4h	~$18	Comprehensive

Throughput: ~50-100 sequences/minute on T4 GPU.

任务规模	耗时（T4 GPU）	成本（Modal平台）	说明
100个骨架 × 8条序列	15-20分钟	~$2	标准任务
500个骨架 × 8条序列	1-1.5小时	~$8	大型任务
1000个骨架 × 16条序列	3-4小时	~$18	全面任务

处理速度：在T4 GPU上约50-100条序列/分钟。

Verify

验证方法

bash

grep -c "^>" output/seqs/*.fa  # Should match backbone_count × num_seq_per_target

bash

grep -c "^>" output/seqs/*.fa  # 结果应等于骨架数量 × 每条骨架生成的序列数

Troubleshooting

故障排查

Low sequence diversity: Increase sampling_temp to 0.2-0.3 Poor recovery: Decrease sampling_temp to 0.1 OOM errors: Reduce batch_size Unwanted cysteines: Use --omit_AAs "C"

序列多样性低：将sampling_temp提升至0.2-0.3 序列恢复度差：将sampling_temp降低至0.1 内存不足错误：减小batch_size 出现多余半胱氨酸：使用--omit_AAs "C"参数

Error interpretation

错误解读

Error	Cause	Fix
`RuntimeError: CUDA out of memory`	Long protein or large batch	Reduce batch_size or use larger GPU
`KeyError: 'A'`	Chain not in PDB	Check chain IDs in your PDB file
`JSONDecodeError`	Invalid JSONL format	Validate JSON syntax (see Common Mistakes)
`IndexError: list index`	Empty chain or residue list	Check PDB has atoms, not just HEADER

Next: Structure prediction for validation →

protein-qc

for filtering.

错误信息	原因	解决方法
`RuntimeError: CUDA out of memory`	蛋白质序列过长或批次过大	减小batch_size或使用显存更大的GPU
`KeyError: 'A'`	PDB文件中不存在该链	检查PDB文件中的链ID
`JSONDecodeError`	JSONL格式无效	验证JSON语法（参考常见错误部分）
`IndexError: list index`	链或残基列表为空	检查PDB文件是否包含原子信息，而非仅头部

下一步：通过结构预测验证序列 → 使用

protein-qc

进行过滤。