proteinmpnn

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ProteinMPNN Sequence Design

ProteinMPNN 序列设计

Prerequisites

前置要求

RequirementMinimumRecommended
Python3.8+3.10
CUDA11.0+11.7+
GPU VRAM8GB16GB (T4)
RAM8GB16GB
要求最低配置推荐配置
Python3.8+3.10
CUDA11.0+11.7+
GPU显存8GB16GB (T4)
内存8GB16GB

How to run

运行方法

First time? See Installation Guide to set up Modal and biomodals.
首次使用? 请查看安装指南来设置Modal和biomodals。

Option 1: Local installation (recommended)

选项1:本地安装(推荐)

bash
git clone https://github.com/dauparas/ProteinMPNN.git
cd ProteinMPNN

python protein_mpnn_run.py \
  --pdb_path backbone.pdb \
  --out_folder output/ \
  --num_seq_per_target 16 \
  --sampling_temp "0.1"
GPU: T4 (16GB) sufficient | Time: ~50-100 sequences/minute
bash
git clone https://github.com/dauparas/ProteinMPNN.git
cd ProteinMPNN

python protein_mpnn_run.py \
  --pdb_path backbone.pdb \
  --out_folder output/ \
  --num_seq_per_target 16 \
  --sampling_temp "0.1"
GPU:T4(16GB)足够 | 耗时:约50-100条序列/分钟

Option 2: Modal (via LigandMPNN wrapper)

选项2:通过Modal运行(借助LigandMPNN包装器)

bash
cd biomodals
modal run modal_ligandmpnn.py \
  --pdb-path backbone.pdb \
  --num-seq-per-target 16
Note: LigandMPNN includes ProteinMPNN functionality.
bash
cd biomodals
modal run modal_ligandmpnn.py \
  --pdb-path backbone.pdb \
  --num-seq-per-target 16
注意:LigandMPNN包含ProteinMPNN的功能。

Config Schema

配置参数说明

Core Parameters

核心参数

ParameterDefaultRangeDescription
--pdb_path
requiredpathSingle PDB input
--pdb_path_chains
allA,BChains to design (comma-sep)
--out_folder
requiredpathOutput directory
--num_seq_per_target
11-1000Sequences per structure
--sampling_temp
"0.1""0.0001-1.0"Temperature (string!)
--seed
0intRandom seed
--batch_size
11-32Batch size
参数默认值范围说明
--pdb_path
必填文件路径单个PDB输入文件
--pdb_path_chains
所有链A,B要设计的链(逗号分隔)
--out_folder
必填文件路径输出目录
--num_seq_per_target
11-1000每个结构生成的序列数量
--sampling_temp
"0.1""0.0001-1.0"采样温度(必须是字符串!)
--seed
0整数随机种子
--batch_size
11-32批次大小

Temperature Guide

温度设置指南

0.1  -> Low diversity, high recovery (production)
0.2  -> Moderate diversity (default)
0.3  -> Higher diversity (exploration)
0.5+ -> Very diverse, lower quality
IMPORTANT: Temperature must be passed as a string, not float.
0.1  -> 多样性低,序列恢复度高(生产环境)
0.2  -> 中等多样性(默认值)
0.3  -> 较高多样性(探索场景)
0.5+ -> 多样性极高,序列质量较低
重要提示:温度参数必须以字符串形式传入,不能是浮点数。

Common mistakes

常见错误

Temperature Parameter

温度参数

Correct:
bash
--sampling_temp "0.1"    # String with quotes
Wrong:
bash
--sampling_temp 0.1      # Float without quotes - may cause errors
--sampling_temp 0.1,0.2  # Multiple temps need proper format
正确写法:
bash
--sampling_temp "0.1"    # 带引号的字符串
错误写法:
bash
--sampling_temp 0.1      # 不带引号的浮点数 - 可能导致错误
--sampling_temp 0.1,0.2  # 多温度设置需要使用正确格式

Fixed Positions JSONL

固定位置JSONL文件

Correct:
json
{"A": [1, 2, 3, 10, 11], "B": [5, 6]}
Wrong:
json
{"A": "1,2,3,10,11"}     # String instead of list
{A: [1, 2, 3]}           # Missing quotes on key
{"A": [1,2,3,]}          # Trailing comma
正确格式:
json
{"A": [1, 2, 3, 10, 11], "B": [5, 6]}
错误格式:
json
{"A": "1,2,3,10,11"}     # 使用字符串而非列表
{A: [1, 2, 3]}           # 键名缺少引号
{"A": [1,2,3,]}          # 存在 trailing 逗号

Chain Selection

链选择

Correct:
bash
--pdb_path_chains A,B    # No spaces
Wrong:
bash
--pdb_path_chains A, B   # Space after comma
--pdb_path_chains "A,B"  # Quotes may cause issues
正确写法:
bash
--pdb_path_chains A,B    # 逗号后无空格
错误写法:
bash
--pdb_path_chains A, B   # 逗号后有空格
--pdb_path_chains "A,B"  # 添加引号可能引发问题

Amino Acid Biases

氨基酸偏好设置

bash
undefined
bash
undefined

Bias toward certain AAs (positive = favor)

偏好特定氨基酸(正值=优先选择)

--bias_AA_jsonl '{"A": {"A": 1.5, "W": -2.0}}'
--bias_AA_jsonl '{"A": {"A": 1.5, "W": -2.0}}'

Omit specific AAs globally

全局排除特定氨基酸

--omit_AAs "CM" # No cysteine or methionine
--omit_AAs "CM" # 排除半胱氨酸和甲硫氨酸

Per-position omission

按位置排除氨基酸

--omit_AA_jsonl '{"A": {"1": "C", "2": "CM"}}'
undefined
--omit_AA_jsonl '{"A": {"1": "C", "2": "CM"}}'
undefined

Multi-Chain Design

多链设计

bash
undefined
bash
undefined

Design chains A and B together

同时设计A链和B链

--pdb_path_chains A,B
--pdb_path_chains A,B

Tie chains (same sequence)

绑定链(生成相同序列)

--tied_positions_jsonl tied.jsonl
undefined
--tied_positions_jsonl tied.jsonl
undefined

Variants Comparison

变体对比

VariantUse CaseKey Difference
ProteinMPNNGeneralOriginal model
SolubleMPNNExpressionTrained on soluble proteins
LigandMPNNSmall moleculesLigand-aware context
变体使用场景核心差异
ProteinMPNN通用场景原始模型
SolubleMPNN表达优化基于可溶性蛋白质训练
LigandMPNN小分子相关支持配体上下文感知

Output format

输出格式

output/
├── seqs/
│   └── backbone.fa          # FASTA sequences
└── backbone_pdb/
    └── backbone_0001.pdb    # PDBs with designed sequence
output/
├── seqs/
│   └── backbone.fa          # FASTA格式序列
└── backbone_pdb/
    └── backbone_0001.pdb    # 包含设计序列的PDB文件

FASTA Header Format

FASTA头部格式

>backbone_0001, score=1.234, global_score=1.234, seq_recovery=0.85
MKTAYIAKQRQISFVKSHFSRQLE...
>backbone_0001, score=1.234, global_score=1.234, seq_recovery=0.85
MKTAYIAKQRQISFVKSHFSRQLE...

Common workflows

常见工作流

Binder Sequence Design

结合体序列设计

bash
python protein_mpnn_run.py \
  --pdb_path binder_backbone.pdb \
  --out_folder output/ \
  --num_seq_per_target 16 \
  --sampling_temp "0.1" \
  --pdb_path_chains B  # Design binder chain only
bash
python protein_mpnn_run.py \
  --pdb_path binder_backbone.pdb \
  --out_folder output/ \
  --num_seq_per_target 16 \
  --sampling_temp "0.1" \
  --pdb_path_chains B  # 仅设计结合体链

Interface Redesign

界面重新设计

bash
undefined
bash
undefined

Fix core, design interface

固定核心区域,设计界面区域

python protein_mpnn_run.py
--pdb_path complex.pdb
--fixed_positions_jsonl core_positions.jsonl
--num_seq_per_target 32
undefined
python protein_mpnn_run.py
--pdb_path complex.pdb
--fixed_positions_jsonl core_positions.jsonl
--num_seq_per_target 32
undefined

Multi-State Design

多态设计

bash
undefined
bash
undefined

Design for multiple conformations

为多种构象设计序列

python protein_mpnn_run.py
--pdb_path_multi state1.pdb,state2.pdb
--num_seq_per_target 16
undefined
python protein_mpnn_run.py
--pdb_path_multi state1.pdb,state2.pdb
--num_seq_per_target 16
undefined

Sample output

示例输出

Successful run

成功运行示例

$ python protein_mpnn_run.py --pdb_path backbone.pdb --out_folder output/ --num_seq_per_target 8
Loading model weights...
Designing sequences for backbone.pdb
Generated 8 sequences in 2.3 seconds

output/seqs/backbone.fa:
>backbone_0001, score=1.234, global_score=1.189, seq_recovery=0.82
MKTAYIAKQRQISFVKSHFSRQLEERGLTKE...
>backbone_0002, score=1.198, global_score=1.156, seq_recovery=0.79
MKTAYIAKQRQISFVKSQFSRQLDERGLTKE...
What good output looks like:
  • Score: 1.0-2.0 (lower = more confident)
  • Seq recovery: 0.3-0.6 for de novo, 0.7-0.9 for redesign
  • Diverse sequences (not all identical) when temp > 0.1
$ python protein_mpnn_run.py --pdb_path backbone.pdb --out_folder output/ --num_seq_per_target 8
Loading model weights...
Designing sequences for backbone.pdb
Generated 8 sequences in 2.3 seconds

output/seqs/backbone.fa:
>backbone_0001, score=1.234, global_score=1.189, seq_recovery=0.82
MKTAYIAKQRQISFVKSHFSRQLEERGLTKE...
>backbone_0002, score=1.198, global_score=1.156, seq_recovery=0.79
MKTAYIAKQRQISFVKSQFSRQLDERGLTKE...
优质输出特征:
  • 分数:1.0-2.0(越低表示模型置信度越高)
  • 序列恢复度:从头设计为0.3-0.6,重新设计为0.7-0.9
  • 当温度>0.1时,序列具有多样性(并非完全一致)

Decision tree

决策树

Should I use ProteinMPNN?
├─ Have a backbone structure?
│  ├─ Yes → Continue below
│  └─ No → Use RFdiffusion first
├─ What's in the binding site?
│  ├─ Nothing / protein only → ProteinMPNN ✓
│  ├─ Small molecule / ligand → Use LigandMPNN
│  └─ Metal / cofactor → Use LigandMPNN
├─ Priority?
│  ├─ Solubility/expression → Consider SolubleMPNN
│  ├─ Speed → ProteinMPNN ✓
│  └─ AF2 optimization → Consider ColabDesign
└─ Need fixed positions?
   ├─ Yes → Use --fixed_positions_jsonl
   └─ No → ProteinMPNN ✓ (design all)
是否应该使用ProteinMPNN?
├─ 是否已有骨架结构?
│  ├─ 是 → 继续以下判断
│  └─ 否 → 先使用RFdiffusion生成骨架
├─ 结合位点包含什么?
│  ├─ 无/仅蛋白质 → 使用ProteinMPNN ✓
│  ├─ 小分子/配体 → 使用LigandMPNN
│  └─ 金属/辅因子 → 使用LigandMPNN
├─ 优先级是什么?
│  ├─ 溶解度/表达量 → 考虑使用SolubleMPNN
│  ├─ 速度 → 使用ProteinMPNN ✓
│  └─ AF2优化 → 考虑使用ColabDesign
└─ 是否需要固定特定位置?
   ├─ 是 → 使用--fixed_positions_jsonl参数
   └─ 否 → 使用ProteinMPNN ✓(设计所有位置)

Typical performance

典型性能

Campaign SizeTime (T4)Cost (Modal)Notes
100 backbones × 8 seq15-20 min~$2Standard
500 backbones × 8 seq1-1.5h~$8Large campaign
1000 backbones × 16 seq3-4h~$18Comprehensive
Throughput: ~50-100 sequences/minute on T4 GPU.

任务规模耗时(T4 GPU)成本(Modal平台)说明
100个骨架 × 8条序列15-20分钟~$2标准任务
500个骨架 × 8条序列1-1.5小时~$8大型任务
1000个骨架 × 16条序列3-4小时~$18全面任务
处理速度:在T4 GPU上约50-100条序列/分钟。

Verify

验证方法

bash
grep -c "^>" output/seqs/*.fa  # Should match backbone_count × num_seq_per_target

bash
grep -c "^>" output/seqs/*.fa  # 结果应等于骨架数量 × 每条骨架生成的序列数

Troubleshooting

故障排查

Low sequence diversity: Increase sampling_temp to 0.2-0.3 Poor recovery: Decrease sampling_temp to 0.1 OOM errors: Reduce batch_size Unwanted cysteines: Use --omit_AAs "C"
序列多样性低:将sampling_temp提升至0.2-0.3 序列恢复度差:将sampling_temp降低至0.1 内存不足错误:减小batch_size 出现多余半胱氨酸:使用--omit_AAs "C"参数

Error interpretation

错误解读

ErrorCauseFix
RuntimeError: CUDA out of memory
Long protein or large batchReduce batch_size or use larger GPU
KeyError: 'A'
Chain not in PDBCheck chain IDs in your PDB file
JSONDecodeError
Invalid JSONL formatValidate JSON syntax (see Common Mistakes)
IndexError: list index
Empty chain or residue listCheck PDB has atoms, not just HEADER

Next: Structure prediction for validation →
protein-qc
for filtering.
错误信息原因解决方法
RuntimeError: CUDA out of memory
蛋白质序列过长或批次过大减小batch_size或使用显存更大的GPU
KeyError: 'A'
PDB文件中不存在该链检查PDB文件中的链ID
JSONDecodeError
JSONL格式无效验证JSON语法(参考常见错误部分)
IndexError: list index
链或残基列表为空检查PDB文件是否包含原子信息,而非仅头部

下一步:通过结构预测验证序列 → 使用
protein-qc
进行过滤。