proteinmpnn
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseProteinMPNN Sequence Design
ProteinMPNN 序列设计
Prerequisites
前置要求
| Requirement | Minimum | Recommended |
|---|---|---|
| Python | 3.8+ | 3.10 |
| CUDA | 11.0+ | 11.7+ |
| GPU VRAM | 8GB | 16GB (T4) |
| RAM | 8GB | 16GB |
| 要求 | 最低配置 | 推荐配置 |
|---|---|---|
| Python | 3.8+ | 3.10 |
| CUDA | 11.0+ | 11.7+ |
| GPU显存 | 8GB | 16GB (T4) |
| 内存 | 8GB | 16GB |
How to run
运行方法
First time? See Installation Guide to set up Modal and biomodals.
首次使用? 请查看安装指南来设置Modal和biomodals。
Option 1: Local installation (recommended)
选项1:本地安装(推荐)
bash
git clone https://github.com/dauparas/ProteinMPNN.git
cd ProteinMPNN
python protein_mpnn_run.py \
--pdb_path backbone.pdb \
--out_folder output/ \
--num_seq_per_target 16 \
--sampling_temp "0.1"GPU: T4 (16GB) sufficient | Time: ~50-100 sequences/minute
bash
git clone https://github.com/dauparas/ProteinMPNN.git
cd ProteinMPNN
python protein_mpnn_run.py \
--pdb_path backbone.pdb \
--out_folder output/ \
--num_seq_per_target 16 \
--sampling_temp "0.1"GPU:T4(16GB)足够 | 耗时:约50-100条序列/分钟
Option 2: Modal (via LigandMPNN wrapper)
选项2:通过Modal运行(借助LigandMPNN包装器)
bash
cd biomodals
modal run modal_ligandmpnn.py \
--pdb-path backbone.pdb \
--num-seq-per-target 16Note: LigandMPNN includes ProteinMPNN functionality.
bash
cd biomodals
modal run modal_ligandmpnn.py \
--pdb-path backbone.pdb \
--num-seq-per-target 16注意:LigandMPNN包含ProteinMPNN的功能。
Config Schema
配置参数说明
Core Parameters
核心参数
| Parameter | Default | Range | Description |
|---|---|---|---|
| required | path | Single PDB input |
| all | A,B | Chains to design (comma-sep) |
| required | path | Output directory |
| 1 | 1-1000 | Sequences per structure |
| "0.1" | "0.0001-1.0" | Temperature (string!) |
| 0 | int | Random seed |
| 1 | 1-32 | Batch size |
| 参数 | 默认值 | 范围 | 说明 |
|---|---|---|---|
| 必填 | 文件路径 | 单个PDB输入文件 |
| 所有链 | A,B | 要设计的链(逗号分隔) |
| 必填 | 文件路径 | 输出目录 |
| 1 | 1-1000 | 每个结构生成的序列数量 |
| "0.1" | "0.0001-1.0" | 采样温度(必须是字符串!) |
| 0 | 整数 | 随机种子 |
| 1 | 1-32 | 批次大小 |
Temperature Guide
温度设置指南
0.1 -> Low diversity, high recovery (production)
0.2 -> Moderate diversity (default)
0.3 -> Higher diversity (exploration)
0.5+ -> Very diverse, lower qualityIMPORTANT: Temperature must be passed as a string, not float.
0.1 -> 多样性低,序列恢复度高(生产环境)
0.2 -> 中等多样性(默认值)
0.3 -> 较高多样性(探索场景)
0.5+ -> 多样性极高,序列质量较低重要提示:温度参数必须以字符串形式传入,不能是浮点数。
Common mistakes
常见错误
Temperature Parameter
温度参数
✅ Correct:
bash
--sampling_temp "0.1" # String with quotes❌ Wrong:
bash
--sampling_temp 0.1 # Float without quotes - may cause errors
--sampling_temp 0.1,0.2 # Multiple temps need proper format✅ 正确写法:
bash
--sampling_temp "0.1" # 带引号的字符串❌ 错误写法:
bash
--sampling_temp 0.1 # 不带引号的浮点数 - 可能导致错误
--sampling_temp 0.1,0.2 # 多温度设置需要使用正确格式Fixed Positions JSONL
固定位置JSONL文件
✅ Correct:
json
{"A": [1, 2, 3, 10, 11], "B": [5, 6]}❌ Wrong:
json
{"A": "1,2,3,10,11"} # String instead of list
{A: [1, 2, 3]} # Missing quotes on key
{"A": [1,2,3,]} # Trailing comma✅ 正确格式:
json
{"A": [1, 2, 3, 10, 11], "B": [5, 6]}❌ 错误格式:
json
{"A": "1,2,3,10,11"} # 使用字符串而非列表
{A: [1, 2, 3]} # 键名缺少引号
{"A": [1,2,3,]} # 存在 trailing 逗号Chain Selection
链选择
✅ Correct:
bash
--pdb_path_chains A,B # No spaces❌ Wrong:
bash
--pdb_path_chains A, B # Space after comma
--pdb_path_chains "A,B" # Quotes may cause issues✅ 正确写法:
bash
--pdb_path_chains A,B # 逗号后无空格❌ 错误写法:
bash
--pdb_path_chains A, B # 逗号后有空格
--pdb_path_chains "A,B" # 添加引号可能引发问题Amino Acid Biases
氨基酸偏好设置
bash
undefinedbash
undefinedBias toward certain AAs (positive = favor)
偏好特定氨基酸(正值=优先选择)
--bias_AA_jsonl '{"A": {"A": 1.5, "W": -2.0}}'
--bias_AA_jsonl '{"A": {"A": 1.5, "W": -2.0}}'
Omit specific AAs globally
全局排除特定氨基酸
--omit_AAs "CM" # No cysteine or methionine
--omit_AAs "CM" # 排除半胱氨酸和甲硫氨酸
Per-position omission
按位置排除氨基酸
--omit_AA_jsonl '{"A": {"1": "C", "2": "CM"}}'
undefined--omit_AA_jsonl '{"A": {"1": "C", "2": "CM"}}'
undefinedMulti-Chain Design
多链设计
bash
undefinedbash
undefinedDesign chains A and B together
同时设计A链和B链
--pdb_path_chains A,B
--pdb_path_chains A,B
Tie chains (same sequence)
绑定链(生成相同序列)
--tied_positions_jsonl tied.jsonl
undefined--tied_positions_jsonl tied.jsonl
undefinedVariants Comparison
变体对比
| Variant | Use Case | Key Difference |
|---|---|---|
| ProteinMPNN | General | Original model |
| SolubleMPNN | Expression | Trained on soluble proteins |
| LigandMPNN | Small molecules | Ligand-aware context |
| 变体 | 使用场景 | 核心差异 |
|---|---|---|
| ProteinMPNN | 通用场景 | 原始模型 |
| SolubleMPNN | 表达优化 | 基于可溶性蛋白质训练 |
| LigandMPNN | 小分子相关 | 支持配体上下文感知 |
Output format
输出格式
output/
├── seqs/
│ └── backbone.fa # FASTA sequences
└── backbone_pdb/
└── backbone_0001.pdb # PDBs with designed sequenceoutput/
├── seqs/
│ └── backbone.fa # FASTA格式序列
└── backbone_pdb/
└── backbone_0001.pdb # 包含设计序列的PDB文件FASTA Header Format
FASTA头部格式
>backbone_0001, score=1.234, global_score=1.234, seq_recovery=0.85
MKTAYIAKQRQISFVKSHFSRQLE...>backbone_0001, score=1.234, global_score=1.234, seq_recovery=0.85
MKTAYIAKQRQISFVKSHFSRQLE...Common workflows
常见工作流
Binder Sequence Design
结合体序列设计
bash
python protein_mpnn_run.py \
--pdb_path binder_backbone.pdb \
--out_folder output/ \
--num_seq_per_target 16 \
--sampling_temp "0.1" \
--pdb_path_chains B # Design binder chain onlybash
python protein_mpnn_run.py \
--pdb_path binder_backbone.pdb \
--out_folder output/ \
--num_seq_per_target 16 \
--sampling_temp "0.1" \
--pdb_path_chains B # 仅设计结合体链Interface Redesign
界面重新设计
bash
undefinedbash
undefinedFix core, design interface
固定核心区域,设计界面区域
python protein_mpnn_run.py
--pdb_path complex.pdb
--fixed_positions_jsonl core_positions.jsonl
--num_seq_per_target 32
--pdb_path complex.pdb
--fixed_positions_jsonl core_positions.jsonl
--num_seq_per_target 32
undefinedpython protein_mpnn_run.py
--pdb_path complex.pdb
--fixed_positions_jsonl core_positions.jsonl
--num_seq_per_target 32
--pdb_path complex.pdb
--fixed_positions_jsonl core_positions.jsonl
--num_seq_per_target 32
undefinedMulti-State Design
多态设计
bash
undefinedbash
undefinedDesign for multiple conformations
为多种构象设计序列
python protein_mpnn_run.py
--pdb_path_multi state1.pdb,state2.pdb
--num_seq_per_target 16
--pdb_path_multi state1.pdb,state2.pdb
--num_seq_per_target 16
undefinedpython protein_mpnn_run.py
--pdb_path_multi state1.pdb,state2.pdb
--num_seq_per_target 16
--pdb_path_multi state1.pdb,state2.pdb
--num_seq_per_target 16
undefinedSample output
示例输出
Successful run
成功运行示例
$ python protein_mpnn_run.py --pdb_path backbone.pdb --out_folder output/ --num_seq_per_target 8
Loading model weights...
Designing sequences for backbone.pdb
Generated 8 sequences in 2.3 seconds
output/seqs/backbone.fa:
>backbone_0001, score=1.234, global_score=1.189, seq_recovery=0.82
MKTAYIAKQRQISFVKSHFSRQLEERGLTKE...
>backbone_0002, score=1.198, global_score=1.156, seq_recovery=0.79
MKTAYIAKQRQISFVKSQFSRQLDERGLTKE...What good output looks like:
- Score: 1.0-2.0 (lower = more confident)
- Seq recovery: 0.3-0.6 for de novo, 0.7-0.9 for redesign
- Diverse sequences (not all identical) when temp > 0.1
$ python protein_mpnn_run.py --pdb_path backbone.pdb --out_folder output/ --num_seq_per_target 8
Loading model weights...
Designing sequences for backbone.pdb
Generated 8 sequences in 2.3 seconds
output/seqs/backbone.fa:
>backbone_0001, score=1.234, global_score=1.189, seq_recovery=0.82
MKTAYIAKQRQISFVKSHFSRQLEERGLTKE...
>backbone_0002, score=1.198, global_score=1.156, seq_recovery=0.79
MKTAYIAKQRQISFVKSQFSRQLDERGLTKE...优质输出特征:
- 分数:1.0-2.0(越低表示模型置信度越高)
- 序列恢复度:从头设计为0.3-0.6,重新设计为0.7-0.9
- 当温度>0.1时,序列具有多样性(并非完全一致)
Decision tree
决策树
Should I use ProteinMPNN?
│
├─ Have a backbone structure?
│ ├─ Yes → Continue below
│ └─ No → Use RFdiffusion first
│
├─ What's in the binding site?
│ ├─ Nothing / protein only → ProteinMPNN ✓
│ ├─ Small molecule / ligand → Use LigandMPNN
│ └─ Metal / cofactor → Use LigandMPNN
│
├─ Priority?
│ ├─ Solubility/expression → Consider SolubleMPNN
│ ├─ Speed → ProteinMPNN ✓
│ └─ AF2 optimization → Consider ColabDesign
│
└─ Need fixed positions?
├─ Yes → Use --fixed_positions_jsonl
└─ No → ProteinMPNN ✓ (design all)是否应该使用ProteinMPNN?
│
├─ 是否已有骨架结构?
│ ├─ 是 → 继续以下判断
│ └─ 否 → 先使用RFdiffusion生成骨架
│
├─ 结合位点包含什么?
│ ├─ 无/仅蛋白质 → 使用ProteinMPNN ✓
│ ├─ 小分子/配体 → 使用LigandMPNN
│ └─ 金属/辅因子 → 使用LigandMPNN
│
├─ 优先级是什么?
│ ├─ 溶解度/表达量 → 考虑使用SolubleMPNN
│ ├─ 速度 → 使用ProteinMPNN ✓
│ └─ AF2优化 → 考虑使用ColabDesign
│
└─ 是否需要固定特定位置?
├─ 是 → 使用--fixed_positions_jsonl参数
└─ 否 → 使用ProteinMPNN ✓(设计所有位置)Typical performance
典型性能
| Campaign Size | Time (T4) | Cost (Modal) | Notes |
|---|---|---|---|
| 100 backbones × 8 seq | 15-20 min | ~$2 | Standard |
| 500 backbones × 8 seq | 1-1.5h | ~$8 | Large campaign |
| 1000 backbones × 16 seq | 3-4h | ~$18 | Comprehensive |
Throughput: ~50-100 sequences/minute on T4 GPU.
| 任务规模 | 耗时(T4 GPU) | 成本(Modal平台) | 说明 |
|---|---|---|---|
| 100个骨架 × 8条序列 | 15-20分钟 | ~$2 | 标准任务 |
| 500个骨架 × 8条序列 | 1-1.5小时 | ~$8 | 大型任务 |
| 1000个骨架 × 16条序列 | 3-4小时 | ~$18 | 全面任务 |
处理速度:在T4 GPU上约50-100条序列/分钟。
Verify
验证方法
bash
grep -c "^>" output/seqs/*.fa # Should match backbone_count × num_seq_per_targetbash
grep -c "^>" output/seqs/*.fa # 结果应等于骨架数量 × 每条骨架生成的序列数Troubleshooting
故障排查
Low sequence diversity: Increase sampling_temp to 0.2-0.3
Poor recovery: Decrease sampling_temp to 0.1
OOM errors: Reduce batch_size
Unwanted cysteines: Use --omit_AAs "C"
序列多样性低:将sampling_temp提升至0.2-0.3
序列恢复度差:将sampling_temp降低至0.1
内存不足错误:减小batch_size
出现多余半胱氨酸:使用--omit_AAs "C"参数
Error interpretation
错误解读
| Error | Cause | Fix |
|---|---|---|
| Long protein or large batch | Reduce batch_size or use larger GPU |
| Chain not in PDB | Check chain IDs in your PDB file |
| Invalid JSONL format | Validate JSON syntax (see Common Mistakes) |
| Empty chain or residue list | Check PDB has atoms, not just HEADER |
Next: Structure prediction for validation → for filtering.
protein-qc| 错误信息 | 原因 | 解决方法 |
|---|---|---|
| 蛋白质序列过长或批次过大 | 减小batch_size或使用显存更大的GPU |
| PDB文件中不存在该链 | 检查PDB文件中的链ID |
| JSONL格式无效 | 验证JSON语法(参考常见错误部分) |
| 链或残基列表为空 | 检查PDB文件是否包含原子信息,而非仅头部 |
下一步:通过结构预测验证序列 → 使用进行过滤。
protein-qc