esm

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ESM2 Protein Language Model

ESM2蛋白质语言模型

Prerequisites

前置要求

RequirementMinimumRecommended
Python3.8+3.10
PyTorch1.10+2.0+
CUDA11.0+11.7+
GPU VRAM8GB24GB (A10G)
RAM16GB32GB
要求最低配置推荐配置
Python3.8+3.10
PyTorch1.10+2.0+
CUDA11.0+11.7+
GPU显存8GB24GB (A10G)
内存16GB32GB

How to run

运行方式

First time? See Installation Guide to set up Modal and biomodals.
首次使用? 查看安装指南配置Modal和biomodals。

Option 1: Modal

选项1:使用Modal

bash
cd biomodals
modal run modal_esm2_predict_masked.py \
  --input-faa sequences.fasta \
  --out-dir embeddings/
GPU: A10G (24GB) | Timeout: 300s default
bash
cd biomodals
modal run modal_esm2_predict_masked.py \
  --input-faa sequences.fasta \
  --out-dir embeddings/
GPU: A10G (24GB) | 超时时间: 默认300秒

Option 2: Python API (recommended)

选项2:Python API(推荐)

python
import torch
import esm
python
import torch
import esm

Load model

加载模型

model, alphabet = esm.pretrained.esm2_t33_650M_UR50D() batch_converter = alphabet.get_batch_converter() model = model.eval().cuda()
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D() batch_converter = alphabet.get_batch_converter() model = model.eval().cuda()

Process sequences

处理序列

data = [("seq1", "MKTAYIAKQRQISFVK...")] batch_labels, batch_strs, batch_tokens = batch_converter(data)
with torch.no_grad(): results = model(batch_tokens.cuda(), repr_layers=[33])
data = [("seq1", "MKTAYIAKQRQISFVK...")] batch_labels, batch_strs, batch_tokens = batch_converter(data)
with torch.no_grad(): results = model(batch_tokens.cuda(), repr_layers=[33])

Get embeddings

获取嵌入向量

embeddings = results["representations"][33]
undefined
embeddings = results["representations"][33]
undefined

Key parameters

关键参数

ESM2 Models

ESM2模型列表

ModelParametersSpeedQuality
esm2_t6_8M8MFastestFast screening
esm2_t12_35M35MFastGood
esm2_t33_650M650MMediumBetter
esm2_t36_3B3BSlowBest
模型参数数量速度质量
esm2_t6_8M8M最快快速筛选
esm2_t12_35M35M良好
esm2_t33_650M650M中等更优
esm2_t36_3B3B最佳

Output format

输出格式

embeddings/
├── embeddings.npy       # (N, 1280) array
├── pll_scores.csv       # PLL for each sequence
└── metadata.json        # Sequence info
embeddings/
├── embeddings.npy       # (N, 1280) 数组
├── pll_scores.csv       # 每条序列的PLL评分
└── metadata.json        # 序列信息

Sample output

示例输出

Successful run

运行成功示例

$ modal run modal_esm2_predict_masked.py --input-faa designs.fasta
[INFO] Loading ESM2-650M model...
[INFO] Processing 100 sequences...
[INFO] Computing pseudo-log-likelihood...

embeddings/pll_scores.csv:
sequence_id,pll,pll_normalized,length
design_0,-0.82,0.15,78
design_1,-0.95,0.08,85
design_2,-1.23,-0.12,72
...

Summary:
  Mean PLL: -0.91
  Sequences with PLL > 0: 42/100 (42%)
What good output looks like:
  • PLL_normalized: > 0.0 (more natural-like)
  • Embeddings shape: (N, 1280) for 650M model
  • Higher PLL = more natural sequence
$ modal run modal_esm2_predict_masked.py --input-faa designs.fasta
[INFO] 加载ESM2-650M模型...
[INFO] 处理100条序列...
[INFO] 计算伪对数似然...

embeddings/pll_scores.csv:
sequence_id,pll,pll_normalized,length
design_0,-0.82,0.15,78
design_1,-0.95,0.08,85
design_2,-1.23,-0.12,72
...

摘要:
  平均PLL: -0.91
  PLL > 0的序列: 42/100 (42%)
优质输出特征:
  • PLL_normalized: > 0.0(更接近天然序列)
  • 嵌入向量形状: 650M模型对应(N, 1280)
  • PLL值越高 → 序列越接近天然序列

Decision tree

决策树

Should I use ESM2?
├─ What do you need?
│  ├─ Sequence plausibility score → ESM2 PLL ✓
│  ├─ Embeddings for clustering → ESM2 ✓
│  ├─ Variant effect prediction → ESM2 ✓
│  └─ Structure prediction → Use ESMFold
├─ What model size?
│  ├─ Fast screening → esm2_t12_35M
│  ├─ Standard use → esm2_t33_650M ✓
│  └─ Best quality → esm2_t36_3B
└─ Use case?
   ├─ QC filtering → PLL > 0.0 threshold
   ├─ Diversity analysis → Mean-pooled embeddings
   └─ Mutation scanning → Per-position log-odds
是否应该使用ESM2?
├─ 你的需求是?
│  ├─ 序列合理性评分 → 使用ESM2 PLL ✓
│  ├─ 用于聚类的嵌入向量 → 使用ESM2 ✓
│  ├─ 变异效应预测 → 使用ESM2 ✓
│  └─ 结构预测 → 使用ESMFold
├─ 选择哪种模型尺寸?
│  ├─ 快速筛选 → esm2_t12_35M
│  ├─ 标准使用 → esm2_t33_650M ✓
│  └─ 最佳质量 → esm2_t36_3B
└─ 使用场景?
   ├─ QC筛选 → 阈值设为PLL > 0.0
   ├─ 多样性分析 → 均值池化嵌入向量
   └─ 突变扫描 → 每个位置的对数优势比

PLL interpretation

PLL评分解读

Normalized PLLInterpretation
> 0.2Very natural sequence
0.0 - 0.2Good, natural-like
-0.5 - 0.0Acceptable
< -0.5May be unnatural
归一化PLL值解读
> 0.2非常接近天然序列
0.0 - 0.2良好,接近天然序列
-0.5 - 0.0可接受
< -0.5可能为非天然序列

Typical performance

典型性能表现

Campaign SizeTime (A10G)Cost (Modal)Notes
100 sequences5-10 min~$1Quick screen
1000 sequences30-60 min~$5Standard
5000 sequences2-3h~$20Large batch
Throughput: ~100-200 sequences/minute with 650M model.

任务规模耗时(A10G)成本(Modal平台)说明
100条序列5-10分钟~1美元快速筛选
1000条序列30-60分钟~5美元标准任务
5000条序列2-3小时~20美元大规模批量任务
处理速度: 使用650M模型时,约100-200条序列/分钟。

Verify

验证步骤

bash
wc -l embeddings/pll_scores.csv  # Should match input + 1 (header)

bash
wc -l embeddings/pll_scores.csv  # 结果应等于输入序列数+1(表头行)

Troubleshooting

问题排查

OOM errors: Use smaller model or batch sequences Slow processing: Use esm2_t12_35M for speed Low PLL scores: May indicate unusual/designed sequences
OOM错误: 使用更小的模型或拆分序列批次 处理缓慢: 使用esm2_t12_35M提升速度 PLL评分低: 可能表示序列为非常规/人工设计序列

Error interpretation

错误解读

ErrorCauseFix
RuntimeError: CUDA out of memory
Sequence too long or large batchReduce batch size
KeyError: representation
Wrong layer requestedUse layer 33 for 650M model
ValueError: sequence
Invalid amino acidCheck for non-standard AAs

Next: Structure prediction with
chai
or
boltz
protein-qc
for filtering.
错误信息原因修复方案
RuntimeError: CUDA out of memory
序列过长或批次过大减小批次大小
KeyError: representation
请求了错误的层数650M模型请使用33层
ValueError: sequence
存在无效氨基酸检查是否有非标准氨基酸

下一步: 使用
chai
boltz
进行结构预测 → 用
protein-qc
进行筛选。