esm
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseESM2 Protein Language Model
ESM2蛋白质语言模型
Prerequisites
前置要求
| Requirement | Minimum | Recommended |
|---|---|---|
| Python | 3.8+ | 3.10 |
| PyTorch | 1.10+ | 2.0+ |
| CUDA | 11.0+ | 11.7+ |
| GPU VRAM | 8GB | 24GB (A10G) |
| RAM | 16GB | 32GB |
| 要求 | 最低配置 | 推荐配置 |
|---|---|---|
| Python | 3.8+ | 3.10 |
| PyTorch | 1.10+ | 2.0+ |
| CUDA | 11.0+ | 11.7+ |
| GPU显存 | 8GB | 24GB (A10G) |
| 内存 | 16GB | 32GB |
How to run
运行方式
First time? See Installation Guide to set up Modal and biomodals.
首次使用? 查看安装指南配置Modal和biomodals。
Option 1: Modal
选项1:使用Modal
bash
cd biomodals
modal run modal_esm2_predict_masked.py \
--input-faa sequences.fasta \
--out-dir embeddings/GPU: A10G (24GB) | Timeout: 300s default
bash
cd biomodals
modal run modal_esm2_predict_masked.py \
--input-faa sequences.fasta \
--out-dir embeddings/GPU: A10G (24GB) | 超时时间: 默认300秒
Option 2: Python API (recommended)
选项2:Python API(推荐)
python
import torch
import esmpython
import torch
import esmLoad model
加载模型
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model = model.eval().cuda()
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model = model.eval().cuda()
Process sequences
处理序列
data = [("seq1", "MKTAYIAKQRQISFVK...")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
with torch.no_grad():
results = model(batch_tokens.cuda(), repr_layers=[33])
data = [("seq1", "MKTAYIAKQRQISFVK...")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
with torch.no_grad():
results = model(batch_tokens.cuda(), repr_layers=[33])
Get embeddings
获取嵌入向量
embeddings = results["representations"][33]
undefinedembeddings = results["representations"][33]
undefinedKey parameters
关键参数
ESM2 Models
ESM2模型列表
| Model | Parameters | Speed | Quality |
|---|---|---|---|
| esm2_t6_8M | 8M | Fastest | Fast screening |
| esm2_t12_35M | 35M | Fast | Good |
| esm2_t33_650M | 650M | Medium | Better |
| esm2_t36_3B | 3B | Slow | Best |
| 模型 | 参数数量 | 速度 | 质量 |
|---|---|---|---|
| esm2_t6_8M | 8M | 最快 | 快速筛选 |
| esm2_t12_35M | 35M | 快 | 良好 |
| esm2_t33_650M | 650M | 中等 | 更优 |
| esm2_t36_3B | 3B | 慢 | 最佳 |
Output format
输出格式
embeddings/
├── embeddings.npy # (N, 1280) array
├── pll_scores.csv # PLL for each sequence
└── metadata.json # Sequence infoembeddings/
├── embeddings.npy # (N, 1280) 数组
├── pll_scores.csv # 每条序列的PLL评分
└── metadata.json # 序列信息Sample output
示例输出
Successful run
运行成功示例
$ modal run modal_esm2_predict_masked.py --input-faa designs.fasta
[INFO] Loading ESM2-650M model...
[INFO] Processing 100 sequences...
[INFO] Computing pseudo-log-likelihood...
embeddings/pll_scores.csv:
sequence_id,pll,pll_normalized,length
design_0,-0.82,0.15,78
design_1,-0.95,0.08,85
design_2,-1.23,-0.12,72
...
Summary:
Mean PLL: -0.91
Sequences with PLL > 0: 42/100 (42%)What good output looks like:
- PLL_normalized: > 0.0 (more natural-like)
- Embeddings shape: (N, 1280) for 650M model
- Higher PLL = more natural sequence
$ modal run modal_esm2_predict_masked.py --input-faa designs.fasta
[INFO] 加载ESM2-650M模型...
[INFO] 处理100条序列...
[INFO] 计算伪对数似然...
embeddings/pll_scores.csv:
sequence_id,pll,pll_normalized,length
design_0,-0.82,0.15,78
design_1,-0.95,0.08,85
design_2,-1.23,-0.12,72
...
摘要:
平均PLL: -0.91
PLL > 0的序列: 42/100 (42%)优质输出特征:
- PLL_normalized: > 0.0(更接近天然序列)
- 嵌入向量形状: 650M模型对应(N, 1280)
- PLL值越高 → 序列越接近天然序列
Decision tree
决策树
Should I use ESM2?
│
├─ What do you need?
│ ├─ Sequence plausibility score → ESM2 PLL ✓
│ ├─ Embeddings for clustering → ESM2 ✓
│ ├─ Variant effect prediction → ESM2 ✓
│ └─ Structure prediction → Use ESMFold
│
├─ What model size?
│ ├─ Fast screening → esm2_t12_35M
│ ├─ Standard use → esm2_t33_650M ✓
│ └─ Best quality → esm2_t36_3B
│
└─ Use case?
├─ QC filtering → PLL > 0.0 threshold
├─ Diversity analysis → Mean-pooled embeddings
└─ Mutation scanning → Per-position log-odds是否应该使用ESM2?
│
├─ 你的需求是?
│ ├─ 序列合理性评分 → 使用ESM2 PLL ✓
│ ├─ 用于聚类的嵌入向量 → 使用ESM2 ✓
│ ├─ 变异效应预测 → 使用ESM2 ✓
│ └─ 结构预测 → 使用ESMFold
│
├─ 选择哪种模型尺寸?
│ ├─ 快速筛选 → esm2_t12_35M
│ ├─ 标准使用 → esm2_t33_650M ✓
│ └─ 最佳质量 → esm2_t36_3B
│
└─ 使用场景?
├─ QC筛选 → 阈值设为PLL > 0.0
├─ 多样性分析 → 均值池化嵌入向量
└─ 突变扫描 → 每个位置的对数优势比PLL interpretation
PLL评分解读
| Normalized PLL | Interpretation |
|---|---|
| > 0.2 | Very natural sequence |
| 0.0 - 0.2 | Good, natural-like |
| -0.5 - 0.0 | Acceptable |
| < -0.5 | May be unnatural |
| 归一化PLL值 | 解读 |
|---|---|
| > 0.2 | 非常接近天然序列 |
| 0.0 - 0.2 | 良好,接近天然序列 |
| -0.5 - 0.0 | 可接受 |
| < -0.5 | 可能为非天然序列 |
Typical performance
典型性能表现
| Campaign Size | Time (A10G) | Cost (Modal) | Notes |
|---|---|---|---|
| 100 sequences | 5-10 min | ~$1 | Quick screen |
| 1000 sequences | 30-60 min | ~$5 | Standard |
| 5000 sequences | 2-3h | ~$20 | Large batch |
Throughput: ~100-200 sequences/minute with 650M model.
| 任务规模 | 耗时(A10G) | 成本(Modal平台) | 说明 |
|---|---|---|---|
| 100条序列 | 5-10分钟 | ~1美元 | 快速筛选 |
| 1000条序列 | 30-60分钟 | ~5美元 | 标准任务 |
| 5000条序列 | 2-3小时 | ~20美元 | 大规模批量任务 |
处理速度: 使用650M模型时,约100-200条序列/分钟。
Verify
验证步骤
bash
wc -l embeddings/pll_scores.csv # Should match input + 1 (header)bash
wc -l embeddings/pll_scores.csv # 结果应等于输入序列数+1(表头行)Troubleshooting
问题排查
OOM errors: Use smaller model or batch sequences
Slow processing: Use esm2_t12_35M for speed
Low PLL scores: May indicate unusual/designed sequences
OOM错误: 使用更小的模型或拆分序列批次
处理缓慢: 使用esm2_t12_35M提升速度
PLL评分低: 可能表示序列为非常规/人工设计序列
Error interpretation
错误解读
| Error | Cause | Fix |
|---|---|---|
| Sequence too long or large batch | Reduce batch size |
| Wrong layer requested | Use layer 33 for 650M model |
| Invalid amino acid | Check for non-standard AAs |
Next: Structure prediction with or → for filtering.
chaiboltzprotein-qc| 错误信息 | 原因 | 修复方案 |
|---|---|---|
| 序列过长或批次过大 | 减小批次大小 |
| 请求了错误的层数 | 650M模型请使用33层 |
| 存在无效氨基酸 | 检查是否有非标准氨基酸 |
下一步: 使用或进行结构预测 → 用进行筛选。
chaiboltzprotein-qc