torchdrug

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TorchDrug

Overview

概述

TorchDrug is a comprehensive PyTorch-based machine learning toolbox for drug discovery and molecular science. Apply graph neural networks, pre-trained models, and task definitions to molecules, proteins, and biological knowledge graphs, including molecular property prediction, protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis planning, with 40+ curated datasets and 20+ model architectures.

TorchDrug是一个基于PyTorch的综合性机器学习工具包，用于药物发现和分子科学研究。可将图神经网络、预训练模型和任务定义应用于分子、蛋白质和生物医学知识图谱，涵盖分子属性预测、蛋白质建模、知识图谱推理、分子生成、逆合成规划等场景，内置40+精选数据集和20+模型架构。

When to Use This Skill

何时使用TorchDrug

This skill should be used when working with:

Data Types:

SMILES strings or molecular structures
Protein sequences or 3D structures (PDB files)
Chemical reactions and retrosynthesis
Biomedical knowledge graphs
Drug discovery datasets

Tasks:

Predicting molecular properties (solubility, toxicity, activity)
Protein function or structure prediction
Drug-target binding prediction
Generating new molecular structures
Planning chemical synthesis routes
Link prediction in biomedical knowledge bases
Training graph neural networks on scientific data

Libraries and Integration:

TorchDrug is the primary library
Often used with RDKit for cheminformatics
Compatible with PyTorch and PyTorch Lightning
Integrates with AlphaFold and ESM for proteins

在处理以下场景时可使用TorchDrug：

数据类型：

SMILES字符串或分子结构
蛋白质序列或3D结构（PDB文件）
化学反应与逆合成
生物医学知识图谱
药物发现数据集

任务类型：

分子属性预测（溶解度、毒性、活性）
蛋白质功能或结构预测
药物-靶点结合预测
新型分子结构生成
化学合成路线规划
生物医学知识库中的链接预测
在科学数据上训练图神经网络

库与集成：

TorchDrug为核心库
常与RDKit联用进行 cheminformatics 研究
兼容PyTorch和PyTorch Lightning
可与AlphaFold和ESM集成处理蛋白质相关任务

Getting Started

快速开始

Installation

安装

bash

uv pip install torchdrug

bash

uv pip install torchdrug

Or with optional dependencies

或安装可选依赖

uv pip install torchdrug[full]

undefined

uv pip install torchdrug[full]

undefined

Quick Example

快速示例

python

from torchdrug import datasets, models, tasks
from torch.utils.data import DataLoader

python

from torchdrug import datasets, models, tasks
from torch.utils.data import DataLoader

Load molecular dataset

加载分子数据集

dataset = datasets.BBBP("~/molecule-datasets/") train_set, valid_set, test_set = dataset.split()

Define GNN model

定义GNN模型

model = models.GIN( input_dim=dataset.node_feature_dim, hidden_dims=[256, 256, 256], edge_input_dim=dataset.edge_feature_dim, batch_norm=True, readout="mean" )

Create property prediction task

创建属性预测任务

task = tasks.PropertyPrediction( model, task=dataset.tasks, criterion="bce", metric=["auroc", "auprc"] )

Train with PyTorch

使用PyTorch训练

optimizer = torch.optim.Adam(task.parameters(), lr=1e-3) train_loader = DataLoader(train_set, batch_size=32, shuffle=True)

for epoch in range(100): for batch in train_loader: loss = task(batch) optimizer.zero_grad() loss.backward() optimizer.step()

undefined

optimizer = torch.optim.Adam(task.parameters(), lr=1e-3) train_loader = DataLoader(train_set, batch_size=32, shuffle=True)

for epoch in range(100): for batch in train_loader: loss = task(batch) optimizer.zero_grad() loss.backward() optimizer.step()

undefined

Core Capabilities

核心功能

1. Molecular Property Prediction

1. 分子属性预测

Predict chemical, physical, and biological properties of molecules from structure.

Use Cases:

Drug-likeness and ADMET properties
Toxicity screening
Quantum chemistry properties
Binding affinity prediction

Key Components:

20+ molecular datasets (BBBP, HIV, Tox21, QM9, etc.)
GNN models (GIN, GAT, SchNet)
PropertyPrediction and MultipleBinaryClassification tasks

Reference: See

references/molecular_property_prediction.md

for:

Complete dataset catalog
Model selection guide
Training workflows and best practices
Feature engineering details

根据结构预测分子的化学、物理和生物属性。

适用场景：

药物相似性与ADMET属性
毒性筛选
量子化学属性
结合亲和力预测

核心组件：

20+分子数据集（BBBP、HIV、Tox21、QM9等）
GNN模型（GIN、GAT、SchNet）
PropertyPrediction和MultipleBinaryClassification任务

参考文档： 查看

references/molecular_property_prediction.md

获取：

完整数据集目录
模型选择指南
训练流程与最佳实践
特征工程细节

2. Protein Modeling

2. 蛋白质建模

Work with protein sequences, structures, and properties.

Use Cases:

Enzyme function prediction
Protein stability and solubility
Subcellular localization
Protein-protein interactions
Structure prediction

Key Components:

15+ protein datasets (EnzymeCommission, GeneOntology, PDBBind, etc.)
Sequence models (ESM, ProteinBERT, ProteinLSTM)
Structure models (GearNet, SchNet)
Multiple task types for different prediction levels

Reference: See

references/protein_modeling.md

for:

Protein-specific datasets
Sequence vs structure models
Pre-training strategies
Integration with AlphaFold and ESM

处理蛋白质序列、结构和属性相关任务。

适用场景：

酶功能预测
蛋白质稳定性与溶解度
亚细胞定位
蛋白质-蛋白质相互作用
结构预测

核心组件：

15+蛋白质数据集（EnzymeCommission、GeneOntology、PDBBind等）
序列模型（ESM、ProteinBERT、ProteinLSTM）
结构模型（GearNet、SchNet）
适用于不同预测层级的多种任务类型

参考文档： 查看

references/protein_modeling.md

获取：

蛋白质专用数据集
序列模型vs结构模型对比
预训练策略
与AlphaFold和ESM的集成方法

3. Knowledge Graph Reasoning

3. 知识图谱推理

Predict missing links and relationships in biological knowledge graphs.

Use Cases:

Drug repurposing
Disease mechanism discovery
Gene-disease associations
Multi-hop biomedical reasoning

Key Components:

General KGs (FB15k, WN18) and biomedical (Hetionet)
Embedding models (TransE, RotatE, ComplEx)
KnowledgeGraphCompletion task

Reference: See

references/knowledge_graphs.md

for:

Knowledge graph datasets (including Hetionet with 45k biomedical entities)
Embedding model comparison
Evaluation metrics and protocols
Biomedical applications

预测生物医学知识图谱中缺失的链接和关系。

适用场景：

药物重定位
疾病机制发现
基因-疾病关联
多跳生物医学推理

核心组件：

通用知识图谱（FB15k、WN18）和生物医学知识图谱（Hetionet）
嵌入模型（TransE、RotatE、ComplEx）
KnowledgeGraphCompletion任务

参考文档： 查看

references/knowledge_graphs.md

获取：

知识图谱数据集（包含拥有45k生物医学实体的Hetionet）
嵌入模型对比
评估指标与协议
生物医学应用案例

4. Molecular Generation

4. 分子生成

Generate novel molecular structures with desired properties.

Use Cases:

De novo drug design
Lead optimization
Chemical space exploration
Property-guided generation

Key Components:

Autoregressive generation
GCPN (policy-based generation)
GraphAutoregressiveFlow
Property optimization workflows

Reference: See

references/molecular_generation.md

for:

Generation strategies (unconditional, conditional, scaffold-based)
Multi-objective optimization
Validation and filtering
Integration with property prediction

生成具有目标属性的新型分子结构。

适用场景：

从头药物设计
先导化合物优化
化学空间探索
属性引导的分子生成

核心组件：

自回归生成
GCPN（基于策略的生成）
GraphAutoregressiveFlow
属性优化流程

参考文档： 查看

references/molecular_generation.md

获取：

生成策略（无条件、条件、基于骨架）
多目标优化
验证与过滤
与属性预测的集成

5. Retrosynthesis

5. 逆合成

Predict synthetic routes from target molecules to starting materials.

Use Cases:

Synthesis planning
Route optimization
Synthetic accessibility assessment
Multi-step planning

Key Components:

USPTO-50k reaction dataset
CenterIdentification (reaction center prediction)
SynthonCompletion (reactant prediction)
End-to-end Retrosynthesis pipeline

Reference: See

references/retrosynthesis.md

for:

Task decomposition (center ID → synthon completion)
Multi-step synthesis planning
Commercial availability checking
Integration with other retrosynthesis tools

预测从目标分子到起始原料的合成路线。

适用场景：

合成规划
路线优化
合成可及性评估
多步规划

核心组件：

USPTO-50k反应数据集
CenterIdentification（反应中心预测）
SynthonCompletion（反应物预测）
端到端逆合成流水线

参考文档： 查看

references/retrosynthesis.md

获取：

任务分解（中心识别→合成子补全）
多步合成规划
商业可得性检查
与其他逆合成工具的集成

6. Graph Neural Network Models

6. 图神经网络模型

Comprehensive catalog of GNN architectures for different data types and tasks.

Available Models:

General GNNs: GCN, GAT, GIN, RGCN, MPNN
3D-aware: SchNet, GearNet
Protein-specific: ESM, ProteinBERT, GearNet
Knowledge graph: TransE, RotatE, ComplEx, SimplE
Generative: GraphAutoregressiveFlow

Reference: See

references/models_architectures.md

for:

Detailed model descriptions
Model selection guide by task and dataset
Architecture comparisons
Implementation tips

适用于不同数据类型和任务的GNN架构全集。

可用模型：

通用GNN：GCN、GAT、GIN、RGCN、MPNN
3D感知模型：SchNet、GearNet
蛋白质专用模型：ESM、ProteinBERT、GearNet
知识图谱模型：TransE、RotatE、ComplEx、SimplE
生成式模型：GraphAutoregressiveFlow

参考文档： 查看

references/models_architectures.md

获取：

详细模型描述
按任务和数据集分类的模型选择指南
架构对比
实现技巧

7. Datasets

7. 数据集

40+ curated datasets spanning chemistry, biology, and knowledge graphs.

Categories:

Molecular properties (drug discovery, quantum chemistry)
Protein properties (function, structure, interactions)
Knowledge graphs (general and biomedical)
Retrosynthesis reactions

Reference: See

references/datasets.md

for:

Complete dataset catalog with sizes and tasks
Dataset selection guide
Loading and preprocessing
Splitting strategies (random, scaffold)

40+精选数据集，涵盖化学、生物学和知识图谱领域。

分类：

分子属性（药物发现、量子化学）
蛋白质属性（功能、结构、相互作用）
知识图谱（通用与生物医学）
逆合成反应

参考文档： 查看

references/datasets.md

获取：

完整数据集目录（包含规模和对应任务）
数据集选择指南
加载与预处理方法
拆分策略（随机拆分、骨架拆分）

Common Workflows

常见工作流

Workflow 1: Molecular Property Prediction

工作流1：分子属性预测

Scenario: Predict blood-brain barrier penetration for drug candidates.

Steps:

Load dataset:
```
datasets.BBBP()
```
Choose model: GIN for molecular graphs
Define task:
```
PropertyPrediction
```
with binary classification
Train with scaffold split for realistic evaluation
Evaluate using AUROC and AUPRC

Navigation:

references/molecular_property_prediction.md

→ Dataset selection → Model selection → Training

场景： 预测候选药物的血脑屏障穿透性。

步骤：

加载数据集：
```
datasets.BBBP()
```
选择模型：用于分子图的GIN
定义任务：用于二分类的
```
PropertyPrediction
```
使用骨架拆分进行真实场景下的训练
用AUROC和AUPRC评估模型

导航：

references/molecular_property_prediction.md

→ 数据集选择 → 模型选择 → 训练

Workflow 2: Protein Function Prediction

工作流2：蛋白质功能预测

Scenario: Predict enzyme function from sequence.

Steps:

Load dataset:
```
datasets.EnzymeCommission()
```
Choose model: ESM (pre-trained) or GearNet (with structure)
Define task:
```
PropertyPrediction
```
with multi-class classification
Fine-tune pre-trained model or train from scratch
Evaluate using accuracy and per-class metrics

Navigation:

references/protein_modeling.md

→ Model selection (sequence vs structure) → Pre-training strategies

场景： 根据序列预测酶功能。

步骤：

加载数据集：
```
datasets.EnzymeCommission()
```
选择模型：ESM（预训练）或GearNet（基于结构）
定义任务：用于多分类的
```
PropertyPrediction
```
微调预训练模型或从头训练
用准确率和各类别指标评估模型

导航：

references/protein_modeling.md

→ 模型选择（序列vs结构） → 预训练策略

Workflow 3: Drug Repurposing via Knowledge Graphs

工作流3：基于知识图谱的药物重定位

Scenario: Find new disease treatments in Hetionet.

Steps:

Load dataset:
```
datasets.Hetionet()
```
Choose model: RotatE or ComplEx
Define task:
```
KnowledgeGraphCompletion
```
Train with negative sampling
Query for "Compound-treats-Disease" predictions
Filter by plausibility and mechanism

Navigation:

references/knowledge_graphs.md

→ Hetionet dataset → Model selection → Biomedical applications

场景： 在Hetionet中寻找新的疾病治疗方案。

步骤：

加载数据集：
```
datasets.Hetionet()
```
选择模型：RotatE或ComplEx
定义任务：
```
KnowledgeGraphCompletion
```
用负采样训练模型
查询“化合物-治疗-疾病”预测结果
根据合理性和作用机制过滤结果

导航：

references/knowledge_graphs.md

→ Hetionet数据集 → 模型选择 → 生物医学应用

Workflow 4: De Novo Molecule Generation

工作流4：从头分子生成

Scenario: Generate drug-like molecules optimized for target binding.

Steps:

Train property predictor on activity data
Choose generation approach: GCPN for RL-based optimization
Define reward function combining affinity, drug-likeness, synthesizability
Generate candidates with property constraints
Validate chemistry and filter by drug-likeness
Rank by multi-objective scoring

Navigation:

references/molecular_generation.md

→ Conditional generation → Multi-objective optimization

场景： 生成针对靶点结合优化的类药分子。

步骤：

在活性数据上训练属性预测器
选择生成方法：基于RL优化的GCPN
定义结合亲和力、类药性、可合成性的奖励函数
生成带有属性约束的候选分子
验证化学合理性并按类药性过滤
按多目标评分排序

导航：

references/molecular_generation.md

→ 条件生成 → 多目标优化

Workflow 5: Retrosynthesis Planning

工作流5：逆合成规划

Scenario: Plan synthesis route for target molecule.

Steps:

Load dataset:
```
datasets.USPTO50k()
```
Train center identification model (RGCN)
Train synthon completion model (GIN)
Combine into end-to-end retrosynthesis pipeline
Apply recursively for multi-step planning
Check commercial availability of building blocks

Navigation:

references/retrosynthesis.md

→ Task types → Multi-step planning

场景： 为目标分子规划合成路线。

步骤：

加载数据集：
```
datasets.USPTO50k()
```
训练反应中心识别模型（RGCN）
训练合成子补全模型（GIN）
组合为端到端逆合成流水线
递归应用进行多步规划
检查构建模块的商业可得性

导航：

references/retrosynthesis.md

→ 任务类型 → 多步规划

Integration Patterns

集成模式

With RDKit

与RDKit集成

Convert between TorchDrug molecules and RDKit:

python

from torchdrug import data
from rdkit import Chem

在TorchDrug分子与RDKit之间转换：

python

from torchdrug import data
from rdkit import Chem

SMILES → TorchDrug molecule

SMILES → TorchDrug分子

smiles = "CCO" mol = data.Molecule.from_smiles(smiles)

TorchDrug → RDKit

rdkit_mol = mol.to_molecule()

RDKit → TorchDrug

rdkit_mol = Chem.MolFromSmiles(smiles) mol = data.Molecule.from_molecule(rdkit_mol)

undefined

rdkit_mol = Chem.MolFromSmiles(smiles) mol = data.Molecule.from_molecule(rdkit_mol)

undefined

With AlphaFold/ESM

与AlphaFold/ESM集成

Use predicted structures:

python

from torchdrug import data

使用预测结构：

python

from torchdrug import data

Load AlphaFold predicted structure

加载AlphaFold预测结构

protein = data.Protein.from_pdb("AF-P12345-F1-model_v4.pdb")

Build graph with spatial edges

构建带空间边的图

graph = protein.residue_graph( node_position="ca", edge_types=["sequential", "radius"], radius_cutoff=10.0 )

undefined

graph = protein.residue_graph( node_position="ca", edge_types=["sequential", "radius"], radius_cutoff=10.0 )

undefined

With PyTorch Lightning

与PyTorch Lightning集成

Wrap tasks for Lightning training:

python

import pytorch_lightning as pl

class LightningTask(pl.LightningModule):
    def __init__(self, torchdrug_task):
        super().__init__()
        self.task = torchdrug_task

    def training_step(self, batch, batch_idx):
        return self.task(batch)

    def validation_step(self, batch, batch_idx):
        pred = self.task.predict(batch)
        target = self.task.target(batch)
        return {"pred": pred, "target": target}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)

将任务包装为Lightning训练模块：

python

import pytorch_lightning as pl

class LightningTask(pl.LightningModule):
    def __init__(self, torchdrug_task):
        super().__init__()
        self.task = torchdrug_task

    def training_step(self, batch, batch_idx):
        return self.task(batch)

    def validation_step(self, batch, batch_idx):
        pred = self.task.predict(batch)
        target = self.task.target(batch)
        return {"pred": pred, "target": target}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)

Technical Details

技术细节

For deep dives into TorchDrug's architecture:

Core Concepts: See

references/core_concepts.md

for:

Architecture philosophy (modular, configurable)
Data structures (Graph, Molecule, Protein, PackedGraph)
Model interface and forward function signature
Task interface (predict, target, forward, evaluate)
Training workflows and best practices
Loss functions and metrics
Common pitfalls and debugging

如需深入了解TorchDrug的架构：

核心概念： 查看

references/core_concepts.md

获取：

架构设计理念（模块化、可配置）
数据结构（Graph、Molecule、Protein、PackedGraph）
模型接口与前向函数签名
任务接口（predict、target、forward、evaluate）
训练流程与最佳实践
损失函数与指标
常见陷阱与调试方法

Quick Reference Cheat Sheet

速查表

Choose Dataset:

Molecular property →
```
references/datasets.md
```
→ Molecular section
Protein task →
```
references/datasets.md
```
→ Protein section
Knowledge graph →
```
references/datasets.md
```
→ Knowledge graph section

Choose Model:

Molecules →
```
references/models_architectures.md
```
→ GNN section → GIN/GAT/SchNet
Proteins (sequence) →
```
references/models_architectures.md
```
→ Protein section → ESM
Proteins (structure) →
```
references/models_architectures.md
```
→ Protein section → GearNet
Knowledge graph →
```
references/models_architectures.md
```
→ KG section → RotatE/ComplEx

Common Tasks:

Property prediction →

references/molecular_property_prediction.md

references/protein_modeling.md

Generation →
```
references/molecular_generation.md
```
Retrosynthesis →
```
references/retrosynthesis.md
```
KG reasoning →
```
references/knowledge_graphs.md
```

Understand Architecture:

Data structures →
```
references/core_concepts.md
```
→ Data Structures
Model design →
```
references/core_concepts.md
```
→ Model Interface
Task design →
```
references/core_concepts.md
```
→ Task Interface

选择数据集：

分子属性任务 →
```
references/datasets.md
```
→ 分子部分
蛋白质任务 →
```
references/datasets.md
```
→ 蛋白质部分
知识图谱任务 →
```
references/datasets.md
```
→ 知识图谱部分

选择模型：

分子任务 →
```
references/models_architectures.md
```
→ GNN部分 → GIN/GAT/SchNet
蛋白质（序列）任务 →
```
references/models_architectures.md
```
→ 蛋白质部分 → ESM
蛋白质（结构）任务 →
```
references/models_architectures.md
```
→ 蛋白质部分 → GearNet
知识图谱任务 →
```
references/models_architectures.md
```
→ KG部分 → RotatE/ComplEx

常见任务：

属性预测 →

references/molecular_property_prediction.md

或

references/protein_modeling.md

分子生成 →
```
references/molecular_generation.md
```
逆合成 →
```
references/retrosynthesis.md
```
KG推理 →
```
references/knowledge_graphs.md
```

架构理解：

数据结构 →
```
references/core_concepts.md
```
→ 数据结构
模型设计 →
```
references/core_concepts.md
```
→ 模型接口
任务设计 →
```
references/core_concepts.md
```
→ 任务接口

Troubleshooting Common Issues

常见问题排查

Issue: Dimension mismatch errors → Check

model.input_dim

matches

dataset.node_feature_dim

→ See

references/core_concepts.md

→ Essential Attributes

Issue: Poor performance on molecular tasks → Use scaffold splitting, not random → Try GIN instead of GCN → See

references/molecular_property_prediction.md

→ Best Practices

Issue: Protein model not learning → Use pre-trained ESM for sequence tasks → Check edge construction for structure models → See

references/protein_modeling.md

→ Training Workflows

Issue: Memory errors with large graphs → Reduce batch size → Use gradient accumulation → See

references/core_concepts.md

→ Memory Efficiency

Issue: Generated molecules are invalid → Add validity constraints → Post-process with RDKit validation → See

references/molecular_generation.md

→ Validation and Filtering

问题：维度不匹配错误 → 检查

model.input_dim

是否与

dataset.node_feature_dim

匹配 → 查看

references/core_concepts.md

→ 核心属性

问题：分子任务性能不佳 → 使用骨架拆分而非随机拆分 → 尝试用GIN替代GCN → 查看

references/molecular_property_prediction.md

→ 最佳实践

问题：蛋白质模型无法收敛 → 针对序列任务使用预训练ESM模型 → 检查结构模型的边构建逻辑 → 查看

references/protein_modeling.md

→ 训练流程

问题：大图导致内存错误 → 减小批量大小 → 使用梯度累积 → 查看

references/core_concepts.md

→ 内存优化

问题：生成的分子无效 → 添加有效性约束 → 用RDKit验证进行后处理 → 查看

references/molecular_generation.md

→ 验证与过滤

Resources

资源

Official Documentation: https://torchdrug.ai/docs/ GitHub: https://github.com/DeepGraphLearning/torchdrug Paper: TorchDrug: A Powerful and Flexible Machine Learning Platform for Drug Discovery

官方文档： https://torchdrug.ai/docs/ GitHub： https://github.com/DeepGraphLearning/torchdrug 论文： TorchDrug: A Powerful and Flexible Machine Learning Platform for Drug Discovery

Summary

总结

Navigate to the appropriate reference file based on your task:

Molecular property prediction →
```
molecular_property_prediction.md
```
Protein modeling →
```
protein_modeling.md
```
Knowledge graphs →
```
knowledge_graphs.md
```
Molecular generation →
```
molecular_generation.md
```
Retrosynthesis →
```
retrosynthesis.md
```
Model selection →
```
models_architectures.md
```
Dataset selection →
```
datasets.md
```
Technical details →
```
core_concepts.md
```

Each reference provides comprehensive coverage of its domain with examples, best practices, and common use cases.

根据你的任务类型导航至对应的参考文档：

分子属性预测 →
```
molecular_property_prediction.md
```
蛋白质建模 →
```
protein_modeling.md
```
知识图谱 →
```
knowledge_graphs.md
```
分子生成 →
```
molecular_generation.md
```
逆合成 →
```
retrosynthesis.md
```
模型选择 →
```
models_architectures.md
```
数据集选择 →
```
datasets.md
```
技术细节 →
```
core_concepts.md
```

每个参考文档都包含对应领域的全面内容，涵盖示例、最佳实践和常见使用场景。