pytdc
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePyTDC (Therapeutics Data Commons)
PyTDC(治疗学数据共享平台)
Overview
概述
PyTDC is an open-science platform providing AI-ready datasets and benchmarks for drug discovery and development. Access curated datasets spanning the entire therapeutics pipeline with standardized evaluation metrics and meaningful data splits, organized into three categories: single-instance prediction (molecular/protein properties), multi-instance prediction (drug-target interactions, DDI), and generation (molecule generation, retrosynthesis).
PyTDC是一个开放科学平台,为药物发现与研发提供可直接用于AI的数据集和基准测试。你可以获取覆盖整个治疗学研发流程的精选数据集,这些数据集配有标准化评估指标和合理的数据拆分方式,分为三类任务:单实例预测(分子/蛋白质属性)、多实例预测(药物-靶点相互作用DTI、药物-药物相互作用DDI)和生成任务(分子生成、逆合成)。
When to Use This Skill
何时使用该工具
This skill should be used when:
- Working with drug discovery or therapeutic ML datasets
- Benchmarking machine learning models on standardized pharmaceutical tasks
- Predicting molecular properties (ADME, toxicity, bioactivity)
- Predicting drug-target or drug-drug interactions
- Generating novel molecules with desired properties
- Accessing curated datasets with proper train/test splits (scaffold, cold-split)
- Using molecular oracles for property optimization
在以下场景中可以使用该工具:
- 处理药物发现或治疗领域机器学习数据集时
- 在标准化药学任务上对机器学习模型进行基准测试时
- 预测分子属性(ADME、毒性、生物活性)时
- 预测药物-靶点或药物-药物相互作用时
- 生成具有所需属性的新型分子时
- 获取带有合理训练/测试拆分(骨架拆分、冷拆分)的精选数据集时
- 使用分子预言机进行属性优化时
Installation & Setup
安装与设置
Install PyTDC using pip:
bash
uv pip install PyTDCTo upgrade to the latest version:
bash
uv pip install PyTDC --upgradeCore dependencies (automatically installed):
- numpy, pandas, tqdm, seaborn, scikit_learn, fuzzywuzzy
Additional packages are installed automatically as needed for specific features.
使用pip安装PyTDC:
bash
uv pip install PyTDC升级至最新版本:
bash
uv pip install PyTDC --upgrade核心依赖(将自动安装):
- numpy, pandas, tqdm, seaborn, scikit_learn, fuzzywuzzy
特定功能所需的额外包会根据需要自动安装。
Quick Start
快速开始
The basic pattern for accessing any TDC dataset follows this structure:
python
from tdc.<problem> import <Task>
data = <Task>(name='<Dataset>')
split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])
df = data.get_data(format='df')Where:
- : One of
<problem>,single_pred, ormulti_predgeneration - : Specific task category (e.g., ADME, DTI, MolGen)
<Task> - : Dataset name within that task
<Dataset>
Example - Loading ADME data:
python
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold')访问任何TDC数据集的基本模式如下:
python
from tdc.<problem> import <Task>
data = <Task>(name='<Dataset>')
split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])
df = data.get_data(format='df')其中:
- :
<problem>、single_pred或multi_pred中的一个generation - :具体任务类别(例如ADME、DTI、MolGen)
<Task> - :该任务下的数据集名称
<Dataset>
示例 - 加载ADME数据:
python
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold')Returns dict with 'train', 'valid', 'test' DataFrames
返回包含'train'、'valid'、'test' DataFrame的字典
undefinedundefinedSingle-Instance Prediction Tasks
单实例预测任务
Single-instance prediction involves forecasting properties of individual biomedical entities (molecules, proteins, etc.).
单实例预测任务指预测单个生物医学实体(分子、蛋白质等)的属性。
Available Task Categories
可用任务类别
1. ADME (Absorption, Distribution, Metabolism, Excretion)
1. ADME(吸收、分布、代谢、排泄)
Predict pharmacokinetic properties of drug molecules.
python
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang') # Intestinal permeability预测药物分子的药代动力学属性。
python
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang') # 肠道通透性Other datasets: HIA_Hou, Bioavailability_Ma, Lipophilicity_AstraZeneca, etc.
其他数据集:HIA_Hou、Bioavailability_Ma、Lipophilicity_AstraZeneca等
**Common ADME datasets:**
- Caco2 - Intestinal permeability
- HIA - Human intestinal absorption
- Bioavailability - Oral bioavailability
- Lipophilicity - Octanol-water partition coefficient
- Solubility - Aqueous solubility
- BBB - Blood-brain barrier penetration
- CYP - Cytochrome P450 metabolism
**常见ADME数据集:**
- Caco2 - 肠道通透性
- HIA - 人体肠道吸收
- Bioavailability - 口服生物利用度
- Lipophilicity - 辛醇-水分配系数
- Solubility - 水溶性
- BBB - 血脑屏障穿透性
- CYP - 细胞色素P450代谢2. Toxicity (Tox)
2. 毒性(Tox)
Predict toxicity and adverse effects of compounds.
python
from tdc.single_pred import Tox
data = Tox(name='hERG') # Cardiotoxicity预测化合物的毒性与不良反应。
python
from tdc.single_pred import Tox
data = Tox(name='hERG') # 心脏毒性Other datasets: AMES, DILI, Carcinogens_Lagunin, etc.
其他数据集:AMES、DILI、Carcinogens_Lagunin等
**Common toxicity datasets:**
- hERG - Cardiac toxicity
- AMES - Mutagenicity
- DILI - Drug-induced liver injury
- Carcinogens - Carcinogenicity
- ClinTox - Clinical trial toxicity
**常见毒性数据集:**
- hERG - 心脏毒性
- AMES - 致突变性
- DILI - 药物诱导肝损伤
- Carcinogens - 致癌性
- ClinTox - 临床试验毒性3. HTS (High-Throughput Screening)
3. HTS(高通量筛选)
Bioactivity predictions from screening data.
python
from tdc.single_pred import HTS
data = HTS(name='SARSCoV2_Vitro_Touret')基于筛选数据的生物活性预测。
python
from tdc.single_pred import HTS
data = HTS(name='SARSCoV2_Vitro_Touret')4. QM (Quantum Mechanics)
4. QM(量子力学)
Quantum mechanical properties of molecules.
python
from tdc.single_pred import QM
data = QM(name='QM7')分子的量子力学属性。
python
from tdc.single_pred import QM
data = QM(name='QM7')5. Other Single Prediction Tasks
5. 其他单实例预测任务
- Yields: Chemical reaction yield prediction
- Epitope: Epitope prediction for biologics
- Develop: Development-stage predictions
- CRISPROutcome: Gene editing outcome prediction
- Yields:化学反应产率预测
- Epitope:生物制品表位预测
- Develop:研发阶段预测
- CRISPROutcome:基因编辑结果预测
Data Format
数据格式
Single prediction datasets typically return DataFrames with columns:
- or
Drug_ID: Unique identifierCompound_ID - or
Drug: SMILES string or molecular representationX - : Target label (continuous or binary)
Y
单实例预测数据集通常返回包含以下列的DataFrame:
- 或
Drug_ID:唯一标识符Compound_ID - 或
Drug:SMILES字符串或分子表示X - :目标标签(连续型或二分类)
Y
Multi-Instance Prediction Tasks
多实例预测任务
Multi-instance prediction involves forecasting properties of interactions between multiple biomedical entities.
多实例预测任务指预测多个生物医学实体之间的相互作用属性。
Available Task Categories
可用任务类别
1. DTI (Drug-Target Interaction)
1. DTI(药物-靶点相互作用)
Predict binding affinity between drugs and protein targets.
python
from tdc.multi_pred import DTI
data = DTI(name='BindingDB_Kd')
split = data.get_split()Available datasets:
- BindingDB_Kd - Dissociation constant (52,284 pairs)
- BindingDB_IC50 - Half-maximal inhibitory concentration (991,486 pairs)
- BindingDB_Ki - Inhibition constant (375,032 pairs)
- DAVIS, KIBA - Kinase binding datasets
Data format: Drug_ID, Target_ID, Drug (SMILES), Target (sequence), Y (binding affinity)
预测药物与蛋白质靶点之间的结合亲和力。
python
from tdc.multi_pred import DTI
data = DTI(name='BindingDB_Kd')
split = data.get_split()可用数据集:
- BindingDB_Kd - 解离常数(52,284对数据)
- BindingDB_IC50 - 半数抑制浓度(991,486对数据)
- BindingDB_Ki - 抑制常数(375,032对数据)
- DAVIS、KIBA - 激酶结合数据集
数据格式: Drug_ID、Target_ID、Drug(SMILES)、Target(序列)、Y(结合亲和力)
2. DDI (Drug-Drug Interaction)
2. DDI(药物-药物相互作用)
Predict interactions between drug pairs.
python
from tdc.multi_pred import DDI
data = DDI(name='DrugBank')
split = data.get_split()Multi-class classification task predicting interaction types. Dataset contains 191,808 DDI pairs with 1,706 drugs.
预测药物对之间的相互作用。
python
from tdc.multi_pred import DDI
data = DDI(name='DrugBank')
split = data.get_split()该任务为多分类任务,预测相互作用类型。数据集包含191,808对DDI数据,涉及1,706种药物。
3. PPI (Protein-Protein Interaction)
3. PPI(蛋白质-蛋白质相互作用)
Predict protein-protein interactions.
python
from tdc.multi_pred import PPI
data = PPI(name='HuRI')预测蛋白质之间的相互作用。
python
from tdc.multi_pred import PPI
data = PPI(name='HuRI')4. Other Multi-Prediction Tasks
4. 其他多实例预测任务
- GDA: Gene-disease associations
- DrugRes: Drug resistance prediction
- DrugSyn: Drug synergy prediction
- PeptideMHC: Peptide-MHC binding
- AntibodyAff: Antibody affinity prediction
- MTI: miRNA-target interactions
- Catalyst: Catalyst prediction
- TrialOutcome: Clinical trial outcome prediction
- GDA:基因-疾病关联
- DrugRes:药物抗性预测
- DrugSyn:药物协同作用预测
- PeptideMHC:肽-MHC结合预测
- AntibodyAff:抗体亲和力预测
- MTI:miRNA-靶点相互作用
- Catalyst:催化剂预测
- TrialOutcome:临床试验结果预测
Generation Tasks
生成任务
Generation tasks involve creating novel biomedical entities with desired properties.
生成任务指创建具有所需属性的新型生物医学实体。
1. Molecular Generation (MolGen)
1. 分子生成(MolGen)
Generate diverse, novel molecules with desirable chemical properties.
python
from tdc.generation import MolGen
data = MolGen(name='ChEMBL_V29')
split = data.get_split()Use with oracles to optimize for specific properties:
python
from tdc import Oracle
oracle = Oracle(name='GSK3B')
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O') # Evaluate SMILESSee for all available oracle functions.
references/oracles.md生成多样且具有理想化学属性的新型分子。
python
from tdc.generation import MolGen
data = MolGen(name='ChEMBL_V29')
split = data.get_split()可结合预言机优化特定属性:
python
from tdc import Oracle
oracle = Oracle(name='GSK3B')
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O') # 评估SMILES字符串所有可用预言机函数请参考。
references/oracles.md2. Retrosynthesis (RetroSyn)
2. 逆合成(RetroSyn)
Predict reactants needed to synthesize a target molecule.
python
from tdc.generation import RetroSyn
data = RetroSyn(name='USPTO')
split = data.get_split()Dataset contains 1,939,253 reactions from USPTO database.
预测合成目标分子所需的反应物。
python
from tdc.generation import RetroSyn
data = RetroSyn(name='USPTO')
split = data.get_split()该数据集包含来自USPTO数据库的1,939,253个反应。
3. Paired Molecule Generation
3. 配对分子生成
Generate molecule pairs (e.g., prodrug-drug pairs).
python
from tdc.generation import PairMolGen
data = PairMolGen(name='Prodrug')For detailed oracle documentation and molecular generation workflows, refer to and .
references/oracles.mdscripts/molecular_generation.py生成分子对(例如前药-药物对)。
python
from tdc.generation import PairMolGen
data = PairMolGen(name='Prodrug')关于预言机的详细文档和分子生成工作流,请参考和。
references/oracles.mdscripts/molecular_generation.pyBenchmark Groups
基准测试组
Benchmark groups provide curated collections of related datasets for systematic model evaluation.
基准测试组提供相关数据集的精选集合,用于系统地评估模型。
ADMET Benchmark Group
ADMET基准测试组
python
from tdc.benchmark_group import admet_group
group = admet_group(path='data/')python
from tdc.benchmark_group import admet_group
group = admet_group(path='data/')Get benchmark datasets
获取基准测试数据集
benchmark = group.get('Caco2_Wang')
predictions = {}
for seed in [1, 2, 3, 4, 5]:
train, valid = benchmark['train'], benchmark['valid']
# Train model here
predictions[seed] = model.predict(benchmark['test'])
benchmark = group.get('Caco2_Wang')
predictions = {}
for seed in [1, 2, 3, 4, 5]:
train, valid = benchmark['train'], benchmark['valid']
# 在此处训练模型
predictions[seed] = model.predict(benchmark['test'])
Evaluate with required 5 seeds
使用要求的5个随机种子进行评估
results = group.evaluate(predictions)
**ADMET Group includes 22 datasets** covering absorption, distribution, metabolism, excretion, and toxicity.results = group.evaluate(predictions)
**ADMET基准测试组包含22个数据集**,覆盖吸收、分布、代谢、排泄和毒性领域。Other Benchmark Groups
其他基准测试组
Available benchmark groups include collections for:
- ADMET properties
- Drug-target interactions
- Drug combination prediction
- And more specialized therapeutic tasks
For benchmark evaluation workflows, see .
scripts/benchmark_evaluation.py可用的基准测试组包括:
- ADMET属性组
- 药物-靶点相互作用组
- 药物组合预测组
- 其他专业治疗学任务组
基准测试评估工作流请参考。
scripts/benchmark_evaluation.pyData Functions
数据处理工具
TDC provides comprehensive data processing utilities organized into four categories.
TDC提供全面的数据处理工具,分为四类。
1. Dataset Splits
1. 数据集拆分
Retrieve train/validation/test partitions with various strategies:
python
undefined通过多种策略获取训练/验证/测试分区:
python
undefinedScaffold split (default for most tasks)
骨架拆分(多数任务的默认方式)
split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])
split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])
Random split
随机拆分
split = data.get_split(method='random', seed=42, frac=[0.8, 0.1, 0.1])
split = data.get_split(method='random', seed=42, frac=[0.8, 0.1, 0.1])
Cold split (for DTI/DDI tasks)
冷拆分(适用于DTI/DDI任务)
split = data.get_split(method='cold_drug', seed=1) # Unseen drugs in test
split = data.get_split(method='cold_target', seed=1) # Unseen targets in test
**Available split strategies:**
- `random`: Random shuffling
- `scaffold`: Scaffold-based (for chemical diversity)
- `cold_drug`, `cold_target`, `cold_drug_target`: For DTI tasks
- `temporal`: Time-based splits for temporal datasetssplit = data.get_split(method='cold_drug', seed=1) # 测试集包含未见过的药物
split = data.get_split(method='cold_target', seed=1) # 测试集包含未见过的靶点
**可用拆分策略:**
- `random`:随机打乱拆分
- `scaffold`:基于分子骨架拆分(保证化学多样性)
- `cold_drug`、`cold_target`、`cold_drug_target`:适用于DTI任务
- `temporal`:基于时间的拆分(适用于时序数据集)2. Model Evaluation
2. 模型评估
Use standardized metrics for evaluation:
python
from tdc import Evaluator使用标准化指标进行评估:
python
from tdc import EvaluatorFor binary classification
二分类任务评估
evaluator = Evaluator(name='ROC-AUC')
score = evaluator(y_true, y_pred)
evaluator = Evaluator(name='ROC-AUC')
score = evaluator(y_true, y_pred)
For regression
回归任务评估
evaluator = Evaluator(name='RMSE')
score = evaluator(y_true, y_pred)
**Available metrics:** ROC-AUC, PR-AUC, F1, Accuracy, RMSE, MAE, R2, Spearman, Pearson, and more.evaluator = Evaluator(name='RMSE')
score = evaluator(y_true, y_pred)
**可用指标:** ROC-AUC、PR-AUC、F1、准确率、RMSE、MAE、R2、斯皮尔曼相关系数、皮尔逊相关系数等。3. Data Processing
3. 数据处理
TDC provides 11 key processing utilities:
python
from tdc.chem_utils import MolConvertTDC提供11种核心处理工具:
python
from tdc.chem_utils import MolConvertMolecule format conversion
分子格式转换
converter = MolConvert(src='SMILES', dst='PyG')
pyg_graph = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
**Processing utilities include:**
- Molecule format conversion (SMILES, SELFIES, PyG, DGL, ECFP, etc.)
- Molecule filters (PAINS, drug-likeness)
- Label binarization and unit conversion
- Data balancing (over/under-sampling)
- Negative sampling for pair data
- Graph transformation
- Entity retrieval (CID to SMILES, UniProt to sequence)
For comprehensive utilities documentation, see `references/utilities.md`.converter = MolConvert(src='SMILES', dst='PyG')
pyg_graph = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
**处理工具包括:**
- 分子格式转换(SMILES、SELFIES、PyG、DGL、ECFP等)
- 分子过滤(PAINS、类药性过滤)
- 标签二值化与单位转换
- 数据平衡(过采样/欠采样)
- 配对数据负采样
- 图转换
- 实体检索(CID转SMILES、UniProt转序列)
完整工具文档请参考`references/utilities.md`。4. Molecule Generation Oracles
4. 分子生成预言机
TDC provides 17+ oracle functions for molecular optimization:
python
from tdc import OracleTDC提供17种以上的预言机函数用于分子优化:
python
from tdc import OracleSingle oracle
单个预言机
oracle = Oracle(name='DRD2')
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
oracle = Oracle(name='DRD2')
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
Multiple oracles
批量预言机
oracle = Oracle(name='JNK3')
scores = oracle(['SMILES1', 'SMILES2', 'SMILES3'])
For complete oracle documentation, see `references/oracles.md`.oracle = Oracle(name='JNK3')
scores = oracle(['SMILES1', 'SMILES2', 'SMILES3'])
完整预言机文档请参考`references/oracles.md`。Advanced Features
高级功能
Retrieve Available Datasets
获取可用数据集
python
from tdc.utils import retrieve_dataset_namespython
from tdc.utils import retrieve_dataset_namesGet all ADME datasets
获取所有ADME数据集
adme_datasets = retrieve_dataset_names('ADME')
adme_datasets = retrieve_dataset_names('ADME')
Get all DTI datasets
获取所有DTI数据集
dti_datasets = retrieve_dataset_names('DTI')
undefineddti_datasets = retrieve_dataset_names('DTI')
undefinedLabel Transformations
标签转换
python
undefinedpython
undefinedGet label mapping
获取标签映射
label_map = data.get_label_map(name='DrugBank')
label_map = data.get_label_map(name='DrugBank')
Convert labels
转换标签
from tdc.chem_utils import label_transform
transformed = label_transform(y, from_unit='nM', to_unit='p')
undefinedfrom tdc.chem_utils import label_transform
transformed = label_transform(y, from_unit='nM', to_unit='p')
undefinedDatabase Queries
数据库查询
python
from tdc.utils import cid2smiles, uniprot2seqpython
from tdc.utils import cid2smiles, uniprot2seqConvert PubChem CID to SMILES
将PubChem CID转换为SMILES
smiles = cid2smiles(2244)
smiles = cid2smiles(2244)
Convert UniProt ID to amino acid sequence
将UniProt ID转换为氨基酸序列
sequence = uniprot2seq('P12345')
undefinedsequence = uniprot2seq('P12345')
undefinedCommon Workflows
常见工作流
Workflow 1: Train a Single Prediction Model
工作流1:训练单实例预测模型
See for a complete example:
scripts/load_and_split_data.pypython
from tdc.single_pred import ADME
from tdc import Evaluator完整示例请参考:
scripts/load_and_split_data.pypython
from tdc.single_pred import ADME
from tdc import EvaluatorLoad data
加载数据
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold', seed=42)
train, valid, test = split['train'], split['valid'], split['test']
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold', seed=42)
train, valid, test = split['train'], split['valid'], split['test']
Train model (user implements)
训练模型(用户自行实现)
model.fit(train['Drug'], train['Y'])
model.fit(train['Drug'], train['Y'])
Evaluate
评估
evaluator = Evaluator(name='MAE')
evaluator = Evaluator(name='MAE')
score = evaluator(test['Y'], predictions)
score = evaluator(test['Y'], predictions)
undefinedundefinedWorkflow 2: Benchmark Evaluation
工作流2:基准测试评估
See for a complete example with multiple seeds and proper evaluation protocol.
scripts/benchmark_evaluation.py完整示例(包含多随机种子与标准评估流程)请参考。
scripts/benchmark_evaluation.pyWorkflow 3: Molecular Generation with Oracles
工作流3:结合预言机的分子生成
See for an example of goal-directed generation using oracle functions.
scripts/molecular_generation.py使用预言机进行目标导向生成的示例请参考。
scripts/molecular_generation.pyResources
资源
This skill includes bundled resources for common TDC workflows:
该工具包含以下常见TDC工作流的配套资源:
scripts/
scripts/
- : Template for loading and splitting TDC datasets with various strategies
load_and_split_data.py - : Template for running benchmark group evaluations with proper 5-seed protocol
benchmark_evaluation.py - : Template for molecular generation using oracle functions
molecular_generation.py
- :加载与拆分TDC数据集的模板(支持多种策略)
load_and_split_data.py - :运行基准测试组评估的模板(符合5随机种子要求)
benchmark_evaluation.py - :使用预言机进行分子生成的模板
molecular_generation.py
references/
references/
- : Comprehensive catalog of all available datasets organized by task type
datasets.md - : Complete documentation of all 17+ molecule generation oracles
oracles.md - : Detailed guide to data processing, splitting, and evaluation utilities
utilities.md
- :按任务类型分类的所有可用数据集的完整目录
datasets.md - :所有17+种分子生成预言机的完整文档
oracles.md - :数据处理、拆分与评估工具的详细指南
utilities.md
Additional Resources
额外资源
- Official Website: https://tdcommons.ai
- Documentation: https://tdc.readthedocs.io
- GitHub: https://github.com/mims-harvard/TDC
- Paper: NeurIPS 2021 - "Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development"
- 官方网站:https://tdcommons.ai
- 文档:https://tdc.readthedocs.io
- GitHub:https://github.com/mims-harvard/TDC
- 论文:NeurIPS 2021 - 《Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development》