deepchem

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DeepChem

DeepChem

Overview

概述

DeepChem is a comprehensive Python library for applying machine learning to chemistry, materials science, and biology. Enable molecular property prediction, drug discovery, materials design, and biomolecule analysis through specialized neural networks, molecular featurization methods, and pretrained models.
DeepChem是一个将机器学习应用于化学、材料科学和生物学领域的综合性Python库。通过专用神经网络、分子特征工程方法和预训练模型,实现分子属性预测、药物发现、材料设计和生物分子分析。

When to Use This Skill

何时使用该工具

This skill should be used when:
  • Loading and processing molecular data (SMILES strings, SDF files, protein sequences)
  • Predicting molecular properties (solubility, toxicity, binding affinity, ADMET properties)
  • Training models on chemical/biological datasets
  • Using MoleculeNet benchmark datasets (Tox21, BBBP, Delaney, etc.)
  • Converting molecules to ML-ready features (fingerprints, graph representations, descriptors)
  • Implementing graph neural networks for molecules (GCN, GAT, MPNN, AttentiveFP)
  • Applying transfer learning with pretrained models (ChemBERTa, GROVER, MolFormer)
  • Predicting crystal/materials properties (bandgap, formation energy)
  • Analyzing protein or DNA sequences
在以下场景中应使用该工具:
  • 加载和处理分子数据(SMILES字符串、SDF文件、蛋白质序列)
  • 预测分子属性(溶解度、毒性、结合亲和力、ADMET属性)
  • 在化学/生物数据集上训练模型
  • 使用MoleculeNet基准数据集(Tox21、BBBP、Delaney等)
  • 将分子转换为适用于机器学习的特征(指纹、图表示、描述符)
  • 为分子实现图神经网络(GCN、GAT、MPNN、AttentiveFP)
  • 结合预训练模型应用迁移学习(ChemBERTa、GROVER、MolFormer)
  • 预测晶体/材料属性(带隙、形成能)
  • 分析蛋白质或DNA序列

Core Capabilities

核心功能

1. Molecular Data Loading and Processing

1. 分子数据加载与处理

DeepChem provides specialized loaders for various chemical data formats:
python
import deepchem as dc
DeepChem为多种化学数据格式提供了专用加载器:
python
import deepchem as dc

Load CSV with SMILES

Load CSV with SMILES

featurizer = dc.feat.CircularFingerprint(radius=2, size=2048) loader = dc.data.CSVLoader( tasks=['solubility', 'toxicity'], feature_field='smiles', featurizer=featurizer ) dataset = loader.create_dataset('molecules.csv')
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048) loader = dc.data.CSVLoader( tasks=['solubility', 'toxicity'], feature_field='smiles', featurizer=featurizer ) dataset = loader.create_dataset('molecules.csv')

Load SDF files

Load SDF files

loader = dc.data.SDFLoader(tasks=['activity'], featurizer=featurizer) dataset = loader.create_dataset('compounds.sdf')
loader = dc.data.SDFLoader(tasks=['activity'], featurizer=featurizer) dataset = loader.create_dataset('compounds.sdf')

Load protein sequences

Load protein sequences

loader = dc.data.FASTALoader() dataset = loader.create_dataset('proteins.fasta')

**Key Loaders**:
- `CSVLoader`: Tabular data with molecular identifiers
- `SDFLoader`: Molecular structure files
- `FASTALoader`: Protein/DNA sequences
- `ImageLoader`: Molecular images
- `JsonLoader`: JSON-formatted datasets
loader = dc.data.FASTALoader() dataset = loader.create_dataset('proteins.fasta')

**主要加载器**:
- `CSVLoader`: 包含分子标识符的表格数据加载器
- `SDFLoader`: 分子结构文件加载器
- `FASTALoader`: 蛋白质/DNA序列加载器
- `ImageLoader`: 分子图像加载器
- `JsonLoader`: JSON格式数据集加载器

2. Molecular Featurization

2. 分子特征工程

Convert molecules into numerical representations for ML models.
将分子转换为适用于机器学习模型的数值表示。

Decision Tree for Featurizer Selection

特征选择决策树

Is the model a graph neural network?
├─ YES → Use graph featurizers
│   ├─ Standard GNN → MolGraphConvFeaturizer
│   ├─ Message passing → DMPNNFeaturizer
│   └─ Pretrained → GroverFeaturizer
└─ NO → What type of model?
    ├─ Traditional ML (RF, XGBoost, SVM)
    │   ├─ Fast baseline → CircularFingerprint (ECFP)
    │   ├─ Interpretable → RDKitDescriptors
    │   └─ Maximum coverage → MordredDescriptors
    ├─ Deep learning (non-graph)
    │   ├─ Dense networks → CircularFingerprint
    │   └─ CNN → SmilesToImage
    ├─ Sequence models (LSTM, Transformer)
    │   └─ SmilesToSeq
    └─ 3D structure analysis
        └─ CoulombMatrix
Is the model a graph neural network?
├─ YES → Use graph featurizers
│   ├─ Standard GNN → MolGraphConvFeaturizer
│   ├─ Message passing → DMPNNFeaturizer
│   └─ Pretrained → GroverFeaturizer
└─ NO → What type of model?
    ├─ Traditional ML (RF, XGBoost, SVM)
    │   ├─ Fast baseline → CircularFingerprint (ECFP)
    │   ├─ Interpretable → RDKitDescriptors
    │   └─ Maximum coverage → MordredDescriptors
    ├─ Deep learning (non-graph)
    │   ├─ Dense networks → CircularFingerprint
    │   └─ CNN → SmilesToImage
    ├─ Sequence models (LSTM, Transformer)
    │   └─ SmilesToSeq
    └─ 3D structure analysis
        └─ CoulombMatrix

Example Featurization

特征工程示例

python
undefined
python
undefined

Fingerprints (for traditional ML)

Fingerprints (for traditional ML)

fp = dc.feat.CircularFingerprint(radius=2, size=2048)
fp = dc.feat.CircularFingerprint(radius=2, size=2048)

Descriptors (for interpretable models)

Descriptors (for interpretable models)

desc = dc.feat.RDKitDescriptors()
desc = dc.feat.RDKitDescriptors()

Graph features (for GNNs)

Graph features (for GNNs)

graph_feat = dc.feat.MolGraphConvFeaturizer()
graph_feat = dc.feat.MolGraphConvFeaturizer()

Apply featurization

Apply featurization

features = fp.featurize(['CCO', 'c1ccccc1'])

**Selection Guide**:
- **Small datasets (<1K)**: CircularFingerprint or RDKitDescriptors
- **Medium datasets (1K-100K)**: CircularFingerprint or graph featurizers
- **Large datasets (>100K)**: Graph featurizers (MolGraphConvFeaturizer, DMPNNFeaturizer)
- **Transfer learning**: Pretrained model featurizers (GroverFeaturizer)

See `references/api_reference.md` for complete featurizer documentation.
features = fp.featurize(['CCO', 'c1ccccc1'])

**选择指南**:
- **小型数据集(<1K样本)**: CircularFingerprint或RDKitDescriptors
- **中型数据集(1K-100K样本)**: CircularFingerprint或图特征
- **大型数据集(>100K样本)**: 图特征(MolGraphConvFeaturizer、DMPNNFeaturizer)
- **迁移学习**: 预训练模型特征(GroverFeaturizer)

完整的特征工程文档请参考`references/api_reference.md`。

3. Data Splitting

3. 数据划分

Critical: For drug discovery tasks, use
ScaffoldSplitter
to prevent data leakage from similar molecular structures appearing in both training and test sets.
python
undefined
关键注意事项: 对于药物发现任务,使用
ScaffoldSplitter
防止相似分子结构同时出现在训练集和测试集中,避免数据泄露。
python
undefined

Scaffold splitting (recommended for molecules)

Scaffold splitting (recommended for molecules)

splitter = dc.splits.ScaffoldSplitter() train, valid, test = splitter.train_valid_test_split( dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1 )
splitter = dc.splits.ScaffoldSplitter() train, valid, test = splitter.train_valid_test_split( dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1 )

Random splitting (for non-molecular data)

Random splitting (for non-molecular data)

splitter = dc.splits.RandomSplitter() train, test = splitter.train_test_split(dataset)
splitter = dc.splits.RandomSplitter() train, test = splitter.train_test_split(dataset)

Stratified splitting (for imbalanced classification)

Stratified splitting (for imbalanced classification)

splitter = dc.splits.RandomStratifiedSplitter() train, test = splitter.train_test_split(dataset)

**Available Splitters**:
- `ScaffoldSplitter`: Split by molecular scaffolds (prevents leakage)
- `ButinaSplitter`: Clustering-based molecular splitting
- `MaxMinSplitter`: Maximize diversity between sets
- `RandomSplitter`: Random splitting
- `RandomStratifiedSplitter`: Preserves class distributions
splitter = dc.splits.RandomStratifiedSplitter() train, test = splitter.train_test_split(dataset)

**可用划分器**:
- `ScaffoldSplitter`: 按分子骨架划分(防止数据泄露)
- `ButinaSplitter`: 基于聚类的分子划分
- `MaxMinSplitter`: 最大化数据集间的多样性
- `RandomSplitter`: 随机划分
- `RandomStratifiedSplitter`: 保留类别分布

4. Model Selection and Training

4. 模型选择与训练

Quick Model Selection Guide

快速模型选择指南

Dataset SizeTaskRecommended ModelFeaturizer
< 1K samplesAnySklearnModel (RandomForest)CircularFingerprint
1K-100KClassification/RegressionGBDTModel or MultitaskRegressorCircularFingerprint
> 100KMolecular propertiesGCNModel, AttentiveFPModel, DMPNNModelMolGraphConvFeaturizer
Any (small preferred)Transfer learningChemBERTa, GROVER, MolFormerModel-specific
Crystal structuresMaterials propertiesCGCNNModel, MEGNetModelStructure-based
Protein sequencesProtein propertiesProtBERTSequence-based
数据集规模任务类型推荐模型特征工程方法
< 1K样本任意SklearnModel(RandomForest)CircularFingerprint
1K-100K样本分类/回归GBDTModel或MultitaskRegressorCircularFingerprint
> 100K样本分子属性预测GCNModel、AttentiveFPModel、DMPNNModelMolGraphConvFeaturizer
任意(小型数据集更适用)迁移学习ChemBERTa、GROVER、MolFormer模型专用特征
晶体结构材料属性预测CGCNNModel、MEGNetModel基于结构的特征
蛋白质序列蛋白质属性预测ProtBERT基于序列的特征

Example: Traditional ML

示例:传统机器学习

python
from sklearn.ensemble import RandomForestRegressor
python
from sklearn.ensemble import RandomForestRegressor

Wrap scikit-learn model

Wrap scikit-learn model

sklearn_model = RandomForestRegressor(n_estimators=100) model = dc.models.SklearnModel(model=sklearn_model) model.fit(train)
undefined
sklearn_model = RandomForestRegressor(n_estimators=100) model = dc.models.SklearnModel(model=sklearn_model) model.fit(train)
undefined

Example: Deep Learning

示例:深度学习

python
undefined
python
undefined

Multitask regressor (for fingerprints)

Multitask regressor (for fingerprints)

model = dc.models.MultitaskRegressor( n_tasks=2, n_features=2048, layer_sizes=[1000, 500], dropouts=0.25, learning_rate=0.001 ) model.fit(train, nb_epoch=50)
undefined
model = dc.models.MultitaskRegressor( n_tasks=2, n_features=2048, layer_sizes=[1000, 500], dropouts=0.25, learning_rate=0.001 ) model.fit(train, nb_epoch=50)
undefined

Example: Graph Neural Networks

示例:图神经网络

python
undefined
python
undefined

Graph Convolutional Network

Graph Convolutional Network

model = dc.models.GCNModel( n_tasks=1, mode='regression', batch_size=128, learning_rate=0.001 ) model.fit(train, nb_epoch=50)
model = dc.models.GCNModel( n_tasks=1, mode='regression', batch_size=128, learning_rate=0.001 ) model.fit(train, nb_epoch=50)

Graph Attention Network

Graph Attention Network

model = dc.models.GATModel(n_tasks=1, mode='classification') model.fit(train, nb_epoch=50)
model = dc.models.GATModel(n_tasks=1, mode='classification') model.fit(train, nb_epoch=50)

Attentive Fingerprint

Attentive Fingerprint

model = dc.models.AttentiveFPModel(n_tasks=1, mode='regression') model.fit(train, nb_epoch=50)
undefined
model = dc.models.AttentiveFPModel(n_tasks=1, mode='regression') model.fit(train, nb_epoch=50)
undefined

5. MoleculeNet Benchmarks

5. MoleculeNet基准测试

Quick access to 30+ curated benchmark datasets with standardized train/valid/test splits:
python
undefined
快速访问30+个经过整理的基准数据集,包含标准化的训练/验证/测试划分:
python
undefined

Load benchmark dataset

Load benchmark dataset

tasks, datasets, transformers = dc.molnet.load_tox21( featurizer='GraphConv', # or 'ECFP', 'Weave', 'Raw' splitter='scaffold', # or 'random', 'stratified' reload=False ) train, valid, test = datasets
tasks, datasets, transformers = dc.molnet.load_tox21( featurizer='GraphConv', # or 'ECFP', 'Weave', 'Raw' splitter='scaffold', # or 'random', 'stratified' reload=False ) train, valid, test = datasets

Train and evaluate

Train and evaluate

model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification') model.fit(train, nb_epoch=50)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score) test_score = model.evaluate(test, [metric])

**Common Datasets**:
- **Classification**: `load_tox21()`, `load_bbbp()`, `load_hiv()`, `load_clintox()`
- **Regression**: `load_delaney()`, `load_freesolv()`, `load_lipo()`
- **Quantum properties**: `load_qm7()`, `load_qm8()`, `load_qm9()`
- **Materials**: `load_perovskite()`, `load_bandgap()`, `load_mp_formation_energy()`

See `references/api_reference.md` for complete dataset list.
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification') model.fit(train, nb_epoch=50)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score) test_score = model.evaluate(test, [metric])

**常见数据集**:
- **分类任务**: `load_tox21()`、`load_bbbp()`、`load_hiv()`、`load_clintox()`
- **回归任务**: `load_delaney()`、`load_freesolv()`、`load_lipo()`
- **量子属性**: `load_qm7()`、`load_qm8()`、`load_qm9()`
- **材料科学**: `load_perovskite()`、`load_bandgap()`、`load_mp_formation_energy()`

完整数据集列表请参考`references/api_reference.md`。

6. Transfer Learning

6. 迁移学习

Leverage pretrained models for improved performance, especially on small datasets:
python
undefined
利用预训练模型提升性能,尤其适用于小型数据集:
python
undefined

ChemBERTa (BERT pretrained on 77M molecules)

ChemBERTa (BERT pretrained on 77M molecules)

model = dc.models.HuggingFaceModel( model='seyonec/ChemBERTa-zinc-base-v1', task='classification', n_tasks=1, learning_rate=2e-5 # Lower LR for fine-tuning ) model.fit(train, nb_epoch=10)
model = dc.models.HuggingFaceModel( model='seyonec/ChemBERTa-zinc-base-v1', task='classification', n_tasks=1, learning_rate=2e-5 # Lower LR for fine-tuning ) model.fit(train, nb_epoch=10)

GROVER (graph transformer pretrained on 10M molecules)

GROVER (graph transformer pretrained on 10M molecules)

model = dc.models.GroverModel( task='regression', n_tasks=1 ) model.fit(train, nb_epoch=20)

**When to use transfer learning**:
- Small datasets (< 1000 samples)
- Novel molecular scaffolds
- Limited computational resources
- Need for rapid prototyping

Use the `scripts/transfer_learning.py` script for guided transfer learning workflows.
model = dc.models.GroverModel( task='regression', n_tasks=1 ) model.fit(train, nb_epoch=20)

**何时使用迁移学习**:
- 小型数据集(<1000样本)
- 新型分子骨架
- 计算资源有限
- 需要快速原型开发

可使用`scripts/transfer_learning.py`脚本进行引导式迁移学习工作流。

7. Model Evaluation

7. 模型评估

python
undefined
python
undefined

Define metrics

Define metrics

classification_metrics = [ dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'), dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'), dc.metrics.Metric(dc.metrics.f1_score, name='F1') ]
regression_metrics = [ dc.metrics.Metric(dc.metrics.r2_score, name='R²'), dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'), dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE') ]
classification_metrics = [ dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'), dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'), dc.metrics.Metric(dc.metrics.f1_score, name='F1') ]
regression_metrics = [ dc.metrics.Metric(dc.metrics.r2_score, name='R²'), dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'), dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE') ]

Evaluate

Evaluate

train_scores = model.evaluate(train, classification_metrics) test_scores = model.evaluate(test, classification_metrics)
undefined
train_scores = model.evaluate(train, classification_metrics) test_scores = model.evaluate(test, classification_metrics)
undefined

8. Making Predictions

8. 生成预测结果

python
undefined
python
undefined

Predict on test set

Predict on test set

predictions = model.predict(test)
predictions = model.predict(test)

Predict on new molecules

Predict on new molecules

new_smiles = ['CCO', 'c1ccccc1', 'CC(C)O'] new_features = featurizer.featurize(new_smiles) new_dataset = dc.data.NumpyDataset(X=new_features)
new_smiles = ['CCO', 'c1ccccc1', 'CC(C)O'] new_features = featurizer.featurize(new_smiles) new_dataset = dc.data.NumpyDataset(X=new_features)

Apply same transformations as training

Apply same transformations as training

for transformer in transformers: new_dataset = transformer.transform(new_dataset)
predictions = model.predict(new_dataset)
undefined
for transformer in transformers: new_dataset = transformer.transform(new_dataset)
predictions = model.predict(new_dataset)
undefined

Typical Workflows

典型工作流

Workflow A: Quick Benchmark Evaluation

工作流A:快速基准测试评估

For evaluating a model on standard benchmarks:
python
import deepchem as dc
用于在标准基准数据集上评估模型:
python
import deepchem as dc

1. Load benchmark

1. Load benchmark

tasks, datasets, _ = dc.molnet.load_bbbp( featurizer='GraphConv', splitter='scaffold' ) train, valid, test = datasets
tasks, datasets, _ = dc.molnet.load_bbbp( featurizer='GraphConv', splitter='scaffold' ) train, valid, test = datasets

2. Train model

2. Train model

model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification') model.fit(train, nb_epoch=50)
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification') model.fit(train, nb_epoch=50)

3. Evaluate

3. Evaluate

metric = dc.metrics.Metric(dc.metrics.roc_auc_score) test_score = model.evaluate(test, [metric]) print(f"Test ROC-AUC: {test_score}")
undefined
metric = dc.metrics.Metric(dc.metrics.roc_auc_score) test_score = model.evaluate(test, [metric]) print(f"Test ROC-AUC: {test_score}")
undefined

Workflow B: Custom Data Prediction

工作流B:自定义数据预测

For training on custom molecular datasets:
python
import deepchem as dc
用于在自定义分子数据集上训练模型:
python
import deepchem as dc

1. Load and featurize data

1. Load and featurize data

featurizer = dc.feat.CircularFingerprint(radius=2, size=2048) loader = dc.data.CSVLoader( tasks=['activity'], feature_field='smiles', featurizer=featurizer ) dataset = loader.create_dataset('my_molecules.csv')
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048) loader = dc.data.CSVLoader( tasks=['activity'], feature_field='smiles', featurizer=featurizer ) dataset = loader.create_dataset('my_molecules.csv')

2. Split data (use ScaffoldSplitter for molecules!)

2. Split data (use ScaffoldSplitter for molecules!)

splitter = dc.splits.ScaffoldSplitter() train, valid, test = splitter.train_valid_test_split(dataset)
splitter = dc.splits.ScaffoldSplitter() train, valid, test = splitter.train_valid_test_split(dataset)

3. Normalize (optional but recommended)

3. Normalize (optional but recommended)

transformers = [dc.trans.NormalizationTransformer( transform_y=True, dataset=train )] for transformer in transformers: train = transformer.transform(train) valid = transformer.transform(valid) test = transformer.transform(test)
transformers = [dc.trans.NormalizationTransformer( transform_y=True, dataset=train )] for transformer in transformers: train = transformer.transform(train) valid = transformer.transform(valid) test = transformer.transform(test)

4. Train model

4. Train model

model = dc.models.MultitaskRegressor( n_tasks=1, n_features=2048, layer_sizes=[1000, 500], dropouts=0.25 ) model.fit(train, nb_epoch=50)
model = dc.models.MultitaskRegressor( n_tasks=1, n_features=2048, layer_sizes=[1000, 500], dropouts=0.25 ) model.fit(train, nb_epoch=50)

5. Evaluate

5. Evaluate

metric = dc.metrics.Metric(dc.metrics.r2_score) test_score = model.evaluate(test, [metric])
undefined
metric = dc.metrics.Metric(dc.metrics.r2_score) test_score = model.evaluate(test, [metric])
undefined

Workflow C: Transfer Learning on Small Dataset

工作流C:小型数据集上的迁移学习

For leveraging pretrained models:
python
import deepchem as dc
用于利用预训练模型:
python
import deepchem as dc

1. Load data (pretrained models often need raw SMILES)

1. Load data (pretrained models often need raw SMILES)

loader = dc.data.CSVLoader( tasks=['activity'], feature_field='smiles', featurizer=dc.feat.DummyFeaturizer() # Model handles featurization ) dataset = loader.create_dataset('small_dataset.csv')
loader = dc.data.CSVLoader( tasks=['activity'], feature_field='smiles', featurizer=dc.feat.DummyFeaturizer() # Model handles featurization ) dataset = loader.create_dataset('small_dataset.csv')

2. Split data

2. Split data

splitter = dc.splits.ScaffoldSplitter() train, test = splitter.train_test_split(dataset)
splitter = dc.splits.ScaffoldSplitter() train, test = splitter.train_test_split(dataset)

3. Load pretrained model

3. Load pretrained model

model = dc.models.HuggingFaceModel( model='seyonec/ChemBERTa-zinc-base-v1', task='classification', n_tasks=1, learning_rate=2e-5 )
model = dc.models.HuggingFaceModel( model='seyonec/ChemBERTa-zinc-base-v1', task='classification', n_tasks=1, learning_rate=2e-5 )

4. Fine-tune

4. Fine-tune

model.fit(train, nb_epoch=10)
model.fit(train, nb_epoch=10)

5. Evaluate

5. Evaluate

predictions = model.predict(test)

See `references/workflows.md` for 8 detailed workflow examples covering molecular generation, materials science, protein analysis, and more.
predictions = model.predict(test)

更多8个涵盖分子生成、材料科学、蛋白质分析等领域的详细工作流示例,请参考`references/workflows.md`。

Example Scripts

示例脚本

This skill includes three production-ready scripts in the
scripts/
directory:
该工具在
scripts/
目录中包含三个可用于生产环境的脚本:

1.
predict_solubility.py

1.
predict_solubility.py

Train and evaluate solubility prediction models. Works with Delaney benchmark or custom CSV data.
bash
undefined
训练并评估溶解度预测模型。可与Delaney基准数据集或自定义CSV数据配合使用。
bash
undefined

Use Delaney benchmark

Use Delaney benchmark

python scripts/predict_solubility.py
python scripts/predict_solubility.py

Use custom data

Use custom data

python scripts/predict_solubility.py
--data my_data.csv
--smiles-col smiles
--target-col solubility
--predict "CCO" "c1ccccc1"
undefined
python scripts/predict_solubility.py
--data my_data.csv
--smiles-col smiles
--target-col solubility
--predict "CCO" "c1ccccc1"
undefined

2.
graph_neural_network.py

2.
graph_neural_network.py

Train various graph neural network architectures on molecular data.
bash
undefined
在分子数据上训练多种图神经网络架构。
bash
undefined

Train GCN on Tox21

Train GCN on Tox21

python scripts/graph_neural_network.py --model gcn --dataset tox21
python scripts/graph_neural_network.py --model gcn --dataset tox21

Train AttentiveFP on custom data

Train AttentiveFP on custom data

python scripts/graph_neural_network.py
--model attentivefp
--data molecules.csv
--task-type regression
--targets activity
--epochs 100
undefined
python scripts/graph_neural_network.py
--model attentivefp
--data molecules.csv
--task-type regression
--targets activity
--epochs 100
undefined

3.
transfer_learning.py

3.
transfer_learning.py

Fine-tune pretrained models (ChemBERTa, GROVER) on molecular property prediction tasks.
bash
undefined
在分子属性预测任务上微调预训练模型(ChemBERTa、GROVER)。
bash
undefined

Fine-tune ChemBERTa on BBBP

Fine-tune ChemBERTa on BBBP

python scripts/transfer_learning.py --model chemberta --dataset bbbp
python scripts/transfer_learning.py --model chemberta --dataset bbbp

Fine-tune GROVER on custom data

Fine-tune GROVER on custom data

python scripts/transfer_learning.py
--model grover
--data small_dataset.csv
--target activity
--task-type classification
--epochs 20
undefined
python scripts/transfer_learning.py
--model grover
--data small_dataset.csv
--target activity
--task-type classification
--epochs 20
undefined

Common Patterns and Best Practices

常见模式与最佳实践

Pattern 1: Always Use Scaffold Splitting for Molecules

模式1:分子数据集始终使用Scaffold划分

python
undefined
python
undefined

GOOD: Prevents data leakage

GOOD: Prevents data leakage

splitter = dc.splits.ScaffoldSplitter() train, test = splitter.train_test_split(dataset)
splitter = dc.splits.ScaffoldSplitter() train, test = splitter.train_test_split(dataset)

BAD: Similar molecules in train and test

BAD: Similar molecules in train and test

splitter = dc.splits.RandomSplitter() train, test = splitter.train_test_split(dataset)
undefined
splitter = dc.splits.RandomSplitter() train, test = splitter.train_test_split(dataset)
undefined

Pattern 2: Normalize Features and Targets

模式2:归一化特征与目标值

python
transformers = [
    dc.trans.NormalizationTransformer(
        transform_y=True,  # Also normalize target values
        dataset=train
    )
]
for transformer in transformers:
    train = transformer.transform(train)
    test = transformer.transform(test)
python
transformers = [
    dc.trans.NormalizationTransformer(
        transform_y=True,  # Also normalize target values
        dataset=train
    )
]
for transformer in transformers:
    train = transformer.transform(train)
    test = transformer.transform(test)

Pattern 3: Start Simple, Then Scale

模式3:从简单模型开始,逐步扩展

  1. Start with Random Forest + CircularFingerprint (fast baseline)
  2. Try XGBoost/LightGBM if RF works well
  3. Move to deep learning (MultitaskRegressor) if you have >5K samples
  4. Try GNNs if you have >10K samples
  5. Use transfer learning for small datasets or novel scaffolds
  1. 从Random Forest + CircularFingerprint开始(快速基准模型)
  2. 如果RF表现良好,尝试XGBoost/LightGBM
  3. 如果数据集规模>5K样本,转向深度学习(MultitaskRegressor)
  4. 如果数据集规模>10K样本,尝试图神经网络
  5. 对于小型数据集或新型分子骨架,使用迁移学习

Pattern 4: Handle Imbalanced Data

模式4:处理不平衡数据

python
undefined
python
undefined

Option 1: Balancing transformer

Option 1: Balancing transformer

transformer = dc.trans.BalancingTransformer(dataset=train) train = transformer.transform(train)
transformer = dc.trans.BalancingTransformer(dataset=train) train = transformer.transform(train)

Option 2: Use balanced metrics

Option 2: Use balanced metrics

metric = dc.metrics.Metric(dc.metrics.balanced_accuracy_score)
undefined
metric = dc.metrics.Metric(dc.metrics.balanced_accuracy_score)
undefined

Pattern 5: Avoid Memory Issues

模式5:避免内存问题

python
undefined
python
undefined

Use DiskDataset for large datasets

Use DiskDataset for large datasets

dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)

Use smaller batch sizes

Use smaller batch sizes

model = dc.models.GCNModel(batch_size=32) # Instead of 128
undefined
model = dc.models.GCNModel(batch_size=32) # Instead of 128
undefined

Common Pitfalls

常见陷阱

Issue 1: Data Leakage in Drug Discovery

问题1:药物发现中的数据泄露

Problem: Using random splitting allows similar molecules in train/test sets. Solution: Always use
ScaffoldSplitter
for molecular datasets.
问题: 使用随机划分会导致相似分子同时出现在训练集和测试集中。 解决方案: 分子数据集始终使用
ScaffoldSplitter

Issue 2: GNN Underperforming vs Fingerprints

问题2:图神经网络性能不如指纹特征

Problem: Graph neural networks perform worse than simple fingerprints. Solutions:
  • Ensure dataset is large enough (>10K samples typically)
  • Increase training epochs (50-100)
  • Try different architectures (AttentiveFP, DMPNN instead of GCN)
  • Use pretrained models (GROVER)
问题: 图神经网络的表现不如简单的指纹特征。 解决方案:
  • 确保数据集规模足够大(通常>10K样本)
  • 增加训练轮数(50-100轮)
  • 尝试不同的架构(使用AttentiveFP、DMPNN替代GCN)
  • 使用预训练模型(GROVER)

Issue 3: Overfitting on Small Datasets

问题3:小型数据集上的过拟合

Problem: Model memorizes training data. Solutions:
  • Use stronger regularization (increase dropout to 0.5)
  • Use simpler models (Random Forest instead of deep learning)
  • Apply transfer learning (ChemBERTa, GROVER)
  • Collect more data
问题: 模型记住了训练数据。 解决方案:
  • 使用更强的正则化(将dropout提高到0.5)
  • 使用更简单的模型(用Random Forest替代深度学习)
  • 应用迁移学习(ChemBERTa、GROVER)
  • 收集更多数据

Issue 4: Import Errors

问题4:导入错误

Problem: Module not found errors. Solution: Ensure DeepChem is installed with required dependencies:
bash
uv pip install deepchem
问题: 出现模块未找到错误。 解决方案: 确保DeepChem已安装所有必要依赖:
bash
uv pip install deepchem

For PyTorch models

For PyTorch models

uv pip install deepchem[torch]
uv pip install deepchem[torch]

For all features

For all features

uv pip install deepchem[all]
undefined
uv pip install deepchem[all]
undefined

Reference Documentation

参考文档

This skill includes comprehensive reference documentation:
该工具包含全面的参考文档:

references/api_reference.md

references/api_reference.md

Complete API documentation including:
  • All data loaders and their use cases
  • Dataset classes and when to use each
  • Complete featurizer catalog with selection guide
  • Model catalog organized by category (50+ models)
  • MoleculeNet dataset descriptions
  • Metrics and evaluation functions
  • Common code patterns
When to reference: Search this file when you need specific API details, parameter names, or want to explore available options.
完整的API文档,包括:
  • 所有数据加载器及其使用场景
  • 数据集类及其适用场景
  • 完整的特征工程目录与选择指南
  • 按类别划分的模型目录(50+模型)
  • MoleculeNet数据集说明
  • 评估指标与函数
  • 常见代码模式
何时参考: 当你需要特定API细节、参数名称或探索可用选项时,查阅该文档。

references/workflows.md

references/workflows.md

Eight detailed end-to-end workflows:
  1. Molecular property prediction from SMILES
  2. Using MoleculeNet benchmarks
  3. Hyperparameter optimization
  4. Transfer learning with pretrained models
  5. Molecular generation with GANs
  6. Materials property prediction
  7. Protein sequence analysis
  8. Custom model integration
When to reference: Use these workflows as templates for implementing complete solutions.
8个详细的端到端工作流:
  1. 基于SMILES的分子属性预测
  2. 使用MoleculeNet基准测试
  3. 超参数优化
  4. 预训练模型的迁移学习
  5. 基于GAN的分子生成
  6. 材料属性预测
  7. 蛋白质序列分析
  8. 自定义模型集成
何时参考: 将这些工作流作为实现完整解决方案的模板。

Installation Notes

安装说明

Basic installation:
bash
uv pip install deepchem
For PyTorch models (GCN, GAT, etc.):
bash
uv pip install deepchem[torch]
For all features:
bash
uv pip install deepchem[all]
If import errors occur, the user may need specific dependencies. Check the DeepChem documentation for detailed installation instructions.
基础安装:
bash
uv pip install deepchem
对于PyTorch模型(GCN、GAT等):
bash
uv pip install deepchem[torch]
安装所有功能:
bash
uv pip install deepchem[all]
如果出现导入错误,用户可能需要安装特定依赖。请查看DeepChem官方文档获取详细安装说明。

Additional Resources

额外资源