deepchem
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDeepChem
DeepChem
Overview
概述
DeepChem is a comprehensive Python library for applying machine learning to chemistry, materials science, and biology. Enable molecular property prediction, drug discovery, materials design, and biomolecule analysis through specialized neural networks, molecular featurization methods, and pretrained models.
DeepChem是一个将机器学习应用于化学、材料科学和生物学领域的综合性Python库。通过专用神经网络、分子特征工程方法和预训练模型,实现分子属性预测、药物发现、材料设计和生物分子分析。
When to Use This Skill
何时使用该工具
This skill should be used when:
- Loading and processing molecular data (SMILES strings, SDF files, protein sequences)
- Predicting molecular properties (solubility, toxicity, binding affinity, ADMET properties)
- Training models on chemical/biological datasets
- Using MoleculeNet benchmark datasets (Tox21, BBBP, Delaney, etc.)
- Converting molecules to ML-ready features (fingerprints, graph representations, descriptors)
- Implementing graph neural networks for molecules (GCN, GAT, MPNN, AttentiveFP)
- Applying transfer learning with pretrained models (ChemBERTa, GROVER, MolFormer)
- Predicting crystal/materials properties (bandgap, formation energy)
- Analyzing protein or DNA sequences
在以下场景中应使用该工具:
- 加载和处理分子数据(SMILES字符串、SDF文件、蛋白质序列)
- 预测分子属性(溶解度、毒性、结合亲和力、ADMET属性)
- 在化学/生物数据集上训练模型
- 使用MoleculeNet基准数据集(Tox21、BBBP、Delaney等)
- 将分子转换为适用于机器学习的特征(指纹、图表示、描述符)
- 为分子实现图神经网络(GCN、GAT、MPNN、AttentiveFP)
- 结合预训练模型应用迁移学习(ChemBERTa、GROVER、MolFormer)
- 预测晶体/材料属性(带隙、形成能)
- 分析蛋白质或DNA序列
Core Capabilities
核心功能
1. Molecular Data Loading and Processing
1. 分子数据加载与处理
DeepChem provides specialized loaders for various chemical data formats:
python
import deepchem as dcDeepChem为多种化学数据格式提供了专用加载器:
python
import deepchem as dcLoad CSV with SMILES
Load CSV with SMILES
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['solubility', 'toxicity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('molecules.csv')
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['solubility', 'toxicity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('molecules.csv')
Load SDF files
Load SDF files
loader = dc.data.SDFLoader(tasks=['activity'], featurizer=featurizer)
dataset = loader.create_dataset('compounds.sdf')
loader = dc.data.SDFLoader(tasks=['activity'], featurizer=featurizer)
dataset = loader.create_dataset('compounds.sdf')
Load protein sequences
Load protein sequences
loader = dc.data.FASTALoader()
dataset = loader.create_dataset('proteins.fasta')
**Key Loaders**:
- `CSVLoader`: Tabular data with molecular identifiers
- `SDFLoader`: Molecular structure files
- `FASTALoader`: Protein/DNA sequences
- `ImageLoader`: Molecular images
- `JsonLoader`: JSON-formatted datasetsloader = dc.data.FASTALoader()
dataset = loader.create_dataset('proteins.fasta')
**主要加载器**:
- `CSVLoader`: 包含分子标识符的表格数据加载器
- `SDFLoader`: 分子结构文件加载器
- `FASTALoader`: 蛋白质/DNA序列加载器
- `ImageLoader`: 分子图像加载器
- `JsonLoader`: JSON格式数据集加载器2. Molecular Featurization
2. 分子特征工程
Convert molecules into numerical representations for ML models.
将分子转换为适用于机器学习模型的数值表示。
Decision Tree for Featurizer Selection
特征选择决策树
Is the model a graph neural network?
├─ YES → Use graph featurizers
│ ├─ Standard GNN → MolGraphConvFeaturizer
│ ├─ Message passing → DMPNNFeaturizer
│ └─ Pretrained → GroverFeaturizer
│
└─ NO → What type of model?
├─ Traditional ML (RF, XGBoost, SVM)
│ ├─ Fast baseline → CircularFingerprint (ECFP)
│ ├─ Interpretable → RDKitDescriptors
│ └─ Maximum coverage → MordredDescriptors
│
├─ Deep learning (non-graph)
│ ├─ Dense networks → CircularFingerprint
│ └─ CNN → SmilesToImage
│
├─ Sequence models (LSTM, Transformer)
│ └─ SmilesToSeq
│
└─ 3D structure analysis
└─ CoulombMatrixIs the model a graph neural network?
├─ YES → Use graph featurizers
│ ├─ Standard GNN → MolGraphConvFeaturizer
│ ├─ Message passing → DMPNNFeaturizer
│ └─ Pretrained → GroverFeaturizer
│
└─ NO → What type of model?
├─ Traditional ML (RF, XGBoost, SVM)
│ ├─ Fast baseline → CircularFingerprint (ECFP)
│ ├─ Interpretable → RDKitDescriptors
│ └─ Maximum coverage → MordredDescriptors
│
├─ Deep learning (non-graph)
│ ├─ Dense networks → CircularFingerprint
│ └─ CNN → SmilesToImage
│
├─ Sequence models (LSTM, Transformer)
│ └─ SmilesToSeq
│
└─ 3D structure analysis
└─ CoulombMatrixExample Featurization
特征工程示例
python
undefinedpython
undefinedFingerprints (for traditional ML)
Fingerprints (for traditional ML)
fp = dc.feat.CircularFingerprint(radius=2, size=2048)
fp = dc.feat.CircularFingerprint(radius=2, size=2048)
Descriptors (for interpretable models)
Descriptors (for interpretable models)
desc = dc.feat.RDKitDescriptors()
desc = dc.feat.RDKitDescriptors()
Graph features (for GNNs)
Graph features (for GNNs)
graph_feat = dc.feat.MolGraphConvFeaturizer()
graph_feat = dc.feat.MolGraphConvFeaturizer()
Apply featurization
Apply featurization
features = fp.featurize(['CCO', 'c1ccccc1'])
**Selection Guide**:
- **Small datasets (<1K)**: CircularFingerprint or RDKitDescriptors
- **Medium datasets (1K-100K)**: CircularFingerprint or graph featurizers
- **Large datasets (>100K)**: Graph featurizers (MolGraphConvFeaturizer, DMPNNFeaturizer)
- **Transfer learning**: Pretrained model featurizers (GroverFeaturizer)
See `references/api_reference.md` for complete featurizer documentation.features = fp.featurize(['CCO', 'c1ccccc1'])
**选择指南**:
- **小型数据集(<1K样本)**: CircularFingerprint或RDKitDescriptors
- **中型数据集(1K-100K样本)**: CircularFingerprint或图特征
- **大型数据集(>100K样本)**: 图特征(MolGraphConvFeaturizer、DMPNNFeaturizer)
- **迁移学习**: 预训练模型特征(GroverFeaturizer)
完整的特征工程文档请参考`references/api_reference.md`。3. Data Splitting
3. 数据划分
Critical: For drug discovery tasks, use to prevent data leakage from similar molecular structures appearing in both training and test sets.
ScaffoldSplitterpython
undefined关键注意事项: 对于药物发现任务,使用防止相似分子结构同时出现在训练集和测试集中,避免数据泄露。
ScaffoldSplitterpython
undefinedScaffold splitting (recommended for molecules)
Scaffold splitting (recommended for molecules)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(
dataset,
frac_train=0.8,
frac_valid=0.1,
frac_test=0.1
)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(
dataset,
frac_train=0.8,
frac_valid=0.1,
frac_test=0.1
)
Random splitting (for non-molecular data)
Random splitting (for non-molecular data)
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
Stratified splitting (for imbalanced classification)
Stratified splitting (for imbalanced classification)
splitter = dc.splits.RandomStratifiedSplitter()
train, test = splitter.train_test_split(dataset)
**Available Splitters**:
- `ScaffoldSplitter`: Split by molecular scaffolds (prevents leakage)
- `ButinaSplitter`: Clustering-based molecular splitting
- `MaxMinSplitter`: Maximize diversity between sets
- `RandomSplitter`: Random splitting
- `RandomStratifiedSplitter`: Preserves class distributionssplitter = dc.splits.RandomStratifiedSplitter()
train, test = splitter.train_test_split(dataset)
**可用划分器**:
- `ScaffoldSplitter`: 按分子骨架划分(防止数据泄露)
- `ButinaSplitter`: 基于聚类的分子划分
- `MaxMinSplitter`: 最大化数据集间的多样性
- `RandomSplitter`: 随机划分
- `RandomStratifiedSplitter`: 保留类别分布4. Model Selection and Training
4. 模型选择与训练
Quick Model Selection Guide
快速模型选择指南
| Dataset Size | Task | Recommended Model | Featurizer |
|---|---|---|---|
| < 1K samples | Any | SklearnModel (RandomForest) | CircularFingerprint |
| 1K-100K | Classification/Regression | GBDTModel or MultitaskRegressor | CircularFingerprint |
| > 100K | Molecular properties | GCNModel, AttentiveFPModel, DMPNNModel | MolGraphConvFeaturizer |
| Any (small preferred) | Transfer learning | ChemBERTa, GROVER, MolFormer | Model-specific |
| Crystal structures | Materials properties | CGCNNModel, MEGNetModel | Structure-based |
| Protein sequences | Protein properties | ProtBERT | Sequence-based |
| 数据集规模 | 任务类型 | 推荐模型 | 特征工程方法 |
|---|---|---|---|
| < 1K样本 | 任意 | SklearnModel(RandomForest) | CircularFingerprint |
| 1K-100K样本 | 分类/回归 | GBDTModel或MultitaskRegressor | CircularFingerprint |
| > 100K样本 | 分子属性预测 | GCNModel、AttentiveFPModel、DMPNNModel | MolGraphConvFeaturizer |
| 任意(小型数据集更适用) | 迁移学习 | ChemBERTa、GROVER、MolFormer | 模型专用特征 |
| 晶体结构 | 材料属性预测 | CGCNNModel、MEGNetModel | 基于结构的特征 |
| 蛋白质序列 | 蛋白质属性预测 | ProtBERT | 基于序列的特征 |
Example: Traditional ML
示例:传统机器学习
python
from sklearn.ensemble import RandomForestRegressorpython
from sklearn.ensemble import RandomForestRegressorWrap scikit-learn model
Wrap scikit-learn model
sklearn_model = RandomForestRegressor(n_estimators=100)
model = dc.models.SklearnModel(model=sklearn_model)
model.fit(train)
undefinedsklearn_model = RandomForestRegressor(n_estimators=100)
model = dc.models.SklearnModel(model=sklearn_model)
model.fit(train)
undefinedExample: Deep Learning
示例:深度学习
python
undefinedpython
undefinedMultitask regressor (for fingerprints)
Multitask regressor (for fingerprints)
model = dc.models.MultitaskRegressor(
n_tasks=2,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
undefinedmodel = dc.models.MultitaskRegressor(
n_tasks=2,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
undefinedExample: Graph Neural Networks
示例:图神经网络
python
undefinedpython
undefinedGraph Convolutional Network
Graph Convolutional Network
model = dc.models.GCNModel(
n_tasks=1,
mode='regression',
batch_size=128,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
model = dc.models.GCNModel(
n_tasks=1,
mode='regression',
batch_size=128,
learning_rate=0.001
)
model.fit(train, nb_epoch=50)
Graph Attention Network
Graph Attention Network
model = dc.models.GATModel(n_tasks=1, mode='classification')
model.fit(train, nb_epoch=50)
model = dc.models.GATModel(n_tasks=1, mode='classification')
model.fit(train, nb_epoch=50)
Attentive Fingerprint
Attentive Fingerprint
model = dc.models.AttentiveFPModel(n_tasks=1, mode='regression')
model.fit(train, nb_epoch=50)
undefinedmodel = dc.models.AttentiveFPModel(n_tasks=1, mode='regression')
model.fit(train, nb_epoch=50)
undefined5. MoleculeNet Benchmarks
5. MoleculeNet基准测试
Quick access to 30+ curated benchmark datasets with standardized train/valid/test splits:
python
undefined快速访问30+个经过整理的基准数据集,包含标准化的训练/验证/测试划分:
python
undefinedLoad benchmark dataset
Load benchmark dataset
tasks, datasets, transformers = dc.molnet.load_tox21(
featurizer='GraphConv', # or 'ECFP', 'Weave', 'Raw'
splitter='scaffold', # or 'random', 'stratified'
reload=False
)
train, valid, test = datasets
tasks, datasets, transformers = dc.molnet.load_tox21(
featurizer='GraphConv', # or 'ECFP', 'Weave', 'Raw'
splitter='scaffold', # or 'random', 'stratified'
reload=False
)
train, valid, test = datasets
Train and evaluate
Train and evaluate
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
**Common Datasets**:
- **Classification**: `load_tox21()`, `load_bbbp()`, `load_hiv()`, `load_clintox()`
- **Regression**: `load_delaney()`, `load_freesolv()`, `load_lipo()`
- **Quantum properties**: `load_qm7()`, `load_qm8()`, `load_qm9()`
- **Materials**: `load_perovskite()`, `load_bandgap()`, `load_mp_formation_energy()`
See `references/api_reference.md` for complete dataset list.model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
**常见数据集**:
- **分类任务**: `load_tox21()`、`load_bbbp()`、`load_hiv()`、`load_clintox()`
- **回归任务**: `load_delaney()`、`load_freesolv()`、`load_lipo()`
- **量子属性**: `load_qm7()`、`load_qm8()`、`load_qm9()`
- **材料科学**: `load_perovskite()`、`load_bandgap()`、`load_mp_formation_energy()`
完整数据集列表请参考`references/api_reference.md`。6. Transfer Learning
6. 迁移学习
Leverage pretrained models for improved performance, especially on small datasets:
python
undefined利用预训练模型提升性能,尤其适用于小型数据集:
python
undefinedChemBERTa (BERT pretrained on 77M molecules)
ChemBERTa (BERT pretrained on 77M molecules)
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5 # Lower LR for fine-tuning
)
model.fit(train, nb_epoch=10)
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5 # Lower LR for fine-tuning
)
model.fit(train, nb_epoch=10)
GROVER (graph transformer pretrained on 10M molecules)
GROVER (graph transformer pretrained on 10M molecules)
model = dc.models.GroverModel(
task='regression',
n_tasks=1
)
model.fit(train, nb_epoch=20)
**When to use transfer learning**:
- Small datasets (< 1000 samples)
- Novel molecular scaffolds
- Limited computational resources
- Need for rapid prototyping
Use the `scripts/transfer_learning.py` script for guided transfer learning workflows.model = dc.models.GroverModel(
task='regression',
n_tasks=1
)
model.fit(train, nb_epoch=20)
**何时使用迁移学习**:
- 小型数据集(<1000样本)
- 新型分子骨架
- 计算资源有限
- 需要快速原型开发
可使用`scripts/transfer_learning.py`脚本进行引导式迁移学习工作流。7. Model Evaluation
7. 模型评估
python
undefinedpython
undefinedDefine metrics
Define metrics
classification_metrics = [
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
dc.metrics.Metric(dc.metrics.f1_score, name='F1')
]
regression_metrics = [
dc.metrics.Metric(dc.metrics.r2_score, name='R²'),
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE')
]
classification_metrics = [
dc.metrics.Metric(dc.metrics.roc_auc_score, name='ROC-AUC'),
dc.metrics.Metric(dc.metrics.accuracy_score, name='Accuracy'),
dc.metrics.Metric(dc.metrics.f1_score, name='F1')
]
regression_metrics = [
dc.metrics.Metric(dc.metrics.r2_score, name='R²'),
dc.metrics.Metric(dc.metrics.mean_absolute_error, name='MAE'),
dc.metrics.Metric(dc.metrics.root_mean_squared_error, name='RMSE')
]
Evaluate
Evaluate
train_scores = model.evaluate(train, classification_metrics)
test_scores = model.evaluate(test, classification_metrics)
undefinedtrain_scores = model.evaluate(train, classification_metrics)
test_scores = model.evaluate(test, classification_metrics)
undefined8. Making Predictions
8. 生成预测结果
python
undefinedpython
undefinedPredict on test set
Predict on test set
predictions = model.predict(test)
predictions = model.predict(test)
Predict on new molecules
Predict on new molecules
new_smiles = ['CCO', 'c1ccccc1', 'CC(C)O']
new_features = featurizer.featurize(new_smiles)
new_dataset = dc.data.NumpyDataset(X=new_features)
new_smiles = ['CCO', 'c1ccccc1', 'CC(C)O']
new_features = featurizer.featurize(new_smiles)
new_dataset = dc.data.NumpyDataset(X=new_features)
Apply same transformations as training
Apply same transformations as training
for transformer in transformers:
new_dataset = transformer.transform(new_dataset)
predictions = model.predict(new_dataset)
undefinedfor transformer in transformers:
new_dataset = transformer.transform(new_dataset)
predictions = model.predict(new_dataset)
undefinedTypical Workflows
典型工作流
Workflow A: Quick Benchmark Evaluation
工作流A:快速基准测试评估
For evaluating a model on standard benchmarks:
python
import deepchem as dc用于在标准基准数据集上评估模型:
python
import deepchem as dc1. Load benchmark
1. Load benchmark
tasks, datasets, _ = dc.molnet.load_bbbp(
featurizer='GraphConv',
splitter='scaffold'
)
train, valid, test = datasets
tasks, datasets, _ = dc.molnet.load_bbbp(
featurizer='GraphConv',
splitter='scaffold'
)
train, valid, test = datasets
2. Train model
2. Train model
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
model = dc.models.GCNModel(n_tasks=len(tasks), mode='classification')
model.fit(train, nb_epoch=50)
3. Evaluate
3. Evaluate
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
print(f"Test ROC-AUC: {test_score}")
undefinedmetric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
print(f"Test ROC-AUC: {test_score}")
undefinedWorkflow B: Custom Data Prediction
工作流B:自定义数据预测
For training on custom molecular datasets:
python
import deepchem as dc用于在自定义分子数据集上训练模型:
python
import deepchem as dc1. Load and featurize data
1. Load and featurize data
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('my_molecules.csv')
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=featurizer
)
dataset = loader.create_dataset('my_molecules.csv')
2. Split data (use ScaffoldSplitter for molecules!)
2. Split data (use ScaffoldSplitter for molecules!)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(dataset)
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(dataset)
3. Normalize (optional but recommended)
3. Normalize (optional but recommended)
transformers = [dc.trans.NormalizationTransformer(
transform_y=True, dataset=train
)]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
transformers = [dc.trans.NormalizationTransformer(
transform_y=True, dataset=train
)]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
4. Train model
4. Train model
model = dc.models.MultitaskRegressor(
n_tasks=1,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25
)
model.fit(train, nb_epoch=50)
model = dc.models.MultitaskRegressor(
n_tasks=1,
n_features=2048,
layer_sizes=[1000, 500],
dropouts=0.25
)
model.fit(train, nb_epoch=50)
5. Evaluate
5. Evaluate
metric = dc.metrics.Metric(dc.metrics.r2_score)
test_score = model.evaluate(test, [metric])
undefinedmetric = dc.metrics.Metric(dc.metrics.r2_score)
test_score = model.evaluate(test, [metric])
undefinedWorkflow C: Transfer Learning on Small Dataset
工作流C:小型数据集上的迁移学习
For leveraging pretrained models:
python
import deepchem as dc用于利用预训练模型:
python
import deepchem as dc1. Load data (pretrained models often need raw SMILES)
1. Load data (pretrained models often need raw SMILES)
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=dc.feat.DummyFeaturizer() # Model handles featurization
)
dataset = loader.create_dataset('small_dataset.csv')
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=dc.feat.DummyFeaturizer() # Model handles featurization
)
dataset = loader.create_dataset('small_dataset.csv')
2. Split data
2. Split data
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
3. Load pretrained model
3. Load pretrained model
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5
)
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='classification',
n_tasks=1,
learning_rate=2e-5
)
4. Fine-tune
4. Fine-tune
model.fit(train, nb_epoch=10)
model.fit(train, nb_epoch=10)
5. Evaluate
5. Evaluate
predictions = model.predict(test)
See `references/workflows.md` for 8 detailed workflow examples covering molecular generation, materials science, protein analysis, and more.predictions = model.predict(test)
更多8个涵盖分子生成、材料科学、蛋白质分析等领域的详细工作流示例,请参考`references/workflows.md`。Example Scripts
示例脚本
This skill includes three production-ready scripts in the directory:
scripts/该工具在目录中包含三个可用于生产环境的脚本:
scripts/1. predict_solubility.py
predict_solubility.py1. predict_solubility.py
predict_solubility.pyTrain and evaluate solubility prediction models. Works with Delaney benchmark or custom CSV data.
bash
undefined训练并评估溶解度预测模型。可与Delaney基准数据集或自定义CSV数据配合使用。
bash
undefinedUse Delaney benchmark
Use Delaney benchmark
python scripts/predict_solubility.py
python scripts/predict_solubility.py
Use custom data
Use custom data
python scripts/predict_solubility.py
--data my_data.csv
--smiles-col smiles
--target-col solubility
--predict "CCO" "c1ccccc1"
--data my_data.csv
--smiles-col smiles
--target-col solubility
--predict "CCO" "c1ccccc1"
undefinedpython scripts/predict_solubility.py
--data my_data.csv
--smiles-col smiles
--target-col solubility
--predict "CCO" "c1ccccc1"
--data my_data.csv
--smiles-col smiles
--target-col solubility
--predict "CCO" "c1ccccc1"
undefined2. graph_neural_network.py
graph_neural_network.py2. graph_neural_network.py
graph_neural_network.pyTrain various graph neural network architectures on molecular data.
bash
undefined在分子数据上训练多种图神经网络架构。
bash
undefinedTrain GCN on Tox21
Train GCN on Tox21
python scripts/graph_neural_network.py --model gcn --dataset tox21
python scripts/graph_neural_network.py --model gcn --dataset tox21
Train AttentiveFP on custom data
Train AttentiveFP on custom data
python scripts/graph_neural_network.py
--model attentivefp
--data molecules.csv
--task-type regression
--targets activity
--epochs 100
--model attentivefp
--data molecules.csv
--task-type regression
--targets activity
--epochs 100
undefinedpython scripts/graph_neural_network.py
--model attentivefp
--data molecules.csv
--task-type regression
--targets activity
--epochs 100
--model attentivefp
--data molecules.csv
--task-type regression
--targets activity
--epochs 100
undefined3. transfer_learning.py
transfer_learning.py3. transfer_learning.py
transfer_learning.pyFine-tune pretrained models (ChemBERTa, GROVER) on molecular property prediction tasks.
bash
undefined在分子属性预测任务上微调预训练模型(ChemBERTa、GROVER)。
bash
undefinedFine-tune ChemBERTa on BBBP
Fine-tune ChemBERTa on BBBP
python scripts/transfer_learning.py --model chemberta --dataset bbbp
python scripts/transfer_learning.py --model chemberta --dataset bbbp
Fine-tune GROVER on custom data
Fine-tune GROVER on custom data
python scripts/transfer_learning.py
--model grover
--data small_dataset.csv
--target activity
--task-type classification
--epochs 20
--model grover
--data small_dataset.csv
--target activity
--task-type classification
--epochs 20
undefinedpython scripts/transfer_learning.py
--model grover
--data small_dataset.csv
--target activity
--task-type classification
--epochs 20
--model grover
--data small_dataset.csv
--target activity
--task-type classification
--epochs 20
undefinedCommon Patterns and Best Practices
常见模式与最佳实践
Pattern 1: Always Use Scaffold Splitting for Molecules
模式1:分子数据集始终使用Scaffold划分
python
undefinedpython
undefinedGOOD: Prevents data leakage
GOOD: Prevents data leakage
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
BAD: Similar molecules in train and test
BAD: Similar molecules in train and test
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
undefinedsplitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
undefinedPattern 2: Normalize Features and Targets
模式2:归一化特征与目标值
python
transformers = [
dc.trans.NormalizationTransformer(
transform_y=True, # Also normalize target values
dataset=train
)
]
for transformer in transformers:
train = transformer.transform(train)
test = transformer.transform(test)python
transformers = [
dc.trans.NormalizationTransformer(
transform_y=True, # Also normalize target values
dataset=train
)
]
for transformer in transformers:
train = transformer.transform(train)
test = transformer.transform(test)Pattern 3: Start Simple, Then Scale
模式3:从简单模型开始,逐步扩展
- Start with Random Forest + CircularFingerprint (fast baseline)
- Try XGBoost/LightGBM if RF works well
- Move to deep learning (MultitaskRegressor) if you have >5K samples
- Try GNNs if you have >10K samples
- Use transfer learning for small datasets or novel scaffolds
- 从Random Forest + CircularFingerprint开始(快速基准模型)
- 如果RF表现良好,尝试XGBoost/LightGBM
- 如果数据集规模>5K样本,转向深度学习(MultitaskRegressor)
- 如果数据集规模>10K样本,尝试图神经网络
- 对于小型数据集或新型分子骨架,使用迁移学习
Pattern 4: Handle Imbalanced Data
模式4:处理不平衡数据
python
undefinedpython
undefinedOption 1: Balancing transformer
Option 1: Balancing transformer
transformer = dc.trans.BalancingTransformer(dataset=train)
train = transformer.transform(train)
transformer = dc.trans.BalancingTransformer(dataset=train)
train = transformer.transform(train)
Option 2: Use balanced metrics
Option 2: Use balanced metrics
metric = dc.metrics.Metric(dc.metrics.balanced_accuracy_score)
undefinedmetric = dc.metrics.Metric(dc.metrics.balanced_accuracy_score)
undefinedPattern 5: Avoid Memory Issues
模式5:避免内存问题
python
undefinedpython
undefinedUse DiskDataset for large datasets
Use DiskDataset for large datasets
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
Use smaller batch sizes
Use smaller batch sizes
model = dc.models.GCNModel(batch_size=32) # Instead of 128
undefinedmodel = dc.models.GCNModel(batch_size=32) # Instead of 128
undefinedCommon Pitfalls
常见陷阱
Issue 1: Data Leakage in Drug Discovery
问题1:药物发现中的数据泄露
Problem: Using random splitting allows similar molecules in train/test sets.
Solution: Always use for molecular datasets.
ScaffoldSplitter问题: 使用随机划分会导致相似分子同时出现在训练集和测试集中。
解决方案: 分子数据集始终使用。
ScaffoldSplitterIssue 2: GNN Underperforming vs Fingerprints
问题2:图神经网络性能不如指纹特征
Problem: Graph neural networks perform worse than simple fingerprints.
Solutions:
- Ensure dataset is large enough (>10K samples typically)
- Increase training epochs (50-100)
- Try different architectures (AttentiveFP, DMPNN instead of GCN)
- Use pretrained models (GROVER)
问题: 图神经网络的表现不如简单的指纹特征。
解决方案:
- 确保数据集规模足够大(通常>10K样本)
- 增加训练轮数(50-100轮)
- 尝试不同的架构(使用AttentiveFP、DMPNN替代GCN)
- 使用预训练模型(GROVER)
Issue 3: Overfitting on Small Datasets
问题3:小型数据集上的过拟合
Problem: Model memorizes training data.
Solutions:
- Use stronger regularization (increase dropout to 0.5)
- Use simpler models (Random Forest instead of deep learning)
- Apply transfer learning (ChemBERTa, GROVER)
- Collect more data
问题: 模型记住了训练数据。
解决方案:
- 使用更强的正则化(将dropout提高到0.5)
- 使用更简单的模型(用Random Forest替代深度学习)
- 应用迁移学习(ChemBERTa、GROVER)
- 收集更多数据
Issue 4: Import Errors
问题4:导入错误
Problem: Module not found errors.
Solution: Ensure DeepChem is installed with required dependencies:
bash
uv pip install deepchem问题: 出现模块未找到错误。
解决方案: 确保DeepChem已安装所有必要依赖:
bash
uv pip install deepchemFor PyTorch models
For PyTorch models
uv pip install deepchem[torch]
uv pip install deepchem[torch]
For all features
For all features
uv pip install deepchem[all]
undefineduv pip install deepchem[all]
undefinedReference Documentation
参考文档
This skill includes comprehensive reference documentation:
该工具包含全面的参考文档:
references/api_reference.md
references/api_reference.mdreferences/api_reference.md
references/api_reference.mdComplete API documentation including:
- All data loaders and their use cases
- Dataset classes and when to use each
- Complete featurizer catalog with selection guide
- Model catalog organized by category (50+ models)
- MoleculeNet dataset descriptions
- Metrics and evaluation functions
- Common code patterns
When to reference: Search this file when you need specific API details, parameter names, or want to explore available options.
完整的API文档,包括:
- 所有数据加载器及其使用场景
- 数据集类及其适用场景
- 完整的特征工程目录与选择指南
- 按类别划分的模型目录(50+模型)
- MoleculeNet数据集说明
- 评估指标与函数
- 常见代码模式
何时参考: 当你需要特定API细节、参数名称或探索可用选项时,查阅该文档。
references/workflows.md
references/workflows.mdreferences/workflows.md
references/workflows.mdEight detailed end-to-end workflows:
- Molecular property prediction from SMILES
- Using MoleculeNet benchmarks
- Hyperparameter optimization
- Transfer learning with pretrained models
- Molecular generation with GANs
- Materials property prediction
- Protein sequence analysis
- Custom model integration
When to reference: Use these workflows as templates for implementing complete solutions.
8个详细的端到端工作流:
- 基于SMILES的分子属性预测
- 使用MoleculeNet基准测试
- 超参数优化
- 预训练模型的迁移学习
- 基于GAN的分子生成
- 材料属性预测
- 蛋白质序列分析
- 自定义模型集成
何时参考: 将这些工作流作为实现完整解决方案的模板。
Installation Notes
安装说明
Basic installation:
bash
uv pip install deepchemFor PyTorch models (GCN, GAT, etc.):
bash
uv pip install deepchem[torch]For all features:
bash
uv pip install deepchem[all]If import errors occur, the user may need specific dependencies. Check the DeepChem documentation for detailed installation instructions.
基础安装:
bash
uv pip install deepchem对于PyTorch模型(GCN、GAT等):
bash
uv pip install deepchem[torch]安装所有功能:
bash
uv pip install deepchem[all]如果出现导入错误,用户可能需要安装特定依赖。请查看DeepChem官方文档获取详细安装说明。
Additional Resources
额外资源
- Official documentation: https://deepchem.readthedocs.io/
- GitHub repository: https://github.com/deepchem/deepchem
- Tutorials: https://deepchem.readthedocs.io/en/latest/get_started/tutorials.html
- Paper: "MoleculeNet: A Benchmark for Molecular Machine Learning"
- 官方文档:https://deepchem.readthedocs.io/
- GitHub仓库:https://github.com/deepchem/deepchem
- 教程:https://deepchem.readthedocs.io/en/latest/get_started/tutorials.html
- 论文:《MoleculeNet: A Benchmark for Molecular Machine Learning》