molfeat
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMolfeat - Molecular Featurization Hub
Molfeat - 分子特征化中心
Overview
概述
Molfeat is a comprehensive Python library for molecular featurization that unifies 100+ pre-trained embeddings and hand-crafted featurizers. Convert chemical structures (SMILES strings or RDKit molecules) into numerical representations for machine learning tasks including QSAR modeling, virtual screening, similarity searching, and deep learning applications. Features fast parallel processing, scikit-learn compatible transformers, and built-in caching.
Molfeat是一个全面的Python分子特征化库,整合了100余种预训练嵌入模型和手工构建的特征提取器。可将化学结构(SMILES字符串或RDKit分子对象)转换为数值表示,用于QSAR建模、虚拟筛选、相似性搜索和深度学习等机器学习任务。具备快速并行处理、兼容scikit-learn的转换器以及内置缓存功能。
When to Use This Skill
适用场景
This skill should be used when working with:
- Molecular machine learning: Building QSAR/QSPR models, property prediction
- Virtual screening: Ranking compound libraries for biological activity
- Similarity searching: Finding structurally similar molecules
- Chemical space analysis: Clustering, visualization, dimensionality reduction
- Deep learning: Training neural networks on molecular data
- Featurization pipelines: Converting SMILES to ML-ready representations
- Cheminformatics: Any task requiring molecular feature extraction
本工具适用于以下场景:
- 分子机器学习:构建QSAR/QSPR模型、性质预测
- 虚拟筛选:针对生物活性对化合物库进行排序
- 相似性搜索:寻找结构相似的分子
- 化学空间分析:聚类、可视化、降维
- 深度学习:基于分子数据训练神经网络
- 特征化流水线:将SMILES转换为可用于机器学习的表示
- 化学信息学:任何需要提取分子特征的任务
Installation
安装
bash
uv pip install molfeatbash
uv pip install molfeatWith all optional dependencies
安装所有可选依赖
uv pip install "molfeat[all]"
**Optional dependencies for specific featurizers:**
- `molfeat[dgl]` - GNN models (GIN variants)
- `molfeat[graphormer]` - Graphormer models
- `molfeat[transformer]` - ChemBERTa, ChemGPT, MolT5
- `molfeat[fcd]` - FCD descriptors
- `molfeat[map4]` - MAP4 fingerprintsuv pip install "molfeat[all]"
**特定特征提取器的可选依赖:**
- `molfeat[dgl]` - GNN模型(GIN变体)
- `molfeat[graphormer]` - Graphormer模型
- `molfeat[transformer]` - ChemBERTa、ChemGPT、MolT5
- `molfeat[fcd]` - FCD描述符
- `molfeat[map4]` - MAP4指纹Core Concepts
核心概念
Molfeat organizes featurization into three hierarchical classes:
Molfeat将特征化分为三个层级的类:
1. Calculators (molfeat.calc
)
molfeat.calc1. 计算器(molfeat.calc
)
molfeat.calcCallable objects that convert individual molecules into feature vectors. Accept RDKit objects or SMILES strings.
Chem.MolUse calculators for:
- Single molecule featurization
- Custom processing loops
- Direct feature computation
Example:
python
from molfeat.calc import FPCalculator
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
features = calc("CCO") # Returns numpy array (2048,)可调用对象,用于将单个分子转换为特征向量。支持RDKit 对象或SMILES字符串。
Chem.Mol计算器适用场景:
- 单个分子的特征化
- 自定义处理循环
- 直接计算特征
示例:
python
from molfeat.calc import FPCalculator
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
features = calc("CCO") # 返回numpy数组(2048,)2. Transformers (molfeat.trans
)
molfeat.trans2. 转换器(molfeat.trans
)
molfeat.transScikit-learn compatible transformers that wrap calculators for batch processing with parallelization.
Use transformers for:
- Batch featurization of molecular datasets
- Integration with scikit-learn pipelines
- Parallel processing (automatic CPU utilization)
Example:
python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles_list) # Parallel processing兼容scikit-learn的转换器,封装计算器以支持并行批量处理。
转换器适用场景:
- 分子数据集的批量特征化
- 与scikit-learn流水线集成
- 并行处理(自动利用CPU核心)
示例:
python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles_list) # 并行处理3. Pretrained Transformers (molfeat.trans.pretrained
)
molfeat.trans.pretrained3. 预训练转换器(molfeat.trans.pretrained
)
molfeat.trans.pretrainedSpecialized transformers for deep learning models with batched inference and caching.
Use pretrained transformers for:
- State-of-the-art molecular embeddings
- Transfer learning from large chemical datasets
- Deep learning feature extraction
Example:
python
from molfeat.trans.pretrained import PretrainedMolTransformer
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
embeddings = transformer(smiles_list) # Deep learning embeddings专为深度学习模型设计的专用转换器,支持批量推理和缓存。
预训练转换器适用场景:
- 最先进的分子嵌入
- 从大型化学数据集进行迁移学习
- 深度学习特征提取
示例:
python
from molfeat.trans.pretrained import PretrainedMolTransformer
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
embeddings = transformer(smiles_list) # 深度学习嵌入向量Quick Start Workflow
快速入门流程
Basic Featurization
基础特征化
python
import datamol as dm
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformerpython
import datamol as dm
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformerLoad molecular data
加载分子数据
smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]
smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]
Create calculator and transformer
创建计算器和转换器
calc = FPCalculator("ecfp", radius=3)
transformer = MoleculeTransformer(calc, n_jobs=-1)
calc = FPCalculator("ecfp", radius=3)
transformer = MoleculeTransformer(calc, n_jobs=-1)
Featurize molecules
对分子进行特征化
features = transformer(smiles)
print(f"Shape: {features.shape}") # (4, 2048)
undefinedfeatures = transformer(smiles)
print(f"形状: {features.shape}") # (4, 2048)
undefinedSave and Load Configuration
保存和加载配置
python
undefinedpython
undefinedSave featurizer configuration for reproducibility
保存特征提取器配置以保证可复现性
transformer.to_state_yaml_file("featurizer_config.yml")
transformer.to_state_yaml_file("featurizer_config.yml")
Reload exact configuration
重新加载完全一致的配置
loaded = MoleculeTransformer.from_state_yaml_file("featurizer_config.yml")
undefinedloaded = MoleculeTransformer.from_state_yaml_file("featurizer_config.yml")
undefinedHandle Errors Gracefully
优雅处理错误
python
undefinedpython
undefinedProcess dataset with potentially invalid SMILES
处理可能包含无效SMILES的数据集
transformer = MoleculeTransformer(
calc,
n_jobs=-1,
ignore_errors=True, # Continue on failures
verbose=True # Log error details
)
features = transformer(smiles_with_errors)
transformer = MoleculeTransformer(
calc,
n_jobs=-1,
ignore_errors=True, # 遇到失败时继续执行
verbose=True # 记录错误详情
)
features = transformer(smiles_with_errors)
Returns None for failed molecules
对处理失败的分子返回None
undefinedundefinedChoosing the Right Featurizer
选择合适的特征提取器
For Traditional Machine Learning (RF, SVM, XGBoost)
传统机器学习(随机森林、SVM、XGBoost)
Start with fingerprints:
python
undefined优先选择指纹:
python
undefinedECFP - Most popular, general-purpose
ECFP - 最受欢迎的通用型指纹
FPCalculator("ecfp", radius=3, fpSize=2048)
FPCalculator("ecfp", radius=3, fpSize=2048)
MACCS - Fast, good for scaffold hopping
MACCS - 速度快,适用于骨架跃迁
FPCalculator("maccs")
FPCalculator("maccs")
MAP4 - Efficient for large-scale screening
MAP4 - 高效适用于大规模筛选
FPCalculator("map4")
**For interpretable models:**
```pythonFPCalculator("map4")
**可解释模型适用:**
```pythonRDKit 2D descriptors (200+ named properties)
RDKit 2D描述符(200+种命名性质)
from molfeat.calc import RDKitDescriptors2D
RDKitDescriptors2D()
from molfeat.calc import RDKitDescriptors2D
RDKitDescriptors2D()
Mordred (1800+ comprehensive descriptors)
Mordred(1800+种全面描述符)
from molfeat.calc import MordredDescriptors
MordredDescriptors()
**Combine multiple featurizers:**
```python
from molfeat.trans import FeatConcat
concat = FeatConcat([
FPCalculator("maccs"), # 167 dimensions
FPCalculator("ecfp") # 2048 dimensions
]) # Result: 2215-dimensional combined featuresfrom molfeat.calc import MordredDescriptors
MordredDescriptors()
**组合多种特征提取器:**
```python
from molfeat.trans import FeatConcat
concat = FeatConcat([
FPCalculator("maccs"), # 167维
FPCalculator("ecfp") # 2048维
]) # 结果:2215维组合特征For Deep Learning
深度学习
Transformer-based embeddings:
python
undefined基于Transformer的嵌入模型:
python
undefinedChemBERTa - Pre-trained on 77M PubChem compounds
ChemBERTa - 在77M PubChem化合物上预训练
PretrainedMolTransformer("ChemBERTa-77M-MLM")
PretrainedMolTransformer("ChemBERTa-77M-MLM")
ChemGPT - Autoregressive language model
ChemGPT - 自回归语言模型
PretrainedMolTransformer("ChemGPT-1.2B")
**Graph neural networks:**
```pythonPretrainedMolTransformer("ChemGPT-1.2B")
**图神经网络:**
```pythonGIN models with different pre-training objectives
带有不同预训练目标的GIN模型
PretrainedMolTransformer("gin-supervised-masking")
PretrainedMolTransformer("gin-supervised-infomax")
PretrainedMolTransformer("gin-supervised-masking")
PretrainedMolTransformer("gin-supervised-infomax")
Graphormer for quantum chemistry
用于量子化学的Graphormer
PretrainedMolTransformer("Graphormer-pcqm4mv2")
undefinedPretrainedMolTransformer("Graphormer-pcqm4mv2")
undefinedFor Similarity Searching
相似性搜索
python
undefinedpython
undefinedECFP - General purpose, most widely used
ECFP - 通用型,应用最广泛
FPCalculator("ecfp")
FPCalculator("ecfp")
MACCS - Fast, scaffold-based similarity
MACCS - 速度快,基于骨架的相似性
FPCalculator("maccs")
FPCalculator("maccs")
MAP4 - Efficient for large databases
MAP4 - 适用于大型数据库
FPCalculator("map4")
FPCalculator("map4")
USR/USRCAT - 3D shape similarity
USR/USRCAT - 3D形状相似性
from molfeat.calc import USRDescriptors
USRDescriptors()
undefinedfrom molfeat.calc import USRDescriptors
USRDescriptors()
undefinedFor Pharmacophore-Based Approaches
基于药效团的方法
python
undefinedpython
undefinedFCFP - Functional group based
FCFP - 基于官能团
FPCalculator("fcfp")
FPCalculator("fcfp")
CATS - Pharmacophore pair distributions
CATS - 药效团对分布
from molfeat.calc import CATSCalculator
CATSCalculator(mode="2D")
from molfeat.calc import CATSCalculator
CATSCalculator(mode="2D")
Gobbi - Explicit pharmacophore features
Gobbi - 显式药效团特征
FPCalculator("gobbi2D")
undefinedFPCalculator("gobbi2D")
undefinedCommon Workflows
常见工作流
Building a QSAR Model
构建QSAR模型
python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_scorepython
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_scoreFeaturize molecules
对分子进行特征化
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X = transformer(smiles_train)
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X = transformer(smiles_train)
Train model
训练模型
model = RandomForestRegressor(n_estimators=100)
scores = cross_val_score(model, X, y_train, cv=5)
print(f"R² = {scores.mean():.3f}")
model = RandomForestRegressor(n_estimators=100)
scores = cross_val_score(model, X, y_train, cv=5)
print(f"R² = {scores.mean():.3f}")
Save configuration for deployment
保存配置用于部署
transformer.to_state_yaml_file("production_featurizer.yml")
undefinedtransformer.to_state_yaml_file("production_featurizer.yml")
undefinedVirtual Screening Pipeline
虚拟筛选流水线
python
from sklearn.ensemble import RandomForestClassifierpython
from sklearn.ensemble import RandomForestClassifierTrain on known actives/inactives
基于已知活性/非活性样本训练
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X_train = transformer(train_smiles)
clf = RandomForestClassifier(n_estimators=500)
clf.fit(X_train, train_labels)
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X_train = transformer(train_smiles)
clf = RandomForestClassifier(n_estimators=500)
clf.fit(X_train, train_labels)
Screen large library
筛选大型化合物库
X_screen = transformer(screening_library) # e.g., 1M compounds
predictions = clf.predict_proba(X_screen)[:, 1]
X_screen = transformer(screening_library) # 例如100万种化合物
predictions = clf.predict_proba(X_screen)[:, 1]
Rank and select top hits
排序并选择Top命中结果
top_indices = predictions.argsort()[::-1][:1000]
top_hits = [screening_library[i] for i in top_indices]
undefinedtop_indices = predictions.argsort()[::-1][:1000]
top_hits = [screening_library[i] for i in top_indices]
undefinedSimilarity Search
相似性搜索
python
from sklearn.metrics.pairwise import cosine_similaritypython
from sklearn.metrics.pairwise import cosine_similarityQuery molecule
查询分子
calc = FPCalculator("ecfp")
query_fp = calc(query_smiles).reshape(1, -1)
calc = FPCalculator("ecfp")
query_fp = calc(query_smiles).reshape(1, -1)
Database fingerprints
数据库指纹
transformer = MoleculeTransformer(calc, n_jobs=-1)
database_fps = transformer(database_smiles)
transformer = MoleculeTransformer(calc, n_jobs=-1)
database_fps = transformer(database_smiles)
Compute similarity
计算相似性
similarities = cosine_similarity(query_fp, database_fps)[0]
top_similar = similarities.argsort()[-10:][::-1]
undefinedsimilarities = cosine_similarity(query_fp, database_fps)[0]
top_similar = similarities.argsort()[-10:][::-1]
undefinedScikit-learn Pipeline Integration
与scikit-learn流水线集成
python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifierpython
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifierCreate end-to-end pipeline
创建端到端流水线
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
('classifier', RandomForestClassifier(n_estimators=100))
])
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
('classifier', RandomForestClassifier(n_estimators=100))
])
Train and predict directly on SMILES
直接基于SMILES训练和预测
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
undefinedpipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
undefinedComparing Multiple Featurizers
比较多种特征提取器
python
featurizers = {
'ECFP': FPCalculator("ecfp"),
'MACCS': FPCalculator("maccs"),
'Descriptors': RDKitDescriptors2D(),
'ChemBERTa': PretrainedMolTransformer("ChemBERTa-77M-MLM")
}
results = {}
for name, feat in featurizers.items():
transformer = MoleculeTransformer(feat, n_jobs=-1)
X = transformer(smiles)
# Evaluate with your ML model
score = evaluate_model(X, y)
results[name] = scorepython
featurizers = {
'ECFP': FPCalculator("ecfp"),
'MACCS': FPCalculator("maccs"),
'Descriptors': RDKitDescriptors2D(),
'ChemBERTa': PretrainedMolTransformer("ChemBERTa-77M-MLM")
}
results = {}
for name, feat in featurizers.items():
transformer = MoleculeTransformer(feat, n_jobs=-1)
X = transformer(smiles)
# 用你的ML模型评估
score = evaluate_model(X, y)
results[name] = scoreDiscovering Available Featurizers
探索可用的特征提取器
Use the ModelStore to explore all available featurizers:
python
from molfeat.store.modelstore import ModelStore
store = ModelStore()使用ModelStore探索所有可用的特征提取器:
python
from molfeat.store.modelstore import ModelStore
store = ModelStore()List all available models
列出所有可用模型
all_models = store.available_models
print(f"Total featurizers: {len(all_models)}")
all_models = store.available_models
print(f"总特征提取器数量: {len(all_models)}")
Search for specific models
搜索特定模型
chemberta_models = store.search(name="ChemBERTa")
for model in chemberta_models:
print(f"- {model.name}: {model.description}")
chemberta_models = store.search(name="ChemBERTa")
for model in chemberta_models:
print(f"- {model.name}: {model.description}")
Get usage information
获取使用信息
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
model_card.usage() # Display usage examples
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
model_card.usage() # 显示使用示例
Load model
加载模型
transformer = store.load("ChemBERTa-77M-MLM")
undefinedtransformer = store.load("ChemBERTa-77M-MLM")
undefinedAdvanced Features
高级功能
Custom Preprocessing
自定义预处理
python
class CustomTransformer(MoleculeTransformer):
def preprocess(self, mol):
"""Custom preprocessing pipeline"""
if isinstance(mol, str):
mol = dm.to_mol(mol)
mol = dm.standardize_mol(mol)
mol = dm.remove_salts(mol)
return mol
transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)python
class CustomTransformer(MoleculeTransformer):
def preprocess(self, mol):
"""自定义预处理流水线"""
if isinstance(mol, str):
mol = dm.to_mol(mol)
mol = dm.standardize_mol(mol)
mol = dm.remove_salts(mol)
return mol
transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)Batch Processing Large Datasets
批量处理大型数据集
python
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
"""Process large datasets in chunks to manage memory"""
all_features = []
for i in range(0, len(smiles_list), chunk_size):
chunk = smiles_list[i:i+chunk_size]
features = transformer(chunk)
all_features.append(features)
return np.vstack(all_features)python
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
"""分块处理大型数据集以管理内存"""
all_features = []
for i in range(0, len(smiles_list), chunk_size):
chunk = smiles_list[i:i+chunk_size]
features = transformer(chunk)
all_features.append(features)
return np.vstack(all_features)Caching Expensive Embeddings
缓存计算成本高的嵌入向量
python
import pickle
cache_file = "embeddings_cache.pkl"
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
try:
with open(cache_file, "rb") as f:
embeddings = pickle.load(f)
except FileNotFoundError:
embeddings = transformer(smiles_list)
with open(cache_file, "wb") as f:
pickle.dump(embeddings, f)python
import pickle
cache_file = "embeddings_cache.pkl"
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
try:
with open(cache_file, "rb") as f:
embeddings = pickle.load(f)
except FileNotFoundError:
embeddings = transformer(smiles_list)
with open(cache_file, "wb") as f:
pickle.dump(embeddings, f)Performance Tips
性能优化技巧
- Use parallelization: Set to utilize all CPU cores
n_jobs=-1 - Batch processing: Process multiple molecules at once instead of loops
- Choose appropriate featurizers: Fingerprints are faster than deep learning models
- Cache pretrained models: Leverage built-in caching for repeated use
- Use float32: Set when precision allows
dtype=np.float32 - Handle errors efficiently: Use for large datasets
ignore_errors=True
- 使用并行处理:设置以利用所有CPU核心
n_jobs=-1 - 批量处理:一次性处理多个分子而非循环处理单个分子
- 选择合适的特征提取器:指纹比深度学习模型速度更快
- 缓存预训练模型:重复使用时利用内置缓存提升速度
- 使用float32:在精度允许的情况下设置
dtype=np.float32 - 高效处理错误:对大型数据集使用
ignore_errors=True
Common Featurizers Reference
常用特征提取器参考
Quick reference for frequently used featurizers:
| Featurizer | Type | Dimensions | Speed | Use Case |
|---|---|---|---|---|
| Fingerprint | 2048 | Fast | General purpose |
| Fingerprint | 167 | Very fast | Scaffold similarity |
| Descriptors | 200+ | Fast | Interpretable models |
| Descriptors | 1800+ | Medium | Comprehensive features |
| Fingerprint | 1024 | Fast | Large-scale screening |
| Deep learning | 768 | Slow* | Transfer learning |
| GNN | Variable | Slow* | Graph-based models |
*First run is slow; subsequent runs benefit from caching
常用特征提取器速查:
| 特征提取器 | 类型 | 维度 | 速度 | 适用场景 |
|---|---|---|---|---|
| 指纹 | 2048 | 快 | 通用场景 |
| 指纹 | 167 | 极快 | 骨架相似性 |
| 描述符 | 200+ | 快 | 可解释模型 |
| 描述符 | 1800+ | 中等 | 全面特征 |
| 指纹 | 1024 | 快 | 大规模筛选 |
| 深度学习 | 768 | 慢* | 迁移学习 |
| GNN | 可变 | 慢* | 基于图的模型 |
*首次运行速度慢;后续运行可通过缓存提速
Resources
资源
This skill includes comprehensive reference documentation:
本技能包含全面的参考文档:
references/api_reference.md
references/api_reference.md
Complete API documentation covering:
- - All calculator classes and parameters
molfeat.calc - - Transformer classes and methods
molfeat.trans - - ModelStore usage
molfeat.store - Common patterns and integration examples
- Performance optimization tips
When to load: Reference when implementing specific calculators, understanding transformer parameters, or integrating with scikit-learn/PyTorch.
完整的API文档,涵盖:
- - 所有计算器类和参数
molfeat.calc - - 转换器类和方法
molfeat.trans - - ModelStore使用方法
molfeat.store - 常见模式和集成示例
- 性能优化技巧
加载时机: 实现特定计算器、理解转换器参数或与scikit-learn/PyTorch集成时参考。
references/available_featurizers.md
references/available_featurizers.md
Comprehensive catalog of all 100+ featurizers organized by category:
- Transformer-based language models (ChemBERTa, ChemGPT)
- Graph neural networks (GIN, Graphormer)
- Molecular descriptors (RDKit, Mordred)
- Fingerprints (ECFP, MACCS, MAP4, and 15+ others)
- Pharmacophore descriptors (CATS, Gobbi)
- Shape descriptors (USR, ElectroShape)
- Scaffold-based descriptors
When to load: Reference when selecting the optimal featurizer for a specific task, exploring available options, or understanding featurizer characteristics.
Search tip: Use grep to find specific featurizer types:
bash
grep -i "chembert" references/available_featurizers.md
grep -i "pharmacophore" references/available_featurizers.md按类别组织的100余种特征提取器的综合目录:
- 基于Transformer的语言模型(ChemBERTa、ChemGPT)
- 图神经网络(GIN、Graphormer)
- 分子描述符(RDKit、Mordred)
- 指纹(ECFP、MACCS、MAP4及15+其他类型)
- 药效团描述符(CATS、Gobbi)
- 形状描述符(USR、ElectroShape)
- 基于骨架的描述符
加载时机: 为特定任务选择最优特征提取器、探索可用选项或了解特征提取器特性时参考。
搜索技巧: 使用grep查找特定类型的特征提取器:
bash
grep -i "chembert" references/available_featurizers.md
grep -i "pharmacophore" references/available_featurizers.mdreferences/examples.md
references/examples.md
Practical code examples for common scenarios:
- Installation and quick start
- Calculator and transformer examples
- Pretrained model usage
- Scikit-learn and PyTorch integration
- Virtual screening workflows
- QSAR model building
- Similarity searching
- Troubleshooting and best practices
When to load: Reference when implementing specific workflows, troubleshooting issues, or learning molfeat patterns.
常见场景的实用代码示例:
- 安装和快速入门
- 计算器和转换器示例
- 预训练模型使用
- 与scikit-learn和PyTorch集成
- 虚拟筛选工作流
- QSAR模型构建
- 相似性搜索
- 故障排除和最佳实践
加载时机: 实现特定工作流、排查问题或学习molfeat使用模式时参考。
Troubleshooting
故障排除
Invalid Molecules
无效分子
Enable error handling to skip invalid SMILES:
python
transformer = MoleculeTransformer(
calc,
ignore_errors=True,
verbose=True
)启用错误处理以跳过无效SMILES:
python
transformer = MoleculeTransformer(
calc,
ignore_errors=True,
verbose=True
)Memory Issues with Large Datasets
大型数据集的内存问题
Process in chunks or use streaming approaches for datasets > 100K molecules.
对超过10万个分子的数据集,采用分块处理或流式处理方法。
Pretrained Model Dependencies
预训练模型依赖
Some models require additional packages. Install specific extras:
bash
uv pip install "molfeat[transformer]" # For ChemBERTa/ChemGPT
uv pip install "molfeat[dgl]" # For GIN models部分模型需要额外的包,安装对应的扩展依赖:
bash
uv pip install "molfeat[transformer]" # 用于ChemBERTa/ChemGPT
uv pip install "molfeat[dgl]" # 用于GIN模型Reproducibility
可复现性
Save exact configurations and document versions:
python
transformer.to_state_yaml_file("config.yml")
import molfeat
print(f"molfeat version: {molfeat.__version__}")保存精确配置并记录版本:
python
transformer.to_state_yaml_file("config.yml")
import molfeat
print(f"molfeat版本: {molfeat.__version__}")Additional Resources
额外资源
- Official Documentation: https://molfeat-docs.datamol.io/
- GitHub Repository: https://github.com/datamol-io/molfeat
- PyPI Package: https://pypi.org/project/molfeat/
- Tutorial: https://portal.valencelabs.com/datamol/post/types-of-featurizers-b1e8HHrbFMkbun6