molfeat

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Molfeat - Molecular Featurization Hub

Molfeat - 分子特征化中心

Overview

概述

Molfeat is a comprehensive Python library for molecular featurization that unifies 100+ pre-trained embeddings and hand-crafted featurizers. Convert chemical structures (SMILES strings or RDKit molecules) into numerical representations for machine learning tasks including QSAR modeling, virtual screening, similarity searching, and deep learning applications. Features fast parallel processing, scikit-learn compatible transformers, and built-in caching.
Molfeat是一个全面的Python分子特征化库,整合了100余种预训练嵌入模型和手工构建的特征提取器。可将化学结构(SMILES字符串或RDKit分子对象)转换为数值表示,用于QSAR建模、虚拟筛选、相似性搜索和深度学习等机器学习任务。具备快速并行处理、兼容scikit-learn的转换器以及内置缓存功能。

When to Use This Skill

适用场景

This skill should be used when working with:
  • Molecular machine learning: Building QSAR/QSPR models, property prediction
  • Virtual screening: Ranking compound libraries for biological activity
  • Similarity searching: Finding structurally similar molecules
  • Chemical space analysis: Clustering, visualization, dimensionality reduction
  • Deep learning: Training neural networks on molecular data
  • Featurization pipelines: Converting SMILES to ML-ready representations
  • Cheminformatics: Any task requiring molecular feature extraction
本工具适用于以下场景:
  • 分子机器学习:构建QSAR/QSPR模型、性质预测
  • 虚拟筛选:针对生物活性对化合物库进行排序
  • 相似性搜索:寻找结构相似的分子
  • 化学空间分析:聚类、可视化、降维
  • 深度学习:基于分子数据训练神经网络
  • 特征化流水线:将SMILES转换为可用于机器学习的表示
  • 化学信息学:任何需要提取分子特征的任务

Installation

安装

bash
uv pip install molfeat
bash
uv pip install molfeat

With all optional dependencies

安装所有可选依赖

uv pip install "molfeat[all]"

**Optional dependencies for specific featurizers:**
- `molfeat[dgl]` - GNN models (GIN variants)
- `molfeat[graphormer]` - Graphormer models
- `molfeat[transformer]` - ChemBERTa, ChemGPT, MolT5
- `molfeat[fcd]` - FCD descriptors
- `molfeat[map4]` - MAP4 fingerprints
uv pip install "molfeat[all]"

**特定特征提取器的可选依赖:**
- `molfeat[dgl]` - GNN模型(GIN变体)
- `molfeat[graphormer]` - Graphormer模型
- `molfeat[transformer]` - ChemBERTa、ChemGPT、MolT5
- `molfeat[fcd]` - FCD描述符
- `molfeat[map4]` - MAP4指纹

Core Concepts

核心概念

Molfeat organizes featurization into three hierarchical classes:
Molfeat将特征化分为三个层级的类:

1. Calculators (
molfeat.calc
)

1. 计算器(
molfeat.calc

Callable objects that convert individual molecules into feature vectors. Accept RDKit
Chem.Mol
objects or SMILES strings.
Use calculators for:
  • Single molecule featurization
  • Custom processing loops
  • Direct feature computation
Example:
python
from molfeat.calc import FPCalculator

calc = FPCalculator("ecfp", radius=3, fpSize=2048)
features = calc("CCO")  # Returns numpy array (2048,)
可调用对象,用于将单个分子转换为特征向量。支持RDKit
Chem.Mol
对象或SMILES字符串。
计算器适用场景:
  • 单个分子的特征化
  • 自定义处理循环
  • 直接计算特征
示例:
python
from molfeat.calc import FPCalculator

calc = FPCalculator("ecfp", radius=3, fpSize=2048)
features = calc("CCO")  # 返回numpy数组(2048,)

2. Transformers (
molfeat.trans
)

2. 转换器(
molfeat.trans

Scikit-learn compatible transformers that wrap calculators for batch processing with parallelization.
Use transformers for:
  • Batch featurization of molecular datasets
  • Integration with scikit-learn pipelines
  • Parallel processing (automatic CPU utilization)
Example:
python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator

transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles_list)  # Parallel processing
兼容scikit-learn的转换器,封装计算器以支持并行批量处理。
转换器适用场景:
  • 分子数据集的批量特征化
  • 与scikit-learn流水线集成
  • 并行处理(自动利用CPU核心)
示例:
python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator

transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles_list)  # 并行处理

3. Pretrained Transformers (
molfeat.trans.pretrained
)

3. 预训练转换器(
molfeat.trans.pretrained

Specialized transformers for deep learning models with batched inference and caching.
Use pretrained transformers for:
  • State-of-the-art molecular embeddings
  • Transfer learning from large chemical datasets
  • Deep learning feature extraction
Example:
python
from molfeat.trans.pretrained import PretrainedMolTransformer

transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
embeddings = transformer(smiles_list)  # Deep learning embeddings
专为深度学习模型设计的专用转换器,支持批量推理和缓存。
预训练转换器适用场景:
  • 最先进的分子嵌入
  • 从大型化学数据集进行迁移学习
  • 深度学习特征提取
示例:
python
from molfeat.trans.pretrained import PretrainedMolTransformer

transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
embeddings = transformer(smiles_list)  # 深度学习嵌入向量

Quick Start Workflow

快速入门流程

Basic Featurization

基础特征化

python
import datamol as dm
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
python
import datamol as dm
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer

Load molecular data

加载分子数据

smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]
smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]

Create calculator and transformer

创建计算器和转换器

calc = FPCalculator("ecfp", radius=3) transformer = MoleculeTransformer(calc, n_jobs=-1)
calc = FPCalculator("ecfp", radius=3) transformer = MoleculeTransformer(calc, n_jobs=-1)

Featurize molecules

对分子进行特征化

features = transformer(smiles) print(f"Shape: {features.shape}") # (4, 2048)
undefined
features = transformer(smiles) print(f"形状: {features.shape}") # (4, 2048)
undefined

Save and Load Configuration

保存和加载配置

python
undefined
python
undefined

Save featurizer configuration for reproducibility

保存特征提取器配置以保证可复现性

transformer.to_state_yaml_file("featurizer_config.yml")
transformer.to_state_yaml_file("featurizer_config.yml")

Reload exact configuration

重新加载完全一致的配置

loaded = MoleculeTransformer.from_state_yaml_file("featurizer_config.yml")
undefined
loaded = MoleculeTransformer.from_state_yaml_file("featurizer_config.yml")
undefined

Handle Errors Gracefully

优雅处理错误

python
undefined
python
undefined

Process dataset with potentially invalid SMILES

处理可能包含无效SMILES的数据集

transformer = MoleculeTransformer( calc, n_jobs=-1, ignore_errors=True, # Continue on failures verbose=True # Log error details )
features = transformer(smiles_with_errors)
transformer = MoleculeTransformer( calc, n_jobs=-1, ignore_errors=True, # 遇到失败时继续执行 verbose=True # 记录错误详情 )
features = transformer(smiles_with_errors)

Returns None for failed molecules

对处理失败的分子返回None

undefined
undefined

Choosing the Right Featurizer

选择合适的特征提取器

For Traditional Machine Learning (RF, SVM, XGBoost)

传统机器学习(随机森林、SVM、XGBoost)

Start with fingerprints:
python
undefined
优先选择指纹:
python
undefined

ECFP - Most popular, general-purpose

ECFP - 最受欢迎的通用型指纹

FPCalculator("ecfp", radius=3, fpSize=2048)
FPCalculator("ecfp", radius=3, fpSize=2048)

MACCS - Fast, good for scaffold hopping

MACCS - 速度快,适用于骨架跃迁

FPCalculator("maccs")
FPCalculator("maccs")

MAP4 - Efficient for large-scale screening

MAP4 - 高效适用于大规模筛选

FPCalculator("map4")

**For interpretable models:**
```python
FPCalculator("map4")

**可解释模型适用:**
```python

RDKit 2D descriptors (200+ named properties)

RDKit 2D描述符(200+种命名性质)

from molfeat.calc import RDKitDescriptors2D RDKitDescriptors2D()
from molfeat.calc import RDKitDescriptors2D RDKitDescriptors2D()

Mordred (1800+ comprehensive descriptors)

Mordred(1800+种全面描述符)

from molfeat.calc import MordredDescriptors MordredDescriptors()

**Combine multiple featurizers:**
```python
from molfeat.trans import FeatConcat

concat = FeatConcat([
    FPCalculator("maccs"),      # 167 dimensions
    FPCalculator("ecfp")         # 2048 dimensions
])  # Result: 2215-dimensional combined features
from molfeat.calc import MordredDescriptors MordredDescriptors()

**组合多种特征提取器:**
```python
from molfeat.trans import FeatConcat

concat = FeatConcat([
    FPCalculator("maccs"),      # 167维
    FPCalculator("ecfp")         # 2048维
])  # 结果:2215维组合特征

For Deep Learning

深度学习

Transformer-based embeddings:
python
undefined
基于Transformer的嵌入模型:
python
undefined

ChemBERTa - Pre-trained on 77M PubChem compounds

ChemBERTa - 在77M PubChem化合物上预训练

PretrainedMolTransformer("ChemBERTa-77M-MLM")
PretrainedMolTransformer("ChemBERTa-77M-MLM")

ChemGPT - Autoregressive language model

ChemGPT - 自回归语言模型

PretrainedMolTransformer("ChemGPT-1.2B")

**Graph neural networks:**
```python
PretrainedMolTransformer("ChemGPT-1.2B")

**图神经网络:**
```python

GIN models with different pre-training objectives

带有不同预训练目标的GIN模型

PretrainedMolTransformer("gin-supervised-masking") PretrainedMolTransformer("gin-supervised-infomax")
PretrainedMolTransformer("gin-supervised-masking") PretrainedMolTransformer("gin-supervised-infomax")

Graphormer for quantum chemistry

用于量子化学的Graphormer

PretrainedMolTransformer("Graphormer-pcqm4mv2")
undefined
PretrainedMolTransformer("Graphormer-pcqm4mv2")
undefined

For Similarity Searching

相似性搜索

python
undefined
python
undefined

ECFP - General purpose, most widely used

ECFP - 通用型,应用最广泛

FPCalculator("ecfp")
FPCalculator("ecfp")

MACCS - Fast, scaffold-based similarity

MACCS - 速度快,基于骨架的相似性

FPCalculator("maccs")
FPCalculator("maccs")

MAP4 - Efficient for large databases

MAP4 - 适用于大型数据库

FPCalculator("map4")
FPCalculator("map4")

USR/USRCAT - 3D shape similarity

USR/USRCAT - 3D形状相似性

from molfeat.calc import USRDescriptors USRDescriptors()
undefined
from molfeat.calc import USRDescriptors USRDescriptors()
undefined

For Pharmacophore-Based Approaches

基于药效团的方法

python
undefined
python
undefined

FCFP - Functional group based

FCFP - 基于官能团

FPCalculator("fcfp")
FPCalculator("fcfp")

CATS - Pharmacophore pair distributions

CATS - 药效团对分布

from molfeat.calc import CATSCalculator CATSCalculator(mode="2D")
from molfeat.calc import CATSCalculator CATSCalculator(mode="2D")

Gobbi - Explicit pharmacophore features

Gobbi - 显式药效团特征

FPCalculator("gobbi2D")
undefined
FPCalculator("gobbi2D")
undefined

Common Workflows

常见工作流

Building a QSAR Model

构建QSAR模型

python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

Featurize molecules

对分子进行特征化

transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1) X = transformer(smiles_train)
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1) X = transformer(smiles_train)

Train model

训练模型

model = RandomForestRegressor(n_estimators=100) scores = cross_val_score(model, X, y_train, cv=5) print(f"R² = {scores.mean():.3f}")
model = RandomForestRegressor(n_estimators=100) scores = cross_val_score(model, X, y_train, cv=5) print(f"R² = {scores.mean():.3f}")

Save configuration for deployment

保存配置用于部署

transformer.to_state_yaml_file("production_featurizer.yml")
undefined
transformer.to_state_yaml_file("production_featurizer.yml")
undefined

Virtual Screening Pipeline

虚拟筛选流水线

python
from sklearn.ensemble import RandomForestClassifier
python
from sklearn.ensemble import RandomForestClassifier

Train on known actives/inactives

基于已知活性/非活性样本训练

transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1) X_train = transformer(train_smiles) clf = RandomForestClassifier(n_estimators=500) clf.fit(X_train, train_labels)
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1) X_train = transformer(train_smiles) clf = RandomForestClassifier(n_estimators=500) clf.fit(X_train, train_labels)

Screen large library

筛选大型化合物库

X_screen = transformer(screening_library) # e.g., 1M compounds predictions = clf.predict_proba(X_screen)[:, 1]
X_screen = transformer(screening_library) # 例如100万种化合物 predictions = clf.predict_proba(X_screen)[:, 1]

Rank and select top hits

排序并选择Top命中结果

top_indices = predictions.argsort()[::-1][:1000] top_hits = [screening_library[i] for i in top_indices]
undefined
top_indices = predictions.argsort()[::-1][:1000] top_hits = [screening_library[i] for i in top_indices]
undefined

Similarity Search

相似性搜索

python
from sklearn.metrics.pairwise import cosine_similarity
python
from sklearn.metrics.pairwise import cosine_similarity

Query molecule

查询分子

calc = FPCalculator("ecfp") query_fp = calc(query_smiles).reshape(1, -1)
calc = FPCalculator("ecfp") query_fp = calc(query_smiles).reshape(1, -1)

Database fingerprints

数据库指纹

transformer = MoleculeTransformer(calc, n_jobs=-1) database_fps = transformer(database_smiles)
transformer = MoleculeTransformer(calc, n_jobs=-1) database_fps = transformer(database_smiles)

Compute similarity

计算相似性

similarities = cosine_similarity(query_fp, database_fps)[0] top_similar = similarities.argsort()[-10:][::-1]
undefined
similarities = cosine_similarity(query_fp, database_fps)[0] top_similar = similarities.argsort()[-10:][::-1]
undefined

Scikit-learn Pipeline Integration

与scikit-learn流水线集成

python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

Create end-to-end pipeline

创建端到端流水线

pipeline = Pipeline([ ('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)), ('classifier', RandomForestClassifier(n_estimators=100)) ])
pipeline = Pipeline([ ('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)), ('classifier', RandomForestClassifier(n_estimators=100)) ])

Train and predict directly on SMILES

直接基于SMILES训练和预测

pipeline.fit(smiles_train, y_train) predictions = pipeline.predict(smiles_test)
undefined
pipeline.fit(smiles_train, y_train) predictions = pipeline.predict(smiles_test)
undefined

Comparing Multiple Featurizers

比较多种特征提取器

python
featurizers = {
    'ECFP': FPCalculator("ecfp"),
    'MACCS': FPCalculator("maccs"),
    'Descriptors': RDKitDescriptors2D(),
    'ChemBERTa': PretrainedMolTransformer("ChemBERTa-77M-MLM")
}

results = {}
for name, feat in featurizers.items():
    transformer = MoleculeTransformer(feat, n_jobs=-1)
    X = transformer(smiles)
    # Evaluate with your ML model
    score = evaluate_model(X, y)
    results[name] = score
python
featurizers = {
    'ECFP': FPCalculator("ecfp"),
    'MACCS': FPCalculator("maccs"),
    'Descriptors': RDKitDescriptors2D(),
    'ChemBERTa': PretrainedMolTransformer("ChemBERTa-77M-MLM")
}

results = {}
for name, feat in featurizers.items():
    transformer = MoleculeTransformer(feat, n_jobs=-1)
    X = transformer(smiles)
    # 用你的ML模型评估
    score = evaluate_model(X, y)
    results[name] = score

Discovering Available Featurizers

探索可用的特征提取器

Use the ModelStore to explore all available featurizers:
python
from molfeat.store.modelstore import ModelStore

store = ModelStore()
使用ModelStore探索所有可用的特征提取器:
python
from molfeat.store.modelstore import ModelStore

store = ModelStore()

List all available models

列出所有可用模型

all_models = store.available_models print(f"Total featurizers: {len(all_models)}")
all_models = store.available_models print(f"总特征提取器数量: {len(all_models)}")

Search for specific models

搜索特定模型

chemberta_models = store.search(name="ChemBERTa") for model in chemberta_models: print(f"- {model.name}: {model.description}")
chemberta_models = store.search(name="ChemBERTa") for model in chemberta_models: print(f"- {model.name}: {model.description}")

Get usage information

获取使用信息

model_card = store.search(name="ChemBERTa-77M-MLM")[0] model_card.usage() # Display usage examples
model_card = store.search(name="ChemBERTa-77M-MLM")[0] model_card.usage() # 显示使用示例

Load model

加载模型

transformer = store.load("ChemBERTa-77M-MLM")
undefined
transformer = store.load("ChemBERTa-77M-MLM")
undefined

Advanced Features

高级功能

Custom Preprocessing

自定义预处理

python
class CustomTransformer(MoleculeTransformer):
    def preprocess(self, mol):
        """Custom preprocessing pipeline"""
        if isinstance(mol, str):
            mol = dm.to_mol(mol)
        mol = dm.standardize_mol(mol)
        mol = dm.remove_salts(mol)
        return mol

transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)
python
class CustomTransformer(MoleculeTransformer):
    def preprocess(self, mol):
        """自定义预处理流水线"""
        if isinstance(mol, str):
            mol = dm.to_mol(mol)
        mol = dm.standardize_mol(mol)
        mol = dm.remove_salts(mol)
        return mol

transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)

Batch Processing Large Datasets

批量处理大型数据集

python
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
    """Process large datasets in chunks to manage memory"""
    all_features = []
    for i in range(0, len(smiles_list), chunk_size):
        chunk = smiles_list[i:i+chunk_size]
        features = transformer(chunk)
        all_features.append(features)
    return np.vstack(all_features)
python
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
    """分块处理大型数据集以管理内存"""
    all_features = []
    for i in range(0, len(smiles_list), chunk_size):
        chunk = smiles_list[i:i+chunk_size]
        features = transformer(chunk)
        all_features.append(features)
    return np.vstack(all_features)

Caching Expensive Embeddings

缓存计算成本高的嵌入向量

python
import pickle

cache_file = "embeddings_cache.pkl"
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)

try:
    with open(cache_file, "rb") as f:
        embeddings = pickle.load(f)
except FileNotFoundError:
    embeddings = transformer(smiles_list)
    with open(cache_file, "wb") as f:
        pickle.dump(embeddings, f)
python
import pickle

cache_file = "embeddings_cache.pkl"
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)

try:
    with open(cache_file, "rb") as f:
        embeddings = pickle.load(f)
except FileNotFoundError:
    embeddings = transformer(smiles_list)
    with open(cache_file, "wb") as f:
        pickle.dump(embeddings, f)

Performance Tips

性能优化技巧

  1. Use parallelization: Set
    n_jobs=-1
    to utilize all CPU cores
  2. Batch processing: Process multiple molecules at once instead of loops
  3. Choose appropriate featurizers: Fingerprints are faster than deep learning models
  4. Cache pretrained models: Leverage built-in caching for repeated use
  5. Use float32: Set
    dtype=np.float32
    when precision allows
  6. Handle errors efficiently: Use
    ignore_errors=True
    for large datasets
  1. 使用并行处理:设置
    n_jobs=-1
    以利用所有CPU核心
  2. 批量处理:一次性处理多个分子而非循环处理单个分子
  3. 选择合适的特征提取器:指纹比深度学习模型速度更快
  4. 缓存预训练模型:重复使用时利用内置缓存提升速度
  5. 使用float32:在精度允许的情况下设置
    dtype=np.float32
  6. 高效处理错误:对大型数据集使用
    ignore_errors=True

Common Featurizers Reference

常用特征提取器参考

Quick reference for frequently used featurizers:
FeaturizerTypeDimensionsSpeedUse Case
ecfp
Fingerprint2048FastGeneral purpose
maccs
Fingerprint167Very fastScaffold similarity
desc2D
Descriptors200+FastInterpretable models
mordred
Descriptors1800+MediumComprehensive features
map4
Fingerprint1024FastLarge-scale screening
ChemBERTa-77M-MLM
Deep learning768Slow*Transfer learning
gin-supervised-masking
GNNVariableSlow*Graph-based models
*First run is slow; subsequent runs benefit from caching
常用特征提取器速查:
特征提取器类型维度速度适用场景
ecfp
指纹2048通用场景
maccs
指纹167极快骨架相似性
desc2D
描述符200+可解释模型
mordred
描述符1800+中等全面特征
map4
指纹1024大规模筛选
ChemBERTa-77M-MLM
深度学习768慢*迁移学习
gin-supervised-masking
GNN可变慢*基于图的模型
*首次运行速度慢;后续运行可通过缓存提速

Resources

资源

This skill includes comprehensive reference documentation:
本技能包含全面的参考文档:

references/api_reference.md

references/api_reference.md

Complete API documentation covering:
  • molfeat.calc
    - All calculator classes and parameters
  • molfeat.trans
    - Transformer classes and methods
  • molfeat.store
    - ModelStore usage
  • Common patterns and integration examples
  • Performance optimization tips
When to load: Reference when implementing specific calculators, understanding transformer parameters, or integrating with scikit-learn/PyTorch.
完整的API文档,涵盖:
  • molfeat.calc
    - 所有计算器类和参数
  • molfeat.trans
    - 转换器类和方法
  • molfeat.store
    - ModelStore使用方法
  • 常见模式和集成示例
  • 性能优化技巧
加载时机: 实现特定计算器、理解转换器参数或与scikit-learn/PyTorch集成时参考。

references/available_featurizers.md

references/available_featurizers.md

Comprehensive catalog of all 100+ featurizers organized by category:
  • Transformer-based language models (ChemBERTa, ChemGPT)
  • Graph neural networks (GIN, Graphormer)
  • Molecular descriptors (RDKit, Mordred)
  • Fingerprints (ECFP, MACCS, MAP4, and 15+ others)
  • Pharmacophore descriptors (CATS, Gobbi)
  • Shape descriptors (USR, ElectroShape)
  • Scaffold-based descriptors
When to load: Reference when selecting the optimal featurizer for a specific task, exploring available options, or understanding featurizer characteristics.
Search tip: Use grep to find specific featurizer types:
bash
grep -i "chembert" references/available_featurizers.md
grep -i "pharmacophore" references/available_featurizers.md
按类别组织的100余种特征提取器的综合目录:
  • 基于Transformer的语言模型(ChemBERTa、ChemGPT)
  • 图神经网络(GIN、Graphormer)
  • 分子描述符(RDKit、Mordred)
  • 指纹(ECFP、MACCS、MAP4及15+其他类型)
  • 药效团描述符(CATS、Gobbi)
  • 形状描述符(USR、ElectroShape)
  • 基于骨架的描述符
加载时机: 为特定任务选择最优特征提取器、探索可用选项或了解特征提取器特性时参考。
搜索技巧: 使用grep查找特定类型的特征提取器:
bash
grep -i "chembert" references/available_featurizers.md
grep -i "pharmacophore" references/available_featurizers.md

references/examples.md

references/examples.md

Practical code examples for common scenarios:
  • Installation and quick start
  • Calculator and transformer examples
  • Pretrained model usage
  • Scikit-learn and PyTorch integration
  • Virtual screening workflows
  • QSAR model building
  • Similarity searching
  • Troubleshooting and best practices
When to load: Reference when implementing specific workflows, troubleshooting issues, or learning molfeat patterns.
常见场景的实用代码示例:
  • 安装和快速入门
  • 计算器和转换器示例
  • 预训练模型使用
  • 与scikit-learn和PyTorch集成
  • 虚拟筛选工作流
  • QSAR模型构建
  • 相似性搜索
  • 故障排除和最佳实践
加载时机: 实现特定工作流、排查问题或学习molfeat使用模式时参考。

Troubleshooting

故障排除

Invalid Molecules

无效分子

Enable error handling to skip invalid SMILES:
python
transformer = MoleculeTransformer(
    calc,
    ignore_errors=True,
    verbose=True
)
启用错误处理以跳过无效SMILES:
python
transformer = MoleculeTransformer(
    calc,
    ignore_errors=True,
    verbose=True
)

Memory Issues with Large Datasets

大型数据集的内存问题

Process in chunks or use streaming approaches for datasets > 100K molecules.
对超过10万个分子的数据集,采用分块处理或流式处理方法。

Pretrained Model Dependencies

预训练模型依赖

Some models require additional packages. Install specific extras:
bash
uv pip install "molfeat[transformer]"  # For ChemBERTa/ChemGPT
uv pip install "molfeat[dgl]"          # For GIN models
部分模型需要额外的包,安装对应的扩展依赖:
bash
uv pip install "molfeat[transformer]"  # 用于ChemBERTa/ChemGPT
uv pip install "molfeat[dgl]"          # 用于GIN模型

Reproducibility

可复现性

Save exact configurations and document versions:
python
transformer.to_state_yaml_file("config.yml")
import molfeat
print(f"molfeat version: {molfeat.__version__}")
保存精确配置并记录版本:
python
transformer.to_state_yaml_file("config.yml")
import molfeat
print(f"molfeat版本: {molfeat.__version__}")

Additional Resources

额外资源