datamol

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Datamol Cheminformatics Skill

Datamol 化学信息学工具使用指南

Overview

概述

Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native
rdkit.Chem.Mol
instances, ensuring full compatibility with the RDKit ecosystem.
Key capabilities:
  • Molecular format conversion (SMILES, SELFIES, InChI)
  • Structure standardization and sanitization
  • Molecular descriptors and fingerprints
  • 3D conformer generation and analysis
  • Clustering and diversity selection
  • Scaffold and fragment analysis
  • Chemical reaction application
  • Visualization and alignment
  • Batch processing with parallelization
  • Cloud storage support via fsspec
Datamol是一个Python库,为分子化学信息学提供了轻量、符合Python风格的RDKit抽象层。通过合理的默认配置、高效的并行化和现代I/O能力,简化复杂的分子操作。所有分子对象都是原生的
rdkit.Chem.Mol
实例,确保与RDKit生态系统完全兼容。
核心功能:
  • 分子格式转换(SMILES、SELFIES、InChI)
  • 分子结构标准化与清洗
  • 分子描述符与指纹计算
  • 3D构象生成与分析
  • 聚类与多样性筛选
  • 骨架与片段分析
  • 化学反应应用
  • 可视化与结构比对
  • 批量并行处理
  • 基于fsspec的云存储支持

Installation and Setup

安装与设置

Guide users to install datamol:
bash
uv pip install datamol
Import convention:
python
import datamol as dm
引导用户安装datamol:
bash
uv pip install datamol
导入约定:
python
import datamol as dm

Core Workflows

核心工作流

1. Basic Molecule Handling

1. 基础分子处理

Creating molecules from SMILES:
python
import datamol as dm
从SMILES创建分子:
python
import datamol as dm

Single molecule

单个分子

mol = dm.to_mol("CCO") # Ethanol
mol = dm.to_mol("CCO") # 乙醇

From list of SMILES

从SMILES列表创建

smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"] mols = [dm.to_mol(smi) for smi in smiles_list]
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"] mols = [dm.to_mol(smi) for smi in smiles_list]

Error handling

错误处理

mol = dm.to_mol("invalid_smiles") # Returns None if mol is None: print("Failed to parse SMILES")

**Converting molecules to SMILES**:
```python
mol = dm.to_mol("invalid_smiles") # 返回None if mol is None: print("SMILES解析失败")

**将分子转换为SMILES**:
```python

Canonical SMILES

标准SMILES

smiles = dm.to_smiles(mol)
smiles = dm.to_smiles(mol)

Isomeric SMILES (includes stereochemistry)

含立体化学的SMILES

smiles = dm.to_smiles(mol, isomeric=True)
smiles = dm.to_smiles(mol, isomeric=True)

Other formats

其他格式

inchi = dm.to_inchi(mol) inchikey = dm.to_inchikey(mol) selfies = dm.to_selfies(mol)

**Standardization and sanitization** (always recommend for user-provided molecules):
```python
inchi = dm.to_inchi(mol) inchikey = dm.to_inchikey(mol) selfies = dm.to_selfies(mol)

**分子标准化与清洗**(对于用户提供的分子,强烈推荐执行此步骤):
```python

Sanitize molecule

清洗分子

mol = dm.sanitize_mol(mol)
mol = dm.sanitize_mol(mol)

Full standardization (recommended for datasets)

完整标准化(推荐用于数据集)

mol = dm.standardize_mol( mol, disconnect_metals=True, normalize=True, reionize=True )
mol = dm.standardize_mol( mol, disconnect_metals=True, normalize=True, reionize=True )

For SMILES strings directly

直接标准化SMILES字符串

clean_smiles = dm.standardize_smiles(smiles)
undefined
clean_smiles = dm.standardize_smiles(smiles)
undefined

2. Reading and Writing Molecular Files

2. 分子文件的读写

Refer to
references/io_module.md
for comprehensive I/O documentation.
Reading files:
python
undefined
关于完整的I/O文档,请参考
references/io_module.md
读取文件:
python
undefined

SDF files (most common in chemistry)

SDF文件(化学领域最常用)

df = dm.read_sdf("compounds.sdf", mol_column='mol')
df = dm.read_sdf("compounds.sdf", mol_column='mol')

SMILES files

SMILES文件

df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')
df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')

CSV with SMILES column

含SMILES列的CSV文件

df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")

Excel files

Excel文件

df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")

Universal reader (auto-detects format)

通用读取器(自动检测格式)

df = dm.open_df("file.sdf") # Works with .sdf, .csv, .xlsx, .parquet, .json

**Writing files**:
```python
df = dm.open_df("file.sdf") # 支持.sdf、.csv、.xlsx、.parquet、.json格式

**写入文件**:
```python

Save as SDF

保存为SDF格式

dm.to_sdf(mols, "output.sdf")
dm.to_sdf(mols, "output.sdf")

Or from DataFrame

或从DataFrame保存

dm.to_sdf(df, "output.sdf", mol_column="mol")
dm.to_sdf(df, "output.sdf", mol_column="mol")

Save as SMILES file

保存为SMILES文件

dm.to_smi(mols, "output.smi")
dm.to_smi(mols, "output.smi")

Excel with rendered molecule images

保存为带分子图片的Excel文件

dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])

**Remote file support** (S3, GCS, HTTP):
```python
dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])

**远程文件支持**(S3、GCS、HTTP):
```python

Read from cloud storage

从云存储读取

df = dm.read_sdf("s3://bucket/compounds.sdf") df = dm.read_csv("https://example.com/data.csv")
df = dm.read_sdf("s3://bucket/compounds.sdf") df = dm.read_csv("https://example.com/data.csv")

Write to cloud storage

写入到云存储

dm.to_sdf(mols, "s3://bucket/output.sdf")
undefined
dm.to_sdf(mols, "s3://bucket/output.sdf")
undefined

3. Molecular Descriptors and Properties

3. 分子描述符与属性

Refer to
references/descriptors_viz.md
for detailed descriptor documentation.
Computing descriptors for a single molecule:
python
undefined
关于详细的描述符文档,请参考
references/descriptors_viz.md
计算单个分子的描述符:
python
undefined

Get standard descriptor set

获取标准描述符集合

descriptors = dm.descriptors.compute_many_descriptors(mol)
descriptors = dm.descriptors.compute_many_descriptors(mol)

Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1,

返回结果: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1,

'tpsa': 20.23, 'n_aromatic_atoms': 0, ...}

'tpsa': 20.23, 'n_aromatic_atoms': 0, ...}


**Batch descriptor computation** (recommended for datasets):
```python

**批量计算描述符**(推荐用于数据集):
```python

Compute for all molecules in parallel

并行计算所有分子的描述符

desc_df = dm.descriptors.batch_compute_many_descriptors( mols, n_jobs=-1, # Use all CPU cores progress=True # Show progress bar )

**Specific descriptors**:
```python
desc_df = dm.descriptors.batch_compute_many_descriptors( mols, n_jobs=-1, # 使用所有CPU核心 progress=True # 显示进度条 )

**特定描述符**:
```python

Aromaticity

芳香性

n_aromatic = dm.descriptors.n_aromatic_atoms(mol) aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
n_aromatic = dm.descriptors.n_aromatic_atoms(mol) aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)

Stereochemistry

立体化学

n_stereo = dm.descriptors.n_stereo_centers(mol) n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)
n_stereo = dm.descriptors.n_stereo_centers(mol) n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)

Flexibility

分子柔性

n_rigid = dm.descriptors.n_rigid_bonds(mol)

**Drug-likeness filtering (Lipinski's Rule of Five)**:
```python
n_rigid = dm.descriptors.n_rigid_bonds(mol)

**类药分子过滤(Lipinski五规则)**:
```python

Filter compounds

过滤化合物

def is_druglike(mol): desc = dm.descriptors.compute_many_descriptors(mol) return ( desc['mw'] <= 500 and desc['logp'] <= 5 and desc['hbd'] <= 5 and desc['hba'] <= 10 )
druglike_mols = [mol for mol in mols if is_druglike(mol)]
undefined
def is_druglike(mol): desc = dm.descriptors.compute_many_descriptors(mol) return ( desc['mw'] <= 500 and desc['logp'] <= 5 and desc['hbd'] <= 5 and desc['hba'] <= 10 )
druglike_mols = [mol for mol in mols if is_druglike(mol)]
undefined

4. Molecular Fingerprints and Similarity

4. 分子指纹与相似性

Generating fingerprints:
python
undefined
生成分子指纹:
python
undefined

ECFP (Extended Connectivity Fingerprint, default)

ECFP(扩展连接指纹,默认选项)

fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)

Other fingerprint types

其他指纹类型

fp_maccs = dm.to_fp(mol, fp_type='maccs') fp_topological = dm.to_fp(mol, fp_type='topological') fp_atompair = dm.to_fp(mol, fp_type='atompair')

**Similarity calculations**:
```python
fp_maccs = dm.to_fp(mol, fp_type='maccs') fp_topological = dm.to_fp(mol, fp_type='topological') fp_atompair = dm.to_fp(mol, fp_type='atompair')

**相似性计算**:
```python

Pairwise distances within a set

集合内的两两距离

distance_matrix = dm.pdist(mols, n_jobs=-1)
distance_matrix = dm.pdist(mols, n_jobs=-1)

Distances between two sets

两个集合间的距离

distances = dm.cdist(query_mols, library_mols, n_jobs=-1)
distances = dm.cdist(query_mols, library_mols, n_jobs=-1)

Find most similar molecules

查找最相似的分子

from scipy.spatial.distance import squareform dist_matrix = squareform(dm.pdist(mols))
from scipy.spatial.distance import squareform dist_matrix = squareform(dm.pdist(mols))

Lower distance = higher similarity (Tanimoto distance = 1 - Tanimoto similarity)

距离越小,相似性越高(Tanimoto距离 = 1 - Tanimoto相似性)

undefined
undefined

5. Clustering and Diversity Selection

5. 聚类与多样性筛选

Refer to
references/core_api.md
for clustering details.
Butina clustering:
python
undefined
关于聚类的详细信息,请参考
references/core_api.md
Butina聚类:
python
undefined

Cluster molecules by structural similarity

按结构相似性对分子进行聚类

clusters = dm.cluster_mols( mols, cutoff=0.2, # Tanimoto distance threshold (0=identical, 1=completely different) n_jobs=-1 # Parallel processing )
clusters = dm.cluster_mols( mols, cutoff=0.2, # Tanimoto距离阈值(0表示完全相同,1表示完全不同) n_jobs=-1 # 并行处理 )

Each cluster is a list of molecule indices

每个聚类是分子索引的列表

for i, cluster in enumerate(clusters): print(f"Cluster {i}: {len(cluster)} molecules") cluster_mols = [mols[idx] for idx in cluster]

**Important**: Butina clustering builds a full distance matrix - suitable for ~1000 molecules, not for 10,000+.

**Diversity selection**:
```python
for i, cluster in enumerate(clusters): print(f"聚类 {i}: {len(cluster)} 个分子") cluster_mols = [mols[idx] for idx in cluster]

**注意**:Butina聚类会构建完整的距离矩阵,适用于约1000个分子的数据集,不适用于10000个以上的分子。

**多样性筛选**:
```python

Pick diverse subset

挑选多样性子集

diverse_mols = dm.pick_diverse( mols, npick=100 # Select 100 diverse molecules )
diverse_mols = dm.pick_diverse( mols, npick=100 # 选择100个多样性分子 )

Pick cluster centroids

挑选聚类中心分子

centroids = dm.pick_centroids( mols, npick=50 # Select 50 representative molecules )
undefined
centroids = dm.pick_centroids( mols, npick=50 # 选择50个代表性分子 )
undefined

6. Scaffold Analysis

6. 分子骨架分析

Refer to
references/fragments_scaffolds.md
for complete scaffold documentation.
Extracting Murcko scaffolds:
python
undefined
关于完整的骨架文档,请参考
references/fragments_scaffolds.md
提取Murcko骨架:
python
undefined

Get Bemis-Murcko scaffold (core structure)

获取Bemis-Murcko骨架(核心结构)

scaffold = dm.to_scaffold_murcko(mol) scaffold_smiles = dm.to_smiles(scaffold)

**Scaffold-based analysis**:
```python
scaffold = dm.to_scaffold_murcko(mol) scaffold_smiles = dm.to_smiles(scaffold)

**基于骨架的分析**:
```python

Group compounds by scaffold

按骨架对化合物分组

from collections import Counter
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols] scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
from collections import Counter
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols] scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]

Count scaffold frequency

统计骨架出现频率

scaffold_counts = Counter(scaffold_smiles) most_common = scaffold_counts.most_common(10)
scaffold_counts = Counter(scaffold_smiles) most_common = scaffold_counts.most_common(10)

Create scaffold-to-molecules mapping

创建骨架到分子的映射

scaffold_groups = {} for mol, scaf_smi in zip(mols, scaffold_smiles): if scaf_smi not in scaffold_groups: scaffold_groups[scaf_smi] = [] scaffold_groups[scaf_smi].append(mol)

**Scaffold-based train/test splitting** (for ML):
```python
scaffold_groups = {} for mol, scaf_smi in zip(mols, scaffold_smiles): if scaf_smi not in scaffold_groups: scaffold_groups[scaf_smi] = [] scaffold_groups[scaf_smi].append(mol)

**基于骨架的机器学习训练/测试拆分**:
```python

Ensure train and test sets have different scaffolds

确保训练集和测试集包含不同的骨架

scaffold_to_mols = {} for mol, scaf in zip(mols, scaffold_smiles): if scaf not in scaffold_to_mols: scaffold_to_mols[scaf] = [] scaffold_to_mols[scaf].append(mol)
scaffold_to_mols = {} for mol, scaf in zip(mols, scaffold_smiles): if scaf not in scaffold_to_mols: scaffold_to_mols[scaf] = [] scaffold_to_mols[scaf].append(mol)

Split scaffolds into train/test

将骨架拆分为训练集和测试集

import random scaffolds = list(scaffold_to_mols.keys()) random.shuffle(scaffolds) split_idx = int(0.8 * len(scaffolds)) train_scaffolds = scaffolds[:split_idx] test_scaffolds = scaffolds[split_idx:]
import random scaffolds = list(scaffold_to_mols.keys()) random.shuffle(scaffolds) split_idx = int(0.8 * len(scaffolds)) train_scaffolds = scaffolds[:split_idx] test_scaffolds = scaffolds[split_idx:]

Get molecules for each split

获取每个拆分对应的分子

train_mols = [mol for scaf in train_scaffolds for mol in scaffold_to_mols[scaf]] test_mols = [mol for scaf in test_scaffolds for mol in scaffold_to_mols[scaf]]
undefined
train_mols = [mol for scaf in train_scaffolds for mol in scaffold_to_mols[scaf]] test_mols = [mol for scaf in test_scaffolds for mol in scaffold_to_mols[scaf]]
undefined

7. Molecular Fragmentation

7. 分子片段化

Refer to
references/fragments_scaffolds.md
for fragmentation details.
BRICS fragmentation (16 bond types):
python
undefined
关于片段化的详细信息,请参考
references/fragments_scaffolds.md
BRICS片段化(16种键类型):
python
undefined

Fragment molecule

对分子进行片段化

fragments = dm.fragment.brics(mol)
fragments = dm.fragment.brics(mol)

Returns: set of fragment SMILES with attachment points like '[1*]CCN'

返回结果:带连接点的片段SMILES集合,例如 '[1*]CCN'


**RECAP fragmentation** (11 bond types):
```python
fragments = dm.fragment.recap(mol)
Fragment analysis:
python
undefined

**RECAP片段化**(11种键类型):
```python
fragments = dm.fragment.recap(mol)
片段分析:
python
undefined

Find common fragments across compound library

找出化合物库中的常见片段

from collections import Counter
all_fragments = [] for mol in mols: frags = dm.fragment.brics(mol) all_fragments.extend(frags)
fragment_counts = Counter(all_fragments) common_frags = fragment_counts.most_common(20)
from collections import Counter
all_fragments = [] for mol in mols: frags = dm.fragment.brics(mol) all_fragments.extend(frags)
fragment_counts = Counter(all_fragments) common_frags = fragment_counts.most_common(20)

Fragment-based scoring

基于片段的打分

def fragment_score(mol, reference_fragments): mol_frags = dm.fragment.brics(mol) overlap = mol_frags.intersection(reference_fragments) return len(overlap) / len(mol_frags) if mol_frags else 0
undefined
def fragment_score(mol, reference_fragments): mol_frags = dm.fragment.brics(mol) overlap = mol_frags.intersection(reference_fragments) return len(overlap) / len(mol_frags) if mol_frags else 0
undefined

8. 3D Conformer Generation

8. 3D构象生成

Refer to
references/conformers_module.md
for detailed conformer documentation.
Generating conformers:
python
undefined
关于详细的构象文档,请参考
references/conformers_module.md
生成3D构象:
python
undefined

Generate 3D conformers

生成3D构象

mol_3d = dm.conformers.generate( mol, n_confs=50, # Number to generate (auto if None) rms_cutoff=0.5, # Filter similar conformers (Ångströms) minimize_energy=True, # Minimize with UFF force field method='ETKDGv3' # Embedding method (recommended) )
mol_3d = dm.conformers.generate( mol, n_confs=50, # 生成的构象数量(设为None则自动生成) rms_cutoff=0.5, # 过滤相似构象的RMSD阈值(埃) minimize_energy=True, # 使用UFF力场进行能量最小化 method='ETKDGv3' # 嵌入方法(推荐使用) )

Access conformers

访问构象

n_conformers = mol_3d.GetNumConformers() conf = mol_3d.GetConformer(0) # Get first conformer positions = conf.GetPositions() # Nx3 array of atom coordinates

**Conformer clustering**:
```python
n_conformers = mol_3d.GetNumConformers() conf = mol_3d.GetConformer(0) # 获取第一个构象 positions = conf.GetPositions() # 原子坐标的Nx3数组

**构象聚类**:
```python

Cluster conformers by RMSD

按RMSD对构象进行聚类

clusters = dm.conformers.cluster( mol_3d, rms_cutoff=1.0, centroids=False )
clusters = dm.conformers.cluster( mol_3d, rms_cutoff=1.0, centroids=False )

Get representative conformers

获取代表性构象

centroids = dm.conformers.return_centroids(mol_3d, clusters)

**SASA calculation**:
```python
centroids = dm.conformers.return_centroids(mol_3d, clusters)

**SASA计算**:
```python

Calculate solvent accessible surface area

计算溶剂可及表面积

sasa_values = dm.conformers.sasa(mol_3d, n_jobs=-1)
sasa_values = dm.conformers.sasa(mol_3d, n_jobs=-1)

Access SASA from conformer properties

从构象属性中获取SASA值

conf = mol_3d.GetConformer(0) sasa = conf.GetDoubleProp('rdkit_free_sasa')
undefined
conf = mol_3d.GetConformer(0) sasa = conf.GetDoubleProp('rdkit_free_sasa')
undefined

9. Visualization

9. 可视化

Refer to
references/descriptors_viz.md
for visualization documentation.
Basic molecule grid:
python
undefined
关于可视化的文档,请参考
references/descriptors_viz.md
基础分子网格可视化:
python
undefined

Visualize molecules

可视化分子

dm.viz.to_image( mols[:20], legends=[dm.to_smiles(m) for m in mols[:20]], n_cols=5, mol_size=(300, 300) )
dm.viz.to_image( mols[:20], legends=[dm.to_smiles(m) for m in mols[:20]], n_cols=5, mol_size=(300, 300) )

Save to file

保存到文件

dm.viz.to_image(mols, outfile="molecules.png")
dm.viz.to_image(mols, outfile="molecules.png")

SVG for publications

用于出版物的SVG格式

dm.viz.to_image(mols, outfile="molecules.svg", use_svg=True)

**Aligned visualization** (for SAR analysis):
```python
dm.viz.to_image(mols, outfile="molecules.svg", use_svg=True)

**比对可视化**(用于SAR分析):
```python

Align molecules by common substructure

按公共子结构对齐分子

dm.viz.to_image( similar_mols, align=True, # Enable MCS alignment legends=activity_labels, n_cols=4 )

**Highlighting substructures**:
```python
dm.viz.to_image( similar_mols, align=True, # 启用最大公共子结构对齐 legends=activity_labels, n_cols=4 )

**子结构高亮**:
```python

Highlight specific atoms and bonds

高亮特定原子和键

dm.viz.to_image( mol, highlight_atom=[0, 1, 2, 3], # Atom indices highlight_bond=[0, 1, 2] # Bond indices )

**Conformer visualization**:
```python
dm.viz.to_image( mol, highlight_atom=[0, 1, 2, 3], # 原子索引 highlight_bond=[0, 1, 2] # 键索引 )

**构象可视化**:
```python

Display multiple conformers

展示多个构象

dm.viz.conformers( mol_3d, n_confs=10, align_conf=True, n_cols=3 )
undefined
dm.viz.conformers( mol_3d, n_confs=10, align_conf=True, n_cols=3 )
undefined

10. Chemical Reactions

10. 化学反应

Refer to
references/reactions_data.md
for reactions documentation.
Applying reactions:
python
from rdkit.Chem import rdChemReactions
关于化学反应的文档,请参考
references/reactions_data.md
应用化学反应:
python
from rdkit.Chem import rdChemReactions

Define reaction from SMARTS

从SMARTS定义反应

rxn_smarts = 'C:1[OH:3]>>C:1[Cl:3]' rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
rxn_smarts = 'C:1[OH:3]>>C:1[Cl:3]' rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)

Apply to molecule

将反应应用到分子上

reactant = dm.to_mol("CC(=O)O") # Acetic acid product = dm.reactions.apply_reaction( rxn, (reactant,), sanitize=True )
reactant = dm.to_mol("CC(=O)O") # 乙酸 product = dm.reactions.apply_reaction( rxn, (reactant,), sanitize=True )

Convert to SMILES

转换为SMILES

product_smiles = dm.to_smiles(product)

**Batch reaction application**:
```python
product_smiles = dm.to_smiles(product)

**批量应用化学反应**:
```python

Apply reaction to library

将反应应用到化合物库

products = [] for mol in reactant_mols: try: prod = dm.reactions.apply_reaction(rxn, (mol,)) if prod is not None: products.append(prod) except Exception as e: print(f"Reaction failed: {e}")
undefined
products = [] for mol in reactant_mols: try: prod = dm.reactions.apply_reaction(rxn, (mol,)) if prod is not None: products.append(prod) except Exception as e: print(f"反应失败: {e}")
undefined

Parallelization

并行化

Datamol includes built-in parallelization for many operations. Use
n_jobs
parameter:
  • n_jobs=1
    : Sequential (no parallelization)
  • n_jobs=-1
    : Use all available CPU cores
  • n_jobs=4
    : Use 4 cores
Functions supporting parallelization:
  • dm.read_sdf(..., n_jobs=-1)
  • dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)
  • dm.cluster_mols(..., n_jobs=-1)
  • dm.pdist(..., n_jobs=-1)
  • dm.conformers.sasa(..., n_jobs=-1)
Progress bars: Many batch operations support
progress=True
parameter.
Datamol为许多操作内置了并行化支持,可通过
n_jobs
参数设置:
  • n_jobs=1
    : 串行执行(无并行化)
  • n_jobs=-1
    : 使用所有可用CPU核心
  • n_jobs=4
    : 使用4个核心
支持并行化的函数:
  • dm.read_sdf(..., n_jobs=-1)
  • dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)
  • dm.cluster_mols(..., n_jobs=-1)
  • dm.pdist(..., n_jobs=-1)
  • dm.conformers.sasa(..., n_jobs=-1)
进度条: 许多批量操作支持
progress=True
参数,用于显示进度。

Common Workflows and Patterns

常见工作流与模式

Complete Pipeline: Data Loading → Filtering → Analysis

完整流程:数据加载 → 过滤 → 分析

python
import datamol as dm
import pandas as pd
python
import datamol as dm
import pandas as pd

1. Load molecules

1. 加载分子数据

df = dm.read_sdf("compounds.sdf")
df = dm.read_sdf("compounds.sdf")

2. Standardize

2. 标准化分子

df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None) df = df[df['mol'].notna()] # Remove failed molecules
df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None) df = df[df['mol'].notna()] # 移除处理失败的分子

3. Compute descriptors

3. 计算描述符

desc_df = dm.descriptors.batch_compute_many_descriptors( df['mol'].tolist(), n_jobs=-1, progress=True )
desc_df = dm.descriptors.batch_compute_many_descriptors( df['mol'].tolist(), n_jobs=-1, progress=True )

4. Filter by drug-likeness

4. 按类药规则过滤

druglike = ( (desc_df['mw'] <= 500) & (desc_df['logp'] <= 5) & (desc_df['hbd'] <= 5) & (desc_df['hba'] <= 10) ) filtered_df = df[druglike]
druglike = ( (desc_df['mw'] <= 500) & (desc_df['logp'] <= 5) & (desc_df['hbd'] <= 5) & (desc_df['hba'] <= 10) ) filtered_df = df[druglike]

5. Cluster and select diverse subset

5. 聚类并挑选多样性子集

diverse_mols = dm.pick_diverse( filtered_df['mol'].tolist(), npick=100 )
diverse_mols = dm.pick_diverse( filtered_df['mol'].tolist(), npick=100 )

6. Visualize results

6. 可视化结果

dm.viz.to_image( diverse_mols, legends=[dm.to_smiles(m) for m in diverse_mols], outfile="diverse_compounds.png", n_cols=10 )
undefined
dm.viz.to_image( diverse_mols, legends=[dm.to_smiles(m) for m in diverse_mols], outfile="diverse_compounds.png", n_cols=10 )
undefined

Structure-Activity Relationship (SAR) Analysis

构效关系(SAR)分析

python
undefined
python
undefined

Group by scaffold

按骨架分组

scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols] scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols] scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]

Create DataFrame with activities

创建包含活性数据的DataFrame

sar_df = pd.DataFrame({ 'mol': mols, 'scaffold': scaffold_smiles, 'activity': activities # User-provided activity data })
sar_df = pd.DataFrame({ 'mol': mols, 'scaffold': scaffold_smiles, 'activity': activities # 用户提供的活性数据 })

Analyze each scaffold series

分析每个骨架系列

for scaffold, group in sar_df.groupby('scaffold'): if len(group) >= 3: # Need multiple examples print(f"\nScaffold: {scaffold}") print(f"Count: {len(group)}") print(f"Activity range: {group['activity'].min():.2f} - {group['activity'].max():.2f}")
    # Visualize with activities as legends
    dm.viz.to_image(
        group['mol'].tolist(),
        legends=[f"Activity: {act:.2f}" for act in group['activity']],
        align=True  # Align by common substructure
    )
undefined
for scaffold, group in sar_df.groupby('scaffold'): if len(group) >= 3: # 需要至少3个样本 print(f"\n骨架: {scaffold}") print(f"数量: {len(group)}") print(f"活性范围: {group['activity'].min():.2f} - {group['activity'].max():.2f}")
    # 以活性值为图例进行可视化
    dm.viz.to_image(
        group['mol'].tolist(),
        legends=[f"活性: {act:.2f}" for act in group['activity']],
        align=True  # 按公共子结构对齐
    )
undefined

Virtual Screening Pipeline

虚拟筛选流程

python
undefined
python
undefined

1. Generate fingerprints for query and library

1. 为查询分子和库分子生成指纹

query_fps = [dm.to_fp(mol) for mol in query_actives] library_fps = [dm.to_fp(mol) for mol in library_mols]
query_fps = [dm.to_fp(mol) for mol in query_actives] library_fps = [dm.to_fp(mol) for mol in library_mols]

2. Calculate similarities

2. 计算相似性

from scipy.spatial.distance import cdist import numpy as np
distances = dm.cdist(query_actives, library_mols, n_jobs=-1)
from scipy.spatial.distance import cdist import numpy as np
distances = dm.cdist(query_actives, library_mols, n_jobs=-1)

3. Find closest matches (min distance to any query)

3. 查找最匹配的分子(与任意查询分子的最小距离)

min_distances = distances.min(axis=0) similarities = 1 - min_distances # Convert distance to similarity
min_distances = distances.min(axis=0) similarities = 1 - min_distances # 将距离转换为相似性

4. Rank and select top hits

4. 排序并选择Top命中分子

top_indices = np.argsort(similarities)[::-1][:100] # Top 100 top_hits = [library_mols[i] for i in top_indices] top_scores = [similarities[i] for i in top_indices]
top_indices = np.argsort(similarities)[::-1][:100] # Top 100 top_hits = [library_mols[i] for i in top_indices] top_scores = [similarities[i] for i in top_indices]

5. Visualize hits

5. 可视化命中分子

dm.viz.to_image( top_hits[:20], legends=[f"Sim: {score:.3f}" for score in top_scores[:20]], outfile="screening_hits.png" )
undefined
dm.viz.to_image( top_hits[:20], legends=[f"相似性: {score:.3f}" for score in top_scores[:20]], outfile="screening_hits.png" )
undefined

Reference Documentation

参考文档

For detailed API documentation, consult these reference files:
  • references/core_api.md
    : Core namespace functions (conversions, standardization, fingerprints, clustering)
  • references/io_module.md
    : File I/O operations (read/write SDF, CSV, Excel, remote files)
  • references/conformers_module.md
    : 3D conformer generation, clustering, SASA calculations
  • references/descriptors_viz.md
    : Molecular descriptors and visualization functions
  • references/fragments_scaffolds.md
    : Scaffold extraction, BRICS/RECAP fragmentation
  • references/reactions_data.md
    : Chemical reactions and toy datasets
如需详细的API文档,请参考以下文件:
  • references/core_api.md
    : 核心命名空间函数(格式转换、标准化、指纹、聚类)
  • references/io_module.md
    : 文件I/O操作(读写SDF、CSV、Excel、远程文件)
  • references/conformers_module.md
    : 3D构象生成、聚类、SASA计算
  • references/descriptors_viz.md
    : 分子描述符与可视化函数
  • references/fragments_scaffolds.md
    : 骨架提取、BRICS/RECAP片段化
  • references/reactions_data.md
    : 化学反应与示例数据集

Best Practices

最佳实践

  1. Always standardize molecules from external sources:
    python
    mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True)
  2. Check for None values after molecule parsing:
    python
    mol = dm.to_mol(smiles)
    if mol is None:
        # Handle invalid SMILES
  3. Use parallel processing for large datasets:
    python
    result = dm.operation(..., n_jobs=-1, progress=True)
  4. Leverage fsspec for cloud storage:
    python
    df = dm.read_sdf("s3://bucket/compounds.sdf")
  5. Use appropriate fingerprints for similarity:
    • ECFP (Morgan): General purpose, structural similarity
    • MACCS: Fast, smaller feature space
    • Atom pairs: Considers atom pairs and distances
  6. Consider scale limitations:
    • Butina clustering: ~1,000 molecules (full distance matrix)
    • For larger datasets: Use diversity selection or hierarchical methods
  7. Scaffold splitting for ML: Ensure proper train/test separation by scaffold
  8. Align molecules when visualizing SAR series
  1. 始终标准化外部来源的分子:
    python
    mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True)
  2. 分子解析后检查None值:
    python
    mol = dm.to_mol(smiles)
    if mol is None:
        # 处理无效SMILES
  3. 对大型数据集使用并行处理:
    python
    result = dm.operation(..., n_jobs=-1, progress=True)
  4. 利用fsspec访问云存储:
    python
    df = dm.read_sdf("s3://bucket/compounds.sdf")
  5. 为相似性计算选择合适的指纹:
    • ECFP(Morgan): 通用用途,结构相似性
    • MACCS: 计算快速,特征空间更小
    • 原子对指纹: 考虑原子对和距离
  6. 考虑规模限制:
    • Butina聚类: 适用于约1000个分子(完整距离矩阵)
    • 对于更大的数据集: 使用多样性筛选或层次聚类方法
  7. 机器学习的骨架拆分: 确保训练集和测试集按骨架正确分离
  8. 可视化SAR系列时对齐分子

Error Handling

错误处理

python
undefined
python
undefined

Safe molecule creation

安全创建分子

def safe_to_mol(smiles): try: mol = dm.to_mol(smiles) if mol is not None: mol = dm.standardize_mol(mol) return mol except Exception as e: print(f"Failed to process {smiles}: {e}") return None
def safe_to_mol(smiles): try: mol = dm.to_mol(smiles) if mol is not None: mol = dm.standardize_mol(mol) return mol except Exception as e: print(f"处理 {smiles} 失败: {e}") return None

Safe batch processing

安全批量处理

valid_mols = [] for smiles in smiles_list: mol = safe_to_mol(smiles) if mol is not None: valid_mols.append(mol)
undefined
valid_mols = [] for smiles in smiles_list: mol = safe_to_mol(smiles) if mol is not None: valid_mols.append(mol)
undefined

Integration with Machine Learning

与机器学习的集成

python
undefined
python
undefined

Feature generation

特征生成

X = np.array([dm.to_fp(mol) for mol in mols])
X = np.array([dm.to_fp(mol) for mol in mols])

Or descriptors

或使用描述符

desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1) X = desc_df.values
desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1) X = desc_df.values

Train model

训练模型

from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor() model.fit(X, y_target)
from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor() model.fit(X, y_target)

Predict

预测

predictions = model.predict(X_test)
undefined
predictions = model.predict(X_test)
undefined

Troubleshooting

故障排除

Issue: Molecule parsing fails
  • Solution: Use
    dm.standardize_smiles()
    first or try
    dm.fix_mol()
Issue: Memory errors with clustering
  • Solution: Use
    dm.pick_diverse()
    instead of full clustering for large sets
Issue: Slow conformer generation
  • Solution: Reduce
    n_confs
    or increase
    rms_cutoff
    to generate fewer conformers
Issue: Remote file access fails
  • Solution: Ensure fsspec and appropriate cloud provider libraries are installed (s3fs, gcsfs, etc.)
问题: 分子解析失败
  • 解决方案: 先使用
    dm.standardize_smiles()
    ,或尝试
    dm.fix_mol()
问题: 聚类时出现内存错误
  • 解决方案: 对于大型数据集,使用
    dm.pick_diverse()
    替代完整聚类
问题: 构象生成速度慢
  • 解决方案: 减少
    n_confs
    或增大
    rms_cutoff
    以生成更少的构象
问题: 远程文件访问失败
  • 解决方案: 确保已安装fsspec和对应的云提供商库(s3fs、gcsfs等)

Additional Resources

额外资源