datamol
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDatamol Cheminformatics Skill
Datamol 化学信息学工具使用指南
Overview
概述
Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native instances, ensuring full compatibility with the RDKit ecosystem.
rdkit.Chem.MolKey capabilities:
- Molecular format conversion (SMILES, SELFIES, InChI)
- Structure standardization and sanitization
- Molecular descriptors and fingerprints
- 3D conformer generation and analysis
- Clustering and diversity selection
- Scaffold and fragment analysis
- Chemical reaction application
- Visualization and alignment
- Batch processing with parallelization
- Cloud storage support via fsspec
Datamol是一个Python库,为分子化学信息学提供了轻量、符合Python风格的RDKit抽象层。通过合理的默认配置、高效的并行化和现代I/O能力,简化复杂的分子操作。所有分子对象都是原生的实例,确保与RDKit生态系统完全兼容。
rdkit.Chem.Mol核心功能:
- 分子格式转换(SMILES、SELFIES、InChI)
- 分子结构标准化与清洗
- 分子描述符与指纹计算
- 3D构象生成与分析
- 聚类与多样性筛选
- 骨架与片段分析
- 化学反应应用
- 可视化与结构比对
- 批量并行处理
- 基于fsspec的云存储支持
Installation and Setup
安装与设置
Guide users to install datamol:
bash
uv pip install datamolImport convention:
python
import datamol as dm引导用户安装datamol:
bash
uv pip install datamol导入约定:
python
import datamol as dmCore Workflows
核心工作流
1. Basic Molecule Handling
1. 基础分子处理
Creating molecules from SMILES:
python
import datamol as dm从SMILES创建分子:
python
import datamol as dmSingle molecule
单个分子
mol = dm.to_mol("CCO") # Ethanol
mol = dm.to_mol("CCO") # 乙醇
From list of SMILES
从SMILES列表创建
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
mols = [dm.to_mol(smi) for smi in smiles_list]
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
mols = [dm.to_mol(smi) for smi in smiles_list]
Error handling
错误处理
mol = dm.to_mol("invalid_smiles") # Returns None
if mol is None:
print("Failed to parse SMILES")
**Converting molecules to SMILES**:
```pythonmol = dm.to_mol("invalid_smiles") # 返回None
if mol is None:
print("SMILES解析失败")
**将分子转换为SMILES**:
```pythonCanonical SMILES
标准SMILES
smiles = dm.to_smiles(mol)
smiles = dm.to_smiles(mol)
Isomeric SMILES (includes stereochemistry)
含立体化学的SMILES
smiles = dm.to_smiles(mol, isomeric=True)
smiles = dm.to_smiles(mol, isomeric=True)
Other formats
其他格式
inchi = dm.to_inchi(mol)
inchikey = dm.to_inchikey(mol)
selfies = dm.to_selfies(mol)
**Standardization and sanitization** (always recommend for user-provided molecules):
```pythoninchi = dm.to_inchi(mol)
inchikey = dm.to_inchikey(mol)
selfies = dm.to_selfies(mol)
**分子标准化与清洗**(对于用户提供的分子,强烈推荐执行此步骤):
```pythonSanitize molecule
清洗分子
mol = dm.sanitize_mol(mol)
mol = dm.sanitize_mol(mol)
Full standardization (recommended for datasets)
完整标准化(推荐用于数据集)
mol = dm.standardize_mol(
mol,
disconnect_metals=True,
normalize=True,
reionize=True
)
mol = dm.standardize_mol(
mol,
disconnect_metals=True,
normalize=True,
reionize=True
)
For SMILES strings directly
直接标准化SMILES字符串
clean_smiles = dm.standardize_smiles(smiles)
undefinedclean_smiles = dm.standardize_smiles(smiles)
undefined2. Reading and Writing Molecular Files
2. 分子文件的读写
Refer to for comprehensive I/O documentation.
references/io_module.mdReading files:
python
undefined关于完整的I/O文档,请参考。
references/io_module.md读取文件:
python
undefinedSDF files (most common in chemistry)
SDF文件(化学领域最常用)
df = dm.read_sdf("compounds.sdf", mol_column='mol')
df = dm.read_sdf("compounds.sdf", mol_column='mol')
SMILES files
SMILES文件
df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')
df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')
CSV with SMILES column
含SMILES列的CSV文件
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
Excel files
Excel文件
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
Universal reader (auto-detects format)
通用读取器(自动检测格式)
df = dm.open_df("file.sdf") # Works with .sdf, .csv, .xlsx, .parquet, .json
**Writing files**:
```pythondf = dm.open_df("file.sdf") # 支持.sdf、.csv、.xlsx、.parquet、.json格式
**写入文件**:
```pythonSave as SDF
保存为SDF格式
dm.to_sdf(mols, "output.sdf")
dm.to_sdf(mols, "output.sdf")
Or from DataFrame
或从DataFrame保存
dm.to_sdf(df, "output.sdf", mol_column="mol")
dm.to_sdf(df, "output.sdf", mol_column="mol")
Save as SMILES file
保存为SMILES文件
dm.to_smi(mols, "output.smi")
dm.to_smi(mols, "output.smi")
Excel with rendered molecule images
保存为带分子图片的Excel文件
dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])
**Remote file support** (S3, GCS, HTTP):
```pythondm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])
**远程文件支持**(S3、GCS、HTTP):
```pythonRead from cloud storage
从云存储读取
df = dm.read_sdf("s3://bucket/compounds.sdf")
df = dm.read_csv("https://example.com/data.csv")
df = dm.read_sdf("s3://bucket/compounds.sdf")
df = dm.read_csv("https://example.com/data.csv")
Write to cloud storage
写入到云存储
dm.to_sdf(mols, "s3://bucket/output.sdf")
undefineddm.to_sdf(mols, "s3://bucket/output.sdf")
undefined3. Molecular Descriptors and Properties
3. 分子描述符与属性
Refer to for detailed descriptor documentation.
references/descriptors_viz.mdComputing descriptors for a single molecule:
python
undefined关于详细的描述符文档,请参考。
references/descriptors_viz.md计算单个分子的描述符:
python
undefinedGet standard descriptor set
获取标准描述符集合
descriptors = dm.descriptors.compute_many_descriptors(mol)
descriptors = dm.descriptors.compute_many_descriptors(mol)
Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1,
返回结果: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1,
'tpsa': 20.23, 'n_aromatic_atoms': 0, ...}
'tpsa': 20.23, 'n_aromatic_atoms': 0, ...}
**Batch descriptor computation** (recommended for datasets):
```python
**批量计算描述符**(推荐用于数据集):
```pythonCompute for all molecules in parallel
并行计算所有分子的描述符
desc_df = dm.descriptors.batch_compute_many_descriptors(
mols,
n_jobs=-1, # Use all CPU cores
progress=True # Show progress bar
)
**Specific descriptors**:
```pythondesc_df = dm.descriptors.batch_compute_many_descriptors(
mols,
n_jobs=-1, # 使用所有CPU核心
progress=True # 显示进度条
)
**特定描述符**:
```pythonAromaticity
芳香性
n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
Stereochemistry
立体化学
n_stereo = dm.descriptors.n_stereo_centers(mol)
n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)
n_stereo = dm.descriptors.n_stereo_centers(mol)
n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)
Flexibility
分子柔性
n_rigid = dm.descriptors.n_rigid_bonds(mol)
**Drug-likeness filtering (Lipinski's Rule of Five)**:
```pythonn_rigid = dm.descriptors.n_rigid_bonds(mol)
**类药分子过滤(Lipinski五规则)**:
```pythonFilter compounds
过滤化合物
def is_druglike(mol):
desc = dm.descriptors.compute_many_descriptors(mol)
return (
desc['mw'] <= 500 and
desc['logp'] <= 5 and
desc['hbd'] <= 5 and
desc['hba'] <= 10
)
druglike_mols = [mol for mol in mols if is_druglike(mol)]
undefineddef is_druglike(mol):
desc = dm.descriptors.compute_many_descriptors(mol)
return (
desc['mw'] <= 500 and
desc['logp'] <= 5 and
desc['hbd'] <= 5 and
desc['hba'] <= 10
)
druglike_mols = [mol for mol in mols if is_druglike(mol)]
undefined4. Molecular Fingerprints and Similarity
4. 分子指纹与相似性
Generating fingerprints:
python
undefined生成分子指纹:
python
undefinedECFP (Extended Connectivity Fingerprint, default)
ECFP(扩展连接指纹,默认选项)
fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
Other fingerprint types
其他指纹类型
fp_maccs = dm.to_fp(mol, fp_type='maccs')
fp_topological = dm.to_fp(mol, fp_type='topological')
fp_atompair = dm.to_fp(mol, fp_type='atompair')
**Similarity calculations**:
```pythonfp_maccs = dm.to_fp(mol, fp_type='maccs')
fp_topological = dm.to_fp(mol, fp_type='topological')
fp_atompair = dm.to_fp(mol, fp_type='atompair')
**相似性计算**:
```pythonPairwise distances within a set
集合内的两两距离
distance_matrix = dm.pdist(mols, n_jobs=-1)
distance_matrix = dm.pdist(mols, n_jobs=-1)
Distances between two sets
两个集合间的距离
distances = dm.cdist(query_mols, library_mols, n_jobs=-1)
distances = dm.cdist(query_mols, library_mols, n_jobs=-1)
Find most similar molecules
查找最相似的分子
from scipy.spatial.distance import squareform
dist_matrix = squareform(dm.pdist(mols))
from scipy.spatial.distance import squareform
dist_matrix = squareform(dm.pdist(mols))
Lower distance = higher similarity (Tanimoto distance = 1 - Tanimoto similarity)
距离越小,相似性越高(Tanimoto距离 = 1 - Tanimoto相似性)
undefinedundefined5. Clustering and Diversity Selection
5. 聚类与多样性筛选
Refer to for clustering details.
references/core_api.mdButina clustering:
python
undefined关于聚类的详细信息,请参考。
references/core_api.mdButina聚类:
python
undefinedCluster molecules by structural similarity
按结构相似性对分子进行聚类
clusters = dm.cluster_mols(
mols,
cutoff=0.2, # Tanimoto distance threshold (0=identical, 1=completely different)
n_jobs=-1 # Parallel processing
)
clusters = dm.cluster_mols(
mols,
cutoff=0.2, # Tanimoto距离阈值(0表示完全相同,1表示完全不同)
n_jobs=-1 # 并行处理
)
Each cluster is a list of molecule indices
每个聚类是分子索引的列表
for i, cluster in enumerate(clusters):
print(f"Cluster {i}: {len(cluster)} molecules")
cluster_mols = [mols[idx] for idx in cluster]
**Important**: Butina clustering builds a full distance matrix - suitable for ~1000 molecules, not for 10,000+.
**Diversity selection**:
```pythonfor i, cluster in enumerate(clusters):
print(f"聚类 {i}: {len(cluster)} 个分子")
cluster_mols = [mols[idx] for idx in cluster]
**注意**:Butina聚类会构建完整的距离矩阵,适用于约1000个分子的数据集,不适用于10000个以上的分子。
**多样性筛选**:
```pythonPick diverse subset
挑选多样性子集
diverse_mols = dm.pick_diverse(
mols,
npick=100 # Select 100 diverse molecules
)
diverse_mols = dm.pick_diverse(
mols,
npick=100 # 选择100个多样性分子
)
Pick cluster centroids
挑选聚类中心分子
centroids = dm.pick_centroids(
mols,
npick=50 # Select 50 representative molecules
)
undefinedcentroids = dm.pick_centroids(
mols,
npick=50 # 选择50个代表性分子
)
undefined6. Scaffold Analysis
6. 分子骨架分析
Refer to for complete scaffold documentation.
references/fragments_scaffolds.mdExtracting Murcko scaffolds:
python
undefined关于完整的骨架文档,请参考。
references/fragments_scaffolds.md提取Murcko骨架:
python
undefinedGet Bemis-Murcko scaffold (core structure)
获取Bemis-Murcko骨架(核心结构)
scaffold = dm.to_scaffold_murcko(mol)
scaffold_smiles = dm.to_smiles(scaffold)
**Scaffold-based analysis**:
```pythonscaffold = dm.to_scaffold_murcko(mol)
scaffold_smiles = dm.to_smiles(scaffold)
**基于骨架的分析**:
```pythonGroup compounds by scaffold
按骨架对化合物分组
from collections import Counter
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
from collections import Counter
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
Count scaffold frequency
统计骨架出现频率
scaffold_counts = Counter(scaffold_smiles)
most_common = scaffold_counts.most_common(10)
scaffold_counts = Counter(scaffold_smiles)
most_common = scaffold_counts.most_common(10)
Create scaffold-to-molecules mapping
创建骨架到分子的映射
scaffold_groups = {}
for mol, scaf_smi in zip(mols, scaffold_smiles):
if scaf_smi not in scaffold_groups:
scaffold_groups[scaf_smi] = []
scaffold_groups[scaf_smi].append(mol)
**Scaffold-based train/test splitting** (for ML):
```pythonscaffold_groups = {}
for mol, scaf_smi in zip(mols, scaffold_smiles):
if scaf_smi not in scaffold_groups:
scaffold_groups[scaf_smi] = []
scaffold_groups[scaf_smi].append(mol)
**基于骨架的机器学习训练/测试拆分**:
```pythonEnsure train and test sets have different scaffolds
确保训练集和测试集包含不同的骨架
scaffold_to_mols = {}
for mol, scaf in zip(mols, scaffold_smiles):
if scaf not in scaffold_to_mols:
scaffold_to_mols[scaf] = []
scaffold_to_mols[scaf].append(mol)
scaffold_to_mols = {}
for mol, scaf in zip(mols, scaffold_smiles):
if scaf not in scaffold_to_mols:
scaffold_to_mols[scaf] = []
scaffold_to_mols[scaf].append(mol)
Split scaffolds into train/test
将骨架拆分为训练集和测试集
import random
scaffolds = list(scaffold_to_mols.keys())
random.shuffle(scaffolds)
split_idx = int(0.8 * len(scaffolds))
train_scaffolds = scaffolds[:split_idx]
test_scaffolds = scaffolds[split_idx:]
import random
scaffolds = list(scaffold_to_mols.keys())
random.shuffle(scaffolds)
split_idx = int(0.8 * len(scaffolds))
train_scaffolds = scaffolds[:split_idx]
test_scaffolds = scaffolds[split_idx:]
Get molecules for each split
获取每个拆分对应的分子
train_mols = [mol for scaf in train_scaffolds for mol in scaffold_to_mols[scaf]]
test_mols = [mol for scaf in test_scaffolds for mol in scaffold_to_mols[scaf]]
undefinedtrain_mols = [mol for scaf in train_scaffolds for mol in scaffold_to_mols[scaf]]
test_mols = [mol for scaf in test_scaffolds for mol in scaffold_to_mols[scaf]]
undefined7. Molecular Fragmentation
7. 分子片段化
Refer to for fragmentation details.
references/fragments_scaffolds.mdBRICS fragmentation (16 bond types):
python
undefined关于片段化的详细信息,请参考。
references/fragments_scaffolds.mdBRICS片段化(16种键类型):
python
undefinedFragment molecule
对分子进行片段化
fragments = dm.fragment.brics(mol)
fragments = dm.fragment.brics(mol)
Returns: set of fragment SMILES with attachment points like '[1*]CCN'
返回结果:带连接点的片段SMILES集合,例如 '[1*]CCN'
**RECAP fragmentation** (11 bond types):
```python
fragments = dm.fragment.recap(mol)Fragment analysis:
python
undefined
**RECAP片段化**(11种键类型):
```python
fragments = dm.fragment.recap(mol)片段分析:
python
undefinedFind common fragments across compound library
找出化合物库中的常见片段
from collections import Counter
all_fragments = []
for mol in mols:
frags = dm.fragment.brics(mol)
all_fragments.extend(frags)
fragment_counts = Counter(all_fragments)
common_frags = fragment_counts.most_common(20)
from collections import Counter
all_fragments = []
for mol in mols:
frags = dm.fragment.brics(mol)
all_fragments.extend(frags)
fragment_counts = Counter(all_fragments)
common_frags = fragment_counts.most_common(20)
Fragment-based scoring
基于片段的打分
def fragment_score(mol, reference_fragments):
mol_frags = dm.fragment.brics(mol)
overlap = mol_frags.intersection(reference_fragments)
return len(overlap) / len(mol_frags) if mol_frags else 0
undefineddef fragment_score(mol, reference_fragments):
mol_frags = dm.fragment.brics(mol)
overlap = mol_frags.intersection(reference_fragments)
return len(overlap) / len(mol_frags) if mol_frags else 0
undefined8. 3D Conformer Generation
8. 3D构象生成
Refer to for detailed conformer documentation.
references/conformers_module.mdGenerating conformers:
python
undefined关于详细的构象文档,请参考。
references/conformers_module.md生成3D构象:
python
undefinedGenerate 3D conformers
生成3D构象
mol_3d = dm.conformers.generate(
mol,
n_confs=50, # Number to generate (auto if None)
rms_cutoff=0.5, # Filter similar conformers (Ångströms)
minimize_energy=True, # Minimize with UFF force field
method='ETKDGv3' # Embedding method (recommended)
)
mol_3d = dm.conformers.generate(
mol,
n_confs=50, # 生成的构象数量(设为None则自动生成)
rms_cutoff=0.5, # 过滤相似构象的RMSD阈值(埃)
minimize_energy=True, # 使用UFF力场进行能量最小化
method='ETKDGv3' # 嵌入方法(推荐使用)
)
Access conformers
访问构象
n_conformers = mol_3d.GetNumConformers()
conf = mol_3d.GetConformer(0) # Get first conformer
positions = conf.GetPositions() # Nx3 array of atom coordinates
**Conformer clustering**:
```pythonn_conformers = mol_3d.GetNumConformers()
conf = mol_3d.GetConformer(0) # 获取第一个构象
positions = conf.GetPositions() # 原子坐标的Nx3数组
**构象聚类**:
```pythonCluster conformers by RMSD
按RMSD对构象进行聚类
clusters = dm.conformers.cluster(
mol_3d,
rms_cutoff=1.0,
centroids=False
)
clusters = dm.conformers.cluster(
mol_3d,
rms_cutoff=1.0,
centroids=False
)
Get representative conformers
获取代表性构象
centroids = dm.conformers.return_centroids(mol_3d, clusters)
**SASA calculation**:
```pythoncentroids = dm.conformers.return_centroids(mol_3d, clusters)
**SASA计算**:
```pythonCalculate solvent accessible surface area
计算溶剂可及表面积
sasa_values = dm.conformers.sasa(mol_3d, n_jobs=-1)
sasa_values = dm.conformers.sasa(mol_3d, n_jobs=-1)
Access SASA from conformer properties
从构象属性中获取SASA值
conf = mol_3d.GetConformer(0)
sasa = conf.GetDoubleProp('rdkit_free_sasa')
undefinedconf = mol_3d.GetConformer(0)
sasa = conf.GetDoubleProp('rdkit_free_sasa')
undefined9. Visualization
9. 可视化
Refer to for visualization documentation.
references/descriptors_viz.mdBasic molecule grid:
python
undefined关于可视化的文档,请参考。
references/descriptors_viz.md基础分子网格可视化:
python
undefinedVisualize molecules
可视化分子
dm.viz.to_image(
mols[:20],
legends=[dm.to_smiles(m) for m in mols[:20]],
n_cols=5,
mol_size=(300, 300)
)
dm.viz.to_image(
mols[:20],
legends=[dm.to_smiles(m) for m in mols[:20]],
n_cols=5,
mol_size=(300, 300)
)
Save to file
保存到文件
dm.viz.to_image(mols, outfile="molecules.png")
dm.viz.to_image(mols, outfile="molecules.png")
SVG for publications
用于出版物的SVG格式
dm.viz.to_image(mols, outfile="molecules.svg", use_svg=True)
**Aligned visualization** (for SAR analysis):
```pythondm.viz.to_image(mols, outfile="molecules.svg", use_svg=True)
**比对可视化**(用于SAR分析):
```pythonAlign molecules by common substructure
按公共子结构对齐分子
dm.viz.to_image(
similar_mols,
align=True, # Enable MCS alignment
legends=activity_labels,
n_cols=4
)
**Highlighting substructures**:
```pythondm.viz.to_image(
similar_mols,
align=True, # 启用最大公共子结构对齐
legends=activity_labels,
n_cols=4
)
**子结构高亮**:
```pythonHighlight specific atoms and bonds
高亮特定原子和键
dm.viz.to_image(
mol,
highlight_atom=[0, 1, 2, 3], # Atom indices
highlight_bond=[0, 1, 2] # Bond indices
)
**Conformer visualization**:
```pythondm.viz.to_image(
mol,
highlight_atom=[0, 1, 2, 3], # 原子索引
highlight_bond=[0, 1, 2] # 键索引
)
**构象可视化**:
```pythonDisplay multiple conformers
展示多个构象
dm.viz.conformers(
mol_3d,
n_confs=10,
align_conf=True,
n_cols=3
)
undefineddm.viz.conformers(
mol_3d,
n_confs=10,
align_conf=True,
n_cols=3
)
undefined10. Chemical Reactions
10. 化学反应
Refer to for reactions documentation.
references/reactions_data.mdApplying reactions:
python
from rdkit.Chem import rdChemReactions关于化学反应的文档,请参考。
references/reactions_data.md应用化学反应:
python
from rdkit.Chem import rdChemReactionsDefine reaction from SMARTS
从SMARTS定义反应
rxn_smarts = 'C:1[OH:3]>>C:1[Cl:3]'
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
rxn_smarts = 'C:1[OH:3]>>C:1[Cl:3]'
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
Apply to molecule
将反应应用到分子上
reactant = dm.to_mol("CC(=O)O") # Acetic acid
product = dm.reactions.apply_reaction(
rxn,
(reactant,),
sanitize=True
)
reactant = dm.to_mol("CC(=O)O") # 乙酸
product = dm.reactions.apply_reaction(
rxn,
(reactant,),
sanitize=True
)
Convert to SMILES
转换为SMILES
product_smiles = dm.to_smiles(product)
**Batch reaction application**:
```pythonproduct_smiles = dm.to_smiles(product)
**批量应用化学反应**:
```pythonApply reaction to library
将反应应用到化合物库
products = []
for mol in reactant_mols:
try:
prod = dm.reactions.apply_reaction(rxn, (mol,))
if prod is not None:
products.append(prod)
except Exception as e:
print(f"Reaction failed: {e}")
undefinedproducts = []
for mol in reactant_mols:
try:
prod = dm.reactions.apply_reaction(rxn, (mol,))
if prod is not None:
products.append(prod)
except Exception as e:
print(f"反应失败: {e}")
undefinedParallelization
并行化
Datamol includes built-in parallelization for many operations. Use parameter:
n_jobs- : Sequential (no parallelization)
n_jobs=1 - : Use all available CPU cores
n_jobs=-1 - : Use 4 cores
n_jobs=4
Functions supporting parallelization:
dm.read_sdf(..., n_jobs=-1)dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)dm.cluster_mols(..., n_jobs=-1)dm.pdist(..., n_jobs=-1)dm.conformers.sasa(..., n_jobs=-1)
Progress bars: Many batch operations support parameter.
progress=TrueDatamol为许多操作内置了并行化支持,可通过参数设置:
n_jobs- : 串行执行(无并行化)
n_jobs=1 - : 使用所有可用CPU核心
n_jobs=-1 - : 使用4个核心
n_jobs=4
支持并行化的函数:
dm.read_sdf(..., n_jobs=-1)dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)dm.cluster_mols(..., n_jobs=-1)dm.pdist(..., n_jobs=-1)dm.conformers.sasa(..., n_jobs=-1)
进度条: 许多批量操作支持参数,用于显示进度。
progress=TrueCommon Workflows and Patterns
常见工作流与模式
Complete Pipeline: Data Loading → Filtering → Analysis
完整流程:数据加载 → 过滤 → 分析
python
import datamol as dm
import pandas as pdpython
import datamol as dm
import pandas as pd1. Load molecules
1. 加载分子数据
df = dm.read_sdf("compounds.sdf")
df = dm.read_sdf("compounds.sdf")
2. Standardize
2. 标准化分子
df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)
df = df[df['mol'].notna()] # Remove failed molecules
df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)
df = df[df['mol'].notna()] # 移除处理失败的分子
3. Compute descriptors
3. 计算描述符
desc_df = dm.descriptors.batch_compute_many_descriptors(
df['mol'].tolist(),
n_jobs=-1,
progress=True
)
desc_df = dm.descriptors.batch_compute_many_descriptors(
df['mol'].tolist(),
n_jobs=-1,
progress=True
)
4. Filter by drug-likeness
4. 按类药规则过滤
druglike = (
(desc_df['mw'] <= 500) &
(desc_df['logp'] <= 5) &
(desc_df['hbd'] <= 5) &
(desc_df['hba'] <= 10)
)
filtered_df = df[druglike]
druglike = (
(desc_df['mw'] <= 500) &
(desc_df['logp'] <= 5) &
(desc_df['hbd'] <= 5) &
(desc_df['hba'] <= 10)
)
filtered_df = df[druglike]
5. Cluster and select diverse subset
5. 聚类并挑选多样性子集
diverse_mols = dm.pick_diverse(
filtered_df['mol'].tolist(),
npick=100
)
diverse_mols = dm.pick_diverse(
filtered_df['mol'].tolist(),
npick=100
)
6. Visualize results
6. 可视化结果
dm.viz.to_image(
diverse_mols,
legends=[dm.to_smiles(m) for m in diverse_mols],
outfile="diverse_compounds.png",
n_cols=10
)
undefineddm.viz.to_image(
diverse_mols,
legends=[dm.to_smiles(m) for m in diverse_mols],
outfile="diverse_compounds.png",
n_cols=10
)
undefinedStructure-Activity Relationship (SAR) Analysis
构效关系(SAR)分析
python
undefinedpython
undefinedGroup by scaffold
按骨架分组
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
Create DataFrame with activities
创建包含活性数据的DataFrame
sar_df = pd.DataFrame({
'mol': mols,
'scaffold': scaffold_smiles,
'activity': activities # User-provided activity data
})
sar_df = pd.DataFrame({
'mol': mols,
'scaffold': scaffold_smiles,
'activity': activities # 用户提供的活性数据
})
Analyze each scaffold series
分析每个骨架系列
for scaffold, group in sar_df.groupby('scaffold'):
if len(group) >= 3: # Need multiple examples
print(f"\nScaffold: {scaffold}")
print(f"Count: {len(group)}")
print(f"Activity range: {group['activity'].min():.2f} - {group['activity'].max():.2f}")
# Visualize with activities as legends
dm.viz.to_image(
group['mol'].tolist(),
legends=[f"Activity: {act:.2f}" for act in group['activity']],
align=True # Align by common substructure
)undefinedfor scaffold, group in sar_df.groupby('scaffold'):
if len(group) >= 3: # 需要至少3个样本
print(f"\n骨架: {scaffold}")
print(f"数量: {len(group)}")
print(f"活性范围: {group['activity'].min():.2f} - {group['activity'].max():.2f}")
# 以活性值为图例进行可视化
dm.viz.to_image(
group['mol'].tolist(),
legends=[f"活性: {act:.2f}" for act in group['activity']],
align=True # 按公共子结构对齐
)undefinedVirtual Screening Pipeline
虚拟筛选流程
python
undefinedpython
undefined1. Generate fingerprints for query and library
1. 为查询分子和库分子生成指纹
query_fps = [dm.to_fp(mol) for mol in query_actives]
library_fps = [dm.to_fp(mol) for mol in library_mols]
query_fps = [dm.to_fp(mol) for mol in query_actives]
library_fps = [dm.to_fp(mol) for mol in library_mols]
2. Calculate similarities
2. 计算相似性
from scipy.spatial.distance import cdist
import numpy as np
distances = dm.cdist(query_actives, library_mols, n_jobs=-1)
from scipy.spatial.distance import cdist
import numpy as np
distances = dm.cdist(query_actives, library_mols, n_jobs=-1)
3. Find closest matches (min distance to any query)
3. 查找最匹配的分子(与任意查询分子的最小距离)
min_distances = distances.min(axis=0)
similarities = 1 - min_distances # Convert distance to similarity
min_distances = distances.min(axis=0)
similarities = 1 - min_distances # 将距离转换为相似性
4. Rank and select top hits
4. 排序并选择Top命中分子
top_indices = np.argsort(similarities)[::-1][:100] # Top 100
top_hits = [library_mols[i] for i in top_indices]
top_scores = [similarities[i] for i in top_indices]
top_indices = np.argsort(similarities)[::-1][:100] # Top 100
top_hits = [library_mols[i] for i in top_indices]
top_scores = [similarities[i] for i in top_indices]
5. Visualize hits
5. 可视化命中分子
dm.viz.to_image(
top_hits[:20],
legends=[f"Sim: {score:.3f}" for score in top_scores[:20]],
outfile="screening_hits.png"
)
undefineddm.viz.to_image(
top_hits[:20],
legends=[f"相似性: {score:.3f}" for score in top_scores[:20]],
outfile="screening_hits.png"
)
undefinedReference Documentation
参考文档
For detailed API documentation, consult these reference files:
- : Core namespace functions (conversions, standardization, fingerprints, clustering)
references/core_api.md - : File I/O operations (read/write SDF, CSV, Excel, remote files)
references/io_module.md - : 3D conformer generation, clustering, SASA calculations
references/conformers_module.md - : Molecular descriptors and visualization functions
references/descriptors_viz.md - : Scaffold extraction, BRICS/RECAP fragmentation
references/fragments_scaffolds.md - : Chemical reactions and toy datasets
references/reactions_data.md
如需详细的API文档,请参考以下文件:
- : 核心命名空间函数(格式转换、标准化、指纹、聚类)
references/core_api.md - : 文件I/O操作(读写SDF、CSV、Excel、远程文件)
references/io_module.md - : 3D构象生成、聚类、SASA计算
references/conformers_module.md - : 分子描述符与可视化函数
references/descriptors_viz.md - : 骨架提取、BRICS/RECAP片段化
references/fragments_scaffolds.md - : 化学反应与示例数据集
references/reactions_data.md
Best Practices
最佳实践
-
Always standardize molecules from external sources:python
mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True) -
Check for None values after molecule parsing:python
mol = dm.to_mol(smiles) if mol is None: # Handle invalid SMILES -
Use parallel processing for large datasets:python
result = dm.operation(..., n_jobs=-1, progress=True) -
Leverage fsspec for cloud storage:python
df = dm.read_sdf("s3://bucket/compounds.sdf") -
Use appropriate fingerprints for similarity:
- ECFP (Morgan): General purpose, structural similarity
- MACCS: Fast, smaller feature space
- Atom pairs: Considers atom pairs and distances
-
Consider scale limitations:
- Butina clustering: ~1,000 molecules (full distance matrix)
- For larger datasets: Use diversity selection or hierarchical methods
-
Scaffold splitting for ML: Ensure proper train/test separation by scaffold
-
Align molecules when visualizing SAR series
-
始终标准化外部来源的分子:python
mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True) -
分子解析后检查None值:python
mol = dm.to_mol(smiles) if mol is None: # 处理无效SMILES -
对大型数据集使用并行处理:python
result = dm.operation(..., n_jobs=-1, progress=True) -
利用fsspec访问云存储:python
df = dm.read_sdf("s3://bucket/compounds.sdf") -
为相似性计算选择合适的指纹:
- ECFP(Morgan): 通用用途,结构相似性
- MACCS: 计算快速,特征空间更小
- 原子对指纹: 考虑原子对和距离
-
考虑规模限制:
- Butina聚类: 适用于约1000个分子(完整距离矩阵)
- 对于更大的数据集: 使用多样性筛选或层次聚类方法
-
机器学习的骨架拆分: 确保训练集和测试集按骨架正确分离
-
可视化SAR系列时对齐分子
Error Handling
错误处理
python
undefinedpython
undefinedSafe molecule creation
安全创建分子
def safe_to_mol(smiles):
try:
mol = dm.to_mol(smiles)
if mol is not None:
mol = dm.standardize_mol(mol)
return mol
except Exception as e:
print(f"Failed to process {smiles}: {e}")
return None
def safe_to_mol(smiles):
try:
mol = dm.to_mol(smiles)
if mol is not None:
mol = dm.standardize_mol(mol)
return mol
except Exception as e:
print(f"处理 {smiles} 失败: {e}")
return None
Safe batch processing
安全批量处理
valid_mols = []
for smiles in smiles_list:
mol = safe_to_mol(smiles)
if mol is not None:
valid_mols.append(mol)
undefinedvalid_mols = []
for smiles in smiles_list:
mol = safe_to_mol(smiles)
if mol is not None:
valid_mols.append(mol)
undefinedIntegration with Machine Learning
与机器学习的集成
python
undefinedpython
undefinedFeature generation
特征生成
X = np.array([dm.to_fp(mol) for mol in mols])
X = np.array([dm.to_fp(mol) for mol in mols])
Or descriptors
或使用描述符
desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1)
X = desc_df.values
desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1)
X = desc_df.values
Train model
训练模型
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X, y_target)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X, y_target)
Predict
预测
predictions = model.predict(X_test)
undefinedpredictions = model.predict(X_test)
undefinedTroubleshooting
故障排除
Issue: Molecule parsing fails
- Solution: Use first or try
dm.standardize_smiles()dm.fix_mol()
Issue: Memory errors with clustering
- Solution: Use instead of full clustering for large sets
dm.pick_diverse()
Issue: Slow conformer generation
- Solution: Reduce or increase
n_confsto generate fewer conformersrms_cutoff
Issue: Remote file access fails
- Solution: Ensure fsspec and appropriate cloud provider libraries are installed (s3fs, gcsfs, etc.)
问题: 分子解析失败
- 解决方案: 先使用,或尝试
dm.standardize_smiles()dm.fix_mol()
问题: 聚类时出现内存错误
- 解决方案: 对于大型数据集,使用替代完整聚类
dm.pick_diverse()
问题: 构象生成速度慢
- 解决方案: 减少或增大
n_confs以生成更少的构象rms_cutoff
问题: 远程文件访问失败
- 解决方案: 确保已安装fsspec和对应的云提供商库(s3fs、gcsfs等)
Additional Resources
额外资源
- Datamol Documentation: https://docs.datamol.io/
- RDKit Documentation: https://www.rdkit.org/docs/
- GitHub Repository: https://github.com/datamol-io/datamol
- Datamol官方文档: https://docs.datamol.io/
- RDKit官方文档: https://www.rdkit.org/docs/
- GitHub仓库: https://github.com/datamol-io/datamol