rdkit
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRDKit: Python Cheminformatics Library
RDKit:Python化学信息学库
Summary
概述
RDKit (v2023+) provides comprehensive Python APIs for molecular structure manipulation, property calculation, and chemical informatics. It requires Python 3 and NumPy, offering modular components for molecule parsing, descriptors, fingerprints, substructure search, conformer generation, and reaction processing.
RDKit(v2023及以上版本)提供全面的Python API,用于分子结构操作、属性计算和化学信息学处理。它需要Python 3和NumPy支持,提供分子解析、描述符计算、指纹生成、子结构搜索、构象生成和反应处理等模块化组件。
Applicable Scenarios
适用场景
This skill applies when you need to:
| Task Category | Examples |
|---|---|
| Molecule I/O | Parse SMILES, MOL, SDF, InChI; write structures |
| Property Calculation | Molecular weight, LogP, TPSA, H-bond donors/acceptors |
| Fingerprinting | Morgan (ECFP), MACCS keys, atom pairs, topological |
| Similarity Analysis | Tanimoto, Dice, clustering compounds |
| Substructure Search | SMARTS patterns, functional group detection |
| 3D Conformers | Generate, optimize, align molecular geometries |
| Chemical Reactions | Define and execute transformations |
| Drug-Likeness | Lipinski rules, QED, lead-likeness filters |
| Visualization | 2D depictions, highlighting, grid images |
当你需要完成以下任务时,可使用本技能:
| 任务类别 | 示例 |
|---|---|
| 分子输入输出 | 解析SMILES、MOL、SDF、InChI;写入分子结构 |
| 属性计算 | 分子量、LogP、TPSA、氢键供体/受体数量 |
| 指纹生成 | Morgan(ECFP)、MACCS键、原子对、拓扑指纹 |
| 相似度分析 | Tanimoto系数、Dice系数、化合物聚类 |
| 子结构搜索 | SMARTS模式、官能团检测 |
| 3D构象 | 生成、优化、对齐分子几何结构 |
| 化学反应 | 定义并执行分子转化 |
| 类药性评估 | Lipinski规则、QED、先导化合物筛选 |
| 可视化 | 2D分子绘制、高亮显示、网格图像 |
Module Organization
模块组织
| Module | Purpose | Reference |
|---|---|---|
| rdkit.Chem | Core molecule parsing, serialization, substructure | |
| rdkit.Chem.Descriptors | Property calculations | |
| rdkit.Chem.rdFingerprintGenerator | Modern fingerprint API | |
| rdkit.DataStructs | Similarity metrics, bulk operations | |
| rdkit.Chem.AllChem | 3D coordinates, reactions, optimization | |
| rdkit.Chem.Draw | Visualization and depiction | |
| SMARTS patterns | Substructure query language | |
| 模块 | 用途 | 参考文档 |
|---|---|---|
| rdkit.Chem | 核心分子解析、序列化、子结构操作 | |
| rdkit.Chem.Descriptors | 属性计算 | |
| rdkit.Chem.rdFingerprintGenerator | 现代指纹API | |
| rdkit.DataStructs | 相似度指标、批量操作 | |
| rdkit.Chem.AllChem | 3D坐标、反应处理、结构优化 | |
| rdkit.Chem.Draw | 可视化与分子绘制 | |
| SMARTS模式 | 子结构查询语言 | |
Setup
安装配置
Install via pip or conda:
bash
undefined可通过pip或conda安装:
bash
undefinedConda (recommended)
Conda(推荐)
conda install -c conda-forge rdkit
conda install -c conda-forge rdkit
Pip
Pip
pip install rdkit-pypi
Standard imports:
```python
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, Draw
from rdkit import DataStructspip install rdkit-pypi
标准导入方式:
```python
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, Draw
from rdkit import DataStructsQuick Reference
快速参考
Parse and Validate Molecules
解析与验证分子
python
from rdkit import Chem
mol = Chem.MolFromSmiles('c1ccc(O)cc1')
if mol is None:
print("Invalid SMILES")python
from rdkit import Chem
mol = Chem.MolFromSmiles('c1ccc(O)cc1')
if mol is None:
print("Invalid SMILES")Compute Properties
计算分子属性
python
from rdkit.Chem import Descriptors
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
tpsa = Descriptors.TPSA(mol)python
from rdkit.Chem import Descriptors
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
tpsa = Descriptors.TPSA(mol)Generate Fingerprints
生成分子指纹
python
from rdkit.Chem import rdFingerprintGenerator
gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fp = gen.GetFingerprint(mol)python
from rdkit.Chem import rdFingerprintGenerator
gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fp = gen.GetFingerprint(mol)Similarity Search
相似度搜索
python
from rdkit import DataStructs
similarity = DataStructs.TanimotoSimilarity(fp1, fp2)python
from rdkit import DataStructs
similarity = DataStructs.TanimotoSimilarity(fp1, fp2)Substructure Match
子结构匹配
python
pattern = Chem.MolFromSmarts('[OH1][C]') # Alcohol
has_alcohol = mol.HasSubstructMatch(pattern)python
pattern = Chem.MolFromSmarts('[OH1][C]') # 醇类
has_alcohol = mol.HasSubstructMatch(pattern)Generate 3D Conformer
生成3D构象
python
from rdkit.Chem import AllChem
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, randomSeed=42)
AllChem.MMFFOptimizeMolecule(mol)python
from rdkit.Chem import AllChem
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, randomSeed=42)
AllChem.MMFFOptimizeMolecule(mol)Implementation Patterns
实现模式
Drug-Likeness Assessment
类药性评估
python
from rdkit import Chem
from rdkit.Chem import Descriptors
def assess_druglikeness(smiles: str) -> dict | None:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Descriptors.NumHDonors(mol)
hba = Descriptors.NumHAcceptors(mol)
return {
'MW': mw,
'LogP': logp,
'HBD': hbd,
'HBA': hba,
'TPSA': Descriptors.TPSA(mol),
'RotBonds': Descriptors.NumRotatableBonds(mol),
'Lipinski': mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10,
'QED': Descriptors.qed(mol)
}python
from rdkit import Chem
from rdkit.Chem import Descriptors
def assess_druglikeness(smiles: str) -> dict | None:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Descriptors.NumHDonors(mol)
hba = Descriptors.NumHAcceptors(mol)
return {
'MW': mw,
'LogP': logp,
'HBD': hbd,
'HBA': hba,
'TPSA': Descriptors.TPSA(mol),
'RotBonds': Descriptors.NumRotatableBonds(mol),
'Lipinski': mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10,
'QED': Descriptors.qed(mol)
}Batch Similarity Search
批量相似度搜索
python
from rdkit import Chem, DataStructs
from rdkit.Chem import rdFingerprintGenerator
def find_similar(query_smiles: str, database: list[str], threshold: float = 0.7) -> list:
query = Chem.MolFromSmiles(query_smiles)
if query is None:
return []
gen = rdFingerprintGenerator.GetMorganGenerator(radius=2)
query_fp = gen.GetFingerprint(query)
hits = []
for idx, smi in enumerate(database):
mol = Chem.MolFromSmiles(smi)
if mol:
fp = gen.GetFingerprint(mol)
sim = DataStructs.TanimotoSimilarity(query_fp, fp)
if sim >= threshold:
hits.append((idx, smi, sim))
return sorted(hits, key=lambda x: x[2], reverse=True)python
from rdkit import Chem, DataStructs
from rdkit.Chem import rdFingerprintGenerator
def find_similar(query_smiles: str, database: list[str], threshold: float = 0.7) -> list:
query = Chem.MolFromSmiles(query_smiles)
if query is None:
return []
gen = rdFingerprintGenerator.GetMorganGenerator(radius=2)
query_fp = gen.GetFingerprint(query)
hits = []
for idx, smi in enumerate(database):
mol = Chem.MolFromSmiles(smi)
if mol:
fp = gen.GetFingerprint(mol)
sim = DataStructs.TanimotoSimilarity(query_fp, fp)
if sim >= threshold:
hits.append((idx, smi, sim))
return sorted(hits, key=lambda x: x[2], reverse=True)Functional Group Screening
官能团筛选
python
from rdkit import Chem
FUNCTIONAL_GROUPS = {
'alcohol': '[OH1][C]',
'amine': '[NH2,NH1][C]',
'carboxylic_acid': 'C(=O)[OH1]',
'amide': 'C(=O)N',
'ester': 'C(=O)O[C]',
'nitro': '[N+](=O)[O-]'
}
def detect_functional_groups(smiles: str) -> list[str]:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return []
found = []
for name, smarts in FUNCTIONAL_GROUPS.items():
pattern = Chem.MolFromSmarts(smarts)
if mol.HasSubstructMatch(pattern):
found.append(name)
return foundpython
from rdkit import Chem
FUNCTIONAL_GROUPS = {
'alcohol': '[OH1][C]',
'amine': '[NH2,NH1][C]',
'carboxylic_acid': 'C(=O)[OH1]',
'amide': 'C(=O)N',
'ester': 'C(=O)O[C]',
'nitro': '[N+](=O)[O-]'
}
def detect_functional_groups(smiles: str) -> list[str]:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return []
found = []
for name, smarts in FUNCTIONAL_GROUPS.items():
pattern = Chem.MolFromSmarts(smarts)
if mol.HasSubstructMatch(pattern):
found.append(name)
return foundConformer Generation with Clustering
带聚类的构象生成
python
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.ML.Cluster import Butina
def generate_diverse_conformers(smiles: str, n_confs: int = 50, rmsd_thresh: float = 0.5) -> list:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return []
mol = Chem.AddHs(mol)
conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=n_confs, randomSeed=42)
# Optimize all conformers
for cid in conf_ids:
AllChem.MMFFOptimizeMolecule(mol, confId=cid)
# Cluster by RMSD to get diverse set
if len(conf_ids) < 2:
return list(conf_ids)
dists = []
for i in range(len(conf_ids)):
for j in range(i):
rmsd = AllChem.GetConformerRMS(mol, conf_ids[j], conf_ids[i])
dists.append(rmsd)
clusters = Butina.ClusterData(dists, len(conf_ids), rmsd_thresh, isDistData=True)
return [conf_ids[c[0]] for c in clusters] # Cluster centroidspython
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.ML.Cluster import Butina
def generate_diverse_conformers(smiles: str, n_confs: int = 50, rmsd_thresh: float = 0.5) -> list:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return []
mol = Chem.AddHs(mol)
conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=n_confs, randomSeed=42)
# 优化所有构象
for cid in conf_ids:
AllChem.MMFFOptimizeMolecule(mol, confId=cid)
# 通过RMSD聚类得到多样化构象集
if len(conf_ids) < 2:
return list(conf_ids)
dists = []
for i in range(len(conf_ids)):
for j in range(i):
rmsd = AllChem.GetConformerRMS(mol, conf_ids[j], conf_ids[i])
dists.append(rmsd)
clusters = Butina.ClusterData(dists, len(conf_ids), rmsd_thresh, isDistData=True)
return [conf_ids[c[0]] for c in clusters] # 聚类中心Batch Processing SDF Files
批量处理SDF文件
python
from rdkit import Chem
from rdkit.Chem import Descriptors
def process_sdf(input_path: str, output_path: str, min_mw: float = 200, max_mw: float = 500):
"""Filter compounds by molecular weight and add property columns."""
supplier = Chem.SDMolSupplier(input_path)
writer = Chem.SDWriter(output_path)
for mol in supplier:
if mol is None:
continue
mw = Descriptors.MolWt(mol)
if not (min_mw <= mw <= max_mw):
continue
# Add computed properties
mol.SetProp('MW', f'{mw:.2f}')
mol.SetProp('LogP', f'{Descriptors.MolLogP(mol):.2f}')
mol.SetProp('TPSA', f'{Descriptors.TPSA(mol):.2f}')
writer.write(mol)
writer.close()python
from rdkit import Chem
from rdkit.Chem import Descriptors
def process_sdf(input_path: str, output_path: str, min_mw: float = 200, max_mw: float = 500):
"""按分子量筛选化合物并添加属性列。"""
supplier = Chem.SDMolSupplier(input_path)
writer = Chem.SDWriter(output_path)
for mol in supplier:
if mol is None:
continue
mw = Descriptors.MolWt(mol)
if not (min_mw <= mw <= max_mw):
continue
# 添加计算得到的属性
mol.SetProp('MW', f'{mw:.2f}')
mol.SetProp('LogP', f'{Descriptors.MolLogP(mol):.2f}')
mol.SetProp('TPSA', f'{Descriptors.TPSA(mol):.2f}')
writer.write(mol)
writer.close()Guidelines
使用指南
Always validate parsed molecules:
python
mol = Chem.MolFromSmiles(smiles)
if mol is None:
print(f"Parse failed: {smiles}")
continueUse bulk operations for performance:
python
fps = [gen.GetFingerprint(m) for m in mols]
sims = DataStructs.BulkTanimotoSimilarity(fps[0], fps[1:])Add hydrogens for 3D work:
python
mol = Chem.AddHs(mol) # Required before EmbedMolecule
AllChem.EmbedMolecule(mol)Stream large files:
python
undefined始终验证解析后的分子:
python
mol = Chem.MolFromSmiles(smiles)
if mol is None:
print(f"解析失败: {smiles}")
continue使用批量操作提升性能:
python
fps = [gen.GetFingerprint(m) for m in mols]
sims = DataStructs.BulkTanimotoSimilarity(fps[0], fps[1:])3D操作前添加氢原子:
python
mol = Chem.AddHs(mol) # EmbedMolecule前必须执行
AllChem.EmbedMolecule(mol)流式处理大文件:
python
undefinedMemory-efficient: process one at a time
内存高效:逐个处理分子
for mol in Chem.ForwardSDMolSupplier(file_handle):
if mol:
process(mol)
for mol in Chem.ForwardSDMolSupplier(file_handle):
if mol:
process(mol)
Avoid: loading entire file
避免:加载整个文件到内存
all_mols = list(Chem.SDMolSupplier('huge.sdf'))
**Thread safety:** Most operations are thread-safe except for concurrent access to MolSupplier objects.all_mols = list(Chem.SDMolSupplier('huge.sdf'))
**线程安全:**除了并发访问MolSupplier对象外,大多数操作都是线程安全的。Troubleshooting
问题排查
| Issue | Resolution |
|---|---|
| Invalid SMILES syntax; check input |
| Sanitization error | Use |
| Wrong 3D geometry | Call |
| Fingerprint size mismatch | Use same |
| SMARTS not matching | Check aromatic vs aliphatic atoms ( |
| Slow SDF processing | Use |
| Memory issues with large files | Stream with |
| 问题 | 解决方法 |
|---|---|
| SMILES语法无效;检查输入内容 |
| 分子清理错误 | 使用 |
| 3D几何结构错误 | 嵌入前调用 |
| 指纹尺寸不匹配 | 所有比较使用相同的 |
| SMARTS模式不匹配 | 检查芳香族与脂肪族原子( |
| SDF处理缓慢 | 使用 |
| 大文件内存问题 | 使用 |
Reference Documentation
参考文档
Each reference file contains detailed API documentation:
| File | Contents |
|---|---|
| Complete function/class listings by module |
| All molecular descriptors with examples |
| Common SMARTS patterns for substructure search |
每个参考文件包含详细的API文档:
| 文件 | 内容 |
|---|---|
| 按模块分类的完整函数/类列表 |
| 所有分子描述符及示例 |
| 用于子结构搜索的常用SMARTS模式 |
External Resources
外部资源
- RDKit Documentation: https://www.rdkit.org/docs/
- Getting Started Guide: https://www.rdkit.org/docs/GettingStartedInPython.html
- GitHub Repository: https://github.com/rdkit/rdkit
- Cookbook: https://www.rdkit.org/docs/Cookbook.html