rdkit
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRDKit - Cheminformatics and Drug Discovery
RDKit - 化学信息学与药物发现
RDKit is the industry-standard open-source toolkit for cheminformatics. It provides comprehensive tools for molecular manipulation, descriptor calculation, fingerprinting, substructure searching, and 3D molecular modeling. RDKit is used extensively in pharmaceutical companies for drug discovery and virtual screening.
RDKit是化学信息学领域的行业标准开源工具包。它提供了全面的分子操作、描述符计算、指纹生成、子结构搜索和3D分子建模工具。RDKit被制药公司广泛应用于药物发现和虚拟筛选。
When to Use
适用场景
- Reading and writing chemical file formats (SMILES, SDF, MOL2, PDB).
- Calculating molecular descriptors and drug-like properties (Lipinski's Rule of Five).
- Generating molecular fingerprints for similarity searching.
- Substructure searching and chemical pattern matching (SMARTS).
- 3D conformer generation and molecular alignment.
- Virtual screening of compound libraries.
- Pharmacophore modeling and shape similarity.
- QSAR (Quantitative Structure-Activity Relationship) modeling.
- Reaction enumeration and retrosynthesis.
- Visualizing chemical structures in 2D and 3D.
- Building machine learning models for molecular property prediction.
- 读取和写入化学文件格式(SMILES、SDF、MOL2、PDB)。
- 计算分子描述符和类药性质(Lipinski五规则)。
- 生成用于相似性搜索的分子指纹。
- 子结构搜索和化学模式匹配(SMARTS)。
- 3D构象生成和分子对齐。
- 化合物库的虚拟筛选。
- 药效团建模和形状相似性分析。
- QSAR(定量构效关系)建模。
- 反应枚举和逆合成分析。
- 2D和3D化学结构可视化。
- 构建用于分子性质预测的机器学习模型。
Reference Documentation
参考文档
Official docs: https://www.rdkit.org/docs/
RDKit Book: https://www.rdkit.org/docs/RDKit_Book.html
GitHub: https://github.com/rdkit/rdkit
Search patterns:, , ,
RDKit Book: https://www.rdkit.org/docs/RDKit_Book.html
GitHub: https://github.com/rdkit/rdkit
Search patterns:
rdkit.Chemrdkit.Chem.Descriptorsrdkit.Chem.AllChemrdkit.DataStructs官方文档:https://www.rdkit.org/docs/
RDKit手册:https://www.rdkit.org/docs/RDKit_Book.html
GitHub:https://github.com/rdkit/rdkit
搜索模式:, , ,
RDKit手册:https://www.rdkit.org/docs/RDKit_Book.html
GitHub:https://github.com/rdkit/rdkit
搜索模式:
rdkit.Chemrdkit.Chem.Descriptorsrdkit.Chem.AllChemrdkit.DataStructsCore Principles
核心原则
Molecular Representation
分子表示
RDKit represents molecules as graphs where atoms are nodes and bonds are edges. The core object is , which can be created from SMILES, SDF files, or built programmatically.
MolRDKit将分子表示为图结构,其中原子是节点,化学键是边。核心对象是,可从SMILES、SDF文件或通过编程方式创建。
MolSMILES (Simplified Molecular Input Line Entry System)
SMILES(简化分子线性输入规范)
A text-based notation for chemical structures. Example: is ethanol, is benzene. RDKit can parse and generate SMILES strings.
CCOc1ccccc1一种基于文本的化学结构表示法。示例:代表乙醇,代表苯。RDKit可以解析和生成SMILES字符串。
CCOc1ccccc1Fingerprints for Similarity
相似性指纹
Molecular fingerprints are binary vectors encoding structural features. They enable fast similarity searching and clustering of large compound libraries.
分子指纹是编码结构特征的二进制向量,可实现大型化合物库的快速相似性搜索和聚类。
Lazy Evaluation
惰性计算
Many RDKit operations are lazy - properties are computed only when needed. This makes operations on large libraries very efficient.
RDKit的许多操作采用惰性计算机制——仅在需要时才计算属性,这使得处理大型库的操作效率极高。
Quick Reference
快速参考
Installation
安装
bash
undefinedbash
undefinedVia conda (recommended)
Via conda (recommended)
conda install -c conda-forge rdkit
conda install -c conda-forge rdkit
Via pip
Via pip
pip install rdkit
pip install rdkit
For visualization
For visualization
pip install rdkit pillow
undefinedpip install rdkit pillow
undefinedStandard Imports
标准导入
python
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, Draw, Lipinski
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
import numpy as np
import pandas as pdpython
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, Draw, Lipinski
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
import numpy as np
import pandas as pdBasic Pattern - SMILES to Molecule
基础模式 - SMILES转分子
python
from rdkit import Chempython
from rdkit import Chem1. Create molecule from SMILES
1. Create molecule from SMILES
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O" # Aspirin
mol = Chem.MolFromSmiles(smiles)
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O" # Aspirin
mol = Chem.MolFromSmiles(smiles)
2. Check if molecule is valid
2. Check if molecule is valid
if mol is None:
print("Invalid SMILES")
else:
print(f"Molecular formula: {Chem.rdMolDescriptors.CalcMolFormula(mol)}")
print(f"Molecular weight: {Descriptors.MolWt(mol):.2f}")
undefinedif mol is None:
print("Invalid SMILES")
else:
print(f"Molecular formula: {Chem.rdMolDescriptors.CalcMolFormula(mol)}")
print(f"Molecular weight: {Descriptors.MolWt(mol):.2f}")
undefinedBasic Pattern - Calculate Properties
基础模式 - 计算分子性质
python
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski
mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O")python
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski
mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O")Calculate drug-like properties
Calculate drug-like properties
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Lipinski.NumHDonors(mol)
hba = Lipinski.NumHAcceptors(mol)
print(f"MW: {mw:.2f}, LogP: {logp:.2f}, HBD: {hbd}, HBA: {hba}")
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Lipinski.NumHDonors(mol)
hba = Lipinski.NumHAcceptors(mol)
print(f"MW: {mw:.2f}, LogP: {logp:.2f}, HBD: {hbd}, HBA: {hba}")
Check Lipinski's Rule of Five
Check Lipinski's Rule of Five
lipinski_pass = (mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10)
print(f"Lipinski compliant: {lipinski_pass}")
undefinedlipinski_pass = (mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10)
print(f"Lipinski compliant: {lipinski_pass}")
undefinedCritical Rules
关键规则
✅ DO
✅ 建议做法
- Always Validate Molecules - Check after parsing SMILES/files to catch invalid structures.
mol is not None - Use Canonical SMILES - Use to get canonical (standardized) SMILES for comparison.
Chem.MolToSmiles(mol) - Sanitize Molecules - RDKit auto-sanitizes by default (valence checking, aromaticity). Keep it enabled unless you have a specific reason.
- Generate 3D Coordinates - Use before 3D operations like alignment or docking.
AllChem.EmbedMolecule() - Use Fingerprints for Large Libraries - For similarity searching in millions of compounds, fingerprints are 1000x faster than direct comparison.
- Specify Random Seeds - For reproducible conformer generation, always set .
randomSeed - Handle Stereochemistry - Use to properly assign R/S and E/Z labels.
Chem.AssignStereochemistry() - Batch Processing - Use generators or chunking for processing millions of molecules to avoid memory issues.
- 始终验证分子 - 解析SMILES/文件后检查,以捕获无效结构。
mol is not None - 使用标准SMILES - 使用获取标准(规范化)SMILES用于比较。
Chem.MolToSmiles(mol) - 清理分子 - RDKit默认自动清理分子(化合价检查、芳香性判断),除非有特殊需求,否则保持启用。
- 生成3D坐标 - 在进行对齐或对接等3D操作前,使用生成3D坐标。
AllChem.EmbedMolecule() - 对大型库使用指纹 - 对百万级化合物进行相似性搜索时,指纹比直接比较快1000倍。
- 指定随机种子 - 为了生成可复现的构象,始终设置。
randomSeed - 处理立体化学 - 使用正确分配R/S和E/Z标记。
Chem.AssignStereochemistry() - 批量处理 - 处理百万级分子时,使用生成器或分块处理以避免内存问题。
❌ DON'T
❌ 不建议做法
- Don't Ignore Invalid Molecules - Always handle the case when returns
MolFromSmiles().None - Don't Compare SMILES Strings Directly - Two different SMILES can represent the same molecule. Use canonical SMILES or InChI.
- Don't Skip Kekulization - For aromatic systems, ensure proper Kekulé structure assignment.
- Don't Use Descriptors for Similarity - Use fingerprints (much faster and more appropriate).
- Don't Forget Hydrogens - Add explicit hydrogens with when needed for 3D operations.
Chem.AddHs() - Don't Overuse 3D Minimization - Energy minimization is slow; only use when necessary (docking, visualization).
- 不要忽略无效分子 - 始终处理返回
MolFromSmiles()的情况。None - 不要直接比较SMILES字符串 - 不同的SMILES可能代表同一个分子,应使用标准SMILES或InChI。
- 不要跳过凯库勒化 - 对于芳香体系,确保正确分配凯库勒结构。
- 不要使用描述符进行相似性搜索 - 使用指纹(速度更快且更合适)。
- 不要忘记添加氢原子 - 在进行3D操作时,使用添加显式氢原子。
Chem.AddHs() - 不要过度使用3D最小化 - 能量最小化速度较慢,仅在必要时使用(如对接、可视化)。
Anti-Patterns (NEVER)
反模式(绝对避免)
python
from rdkit import Chem
from rdkit.Chem import AllChempython
from rdkit import Chem
from rdkit.Chem import AllChem❌ BAD: Not checking if molecule is valid
❌ BAD: Not checking if molecule is valid
smiles = "INVALID_SMILES"
mol = Chem.MolFromSmiles(smiles)
mw = Descriptors.MolWt(mol) # Crashes!
smiles = "INVALID_SMILES"
mol = Chem.MolFromSmiles(smiles)
mw = Descriptors.MolWt(mol) # Crashes!
✅ GOOD: Always validate
✅ GOOD: Always validate
mol = Chem.MolFromSmiles(smiles)
if mol is not None:
mw = Descriptors.MolWt(mol)
else:
print("Invalid SMILES")
mol = Chem.MolFromSmiles(smiles)
if mol is not None:
mw = Descriptors.MolWt(mol)
else:
print("Invalid SMILES")
❌ BAD: Comparing SMILES strings directly
❌ BAD: Comparing SMILES strings directly
smiles1 = "CC(C)C" # isobutane
smiles2 = "C(C)CC" # same molecule, different SMILES
if smiles1 == smiles2: # False, but same molecule!
print("Same")
smiles1 = "CC(C)C" # isobutane
smiles2 = "C(C)CC" # same molecule, different SMILES
if smiles1 == smiles2: # False, but same molecule!
print("Same")
✅ GOOD: Use canonical SMILES
✅ GOOD: Use canonical SMILES
mol1 = Chem.MolFromSmiles(smiles1)
mol2 = Chem.MolFromSmiles(smiles2)
can1 = Chem.MolToSmiles(mol1)
can2 = Chem.MolToSmiles(mol2)
if can1 == can2: # True
print("Same molecule")
mol1 = Chem.MolFromSmiles(smiles1)
mol2 = Chem.MolFromSmiles(smiles2)
can1 = Chem.MolToSmiles(mol1)
can2 = Chem.MolToSmiles(mol2)
if can1 == can2: # True
print("Same molecule")
❌ BAD: 3D operations without 3D coordinates
❌ BAD: 3D operations without 3D coordinates
mol = Chem.MolFromSmiles("CCO")
AllChem.AlignMol(mol, ref_mol) # Fails! No 3D coords
mol = Chem.MolFromSmiles("CCO")
AllChem.AlignMol(mol, ref_mol) # Fails! No 3D coords
✅ GOOD: Generate 3D coordinates first
✅ GOOD: Generate 3D coordinates first
mol = Chem.MolFromSmiles("CCO")
AllChem.EmbedMolecule(mol)
AllChem.AlignMol(mol, ref_mol)
undefinedmol = Chem.MolFromSmiles("CCO")
AllChem.EmbedMolecule(mol)
AllChem.AlignMol(mol, ref_mol)
undefinedMolecular I/O and Conversion
分子输入输出与转换
SMILES Parsing
SMILES解析
python
from rdkit import Chempython
from rdkit import ChemParse SMILES
Parse SMILES
mol = Chem.MolFromSmiles("CCO")
mol = Chem.MolFromSmiles("CCO")
Parse SMILES with sanitization control
Parse SMILES with sanitization control
mol = Chem.MolFromSmiles("CCO", sanitize=True) # Default
mol = Chem.MolFromSmiles("CCO", sanitize=True) # Default
Generate canonical SMILES
Generate canonical SMILES
canonical = Chem.MolToSmiles(mol)
canonical = Chem.MolToSmiles(mol)
Generate isomeric SMILES (includes stereochemistry)
Generate isomeric SMILES (includes stereochemistry)
iso_smiles = Chem.MolToSmiles(mol, isomericSmiles=True)
iso_smiles = Chem.MolToSmiles(mol, isomericSmiles=True)
Generate SMILES without stereochemistry
Generate SMILES without stereochemistry
non_iso = Chem.MolToSmiles(mol, isomericSmiles=False)
non_iso = Chem.MolToSmiles(mol, isomericSmiles=False)
Handle invalid SMILES
Handle invalid SMILES
smiles_list = ["CCO", "INVALID", "c1ccccc1"]
mols = []
for smi in smiles_list:
mol = Chem.MolFromSmiles(smi)
if mol is not None:
mols.append(mol)
else:
print(f"Failed to parse: {smi}")
undefinedsmiles_list = ["CCO", "INVALID", "c1ccccc1"]
mols = []
for smi in smiles_list:
mol = Chem.MolFromSmiles(smi)
if mol is not None:
mols.append(mol)
else:
print(f"Failed to parse: {smi}")
undefinedReading SDF Files
读取SDF文件
python
from rdkit import Chempython
from rdkit import ChemRead single molecule from file
Read single molecule from file
mol = Chem.MolFromMolFile("molecule.mol")
mol = Chem.MolFromMolFile("molecule.mol")
Read multiple molecules from SDF
Read multiple molecules from SDF
suppl = Chem.SDMolSupplier("compounds.sdf")
suppl = Chem.SDMolSupplier("compounds.sdf")
Iterate through molecules
Iterate through molecules
for mol in suppl:
if mol is None:
continue
smiles = Chem.MolToSmiles(mol)
print(f"SMILES: {smiles}")
# Access SDF properties
if mol.HasProp("_Name"):
name = mol.GetProp("_Name")
print(f"Name: {name}")for mol in suppl:
if mol is None:
continue
smiles = Chem.MolToSmiles(mol)
print(f"SMILES: {smiles}")
# Access SDF properties
if mol.HasProp("_Name"):
name = mol.GetProp("_Name")
print(f"Name: {name}")Read with removeHs=False to keep explicit hydrogens
Read with removeHs=False to keep explicit hydrogens
suppl = Chem.SDMolSupplier("compounds.sdf", removeHs=False)
undefinedsuppl = Chem.SDMolSupplier("compounds.sdf", removeHs=False)
undefinedWriting SDF Files
写入SDF文件
python
from rdkit import Chempython
from rdkit import ChemWrite single molecule
Write single molecule
mol = Chem.MolFromSmiles("CCO")
writer = Chem.SDWriter("output.sdf")
writer.write(mol)
writer.close()
mol = Chem.MolFromSmiles("CCO")
writer = Chem.SDWriter("output.sdf")
writer.write(mol)
writer.close()
Write multiple molecules
Write multiple molecules
mols = [Chem.MolFromSmiles(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
writer = Chem.SDWriter("output.sdf")
for mol in mols:
if mol is not None:
writer.write(mol)
writer.close()
mols = [Chem.MolFromSmiles(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
writer = Chem.SDWriter("output.sdf")
for mol in mols:
if mol is not None:
writer.write(mol)
writer.close()
Add properties to molecules
Add properties to molecules
mol = Chem.MolFromSmiles("CCO")
mol.SetProp("_Name", "Ethanol")
mol.SetProp("Activity", "10.5")
writer = Chem.SDWriter("output.sdf")
writer.write(mol)
writer.close()
undefinedmol = Chem.MolFromSmiles("CCO")
mol.SetProp("_Name", "Ethanol")
mol.SetProp("Activity", "10.5")
writer = Chem.SDWriter("output.sdf")
writer.write(mol)
writer.close()
undefinedInChI and InChIKey
InChI与InChIKey
python
from rdkit import Chem
mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O") # Aspirinpython
from rdkit import Chem
mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O") # AspirinGenerate InChI (unique chemical identifier)
Generate InChI (unique chemical identifier)
inchi = Chem.MolToInchi(mol)
print(f"InChI: {inchi}")
inchi = Chem.MolToInchi(mol)
print(f"InChI: {inchi}")
Generate InChIKey (hashed InChI, good for database lookups)
Generate InChIKey (hashed InChI, good for database lookups)
inchikey = Chem.MolToInchiKey(mol)
print(f"InChIKey: {inchikey}")
inchikey = Chem.MolToInchiKey(mol)
print(f"InChIKey: {inchikey}")
Parse InChI
Parse InChI
mol_from_inchi = Chem.MolFromInchi(inchi)
undefinedmol_from_inchi = Chem.MolFromInchi(inchi)
undefinedMolecular Descriptors
分子描述符
Common Descriptors
常用描述符
python
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski, Crippen
mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O") # Aspirinpython
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski, Crippen
mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O") # AspirinBasic properties
Basic properties
mw = Descriptors.MolWt(mol)
num_atoms = mol.GetNumAtoms()
num_heavy_atoms = Lipinski.HeavyAtomCount(mol)
mw = Descriptors.MolWt(mol)
num_atoms = mol.GetNumAtoms()
num_heavy_atoms = Lipinski.HeavyAtomCount(mol)
Lipinski's Rule of Five parameters
Lipinski's Rule of Five parameters
logp = Descriptors.MolLogP(mol)
hbd = Lipinski.NumHDonors(mol)
hba = Lipinski.NumHAcceptors(mol)
rotatable_bonds = Lipinski.NumRotatableBonds(mol)
logp = Descriptors.MolLogP(mol)
hbd = Lipinski.NumHDonors(mol)
hba = Lipinski.NumHAcceptors(mol)
rotatable_bonds = Lipinski.NumRotatableBonds(mol)
Topological descriptors
Topological descriptors
tpsa = Descriptors.TPSA(mol) # Topological polar surface area
rings = Lipinski.RingCount(mol)
aromatic_rings = Lipinski.NumAromaticRings(mol)
tpsa = Descriptors.TPSA(mol) # Topological polar surface area
rings = Lipinski.RingCount(mol)
aromatic_rings = Lipinski.NumAromaticRings(mol)
Complexity
Complexity
bertz_ct = Descriptors.BertzCT(mol) # Molecular complexity
print(f"""
Molecular Weight: {mw:.2f}
LogP: {logp:.2f}
HBD: {hbd}
HBA: {hba}
TPSA: {tpsa:.2f}
Rotatable Bonds: {rotatable_bonds}
Aromatic Rings: {aromatic_rings}
""")
undefinedbertz_ct = Descriptors.BertzCT(mol) # Molecular complexity
print(f"""
Molecular Weight: {mw:.2f}
LogP: {logp:.2f}
HBD: {hbd}
HBA: {hba}
TPSA: {tpsa:.2f}
Rotatable Bonds: {rotatable_bonds}
Aromatic Rings: {aromatic_rings}
""")
undefinedCalculate All Descriptors
计算所有描述符
python
from rdkit import Chem
from rdkit.Chem import Descriptors
mol = Chem.MolFromSmiles("CCO")python
from rdkit import Chem
from rdkit.Chem import Descriptors
mol = Chem.MolFromSmiles("CCO")Get all available descriptors
Get all available descriptors
descriptor_names = [desc[0] for desc in Descriptors.descList]
descriptor_names = [desc[0] for desc in Descriptors.descList]
Calculate all descriptors
Calculate all descriptors
descriptors = {}
for name in descriptor_names:
calc = getattr(Descriptors, name)
descriptors[name] = calc(mol)
print(f"Total descriptors: {len(descriptors)}")
print(f"First 5: {list(descriptors.items())[:5]}")
undefineddescriptors = {}
for name in descriptor_names:
calc = getattr(Descriptors, name)
descriptors[name] = calc(mol)
print(f"Total descriptors: {len(descriptors)}")
print(f"First 5: {list(descriptors.items())[:5]}")
undefinedDrug-Likeness Filters
类药性质过滤
python
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski
def check_lipinski(mol):
"""Check Lipinski's Rule of Five."""
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Lipinski.NumHDonors(mol)
hba = Lipinski.NumHAcceptors(mol)
rules = {
'MW <= 500': mw <= 500,
'LogP <= 5': logp <= 5,
'HBD <= 5': hbd <= 5,
'HBA <= 10': hba <= 10
}
passed = all(rules.values())
return passed, rules
def check_veber(mol):
"""Check Veber's rules for oral bioavailability."""
rotatable = Lipinski.NumRotatableBonds(mol)
tpsa = Descriptors.TPSA(mol)
rules = {
'Rotatable bonds <= 10': rotatable <= 10,
'TPSA <= 140': tpsa <= 140
}
passed = all(rules.values())
return passed, rulespython
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski
def check_lipinski(mol):
"""Check Lipinski's Rule of Five."""
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Lipinski.NumHDonors(mol)
hba = Lipinski.NumHAcceptors(mol)
rules = {
'MW <= 500': mw <= 500,
'LogP <= 5': logp <= 5,
'HBD <= 5': hbd <= 5,
'HBA <= 10': hba <= 10
}
passed = all(rules.values())
return passed, rules
def check_veber(mol):
"""Check Veber's rules for oral bioavailability."""
rotatable = Lipinski.NumRotatableBonds(mol)
tpsa = Descriptors.TPSA(mol)
rules = {
'Rotatable bonds <= 10': rotatable <= 10,
'TPSA <= 140': tpsa <= 140
}
passed = all(rules.values())
return passed, rulesUsage
Usage
mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O") # Aspirin
lipinski_pass, lipinski_rules = check_lipinski(mol)
veber_pass, veber_rules = check_veber(mol)
print(f"Lipinski: {lipinski_pass}")
print(f"Veber: {veber_pass}")
undefinedmol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O") # Aspirin
lipinski_pass, lipinski_rules = check_lipinski(mol)
veber_pass, veber_rules = check_veber(mol)
print(f"Lipinski: {lipinski_pass}")
print(f"Veber: {veber_pass}")
undefinedMolecular Fingerprints
分子指纹
Morgan Fingerprints (Circular)
Morgan指纹(圆形指纹)
python
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructspython
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructsCreate molecule
Create molecule
mol = Chem.MolFromSmiles("CCO")
mol = Chem.MolFromSmiles("CCO")
Generate Morgan fingerprint (ECFP4)
Generate Morgan fingerprint (ECFP4)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
Convert to numpy array
Convert to numpy array
import numpy as np
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
import numpy as np
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
Generate count-based fingerprint (for feature importance)
Generate count-based fingerprint (for feature importance)
fp_counts = AllChem.GetMorganFingerprint(mol, radius=2)
fp_counts = AllChem.GetMorganFingerprint(mol, radius=2)
Get feature info
Get feature info
info = {}
fp_info = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048, bitInfo=info)
print(f"Number of on-bits: {len(info)}")
undefinedinfo = {}
fp_info = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048, bitInfo=info)
print(f"Number of on-bits: {len(info)}")
undefinedFingerprint Similarity
指纹相似性
python
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructspython
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructsCreate molecules
Create molecules
mol1 = Chem.MolFromSmiles("CCO")
mol2 = Chem.MolFromSmiles("CCCO")
mol3 = Chem.MolFromSmiles("c1ccccc1")
mol1 = Chem.MolFromSmiles("CCO")
mol2 = Chem.MolFromSmiles("CCCO")
mol3 = Chem.MolFromSmiles("c1ccccc1")
Generate fingerprints
Generate fingerprints
fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, radius=2)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, radius=2)
fp3 = AllChem.GetMorganFingerprintAsBitVect(mol3, radius=2)
fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, radius=2)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, radius=2)
fp3 = AllChem.GetMorganFingerprintAsBitVect(mol3, radius=2)
Calculate Tanimoto similarity
Calculate Tanimoto similarity
sim_12 = DataStructs.TanimotoSimilarity(fp1, fp2)
sim_13 = DataStructs.TanimotoSimilarity(fp1, fp3)
print(f"Ethanol vs Propanol: {sim_12:.3f}") # High similarity
print(f"Ethanol vs Benzene: {sim_13:.3f}") # Low similarity
sim_12 = DataStructs.TanimotoSimilarity(fp1, fp2)
sim_13 = DataStructs.TanimotoSimilarity(fp1, fp3)
print(f"Ethanol vs Propanol: {sim_12:.3f}") # High similarity
print(f"Ethanol vs Benzene: {sim_13:.3f}") # Low similarity
Calculate Dice similarity
Calculate Dice similarity
dice_12 = DataStructs.DiceSimilarity(fp1, fp2)
dice_12 = DataStructs.DiceSimilarity(fp1, fp2)
Bulk similarity (compare one to many)
Bulk similarity (compare one to many)
fps = [fp1, fp2, fp3]
similarities = DataStructs.BulkTanimotoSimilarity(fp1, fps)
print(f"Bulk similarities: {similarities}")
undefinedfps = [fp1, fp2, fp3]
similarities = DataStructs.BulkTanimotoSimilarity(fp1, fps)
print(f"Bulk similarities: {similarities}")
undefinedOther Fingerprint Types
其他指纹类型
python
from rdkit import Chem
from rdkit.Chem import AllChem, RDKFingerprint
mol = Chem.MolFromSmiles("CCO")python
from rdkit import Chem
from rdkit.Chem import AllChem, RDKFingerprint
mol = Chem.MolFromSmiles("CCO")RDKit fingerprint (topological)
RDKit fingerprint (topological)
fp_rdkit = Chem.RDKFingerprint(mol)
fp_rdkit = Chem.RDKFingerprint(mol)
Atom pair fingerprint
Atom pair fingerprint
fp_atompair = AllChem.GetHashedAtomPairFingerprintAsBitVect(mol)
fp_atompair = AllChem.GetHashedAtomPairFingerprintAsBitVect(mol)
Topological torsion fingerprint
Topological torsion fingerprint
fp_torsion = AllChem.GetHashedTopologicalTorsionFingerprintAsBitVect(mol)
fp_torsion = AllChem.GetHashedTopologicalTorsionFingerprintAsBitVect(mol)
MACCS keys (166-bit structural keys)
MACCS keys (166-bit structural keys)
from rdkit.Chem import MACCSkeys
fp_maccs = MACCSkeys.GenMACCSKeys(mol)
undefinedfrom rdkit.Chem import MACCSkeys
fp_maccs = MACCSkeys.GenMACCSKeys(mol)
undefinedSubstructure Searching
子结构搜索
SMARTS Pattern Matching
SMARTS模式匹配
python
from rdkit import Chempython
from rdkit import ChemDefine molecule
Define molecule
mol = Chem.MolFromSmiles("CC(=O)Oc1ccccc1C(=O)O") # Aspirin
mol = Chem.MolFromSmiles("CC(=O)Oc1ccccc1C(=O)O") # Aspirin
SMARTS pattern for carboxylic acid
SMARTS pattern for carboxylic acid
pattern = Chem.MolFromSmarts("C(=O)O")
pattern = Chem.MolFromSmarts("C(=O)O")
Check if substructure exists
Check if substructure exists
has_match = mol.HasSubstructMatch(pattern)
print(f"Contains carboxylic acid: {has_match}")
has_match = mol.HasSubstructMatch(pattern)
print(f"Contains carboxylic acid: {has_match}")
Get matching atoms
Get matching atoms
matches = mol.GetSubstructMatches(pattern)
print(f"Number of matches: {len(matches)}")
print(f"Matching atom indices: {matches}")
matches = mol.GetSubstructMatches(pattern)
print(f"Number of matches: {len(matches)}")
print(f"Matching atom indices: {matches}")
Common SMARTS patterns
Common SMARTS patterns
patterns = {
'Carboxylic acid': 'C(=O)O',
'Ester': 'C(=O)O[C,c]',
'Amide': 'C(=O)N',
'Alcohol': '[OH][C,c]',
'Primary amine': '[NH2][C,c]',
'Aromatic ring': 'c1ccccc1'
}
for name, smarts in patterns.items():
pattern = Chem.MolFromSmarts(smarts)
if mol.HasSubstructMatch(pattern):
print(f"Contains {name}")
undefinedpatterns = {
'Carboxylic acid': 'C(=O)O',
'Ester': 'C(=O)O[C,c]',
'Amide': 'C(=O)N',
'Alcohol': '[OH][C,c]',
'Primary amine': '[NH2][C,c]',
'Aromatic ring': 'c1ccccc1'
}
for name, smarts in patterns.items():
pattern = Chem.MolFromSmarts(smarts)
if mol.HasSubstructMatch(pattern):
print(f"Contains {name}")
undefinedSubstructure Filtering
子结构过滤
python
from rdkit import Chempython
from rdkit import ChemLibrary of molecules
Library of molecules
smiles_list = [
"CC(=O)O", # Acetic acid
"CCO", # Ethanol
"c1ccccc1C(=O)O", # Benzoic acid
"CCCC", # Butane
]
mols = [Chem.MolFromSmiles(s) for s in smiles_list]
smiles_list = [
"CC(=O)O", # Acetic acid
"CCO", # Ethanol
"c1ccccc1C(=O)O", # Benzoic acid
"CCCC", # Butane
]
mols = [Chem.MolFromSmiles(s) for s in smiles_list]
Filter molecules containing carboxylic acid
Filter molecules containing carboxylic acid
pattern = Chem.MolFromSmarts("C(=O)O")
filtered = [mol for mol in mols if mol.HasSubstructMatch(pattern)]
print(f"Molecules with carboxylic acid: {len(filtered)}/{len(mols)}")
pattern = Chem.MolFromSmarts("C(=O)O")
filtered = [mol for mol in mols if mol.HasSubstructMatch(pattern)]
print(f"Molecules with carboxylic acid: {len(filtered)}/{len(mols)}")
Filter by multiple patterns (AND logic)
Filter by multiple patterns (AND logic)
pattern1 = Chem.MolFromSmarts("c1ccccc1") # Aromatic ring
pattern2 = Chem.MolFromSmarts("C(=O)O") # Carboxylic acid
aromatic_acids = [
mol for mol in mols
if mol.HasSubstructMatch(pattern1) and mol.HasSubstructMatch(pattern2)
]
undefinedpattern1 = Chem.MolFromSmarts("c1ccccc1") # Aromatic ring
pattern2 = Chem.MolFromSmarts("C(=O)O") # Carboxylic acid
aromatic_acids = [
mol for mol in mols
if mol.HasSubstructMatch(pattern1) and mol.HasSubstructMatch(pattern2)
]
undefinedReplace Substructures
子结构替换
python
from rdkit import Chem
from rdkit.Chem import AllChempython
from rdkit import Chem
from rdkit.Chem import AllChemReplace carboxylic acid with ester
Replace carboxylic acid with ester
mol = Chem.MolFromSmiles("CC(=O)O") # Acetic acid
mol = Chem.MolFromSmiles("CC(=O)O") # Acetic acid
Define replacement
Define replacement
rxn = AllChem.ReactionFromSmarts('C:1O>>C:1OC')
rxn = AllChem.ReactionFromSmarts('C:1O>>C:1OC')
Apply reaction
Apply reaction
products = rxn.RunReactants((mol,))
if products:
product = products[0][0]
print(f"Product: {Chem.MolToSmiles(product)}") # Methyl acetate
undefinedproducts = rxn.RunReactants((mol,))
if products:
product = products[0][0]
print(f"Product: {Chem.MolToSmiles(product)}") # Methyl acetate
undefined3D Conformer Generation
3D构象生成
Generate 3D Coordinates
生成3D坐标
python
from rdkit import Chem
from rdkit.Chem import AllChempython
from rdkit import Chem
from rdkit.Chem import AllChemCreate molecule
Create molecule
mol = Chem.MolFromSmiles("CCO")
mol = Chem.MolFromSmiles("CCO")
Add hydrogens (required for 3D)
Add hydrogens (required for 3D)
mol = Chem.AddHs(mol)
mol = Chem.AddHs(mol)
Generate 3D coordinates
Generate 3D coordinates
result = AllChem.EmbedMolecule(mol, randomSeed=42)
if result == 0: # Success
print("3D coordinates generated")
else:
print("Failed to generate 3D coordinates")
result = AllChem.EmbedMolecule(mol, randomSeed=42)
if result == 0: # Success
print("3D coordinates generated")
else:
print("Failed to generate 3D coordinates")
Optimize geometry with MMFF force field
Optimize geometry with MMFF force field
AllChem.MMFFOptimizeMolecule(mol)
AllChem.MMFFOptimizeMolecule(mol)
Get atomic positions
Get atomic positions
conf = mol.GetConformer()
for i in range(mol.GetNumAtoms()):
pos = conf.GetAtomPosition(i)
print(f"Atom {i}: ({pos.x:.3f}, {pos.y:.3f}, {pos.z:.3f})")
undefinedconf = mol.GetConformer()
for i in range(mol.GetNumAtoms()):
pos = conf.GetAtomPosition(i)
print(f"Atom {i}: ({pos.x:.3f}, {pos.y:.3f}, {pos.z:.3f})")
undefinedMultiple Conformers
多构象生成
python
from rdkit import Chem
from rdkit.Chem import AllChem
mol = Chem.MolFromSmiles("CCCC") # Butane
mol = Chem.AddHs(mol)python
from rdkit import Chem
from rdkit.Chem import AllChem
mol = Chem.MolFromSmiles("CCCC") # Butane
mol = Chem.AddHs(mol)Generate multiple conformers
Generate multiple conformers
conf_ids = AllChem.EmbedMultipleConfs(
mol,
numConfs=10,
randomSeed=42,
pruneRmsThresh=0.5 # Remove similar conformers
)
print(f"Generated {len(conf_ids)} conformers")
conf_ids = AllChem.EmbedMultipleConfs(
mol,
numConfs=10,
randomSeed=42,
pruneRmsThresh=0.5 # Remove similar conformers
)
print(f"Generated {len(conf_ids)} conformers")
Optimize each conformer
Optimize each conformer
for conf_id in conf_ids:
AllChem.MMFFOptimizeMolecule(mol, confId=conf_id)
for conf_id in conf_ids:
AllChem.MMFFOptimizeMolecule(mol, confId=conf_id)
Get energies
Get energies
props = AllChem.MMFFGetMoleculeProperties(mol)
for conf_id in conf_ids:
ff = AllChem.MMFFGetMoleculeForceField(mol, props, confId=conf_id)
energy = ff.CalcEnergy()
print(f"Conformer {conf_id}: {energy:.2f} kcal/mol")
undefinedprops = AllChem.MMFFGetMoleculeProperties(mol)
for conf_id in conf_ids:
ff = AllChem.MMFFGetMoleculeForceField(mol, props, confId=conf_id)
energy = ff.CalcEnergy()
print(f"Conformer {conf_id}: {energy:.2f} kcal/mol")
undefinedMolecular Alignment
分子对齐
python
from rdkit import Chem
from rdkit.Chem import AllChempython
from rdkit import Chem
from rdkit.Chem import AllChemReference molecule
Reference molecule
ref_mol = Chem.MolFromSmiles("c1ccccc1C") # Toluene
ref_mol = Chem.AddHs(ref_mol)
AllChem.EmbedMolecule(ref_mol)
ref_mol = Chem.MolFromSmiles("c1ccccc1C") # Toluene
ref_mol = Chem.AddHs(ref_mol)
AllChem.EmbedMolecule(ref_mol)
Probe molecule
Probe molecule
probe_mol = Chem.MolFromSmiles("c1ccccc1CC") # Ethylbenzene
probe_mol = Chem.AddHs(probe_mol)
AllChem.EmbedMolecule(probe_mol)
probe_mol = Chem.MolFromSmiles("c1ccccc1CC") # Ethylbenzene
probe_mol = Chem.AddHs(probe_mol)
AllChem.EmbedMolecule(probe_mol)
Align probe to reference
Align probe to reference
rmsd = AllChem.AlignMol(probe_mol, ref_mol)
print(f"RMSD: {rmsd:.3f} Å")
rmsd = AllChem.AlignMol(probe_mol, ref_mol)
print(f"RMSD: {rmsd:.3f} Å")
Get aligned coordinates
Get aligned coordinates
Now probe_mol has coordinates aligned to ref_mol
Now probe_mol has coordinates aligned to ref_mol
undefinedundefinedMolecular Visualization
分子可视化
2D Drawings
2D绘图
python
from rdkit import Chem
from rdkit.Chem import Draw
import matplotlib.pyplot as pltpython
from rdkit import Chem
from rdkit.Chem import Draw
import matplotlib.pyplot as pltSingle molecule
Single molecule
mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O") # Aspirin
img = Draw.MolToImage(mol, size=(300, 300))
plt.imshow(img)
plt.axis('off')
plt.show()
mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O") # Aspirin
img = Draw.MolToImage(mol, size=(300, 300))
plt.imshow(img)
plt.axis('off')
plt.show()
Multiple molecules
Multiple molecules
mols = [Chem.MolFromSmiles(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
legends = ["Ethanol", "Benzene", "Acetic acid"]
img = Draw.MolsToGridImage(
mols,
molsPerRow=3,
subImgSize=(200, 200),
legends=legends
)
plt.imshow(img)
plt.axis('off')
plt.show()
undefinedmols = [Chem.MolFromSmiles(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
legends = ["Ethanol", "Benzene", "Acetic acid"]
img = Draw.MolsToGridImage(
mols,
molsPerRow=3,
subImgSize=(200, 200),
legends=legends
)
plt.imshow(img)
plt.axis('off')
plt.show()
undefinedHighlight Substructures
高亮子结构
python
from rdkit import Chem
from rdkit.Chem import Draw
mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O") # Aspirinpython
from rdkit import Chem
from rdkit.Chem import Draw
mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O") # AspirinHighlight carboxylic acid group
Highlight carboxylic acid group
pattern = Chem.MolFromSmarts("C(=O)O")
match = mol.GetSubstructMatch(pattern)
pattern = Chem.MolFromSmarts("C(=O)O")
match = mol.GetSubstructMatch(pattern)
Draw with highlighted atoms
Draw with highlighted atoms
img = Draw.MolToImage(mol, highlightAtoms=match, size=(300, 300))
undefinedimg = Draw.MolToImage(mol, highlightAtoms=match, size=(300, 300))
undefinedSave to File
保存到文件
python
from rdkit import Chem
from rdkit.Chem import Draw
mol = Chem.MolFromSmiles("CCO")python
from rdkit import Chem
from rdkit.Chem import Draw
mol = Chem.MolFromSmiles("CCO")Save as PNG
Save as PNG
Draw.MolToFile(mol, "molecule.png", size=(300, 300))
Draw.MolToFile(mol, "molecule.png", size=(300, 300))
Save as SVG (vector graphics)
Save as SVG (vector graphics)
from rdkit.Chem.Draw import rdMolDraw2D
drawer = rdMolDraw2D.MolDraw2DSVG(300, 300)
drawer.DrawMolecule(mol)
drawer.FinishDrawing()
svg = drawer.GetDrawingText()
with open("molecule.svg", "w") as f:
f.write(svg)
undefinedfrom rdkit.Chem.Draw import rdMolDraw2D
drawer = rdMolDraw2D.MolDraw2DSVG(300, 300)
drawer.DrawMolecule(mol)
drawer.FinishDrawing()
svg = drawer.GetDrawingText()
with open("molecule.svg", "w") as f:
f.write(svg)
undefinedPractical Workflows
实用工作流
1. Virtual Screening Pipeline
1. 虚拟筛选流程
python
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, Lipinski
from rdkit import DataStructs
import pandas as pd
def screen_library(library_file, reference_smiles, similarity_threshold=0.7):
"""Screen compound library for similar, drug-like molecules."""
# Reference molecule and fingerprint
ref_mol = Chem.MolFromSmiles(reference_smiles)
ref_fp = AllChem.GetMorganFingerprintAsBitVect(ref_mol, radius=2)
# Results
hits = []
# Read library
suppl = Chem.SDMolSupplier(library_file)
for i, mol in enumerate(suppl):
if mol is None:
continue
# Step 1: Drug-likeness filter (Lipinski)
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Lipinski.NumHDonors(mol)
hba = Lipinski.NumHAcceptors(mol)
if not (mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10):
continue
# Step 2: Similarity filter
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2)
similarity = DataStructs.TanimotoSimilarity(ref_fp, fp)
if similarity < similarity_threshold:
continue
# Step 3: PAINS filter (Pan-Assay Interference Compounds)
from rdkit.Chem import FilterCatalog
params = FilterCatalog.FilterCatalogParams()
params.AddCatalog(FilterCatalog.FilterCatalogParams.FilterCatalogs.PAINS)
catalog = FilterCatalog.FilterCatalog(params)
if catalog.HasMatch(mol):
continue
# Passed all filters
hits.append({
'id': i,
'smiles': Chem.MolToSmiles(mol),
'similarity': similarity,
'mw': mw,
'logp': logp
})
# Convert to DataFrame
df_hits = pd.DataFrame(hits)
df_hits = df_hits.sort_values('similarity', ascending=False)
return df_hitspython
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, Lipinski
from rdkit import DataStructs
import pandas as pd
def screen_library(library_file, reference_smiles, similarity_threshold=0.7):
"""Screen compound library for similar, drug-like molecules."""
# Reference molecule and fingerprint
ref_mol = Chem.MolFromSmiles(reference_smiles)
ref_fp = AllChem.GetMorganFingerprintAsBitVect(ref_mol, radius=2)
# Results
hits = []
# Read library
suppl = Chem.SDMolSupplier(library_file)
for i, mol in enumerate(suppl):
if mol is None:
continue
# Step 1: Drug-likeness filter (Lipinski)
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Lipinski.NumHDonors(mol)
hba = Lipinski.NumHAcceptors(mol)
if not (mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10):
continue
# Step 2: Similarity filter
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2)
similarity = DataStructs.TanimotoSimilarity(ref_fp, fp)
if similarity < similarity_threshold:
continue
# Step 3: PAINS filter (Pan-Assay Interference Compounds)
from rdkit.Chem import FilterCatalog
params = FilterCatalog.FilterCatalogParams()
params.AddCatalog(FilterCatalog.FilterCatalogParams.FilterCatalogs.PAINS)
catalog = FilterCatalog.FilterCatalog(params)
if catalog.HasMatch(mol):
continue
# Passed all filters
hits.append({
'id': i,
'smiles': Chem.MolToSmiles(mol),
'similarity': similarity,
'mw': mw,
'logp': logp
})
# Convert to DataFrame
df_hits = pd.DataFrame(hits)
df_hits = df_hits.sort_values('similarity', ascending=False)
return df_hitsUsage
Usage
hits = screen_library('compounds.sdf', 'CC(=O)OC1=CC=CC=C1C(=O)O')
hits = screen_library('compounds.sdf', 'CC(=O)OC1=CC=CC=C1C(=O)O')
print(f"Found {len(hits)} hits")
print(f"Found {len(hits)} hits")
undefinedundefined2. Diversity Selection
2. 多样性选择
python
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs
import numpy as np
def select_diverse_set(smiles_list, n_select=100):
"""Select diverse subset using MaxMin algorithm."""
# Generate fingerprints
mols = [Chem.MolFromSmiles(s) for s in smiles_list]
fps = [AllChem.GetMorganFingerprintAsBitVect(m, radius=2) for m in mols if m]
if len(fps) < n_select:
return list(range(len(fps)))
# MaxMin diversity picking
from rdkit.SimDivFilters import MaxMinPicker
def distance_function(i, j):
return 1 - DataStructs.TanimotoSimilarity(fps[i], fps[j])
picker = MaxMinPicker()
picks = picker.LazyPick(
distance_function,
len(fps),
n_select,
seed=42
)
return list(picks)python
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs
import numpy as np
def select_diverse_set(smiles_list, n_select=100):
"""Select diverse subset using MaxMin algorithm."""
# Generate fingerprints
mols = [Chem.MolFromSmiles(s) for s in smiles_list]
fps = [AllChem.GetMorganFingerprintAsBitVect(m, radius=2) for m in mols if m]
if len(fps) < n_select:
return list(range(len(fps)))
# MaxMin diversity picking
from rdkit.SimDivFilters import MaxMinPicker
def distance_function(i, j):
return 1 - DataStructs.TanimotoSimilarity(fps[i], fps[j])
picker = MaxMinPicker()
picks = picker.LazyPick(
distance_function,
len(fps),
n_select,
seed=42
)
return list(picks)Usage
Usage
smiles_list = ["CCO", "CCCO", "c1ccccc1", "CC(=O)O", "CCCCCCCC"]
diverse_indices = select_diverse_set(smiles_list, n_select=3)
diverse_smiles = [smiles_list[i] for i in diverse_indices]
print(f"Selected: {diverse_smiles}")
undefinedsmiles_list = ["CCO", "CCCO", "c1ccccc1", "CC(=O)O", "CCCCCCCC"]
diverse_indices = select_diverse_set(smiles_list, n_select=3)
diverse_smiles = [smiles_list[i] for i in diverse_indices]
print(f"Selected: {diverse_smiles}")
undefined3. QSAR Model Building
3. QSAR模型构建
python
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
def build_qsar_model(smiles_list, activities):
"""Build QSAR model from SMILES and activities."""
# Generate fingerprints
fps = []
valid_activities = []
for smi, act in zip(smiles_list, activities):
mol = Chem.MolFromSmiles(smi)
if mol is not None:
# Morgan fingerprint as features
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
fps.append(arr)
valid_activities.append(act)
X = np.array(fps)
y = np.array(valid_activities)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
print(f"Train R²: {r2_score(y_train, y_pred_train):.3f}")
print(f"Test R²: {r2_score(y_test, y_pred_test):.3f}")
print(f"Test RMSE: {mean_squared_error(y_test, y_pred_test, squared=False):.3f}")
return modelpython
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
def build_qsar_model(smiles_list, activities):
"""Build QSAR model from SMILES and activities."""
# Generate fingerprints
fps = []
valid_activities = []
for smi, act in zip(smiles_list, activities):
mol = Chem.MolFromSmiles(smi)
if mol is not None:
# Morgan fingerprint as features
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
fps.append(arr)
valid_activities.append(act)
X = np.array(fps)
y = np.array(valid_activities)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
print(f"Train R²: {r2_score(y_train, y_pred_train):.3f}")
print(f"Test R²: {r2_score(y_test, y_pred_test):.3f}")
print(f"Test RMSE: {mean_squared_error(y_test, y_pred_test, squared=False):.3f}")
return modelUsage
Usage
smiles = ["CCO", "CCCO", "c1ccccc1", "CC(=O)O"]
smiles = ["CCO", "CCCO", "c1ccccc1", "CC(=O)O"]
activities = [5.2, 4.8, 6.1, 5.5] # pIC50 values
activities = [5.2, 4.8, 6.1, 5.5] # pIC50 values
model = build_qsar_model(smiles, activities)
model = build_qsar_model(smiles, activities)
undefinedundefined4. Scaffold Analysis
4. 骨架分析
python
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
from collections import Counter
def analyze_scaffolds(smiles_list):
"""Analyze Murcko scaffolds in compound set."""
scaffolds = []
for smi in smiles_list:
mol = Chem.MolFromSmiles(smi)
if mol is not None:
# Get Murcko scaffold
scaffold = MurckoScaffold.GetScaffoldForMol(mol)
scaffold_smi = Chem.MolToSmiles(scaffold)
scaffolds.append(scaffold_smi)
# Count scaffolds
scaffold_counts = Counter(scaffolds)
print(f"Unique scaffolds: {len(scaffold_counts)}")
print("\nTop 5 scaffolds:")
for scaffold, count in scaffold_counts.most_common(5):
print(f"{scaffold}: {count}")
return scaffold_countspython
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
from collections import Counter
def analyze_scaffolds(smiles_list):
"""Analyze Murcko scaffolds in compound set."""
scaffolds = []
for smi in smiles_list:
mol = Chem.MolFromSmiles(smi)
if mol is not None:
# Get Murcko scaffold
scaffold = MurckoScaffold.GetScaffoldForMol(mol)
scaffold_smi = Chem.MolToSmiles(scaffold)
scaffolds.append(scaffold_smi)
# Count scaffolds
scaffold_counts = Counter(scaffolds)
print(f"Unique scaffolds: {len(scaffold_counts)}")
print("\nTop 5 scaffolds:")
for scaffold, count in scaffold_counts.most_common(5):
print(f"{scaffold}: {count}")
return scaffold_countsUsage
Usage
smiles_list = ["c1ccccc1CC", "c1ccccc1CCC", "c1ccc(O)cc1"]
smiles_list = ["c1ccccc1CC", "c1ccccc1CCC", "c1ccc(O)cc1"]
scaffolds = analyze_scaffolds(smiles_list)
scaffolds = analyze_scaffolds(smiles_list)
undefinedundefined5. Reaction Enumeration
5. 反应枚举
python
from rdkit import Chem
from rdkit.Chem import AllChem
def enumerate_amide_coupling(acids, amines):
"""Enumerate all possible amide products."""
# Define reaction SMARTS
rxn = AllChem.ReactionFromSmarts('[C:1](=[O:2])O.[N:3]>>[C:1](=[O:2])[N:3]')
products = []
for acid_smi in acids:
for amine_smi in amines:
acid = Chem.MolFromSmiles(acid_smi)
amine = Chem.MolFromSmiles(amine_smi)
if acid is None or amine is None:
continue
# Run reaction
products_tuple = rxn.RunReactants((acid, amine))
if products_tuple:
product = products_tuple[0][0]
Chem.SanitizeMol(product)
product_smi = Chem.MolToSmiles(product)
products.append({
'acid': acid_smi,
'amine': amine_smi,
'product': product_smi
})
return productspython
from rdkit import Chem
from rdkit.Chem import AllChem
def enumerate_amide_coupling(acids, amines):
"""Enumerate all possible amide products."""
# Define reaction SMARTS
rxn = AllChem.ReactionFromSmarts('[C:1](=[O:2])O.[N:3]>>[C:1](=[O:2])[N:3]')
products = []
for acid_smi in acids:
for amine_smi in amines:
acid = Chem.MolFromSmiles(acid_smi)
amine = Chem.MolFromSmiles(amine_smi)
if acid is None or amine is None:
continue
# Run reaction
products_tuple = rxn.RunReactants((acid, amine))
if products_tuple:
product = products_tuple[0][0]
Chem.SanitizeMol(product)
product_smi = Chem.MolToSmiles(product)
products.append({
'acid': acid_smi,
'amine': amine_smi,
'product': product_smi
})
return productsUsage
Usage
acids = ["CC(=O)O", "c1ccccc1C(=O)O"]
amines = ["CCN", "c1ccccc1N"]
products = enumerate_amide_coupling(acids, amines)
print(f"Generated {len(products)} amides")
undefinedacids = ["CC(=O)O", "c1ccccc1C(=O)O"]
amines = ["CCN", "c1ccccc1N"]
products = enumerate_amide_coupling(acids, amines)
print(f"Generated {len(products)} amides")
undefinedPerformance Optimization
性能优化
Bulk Operations
批量操作
python
from rdkit import Chem
from rdkit.Chem import AllChem
import pandas as pdpython
from rdkit import Chem
from rdkit.Chem import AllChem
import pandas as pdInstead of loop
Instead of loop
❌ SLOW
❌ SLOW
fps = []
fps = []
for smi in smiles_list:
for smi in smiles_list:
mol = Chem.MolFromSmiles(smi)
mol = Chem.MolFromSmiles(smi)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
fps.append(fp)
fps.append(fp)
✅ FAST: Use PandasTools for bulk operations
✅ FAST: Use PandasTools for bulk operations
from rdkit.Chem import PandasTools
df = pd.DataFrame({'SMILES': smiles_list})
PandasTools.AddMoleculeColumnToFrame(df, 'SMILES', 'Molecule')
from rdkit.Chem import PandasTools
df = pd.DataFrame({'SMILES': smiles_list})
PandasTools.AddMoleculeColumnToFrame(df, 'SMILES', 'Molecule')
Calculate properties in bulk
Calculate properties in bulk
df['MW'] = df['Molecule'].apply(lambda x: Descriptors.MolWt(x) if x else None)
df['LogP'] = df['Molecule'].apply(lambda x: Descriptors.MolLogP(x) if x else None)
undefineddf['MW'] = df['Molecule'].apply(lambda x: Descriptors.MolWt(x) if x else None)
df['LogP'] = df['Molecule'].apply(lambda x: Descriptors.MolLogP(x) if x else None)
undefinedParallel Processing
并行处理
python
from rdkit import Chem
from rdkit.Chem import AllChem
from multiprocessing import Pool
import pandas as pd
def process_molecule(smiles):
"""Process single molecule."""
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None
return {
'smiles': smiles,
'mw': Descriptors.MolWt(mol),
'logp': Descriptors.MolLogP(mol),
'fp': AllChem.GetMorganFingerprintAsBitVect(mol, 2)
}
def process_library_parallel(smiles_list, n_jobs=4):
"""Process library in parallel."""
with Pool(n_jobs) as pool:
results = pool.map(process_molecule, smiles_list)
# Filter None results
results = [r for r in results if r is not None]
return pd.DataFrame(results)python
from rdkit import Chem
from rdkit.Chem import AllChem
from multiprocessing import Pool
import pandas as pd
def process_molecule(smiles):
"""Process single molecule."""
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None
return {
'smiles': smiles,
'mw': Descriptors.MolWt(mol),
'logp': Descriptors.MolLogP(mol),
'fp': AllChem.GetMorganFingerprintAsBitVect(mol, 2)
}
def process_library_parallel(smiles_list, n_jobs=4):
"""Process library in parallel."""
with Pool(n_jobs) as pool:
results = pool.map(process_molecule, smiles_list)
# Filter None results
results = [r for r in results if r is not None]
return pd.DataFrame(results)Usage
Usage
df = process_library_parallel(large_smiles_list, n_jobs=8)
df = process_library_parallel(large_smiles_list, n_jobs=8)
undefinedundefinedCaching Calculations
缓存计算结果
python
from functools import lru_cache
from rdkit import Chem
from rdkit.Chem import Descriptors
@lru_cache(maxsize=10000)
def get_mol_properties(smiles):
"""Calculate properties with caching."""
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None
return {
'mw': Descriptors.MolWt(mol),
'logp': Descriptors.MolLogP(mol),
'tpsa': Descriptors.TPSA(mol)
}python
from functools import lru_cache
from rdkit import Chem
from rdkit.Chem import Descriptors
@lru_cache(maxsize=10000)
def get_mol_properties(smiles):
"""Calculate properties with caching."""
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None
return {
'mw': Descriptors.MolWt(mol),
'logp': Descriptors.MolLogP(mol),
'tpsa': Descriptors.TPSA(mol)
}Repeated calls will use cache
Repeated calls will use cache
props1 = get_mol_properties("CCO") # Calculated
props2 = get_mol_properties("CCO") # Cached (instant)
undefinedprops1 = get_mol_properties("CCO") # Calculated
props2 = get_mol_properties("CCO") # Cached (instant)
undefinedCommon Pitfalls and Solutions
常见问题与解决方案
The "Invalid SMILES" Problem
“无效SMILES”问题
Not all SMILES strings are valid.
python
undefined并非所有SMILES字符串都是有效的。
python
undefined❌ Problem: Assuming all SMILES are valid
❌ Problem: Assuming all SMILES are valid
smiles_list = ["CCO", "INVALID", "c1ccccc1"]
mols = [Chem.MolFromSmiles(s) for s in smiles_list]
smiles_list = ["CCO", "INVALID", "c1ccccc1"]
mols = [Chem.MolFromSmiles(s) for s in smiles_list]
Contains None!
Contains None!
✅ Solution: Filter invalid molecules
✅ Solution: Filter invalid molecules
mols = [Chem.MolFromSmiles(s) for s in smiles_list]
valid_mols = [m for m in mols if m is not None]
mols = [Chem.MolFromSmiles(s) for s in smiles_list]
valid_mols = [m for m in mols if m is not None]
✅ Better: Track which failed
✅ Better: Track which failed
results = []
for smi in smiles_list:
mol = Chem.MolFromSmiles(smi)
if mol is not None:
results.append({'smiles': smi, 'mol': mol, 'valid': True})
else:
results.append({'smiles': smi, 'mol': None, 'valid': False})
undefinedresults = []
for smi in smiles_list:
mol = Chem.MolFromSmiles(smi)
if mol is not None:
results.append({'smiles': smi, 'mol': mol, 'valid': True})
else:
results.append({'smiles': smi, 'mol': None, 'valid': False})
undefinedThe "Stereochemistry Loss" Problem
“立体化学丢失”问题
SMILES generation can lose stereochemistry if not careful.
python
from rdkit import Chem如果操作不当,生成SMILES时可能会丢失立体化学信息。
python
from rdkit import ChemMolecule with stereochemistry
Molecule with stereochemistry
chiral_smiles = "CC@HCC" # (S)-2-butanol
mol = Chem.MolFromSmiles(chiral_smiles)
chiral_smiles = "CC@HCC" # (S)-2-butanol
mol = Chem.MolFromSmiles(chiral_smiles)
❌ BAD: Lose stereochemistry
❌ BAD: Lose stereochemistry
non_iso = Chem.MolToSmiles(mol, isomericSmiles=False)
print(non_iso) # "CC(O)CC" - lost chirality!
non_iso = Chem.MolToSmiles(mol, isomericSmiles=False)
print(non_iso) # "CC(O)CC" - lost chirality!
✅ GOOD: Preserve stereochemistry
✅ GOOD: Preserve stereochemistry
iso = Chem.MolToSmiles(mol, isomericSmiles=True)
print(iso) # "CC@HCC" - preserved!
undefinediso = Chem.MolToSmiles(mol, isomericSmiles=True)
print(iso) # "CC@HCC" - preserved!
undefinedThe "3D Without Hydrogens" Problem
“无氢原子的3D操作”问题
3D operations require explicit hydrogens.
python
from rdkit import Chem
from rdkit.Chem import AllChem
mol = Chem.MolFromSmiles("CCO")3D操作需要显式氢原子。
python
from rdkit import Chem
from rdkit.Chem import AllChem
mol = Chem.MolFromSmiles("CCO")❌ BAD: Generate 3D without hydrogens
❌ BAD: Generate 3D without hydrogens
result = AllChem.EmbedMolecule(mol)
result = AllChem.EmbedMolecule(mol)
Poor quality or failure
Poor quality or failure
✅ GOOD: Add hydrogens first
✅ GOOD: Add hydrogens first
mol_h = Chem.AddHs(mol)
result = AllChem.EmbedMolecule(mol_h)
AllChem.MMFFOptimizeMolecule(mol_h)
undefinedmol_h = Chem.AddHs(mol)
result = AllChem.EmbedMolecule(mol_h)
AllChem.MMFFOptimizeMolecule(mol_h)
undefinedThe "Fingerprint Type Mismatch" Problem
“指纹类型不匹配”问题
Comparing different fingerprint types gives meaningless results.
python
from rdkit import Chem
from rdkit.Chem import AllChem, MACCSkeys
from rdkit import DataStructs
mol1 = Chem.MolFromSmiles("CCO")
mol2 = Chem.MolFromSmiles("CCCO")比较不同类型的指纹会得到无意义的结果。
python
from rdkit import Chem
from rdkit.Chem import AllChem, MACCSkeys
from rdkit import DataStructs
mol1 = Chem.MolFromSmiles("CCO")
mol2 = Chem.MolFromSmiles("CCCO")❌ BAD: Comparing different fingerprint types
❌ BAD: Comparing different fingerprint types
fp1_morgan = AllChem.GetMorganFingerprintAsBitVect(mol1, 2)
fp2_maccs = MACCSkeys.GenMACCSKeys(mol2)
fp1_morgan = AllChem.GetMorganFingerprintAsBitVect(mol1, 2)
fp2_maccs = MACCSkeys.GenMACCSKeys(mol2)
This will error or give nonsense!
This will error or give nonsense!
similarity = DataStructs.TanimotoSimilarity(fp1_morgan, fp2_maccs)
similarity = DataStructs.TanimotoSimilarity(fp1_morgan, fp2_maccs)
✅ GOOD: Use same fingerprint type
✅ GOOD: Use same fingerprint type
fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, 2)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, 2)
similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
undefinedfp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, 2)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, 2)
similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
undefinedThe "Memory Explosion" Problem
“内存爆炸”问题
Processing millions of molecules can exhaust memory.
python
undefined处理百万级分子可能会耗尽内存。
python
undefined❌ BAD: Load entire library into memory
❌ BAD: Load entire library into memory
suppl = Chem.SDMolSupplier('huge_library.sdf')
mols = [mol for mol in suppl] # Out of memory!
suppl = Chem.SDMolSupplier('huge_library.sdf')
mols = [mol for mol in suppl] # Out of memory!
✅ GOOD: Process in batches
✅ GOOD: Process in batches
def process_in_batches(sdf_file, batch_size=10000):
suppl = Chem.SDMolSupplier(sdf_file)
batch = []
for mol in suppl:
if mol is not None:
batch.append(mol)
if len(batch) >= batch_size:
# Process batch
yield batch
batch = []
# Process remaining
if batch:
yield batchdef process_in_batches(sdf_file, batch_size=10000):
suppl = Chem.SDMolSupplier(sdf_file)
batch = []
for mol in suppl:
if mol is not None:
batch.append(mol)
if len(batch) >= batch_size:
# Process batch
yield batch
batch = []
# Process remaining
if batch:
yield batchUsage
Usage
for batch in process_in_batches('huge_library.sdf'):
# Process each batch
pass
RDKit is the cornerstone of computational drug discovery and cheminformatics. Its comprehensive toolkit for molecular manipulation, descriptor calculation, and similarity searching makes it indispensable for pharmaceutical research, virtual screening, and chemical data analysis. Master RDKit, and you'll have the power to computationally explore vast chemical spaces and accelerate drug discovery.for batch in process_in_batches('huge_library.sdf'):
# Process each batch
pass
RDKit是计算药物发现和化学信息学的基石。其全面的分子操作、描述符计算和相似性搜索工具使其成为药物研究、虚拟筛选和化学数据分析中不可或缺的工具。掌握RDKit,你将获得计算探索广阔化学空间并加速药物发现的能力。