rdkit

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RDKit: Python Cheminformatics Library

RDKit:Python化学信息学库

Summary

概述

RDKit (v2023+) provides comprehensive Python APIs for molecular structure manipulation, property calculation, and chemical informatics. It requires Python 3 and NumPy, offering modular components for molecule parsing, descriptors, fingerprints, substructure search, conformer generation, and reaction processing.
RDKit(v2023及以上版本)提供全面的Python API,用于分子结构操作、属性计算和化学信息学处理。它需要Python 3和NumPy支持,提供分子解析、描述符计算、指纹生成、子结构搜索、构象生成和反应处理等模块化组件。

Applicable Scenarios

适用场景

This skill applies when you need to:
Task CategoryExamples
Molecule I/OParse SMILES, MOL, SDF, InChI; write structures
Property CalculationMolecular weight, LogP, TPSA, H-bond donors/acceptors
FingerprintingMorgan (ECFP), MACCS keys, atom pairs, topological
Similarity AnalysisTanimoto, Dice, clustering compounds
Substructure SearchSMARTS patterns, functional group detection
3D ConformersGenerate, optimize, align molecular geometries
Chemical ReactionsDefine and execute transformations
Drug-LikenessLipinski rules, QED, lead-likeness filters
Visualization2D depictions, highlighting, grid images
当你需要完成以下任务时,可使用本技能:
任务类别示例
分子输入输出解析SMILES、MOL、SDF、InChI;写入分子结构
属性计算分子量、LogP、TPSA、氢键供体/受体数量
指纹生成Morgan(ECFP)、MACCS键、原子对、拓扑指纹
相似度分析Tanimoto系数、Dice系数、化合物聚类
子结构搜索SMARTS模式、官能团检测
3D构象生成、优化、对齐分子几何结构
化学反应定义并执行分子转化
类药性评估Lipinski规则、QED、先导化合物筛选
可视化2D分子绘制、高亮显示、网格图像

Module Organization

模块组织

ModulePurposeReference
rdkit.ChemCore molecule parsing, serialization, substructure
references/api-reference.md
rdkit.Chem.DescriptorsProperty calculations
references/descriptors-reference.md
rdkit.Chem.rdFingerprintGeneratorModern fingerprint API
references/api-reference.md
rdkit.DataStructsSimilarity metrics, bulk operations
references/api-reference.md
rdkit.Chem.AllChem3D coordinates, reactions, optimization
references/api-reference.md
rdkit.Chem.DrawVisualization and depiction
references/api-reference.md
SMARTS patternsSubstructure query language
references/smarts-patterns.md
模块用途参考文档
rdkit.Chem核心分子解析、序列化、子结构操作
references/api-reference.md
rdkit.Chem.Descriptors属性计算
references/descriptors-reference.md
rdkit.Chem.rdFingerprintGenerator现代指纹API
references/api-reference.md
rdkit.DataStructs相似度指标、批量操作
references/api-reference.md
rdkit.Chem.AllChem3D坐标、反应处理、结构优化
references/api-reference.md
rdkit.Chem.Draw可视化与分子绘制
references/api-reference.md
SMARTS模式子结构查询语言
references/smarts-patterns.md

Setup

安装配置

Install via pip or conda:
bash
undefined
可通过pip或conda安装:
bash
undefined

Conda (recommended)

Conda(推荐)

conda install -c conda-forge rdkit
conda install -c conda-forge rdkit

Pip

Pip

pip install rdkit-pypi

Standard imports:

```python
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, Draw
from rdkit import DataStructs
pip install rdkit-pypi

标准导入方式:

```python
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, Draw
from rdkit import DataStructs

Quick Reference

快速参考

Parse and Validate Molecules

解析与验证分子

python
from rdkit import Chem

mol = Chem.MolFromSmiles('c1ccc(O)cc1')
if mol is None:
    print("Invalid SMILES")
python
from rdkit import Chem

mol = Chem.MolFromSmiles('c1ccc(O)cc1')
if mol is None:
    print("Invalid SMILES")

Compute Properties

计算分子属性

python
from rdkit.Chem import Descriptors

mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
tpsa = Descriptors.TPSA(mol)
python
from rdkit.Chem import Descriptors

mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
tpsa = Descriptors.TPSA(mol)

Generate Fingerprints

生成分子指纹

python
from rdkit.Chem import rdFingerprintGenerator

gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fp = gen.GetFingerprint(mol)
python
from rdkit.Chem import rdFingerprintGenerator

gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fp = gen.GetFingerprint(mol)

Similarity Search

相似度搜索

python
from rdkit import DataStructs

similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
python
from rdkit import DataStructs

similarity = DataStructs.TanimotoSimilarity(fp1, fp2)

Substructure Match

子结构匹配

python
pattern = Chem.MolFromSmarts('[OH1][C]')  # Alcohol
has_alcohol = mol.HasSubstructMatch(pattern)
python
pattern = Chem.MolFromSmarts('[OH1][C]')  # 醇类
has_alcohol = mol.HasSubstructMatch(pattern)

Generate 3D Conformer

生成3D构象

python
from rdkit.Chem import AllChem

mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, randomSeed=42)
AllChem.MMFFOptimizeMolecule(mol)
python
from rdkit.Chem import AllChem

mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, randomSeed=42)
AllChem.MMFFOptimizeMolecule(mol)

Implementation Patterns

实现模式

Drug-Likeness Assessment

类药性评估

python
from rdkit import Chem
from rdkit.Chem import Descriptors

def assess_druglikeness(smiles: str) -> dict | None:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None

    mw = Descriptors.MolWt(mol)
    logp = Descriptors.MolLogP(mol)
    hbd = Descriptors.NumHDonors(mol)
    hba = Descriptors.NumHAcceptors(mol)

    return {
        'MW': mw,
        'LogP': logp,
        'HBD': hbd,
        'HBA': hba,
        'TPSA': Descriptors.TPSA(mol),
        'RotBonds': Descriptors.NumRotatableBonds(mol),
        'Lipinski': mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10,
        'QED': Descriptors.qed(mol)
    }
python
from rdkit import Chem
from rdkit.Chem import Descriptors

def assess_druglikeness(smiles: str) -> dict | None:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None

    mw = Descriptors.MolWt(mol)
    logp = Descriptors.MolLogP(mol)
    hbd = Descriptors.NumHDonors(mol)
    hba = Descriptors.NumHAcceptors(mol)

    return {
        'MW': mw,
        'LogP': logp,
        'HBD': hbd,
        'HBA': hba,
        'TPSA': Descriptors.TPSA(mol),
        'RotBonds': Descriptors.NumRotatableBonds(mol),
        'Lipinski': mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10,
        'QED': Descriptors.qed(mol)
    }

Batch Similarity Search

批量相似度搜索

python
from rdkit import Chem, DataStructs
from rdkit.Chem import rdFingerprintGenerator

def find_similar(query_smiles: str, database: list[str], threshold: float = 0.7) -> list:
    query = Chem.MolFromSmiles(query_smiles)
    if query is None:
        return []

    gen = rdFingerprintGenerator.GetMorganGenerator(radius=2)
    query_fp = gen.GetFingerprint(query)

    hits = []
    for idx, smi in enumerate(database):
        mol = Chem.MolFromSmiles(smi)
        if mol:
            fp = gen.GetFingerprint(mol)
            sim = DataStructs.TanimotoSimilarity(query_fp, fp)
            if sim >= threshold:
                hits.append((idx, smi, sim))

    return sorted(hits, key=lambda x: x[2], reverse=True)
python
from rdkit import Chem, DataStructs
from rdkit.Chem import rdFingerprintGenerator

def find_similar(query_smiles: str, database: list[str], threshold: float = 0.7) -> list:
    query = Chem.MolFromSmiles(query_smiles)
    if query is None:
        return []

    gen = rdFingerprintGenerator.GetMorganGenerator(radius=2)
    query_fp = gen.GetFingerprint(query)

    hits = []
    for idx, smi in enumerate(database):
        mol = Chem.MolFromSmiles(smi)
        if mol:
            fp = gen.GetFingerprint(mol)
            sim = DataStructs.TanimotoSimilarity(query_fp, fp)
            if sim >= threshold:
                hits.append((idx, smi, sim))

    return sorted(hits, key=lambda x: x[2], reverse=True)

Functional Group Screening

官能团筛选

python
from rdkit import Chem

FUNCTIONAL_GROUPS = {
    'alcohol': '[OH1][C]',
    'amine': '[NH2,NH1][C]',
    'carboxylic_acid': 'C(=O)[OH1]',
    'amide': 'C(=O)N',
    'ester': 'C(=O)O[C]',
    'nitro': '[N+](=O)[O-]'
}

def detect_functional_groups(smiles: str) -> list[str]:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return []

    found = []
    for name, smarts in FUNCTIONAL_GROUPS.items():
        pattern = Chem.MolFromSmarts(smarts)
        if mol.HasSubstructMatch(pattern):
            found.append(name)
    return found
python
from rdkit import Chem

FUNCTIONAL_GROUPS = {
    'alcohol': '[OH1][C]',
    'amine': '[NH2,NH1][C]',
    'carboxylic_acid': 'C(=O)[OH1]',
    'amide': 'C(=O)N',
    'ester': 'C(=O)O[C]',
    'nitro': '[N+](=O)[O-]'
}

def detect_functional_groups(smiles: str) -> list[str]:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return []

    found = []
    for name, smarts in FUNCTIONAL_GROUPS.items():
        pattern = Chem.MolFromSmarts(smarts)
        if mol.HasSubstructMatch(pattern):
            found.append(name)
    return found

Conformer Generation with Clustering

带聚类的构象生成

python
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.ML.Cluster import Butina

def generate_diverse_conformers(smiles: str, n_confs: int = 50, rmsd_thresh: float = 0.5) -> list:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return []

    mol = Chem.AddHs(mol)
    conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=n_confs, randomSeed=42)

    # Optimize all conformers
    for cid in conf_ids:
        AllChem.MMFFOptimizeMolecule(mol, confId=cid)

    # Cluster by RMSD to get diverse set
    if len(conf_ids) < 2:
        return list(conf_ids)

    dists = []
    for i in range(len(conf_ids)):
        for j in range(i):
            rmsd = AllChem.GetConformerRMS(mol, conf_ids[j], conf_ids[i])
            dists.append(rmsd)

    clusters = Butina.ClusterData(dists, len(conf_ids), rmsd_thresh, isDistData=True)
    return [conf_ids[c[0]] for c in clusters]  # Cluster centroids
python
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.ML.Cluster import Butina

def generate_diverse_conformers(smiles: str, n_confs: int = 50, rmsd_thresh: float = 0.5) -> list:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return []

    mol = Chem.AddHs(mol)
    conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=n_confs, randomSeed=42)

    # 优化所有构象
    for cid in conf_ids:
        AllChem.MMFFOptimizeMolecule(mol, confId=cid)

    # 通过RMSD聚类得到多样化构象集
    if len(conf_ids) < 2:
        return list(conf_ids)

    dists = []
    for i in range(len(conf_ids)):
        for j in range(i):
            rmsd = AllChem.GetConformerRMS(mol, conf_ids[j], conf_ids[i])
            dists.append(rmsd)

    clusters = Butina.ClusterData(dists, len(conf_ids), rmsd_thresh, isDistData=True)
    return [conf_ids[c[0]] for c in clusters]  # 聚类中心

Batch Processing SDF Files

批量处理SDF文件

python
from rdkit import Chem
from rdkit.Chem import Descriptors

def process_sdf(input_path: str, output_path: str, min_mw: float = 200, max_mw: float = 500):
    """Filter compounds by molecular weight and add property columns."""
    supplier = Chem.SDMolSupplier(input_path)
    writer = Chem.SDWriter(output_path)

    for mol in supplier:
        if mol is None:
            continue

        mw = Descriptors.MolWt(mol)
        if not (min_mw <= mw <= max_mw):
            continue

        # Add computed properties
        mol.SetProp('MW', f'{mw:.2f}')
        mol.SetProp('LogP', f'{Descriptors.MolLogP(mol):.2f}')
        mol.SetProp('TPSA', f'{Descriptors.TPSA(mol):.2f}')

        writer.write(mol)

    writer.close()
python
from rdkit import Chem
from rdkit.Chem import Descriptors

def process_sdf(input_path: str, output_path: str, min_mw: float = 200, max_mw: float = 500):
    """按分子量筛选化合物并添加属性列。"""
    supplier = Chem.SDMolSupplier(input_path)
    writer = Chem.SDWriter(output_path)

    for mol in supplier:
        if mol is None:
            continue

        mw = Descriptors.MolWt(mol)
        if not (min_mw <= mw <= max_mw):
            continue

        # 添加计算得到的属性
        mol.SetProp('MW', f'{mw:.2f}')
        mol.SetProp('LogP', f'{Descriptors.MolLogP(mol):.2f}')
        mol.SetProp('TPSA', f'{Descriptors.TPSA(mol):.2f}')

        writer.write(mol)

    writer.close()

Guidelines

使用指南

Always validate parsed molecules:
python
mol = Chem.MolFromSmiles(smiles)
if mol is None:
    print(f"Parse failed: {smiles}")
    continue
Use bulk operations for performance:
python
fps = [gen.GetFingerprint(m) for m in mols]
sims = DataStructs.BulkTanimotoSimilarity(fps[0], fps[1:])
Add hydrogens for 3D work:
python
mol = Chem.AddHs(mol)  # Required before EmbedMolecule
AllChem.EmbedMolecule(mol)
Stream large files:
python
undefined
始终验证解析后的分子:
python
mol = Chem.MolFromSmiles(smiles)
if mol is None:
    print(f"解析失败: {smiles}")
    continue
使用批量操作提升性能:
python
fps = [gen.GetFingerprint(m) for m in mols]
sims = DataStructs.BulkTanimotoSimilarity(fps[0], fps[1:])
3D操作前添加氢原子:
python
mol = Chem.AddHs(mol)  # EmbedMolecule前必须执行
AllChem.EmbedMolecule(mol)
流式处理大文件:
python
undefined

Memory-efficient: process one at a time

内存高效:逐个处理分子

for mol in Chem.ForwardSDMolSupplier(file_handle): if mol: process(mol)
for mol in Chem.ForwardSDMolSupplier(file_handle): if mol: process(mol)

Avoid: loading entire file

避免:加载整个文件到内存

all_mols = list(Chem.SDMolSupplier('huge.sdf'))

**Thread safety:** Most operations are thread-safe except for concurrent access to MolSupplier objects.
all_mols = list(Chem.SDMolSupplier('huge.sdf'))

**线程安全:**除了并发访问MolSupplier对象外,大多数操作都是线程安全的。

Troubleshooting

问题排查

IssueResolution
MolFromSmiles
returns
None
Invalid SMILES syntax; check input
Sanitization errorUse
Chem.DetectChemistryProblems(mol)
to diagnose
Wrong 3D geometryCall
AddHs(mol)
before embedding
Fingerprint size mismatchUse same
fpSize
parameter for all comparisons
SMARTS not matchingCheck aromatic vs aliphatic atoms (
c
vs
C
)
Slow SDF processingUse
ForwardSDMolSupplier
or
MultithreadedSDMolSupplier
Memory issues with large filesStream with
ForwardSDMolSupplier
, don't load all
问题解决方法
MolFromSmiles
返回
None
SMILES语法无效;检查输入内容
分子清理错误使用
Chem.DetectChemistryProblems(mol)
诊断问题
3D几何结构错误嵌入前调用
AddHs(mol)
添加氢原子
指纹尺寸不匹配所有比较使用相同的
fpSize
参数
SMARTS模式不匹配检查芳香族与脂肪族原子(
c
vs
C
SDF处理缓慢使用
ForwardSDMolSupplier
MultithreadedSDMolSupplier
大文件内存问题使用
ForwardSDMolSupplier
流式处理,不要全部加载到内存

Reference Documentation

参考文档

Each reference file contains detailed API documentation:
FileContents
references/api-reference.md
Complete function/class listings by module
references/descriptors-reference.md
All molecular descriptors with examples
references/smarts-patterns.md
Common SMARTS patterns for substructure search
每个参考文件包含详细的API文档:
文件内容
references/api-reference.md
按模块分类的完整函数/类列表
references/descriptors-reference.md
所有分子描述符及示例
references/smarts-patterns.md
用于子结构搜索的常用SMARTS模式

External Resources

外部资源