rdkit

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

RDKit: Python Cheminformatics Library

RDKit：Python化学信息学库

Summary

概述

RDKit (v2023+) provides comprehensive Python APIs for molecular structure manipulation, property calculation, and chemical informatics. It requires Python 3 and NumPy, offering modular components for molecule parsing, descriptors, fingerprints, substructure search, conformer generation, and reaction processing.

RDKit（v2023及以上版本）提供全面的Python API，用于分子结构操作、属性计算和化学信息学处理。它需要Python 3和NumPy支持，提供分子解析、描述符计算、指纹生成、子结构搜索、构象生成和反应处理等模块化组件。

Applicable Scenarios

适用场景

This skill applies when you need to:

Task Category	Examples
Molecule I/O	Parse SMILES, MOL, SDF, InChI; write structures
Property Calculation	Molecular weight, LogP, TPSA, H-bond donors/acceptors
Fingerprinting	Morgan (ECFP), MACCS keys, atom pairs, topological
Similarity Analysis	Tanimoto, Dice, clustering compounds
Substructure Search	SMARTS patterns, functional group detection
3D Conformers	Generate, optimize, align molecular geometries
Chemical Reactions	Define and execute transformations
Drug-Likeness	Lipinski rules, QED, lead-likeness filters
Visualization	2D depictions, highlighting, grid images

当你需要完成以下任务时，可使用本技能：

任务类别	示例
分子输入输出	解析SMILES、MOL、SDF、InChI；写入分子结构
属性计算	分子量、LogP、TPSA、氢键供体/受体数量
指纹生成	Morgan（ECFP）、MACCS键、原子对、拓扑指纹
相似度分析	Tanimoto系数、Dice系数、化合物聚类
子结构搜索	SMARTS模式、官能团检测
3D构象	生成、优化、对齐分子几何结构
化学反应	定义并执行分子转化
类药性评估	Lipinski规则、QED、先导化合物筛选
可视化	2D分子绘制、高亮显示、网格图像

Module Organization

模块组织

Module	Purpose	Reference
rdkit.Chem	Core molecule parsing, serialization, substructure	`references/api-reference.md`
rdkit.Chem.Descriptors	Property calculations	`references/descriptors-reference.md`
rdkit.Chem.rdFingerprintGenerator	Modern fingerprint API	`references/api-reference.md`
rdkit.DataStructs	Similarity metrics, bulk operations	`references/api-reference.md`
rdkit.Chem.AllChem	3D coordinates, reactions, optimization	`references/api-reference.md`
rdkit.Chem.Draw	Visualization and depiction	`references/api-reference.md`
SMARTS patterns	Substructure query language	`references/smarts-patterns.md`

模块	用途	参考文档
rdkit.Chem	核心分子解析、序列化、子结构操作	`references/api-reference.md`
rdkit.Chem.Descriptors	属性计算	`references/descriptors-reference.md`
rdkit.Chem.rdFingerprintGenerator	现代指纹API	`references/api-reference.md`
rdkit.DataStructs	相似度指标、批量操作	`references/api-reference.md`
rdkit.Chem.AllChem	3D坐标、反应处理、结构优化	`references/api-reference.md`
rdkit.Chem.Draw	可视化与分子绘制	`references/api-reference.md`
SMARTS模式	子结构查询语言	`references/smarts-patterns.md`

Setup

安装配置

Install via pip or conda:

bash

undefined

可通过pip或conda安装：

bash

undefined

Conda (recommended)

Conda（推荐）

conda install -c conda-forge rdkit

Pip

pip install rdkit-pypi


Standard imports:

```python
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, Draw
from rdkit import DataStructs

pip install rdkit-pypi


标准导入方式：

```python
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors, Draw
from rdkit import DataStructs

Quick Reference

快速参考

Parse and Validate Molecules

解析与验证分子

python

from rdkit import Chem

mol = Chem.MolFromSmiles('c1ccc(O)cc1')
if mol is None:
    print("Invalid SMILES")

python

from rdkit import Chem

mol = Chem.MolFromSmiles('c1ccc(O)cc1')
if mol is None:
    print("Invalid SMILES")

Compute Properties

计算分子属性

python

from rdkit.Chem import Descriptors

mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
tpsa = Descriptors.TPSA(mol)

python

from rdkit.Chem import Descriptors

mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
tpsa = Descriptors.TPSA(mol)

Generate Fingerprints

生成分子指纹

python

from rdkit.Chem import rdFingerprintGenerator

gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fp = gen.GetFingerprint(mol)

python

from rdkit.Chem import rdFingerprintGenerator

gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
fp = gen.GetFingerprint(mol)

Similarity Search

相似度搜索

python

from rdkit import DataStructs

similarity = DataStructs.TanimotoSimilarity(fp1, fp2)

python

from rdkit import DataStructs

similarity = DataStructs.TanimotoSimilarity(fp1, fp2)

Substructure Match

子结构匹配

python

pattern = Chem.MolFromSmarts('[OH1][C]')  # Alcohol
has_alcohol = mol.HasSubstructMatch(pattern)

python

pattern = Chem.MolFromSmarts('[OH1][C]')  # 醇类
has_alcohol = mol.HasSubstructMatch(pattern)

Generate 3D Conformer

生成3D构象

python

from rdkit.Chem import AllChem

mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, randomSeed=42)
AllChem.MMFFOptimizeMolecule(mol)

python

from rdkit.Chem import AllChem

mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, randomSeed=42)
AllChem.MMFFOptimizeMolecule(mol)

Implementation Patterns

实现模式

Drug-Likeness Assessment

类药性评估

python

from rdkit import Chem
from rdkit.Chem import Descriptors

def assess_druglikeness(smiles: str) -> dict | None:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None

    mw = Descriptors.MolWt(mol)
    logp = Descriptors.MolLogP(mol)
    hbd = Descriptors.NumHDonors(mol)
    hba = Descriptors.NumHAcceptors(mol)

    return {
        'MW': mw,
        'LogP': logp,
        'HBD': hbd,
        'HBA': hba,
        'TPSA': Descriptors.TPSA(mol),
        'RotBonds': Descriptors.NumRotatableBonds(mol),
        'Lipinski': mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10,
        'QED': Descriptors.qed(mol)
    }

python

from rdkit import Chem
from rdkit.Chem import Descriptors

def assess_druglikeness(smiles: str) -> dict | None:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None

    mw = Descriptors.MolWt(mol)
    logp = Descriptors.MolLogP(mol)
    hbd = Descriptors.NumHDonors(mol)
    hba = Descriptors.NumHAcceptors(mol)

    return {
        'MW': mw,
        'LogP': logp,
        'HBD': hbd,
        'HBA': hba,
        'TPSA': Descriptors.TPSA(mol),
        'RotBonds': Descriptors.NumRotatableBonds(mol),
        'Lipinski': mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10,
        'QED': Descriptors.qed(mol)
    }

Batch Similarity Search

批量相似度搜索

python

from rdkit import Chem, DataStructs
from rdkit.Chem import rdFingerprintGenerator

def find_similar(query_smiles: str, database: list[str], threshold: float = 0.7) -> list:
    query = Chem.MolFromSmiles(query_smiles)
    if query is None:
        return []

    gen = rdFingerprintGenerator.GetMorganGenerator(radius=2)
    query_fp = gen.GetFingerprint(query)

    hits = []
    for idx, smi in enumerate(database):
        mol = Chem.MolFromSmiles(smi)
        if mol:
            fp = gen.GetFingerprint(mol)
            sim = DataStructs.TanimotoSimilarity(query_fp, fp)
            if sim >= threshold:
                hits.append((idx, smi, sim))

    return sorted(hits, key=lambda x: x[2], reverse=True)

python

from rdkit import Chem, DataStructs
from rdkit.Chem import rdFingerprintGenerator

def find_similar(query_smiles: str, database: list[str], threshold: float = 0.7) -> list:
    query = Chem.MolFromSmiles(query_smiles)
    if query is None:
        return []

    gen = rdFingerprintGenerator.GetMorganGenerator(radius=2)
    query_fp = gen.GetFingerprint(query)

    hits = []
    for idx, smi in enumerate(database):
        mol = Chem.MolFromSmiles(smi)
        if mol:
            fp = gen.GetFingerprint(mol)
            sim = DataStructs.TanimotoSimilarity(query_fp, fp)
            if sim >= threshold:
                hits.append((idx, smi, sim))

    return sorted(hits, key=lambda x: x[2], reverse=True)

Functional Group Screening

官能团筛选

python

from rdkit import Chem

FUNCTIONAL_GROUPS = {
    'alcohol': '[OH1][C]',
    'amine': '[NH2,NH1][C]',
    'carboxylic_acid': 'C(=O)[OH1]',
    'amide': 'C(=O)N',
    'ester': 'C(=O)O[C]',
    'nitro': '[N+](=O)[O-]'
}

def detect_functional_groups(smiles: str) -> list[str]:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return []

    found = []
    for name, smarts in FUNCTIONAL_GROUPS.items():
        pattern = Chem.MolFromSmarts(smarts)
        if mol.HasSubstructMatch(pattern):
            found.append(name)
    return found

python

from rdkit import Chem

FUNCTIONAL_GROUPS = {
    'alcohol': '[OH1][C]',
    'amine': '[NH2,NH1][C]',
    'carboxylic_acid': 'C(=O)[OH1]',
    'amide': 'C(=O)N',
    'ester': 'C(=O)O[C]',
    'nitro': '[N+](=O)[O-]'
}

def detect_functional_groups(smiles: str) -> list[str]:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return []

    found = []
    for name, smarts in FUNCTIONAL_GROUPS.items():
        pattern = Chem.MolFromSmarts(smarts)
        if mol.HasSubstructMatch(pattern):
            found.append(name)
    return found

Conformer Generation with Clustering

带聚类的构象生成

python

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.ML.Cluster import Butina

def generate_diverse_conformers(smiles: str, n_confs: int = 50, rmsd_thresh: float = 0.5) -> list:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return []

    mol = Chem.AddHs(mol)
    conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=n_confs, randomSeed=42)

    # Optimize all conformers
    for cid in conf_ids:
        AllChem.MMFFOptimizeMolecule(mol, confId=cid)

    # Cluster by RMSD to get diverse set
    if len(conf_ids) < 2:
        return list(conf_ids)

    dists = []
    for i in range(len(conf_ids)):
        for j in range(i):
            rmsd = AllChem.GetConformerRMS(mol, conf_ids[j], conf_ids[i])
            dists.append(rmsd)

    clusters = Butina.ClusterData(dists, len(conf_ids), rmsd_thresh, isDistData=True)
    return [conf_ids[c[0]] for c in clusters]  # Cluster centroids

python

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.ML.Cluster import Butina

def generate_diverse_conformers(smiles: str, n_confs: int = 50, rmsd_thresh: float = 0.5) -> list:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return []

    mol = Chem.AddHs(mol)
    conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=n_confs, randomSeed=42)

    # 优化所有构象
    for cid in conf_ids:
        AllChem.MMFFOptimizeMolecule(mol, confId=cid)

    # 通过RMSD聚类得到多样化构象集
    if len(conf_ids) < 2:
        return list(conf_ids)

    dists = []
    for i in range(len(conf_ids)):
        for j in range(i):
            rmsd = AllChem.GetConformerRMS(mol, conf_ids[j], conf_ids[i])
            dists.append(rmsd)

    clusters = Butina.ClusterData(dists, len(conf_ids), rmsd_thresh, isDistData=True)
    return [conf_ids[c[0]] for c in clusters]  # 聚类中心

Batch Processing SDF Files

批量处理SDF文件

python

from rdkit import Chem
from rdkit.Chem import Descriptors

def process_sdf(input_path: str, output_path: str, min_mw: float = 200, max_mw: float = 500):
    """Filter compounds by molecular weight and add property columns."""
    supplier = Chem.SDMolSupplier(input_path)
    writer = Chem.SDWriter(output_path)

    for mol in supplier:
        if mol is None:
            continue

        mw = Descriptors.MolWt(mol)
        if not (min_mw <= mw <= max_mw):
            continue

        # Add computed properties
        mol.SetProp('MW', f'{mw:.2f}')
        mol.SetProp('LogP', f'{Descriptors.MolLogP(mol):.2f}')
        mol.SetProp('TPSA', f'{Descriptors.TPSA(mol):.2f}')

        writer.write(mol)

    writer.close()

python

from rdkit import Chem
from rdkit.Chem import Descriptors

def process_sdf(input_path: str, output_path: str, min_mw: float = 200, max_mw: float = 500):
    """按分子量筛选化合物并添加属性列。"""
    supplier = Chem.SDMolSupplier(input_path)
    writer = Chem.SDWriter(output_path)

    for mol in supplier:
        if mol is None:
            continue

        mw = Descriptors.MolWt(mol)
        if not (min_mw <= mw <= max_mw):
            continue

        # 添加计算得到的属性
        mol.SetProp('MW', f'{mw:.2f}')
        mol.SetProp('LogP', f'{Descriptors.MolLogP(mol):.2f}')
        mol.SetProp('TPSA', f'{Descriptors.TPSA(mol):.2f}')

        writer.write(mol)

    writer.close()

Guidelines

使用指南

Always validate parsed molecules:

python

mol = Chem.MolFromSmiles(smiles)
if mol is None:
    print(f"Parse failed: {smiles}")
    continue

Use bulk operations for performance:

python

fps = [gen.GetFingerprint(m) for m in mols]
sims = DataStructs.BulkTanimotoSimilarity(fps[0], fps[1:])

Add hydrogens for 3D work:

python

mol = Chem.AddHs(mol)  # Required before EmbedMolecule
AllChem.EmbedMolecule(mol)

Stream large files:

python

undefined

始终验证解析后的分子：

python

mol = Chem.MolFromSmiles(smiles)
if mol is None:
    print(f"解析失败: {smiles}")
    continue

使用批量操作提升性能：

python

fps = [gen.GetFingerprint(m) for m in mols]
sims = DataStructs.BulkTanimotoSimilarity(fps[0], fps[1:])

3D操作前添加氢原子：

python

mol = Chem.AddHs(mol)  # EmbedMolecule前必须执行
AllChem.EmbedMolecule(mol)

流式处理大文件：

python

undefined

Memory-efficient: process one at a time

内存高效：逐个处理分子

for mol in Chem.ForwardSDMolSupplier(file_handle): if mol: process(mol)

Avoid: loading entire file

避免：加载整个文件到内存

all_mols = list(Chem.SDMolSupplier('huge.sdf'))


**Thread safety:** Most operations are thread-safe except for concurrent access to MolSupplier objects.

all_mols = list(Chem.SDMolSupplier('huge.sdf'))


**线程安全：**除了并发访问MolSupplier对象外，大多数操作都是线程安全的。

Troubleshooting

问题排查

Issue	Resolution
`MolFromSmiles` returns `None`	Invalid SMILES syntax; check input
Sanitization error	Use `Chem.DetectChemistryProblems(mol)` to diagnose
Wrong 3D geometry	Call `AddHs(mol)` before embedding
Fingerprint size mismatch	Use same `fpSize` parameter for all comparisons
SMARTS not matching	Check aromatic vs aliphatic atoms ( `c` vs `C` )
Slow SDF processing	Use `ForwardSDMolSupplier` or `MultithreadedSDMolSupplier`
Memory issues with large files	Stream with `ForwardSDMolSupplier` , don't load all

问题	解决方法
`MolFromSmiles` 返回 `None`	SMILES语法无效；检查输入内容
分子清理错误	使用 `Chem.DetectChemistryProblems(mol)` 诊断问题
3D几何结构错误	嵌入前调用 `AddHs(mol)` 添加氢原子
指纹尺寸不匹配	所有比较使用相同的 `fpSize` 参数
SMARTS模式不匹配	检查芳香族与脂肪族原子（ `c` vs `C` ）
SDF处理缓慢	使用 `ForwardSDMolSupplier` 或 `MultithreadedSDMolSupplier`
大文件内存问题	使用 `ForwardSDMolSupplier` 流式处理，不要全部加载到内存

Reference Documentation

参考文档

Each reference file contains detailed API documentation:

File	Contents
`references/api-reference.md`	Complete function/class listings by module
`references/descriptors-reference.md`	All molecular descriptors with examples
`references/smarts-patterns.md`	Common SMARTS patterns for substructure search

每个参考文件包含详细的API文档：

文件	内容
`references/api-reference.md`	按模块分类的完整函数/类列表
`references/descriptors-reference.md`	所有分子描述符及示例
`references/smarts-patterns.md`	用于子结构搜索的常用SMARTS模式

External Resources

外部资源

RDKit Documentation: https://www.rdkit.org/docs/
Getting Started Guide: https://www.rdkit.org/docs/GettingStartedInPython.html
GitHub Repository: https://github.com/rdkit/rdkit
Cookbook: https://www.rdkit.org/docs/Cookbook.html

RDKit官方文档：https://www.rdkit.org/docs/
入门指南：https://www.rdkit.org/docs/GettingStartedInPython.html
GitHub仓库：https://github.com/rdkit/rdkit
参考手册：https://www.rdkit.org/docs/Cookbook.html