tooluniverse-binder-discovery

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Small Molecule Binder Discovery Strategy

小分子结合剂发现策略

Systematic discovery of novel small molecule binders using 60+ ToolUniverse tools across druggability assessment, known ligand mining, similarity expansion, ADMET filtering, and synthesis feasibility.
KEY PRINCIPLES:
  1. Report-first approach - Create report file FIRST, then populate progressively
  2. Target validation FIRST - Confirm druggability before compound searching
  3. Multi-strategy approach - Combine structure-based and ligand-based methods
  4. ADMET-aware filtering - Eliminate poor compounds early
  5. Evidence grading - Grade candidates by supporting evidence
  6. Actionable output - Provide prioritized candidates with rationale
  7. English-first queries - Always use English terms in tool calls, even if the user writes in another language. Only try original-language terms as a fallback. Respond in the user's language

通过ToolUniverse的60余种工具,系统性发现新型小分子结合剂,涵盖成药性评估、已知配体挖掘、相似性拓展、ADMET筛选及合成可行性分析全流程。
核心原则:
  1. 报告优先法 - 先创建报告文件,再逐步填充内容
  2. 靶点优先验证 - 在化合物搜索前确认成药性
  3. 多策略结合 - 融合基于结构和基于配体的方法
  4. ADMET感知筛选 - 尽早剔除劣质化合物
  5. 证据分级 - 根据支持证据对候选物进行分级
  6. 可执行输出 - 提供带理论依据的优先级候选物列表
  7. 英文优先查询 - 调用工具时始终使用英文术语,即使用户使用其他语言提问。仅在无法查询时尝试原语言术语。用用户的语言回复

Critical Workflow Requirements

关键工作流要求

1. Report-First Approach (MANDATORY)

1. 报告优先法(强制要求)

DO NOT show search process or tool outputs to the user. Instead:
  1. Create the report file FIRST - Before any data collection:
    • File name:
      [TARGET]_binder_discovery_report.md
    • Initialize with all section headers from the template
    • Add placeholder text:
      [Researching...]
      in each section
  2. Progressively update the report - As you gather data:
    • Update each section with findings immediately
    • The user sees the report growing, not the search process
  3. Output separate data files:
    • [TARGET]_candidate_compounds.csv
      - Prioritized compounds with SMILES, scores
    • [TARGET]_bibliography.json
      - Literature references (optional)
禁止向用户展示搜索过程或工具输出。应遵循以下步骤:
  1. 先创建报告文件 - 在收集任何数据前:
    • 文件名:
      [TARGET]_binder_discovery_report.md
    • 用模板初始化所有章节标题
    • 在每个章节添加占位文本:
      [研究中...]
  2. 逐步更新报告 - 收集数据时:
    • 立即用研究结果更新对应章节
    • 用户看到的是报告内容逐步完善,而非搜索过程
  3. 输出独立数据文件:
    • [TARGET]_candidate_compounds.csv
      - 带SMILES、评分的优先级化合物列表
    • [TARGET]_bibliography.json
      - 文献参考文献(可选)

2. Citation Requirements (MANDATORY)

2. 引用要求(强制要求)

Every piece of information MUST include its source:
markdown
undefined
所有信息必须包含来源:
markdown
undefined

3.2 Known Inhibitors

3.2 已知抑制剂

CompoundChEMBL IDIC50 (nM)SelectivitySource
ImatinibCHEMBL94138ABL-selectiveChEMBL
DasatinibCHEMBL14210.5Multi-kinaseChEMBL
Source: ChEMBL via
ChEMBL_get_target_activities
(CHEMBL1862)

---
化合物ChEMBL IDIC50 (nM)选择性来源
ImatinibCHEMBL94138ABL选择性ChEMBL
DasatinibCHEMBL14210.5多激酶ChEMBL
来源:ChEMBL via
ChEMBL_get_target_activities
(CHEMBL1862)

---

Workflow Overview

工作流概览

Phase 0: Tool Verification (check parameter names)
Phase 1: Target Validation
    ├─ 1.1 Resolve identifiers (UniProt, Ensembl, ChEMBL target ID)
    ├─ 1.2 Assess druggability/tractability
    │   └─ 1.2.5 Check therapeutic antibodies (Thera-SAbDab) [NEW]
    ├─ 1.3 Identify binding sites
    └─ 1.4 Predict structure (NvidiaNIM_alphafold2/esmfold)
Phase 2: Known Ligand Mining
    ├─ Extract ChEMBL bioactivity data
    ├─ Get GtoPdb interactions
    ├─ Identify chemical probes
    ├─ BindingDB affinity data (NEW - Ki/IC50/Kd)
    ├─ PubChem BioAssay HTS data (NEW - screening hits)
    └─ Analyze SAR from known actives
Phase 3: Structure Analysis
    ├─ Get PDB structures with ligands
    ├─ Check EMDB for cryo-EM structures (NEW - for membrane targets)
    ├─ Analyze binding pocket
    └─ Identify key interactions
Phase 3.5: Docking Validation (NvidiaNIM_diffdock/boltz2) [NEW]
    ├─ Dock reference inhibitor
    └─ Validate binding pocket geometry
Phase 4: Compound Expansion
    ├─ 4.1-4.3 Similarity/substructure search
    └─ 4.4 De novo generation (NvidiaNIM_genmol/molmim) [NEW]
Phase 5: ADMET Filtering
    ├─ Predict physicochemical properties
    ├─ Predict ADMET endpoints
    └─ Flag liabilities
Phase 6: Candidate Docking & Prioritization
    ├─ Dock all candidates (NvidiaNIM_diffdock/boltz2) [UPDATED]
    ├─ Score by docking + ADMET + novelty
    ├─ Assess synthesis feasibility
    └─ Generate final ranked list
Phase 7: Report Synthesis

阶段0:工具验证(检查参数名称)
阶段1:靶点验证
    ├─ 1.1 标识符解析(UniProt、Ensembl、ChEMBL靶点ID)
    ├─ 1.2 成药性/可开发性评估
    │   └─ 1.2.5 检查治疗性抗体(Thera-SAbDab)【新增】
    ├─ 1.3 识别结合位点
    └─ 1.4 结构预测(NvidiaNIM_alphafold2/esmfold)
阶段2:已知配体挖掘
    ├─ 提取ChEMBL生物活性数据
    ├─ 获取GtoPdb相互作用数据
    ├─ 识别化学探针
    ├─ BindingDB亲和力数据【新增 - Ki/IC50/Kd】
    ├─ PubChem BioAssay高通量筛选数据【新增 - 筛选命中物】
    └─ 分析已知活性化合物的构效关系(SAR)
阶段3:结构分析
    ├─ 获取带配体的PDB结构
    ├─ 检查EMDB冷冻电镜结构【新增 - 针对膜蛋白靶点】
    ├─ 分析结合口袋
    └─ 识别关键相互作用
阶段3.5:对接验证(NvidiaNIM_diffdock/boltz2)【新增】
    ├─ 对接参考抑制剂
    └─ 验证结合口袋几何结构
阶段4:化合物拓展
    ├─ 4.1-4.3 相似性/子结构搜索
    └─ 4.4 从头生成(NvidiaNIM_genmol/molmim)【新增】
阶段5:ADMET筛选
    ├─ 预测理化性质
    ├─ 预测ADMET终点
    └─ 标记风险因素
阶段6:候选物对接与优先级排序
    ├─ 对接所有候选物(NvidiaNIM_diffdock/boltz2)【更新】
    ├─ 按对接评分+ADMET+新颖性综合评分
    ├─ 评估合成可行性
    └─ 生成最终排名列表
阶段7:报告整合

Phase 0: Tool Verification

阶段0:工具验证

CRITICAL: Verify tool parameters before calling unfamiliar tools.
python
undefined
关键:调用不熟悉的工具前,先验证工具参数。
python
undefined

Check tool params to prevent silent failures

检查工具参数以避免静默失败

tool_info = tu.tools.get_tool_info(tool_name="ChEMBL_get_target_activities")
undefined
tool_info = tu.tools.get_tool_info(tool_name="ChEMBL_get_target_activities")
undefined

Known Parameter Corrections

已知参数修正

ToolWRONG ParameterCORRECT Parameter
OpenTargets_get_target_tractability_by_ensemblID
ensembl_id
ensemblId
ChEMBL_get_target_activities
chembl_target_id
target_chembl_id
ChEMBL_search_similar_molecules
smiles
molecule
(accepts SMILES, ChEMBL ID, or name)
alphafold_get_prediction
uniprot
accession

工具错误参数正确参数
OpenTargets_get_target_tractability_by_ensemblID
ensembl_id
ensemblId
ChEMBL_get_target_activities
chembl_target_id
target_chembl_id
ChEMBL_search_similar_molecules
smiles
molecule
(支持SMILES、ChEMBL ID或名称)
alphafold_get_prediction
uniprot
accession

Phase 1: Target Validation

阶段1:靶点验证

1.1 Identifier Resolution Chain

1.1 标识符解析链

1. UniProt_search(query=target_name, organism="human")
   └─ Extract: UniProt accession, gene name, protein name

2. MyGene_query_genes(q=gene_symbol, species="human")
   └─ Extract: Ensembl gene ID, NCBI gene ID

3. ChEMBL_search_targets(query=target_name, organism="Homo sapiens")
   └─ Extract: ChEMBL target ID, target type

4. GtoPdb_get_targets(query=target_name)
   └─ Extract: GtoPdb target ID (if GPCR/ion channel/enzyme)
Store all IDs for downstream queries:
ids = {
    'uniprot': 'P00533',
    'ensembl': 'ENSG00000146648',
    'chembl_target': 'CHEMBL203',
    'gene_symbol': 'EGFR',
    'gtopdb': '1797'  # if available
}
1. UniProt_search(query=target_name, organism="human")
   └─ 提取:UniProt登录号、基因名、蛋白名

2. MyGene_query_genes(q=gene_symbol, species="human")
   └─ 提取:Ensembl基因ID、NCBI基因ID

3. ChEMBL_search_targets(query=target_name, organism="Homo sapiens")
   └─ 提取:ChEMBL靶点ID、靶点类型

4. GtoPdb_get_targets(query=target_name)
   └─ 提取:GtoPdb靶点ID(若为GPCR/离子通道/酶)
存储所有ID用于后续查询:
ids = {
    'uniprot': 'P00533',
    'ensembl': 'ENSG00000146648',
    'chembl_target': 'CHEMBL203',
    'gene_symbol': 'EGFR',
    'gtopdb': '1797'  # 若可用
}

1.2 Druggability Assessment

1.2 成药性评估

Multi-Source Triangulation:
1. OpenTargets_get_target_tractability_by_ensemblID(ensemblId)
   └─ Extract: Small molecule tractability score, bucket
   
2. DGIdb_get_gene_druggability(genes=[gene_symbol])
   └─ Extract: Druggability categories, known drug count
   
3. OpenTargets_get_target_classes_by_ensemblID(ensemblId)
   └─ Extract: Target class (kinase, GPCR, etc.)

4. GPCRdb_get_protein(protein=entry_name)  # NEW - for GPCRs
   └─ Extract: GPCR family, receptor state, ligand binding data
多源交叉验证:
1. OpenTargets_get_target_tractability_by_ensemblID(ensemblId)
   └─ 提取:小分子可开发性评分、分级
   
2. DGIdb_get_gene_druggability(genes=[gene_symbol])
   └─ 提取:成药性分类、已知药物数量
   
3. OpenTargets_get_target_classes_by_ensemblID(ensemblId)
   └─ 提取:靶点类别(激酶、GPCR等)

4. GPCRdb_get_protein(protein=entry_name) 【新增 - 针对GPCR】
   └─ 提取:GPCR家族、受体状态、配体结合数据

1.2a GPCRdb Integration (NEW - for GPCR Targets)

1.2a GPCRdb集成(新增 - 针对GPCR靶点)

~35% of all approved drugs target GPCRs. For GPCR targets, use specialized data:
python
def check_if_gpcr_and_enrich(tu, target_name, uniprot_id):
    """Check if target is GPCR and get specialized data."""
    
    # Build GPCRdb entry name (e.g., "adrb2_human")
    entry_name = f"{target_name.lower()}_human"
    
    # Check if it's a GPCR
    gpcr_info = tu.tools.GPCRdb_get_protein(
        operation="get_protein",
        protein=entry_name
    )
    
    if gpcr_info.get('status') == 'success':
        # It's a GPCR - get specialized data
        
        # Get known structures (active/inactive states)
        structures = tu.tools.GPCRdb_get_structures(
            operation="get_structures",
            protein=entry_name
        )
        
        # Get known ligands
        ligands = tu.tools.GPCRdb_get_ligands(
            operation="get_ligands",
            protein=entry_name
        )
        
        # Get mutation data (important for SAR)
        mutations = tu.tools.GPCRdb_get_mutations(
            operation="get_mutations",
            protein=entry_name
        )
        
        return {
            'is_gpcr': True,
            'gpcr_family': gpcr_info['data'].get('family'),
            'gpcr_class': gpcr_info['data'].get('receptor_class'),
            'structures': structures['data'].get('structures', []),
            'ligands': ligands['data'].get('ligands', []),
            'mutation_data': mutations['data'].get('mutations', [])
        }
    
    return {'is_gpcr': False}
GPCRdb Advantages:
  • GPCR-specific sequence alignments (Ballesteros-Weinstein numbering)
  • Active vs. inactive state structures
  • Curated ligand binding data
  • Experimental mutation effects on ligand binding
Druggability Scorecard:
FactorAssessmentScore
Known small molecule drugsYes (3+)★★★
Tractability bucket1-3★★☆-★★★
Target classEnzyme/GPCR/Ion channel★★★
Binding site knownYes (X-ray)★★★
GPCRdb ligands availableYes (10+)★★★ (GPCR only)
Therapeutic antibodies existCheck Thera-SAbDabSee 1.2.5
Decision Point: If druggability score < ★★☆, warn user about challenges.
约35%的已获批药物靶向GPCR。针对GPCR靶点,使用专用数据:
python
def check_if_gpcr_and_enrich(tu, target_name, uniprot_id):
    """检查靶点是否为GPCR并获取专用数据。"""
    
    # 构建GPCRdb条目名称(例如:"adrb2_human")
    entry_name = f"{target_name.lower()}_human"
    
    # 检查是否为GPCR
    gpcr_info = tu.tools.GPCRdb_get_protein(
        operation="get_protein",
        protein=entry_name
    )
    
    if gpcr_info.get('status') == 'success':
        # 是GPCR - 获取专用数据
        
        # 获取已知结构(激活/非激活状态)
        structures = tu.tools.GPCRdb_get_structures(
            operation="get_structures",
            protein=entry_name
        )
        
        # 获取已知配体
        ligands = tu.tools.GPCRdb_get_ligands(
            operation="get_ligands",
            protein=entry_name
        )
        
        # 获取突变数据(对SAR分析很重要)
        mutations = tu.tools.GPCRdb_get_mutations(
            operation="get_mutations",
            protein=entry_name
        )
        
        return {
            'is_gpcr': True,
            'gpcr_family': gpcr_info['data'].get('family'),
            'gpcr_class': gpcr_info['data'].get('receptor_class'),
            'structures': structures['data'].get('structures', []),
            'ligands': ligands['data'].get('ligands', []),
            'mutation_data': mutations['data'].get('mutations', [])
        }
    
    return {'is_gpcr': False}
GPCRdb优势:
  • GPCR特异性序列比对(Ballesteros-Weinstein编号)
  • 激活态与非激活态结构
  • curated配体结合数据
  • 实验突变对配体结合的影响
成药性评分卡:
因素评估评分
已知小分子药物是(≥3种)★★★
可开发性分级1-3★★☆-★★★
靶点类别酶/GPCR/离子通道★★★
结合位点已知是(X射线结构)★★★
GPCRdb配体可用是(≥10种)★★★(仅GPCR)
治疗性抗体存在检查Thera-SAbDab见1.2.5
决策点:如果成药性评分 < ★★☆,需向用户提示挑战。

1.2.5 Therapeutic Antibody Landscape (NEW)

1.2.5 治疗性抗体格局(新增)

Check if therapeutic antibodies already target this protein - important for:
  • Understanding competitive landscape
  • Validating target tractability (if antibodies work, target is validated)
  • Identifying potential combination approaches
python
def check_therapeutic_antibodies(tu, target_name):
    """
    Check Thera-SAbDab for therapeutic antibodies against target.
    """
    # Search by target name
    results = tu.tools.TheraSAbDab_search_by_target(
        target=target_name
    )
    
    if results.get('status') == 'success':
        antibodies = results['data'].get('therapeutics', [])
        
        # Categorize by clinical stage
        by_phase = {'Approved': [], 'Phase 3': [], 'Phase 2': [], 'Phase 1': [], 'Preclinical': []}
        for ab in antibodies:
            phase = ab.get('phase', 'Unknown')
            for key in by_phase.keys():
                if key.lower() in phase.lower():
                    by_phase[key].append(ab)
                    break
        
        return {
            'total_antibodies': len(antibodies),
            'by_phase': by_phase,
            'antibodies': antibodies[:10],  # Top 10
            'competitive_alert': len(by_phase.get('Approved', [])) > 0
        }
    return None

def get_antibody_landscape(tu, target_name, uniprot_id=None):
    """
    Comprehensive antibody competitive landscape.
    """
    # Thera-SAbDab search
    therasabdab = check_therapeutic_antibodies(tu, target_name)
    
    # Also search by common synonyms
    synonyms = [target_name]
    if target_name != uniprot_id:
        synonyms.append(uniprot_id)
    
    all_antibodies = []
    for synonym in synonyms:
        results = tu.tools.TheraSAbDab_search_therapeutics(query=synonym)
        if results.get('status') == 'success':
            all_antibodies.extend(results['data'].get('therapeutics', []))
    
    # Deduplicate
    seen = set()
    unique = []
    for ab in all_antibodies:
        inn = ab.get('inn_name')
        if inn and inn not in seen:
            seen.add(inn)
            unique.append(ab)
    
    return {
        'antibodies': unique,
        'count': len(unique),
        'has_approved': any(ab.get('phase', '').lower() == 'approved' for ab in unique),
        'source': 'Thera-SAbDab'
    }
Report Output:
markdown
undefined
检查是否已有治疗性抗体靶向该蛋白,这对以下方面很重要:
  • 了解竞争格局
  • 验证靶点可开发性(若抗体有效,则靶点已验证)
  • 识别潜在联合治疗方案
python
def check_therapeutic_antibodies(tu, target_name):
    """
    检查Thera-SAbDab中针对该靶点的治疗性抗体。
    """
    # 按靶点名称搜索
    results = tu.tools.TheraSAbDab_search_by_target(
        target=target_name
    )
    
    if results.get('status') == 'success':
        antibodies = results['data'].get('therapeutics', [])
        
        # 按临床阶段分类
        by_phase = {'已获批': [], '3期': [], '2期': [], '1期': [], '临床前': []}
        for ab in antibodies:
            phase = ab.get('phase', '未知')
            for key in by_phase.keys():
                if key.lower() in phase.lower():
                    by_phase[key].append(ab)
                    break
        
        return {
            '总抗体数': len(antibodies),
            '按阶段分类': by_phase,
            '抗体列表': antibodies[:10],  # 前10种
            '竞争预警': len(by_phase.get('已获批', [])) > 0
        }
    return None

def get_antibody_landscape(tu, target_name, uniprot_id=None):
    """
    全面的抗体竞争格局分析。
    """
    # Thera-SAbDab搜索
    therasabdab = check_therapeutic_antibodies(tu, target_name)
    
    # 同时按常用同义词搜索
    synonyms = [target_name]
    if target_name != uniprot_id:
        synonyms.append(uniprot_id)
    
    all_antibodies = []
    for synonym in synonyms:
        results = tu.tools.TheraSAbDab_search_therapeutics(query=synonym)
        if results.get('status') == 'success':
            all_antibodies.extend(results['data'].get('therapeutics', []))
    
    # 去重
    seen = set()
    unique = []
    for ab in all_antibodies:
        inn = ab.get('inn_name')
        if inn and inn not in seen:
            seen.add(inn)
            unique.append(ab)
    
    return {
        '抗体列表': unique,
        '数量': len(unique),
        '有已获批抗体': any(ab.get('phase', '').lower() == 'approved' for ab in unique),
        '来源': 'Thera-SAbDab'
    }
报告输出:
markdown
undefined

1.2.5 Therapeutic Antibody Landscape (NEW)

1.2.5 治疗性抗体格局(新增)

Thera-SAbDab Search Results:
Antibody (INN)TargetFormatPhasePDB
PembrolizumabPD-1IgG4Approved5DK3
NivolumabPD-1IgG4Approved5WT9
CemiplimabPD-1IgG4Approved7WVM
Competitive Landscape: ⚠️ 3 approved antibodies target this protein Strategic Implication: Small molecule approach offers differentiation (oral dosing, CNS penetration, cost)
Source: Thera-SAbDab via
TheraSAbDab_search_by_target

**Why Include Antibody Landscape**:
- **Validation**: Approved antibodies = validated target
- **Competition**: Understand what's already in market/clinic
- **Strategy**: Identify gaps (no oral, no CNS-penetrant)
- **Synergy**: Potential combination opportunities
Thera-SAbDab搜索结果:
抗体(INN)靶点格式阶段PDB
PembrolizumabPD-1IgG4已获批5DK3
NivolumabPD-1IgG4已获批5WT9
CemiplimabPD-1IgG4已获批7WVM
竞争格局:⚠️ 3种已获批抗体靶向该蛋白 战略意义:小分子方案具备差异化优势(口服给药、血脑屏障穿透性、成本)
来源:Thera-SAbDab via
TheraSAbDab_search_by_target

**为何包含抗体格局**:
- **验证**:已获批抗体=靶点已验证
- **竞争**:了解市场/临床已有药物
- **战略**:识别空白领域(无口服剂型、无血脑屏障穿透性)
- **协同**:潜在联合治疗机会

1.3 Binding Site Analysis

1.3 结合位点分析

1. ChEMBL_search_binding_sites(target_chembl_id)
   └─ Extract: Binding site names, types
   
2. get_binding_affinity_by_pdb_id(pdb_id)  # For each PDB with ligand
   └─ Extract: Kd, Ki, IC50 values for co-crystallized ligands
   
3. InterPro_get_protein_domains(uniprot_accession)
   └─ Extract: Domain architecture, active sites
Output for Report:
markdown
undefined
1. ChEMBL_search_binding_sites(target_chembl_id)
   └─ 提取:结合位点名称、类型
   
2. get_binding_affinity_by_pdb_id(pdb_id)  # 针对每个带配体的PDB
   └─ 提取:共结晶配体的Kd、Ki、IC50值
   
3. InterPro_get_protein_domains(uniprot_accession)
   └─ 提取:结构域架构、活性位点
报告输出:
markdown
undefined

1.3 Binding Site Assessment

1.3 结合位点评估

Known Binding Sites:
SiteTypeEvidenceKey ResiduesSource
ATP pocketOrthostericX-ray (23 PDBs)K745, E762, M793PDB/ChEMBL
Allosteric pocketAllostericX-ray (3 PDBs)T790, C797PDB
Binding Site Druggability: ★★★ (well-defined pocket, multiple co-crystal structures)
Source: ChEMBL via
ChEMBL_search_binding_sites
, PDB structures
undefined
已知结合位点
位点类型证据关键残基来源
ATP口袋正构X射线(23个PDB)K745, E762, M793PDB/ChEMBL
别构口袋别构X射线(3个PDB)T790, C797PDB
结合位点成药性:★★★(口袋定义清晰,多个共结晶结构)
来源:ChEMBL via
ChEMBL_search_binding_sites
, PDB结构
undefined

1.4 Structure Prediction (NVIDIA NIM)

1.4 结构预测(NVIDIA NIM)

When no experimental structure is available, or for custom domain predictions.
Requires:
NVIDIA_API_KEY
environment variable
Option A: AlphaFold2 (High accuracy, async)
NvidiaNIM_alphafold2(
    sequence=kinase_domain_sequence,
    algorithm="mmseqs2",
    relax_prediction=False
)
└─ Returns: PDB structure with pLDDT confidence scores
└─ Use when: Accuracy is critical, time is available (~5-15 min)
Option B: ESMFold (Fast, synchronous)
NvidiaNIM_esmfold(sequence=kinase_domain_sequence)
└─ Returns: PDB structure (max 1024 AA)
└─ Use when: Quick assessment needed (~30 sec)
Report pLDDT Confidence:
markdown
undefined
当无实验结构可用,或需要自定义结构域预测时使用。
要求
NVIDIA_API_KEY
环境变量
选项A:AlphaFold2(高精度,异步)
NvidiaNIM_alphafold2(
    sequence=kinase_domain_sequence,
    algorithm="mmseqs2",
    relax_prediction=False
)
└─ 返回:带pLDDT置信度评分的PDB结构
└─ 适用场景:对精度要求高,可等待(约5-15分钟)
选项B:ESMFold(快速,同步)
NvidiaNIM_esmfold(sequence=kinase_domain_sequence)
└─ 返回:PDB结构(最大1024个氨基酸)
└─ 适用场景:快速评估(约30秒)
报告pLDDT置信度:
markdown
undefined

1.4 Structure Prediction Quality

1.4 结构预测质量

Method: AlphaFold2 via NVIDIA NIM Mean pLDDT: 90.94 (very high confidence)
Confidence LevelRangeFractionInterpretation
Very High≥9074.3%Highly reliable
Confident70-9016.0%Reliable
Low50-709.0%Use caution
Very Low<500.7%Unreliable
Key Binding Residue Confidence:
ResidueFunctionpLDDT
K745ATP binding90.0
T790Gatekeeper92.3
M793Hinge region95.3
D855DFG motif89.5
Source: NVIDIA NIM via
NvidiaNIM_alphafold2

---
方法:AlphaFold2 via NVIDIA NIM 平均pLDDT:90.94(极高置信度)
置信度等级范围占比解读
极高≥9074.3%高度可靠
70-9016.0%可靠
50-709.0%谨慎使用
极低<500.7%不可靠
关键结合残基置信度:
残基功能pLDDT
K745ATP结合90.0
T790守门残基92.3
M793铰链区95.3
D855DFG基序89.5
来源:NVIDIA NIM via
NvidiaNIM_alphafold2

---

Phase 2: Known Ligand Mining

阶段2:已知配体挖掘

2.1 ChEMBL Bioactivity Data

2.1 ChEMBL生物活性数据

1. ChEMBL_get_target_activities(target_chembl_id, limit=500)
   └─ Filter: standard_type in ["IC50", "Ki", "Kd", "EC50"]
   └─ Filter: standard_value < 10000 nM
   └─ Extract: ChEMBL molecule IDs, SMILES, potency values

2. ChEMBL_get_molecule(molecule_chembl_id)  # For top actives
   └─ Extract: Full molecular data, max_phase, oral flag
Activity Summary Table:
markdown
undefined
1. ChEMBL_get_target_activities(target_chembl_id, limit=500)
   └─ 筛选:standard_type in ["IC50", "Ki", "Kd", "EC50"]
   └─ 筛选:standard_value < 10000 nM
   └─ 提取:ChEMBL分子ID、SMILES、活性值

2. ChEMBL_get_molecule(molecule_chembl_id)  # 针对顶级活性化合物
   └─ 提取:完整分子数据、研发阶段、口服标记
活性汇总表:
markdown
undefined

2.1 Known Active Compounds (ChEMBL)

2.1 已知活性化合物(ChEMBL)

Total Bioactivity Points: 2,847 (IC50: 1,234 | Ki: 892 | Kd: 456 | EC50: 265) Compounds with IC50 < 100 nM: 156 Approved Drugs for This Target: 5
CompoundChEMBL IDIC50 (nM)Max PhaseSMILES (truncated)
ErlotinibCHEMBL55324COc1cc2ncnc(Nc3ccc...
GefitinibCHEMBL93954COc1cc2ncnc(Nc3ccc...
[Novel]CHEMBL123120c1ccc(NC(=O)c2ccc...
Source: ChEMBL via
ChEMBL_get_target_activities
(CHEMBL203)
undefined
总生物活性数据点:2,847(IC50: 1,234 | Ki: 892 | Kd: 456 | EC50: 265) IC50 < 100 nM的化合物:156种 针对该靶点的已获批药物:5种
化合物ChEMBL IDIC50 (nM)研发阶段SMILES(截断)
ErlotinibCHEMBL55324COc1cc2ncnc(Nc3ccc...
GefitinibCHEMBL93954COc1cc2ncnc(Nc3ccc...
[新型]CHEMBL123120c1ccc(NC(=O)c2ccc...
来源:ChEMBL via
ChEMBL_get_target_activities
(CHEMBL203)
undefined

2.2 GtoPdb Interactions

2.2 GtoPdb相互作用数据

GtoPdb_get_target_interactions(target_id)
└─ Extract: Ligands with pKi/pIC50, selectivity data
GtoPdb_get_target_interactions(target_id)
└─ 提取:带pKi/pIC50、选择性数据的配体

2.3 Chemical Probes

2.3 化学探针

OpenTargets_get_chemical_probes_by_target_ensemblID(ensemblId)
└─ Extract: Validated chemical probes with ratings
Output for Report:
markdown
undefined
OpenTargets_get_chemical_probes_by_target_ensemblID(ensemblId)
└─ 提取:带评级的验证化学探针
报告输出:
markdown
undefined

2.3 Chemical Probes

2.3 化学探针

ProbeTargetRatingUseCaveatSource
Probe-XEGFR★★★★In vivoNoneChemical Probes Portal
Probe-YEGFR★★★☆In vitroOff-target kinase activityOpen Targets
Recommended Probe for Target Validation: Probe-X (highest rating, validated in vivo)
undefined
探针靶点评级用途注意事项来源
Probe-XEGFR★★★★体内实验Chemical Probes Portal
Probe-YEGFR★★★☆体外实验脱靶激酶活性Open Targets
推荐用于靶点验证的探针:Probe-X(最高评级,体内验证)
undefined

2.4 SAR Analysis from Actives

2.4 已知活性化合物的SAR分析

Identify common scaffolds and SAR trends:
markdown
undefined
识别共同骨架和构效关系趋势:
markdown
undefined

2.4 Structure-Activity Relationships

2.4 构效关系(SAR)洞察

Core Scaffolds Identified:
  1. 4-Anilinoquinazoline (34 compounds, IC50 range: 2-500 nM)
    • N1 position: Aryl preferred
    • C6/C7: Methoxy groups improve potency
  2. Pyrimidine-amine (12 compounds, IC50 range: 15-800 nM)
    • Less potent than quinazolines
    • Better selectivity profile
Key SAR Insights:
  • Halogen at meta position of aniline increases potency 3-5x
  • C7 ethoxy group critical for binding (H-bond to M793)
undefined
核心骨架:
  1. 4-苯胺基喹唑啉(34种化合物,IC50范围:2-500 nM)
    • N1位:优先芳基
    • C6/C7:甲氧基提高活性
  2. 嘧啶胺(12种化合物,IC50范围:15-800 nM)
    • 活性低于喹唑啉
    • 选择性更好
关键SAR洞察:
  • 苯胺间位的卤素使活性提高3-5倍
  • C7乙氧基对结合至关重要(与M793形成氢键)
undefined

2.5 BindingDB Affinity Data (NEW)

2.5 BindingDB亲和力数据(新增)

BindingDB provides experimental binding affinity data complementary to ChEMBL:
python
def get_bindingdb_ligands(tu, uniprot_id, affinity_cutoff=10000):
    """
    Get ligands from BindingDB with measured affinities.
    
    BindingDB advantages:
    - May have compounds not in ChEMBL
    - Different affinity types (Ki, IC50, Kd)
    - Direct literature links
    """
    
    result = tu.tools.BindingDB_get_ligands_by_uniprot(
        uniprot=uniprot_id,
        affinity_cutoff=affinity_cutoff  # nM
    )
    
    if result:
        ligands = []
        for entry in result:
            ligands.append({
                'smiles': entry.get('smile'),
                'affinity_type': entry.get('affinity_type'),
                'affinity_nM': entry.get('affinity'),
                'pmid': entry.get('pmid'),
                'monomer_id': entry.get('monomerid')
            })
        
        # Sort by potency
        ligands.sort(key=lambda x: float(x['affinity_nM']) if x['affinity_nM'] else 1e6)
        return ligands[:50]  # Top 50
    
    return []

def find_compound_polypharmacology(tu, smiles, similarity_cutoff=0.85):
    """Find off-target interactions for selectivity analysis."""
    
    targets = tu.tools.BindingDB_get_targets_by_compound(
        smiles=smiles,
        similarity_cutoff=similarity_cutoff
    )
    
    return targets  # Other proteins this compound may bind
BindingDB Output for Report:
markdown
undefined
BindingDB提供补充ChEMBL的实验结合亲和力数据:
python
def get_bindingdb_ligands(tu, uniprot_id, affinity_cutoff=10000):
    """
    从BindingDB获取带实测亲和力的配体。
    
    BindingDB优势:
    - 可能包含ChEMBL中没有的化合物
    - 不同亲和力类型(Ki、IC50、Kd)
    - 直接文献链接
    """
    
    result = tu.tools.BindingDB_get_ligands_by_uniprot(
        uniprot=uniprot_id,
        affinity_cutoff=affinity_cutoff  # nM
    )
    
    if result:
        ligands = []
        for entry in result:
            ligands.append({
                'smiles': entry.get('smile'),
                'affinity_type': entry.get('affinity_type'),
                'affinity_nM': entry.get('affinity'),
                'pmid': entry.get('pmid'),
                'monomer_id': entry.get('monomerid')
            })
        
        # 按活性排序
        ligands.sort(key=lambda x: float(x['affinity_nM']) if x['affinity_nM'] else 1e6)
        return ligands[:50]  # 前50种
    
    return []

def find_compound_polypharmacology(tu, smiles, similarity_cutoff=0.85):
    """寻找脱靶相互作用以进行选择性分析。"""
    
    targets = tu.tools.BindingDB_get_targets_by_compound(
        smiles=smiles,
        similarity_cutoff=similarity_cutoff
    )
    
    return targets  # 该化合物可能结合的其他蛋白
BindingDB报告输出:
markdown
undefined

2.5 Additional Ligands (BindingDB) (NEW)

2.5 额外配体(BindingDB)(新增)

Total Unique Ligands: 89 (non-overlapping with ChEMBL) Most Potent: 0.3 nM Ki
SMILESAffinity TypeValue (nM)PMIDBindingDB ID
CC(C)Cc1ccc...Ki0.31573701412345
COc1cc2ncnc...IC502.11646080812346
Novel Scaffolds from BindingDB: 3 scaffolds not seen in ChEMBL data
Source: BindingDB via
BindingDB_get_ligands_by_uniprot
undefined
独特配体总数:89种(与ChEMBL无重叠) 活性最强:0.3 nM Ki
SMILES亲和力类型数值(nM)PMIDBindingDB ID
CC(C)Cc1ccc...Ki0.31573701412345
COc1cc2ncnc...IC502.11646080812346
BindingDB中的新型骨架:3种未在ChEMBL数据中出现的骨架
来源:BindingDB via
BindingDB_get_ligands_by_uniprot
undefined

2.6 PubChem BioAssay Screening Data (NEW)

2.6 PubChem BioAssay筛选数据(新增)

PubChem BioAssay provides HTS screening results and dose-response data:
python
def get_pubchem_assays_for_target(tu, gene_symbol):
    """
    Get bioassays and active compounds from PubChem.
    
    Advantages:
    - HTS data not in ChEMBL
    - NIH-funded screening programs (MLPCN)
    - Dose-response curves for IC50 calculation
    """
    
    # Search assays targeting this gene
    assays = tu.tools.PubChem_search_assays_by_target_gene(
        gene_symbol=gene_symbol
    )
    
    results = {
        'assays': [],
        'total_active_compounds': 0
    }
    
    if assays.get('data', {}).get('aids'):
        for aid in assays['data']['aids'][:10]:  # Top 10 assays
            # Get assay summary
            summary = tu.tools.PubChem_get_assay_summary(aid=aid)
            
            # Get active compounds
            actives = tu.tools.PubChem_get_assay_active_compounds(aid=aid)
            active_cids = actives.get('data', {}).get('cids', [])
            
            results['assays'].append({
                'aid': aid,
                'summary': summary.get('data', {}),
                'active_count': len(active_cids)
            })
            results['total_active_compounds'] += len(active_cids)
    
    return results

def get_dose_response_data(tu, aid):
    """Get dose-response curves for IC50/EC50 determination."""
    
    dr_data = tu.tools.PubChem_get_assay_dose_response(aid=aid)
    return dr_data

def get_compound_bioactivity_profile(tu, cid):
    """Get all bioactivity data for a compound."""
    
    profile = tu.tools.PubChem_get_compound_bioactivity(cid=cid)
    return profile
PubChem BioAssay Output for Report:
markdown
undefined
PubChem BioAssay提供高通量筛选(HTS)结果和剂量响应数据:
python
def get_pubchem_assays_for_target(tu, gene_symbol):
    """
    从PubChem获取生物实验和活性化合物。
    
    优势:
    - ChEMBL中没有的HTS数据
    - NIH资助的筛选项目(MLPCN)
    - 用于IC50计算的剂量响应曲线
    """
    
    # 搜索针对该基因的实验
    assays = tu.tools.PubChem_search_assays_by_target_gene(
        gene_symbol=gene_symbol
    )
    
    results = {
        'assays': [],
        'total_active_compounds': 0
    }
    
    if assays.get('data', {}).get('aids'):
        for aid in assays['data']['aids'][:10]:  # 前10个实验
            # 获取实验摘要
            summary = tu.tools.PubChem_get_assay_summary(aid=aid)
            
            # 获取活性化合物
            actives = tu.tools.PubChem_get_assay_active_compounds(aid=aid)
            active_cids = actives.get('data', {}).get('cids', [])
            
            results['assays'].append({
                'aid': aid,
                'summary': summary.get('data', {}),
                'active_count': len(active_cids)
            })
            results['total_active_compounds'] += len(active_cids)
    
    return results

def get_dose_response_data(tu, aid):
    """获取用于IC50/EC50测定的剂量响应曲线。"""
    
    dr_data = tu.tools.PubChem_get_assay_dose_response(aid=aid)
    return dr_data

def get_compound_bioactivity_profile(tu, cid):
    """获取化合物的所有生物活性数据。"""
    
    profile = tu.tools.PubChem_get_compound_bioactivity(cid=cid)
    return profile
PubChem BioAssay报告输出:
markdown
undefined

2.6 PubChem HTS Screening Data (NEW)

2.6 PubChem HTS筛选数据(新增)

Assays Found: 45 Total Active Compounds Across Assays: ~1,200
AIDAssay TypeActive CompoundsTargetDescription
504526HTS234EGFRqHTS inhibition screen
1053104Dose-response12EGFR kinaseConfirmatory IC50
651564Cellular8EGFRCell proliferation assay
Novel Actives (not in ChEMBL/BindingDB):
  • CID 12345678: Active in AID 504526, IC50 = 45 nM
  • CID 23456789: Active in AID 1053104, IC50 = 120 nM
Source: PubChem via
PubChem_search_assays_by_target_gene
,
PubChem_get_assay_active_compounds

**Why Use Both BindingDB and PubChem**:
| Source | Strengths | Best For |
|--------|-----------|----------|
| **ChEMBL** | Curated, standardized, SAR data | Primary ligand source |
| **BindingDB** | Direct affinity measurements | Ki/Kd values, PMIDs |
| **PubChem BioAssay** | HTS data, NIH screens | Novel scaffolds, broad coverage |

---
找到的实验:45个 所有实验中的活性化合物总数:约1,200种
AID实验类型活性化合物靶点描述
504526HTS234EGFRqHTS抑制筛选
1053104剂量响应12EGFR激酶确证性IC50测定
651564细胞水平8EGFR细胞增殖实验
新型活性化合物(未在ChEMBL/BindingDB中出现):
  • CID 12345678:在AID 504526中活性,IC50 = 45 nM
  • CID 23456789:在AID 1053104中活性,IC50 = 120 nM
来源:PubChem via
PubChem_search_assays_by_target_gene
,
PubChem_get_assay_active_compounds

**为何同时使用BindingDB和PubChem**:
| 来源 | 优势 | 最佳用途 |
|--------|-----------|----------|
| **ChEMBL** |  curated、标准化、SAR数据 | 主要配体来源 |
| **BindingDB** | 直接亲和力测量 | Ki/Kd值、PMID |
| **PubChem BioAssay** | HTS数据、NIH筛选 | 新型骨架、广泛覆盖 |

---

Phase 3: Structure Analysis

阶段3:结构分析

3.1 PDB Structure Retrieval

3.1 PDB结构检索

1. PDB_search_similar_structures(query=uniprot_accession, type="sequence")
   └─ Extract: PDB IDs with ligands
   
2. get_protein_metadata_by_pdb_id(pdb_id)
   └─ Extract: Resolution, method, ligand codes
   
3. alphafold_get_prediction(accession=uniprot_accession)
   └─ Extract: Predicted structure (if no experimental)
1. PDB_search_similar_structures(query=uniprot_accession, type="sequence")
   └─ 提取:带配体的PDB ID
   
2. get_protein_metadata_by_pdb_id(pdb_id)
   └─ 提取:分辨率、方法、配体代码
   
3. alphafold_get_prediction(accession=uniprot_accession)
   └─ 提取:预测结构(若无实验结构)

3.1b EMDB Cryo-EM Structures (NEW)

3.1b EMDB冷冻电镜结构(新增)

Prioritize EMDB for: Membrane proteins (GPCRs, ion channels), large complexes, targets with multiple conformational states.
python
def get_cryoem_structures(tu, target_name, uniprot_accession):
    """Get cryo-EM structures for membrane targets."""
    
    # Search EMDB
    emdb_results = tu.tools.emdb_search(
        query=f"{target_name} membrane receptor"
    )
    
    structures = []
    for entry in emdb_results[:5]:
        details = tu.tools.emdb_get_entry(entry_id=entry['emdb_id'])
        
        # Get associated PDB model (essential for docking)
        pdb_models = details.get('pdb_ids', [])
        
        structures.append({
            'emdb_id': entry['emdb_id'],
            'resolution': entry.get('resolution', 'N/A'),
            'title': entry.get('title', 'N/A'),
            'conformational_state': details.get('state', 'Unknown'),
            'pdb_models': pdb_models
        })
    
    return structures
When to use cryo-EM over X-ray:
Target TypePrefer cryo-EM?Reason
GPCRYesNative membrane conformation
Ion channelYesMultiple functional states
Receptor-ligand complexYesPhysiological state
KinaseUsually X-rayHigher resolution typically
Structure Summary:
markdown
undefined
优先使用EMDB的场景:膜蛋白(GPCR、离子通道)、大型复合物、存在多种构象状态的靶点。
python
def get_cryoem_structures(tu, target_name, uniprot_accession):
    """获取膜靶点的冷冻电镜结构。"""
    
    # 搜索EMDB
    emdb_results = tu.tools.emdb_search(
        query=f"{target_name} membrane receptor"
    )
    
    structures = []
    for entry in emdb_results[:5]:
        details = tu.tools.emdb_get_entry(entry_id=entry['emdb_id'])
        
        # 获取关联的PDB模型(对接必需)
        pdb_models = details.get('pdb_ids', [])
        
        structures.append({
            'emdb_id': entry['emdb_id'],
            'resolution': entry.get('resolution', 'N/A'),
            'title': entry.get('title', 'N/A'),
            '构象状态': details.get('state', 'Unknown'),
            'pdb_models': pdb_models
        })
    
    return structures
何时优先使用冷冻电镜而非X射线:
靶点类型优先冷冻电镜?原因
GPCR天然膜构象
离子通道多种功能状态
受体-配体复合物生理状态
激酶通常X射线通常分辨率更高
结构汇总:
markdown
undefined

3.1 Available Structures

3.1 可用结构

PDB IDResolutionMethodLigandAffinityState
1M172.6 ÅX-rayErlotinibKi=0.4 nMActive
4HJO2.1 ÅX-rayLapatinibKi=3 nMInactive
AF-P00533-PredictedNone--
PDB ID分辨率方法配体亲和力状态
1M172.6 ÅX射线ErlotinibKi=0.4 nM激活态
4HJO2.1 ÅX射线LapatinibKi=3 nM非激活态
AF-P00533-预测--

3.1b Cryo-EM Structures (EMDB)

3.1b 冷冻电镜结构(EMDB)

EMDB IDResolutionPDB ModelConformationLigand
EMD-123453.2 Å7ABCActiveAgonist
EMD-234563.5 Å8DEFInactiveAntagonist
Best Structure for Docking: 1M17 (high resolution, relevant ligand) Source: RCSB PDB, EMDB, AlphaFold DB
undefined
EMDB ID分辨率PDB模型构象配体
EMD-123453.2 Å7ABC激活态激动剂
EMD-234563.5 Å8DEF非激活态拮抗剂
对接最佳结构:1M17(高分辨率,相关配体) 来源:RCSB PDB, EMDB, AlphaFold DB
undefined

3.2 Binding Pocket Analysis

3.2 结合口袋分析

get_binding_affinity_by_pdb_id(pdb_id)
└─ Extract: Binding affinities for co-crystallized ligands
Output for Report:
markdown
undefined
get_binding_affinity_by_pdb_id(pdb_id)
└─ 提取:共结晶配体的结合亲和力
报告输出:
markdown
undefined

3.2 Binding Pocket Characterization

3.2 结合口袋表征

Pocket Volume: ~850 ų (well-defined) Key Interaction Residues:
  • Hinge region: M793 (backbone H-bond donor/acceptor)
  • Gatekeeper: T790 (small residue, allows access)
  • DFG motif: D855 (active conformation)
  • Selectivity pocket: L788, G796 (unique to EGFR)
Druggability Assessment: High (enclosed pocket, conserved interactions)

---
口袋体积:~850 ų(定义清晰) 关键相互作用残基:
  • 铰链区:M793(主链氢键供体/受体)
  • 守门残基:T790(小残基,允许进入)
  • DFG基序:D855(激活构象)
  • 选择性口袋:L788, G796(EGFR特有)
成药性评估:高(封闭口袋,保守相互作用)

---

Phase 3.5: Docking Validation (NVIDIA NIM)

阶段3.5:对接验证(NVIDIA NIM)

Validate structure and score compounds using molecular docking.
Requires:
NVIDIA_API_KEY
environment variable
使用分子对接验证结构并评分化合物。
要求
NVIDIA_API_KEY
环境变量

3.5.1 Reference Compound Docking

3.5.1 参考化合物对接

Dock a known inhibitor to validate the structure captures the binding pocket correctly.
Option A: DiffDock (Blind docking, PDB + SDF input)
NvidiaNIM_diffdock(
    protein=pdb_content,        # PDB text content
    ligand=reference_sdf,       # SDF/MOL2 content
    num_poses=10
)
└─ Returns: Docking poses with confidence scores
└─ Use: When you have PDB structure and ligand SDF file
Option B: Boltz2 (From sequence + SMILES)
NvidiaNIM_boltz2(
    polymers=[{"molecule_type": "protein", "sequence": kinase_sequence}],
    ligands=[{"smiles": "COc1cc2ncnc(Nc3ccc(C#C)cc3)c2cc1OCCOC"}],
    sampling_steps=50,
    diffusion_samples=1
)
└─ Returns: Protein-ligand complex structure
└─ Use: When starting from SMILES, no SDF needed
对接已知抑制剂以验证结构是否正确捕获结合口袋。
选项A:DiffDock(盲对接,PDF + SDF输入)
NvidiaNIM_diffdock(
    protein=pdb_content,        # PDB文本内容
    ligand=reference_sdf,       # SDF/MOL2内容
    num_poses=10
)
└─ 返回:带置信度评分的对接构象
└─ 适用场景:有PDB结构和配体SDF文件时
选项B:Boltz2(从序列 + SMILES开始)
NvidiaNIM_boltz2(
    polymers=[{"molecule_type": "protein", "sequence": kinase_sequence}],
    ligands=[{"smiles": "COc1cc2ncnc(Nc3ccc(C#C)cc3)c2cc1OCCOC"}],
    sampling_steps=50,
    diffusion_samples=1
)
└─ 返回:蛋白-配体复合物结构
└─ 适用场景:从SMILES开始,无SDF文件时

3.5.2 Docking Score Interpretation

3.5.2 对接评分解读

Score vs ReferencePrioritySymbol
Higher than referenceTop priority★★★★
Within 5% of referenceHigh priority★★★
Within 20% of referenceModerate priority★★☆
>20% lowerLow priority★☆☆
Report Format:
markdown
undefined
与参考评分对比优先级符号
高于参考最高优先级★★★★
参考的5%以内高优先级★★★
参考的20%以内中优先级★★☆
低于参考>20%低优先级★☆☆
报告格式:
markdown
undefined

3.5 Docking Validation Results

3.5 对接验证结果

Reference Compound: Erlotinib Method: DiffDock via NVIDIA NIM
MetricValueInterpretation
Best Pose Confidence0.906Excellent
Steric ClashesNoneClean binding pose
Validation Status: ✓ Structure captures binding pocket correctly
Source: NVIDIA NIM via
NvidiaNIM_diffdock

---
参考化合物:Erlotinib 方法:DiffDock via NVIDIA NIM
指标数值解读
最佳构象置信度0.906优秀
空间冲突结合构象干净
验证状态:✓ 结构正确捕获结合口袋
来源:NVIDIA NIM via
NvidiaNIM_diffdock

---

Phase 4: Compound Expansion

阶段4:化合物拓展

4.1 Similarity Search

4.1 相似性搜索

Starting from top actives, expand chemical space:
1. ChEMBL_search_similar_molecules(molecule=top_active_smiles, similarity=70)
   └─ Extract: Similar compounds not yet tested on target
   
2. PubChem_search_compounds_by_similarity(smiles, threshold=0.7)
   └─ Extract: PubChem CIDs with similar structures
Strategy:
  • Use 3-5 diverse actives as seeds
  • Similarity threshold: 70-85% (balance novelty vs. activity)
  • Prioritize compounds NOT in ChEMBL bioactivity for target
从顶级活性化合物开始,拓展化学空间:
1. ChEMBL_search_similar_molecules(molecule=top_active_smiles, similarity=70)
   └─ 提取:尚未针对该靶点测试的相似化合物
   
2. PubChem_search_compounds_by_similarity(smiles, threshold=0.7)
   └─ 提取:结构相似的PubChem CID
策略:
  • 使用3-5种多样化的活性化合物作为种子
  • 相似性阈值:70-85%(平衡新颖性与活性)
  • 优先选择未在ChEMBL该靶点生物活性数据中的化合物

4.2 Substructure Search

4.2 子结构搜索

1. ChEMBL_search_substructure(smiles=core_scaffold)
   └─ Extract: Compounds containing scaffold
   
2. PubChem_search_compounds_by_substructure(smiles=core_scaffold)
   └─ Extract: Additional scaffold-containing compounds
1. ChEMBL_search_substructure(smiles=core_scaffold)
   └─ 提取:包含该骨架的化合物
   
2. PubChem_search_compounds_by_substructure(smiles=core_scaffold)
   └─ 提取:额外的含该骨架化合物

4.3 Cross-Database Mining

4.3 跨数据库挖掘

1. STITCH_get_chemical_protein_interactions(identifier=target_gene)
   └─ Extract: Additional chemical-protein links
   
2. DGIdb_get_drug_gene_interactions(genes=[gene_symbol])
   └─ Extract: Approved/investigational drugs
Output for Report:
markdown
undefined
1. STITCH_get_chemical_protein_interactions(identifier=target_gene)
   └─ 提取:额外的化学-蛋白关联
   
2. DGIdb_get_drug_gene_interactions(genes=[gene_symbol])
   └─ 提取:已获批/在研药物
报告输出:
markdown
undefined

4. Compound Expansion Results

4. 化合物拓展结果

Starting Seeds: 5 diverse actives (IC50 < 100 nM) Similarity Expansion: 847 compounds (70% threshold) Substructure Search: 234 scaffold matches Cross-Database: 45 additional hits
After Deduplication: 923 unique candidate compounds
SourceCompoundsAlready TestedNovel Candidates
ChEMBL similarity456234222
PubChem similarity391156235
ChEMBL substructure1788989
STITCH452322
Total Unique923355568
undefined
起始种子:5种多样化活性化合物(IC50 < 100 nM) 相似性拓展:847种化合物(70%阈值) 子结构搜索:234种骨架匹配物 跨数据库:45种额外命中物
去重后:923种独特候选化合物
来源化合物总数已测试新型候选物
ChEMBL相似性456234222
PubChem相似性391156235
ChEMBL子结构1788989
STITCH452322
总计独特923355568
undefined

4.4 De Novo Molecule Generation (NVIDIA NIM)

4.4 从头分子生成(NVIDIA NIM)

When database mining yields insufficient candidates, generate novel molecules.
Requires:
NVIDIA_API_KEY
environment variable
Option A: GenMol (Scaffold Hopping with Masked Regions)
NvidiaNIM_genmol(
    smiles="COc1cc2ncnc(Nc3ccc([*{3-8}])c([*{1-3}])c3)c2cc1OCCCN1CCOCC1",
    num_molecules=100,
    temperature=2.0,
    scoring="QED"
)
└─ Input: SMILES with [*{min-max}] masked regions
└─ Output: Generated molecules with QED/LogP scores
└─ Use: Explore specific positions while keeping scaffold
Mask Design Strategy:
PositionMaskPurpose
Aniline substituent
[*{1-3}]
Small groups (halogen, methyl)
Solubilizing group
[*{5-10}]
Morpholine, piperazine variants
Linker region
[*{3-6}]
Spacer variations
Example Masked SMILES for EGFR:
undefined
当数据库挖掘得到的候选物不足时,生成新型分子。
要求
NVIDIA_API_KEY
环境变量
选项A:GenMol(带掩码区域的骨架跃迁)
NvidiaNIM_genmol(
    smiles="COc1cc2ncnc(Nc3ccc([*{3-8}])c([*{1-3}])c3)c2cc1OCCCN1CCOCC1",
    num_molecules=100,
    temperature=2.0,
    scoring="QED"
)
└─ 输入:带[*{min-max}]掩码区域的SMILES
└─ 输出:带QED/LogP评分的生成分子
└─ 适用场景:探索特定位置,同时保留骨架
掩码设计策略:
位置掩码目的
苯胺取代基
[*{1-3}]
小基团(卤素、甲基)
增溶基团
[*{5-10}]
吗啉、哌嗪变体
连接区
[*{3-6}]
间隔基变体
EGFR的掩码SMILES示例:
undefined

Keep quinazoline core, vary aniline and tail

保留喹唑啉核心,改变苯胺和尾部

COc1cc2ncnc(Nc3ccc([{1-3}])c([{1-3}])c3)c2cc1[*{5-12}]

**Option B: MolMIM (Controlled Generation from Reference)**
NvidiaNIM_molmim( smi="COc1cc2ncnc(Nc3ccc(Cl)cc3)c2cc1OCCN1CCOCC1", num_molecules=50, algorithm="CMA-ES" ) └─ Input: Reference SMILES (known active) └─ Output: Optimized analogs with property scores └─ Use: Generate close analogs of top actives

**Generation Workflow**:
1. Identify top 3-5 actives from Phase 2
2. Design masked SMILES for GenMol OR use as reference for MolMIM
3. Generate 50-100 molecules per seed
4. Pass generated molecules to Phase 5 (ADMET filtering)
5. Dock survivors in Phase 6 for final ranking

**Report Format**:
```markdown
COc1cc2ncnc(Nc3ccc([{1-3}])c([{1-3}])c3)c2cc1[*{5-12}]

**选项B:MolMIM(基于参考的可控生成)**
NvidiaNIM_molmim( smi="COc1cc2ncnc(Nc3ccc(Cl)cc3)c2cc1OCCN1CCOCC1", num_molecules=50, algorithm="CMA-ES" ) └─ 输入:参考SMILES(已知活性化合物) └─ 输出:带性质评分的优化类似物 └─ 适用场景:生成顶级活性化合物的类似物

**生成工作流**:
1. 从阶段2中识别前3-5种活性化合物
2. 为GenMol设计掩码SMILES,或作为MolMIM的参考
3. 每个种子生成50-100个分子
4. 将生成的分子传入阶段5(ADMET筛选)
5. 在阶段6对接存活分子以进行最终排名

**报告格式**:
```markdown

4.4 De Novo Generation Results

4.4 从头生成结果

Method: GenMol via NVIDIA NIM Seed Scaffold: 4-anilinoquinazoline (from erlotinib) Masked Positions: Aniline (3,4), solubilizing tail
MetricValue
Molecules Generated100
Passing Lipinski95 (95%)
Mean QED Score0.72
Unique Scaffolds12
Top Generated Compounds:
IDSMILESQEDLogPNovelty
GEN-001COc1cc2ncnc(Nc3ccc(Cl)c(Cl)c3)c2cc1OCCCN1CCOCC10.814.2Novel substitution
GEN-002COc1cc2ncnc(Nc3ccc(C#N)c(F)c3)c2cc1OCCCN1CCOCC10.783.8Novel substitution
Source: NVIDIA NIM via
NvidiaNIM_genmol

---
方法:GenMol via NVIDIA NIM 种子骨架:4-苯胺基喹唑啉(来自erlotinib) 掩码位置:苯胺(3,4位)、增溶尾部
指标数值
生成分子数100
通过Lipinski规则95(95%)
平均QED评分0.72
独特骨架12
顶级生成化合物:
IDSMILESQEDLogP新颖性
GEN-001COc1cc2ncnc(Nc3ccc(Cl)c(Cl)c3)c2cc1OCCCN1CCOCC10.814.2新型取代
GEN-002COc1cc2ncnc(Nc3ccc(C#N)c(F)c3)c2cc1OCCCN1CCOCC10.783.8新型取代
来源:NVIDIA NIM via
NvidiaNIM_genmol

---

Phase 5: ADMET Filtering

阶段5:ADMET筛选

5.1 Physicochemical Properties

5.1 理化性质

ADMETAI_predict_physicochemical_properties(smiles=[compound_list])
└─ Filter: Lipinski violations ≤ 1
└─ Filter: QED > 0.3
└─ Filter: MW 200-600
ADMETAI_predict_physicochemical_properties(smiles=[compound_list])
└─ 筛选:Lipinski规则违反数 ≤ 1
└─ 筛选:QED > 0.3
└─ 筛选:分子量 200-600

5.2 ADMET Endpoints

5.2 ADMET终点

1. ADMETAI_predict_bioavailability(smiles=[compound_list])
   └─ Filter: Oral bioavailability > 0.3
   
2. ADMETAI_predict_toxicity(smiles=[compound_list])
   └─ Filter: AMES < 0.5, hERG < 0.5, DILI < 0.5
   
3. ADMETAI_predict_CYP_interactions(smiles=[compound_list])
   └─ Flag: CYP3A4 inhibitors (drug interaction risk)
1. ADMETAI_predict_bioavailability(smiles=[compound_list])
   └─ 筛选:口服生物利用度 > 0.3
   
2. ADMETAI_predict_toxicity(smiles=[compound_list])
   └─ 筛选:AMES < 0.5, hERG < 0.5, DILI < 0.5
   
3. ADMETAI_predict_CYP_interactions(smiles=[compound_list])
   └─ 标记:CYP3A4抑制剂(药物相互作用风险)

5.3 Structural Alerts

5.3 结构警报

ChEMBL_search_compound_structural_alerts(smiles=compound_smiles)
└─ Flag: PAINS, reactive groups, toxicophores
ADMET Filter Summary:
markdown
undefined
ChEMBL_search_compound_structural_alerts(smiles=compound_smiles)
└─ 标记:PAINS、反应性基团、毒性基团
ADMET筛选汇总:
markdown
undefined

5. ADMET Filtering Results

5. ADMET筛选结果

Filter StageInputPassedFailedPass Rate
Physicochemical (Lipinski)56845611280%
Drug-likeness (QED > 0.3)4563985887%
Bioavailability (> 0.3)3983128678%
Toxicity filters3122674586%
Structural alerts2672343388%
Final Candidates56823433441%
Common Failure Reasons:
  1. High molecular weight (>600): 67 compounds
  2. Low predicted bioavailability: 86 compounds
  3. hERG liability: 28 compounds
  4. PAINS alerts: 18 compounds

---
筛选阶段输入通过失败通过率
理化性质(Lipinski)56845611280%
类药性(QED > 0.3)4563985887%
生物利用度(> 0.3)3983128678%
毒性筛选3122674586%
结构警报2672343388%
最终候选物56823433441%
常见失败原因:
  1. 高分子量(>600):67种化合物
  2. 预测生物利用度低:86种化合物
  3. hERG风险:28种化合物
  4. PAINS警报:18种化合物

---

Phase 6: Candidate Prioritization

阶段6:候选物优先级排序

6.1 Scoring Framework

6.1 评分框架

Score each candidate on multiple dimensions:
DimensionWeightScoring Criteria
Structural Similarity25%Tanimoto to actives (0.7-1.0 → 1-5)
Novelty20%Not in ChEMBL bioactivity = +2; Novel scaffold = +3
ADMET Score25%Composite of property predictions
Synthesis Feasibility15%SA score (1-10), commercial availability
Scaffold Diversity15%Cluster representative bonus
从多个维度对每个候选物评分:
维度权重评分标准
结构相似性25%与活性化合物的Tanimoto系数(0.7-1.0 → 1-5分)
新颖性20%不在ChEMBL生物活性数据中=+2;新型骨架=+3
ADMET评分25%性质预测的综合评分
合成可行性15%SA评分(1-10)、商业可得性
骨架多样性15%簇代表加分

6.2 Synthesis Feasibility

6.2 合成可行性

markdown
undefined
markdown
undefined

6.2 Synthesis Feasibility Assessment

6.2 合成可行性评估

CandidateSA ScoreCommercialEstimated StepsFlag
Compound-12.3Yes (Enamine)0★★★
Compound-23.5Building block2-3★★☆
Compound-35.8No6-8★☆☆
SA Score Interpretation:
  • 1-3: Easy synthesis
  • 3-5: Moderate complexity
  • 5-10: Challenging synthesis
undefined
候选物SA评分商业可得性预计步骤标记
Compound-12.3是(Enamine)0★★★
Compound-23.5合成砌块2-3★★☆
Compound-35.86-8★☆☆
SA评分解读:
  • 1-3:合成容易
  • 3-5:中等复杂度
  • 5-10:合成困难
undefined

6.3 Final Prioritized List

6.3 最终优先级列表

markdown
undefined
markdown
undefined

6.3 Top 20 Candidate Compounds

6.3 前20种候选化合物

RankIDSMILESSim. ScoreADMETNoveltyOverallRationale
1CPD-001Cc1ccc...0.824.5Novel scaffold4.2High similarity, clean ADMET
2CPD-002COc1cc...0.784.3Not tested4.0Quinazoline analog
3CPD-003Nc1ccc...0.754.1Novel core3.9New chemotype
........................
Scaffold Diversity: 7 distinct scaffolds in top 20 Commercial Availability: 12/20 available for purchase Estimated Hit Rate: 15-25% (based on similarity to actives)

---
排名IDSMILES相似性评分ADMET新颖性总分理由
1CPD-001Cc1ccc...0.824.5新型骨架4.2高相似性,干净的ADMET
2CPD-002COc1cc...0.784.3未测试4.0喹唑啉类似物
3CPD-003Nc1ccc...0.754.1新型核心3.9新化学型
........................
骨架多样性:前20种中有7种独特骨架 商业可得性:20种中有12种可购买 预计命中率:15-25%(基于与活性化合物的相似性)

---

Phase 6.5: Literature Evidence (NEW)

阶段6.5:文献证据(新增)

6.5.1 Literature Search for Validation

6.5.1 文献搜索验证

Search literature to validate candidate compounds and understand target context.
python
def search_binder_literature(tu, target_name, compound_scaffolds):
    """Search literature for compound and target evidence."""
    
    # PubMed: Published SAR studies
    sar_papers = tu.tools.PubMed_search_articles(
        query=f"{target_name} inhibitor SAR structure-activity",
        limit=30
    )
    
    # BioRxiv: Latest unpublished findings
    preprints = tu.tools.BioRxiv_search_preprints(
        query=f"{target_name} small molecule discovery",
        limit=15
    )
    
    # MedRxiv: Clinical data on inhibitors
    clinical = tu.tools.MedRxiv_search_preprints(
        query=f"{target_name} inhibitor clinical trial",
        limit=10
    )
    
    # Citation analysis for key papers
    key_papers = sar_papers[:10]
    for paper in key_papers:
        citation = tu.tools.openalex_search_works(
            query=paper['title'],
            limit=1
        )
        paper['citations'] = citation[0].get('cited_by_count', 0) if citation else 0
    
    return {
        'published_sar': sar_papers,
        'preprints': preprints,
        'clinical_preprints': clinical,
        'high_impact_papers': sorted(key_papers, key=lambda x: x.get('citations', 0), reverse=True)
    }
搜索文献以验证候选化合物并了解靶点背景。
python
def search_binder_literature(tu, target_name, compound_scaffolds):
    """搜索化合物和靶点的文献证据。"""
    
    # PubMed:已发表的SAR研究
    sar_papers = tu.tools.PubMed_search_articles(
        query=f"{target_name} inhibitor SAR structure-activity",
        limit=30
    )
    
    # BioRxiv:最新未发表研究结果
    preprints = tu.tools.BioRxiv_search_preprints(
        query=f"{target_name} small molecule discovery",
        limit=15
    )
    
    # MedRxiv:抑制剂的临床数据
    clinical = tu.tools.MedRxiv_search_preprints(
        query=f"{target_name} inhibitor clinical trial",
        limit=10
    )
    
    # 关键论文的引用分析
    key_papers = sar_papers[:10]
    for paper in key_papers:
        citation = tu.tools.openalex_search_works(
            query=paper['title'],
            limit=1
        )
        paper['citations'] = citation[0].get('cited_by_count', 0) if citation else 0
    
    return {
        'published_sar': sar_papers,
        'preprints': preprints,
        'clinical_preprints': clinical,
        'high_impact_papers': sorted(key_papers, key=lambda x: x.get('citations', 0), reverse=True)
    }

6.5.2 Output for Report

6.5.2 报告输出

markdown
undefined
markdown
undefined

6.5 Literature Evidence

6.5 文献证据

Published SAR Studies

已发表SAR研究

PMIDTitleYearKey Insight
34567890Discovery of novel EGFR inhibitors...2024C7 substitution critical
33456789Structure-activity relationship of...2023Fluorine improves potency
PMID标题年份关键洞察
34567890Discovery of novel EGFR inhibitors...2024C7位取代至关重要
33456789Structure-activity relationship of...2023氟提高活性

Recent Preprints (⚠️ Not Peer-Reviewed)

近期预印本(⚠️ 未同行评审)

SourceTitlePostedRelevance
BioRxivNovel scaffolds for EGFR...2024-02New chemotype discovery
MedRxivClinical activity of...2024-01Phase 2 results
来源标题发布时间相关性
BioRxivNovel scaffolds for EGFR...2024-02新化学型发现
MedRxivClinical activity of...2024-012期结果

High-Impact References

高影响力参考文献

PMIDCitationsTitle
32123456523Landmark EGFR inhibitor study...
31234567312Comprehensive SAR analysis...
Source: PubMed, BioRxiv, MedRxiv, OpenAlex

---
PMID引用数标题
32123456523Landmark EGFR inhibitor study...
31234567312Comprehensive SAR analysis...
来源:PubMed, BioRxiv, MedRxiv, OpenAlex

---

Report Template

报告模板

File:
[TARGET]_binder_discovery_report.md
markdown
undefined
文件
[TARGET]_binder_discovery_report.md
markdown
undefined

Small Molecule Binder Discovery: [TARGET]

小分子结合剂发现:[TARGET]

Generated: [Date] | Query: [Original query] | Status: In Progress

生成时间:[日期] | 查询:[原始查询] | 状态:进行中

Executive Summary

执行摘要

[Researching...]

[研究中...]

1. Target Validation

1. 靶点验证

1.1 Target Identifiers

1.1 靶点标识符

[Researching...]
[研究中...]

1.2 Druggability Assessment

1.2 成药性评估

[Researching...]
[研究中...]

1.3 Binding Site Analysis

1.3 结合位点分析

[Researching...]

[研究中...]

2. Known Ligand Landscape

2. 已知配体格局

2.1 ChEMBL Bioactivity Summary

2.1 ChEMBL生物活性汇总

[Researching...]
[研究中...]

2.2 Approved Drugs & Clinical Compounds

2.2 已获批药物与临床化合物

[Researching...]
[研究中...]

2.3 Chemical Probes

2.3 化学探针

[Researching...]
[研究中...]

2.4 SAR Insights

2.4 SAR洞察

[Researching...]

[研究中...]

3. Structural Information

3. 结构信息

3.1 Available Structures

3.1 可用结构

[Researching...]
[研究中...]

3.2 Binding Pocket Analysis

3.2 结合口袋分析

[Researching...]
[研究中...]

3.3 Key Interactions

3.3 关键相互作用

[Researching...]

[研究中...]

4. Compound Expansion

4. 化合物拓展

4.1 Similarity Search Results

4.1 相似性搜索结果

[Researching...]
[研究中...]

4.2 Substructure Search Results

4.2 子结构搜索结果

[Researching...]
[研究中...]

4.3 Cross-Database Mining

4.3 跨数据库挖掘

[Researching...]

[研究中...]

5. ADMET Filtering

5. ADMET筛选

5.1 Physicochemical Filters

5.1 理化筛选

[Researching...]
[研究中...]

5.2 ADMET Predictions

5.2 ADMET预测

[Researching...]
[研究中...]

5.3 Structural Alerts

5.3 结构警报

[Researching...]
[研究中...]

5.4 Filter Summary

5.4 筛选汇总

[Researching...]

[研究中...]

6. Candidate Prioritization

6. 候选物优先级排序

6.1 Scoring Methodology

6.1 评分方法

[Researching...]
[研究中...]

6.2 Synthesis Feasibility

6.2 合成可行性

[Researching...]
[研究中...]

6.3 Top 20 Candidates

6.3 前20种候选物

[Researching...]

[研究中...]

7. Recommendations

7. 建议

7.1 Immediate Actions

7.1 立即行动

[Researching...]
[研究中...]

7.2 Experimental Validation Plan

7.2 实验验证计划

[Researching...]
[研究中...]

7.3 Backup Strategies

7.3 备选策略

[Researching...]

[研究中...]

8. Data Gaps & Limitations

8. 数据缺口与局限性

[Researching...]

[研究中...]

9. Data Sources

9. 数据来源

[Will be populated as research progresses...]

[将随研究进展填充...]

10. Methods Summary

10. 方法汇总

StepToolPurpose
Sequence retrievalUniProt_searchGet protein sequence
Structure predictionNvidiaNIM_alphafold2 / NvidiaNIM_esmfold3D structure with pLDDT
Docking validationNvidiaNIM_diffdock / NvidiaNIM_boltz2Validate binding pocket
Known ligandsChEMBL_get_target_activitiesBioactivity data
Similarity searchChEMBL_search_similar_moleculesExpand chemical space
De novo generationNvidiaNIM_genmol / NvidiaNIM_molmimNovel molecule design
ADMET filteringADMETAI_predict_*Drug-likeness assessment
Candidate dockingNvidiaNIM_diffdock / NvidiaNIM_boltz2Final scoring

---
步骤工具目的
序列检索UniProt_search获取蛋白序列
结构预测NvidiaNIM_alphafold2 / NvidiaNIM_esmfold带pLDDT的3D结构
对接验证NvidiaNIM_diffdock / NvidiaNIM_boltz2验证结合口袋
已知配体ChEMBL_get_target_activities生物活性数据
相似性搜索ChEMBL_search_similar_molecules拓展化学空间
从头生成NvidiaNIM_genmol / NvidiaNIM_molmim新型分子设计
ADMET筛选ADMETAI_predict_*类药性评估
候选物对接NvidiaNIM_diffdock / NvidiaNIM_boltz2最终评分

---

Evidence Grading

证据分级

TierSymbolDescriptionExample
T0★★★★Docking score > reference inhibitorBetter than erlotinib
T1★★★Experimental IC50/Ki < 100 nMChEMBL bioactivity
T2★★☆Docking within 5% of reference OR IC50 100-1000 nMHigh priority
T3★☆☆Structural similarity > 80% to T1Predicted active
T4☆☆☆Similarity 70-80%, scaffold matchLower confidence
T5○○○Generated molecule, ADMET-passed, no dockingSpeculative
Docking-Enhanced Grading: When NVIDIA NIM docking is available, compounds gain evidence:
  • Docking > reference → upgrade to T0 (★★★★)
  • Docking within 5% → upgrade to T2 (★★☆)
  • Docking within 20% → maintain current tier
  • Docking >20% worse → downgrade one tier
Apply to all candidate compounds:
markdown
| Compound | Evidence | Docking vs Ref | Rationale |
|----------|----------|----------------|-----------|
| CPD-001 | ★★★★ | +8.3% | 85% similar, docking > erlotinib |
| CPD-002 | ★★★ | -2.1% | IC50=45nM, validated by docking |
| CPD-003 | ★★☆ | -4.5% | 78% similar, good docking |
| GEN-001 | ★☆☆ | -15% | Generated, ADMET-passed |

层级符号描述示例
T0★★★★对接评分 > 参考抑制剂优于erlotinib
T1★★★实验IC50/Ki < 100 nMChEMBL生物活性
T2★★☆对接评分在参考的5%以内 或 IC50 100-1000 nM高优先级
T3★☆☆结构相似性 > 80% 与T1化合物预测活性
T4☆☆☆相似性70-80%,骨架匹配低置信度
T5○○○生成分子,通过ADMET筛选,无对接数据推测性
对接增强分级:当NVIDIA NIM对接可用时,化合物证据升级:
  • 对接评分>参考 → 升级为T0(★★★★)
  • 对接评分在参考的5%以内 → 升级为T2(★★☆)
  • 对接评分在参考的20%以内 → 保持当前层级
  • 对接评分低于参考>20% → 降级一级
应用于所有候选化合物:
markdown
| 化合物 | 证据 | 与参考对接对比 | 理由 |
|----------|----------|----------------|-----------|
| CPD-001 | ★★★★ | +8.3% | 85%相似,对接评分优于erlotinib |
| CPD-002 | ★★★ | -2.1% | IC50=45nM,对接验证 |
| CPD-003 | ★★☆ | -4.5% | 78%相似,对接良好 |
| GEN-001 | ★☆☆ | -15% | 生成分子,通过ADMET筛选 |

Mandatory Completeness Checklist

强制完整性检查清单

Phase 1: Target Validation

阶段1:靶点验证

  • UniProt accession resolved
  • ChEMBL target ID obtained
  • Druggability assessed (≥2 sources)
  • Target class identified
  • Binding site information (or "No structural data")
  • 解析UniProt登录号
  • 获取ChEMBL靶点ID
  • 评估成药性(≥2个来源)
  • 识别靶点类别
  • 结合位点信息(或“无结构数据”)

Phase 2: Known Ligands

阶段2:已知配体

  • ChEMBL activities queried (≥100 or all available)
  • Activity statistics (count, potency range)
  • Top 10 actives listed with IC50
  • Chemical probes identified (or "None available")
  • SAR insights summarized
  • 查询ChEMBL活性数据(≥100条或所有可用)
  • 活性统计(数量、活性范围)
  • 列出前10种活性化合物及IC50
  • 识别化学探针(或“无可用探针”)
  • 总结SAR洞察

Phase 3: Structure

阶段3:结构

  • PDB structures listed (or "No experimental structure")
  • Best structure for docking identified
  • Binding pocket described (or "Predicted from AlphaFold")
  • 列出PDB结构(或“无实验结构”)
  • 识别对接最佳结构
  • 描述结合口袋(或“基于AlphaFold预测”)

Phase 4: Expansion

阶段4:拓展

  • ≥3 seed compounds used
  • Similarity search completed (≥100 results or exhausted)
  • Substructure search completed
  • Deduplicated candidate count reported
  • 使用≥3种种子化合物
  • 完成相似性搜索(≥100条结果或穷尽)
  • 完成子结构搜索
  • 报告去重后的候选物数量

Phase 5: ADMET

阶段5:ADMET

  • Physicochemical filters applied
  • Toxicity predictions run
  • Structural alerts checked
  • Filter funnel table included
  • 应用理化筛选
  • 运行毒性预测
  • 检查结构警报
  • 包含筛选漏斗表

Phase 6: Prioritization

阶段6:优先级排序

  • ≥20 candidates ranked (or all if fewer)
  • Scoring methodology explained
  • Synthesis feasibility assessed
  • Scaffold diversity noted
  • 排名≥20种候选物(或所有候选物如果少于20种)
  • 解释评分方法
  • 评估合成可行性
  • 记录骨架多样性

Phase 7: Recommendations

阶段7:建议

  • ≥3 immediate actions listed
  • Experimental validation plan outlined
  • Data gaps aggregated

  • 列出≥3项立即行动
  • 概述实验验证计划
  • 汇总数据缺口

Tool Reference by Phase

各阶段工具参考

Phase 1: Target Validation

阶段1:靶点验证

ToolPurpose
UniProt_search
Resolve UniProt accession
MyGene_query_genes
Get Ensembl/NCBI IDs
ChEMBL_search_targets
Get ChEMBL target ID
OpenTargets_get_target_tractability_by_ensemblID
Tractability assessment
DGIdb_get_gene_druggability
Druggability categories
ChEMBL_search_binding_sites
Binding site info
InterPro_get_protein_domains
Domain architecture
工具目的
UniProt_search
解析UniProt登录号
MyGene_query_genes
获取Ensembl/NCBI ID
ChEMBL_search_targets
获取ChEMBL靶点ID
OpenTargets_get_target_tractability_by_ensemblID
可开发性评估
DGIdb_get_gene_druggability
成药性分类
ChEMBL_search_binding_sites
结合位点信息
InterPro_get_protein_domains
结构域架构

Phase 2: Known Ligands

阶段2:已知配体

ToolPurpose
ChEMBL_get_target_activities
Bioactivity data
ChEMBL_get_molecule
Molecule details
GtoPdb_get_target_interactions
Pharmacology data
OpenTargets_get_chemical_probes_by_target_ensemblID
Chemical probes
OpenTargets_get_associated_drugs_by_target_ensemblID
Known drugs
工具目的
ChEMBL_get_target_activities
生物活性数据
ChEMBL_get_molecule
分子详情
GtoPdb_get_target_interactions
药理学数据
OpenTargets_get_chemical_probes_by_target_ensemblID
化学探针
OpenTargets_get_associated_drugs_by_target_ensemblID
已知药物

Phase 1.4: Structure Prediction (NVIDIA NIM)

阶段1.4:结构预测(NVIDIA NIM)

ToolPurpose
NvidiaNIM_alphafold2
High-accuracy structure prediction with pLDDT
NvidiaNIM_esmfold
Fast structure prediction (max 1024 AA)
NvidiaNIM_msa_search
MSA generation for AlphaFold
工具目的
NvidiaNIM_alphafold2
带pLDDT的高精度结构预测
NvidiaNIM_esmfold
快速结构预测(最大1024个氨基酸)
NvidiaNIM_msa_search
为AlphaFold生成多序列比对(MSA)

Phase 3: Structure

阶段3:结构

ToolPurpose
PDB_search_similar_structures
Find PDB structures
get_protein_metadata_by_pdb_id
Structure metadata
get_binding_affinity_by_pdb_id
Ligand affinities
alphafold_get_prediction
Predicted structure (AlphaFold DB)
get_ligand_smiles_by_chem_comp_id
Ligand structures
emdb_search
Search cryo-EM structures (NEW)
emdb_get_entry
Get EMDB entry details (NEW)
工具目的
PDB_search_similar_structures
查找PDB结构
get_protein_metadata_by_pdb_id
结构元数据
get_binding_affinity_by_pdb_id
配体亲和力
alphafold_get_prediction
预测结构(AlphaFold DB)
get_ligand_smiles_by_chem_comp_id
配体结构
emdb_search
搜索冷冻电镜结构(新增)
emdb_get_entry
获取EMDB条目详情(新增)

Phase 3.5: Docking Validation (NVIDIA NIM)

阶段3.5:对接验证(NVIDIA NIM)

ToolPurpose
NvidiaNIM_diffdock
Blind molecular docking (PDB + SDF)
NvidiaNIM_boltz2
Protein-ligand complex (sequence + SMILES)
工具目的
NvidiaNIM_diffdock
盲分子对接(PDB + SDF)
NvidiaNIM_boltz2
蛋白-配体复合物(序列 + SMILES)

Phase 4: Expansion

阶段4:拓展

ToolPurpose
ChEMBL_search_similar_molecules
Similarity search
PubChem_search_compounds_by_similarity
PubChem similarity
ChEMBL_search_substructure
Substructure search
PubChem_search_compounds_by_substructure
PubChem substructure
STITCH_get_chemical_protein_interactions
Cross-database
工具目的
ChEMBL_search_similar_molecules
相似性搜索
PubChem_search_compounds_by_similarity
PubChem相似性搜索
ChEMBL_search_substructure
子结构搜索
PubChem_search_compounds_by_substructure
PubChem子结构搜索
STITCH_get_chemical_protein_interactions
跨数据库挖掘

Phase 4.4: De Novo Generation (NVIDIA NIM)

阶段4.4:从头生成(NVIDIA NIM)

ToolPurpose
NvidiaNIM_genmol
Scaffold hopping with masked regions
NvidiaNIM_molmim
Controlled generation from reference
工具目的
NvidiaNIM_genmol
带掩码区域的骨架跃迁
NvidiaNIM_molmim
基于参考的可控生成

Phase 5: ADMET

阶段5:ADMET

ToolPurpose
ADMETAI_predict_physicochemical_properties
Drug-likeness
ADMETAI_predict_bioavailability
Oral absorption
ADMETAI_predict_toxicity
Toxicity flags
ADMETAI_predict_CYP_interactions
CYP liabilities
ChEMBL_search_compound_structural_alerts
PAINS, alerts
工具目的
ADMETAI_predict_physicochemical_properties
类药性
ADMETAI_predict_bioavailability
口服吸收
ADMETAI_predict_toxicity
毒性标记
ADMETAI_predict_CYP_interactions
CYP风险
ChEMBL_search_compound_structural_alerts
PAINS、警报

Phase 6: Candidate Docking (NVIDIA NIM)

阶段6:候选物对接(NVIDIA NIM)

ToolPurpose
NvidiaNIM_diffdock
Score all candidates by docking
NvidiaNIM_boltz2
Alternative docking from SMILES
工具目的
NvidiaNIM_diffdock
通过对接为所有候选物评分
NvidiaNIM_boltz2
从SMILES出发的替代对接方法

Phase 6.5: Literature Evidence (NEW)

阶段6.5:文献证据(新增)

ToolPurpose
PubMed_search_articles
Published SAR studies
BioRxiv_search_preprints
Latest biology preprints
MedRxiv_search_preprints
Clinical preprints
openalex_search_works
Citation analysis
SemanticScholar_search
AI-ranked papers

工具目的
PubMed_search_articles
已发表SAR研究
BioRxiv_search_preprints
最新生物学预印本
MedRxiv_search_preprints
临床预印本
openalex_search_works
引用分析
SemanticScholar_search
AI排名论文

Fallback Chains

备选工具链

Primary ToolFallback 1Fallback 2Use When
ChEMBL_get_target_activities
GtoPdb_get_target_interactions
PubChem_search_assays
No ChEMBL data
ChEMBL_search_similar_molecules
PubChem_search_compounds_by_similarity
STITCH_get_chemical_protein_interactions
ChEMBL exhausted
PDB_search_similar_structures
NvidiaNIM_alphafold2
alphafold_get_prediction
No PDB structure
alphafold_get_prediction
NvidiaNIM_alphafold2
NvidiaNIM_esmfold
AlphaFold DB unavailable
NvidiaNIM_alphafold2
NvidiaNIM_esmfold
alphafold_get_prediction
AlphaFold2 NIM error
NvidiaNIM_diffdock
NvidiaNIM_boltz2
Skip docking, use similarityDocking error
NvidiaNIM_genmol
NvidiaNIM_molmim
Manual scaffold hoppingGeneration error
OpenTargets_get_target_tractability
DGIdb_get_gene_druggability
Document "Unknown"Open Targets error
ADMETAI_*
SwissADME toolsBasic LipinskiInvalid SMILES
PDB_search_similar_structures
emdb_search
+ PDB
NvidiaNIM_alphafold2
Membrane proteins
PubMed_search_articles
openalex_search_works
SemanticScholar_search
Literature search
BioRxiv_search_preprints
MedRxiv_search_preprints
Skip preprintsPreprint sources
NVIDIA NIM API Key Required: Tools with
NvidiaNIM_
prefix require
NVIDIA_API_KEY
environment variable. Check availability at start:
python
import os
nvidia_available = bool(os.environ.get("NVIDIA_API_KEY"))
主工具备选1备选2适用场景
ChEMBL_get_target_activities
GtoPdb_get_target_interactions
PubChem_search_assays
无ChEMBL数据时
ChEMBL_search_similar_molecules
PubChem_search_compounds_by_similarity
STITCH_get_chemical_protein_interactions
ChEMBL穷尽时
PDB_search_similar_structures
NvidiaNIM_alphafold2
alphafold_get_prediction
无PDB结构时
alphafold_get_prediction
NvidiaNIM_alphafold2
NvidiaNIM_esmfold
AlphaFold DB不可用时
NvidiaNIM_alphafold2
NvidiaNIM_esmfold
alphafold_get_prediction
AlphaFold2 NIM错误时
NvidiaNIM_diffdock
NvidiaNIM_boltz2
跳过对接,使用相似性对接错误时
NvidiaNIM_genmol
NvidiaNIM_molmim
手动骨架跃迁生成错误时
OpenTargets_get_target_tractability
DGIdb_get_gene_druggability
记录“未知”Open Targets错误时
ADMETAI_*
SwissADME工具基础Lipinski规则SMILES无效时
PDB_search_similar_structures
emdb_search
+ PDB
NvidiaNIM_alphafold2
膜蛋白时
PubMed_search_articles
openalex_search_works
SemanticScholar_search
文献搜索时
BioRxiv_search_preprints
MedRxiv_search_preprints
跳过预印本预印本来源不可用时
需要NVIDIA NIM API密钥:带
NvidiaNIM_
前缀的工具需要
NVIDIA_API_KEY
环境变量。开始时检查可用性:
python
import os
nvidia_available = bool(os.environ.get("NVIDIA_API_KEY"))

If not available, fall back to non-NIM alternatives

若不可用,切换到非NIM备选工具


---

---

Common Use Cases

常见用例

Well-Characterized Target

特征明确的靶点

User: "Find novel binders for EGFR" → Rich ChEMBL data; focus on novel scaffolds, selectivity, ADMET
用户:“寻找EGFR的新型结合剂” → 丰富的ChEMBL数据;聚焦新型骨架、选择性、ADMET

Novel Target

新型靶点

User: "Find small molecules for [new target with no known ligands]" → Limited bioactivity; rely on structure-based assessment, similar target ligands
用户:“寻找[无已知配体的新靶点]的小分子” → 生物活性有限;依赖基于结构的评估、相似靶点配体

Lead Optimization

先导化合物优化

User: "Find analogs of compound X for target Y" → Deep similarity search around specific compound; focus on SAR
用户:“寻找针对靶点Y的化合物X的类似物” → 围绕特定化合物的深度相似性搜索;聚焦SAR

Selectivity Challenge

选择性挑战

User: "Find selective inhibitors for kinase X vs kinase Y" → Include selectivity analysis; filter by off-target predictions

用户:“寻找针对激酶X vs 激酶Y的选择性抑制剂” → 包含选择性分析;按脱靶预测筛选

When NOT to Use This Skill

何时不使用该技能

  • Drug research → Use tooluniverse-drug-research (existing drug profiling)
  • Target research only → Use tooluniverse-target-research
  • Single compound ADMET → Call ADMET tools directly
  • Literature search → Use tooluniverse-literature-deep-research
  • Protein structure only → Use tooluniverse-protein-structure-retrieval
Use this skill for discovering new compounds for a protein target.

  • 药物研究 → 使用tooluniverse-drug-research(现有药物分析)
  • 仅靶点研究 → 使用tooluniverse-target-research
  • 单个化合物ADMET → 直接调用ADMET工具
  • 文献搜索 → 使用tooluniverse-literature-deep-research
  • 仅蛋白结构 → 使用tooluniverse-protein-structure-retrieval
该技能适用于为蛋白靶点发现新化合物的场景。

Additional Resources

额外资源

  • Checklist: CHECKLIST.md - Pre-delivery verification
  • Examples: EXAMPLES.md - Detailed workflow examples
  • Tool corrections: TOOLS_REFERENCE.md - Parameter corrections
  • 检查清单CHECKLIST.md - 交付前验证
  • 示例EXAMPLES.md - 详细工作流示例
  • 工具修正TOOLS_REFERENCE.md - 参数修正