tooluniverse-binder-discovery

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Small Molecule Binder Discovery Strategy

小分子结合剂发现策略

Systematic discovery of novel small molecule binders using 60+ ToolUniverse tools across druggability assessment, known ligand mining, similarity expansion, ADMET filtering, and synthesis feasibility.

KEY PRINCIPLES:

Report-first approach - Create report file FIRST, then populate progressively
Target validation FIRST - Confirm druggability before compound searching
Multi-strategy approach - Combine structure-based and ligand-based methods
ADMET-aware filtering - Eliminate poor compounds early
Evidence grading - Grade candidates by supporting evidence
Actionable output - Provide prioritized candidates with rationale
English-first queries - Always use English terms in tool calls, even if the user writes in another language. Only try original-language terms as a fallback. Respond in the user's language

通过ToolUniverse的60余种工具，系统性发现新型小分子结合剂，涵盖成药性评估、已知配体挖掘、相似性拓展、ADMET筛选及合成可行性分析全流程。

核心原则:

报告优先法 - 先创建报告文件，再逐步填充内容
靶点优先验证 - 在化合物搜索前确认成药性
多策略结合 - 融合基于结构和基于配体的方法
ADMET感知筛选 - 尽早剔除劣质化合物
证据分级 - 根据支持证据对候选物进行分级
可执行输出 - 提供带理论依据的优先级候选物列表
英文优先查询 - 调用工具时始终使用英文术语，即使用户使用其他语言提问。仅在无法查询时尝试原语言术语。用用户的语言回复

Critical Workflow Requirements

关键工作流要求

1. Report-First Approach (MANDATORY)

1. 报告优先法（强制要求）

DO NOT show search process or tool outputs to the user. Instead:

Create the report file FIRST - Before any data collection:
- File name:
```
[TARGET]_binder_discovery_report.md
```
- Initialize with all section headers from the template
- Add placeholder text:
```
[Researching...]
```
  in each section
Progressively update the report - As you gather data:
- Update each section with findings immediately
- The user sees the report growing, not the search process
Output separate data files:
- ```
[TARGET]_candidate_compounds.csv
```
  - Prioritized compounds with SMILES, scores
- ```
[TARGET]_bibliography.json
```
  - Literature references (optional)

禁止向用户展示搜索过程或工具输出。应遵循以下步骤：

先创建报告文件 - 在收集任何数据前：
- 文件名：
```
[TARGET]_binder_discovery_report.md
```
- 用模板初始化所有章节标题
- 在每个章节添加占位文本：
```
[研究中...]
```
逐步更新报告 - 收集数据时：
- 立即用研究结果更新对应章节
- 用户看到的是报告内容逐步完善，而非搜索过程
输出独立数据文件:
- ```
[TARGET]_candidate_compounds.csv
```
  - 带SMILES、评分的优先级化合物列表
- ```
[TARGET]_bibliography.json
```
  - 文献参考文献（可选）

2. Citation Requirements (MANDATORY)

2. 引用要求（强制要求）

Every piece of information MUST include its source:

markdown

undefined

所有信息必须包含来源：

markdown

undefined

3.2 Known Inhibitors

3.2 已知抑制剂

Compound	ChEMBL ID	IC50 (nM)	Selectivity	Source
Imatinib	CHEMBL941	38	ABL-selective	ChEMBL
Dasatinib	CHEMBL1421	0.5	Multi-kinase	ChEMBL

Source: ChEMBL via
ChEMBL_get_target_activities
(CHEMBL1862)

---

化合物	ChEMBL ID	IC50 (nM)	选择性	来源
Imatinib	CHEMBL941	38	ABL选择性	ChEMBL
Dasatinib	CHEMBL1421	0.5	多激酶	ChEMBL

来源：ChEMBL via
ChEMBL_get_target_activities
(CHEMBL1862)

---

Workflow Overview

工作流概览

Phase 0: Tool Verification (check parameter names)
    ↓
Phase 1: Target Validation
    ├─ 1.1 Resolve identifiers (UniProt, Ensembl, ChEMBL target ID)
    ├─ 1.2 Assess druggability/tractability
    │   └─ 1.2.5 Check therapeutic antibodies (Thera-SAbDab) [NEW]
    ├─ 1.3 Identify binding sites
    └─ 1.4 Predict structure (NvidiaNIM_alphafold2/esmfold)
    ↓
Phase 2: Known Ligand Mining
    ├─ Extract ChEMBL bioactivity data
    ├─ Get GtoPdb interactions
    ├─ Identify chemical probes
    ├─ BindingDB affinity data (NEW - Ki/IC50/Kd)
    ├─ PubChem BioAssay HTS data (NEW - screening hits)
    └─ Analyze SAR from known actives
    ↓
Phase 3: Structure Analysis
    ├─ Get PDB structures with ligands
    ├─ Check EMDB for cryo-EM structures (NEW - for membrane targets)
    ├─ Analyze binding pocket
    └─ Identify key interactions
    ↓
Phase 3.5: Docking Validation (NvidiaNIM_diffdock/boltz2) [NEW]
    ├─ Dock reference inhibitor
    └─ Validate binding pocket geometry
    ↓
Phase 4: Compound Expansion
    ├─ 4.1-4.3 Similarity/substructure search
    └─ 4.4 De novo generation (NvidiaNIM_genmol/molmim) [NEW]
    ↓
Phase 5: ADMET Filtering
    ├─ Predict physicochemical properties
    ├─ Predict ADMET endpoints
    └─ Flag liabilities
    ↓
Phase 6: Candidate Docking & Prioritization
    ├─ Dock all candidates (NvidiaNIM_diffdock/boltz2) [UPDATED]
    ├─ Score by docking + ADMET + novelty
    ├─ Assess synthesis feasibility
    └─ Generate final ranked list
    ↓
Phase 7: Report Synthesis

阶段0：工具验证（检查参数名称）
    ↓
阶段1：靶点验证
    ├─ 1.1 标识符解析（UniProt、Ensembl、ChEMBL靶点ID）
    ├─ 1.2 成药性/可开发性评估
    │   └─ 1.2.5 检查治疗性抗体（Thera-SAbDab）【新增】
    ├─ 1.3 识别结合位点
    └─ 1.4 结构预测（NvidiaNIM_alphafold2/esmfold）
    ↓
阶段2：已知配体挖掘
    ├─ 提取ChEMBL生物活性数据
    ├─ 获取GtoPdb相互作用数据
    ├─ 识别化学探针
    ├─ BindingDB亲和力数据【新增 - Ki/IC50/Kd】
    ├─ PubChem BioAssay高通量筛选数据【新增 - 筛选命中物】
    └─ 分析已知活性化合物的构效关系（SAR）
    ↓
阶段3：结构分析
    ├─ 获取带配体的PDB结构
    ├─ 检查EMDB冷冻电镜结构【新增 - 针对膜蛋白靶点】
    ├─ 分析结合口袋
    └─ 识别关键相互作用
    ↓
阶段3.5：对接验证（NvidiaNIM_diffdock/boltz2）【新增】
    ├─ 对接参考抑制剂
    └─ 验证结合口袋几何结构
    ↓
阶段4：化合物拓展
    ├─ 4.1-4.3 相似性/子结构搜索
    └─ 4.4 从头生成（NvidiaNIM_genmol/molmim）【新增】
    ↓
阶段5：ADMET筛选
    ├─ 预测理化性质
    ├─ 预测ADMET终点
    └─ 标记风险因素
    ↓
阶段6：候选物对接与优先级排序
    ├─ 对接所有候选物（NvidiaNIM_diffdock/boltz2）【更新】
    ├─ 按对接评分+ADMET+新颖性综合评分
    ├─ 评估合成可行性
    └─ 生成最终排名列表
    ↓
阶段7：报告整合

Phase 0: Tool Verification

阶段0：工具验证

CRITICAL: Verify tool parameters before calling unfamiliar tools.

python

undefined

关键：调用不熟悉的工具前，先验证工具参数。

python

undefined

Check tool params to prevent silent failures

检查工具参数以避免静默失败

tool_info = tu.tools.get_tool_info(tool_name="ChEMBL_get_target_activities")

undefined

tool_info = tu.tools.get_tool_info(tool_name="ChEMBL_get_target_activities")

undefined

Known Parameter Corrections

已知参数修正

Tool	WRONG Parameter	CORRECT Parameter
`OpenTargets_get_target_tractability_by_ensemblID`	`ensembl_id`	`ensemblId`
`ChEMBL_get_target_activities`	`chembl_target_id`	`target_chembl_id`
`ChEMBL_search_similar_molecules`	`smiles`	`molecule` (accepts SMILES, ChEMBL ID, or name)
`alphafold_get_prediction`	`uniprot`	`accession`

工具	错误参数	正确参数
`OpenTargets_get_target_tractability_by_ensemblID`	`ensembl_id`	`ensemblId`
`ChEMBL_get_target_activities`	`chembl_target_id`	`target_chembl_id`
`ChEMBL_search_similar_molecules`	`smiles`	`molecule` （支持SMILES、ChEMBL ID或名称）
`alphafold_get_prediction`	`uniprot`	`accession`

Phase 1: Target Validation

阶段1：靶点验证

1.1 Identifier Resolution Chain

1.1 标识符解析链

1. UniProt_search(query=target_name, organism="human")
   └─ Extract: UniProt accession, gene name, protein name

2. MyGene_query_genes(q=gene_symbol, species="human")
   └─ Extract: Ensembl gene ID, NCBI gene ID

3. ChEMBL_search_targets(query=target_name, organism="Homo sapiens")
   └─ Extract: ChEMBL target ID, target type

4. GtoPdb_get_targets(query=target_name)
   └─ Extract: GtoPdb target ID (if GPCR/ion channel/enzyme)

Store all IDs for downstream queries:

ids = {
    'uniprot': 'P00533',
    'ensembl': 'ENSG00000146648',
    'chembl_target': 'CHEMBL203',
    'gene_symbol': 'EGFR',
    'gtopdb': '1797'  # if available
}

1. UniProt_search(query=target_name, organism="human")
   └─ 提取：UniProt登录号、基因名、蛋白名

2. MyGene_query_genes(q=gene_symbol, species="human")
   └─ 提取：Ensembl基因ID、NCBI基因ID

3. ChEMBL_search_targets(query=target_name, organism="Homo sapiens")
   └─ 提取：ChEMBL靶点ID、靶点类型

4. GtoPdb_get_targets(query=target_name)
   └─ 提取：GtoPdb靶点ID（若为GPCR/离子通道/酶）

存储所有ID用于后续查询:

ids = {
    'uniprot': 'P00533',
    'ensembl': 'ENSG00000146648',
    'chembl_target': 'CHEMBL203',
    'gene_symbol': 'EGFR',
    'gtopdb': '1797'  # 若可用
}

1.2 Druggability Assessment

1.2 成药性评估

Multi-Source Triangulation:

1. OpenTargets_get_target_tractability_by_ensemblID(ensemblId)
   └─ Extract: Small molecule tractability score, bucket
   
2. DGIdb_get_gene_druggability(genes=[gene_symbol])
   └─ Extract: Druggability categories, known drug count
   
3. OpenTargets_get_target_classes_by_ensemblID(ensemblId)
   └─ Extract: Target class (kinase, GPCR, etc.)

4. GPCRdb_get_protein(protein=entry_name)  # NEW - for GPCRs
   └─ Extract: GPCR family, receptor state, ligand binding data

多源交叉验证:

1. OpenTargets_get_target_tractability_by_ensemblID(ensemblId)
   └─ 提取：小分子可开发性评分、分级
   
2. DGIdb_get_gene_druggability(genes=[gene_symbol])
   └─ 提取：成药性分类、已知药物数量
   
3. OpenTargets_get_target_classes_by_ensemblID(ensemblId)
   └─ 提取：靶点类别（激酶、GPCR等）

4. GPCRdb_get_protein(protein=entry_name) 【新增 - 针对GPCR】
   └─ 提取：GPCR家族、受体状态、配体结合数据

1.2a GPCRdb Integration (NEW - for GPCR Targets)

1.2a GPCRdb集成（新增 - 针对GPCR靶点）

~35% of all approved drugs target GPCRs. For GPCR targets, use specialized data:

python

def check_if_gpcr_and_enrich(tu, target_name, uniprot_id):
    """Check if target is GPCR and get specialized data."""
    
    # Build GPCRdb entry name (e.g., "adrb2_human")
    entry_name = f"{target_name.lower()}_human"
    
    # Check if it's a GPCR
    gpcr_info = tu.tools.GPCRdb_get_protein(
        operation="get_protein",
        protein=entry_name
    )
    
    if gpcr_info.get('status') == 'success':
        # It's a GPCR - get specialized data
        
        # Get known structures (active/inactive states)
        structures = tu.tools.GPCRdb_get_structures(
            operation="get_structures",
            protein=entry_name
        )
        
        # Get known ligands
        ligands = tu.tools.GPCRdb_get_ligands(
            operation="get_ligands",
            protein=entry_name
        )
        
        # Get mutation data (important for SAR)
        mutations = tu.tools.GPCRdb_get_mutations(
            operation="get_mutations",
            protein=entry_name
        )
        
        return {
            'is_gpcr': True,
            'gpcr_family': gpcr_info['data'].get('family'),
            'gpcr_class': gpcr_info['data'].get('receptor_class'),
            'structures': structures['data'].get('structures', []),
            'ligands': ligands['data'].get('ligands', []),
            'mutation_data': mutations['data'].get('mutations', [])
        }
    
    return {'is_gpcr': False}

GPCRdb Advantages:

GPCR-specific sequence alignments (Ballesteros-Weinstein numbering)
Active vs. inactive state structures
Curated ligand binding data
Experimental mutation effects on ligand binding

Druggability Scorecard:

Factor	Assessment	Score
Known small molecule drugs	Yes (3+)	★★★
Tractability bucket	1-3	★★☆-★★★
Target class	Enzyme/GPCR/Ion channel	★★★
Binding site known	Yes (X-ray)	★★★
GPCRdb ligands available	Yes (10+)	★★★ (GPCR only)
Therapeutic antibodies exist	Check Thera-SAbDab	See 1.2.5

Decision Point: If druggability score < ★★☆, warn user about challenges.

约35%的已获批药物靶向GPCR。针对GPCR靶点，使用专用数据：

python

def check_if_gpcr_and_enrich(tu, target_name, uniprot_id):
    """检查靶点是否为GPCR并获取专用数据。"""
    
    # 构建GPCRdb条目名称（例如："adrb2_human"）
    entry_name = f"{target_name.lower()}_human"
    
    # 检查是否为GPCR
    gpcr_info = tu.tools.GPCRdb_get_protein(
        operation="get_protein",
        protein=entry_name
    )
    
    if gpcr_info.get('status') == 'success':
        # 是GPCR - 获取专用数据
        
        # 获取已知结构（激活/非激活状态）
        structures = tu.tools.GPCRdb_get_structures(
            operation="get_structures",
            protein=entry_name
        )
        
        # 获取已知配体
        ligands = tu.tools.GPCRdb_get_ligands(
            operation="get_ligands",
            protein=entry_name
        )
        
        # 获取突变数据（对SAR分析很重要）
        mutations = tu.tools.GPCRdb_get_mutations(
            operation="get_mutations",
            protein=entry_name
        )
        
        return {
            'is_gpcr': True,
            'gpcr_family': gpcr_info['data'].get('family'),
            'gpcr_class': gpcr_info['data'].get('receptor_class'),
            'structures': structures['data'].get('structures', []),
            'ligands': ligands['data'].get('ligands', []),
            'mutation_data': mutations['data'].get('mutations', [])
        }
    
    return {'is_gpcr': False}

GPCRdb优势:

GPCR特异性序列比对（Ballesteros-Weinstein编号）
激活态与非激活态结构
curated配体结合数据
实验突变对配体结合的影响

成药性评分卡:

因素	评估	评分
已知小分子药物	是（≥3种）	★★★
可开发性分级	1-3	★★☆-★★★
靶点类别	酶/GPCR/离子通道	★★★
结合位点已知	是（X射线结构）	★★★
GPCRdb配体可用	是（≥10种）	★★★（仅GPCR）
治疗性抗体存在	检查Thera-SAbDab	见1.2.5

决策点：如果成药性评分 < ★★☆，需向用户提示挑战。

1.2.5 Therapeutic Antibody Landscape (NEW)

1.2.5 治疗性抗体格局（新增）

Check if therapeutic antibodies already target this protein - important for:

Understanding competitive landscape
Validating target tractability (if antibodies work, target is validated)
Identifying potential combination approaches

python

def check_therapeutic_antibodies(tu, target_name):
    """
    Check Thera-SAbDab for therapeutic antibodies against target.
    """
    # Search by target name
    results = tu.tools.TheraSAbDab_search_by_target(
        target=target_name
    )
    
    if results.get('status') == 'success':
        antibodies = results['data'].get('therapeutics', [])
        
        # Categorize by clinical stage
        by_phase = {'Approved': [], 'Phase 3': [], 'Phase 2': [], 'Phase 1': [], 'Preclinical': []}
        for ab in antibodies:
            phase = ab.get('phase', 'Unknown')
            for key in by_phase.keys():
                if key.lower() in phase.lower():
                    by_phase[key].append(ab)
                    break
        
        return {
            'total_antibodies': len(antibodies),
            'by_phase': by_phase,
            'antibodies': antibodies[:10],  # Top 10
            'competitive_alert': len(by_phase.get('Approved', [])) > 0
        }
    return None

def get_antibody_landscape(tu, target_name, uniprot_id=None):
    """
    Comprehensive antibody competitive landscape.
    """
    # Thera-SAbDab search
    therasabdab = check_therapeutic_antibodies(tu, target_name)
    
    # Also search by common synonyms
    synonyms = [target_name]
    if target_name != uniprot_id:
        synonyms.append(uniprot_id)
    
    all_antibodies = []
    for synonym in synonyms:
        results = tu.tools.TheraSAbDab_search_therapeutics(query=synonym)
        if results.get('status') == 'success':
            all_antibodies.extend(results['data'].get('therapeutics', []))
    
    # Deduplicate
    seen = set()
    unique = []
    for ab in all_antibodies:
        inn = ab.get('inn_name')
        if inn and inn not in seen:
            seen.add(inn)
            unique.append(ab)
    
    return {
        'antibodies': unique,
        'count': len(unique),
        'has_approved': any(ab.get('phase', '').lower() == 'approved' for ab in unique),
        'source': 'Thera-SAbDab'
    }

Report Output:

markdown

undefined

检查是否已有治疗性抗体靶向该蛋白，这对以下方面很重要：

了解竞争格局
验证靶点可开发性（若抗体有效，则靶点已验证）
识别潜在联合治疗方案

python

def check_therapeutic_antibodies(tu, target_name):
    """
    检查Thera-SAbDab中针对该靶点的治疗性抗体。
    """
    # 按靶点名称搜索
    results = tu.tools.TheraSAbDab_search_by_target(
        target=target_name
    )
    
    if results.get('status') == 'success':
        antibodies = results['data'].get('therapeutics', [])
        
        # 按临床阶段分类
        by_phase = {'已获批': [], '3期': [], '2期': [], '1期': [], '临床前': []}
        for ab in antibodies:
            phase = ab.get('phase', '未知')
            for key in by_phase.keys():
                if key.lower() in phase.lower():
                    by_phase[key].append(ab)
                    break
        
        return {
            '总抗体数': len(antibodies),
            '按阶段分类': by_phase,
            '抗体列表': antibodies[:10],  # 前10种
            '竞争预警': len(by_phase.get('已获批', [])) > 0
        }
    return None

def get_antibody_landscape(tu, target_name, uniprot_id=None):
    """
    全面的抗体竞争格局分析。
    """
    # Thera-SAbDab搜索
    therasabdab = check_therapeutic_antibodies(tu, target_name)
    
    # 同时按常用同义词搜索
    synonyms = [target_name]
    if target_name != uniprot_id:
        synonyms.append(uniprot_id)
    
    all_antibodies = []
    for synonym in synonyms:
        results = tu.tools.TheraSAbDab_search_therapeutics(query=synonym)
        if results.get('status') == 'success':
            all_antibodies.extend(results['data'].get('therapeutics', []))
    
    # 去重
    seen = set()
    unique = []
    for ab in all_antibodies:
        inn = ab.get('inn_name')
        if inn and inn not in seen:
            seen.add(inn)
            unique.append(ab)
    
    return {
        '抗体列表': unique,
        '数量': len(unique),
        '有已获批抗体': any(ab.get('phase', '').lower() == 'approved' for ab in unique),
        '来源': 'Thera-SAbDab'
    }

报告输出:

markdown

undefined

1.2.5 Therapeutic Antibody Landscape (NEW)

1.2.5 治疗性抗体格局（新增）

Thera-SAbDab Search Results:

Antibody (INN)	Target	Format	Phase	PDB
Pembrolizumab	PD-1	IgG4	Approved	5DK3
Nivolumab	PD-1	IgG4	Approved	5WT9
Cemiplimab	PD-1	IgG4	Approved	7WVM

Competitive Landscape: ⚠️ 3 approved antibodies target this protein Strategic Implication: Small molecule approach offers differentiation (oral dosing, CNS penetration, cost)

Source: Thera-SAbDab via
TheraSAbDab_search_by_target


**Why Include Antibody Landscape**:
- **Validation**: Approved antibodies = validated target
- **Competition**: Understand what's already in market/clinic
- **Strategy**: Identify gaps (no oral, no CNS-penetrant)
- **Synergy**: Potential combination opportunities

Thera-SAbDab搜索结果:

抗体（INN）	靶点	格式	阶段	PDB
Pembrolizumab	PD-1	IgG4	已获批	5DK3
Nivolumab	PD-1	IgG4	已获批	5WT9
Cemiplimab	PD-1	IgG4	已获批	7WVM

竞争格局：⚠️ 3种已获批抗体靶向该蛋白 战略意义：小分子方案具备差异化优势（口服给药、血脑屏障穿透性、成本）

来源：Thera-SAbDab via
TheraSAbDab_search_by_target


**为何包含抗体格局**:
- **验证**：已获批抗体=靶点已验证
- **竞争**：了解市场/临床已有药物
- **战略**：识别空白领域（无口服剂型、无血脑屏障穿透性）
- **协同**：潜在联合治疗机会

1.3 Binding Site Analysis

1.3 结合位点分析

1. ChEMBL_search_binding_sites(target_chembl_id)
   └─ Extract: Binding site names, types
   
2. get_binding_affinity_by_pdb_id(pdb_id)  # For each PDB with ligand
   └─ Extract: Kd, Ki, IC50 values for co-crystallized ligands
   
3. InterPro_get_protein_domains(uniprot_accession)
   └─ Extract: Domain architecture, active sites

Output for Report:

markdown

undefined

1. ChEMBL_search_binding_sites(target_chembl_id)
   └─ 提取：结合位点名称、类型
   
2. get_binding_affinity_by_pdb_id(pdb_id)  # 针对每个带配体的PDB
   └─ 提取：共结晶配体的Kd、Ki、IC50值
   
3. InterPro_get_protein_domains(uniprot_accession)
   └─ 提取：结构域架构、活性位点

报告输出:

markdown

undefined

1.3 Binding Site Assessment

1.3 结合位点评估

Known Binding Sites:

Site	Type	Evidence	Key Residues	Source
ATP pocket	Orthosteric	X-ray (23 PDBs)	K745, E762, M793	PDB/ChEMBL
Allosteric pocket	Allosteric	X-ray (3 PDBs)	T790, C797	PDB

Binding Site Druggability: ★★★ (well-defined pocket, multiple co-crystal structures)

Source: ChEMBL via
ChEMBL_search_binding_sites
, PDB structures

undefined

已知结合位点：

位点	类型	证据	关键残基	来源
ATP口袋	正构	X射线（23个PDB）	K745, E762, M793	PDB/ChEMBL
别构口袋	别构	X射线（3个PDB）	T790, C797	PDB

结合位点成药性：★★★（口袋定义清晰，多个共结晶结构）

来源：ChEMBL via
ChEMBL_search_binding_sites
, PDB结构

undefined

1.4 Structure Prediction (NVIDIA NIM)

1.4 结构预测（NVIDIA NIM）

When no experimental structure is available, or for custom domain predictions.

Requires:

NVIDIA_API_KEY

environment variable

Option A: AlphaFold2 (High accuracy, async)

NvidiaNIM_alphafold2(
    sequence=kinase_domain_sequence,
    algorithm="mmseqs2",
    relax_prediction=False
)
└─ Returns: PDB structure with pLDDT confidence scores
└─ Use when: Accuracy is critical, time is available (~5-15 min)

Option B: ESMFold (Fast, synchronous)

NvidiaNIM_esmfold(sequence=kinase_domain_sequence)
└─ Returns: PDB structure (max 1024 AA)
└─ Use when: Quick assessment needed (~30 sec)

Report pLDDT Confidence:

markdown

undefined

当无实验结构可用，或需要自定义结构域预测时使用。

要求：

NVIDIA_API_KEY

环境变量

选项A：AlphaFold2（高精度，异步）

NvidiaNIM_alphafold2(
    sequence=kinase_domain_sequence,
    algorithm="mmseqs2",
    relax_prediction=False
)
└─ 返回：带pLDDT置信度评分的PDB结构
└─ 适用场景：对精度要求高，可等待（约5-15分钟）

选项B：ESMFold（快速，同步）

NvidiaNIM_esmfold(sequence=kinase_domain_sequence)
└─ 返回：PDB结构（最大1024个氨基酸）
└─ 适用场景：快速评估（约30秒）

报告pLDDT置信度:

markdown

undefined

1.4 Structure Prediction Quality

1.4 结构预测质量

Method: AlphaFold2 via NVIDIA NIM Mean pLDDT: 90.94 (very high confidence)

Confidence Level	Range	Fraction	Interpretation
Very High	≥90	74.3%	Highly reliable
Confident	70-90	16.0%	Reliable
Low	50-70	9.0%	Use caution
Very Low	<50	0.7%	Unreliable

Key Binding Residue Confidence:

Residue	Function	pLDDT
K745	ATP binding	90.0
T790	Gatekeeper	92.3
M793	Hinge region	95.3
D855	DFG motif	89.5

Source: NVIDIA NIM via
NvidiaNIM_alphafold2

---

方法：AlphaFold2 via NVIDIA NIM 平均pLDDT：90.94（极高置信度）

置信度等级	范围	占比	解读
极高	≥90	74.3%	高度可靠
高	70-90	16.0%	可靠
低	50-70	9.0%	谨慎使用
极低	<50	0.7%	不可靠

关键结合残基置信度:

残基	功能	pLDDT
K745	ATP结合	90.0
T790	守门残基	92.3
M793	铰链区	95.3
D855	DFG基序	89.5

来源：NVIDIA NIM via
NvidiaNIM_alphafold2

---

Phase 2: Known Ligand Mining

阶段2：已知配体挖掘

2.1 ChEMBL Bioactivity Data

2.1 ChEMBL生物活性数据

1. ChEMBL_get_target_activities(target_chembl_id, limit=500)
   └─ Filter: standard_type in ["IC50", "Ki", "Kd", "EC50"]
   └─ Filter: standard_value < 10000 nM
   └─ Extract: ChEMBL molecule IDs, SMILES, potency values

2. ChEMBL_get_molecule(molecule_chembl_id)  # For top actives
   └─ Extract: Full molecular data, max_phase, oral flag

Activity Summary Table:

markdown

undefined

1. ChEMBL_get_target_activities(target_chembl_id, limit=500)
   └─ 筛选：standard_type in ["IC50", "Ki", "Kd", "EC50"]
   └─ 筛选：standard_value < 10000 nM
   └─ 提取：ChEMBL分子ID、SMILES、活性值

2. ChEMBL_get_molecule(molecule_chembl_id)  # 针对顶级活性化合物
   └─ 提取：完整分子数据、研发阶段、口服标记

活性汇总表:

markdown

undefined

2.1 Known Active Compounds (ChEMBL)

2.1 已知活性化合物（ChEMBL）

Total Bioactivity Points: 2,847 (IC50: 1,234 | Ki: 892 | Kd: 456 | EC50: 265) Compounds with IC50 < 100 nM: 156 Approved Drugs for This Target: 5

Compound	ChEMBL ID	IC50 (nM)	Max Phase	SMILES (truncated)
Erlotinib	CHEMBL553	2	4	COc1cc2ncnc(Nc3ccc...
Gefitinib	CHEMBL939	5	4	COc1cc2ncnc(Nc3ccc...
[Novel]	CHEMBL123	12	0	c1ccc(NC(=O)c2ccc...

Source: ChEMBL via
ChEMBL_get_target_activities
(CHEMBL203)

undefined

总生物活性数据点：2,847（IC50: 1,234 | Ki: 892 | Kd: 456 | EC50: 265） IC50 < 100 nM的化合物：156种 针对该靶点的已获批药物：5种

化合物	ChEMBL ID	IC50 (nM)	研发阶段	SMILES（截断）
Erlotinib	CHEMBL553	2	4	COc1cc2ncnc(Nc3ccc...
Gefitinib	CHEMBL939	5	4	COc1cc2ncnc(Nc3ccc...
[新型]	CHEMBL123	12	0	c1ccc(NC(=O)c2ccc...

来源：ChEMBL via
ChEMBL_get_target_activities
(CHEMBL203)

undefined

2.2 GtoPdb Interactions

2.2 GtoPdb相互作用数据

GtoPdb_get_target_interactions(target_id)
└─ Extract: Ligands with pKi/pIC50, selectivity data

GtoPdb_get_target_interactions(target_id)
└─ 提取：带pKi/pIC50、选择性数据的配体

2.3 Chemical Probes

2.3 化学探针

OpenTargets_get_chemical_probes_by_target_ensemblID(ensemblId)
└─ Extract: Validated chemical probes with ratings

Output for Report:

markdown

undefined

OpenTargets_get_chemical_probes_by_target_ensemblID(ensemblId)
└─ 提取：带评级的验证化学探针

报告输出:

markdown

undefined

2.3 Chemical Probes

2.3 化学探针

Probe	Target	Rating	Use	Caveat	Source
Probe-X	EGFR	★★★★	In vivo	None	Chemical Probes Portal
Probe-Y	EGFR	★★★☆	In vitro	Off-target kinase activity	Open Targets

Recommended Probe for Target Validation: Probe-X (highest rating, validated in vivo)

undefined

探针	靶点	评级	用途	注意事项	来源
Probe-X	EGFR	★★★★	体内实验	无	Chemical Probes Portal
Probe-Y	EGFR	★★★☆	体外实验	脱靶激酶活性	Open Targets

推荐用于靶点验证的探针：Probe-X（最高评级，体内验证）

undefined

2.4 SAR Analysis from Actives

2.4 已知活性化合物的SAR分析

Identify common scaffolds and SAR trends:

markdown

undefined

识别共同骨架和构效关系趋势：

markdown

undefined

2.4 Structure-Activity Relationships

2.4 构效关系（SAR）洞察

Core Scaffolds Identified:

4-Anilinoquinazoline (34 compounds, IC50 range: 2-500 nM)
- N1 position: Aryl preferred
- C6/C7: Methoxy groups improve potency
Pyrimidine-amine (12 compounds, IC50 range: 15-800 nM)
- Less potent than quinazolines
- Better selectivity profile

Key SAR Insights:

Halogen at meta position of aniline increases potency 3-5x
C7 ethoxy group critical for binding (H-bond to M793)

undefined

核心骨架:

4-苯胺基喹唑啉（34种化合物，IC50范围：2-500 nM）
- N1位：优先芳基
- C6/C7：甲氧基提高活性
嘧啶胺（12种化合物，IC50范围：15-800 nM）
- 活性低于喹唑啉
- 选择性更好

关键SAR洞察:

苯胺间位的卤素使活性提高3-5倍
C7乙氧基对结合至关重要（与M793形成氢键）

undefined

2.5 BindingDB Affinity Data (NEW)

2.5 BindingDB亲和力数据（新增）

BindingDB provides experimental binding affinity data complementary to ChEMBL:

python

def get_bindingdb_ligands(tu, uniprot_id, affinity_cutoff=10000):
    """
    Get ligands from BindingDB with measured affinities.
    
    BindingDB advantages:
    - May have compounds not in ChEMBL
    - Different affinity types (Ki, IC50, Kd)
    - Direct literature links
    """
    
    result = tu.tools.BindingDB_get_ligands_by_uniprot(
        uniprot=uniprot_id,
        affinity_cutoff=affinity_cutoff  # nM
    )
    
    if result:
        ligands = []
        for entry in result:
            ligands.append({
                'smiles': entry.get('smile'),
                'affinity_type': entry.get('affinity_type'),
                'affinity_nM': entry.get('affinity'),
                'pmid': entry.get('pmid'),
                'monomer_id': entry.get('monomerid')
            })
        
        # Sort by potency
        ligands.sort(key=lambda x: float(x['affinity_nM']) if x['affinity_nM'] else 1e6)
        return ligands[:50]  # Top 50
    
    return []

def find_compound_polypharmacology(tu, smiles, similarity_cutoff=0.85):
    """Find off-target interactions for selectivity analysis."""
    
    targets = tu.tools.BindingDB_get_targets_by_compound(
        smiles=smiles,
        similarity_cutoff=similarity_cutoff
    )
    
    return targets  # Other proteins this compound may bind

BindingDB Output for Report:

markdown

undefined

BindingDB提供补充ChEMBL的实验结合亲和力数据：

python

def get_bindingdb_ligands(tu, uniprot_id, affinity_cutoff=10000):
    """
    从BindingDB获取带实测亲和力的配体。
    
    BindingDB优势:
    - 可能包含ChEMBL中没有的化合物
    - 不同亲和力类型（Ki、IC50、Kd）
    - 直接文献链接
    """
    
    result = tu.tools.BindingDB_get_ligands_by_uniprot(
        uniprot=uniprot_id,
        affinity_cutoff=affinity_cutoff  # nM
    )
    
    if result:
        ligands = []
        for entry in result:
            ligands.append({
                'smiles': entry.get('smile'),
                'affinity_type': entry.get('affinity_type'),
                'affinity_nM': entry.get('affinity'),
                'pmid': entry.get('pmid'),
                'monomer_id': entry.get('monomerid')
            })
        
        # 按活性排序
        ligands.sort(key=lambda x: float(x['affinity_nM']) if x['affinity_nM'] else 1e6)
        return ligands[:50]  # 前50种
    
    return []

def find_compound_polypharmacology(tu, smiles, similarity_cutoff=0.85):
    """寻找脱靶相互作用以进行选择性分析。"""
    
    targets = tu.tools.BindingDB_get_targets_by_compound(
        smiles=smiles,
        similarity_cutoff=similarity_cutoff
    )
    
    return targets  # 该化合物可能结合的其他蛋白

BindingDB报告输出:

markdown

undefined

2.5 Additional Ligands (BindingDB) (NEW)

2.5 额外配体（BindingDB）（新增）

Total Unique Ligands: 89 (non-overlapping with ChEMBL) Most Potent: 0.3 nM Ki

SMILES	Affinity Type	Value (nM)	PMID	BindingDB ID
CC(C)Cc1ccc...	Ki	0.3	15737014	12345
COc1cc2ncnc...	IC50	2.1	16460808	12346

Novel Scaffolds from BindingDB: 3 scaffolds not seen in ChEMBL data

Source: BindingDB via
BindingDB_get_ligands_by_uniprot

undefined

独特配体总数：89种（与ChEMBL无重叠） 活性最强：0.3 nM Ki

SMILES	亲和力类型	数值（nM）	PMID	BindingDB ID
CC(C)Cc1ccc...	Ki	0.3	15737014	12345
COc1cc2ncnc...	IC50	2.1	16460808	12346

BindingDB中的新型骨架：3种未在ChEMBL数据中出现的骨架

来源：BindingDB via
BindingDB_get_ligands_by_uniprot

undefined

2.6 PubChem BioAssay Screening Data (NEW)

2.6 PubChem BioAssay筛选数据（新增）

PubChem BioAssay provides HTS screening results and dose-response data:

python

def get_pubchem_assays_for_target(tu, gene_symbol):
    """
    Get bioassays and active compounds from PubChem.
    
    Advantages:
    - HTS data not in ChEMBL
    - NIH-funded screening programs (MLPCN)
    - Dose-response curves for IC50 calculation
    """
    
    # Search assays targeting this gene
    assays = tu.tools.PubChem_search_assays_by_target_gene(
        gene_symbol=gene_symbol
    )
    
    results = {
        'assays': [],
        'total_active_compounds': 0
    }
    
    if assays.get('data', {}).get('aids'):
        for aid in assays['data']['aids'][:10]:  # Top 10 assays
            # Get assay summary
            summary = tu.tools.PubChem_get_assay_summary(aid=aid)
            
            # Get active compounds
            actives = tu.tools.PubChem_get_assay_active_compounds(aid=aid)
            active_cids = actives.get('data', {}).get('cids', [])
            
            results['assays'].append({
                'aid': aid,
                'summary': summary.get('data', {}),
                'active_count': len(active_cids)
            })
            results['total_active_compounds'] += len(active_cids)
    
    return results

def get_dose_response_data(tu, aid):
    """Get dose-response curves for IC50/EC50 determination."""
    
    dr_data = tu.tools.PubChem_get_assay_dose_response(aid=aid)
    return dr_data

def get_compound_bioactivity_profile(tu, cid):
    """Get all bioactivity data for a compound."""
    
    profile = tu.tools.PubChem_get_compound_bioactivity(cid=cid)
    return profile

PubChem BioAssay Output for Report:

markdown

undefined

PubChem BioAssay提供高通量筛选（HTS）结果和剂量响应数据：

python

def get_pubchem_assays_for_target(tu, gene_symbol):
    """
    从PubChem获取生物实验和活性化合物。
    
    优势:
    - ChEMBL中没有的HTS数据
    - NIH资助的筛选项目（MLPCN）
    - 用于IC50计算的剂量响应曲线
    """
    
    # 搜索针对该基因的实验
    assays = tu.tools.PubChem_search_assays_by_target_gene(
        gene_symbol=gene_symbol
    )
    
    results = {
        'assays': [],
        'total_active_compounds': 0
    }
    
    if assays.get('data', {}).get('aids'):
        for aid in assays['data']['aids'][:10]:  # 前10个实验
            # 获取实验摘要
            summary = tu.tools.PubChem_get_assay_summary(aid=aid)
            
            # 获取活性化合物
            actives = tu.tools.PubChem_get_assay_active_compounds(aid=aid)
            active_cids = actives.get('data', {}).get('cids', [])
            
            results['assays'].append({
                'aid': aid,
                'summary': summary.get('data', {}),
                'active_count': len(active_cids)
            })
            results['total_active_compounds'] += len(active_cids)
    
    return results

def get_dose_response_data(tu, aid):
    """获取用于IC50/EC50测定的剂量响应曲线。"""
    
    dr_data = tu.tools.PubChem_get_assay_dose_response(aid=aid)
    return dr_data

def get_compound_bioactivity_profile(tu, cid):
    """获取化合物的所有生物活性数据。"""
    
    profile = tu.tools.PubChem_get_compound_bioactivity(cid=cid)
    return profile

PubChem BioAssay报告输出:

markdown

undefined

2.6 PubChem HTS Screening Data (NEW)

2.6 PubChem HTS筛选数据（新增）

Assays Found: 45 Total Active Compounds Across Assays: ~1,200

AID	Assay Type	Active Compounds	Target	Description
504526	HTS	234	EGFR	qHTS inhibition screen
1053104	Dose-response	12	EGFR kinase	Confirmatory IC50
651564	Cellular	8	EGFR	Cell proliferation assay

Novel Actives (not in ChEMBL/BindingDB):

CID 12345678: Active in AID 504526, IC50 = 45 nM
CID 23456789: Active in AID 1053104, IC50 = 120 nM

Source: PubChem via
PubChem_search_assays_by_target_gene
,
PubChem_get_assay_active_compounds


**Why Use Both BindingDB and PubChem**:
| Source | Strengths | Best For |
|--------|-----------|----------|
| **ChEMBL** | Curated, standardized, SAR data | Primary ligand source |
| **BindingDB** | Direct affinity measurements | Ki/Kd values, PMIDs |
| **PubChem BioAssay** | HTS data, NIH screens | Novel scaffolds, broad coverage |

---

找到的实验：45个 所有实验中的活性化合物总数：约1,200种

AID	实验类型	活性化合物	靶点	描述
504526	HTS	234	EGFR	qHTS抑制筛选
1053104	剂量响应	12	EGFR激酶	确证性IC50测定
651564	细胞水平	8	EGFR	细胞增殖实验

新型活性化合物（未在ChEMBL/BindingDB中出现）:

CID 12345678：在AID 504526中活性，IC50 = 45 nM
CID 23456789：在AID 1053104中活性，IC50 = 120 nM

来源：PubChem via
PubChem_search_assays_by_target_gene
,
PubChem_get_assay_active_compounds


**为何同时使用BindingDB和PubChem**:
| 来源 | 优势 | 最佳用途 |
|--------|-----------|----------|
| **ChEMBL** |  curated、标准化、SAR数据 | 主要配体来源 |
| **BindingDB** | 直接亲和力测量 | Ki/Kd值、PMID |
| **PubChem BioAssay** | HTS数据、NIH筛选 | 新型骨架、广泛覆盖 |

---

Phase 3: Structure Analysis

阶段3：结构分析

3.1 PDB Structure Retrieval

3.1 PDB结构检索

1. PDB_search_similar_structures(query=uniprot_accession, type="sequence")
   └─ Extract: PDB IDs with ligands
   
2. get_protein_metadata_by_pdb_id(pdb_id)
   └─ Extract: Resolution, method, ligand codes
   
3. alphafold_get_prediction(accession=uniprot_accession)
   └─ Extract: Predicted structure (if no experimental)

1. PDB_search_similar_structures(query=uniprot_accession, type="sequence")
   └─ 提取：带配体的PDB ID
   
2. get_protein_metadata_by_pdb_id(pdb_id)
   └─ 提取：分辨率、方法、配体代码
   
3. alphafold_get_prediction(accession=uniprot_accession)
   └─ 提取：预测结构（若无实验结构）

3.1b EMDB Cryo-EM Structures (NEW)

3.1b EMDB冷冻电镜结构（新增）

Prioritize EMDB for: Membrane proteins (GPCRs, ion channels), large complexes, targets with multiple conformational states.

python

def get_cryoem_structures(tu, target_name, uniprot_accession):
    """Get cryo-EM structures for membrane targets."""
    
    # Search EMDB
    emdb_results = tu.tools.emdb_search(
        query=f"{target_name} membrane receptor"
    )
    
    structures = []
    for entry in emdb_results[:5]:
        details = tu.tools.emdb_get_entry(entry_id=entry['emdb_id'])
        
        # Get associated PDB model (essential for docking)
        pdb_models = details.get('pdb_ids', [])
        
        structures.append({
            'emdb_id': entry['emdb_id'],
            'resolution': entry.get('resolution', 'N/A'),
            'title': entry.get('title', 'N/A'),
            'conformational_state': details.get('state', 'Unknown'),
            'pdb_models': pdb_models
        })
    
    return structures

When to use cryo-EM over X-ray:

Target Type	Prefer cryo-EM?	Reason
GPCR	Yes	Native membrane conformation
Ion channel	Yes	Multiple functional states
Receptor-ligand complex	Yes	Physiological state
Kinase	Usually X-ray	Higher resolution typically

Structure Summary:

markdown

undefined

优先使用EMDB的场景：膜蛋白（GPCR、离子通道）、大型复合物、存在多种构象状态的靶点。

python

def get_cryoem_structures(tu, target_name, uniprot_accession):
    """获取膜靶点的冷冻电镜结构。"""
    
    # 搜索EMDB
    emdb_results = tu.tools.emdb_search(
        query=f"{target_name} membrane receptor"
    )
    
    structures = []
    for entry in emdb_results[:5]:
        details = tu.tools.emdb_get_entry(entry_id=entry['emdb_id'])
        
        # 获取关联的PDB模型（对接必需）
        pdb_models = details.get('pdb_ids', [])
        
        structures.append({
            'emdb_id': entry['emdb_id'],
            'resolution': entry.get('resolution', 'N/A'),
            'title': entry.get('title', 'N/A'),
            '构象状态': details.get('state', 'Unknown'),
            'pdb_models': pdb_models
        })
    
    return structures

何时优先使用冷冻电镜而非X射线:

靶点类型	优先冷冻电镜？	原因
GPCR	是	天然膜构象
离子通道	是	多种功能状态
受体-配体复合物	是	生理状态
激酶	通常X射线	通常分辨率更高

结构汇总:

markdown

undefined

3.1 Available Structures

3.1 可用结构

PDB ID	Resolution	Method	Ligand	Affinity	State
1M17	2.6 Å	X-ray	Erlotinib	Ki=0.4 nM	Active
4HJO	2.1 Å	X-ray	Lapatinib	Ki=3 nM	Inactive
AF-P00533	-	Predicted	None	-	-

PDB ID	分辨率	方法	配体	亲和力	状态
1M17	2.6 Å	X射线	Erlotinib	Ki=0.4 nM	激活态
4HJO	2.1 Å	X射线	Lapatinib	Ki=3 nM	非激活态
AF-P00533	-	预测	无	-	-

3.1b Cryo-EM Structures (EMDB)

3.1b 冷冻电镜结构（EMDB）

EMDB ID	Resolution	PDB Model	Conformation	Ligand
EMD-12345	3.2 Å	7ABC	Active	Agonist
EMD-23456	3.5 Å	8DEF	Inactive	Antagonist

Best Structure for Docking: 1M17 (high resolution, relevant ligand) Source: RCSB PDB, EMDB, AlphaFold DB

undefined

EMDB ID	分辨率	PDB模型	构象	配体
EMD-12345	3.2 Å	7ABC	激活态	激动剂
EMD-23456	3.5 Å	8DEF	非激活态	拮抗剂

对接最佳结构：1M17（高分辨率，相关配体） 来源：RCSB PDB, EMDB, AlphaFold DB

undefined

3.2 Binding Pocket Analysis

3.2 结合口袋分析

get_binding_affinity_by_pdb_id(pdb_id)
└─ Extract: Binding affinities for co-crystallized ligands

Output for Report:

markdown

undefined

get_binding_affinity_by_pdb_id(pdb_id)
└─ 提取：共结晶配体的结合亲和力

报告输出:

markdown

undefined

3.2 Binding Pocket Characterization

3.2 结合口袋表征

Pocket Volume: ~850 Å³ (well-defined) Key Interaction Residues:

Hinge region: M793 (backbone H-bond donor/acceptor)
Gatekeeper: T790 (small residue, allows access)
DFG motif: D855 (active conformation)
Selectivity pocket: L788, G796 (unique to EGFR)

Druggability Assessment: High (enclosed pocket, conserved interactions)

---

口袋体积：~850 Å³（定义清晰） 关键相互作用残基:

铰链区：M793（主链氢键供体/受体）
守门残基：T790（小残基，允许进入）
DFG基序：D855（激活构象）
选择性口袋：L788, G796（EGFR特有）

成药性评估：高（封闭口袋，保守相互作用）

---

Phase 3.5: Docking Validation (NVIDIA NIM)

阶段3.5：对接验证（NVIDIA NIM）

Validate structure and score compounds using molecular docking.

Requires:

NVIDIA_API_KEY

environment variable

使用分子对接验证结构并评分化合物。

要求：

NVIDIA_API_KEY

环境变量

3.5.1 Reference Compound Docking

3.5.1 参考化合物对接

Dock a known inhibitor to validate the structure captures the binding pocket correctly.

Option A: DiffDock (Blind docking, PDB + SDF input)

NvidiaNIM_diffdock(
    protein=pdb_content,        # PDB text content
    ligand=reference_sdf,       # SDF/MOL2 content
    num_poses=10
)
└─ Returns: Docking poses with confidence scores
└─ Use: When you have PDB structure and ligand SDF file

Option B: Boltz2 (From sequence + SMILES)

NvidiaNIM_boltz2(
    polymers=[{"molecule_type": "protein", "sequence": kinase_sequence}],
    ligands=[{"smiles": "COc1cc2ncnc(Nc3ccc(C#C)cc3)c2cc1OCCOC"}],
    sampling_steps=50,
    diffusion_samples=1
)
└─ Returns: Protein-ligand complex structure
└─ Use: When starting from SMILES, no SDF needed

对接已知抑制剂以验证结构是否正确捕获结合口袋。

选项A：DiffDock（盲对接，PDF + SDF输入）

NvidiaNIM_diffdock(
    protein=pdb_content,        # PDB文本内容
    ligand=reference_sdf,       # SDF/MOL2内容
    num_poses=10
)
└─ 返回：带置信度评分的对接构象
└─ 适用场景：有PDB结构和配体SDF文件时

选项B：Boltz2（从序列 + SMILES开始）

NvidiaNIM_boltz2(
    polymers=[{"molecule_type": "protein", "sequence": kinase_sequence}],
    ligands=[{"smiles": "COc1cc2ncnc(Nc3ccc(C#C)cc3)c2cc1OCCOC"}],
    sampling_steps=50,
    diffusion_samples=1
)
└─ 返回：蛋白-配体复合物结构
└─ 适用场景：从SMILES开始，无SDF文件时

3.5.2 Docking Score Interpretation

3.5.2 对接评分解读

Score vs Reference	Priority	Symbol
Higher than reference	Top priority	★★★★
Within 5% of reference	High priority	★★★
Within 20% of reference	Moderate priority	★★☆
>20% lower	Low priority	★☆☆

Report Format:

markdown

undefined

与参考评分对比	优先级	符号
高于参考	最高优先级	★★★★
参考的5%以内	高优先级	★★★
参考的20%以内	中优先级	★★☆
低于参考>20%	低优先级	★☆☆

报告格式:

markdown

undefined

3.5 Docking Validation Results

3.5 对接验证结果

Reference Compound: Erlotinib Method: DiffDock via NVIDIA NIM

Metric	Value	Interpretation
Best Pose Confidence	0.906	Excellent
Steric Clashes	None	Clean binding pose

Validation Status: ✓ Structure captures binding pocket correctly

Source: NVIDIA NIM via
NvidiaNIM_diffdock

---

参考化合物：Erlotinib 方法：DiffDock via NVIDIA NIM

指标	数值	解读
最佳构象置信度	0.906	优秀
空间冲突	无	结合构象干净

验证状态：✓ 结构正确捕获结合口袋

来源：NVIDIA NIM via
NvidiaNIM_diffdock

---

Phase 4: Compound Expansion

阶段4：化合物拓展

4.1 Similarity Search

4.1 相似性搜索

Starting from top actives, expand chemical space:

1. ChEMBL_search_similar_molecules(molecule=top_active_smiles, similarity=70)
   └─ Extract: Similar compounds not yet tested on target
   
2. PubChem_search_compounds_by_similarity(smiles, threshold=0.7)
   └─ Extract: PubChem CIDs with similar structures

Strategy:

Use 3-5 diverse actives as seeds
Similarity threshold: 70-85% (balance novelty vs. activity)
Prioritize compounds NOT in ChEMBL bioactivity for target

从顶级活性化合物开始，拓展化学空间：

1. ChEMBL_search_similar_molecules(molecule=top_active_smiles, similarity=70)
   └─ 提取：尚未针对该靶点测试的相似化合物
   
2. PubChem_search_compounds_by_similarity(smiles, threshold=0.7)
   └─ 提取：结构相似的PubChem CID

策略:

使用3-5种多样化的活性化合物作为种子
相似性阈值：70-85%（平衡新颖性与活性）
优先选择未在ChEMBL该靶点生物活性数据中的化合物

4.2 Substructure Search

4.2 子结构搜索

1. ChEMBL_search_substructure(smiles=core_scaffold)
   └─ Extract: Compounds containing scaffold
   
2. PubChem_search_compounds_by_substructure(smiles=core_scaffold)
   └─ Extract: Additional scaffold-containing compounds

1. ChEMBL_search_substructure(smiles=core_scaffold)
   └─ 提取：包含该骨架的化合物
   
2. PubChem_search_compounds_by_substructure(smiles=core_scaffold)
   └─ 提取：额外的含该骨架化合物

4.3 Cross-Database Mining

4.3 跨数据库挖掘

1. STITCH_get_chemical_protein_interactions(identifier=target_gene)
   └─ Extract: Additional chemical-protein links
   
2. DGIdb_get_drug_gene_interactions(genes=[gene_symbol])
   └─ Extract: Approved/investigational drugs

Output for Report:

markdown

undefined

1. STITCH_get_chemical_protein_interactions(identifier=target_gene)
   └─ 提取：额外的化学-蛋白关联
   
2. DGIdb_get_drug_gene_interactions(genes=[gene_symbol])
   └─ 提取：已获批/在研药物

报告输出:

markdown

undefined

4. Compound Expansion Results

4. 化合物拓展结果

Starting Seeds: 5 diverse actives (IC50 < 100 nM) Similarity Expansion: 847 compounds (70% threshold) Substructure Search: 234 scaffold matches Cross-Database: 45 additional hits

After Deduplication: 923 unique candidate compounds

Source	Compounds	Already Tested	Novel Candidates
ChEMBL similarity	456	234	222
PubChem similarity	391	156	235
ChEMBL substructure	178	89	89
STITCH	45	23	22
Total Unique	923	355	568

undefined

起始种子：5种多样化活性化合物（IC50 < 100 nM） 相似性拓展：847种化合物（70%阈值） 子结构搜索：234种骨架匹配物 跨数据库：45种额外命中物

去重后：923种独特候选化合物

来源	化合物总数	已测试	新型候选物
ChEMBL相似性	456	234	222
PubChem相似性	391	156	235
ChEMBL子结构	178	89	89
STITCH	45	23	22
总计独特	923	355	568

undefined

4.4 De Novo Molecule Generation (NVIDIA NIM)

4.4 从头分子生成（NVIDIA NIM）

When database mining yields insufficient candidates, generate novel molecules.

Requires:

NVIDIA_API_KEY

environment variable

Option A: GenMol (Scaffold Hopping with Masked Regions)

NvidiaNIM_genmol(
    smiles="COc1cc2ncnc(Nc3ccc([*{3-8}])c([*{1-3}])c3)c2cc1OCCCN1CCOCC1",
    num_molecules=100,
    temperature=2.0,
    scoring="QED"
)
└─ Input: SMILES with [*{min-max}] masked regions
└─ Output: Generated molecules with QED/LogP scores
└─ Use: Explore specific positions while keeping scaffold

Mask Design Strategy:

Position	Mask	Purpose
Aniline substituent	`[*{1-3}]`	Small groups (halogen, methyl)
Solubilizing group	`[*{5-10}]`	Morpholine, piperazine variants
Linker region	`[*{3-6}]`	Spacer variations

Example Masked SMILES for EGFR:

undefined

当数据库挖掘得到的候选物不足时，生成新型分子。

要求：

NVIDIA_API_KEY

环境变量

选项A：GenMol（带掩码区域的骨架跃迁）

NvidiaNIM_genmol(
    smiles="COc1cc2ncnc(Nc3ccc([*{3-8}])c([*{1-3}])c3)c2cc1OCCCN1CCOCC1",
    num_molecules=100,
    temperature=2.0,
    scoring="QED"
)
└─ 输入：带[*{min-max}]掩码区域的SMILES
└─ 输出：带QED/LogP评分的生成分子
└─ 适用场景：探索特定位置，同时保留骨架

掩码设计策略:

位置	掩码	目的
苯胺取代基	`[*{1-3}]`	小基团（卤素、甲基）
增溶基团	`[*{5-10}]`	吗啉、哌嗪变体
连接区	`[*{3-6}]`	间隔基变体

EGFR的掩码SMILES示例:

undefined

Keep quinazoline core, vary aniline and tail

保留喹唑啉核心，改变苯胺和尾部

COc1cc2ncnc(Nc3ccc([{1-3}])c([{1-3}])c3)c2cc1[*{5-12}]


**Option B: MolMIM (Controlled Generation from Reference)**

NvidiaNIM_molmim( smi="COc1cc2ncnc(Nc3ccc(Cl)cc3)c2cc1OCCN1CCOCC1", num_molecules=50, algorithm="CMA-ES" ) └─ Input: Reference SMILES (known active) └─ Output: Optimized analogs with property scores └─ Use: Generate close analogs of top actives


**Generation Workflow**:
1. Identify top 3-5 actives from Phase 2
2. Design masked SMILES for GenMol OR use as reference for MolMIM
3. Generate 50-100 molecules per seed
4. Pass generated molecules to Phase 5 (ADMET filtering)
5. Dock survivors in Phase 6 for final ranking

**Report Format**:
```markdown

COc1cc2ncnc(Nc3ccc([{1-3}])c([{1-3}])c3)c2cc1[*{5-12}]


**选项B：MolMIM（基于参考的可控生成）**

NvidiaNIM_molmim( smi="COc1cc2ncnc(Nc3ccc(Cl)cc3)c2cc1OCCN1CCOCC1", num_molecules=50, algorithm="CMA-ES" ) └─ 输入：参考SMILES（已知活性化合物） └─ 输出：带性质评分的优化类似物 └─ 适用场景：生成顶级活性化合物的类似物


**生成工作流**:
1. 从阶段2中识别前3-5种活性化合物
2. 为GenMol设计掩码SMILES，或作为MolMIM的参考
3. 每个种子生成50-100个分子
4. 将生成的分子传入阶段5（ADMET筛选）
5. 在阶段6对接存活分子以进行最终排名

**报告格式**:
```markdown

4.4 De Novo Generation Results

4.4 从头生成结果

Method: GenMol via NVIDIA NIM Seed Scaffold: 4-anilinoquinazoline (from erlotinib) Masked Positions: Aniline (3,4), solubilizing tail

Metric	Value
Molecules Generated	100
Passing Lipinski	95 (95%)
Mean QED Score	0.72
Unique Scaffolds	12

Top Generated Compounds:

ID	SMILES	QED	LogP	Novelty
GEN-001	COc1cc2ncnc(Nc3ccc(Cl)c(Cl)c3)c2cc1OCCCN1CCOCC1	0.81	4.2	Novel substitution
GEN-002	COc1cc2ncnc(Nc3ccc(C#N)c(F)c3)c2cc1OCCCN1CCOCC1	0.78	3.8	Novel substitution

Source: NVIDIA NIM via
NvidiaNIM_genmol

---

方法：GenMol via NVIDIA NIM 种子骨架：4-苯胺基喹唑啉（来自erlotinib） 掩码位置：苯胺（3,4位）、增溶尾部

指标	数值
生成分子数	100
通过Lipinski规则	95（95%）
平均QED评分	0.72
独特骨架	12

顶级生成化合物:

ID	SMILES	QED	LogP	新颖性
GEN-001	COc1cc2ncnc(Nc3ccc(Cl)c(Cl)c3)c2cc1OCCCN1CCOCC1	0.81	4.2	新型取代
GEN-002	COc1cc2ncnc(Nc3ccc(C#N)c(F)c3)c2cc1OCCCN1CCOCC1	0.78	3.8	新型取代

来源：NVIDIA NIM via
NvidiaNIM_genmol

---

Phase 5: ADMET Filtering

阶段5：ADMET筛选

5.1 Physicochemical Properties

5.1 理化性质

ADMETAI_predict_physicochemical_properties(smiles=[compound_list])
└─ Filter: Lipinski violations ≤ 1
└─ Filter: QED > 0.3
└─ Filter: MW 200-600

ADMETAI_predict_physicochemical_properties(smiles=[compound_list])
└─ 筛选：Lipinski规则违反数 ≤ 1
└─ 筛选：QED > 0.3
└─ 筛选：分子量 200-600

5.2 ADMET Endpoints

5.2 ADMET终点

1. ADMETAI_predict_bioavailability(smiles=[compound_list])
   └─ Filter: Oral bioavailability > 0.3
   
2. ADMETAI_predict_toxicity(smiles=[compound_list])
   └─ Filter: AMES < 0.5, hERG < 0.5, DILI < 0.5
   
3. ADMETAI_predict_CYP_interactions(smiles=[compound_list])
   └─ Flag: CYP3A4 inhibitors (drug interaction risk)

1. ADMETAI_predict_bioavailability(smiles=[compound_list])
   └─ 筛选：口服生物利用度 > 0.3
   
2. ADMETAI_predict_toxicity(smiles=[compound_list])
   └─ 筛选：AMES < 0.5, hERG < 0.5, DILI < 0.5
   
3. ADMETAI_predict_CYP_interactions(smiles=[compound_list])
   └─ 标记：CYP3A4抑制剂（药物相互作用风险）

5.3 Structural Alerts

5.3 结构警报

ChEMBL_search_compound_structural_alerts(smiles=compound_smiles)
└─ Flag: PAINS, reactive groups, toxicophores

ADMET Filter Summary:

markdown

undefined

ChEMBL_search_compound_structural_alerts(smiles=compound_smiles)
└─ 标记：PAINS、反应性基团、毒性基团

ADMET筛选汇总:

markdown

undefined

5. ADMET Filtering Results

5. ADMET筛选结果

Filter Stage	Input	Passed	Failed	Pass Rate
Physicochemical (Lipinski)	568	456	112	80%
Drug-likeness (QED > 0.3)	456	398	58	87%
Bioavailability (> 0.3)	398	312	86	78%
Toxicity filters	312	267	45	86%
Structural alerts	267	234	33	88%
Final Candidates	568	234	334	41%

Common Failure Reasons:

High molecular weight (>600): 67 compounds
Low predicted bioavailability: 86 compounds
hERG liability: 28 compounds
PAINS alerts: 18 compounds

---

筛选阶段	输入	通过	失败	通过率
理化性质（Lipinski）	568	456	112	80%
类药性（QED > 0.3）	456	398	58	87%
生物利用度（> 0.3）	398	312	86	78%
毒性筛选	312	267	45	86%
结构警报	267	234	33	88%
最终候选物	568	234	334	41%

常见失败原因:

高分子量（>600）：67种化合物
预测生物利用度低：86种化合物
hERG风险：28种化合物
PAINS警报：18种化合物

---

Phase 6: Candidate Prioritization

阶段6：候选物优先级排序

6.1 Scoring Framework

6.1 评分框架

Score each candidate on multiple dimensions:

Dimension	Weight	Scoring Criteria
Structural Similarity	25%	Tanimoto to actives (0.7-1.0 → 1-5)
Novelty	20%	Not in ChEMBL bioactivity = +2; Novel scaffold = +3
ADMET Score	25%	Composite of property predictions
Synthesis Feasibility	15%	SA score (1-10), commercial availability
Scaffold Diversity	15%	Cluster representative bonus

从多个维度对每个候选物评分：

维度	权重	评分标准
结构相似性	25%	与活性化合物的Tanimoto系数（0.7-1.0 → 1-5分）
新颖性	20%	不在ChEMBL生物活性数据中=+2；新型骨架=+3
ADMET评分	25%	性质预测的综合评分
合成可行性	15%	SA评分（1-10）、商业可得性
骨架多样性	15%	簇代表加分

6.2 Synthesis Feasibility

6.2 合成可行性

markdown

undefined

markdown

undefined

6.2 Synthesis Feasibility Assessment

6.2 合成可行性评估

Candidate	SA Score	Commercial	Estimated Steps	Flag
Compound-1	2.3	Yes (Enamine)	0	★★★
Compound-2	3.5	Building block	2-3	★★☆
Compound-3	5.8	No	6-8	★☆☆

SA Score Interpretation:

1-3: Easy synthesis
3-5: Moderate complexity
5-10: Challenging synthesis

undefined

候选物	SA评分	商业可得性	预计步骤	标记
Compound-1	2.3	是（Enamine）	0	★★★
Compound-2	3.5	合成砌块	2-3	★★☆
Compound-3	5.8	否	6-8	★☆☆

SA评分解读:

1-3：合成容易
3-5：中等复杂度
5-10：合成困难

undefined

6.3 Final Prioritized List

6.3 最终优先级列表

markdown

undefined

markdown

undefined

6.3 Top 20 Candidate Compounds

6.3 前20种候选化合物

Rank	ID	SMILES	Sim. Score	ADMET	Novelty	Overall	Rationale
1	CPD-001	Cc1ccc...	0.82	4.5	Novel scaffold	4.2	High similarity, clean ADMET
2	CPD-002	COc1cc...	0.78	4.3	Not tested	4.0	Quinazoline analog
3	CPD-003	Nc1ccc...	0.75	4.1	Novel core	3.9	New chemotype
...	...	...	...	...	...	...	...

Scaffold Diversity: 7 distinct scaffolds in top 20 Commercial Availability: 12/20 available for purchase Estimated Hit Rate: 15-25% (based on similarity to actives)

---

排名	ID	SMILES	相似性评分	ADMET	新颖性	总分	理由
1	CPD-001	Cc1ccc...	0.82	4.5	新型骨架	4.2	高相似性，干净的ADMET
2	CPD-002	COc1cc...	0.78	4.3	未测试	4.0	喹唑啉类似物
3	CPD-003	Nc1ccc...	0.75	4.1	新型核心	3.9	新化学型
...	...	...	...	...	...	...	...

骨架多样性：前20种中有7种独特骨架 商业可得性：20种中有12种可购买 预计命中率：15-25%（基于与活性化合物的相似性）

---

Phase 6.5: Literature Evidence (NEW)

阶段6.5：文献证据（新增）

6.5.1 Literature Search for Validation

6.5.1 文献搜索验证

Search literature to validate candidate compounds and understand target context.

python

def search_binder_literature(tu, target_name, compound_scaffolds):
    """Search literature for compound and target evidence."""
    
    # PubMed: Published SAR studies
    sar_papers = tu.tools.PubMed_search_articles(
        query=f"{target_name} inhibitor SAR structure-activity",
        limit=30
    )
    
    # BioRxiv: Latest unpublished findings
    preprints = tu.tools.BioRxiv_search_preprints(
        query=f"{target_name} small molecule discovery",
        limit=15
    )
    
    # MedRxiv: Clinical data on inhibitors
    clinical = tu.tools.MedRxiv_search_preprints(
        query=f"{target_name} inhibitor clinical trial",
        limit=10
    )
    
    # Citation analysis for key papers
    key_papers = sar_papers[:10]
    for paper in key_papers:
        citation = tu.tools.openalex_search_works(
            query=paper['title'],
            limit=1
        )
        paper['citations'] = citation[0].get('cited_by_count', 0) if citation else 0
    
    return {
        'published_sar': sar_papers,
        'preprints': preprints,
        'clinical_preprints': clinical,
        'high_impact_papers': sorted(key_papers, key=lambda x: x.get('citations', 0), reverse=True)
    }

搜索文献以验证候选化合物并了解靶点背景。

python

def search_binder_literature(tu, target_name, compound_scaffolds):
    """搜索化合物和靶点的文献证据。"""
    
    # PubMed：已发表的SAR研究
    sar_papers = tu.tools.PubMed_search_articles(
        query=f"{target_name} inhibitor SAR structure-activity",
        limit=30
    )
    
    # BioRxiv：最新未发表研究结果
    preprints = tu.tools.BioRxiv_search_preprints(
        query=f"{target_name} small molecule discovery",
        limit=15
    )
    
    # MedRxiv：抑制剂的临床数据
    clinical = tu.tools.MedRxiv_search_preprints(
        query=f"{target_name} inhibitor clinical trial",
        limit=10
    )
    
    # 关键论文的引用分析
    key_papers = sar_papers[:10]
    for paper in key_papers:
        citation = tu.tools.openalex_search_works(
            query=paper['title'],
            limit=1
        )
        paper['citations'] = citation[0].get('cited_by_count', 0) if citation else 0
    
    return {
        'published_sar': sar_papers,
        'preprints': preprints,
        'clinical_preprints': clinical,
        'high_impact_papers': sorted(key_papers, key=lambda x: x.get('citations', 0), reverse=True)
    }

6.5.2 Output for Report

6.5.2 报告输出

markdown

undefined

markdown

undefined

6.5 Literature Evidence

6.5 文献证据

Published SAR Studies

已发表SAR研究

PMID	Title	Year	Key Insight
34567890	Discovery of novel EGFR inhibitors...	2024	C7 substitution critical
33456789	Structure-activity relationship of...	2023	Fluorine improves potency

PMID	标题	年份	关键洞察
34567890	Discovery of novel EGFR inhibitors...	2024	C7位取代至关重要
33456789	Structure-activity relationship of...	2023	氟提高活性

Recent Preprints (⚠️ Not Peer-Reviewed)

近期预印本（⚠️ 未同行评审）

Source	Title	Posted	Relevance
BioRxiv	Novel scaffolds for EGFR...	2024-02	New chemotype discovery
MedRxiv	Clinical activity of...	2024-01	Phase 2 results

来源	标题	发布时间	相关性
BioRxiv	Novel scaffolds for EGFR...	2024-02	新化学型发现
MedRxiv	Clinical activity of...	2024-01	2期结果

High-Impact References

高影响力参考文献

PMID	Citations	Title
32123456	523	Landmark EGFR inhibitor study...
31234567	312	Comprehensive SAR analysis...

Source: PubMed, BioRxiv, MedRxiv, OpenAlex

---

PMID	引用数	标题
32123456	523	Landmark EGFR inhibitor study...
31234567	312	Comprehensive SAR analysis...

来源：PubMed, BioRxiv, MedRxiv, OpenAlex

---

Report Template

报告模板

File:

[TARGET]_binder_discovery_report.md

markdown

undefined

文件：

[TARGET]_binder_discovery_report.md

markdown

undefined

Small Molecule Binder Discovery: [TARGET]

小分子结合剂发现：[TARGET]

Generated: [Date] | Query: [Original query] | Status: In Progress

生成时间：[日期] | 查询：[原始查询] | 状态：进行中

Executive Summary

执行摘要

[Researching...]

[研究中...]

1. Target Validation

1. 靶点验证

1.1 Target Identifiers

1.1 靶点标识符

[Researching...]

[研究中...]

1.2 Druggability Assessment

1.2 成药性评估

[Researching...]

[研究中...]

1.3 Binding Site Analysis

1.3 结合位点分析

[Researching...]

[研究中...]

2. Known Ligand Landscape

2. 已知配体格局

2.1 ChEMBL Bioactivity Summary

2.1 ChEMBL生物活性汇总

[Researching...]

[研究中...]

2.2 Approved Drugs & Clinical Compounds

2.2 已获批药物与临床化合物

[Researching...]

[研究中...]

2.3 Chemical Probes

2.3 化学探针

[Researching...]

[研究中...]

2.4 SAR Insights

2.4 SAR洞察

[Researching...]

[研究中...]

3. Structural Information

3. 结构信息

3.1 Available Structures

3.1 可用结构

[Researching...]

[研究中...]

3.2 Binding Pocket Analysis

3.2 结合口袋分析

[Researching...]

[研究中...]

3.3 Key Interactions

3.3 关键相互作用

[Researching...]

[研究中...]

4. Compound Expansion

4. 化合物拓展

4.1 Similarity Search Results

4.1 相似性搜索结果

[Researching...]

[研究中...]

4.2 Substructure Search Results

4.2 子结构搜索结果

[Researching...]

[研究中...]

4.3 Cross-Database Mining

4.3 跨数据库挖掘

[Researching...]

[研究中...]

5. ADMET Filtering

5. ADMET筛选

5.1 Physicochemical Filters

5.1 理化筛选

[Researching...]

[研究中...]

5.2 ADMET Predictions

5.2 ADMET预测

[Researching...]

[研究中...]

5.3 Structural Alerts

5.3 结构警报

[Researching...]

[研究中...]

5.4 Filter Summary

5.4 筛选汇总

[Researching...]

[研究中...]

6. Candidate Prioritization

6. 候选物优先级排序

6.1 Scoring Methodology

6.1 评分方法

[Researching...]

[研究中...]

6.2 Synthesis Feasibility

6.2 合成可行性

[Researching...]

[研究中...]

6.3 Top 20 Candidates

6.3 前20种候选物

[Researching...]

[研究中...]

7. Recommendations

7. 建议

7.1 Immediate Actions

7.1 立即行动

[Researching...]

[研究中...]

7.2 Experimental Validation Plan

7.2 实验验证计划

[Researching...]

[研究中...]

7.3 Backup Strategies

7.3 备选策略

[Researching...]

[研究中...]

8. Data Gaps & Limitations

8. 数据缺口与局限性

[Researching...]

[研究中...]

9. Data Sources

9. 数据来源

[Will be populated as research progresses...]

[将随研究进展填充...]

10. Methods Summary

10. 方法汇总

Step	Tool	Purpose
Sequence retrieval	UniProt_search	Get protein sequence
Structure prediction	NvidiaNIM_alphafold2 / NvidiaNIM_esmfold	3D structure with pLDDT
Docking validation	NvidiaNIM_diffdock / NvidiaNIM_boltz2	Validate binding pocket
Known ligands	ChEMBL_get_target_activities	Bioactivity data
Similarity search	ChEMBL_search_similar_molecules	Expand chemical space
De novo generation	NvidiaNIM_genmol / NvidiaNIM_molmim	Novel molecule design
ADMET filtering	ADMETAI_predict_*	Drug-likeness assessment
Candidate docking	NvidiaNIM_diffdock / NvidiaNIM_boltz2	Final scoring

---

步骤	工具	目的
序列检索	UniProt_search	获取蛋白序列
结构预测	NvidiaNIM_alphafold2 / NvidiaNIM_esmfold	带pLDDT的3D结构
对接验证	NvidiaNIM_diffdock / NvidiaNIM_boltz2	验证结合口袋
已知配体	ChEMBL_get_target_activities	生物活性数据
相似性搜索	ChEMBL_search_similar_molecules	拓展化学空间
从头生成	NvidiaNIM_genmol / NvidiaNIM_molmim	新型分子设计
ADMET筛选	ADMETAI_predict_*	类药性评估
候选物对接	NvidiaNIM_diffdock / NvidiaNIM_boltz2	最终评分

---

Evidence Grading

证据分级

Tier	Symbol	Description	Example
T0	★★★★	Docking score > reference inhibitor	Better than erlotinib
T1	★★★	Experimental IC50/Ki < 100 nM	ChEMBL bioactivity
T2	★★☆	Docking within 5% of reference OR IC50 100-1000 nM	High priority
T3	★☆☆	Structural similarity > 80% to T1	Predicted active
T4	☆☆☆	Similarity 70-80%, scaffold match	Lower confidence
T5	○○○	Generated molecule, ADMET-passed, no docking	Speculative

Docking-Enhanced Grading: When NVIDIA NIM docking is available, compounds gain evidence:

Docking > reference → upgrade to T0 (★★★★)
Docking within 5% → upgrade to T2 (★★☆)
Docking within 20% → maintain current tier
Docking >20% worse → downgrade one tier

Apply to all candidate compounds:

markdown

| Compound | Evidence | Docking vs Ref | Rationale |
|----------|----------|----------------|-----------|
| CPD-001 | ★★★★ | +8.3% | 85% similar, docking > erlotinib |
| CPD-002 | ★★★ | -2.1% | IC50=45nM, validated by docking |
| CPD-003 | ★★☆ | -4.5% | 78% similar, good docking |
| GEN-001 | ★☆☆ | -15% | Generated, ADMET-passed |

层级	符号	描述	示例
T0	★★★★	对接评分 > 参考抑制剂	优于erlotinib
T1	★★★	实验IC50/Ki < 100 nM	ChEMBL生物活性
T2	★★☆	对接评分在参考的5%以内或 IC50 100-1000 nM	高优先级
T3	★☆☆	结构相似性 > 80% 与T1化合物	预测活性
T4	☆☆☆	相似性70-80%，骨架匹配	低置信度
T5	○○○	生成分子，通过ADMET筛选，无对接数据	推测性

对接增强分级：当NVIDIA NIM对接可用时，化合物证据升级：

对接评分>参考 → 升级为T0（★★★★）
对接评分在参考的5%以内 → 升级为T2（★★☆）
对接评分在参考的20%以内 → 保持当前层级
对接评分低于参考>20% → 降级一级

应用于所有候选化合物：

markdown

| 化合物 | 证据 | 与参考对接对比 | 理由 |
|----------|----------|----------------|-----------|
| CPD-001 | ★★★★ | +8.3% | 85%相似，对接评分优于erlotinib |
| CPD-002 | ★★★ | -2.1% | IC50=45nM，对接验证 |
| CPD-003 | ★★☆ | -4.5% | 78%相似，对接良好 |
| GEN-001 | ★☆☆ | -15% | 生成分子，通过ADMET筛选 |

Mandatory Completeness Checklist

强制完整性检查清单

Phase 1: Target Validation

阶段1：靶点验证

Phase 2: Known Ligands

阶段2：已知配体

Phase 3: Structure

阶段3：结构

PDB structures listed (or "No experimental structure")
Best structure for docking identified
Binding pocket described (or "Predicted from AlphaFold")

列出PDB结构（或“无实验结构”）
识别对接最佳结构
描述结合口袋（或“基于AlphaFold预测”）

Phase 4: Expansion

阶段4：拓展

≥3 seed compounds used
Similarity search completed (≥100 results or exhausted)
Substructure search completed
Deduplicated candidate count reported

使用≥3种种子化合物
完成相似性搜索（≥100条结果或穷尽）
完成子结构搜索
报告去重后的候选物数量

Phase 5: ADMET

阶段5：ADMET

Phase 6: Prioritization

阶段6：优先级排序

≥20 candidates ranked (or all if fewer)
Scoring methodology explained
Synthesis feasibility assessed
Scaffold diversity noted

排名≥20种候选物（或所有候选物如果少于20种）
解释评分方法
评估合成可行性
记录骨架多样性

Phase 7: Recommendations

阶段7：建议

Tool Reference by Phase

各阶段工具参考

Phase 1: Target Validation

阶段1：靶点验证

Tool	Purpose
`UniProt_search`	Resolve UniProt accession
`MyGene_query_genes`	Get Ensembl/NCBI IDs
`ChEMBL_search_targets`	Get ChEMBL target ID
`OpenTargets_get_target_tractability_by_ensemblID`	Tractability assessment
`DGIdb_get_gene_druggability`	Druggability categories
`ChEMBL_search_binding_sites`	Binding site info
`InterPro_get_protein_domains`	Domain architecture

工具	目的
`UniProt_search`	解析UniProt登录号
`MyGene_query_genes`	获取Ensembl/NCBI ID
`ChEMBL_search_targets`	获取ChEMBL靶点ID
`OpenTargets_get_target_tractability_by_ensemblID`	可开发性评估
`DGIdb_get_gene_druggability`	成药性分类
`ChEMBL_search_binding_sites`	结合位点信息
`InterPro_get_protein_domains`	结构域架构

Phase 2: Known Ligands

阶段2：已知配体

Tool	Purpose
`ChEMBL_get_target_activities`	Bioactivity data
`ChEMBL_get_molecule`	Molecule details
`GtoPdb_get_target_interactions`	Pharmacology data
`OpenTargets_get_chemical_probes_by_target_ensemblID`	Chemical probes
`OpenTargets_get_associated_drugs_by_target_ensemblID`	Known drugs

工具	目的
`ChEMBL_get_target_activities`	生物活性数据
`ChEMBL_get_molecule`	分子详情
`GtoPdb_get_target_interactions`	药理学数据
`OpenTargets_get_chemical_probes_by_target_ensemblID`	化学探针
`OpenTargets_get_associated_drugs_by_target_ensemblID`	已知药物

Phase 1.4: Structure Prediction (NVIDIA NIM)

阶段1.4：结构预测（NVIDIA NIM）

Tool	Purpose
`NvidiaNIM_alphafold2`	High-accuracy structure prediction with pLDDT
`NvidiaNIM_esmfold`	Fast structure prediction (max 1024 AA)
`NvidiaNIM_msa_search`	MSA generation for AlphaFold

工具	目的
`NvidiaNIM_alphafold2`	带pLDDT的高精度结构预测
`NvidiaNIM_esmfold`	快速结构预测（最大1024个氨基酸）
`NvidiaNIM_msa_search`	为AlphaFold生成多序列比对（MSA）

Phase 3: Structure

阶段3：结构

Tool	Purpose
`PDB_search_similar_structures`	Find PDB structures
`get_protein_metadata_by_pdb_id`	Structure metadata
`get_binding_affinity_by_pdb_id`	Ligand affinities
`alphafold_get_prediction`	Predicted structure (AlphaFold DB)
`get_ligand_smiles_by_chem_comp_id`	Ligand structures
`emdb_search`	Search cryo-EM structures (NEW)
`emdb_get_entry`	Get EMDB entry details (NEW)

工具	目的
`PDB_search_similar_structures`	查找PDB结构
`get_protein_metadata_by_pdb_id`	结构元数据
`get_binding_affinity_by_pdb_id`	配体亲和力
`alphafold_get_prediction`	预测结构（AlphaFold DB）
`get_ligand_smiles_by_chem_comp_id`	配体结构
`emdb_search`	搜索冷冻电镜结构（新增）
`emdb_get_entry`	获取EMDB条目详情（新增）

Phase 3.5: Docking Validation (NVIDIA NIM)

阶段3.5：对接验证（NVIDIA NIM）

Tool	Purpose
`NvidiaNIM_diffdock`	Blind molecular docking (PDB + SDF)
`NvidiaNIM_boltz2`	Protein-ligand complex (sequence + SMILES)

工具	目的
`NvidiaNIM_diffdock`	盲分子对接（PDB + SDF）
`NvidiaNIM_boltz2`	蛋白-配体复合物（序列 + SMILES）

Phase 4: Expansion

阶段4：拓展

Tool	Purpose
`ChEMBL_search_similar_molecules`	Similarity search
`PubChem_search_compounds_by_similarity`	PubChem similarity
`ChEMBL_search_substructure`	Substructure search
`PubChem_search_compounds_by_substructure`	PubChem substructure
`STITCH_get_chemical_protein_interactions`	Cross-database

工具	目的
`ChEMBL_search_similar_molecules`	相似性搜索
`PubChem_search_compounds_by_similarity`	PubChem相似性搜索
`ChEMBL_search_substructure`	子结构搜索
`PubChem_search_compounds_by_substructure`	PubChem子结构搜索
`STITCH_get_chemical_protein_interactions`	跨数据库挖掘

Phase 4.4: De Novo Generation (NVIDIA NIM)

阶段4.4：从头生成（NVIDIA NIM）

Tool	Purpose
`NvidiaNIM_genmol`	Scaffold hopping with masked regions
`NvidiaNIM_molmim`	Controlled generation from reference

工具	目的
`NvidiaNIM_genmol`	带掩码区域的骨架跃迁
`NvidiaNIM_molmim`	基于参考的可控生成

Phase 5: ADMET

阶段5：ADMET

Tool	Purpose
`ADMETAI_predict_physicochemical_properties`	Drug-likeness
`ADMETAI_predict_bioavailability`	Oral absorption
`ADMETAI_predict_toxicity`	Toxicity flags
`ADMETAI_predict_CYP_interactions`	CYP liabilities
`ChEMBL_search_compound_structural_alerts`	PAINS, alerts

工具	目的
`ADMETAI_predict_physicochemical_properties`	类药性
`ADMETAI_predict_bioavailability`	口服吸收
`ADMETAI_predict_toxicity`	毒性标记
`ADMETAI_predict_CYP_interactions`	CYP风险
`ChEMBL_search_compound_structural_alerts`	PAINS、警报

Phase 6: Candidate Docking (NVIDIA NIM)

阶段6：候选物对接（NVIDIA NIM）

Tool	Purpose
`NvidiaNIM_diffdock`	Score all candidates by docking
`NvidiaNIM_boltz2`	Alternative docking from SMILES

工具	目的
`NvidiaNIM_diffdock`	通过对接为所有候选物评分
`NvidiaNIM_boltz2`	从SMILES出发的替代对接方法

Phase 6.5: Literature Evidence (NEW)

阶段6.5：文献证据（新增）

Tool	Purpose
`PubMed_search_articles`	Published SAR studies
`BioRxiv_search_preprints`	Latest biology preprints
`MedRxiv_search_preprints`	Clinical preprints
`openalex_search_works`	Citation analysis
`SemanticScholar_search`	AI-ranked papers

工具	目的
`PubMed_search_articles`	已发表SAR研究
`BioRxiv_search_preprints`	最新生物学预印本
`MedRxiv_search_preprints`	临床预印本
`openalex_search_works`	引用分析
`SemanticScholar_search`	AI排名论文

Fallback Chains

备选工具链

Primary Tool	Fallback 1	Fallback 2	Use When
`ChEMBL_get_target_activities`	`GtoPdb_get_target_interactions`	`PubChem_search_assays`	No ChEMBL data
`ChEMBL_search_similar_molecules`	`PubChem_search_compounds_by_similarity`	`STITCH_get_chemical_protein_interactions`	ChEMBL exhausted
`PDB_search_similar_structures`	`NvidiaNIM_alphafold2`	`alphafold_get_prediction`	No PDB structure
`alphafold_get_prediction`	`NvidiaNIM_alphafold2`	`NvidiaNIM_esmfold`	AlphaFold DB unavailable
`NvidiaNIM_alphafold2`	`NvidiaNIM_esmfold`	`alphafold_get_prediction`	AlphaFold2 NIM error
`NvidiaNIM_diffdock`	`NvidiaNIM_boltz2`	Skip docking, use similarity	Docking error
`NvidiaNIM_genmol`	`NvidiaNIM_molmim`	Manual scaffold hopping	Generation error
`OpenTargets_get_target_tractability`	`DGIdb_get_gene_druggability`	Document "Unknown"	Open Targets error
`ADMETAI_*`	SwissADME tools	Basic Lipinski	Invalid SMILES
`PDB_search_similar_structures`	`emdb_search` + PDB	`NvidiaNIM_alphafold2`	Membrane proteins
`PubMed_search_articles`	`openalex_search_works`	`SemanticScholar_search`	Literature search
`BioRxiv_search_preprints`	`MedRxiv_search_preprints`	Skip preprints	Preprint sources

NVIDIA NIM API Key Required: Tools with

NvidiaNIM_

prefix require

NVIDIA_API_KEY

environment variable. Check availability at start:

python

import os
nvidia_available = bool(os.environ.get("NVIDIA_API_KEY"))

主工具	备选1	备选2	适用场景
`ChEMBL_get_target_activities`	`GtoPdb_get_target_interactions`	`PubChem_search_assays`	无ChEMBL数据时
`ChEMBL_search_similar_molecules`	`PubChem_search_compounds_by_similarity`	`STITCH_get_chemical_protein_interactions`	ChEMBL穷尽时
`PDB_search_similar_structures`	`NvidiaNIM_alphafold2`	`alphafold_get_prediction`	无PDB结构时
`alphafold_get_prediction`	`NvidiaNIM_alphafold2`	`NvidiaNIM_esmfold`	AlphaFold DB不可用时
`NvidiaNIM_alphafold2`	`NvidiaNIM_esmfold`	`alphafold_get_prediction`	AlphaFold2 NIM错误时
`NvidiaNIM_diffdock`	`NvidiaNIM_boltz2`	跳过对接，使用相似性	对接错误时
`NvidiaNIM_genmol`	`NvidiaNIM_molmim`	手动骨架跃迁	生成错误时
`OpenTargets_get_target_tractability`	`DGIdb_get_gene_druggability`	记录“未知”	Open Targets错误时
`ADMETAI_*`	SwissADME工具	基础Lipinski规则	SMILES无效时
`PDB_search_similar_structures`	`emdb_search` + PDB	`NvidiaNIM_alphafold2`	膜蛋白时
`PubMed_search_articles`	`openalex_search_works`	`SemanticScholar_search`	文献搜索时
`BioRxiv_search_preprints`	`MedRxiv_search_preprints`	跳过预印本	预印本来源不可用时

需要NVIDIA NIM API密钥：带

NvidiaNIM_

前缀的工具需要

NVIDIA_API_KEY

环境变量。开始时检查可用性：

python

import os
nvidia_available = bool(os.environ.get("NVIDIA_API_KEY"))

If not available, fall back to non-NIM alternatives

若不可用，切换到非NIM备选工具

---

---

Common Use Cases

常见用例

Well-Characterized Target

特征明确的靶点

User: "Find novel binders for EGFR" → Rich ChEMBL data; focus on novel scaffolds, selectivity, ADMET

用户：“寻找EGFR的新型结合剂” → 丰富的ChEMBL数据；聚焦新型骨架、选择性、ADMET

Novel Target

新型靶点

User: "Find small molecules for [new target with no known ligands]" → Limited bioactivity; rely on structure-based assessment, similar target ligands

用户：“寻找[无已知配体的新靶点]的小分子” → 生物活性有限；依赖基于结构的评估、相似靶点配体

Lead Optimization

先导化合物优化

User: "Find analogs of compound X for target Y" → Deep similarity search around specific compound; focus on SAR

用户：“寻找针对靶点Y的化合物X的类似物” → 围绕特定化合物的深度相似性搜索；聚焦SAR

Selectivity Challenge

选择性挑战

User: "Find selective inhibitors for kinase X vs kinase Y" → Include selectivity analysis; filter by off-target predictions

用户：“寻找针对激酶X vs 激酶Y的选择性抑制剂” → 包含选择性分析；按脱靶预测筛选

When NOT to Use This Skill

何时不使用该技能

Drug research → Use tooluniverse-drug-research (existing drug profiling)
Target research only → Use tooluniverse-target-research
Single compound ADMET → Call ADMET tools directly
Literature search → Use tooluniverse-literature-deep-research
Protein structure only → Use tooluniverse-protein-structure-retrieval

Use this skill for discovering new compounds for a protein target.

药物研究 → 使用tooluniverse-drug-research（现有药物分析）
仅靶点研究 → 使用tooluniverse-target-research
单个化合物ADMET → 直接调用ADMET工具
文献搜索 → 使用tooluniverse-literature-deep-research
仅蛋白结构 → 使用tooluniverse-protein-structure-retrieval

该技能适用于为蛋白靶点发现新化合物的场景。

Additional Resources

额外资源

Checklist: CHECKLIST.md - Pre-delivery verification
Examples: EXAMPLES.md - Detailed workflow examples
Tool corrections: TOOLS_REFERENCE.md - Parameter corrections

检查清单：CHECKLIST.md - 交付前验证
示例：EXAMPLES.md - 详细工作流示例
工具修正：TOOLS_REFERENCE.md - 参数修正