tooluniverse-binder-discovery
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSmall Molecule Binder Discovery Strategy
小分子结合剂发现策略
Systematic discovery of novel small molecule binders using 60+ ToolUniverse tools across druggability assessment, known ligand mining, similarity expansion, ADMET filtering, and synthesis feasibility.
KEY PRINCIPLES:
- Report-first approach - Create report file FIRST, then populate progressively
- Target validation FIRST - Confirm druggability before compound searching
- Multi-strategy approach - Combine structure-based and ligand-based methods
- ADMET-aware filtering - Eliminate poor compounds early
- Evidence grading - Grade candidates by supporting evidence
- Actionable output - Provide prioritized candidates with rationale
- English-first queries - Always use English terms in tool calls, even if the user writes in another language. Only try original-language terms as a fallback. Respond in the user's language
通过ToolUniverse的60余种工具,系统性发现新型小分子结合剂,涵盖成药性评估、已知配体挖掘、相似性拓展、ADMET筛选及合成可行性分析全流程。
核心原则:
- 报告优先法 - 先创建报告文件,再逐步填充内容
- 靶点优先验证 - 在化合物搜索前确认成药性
- 多策略结合 - 融合基于结构和基于配体的方法
- ADMET感知筛选 - 尽早剔除劣质化合物
- 证据分级 - 根据支持证据对候选物进行分级
- 可执行输出 - 提供带理论依据的优先级候选物列表
- 英文优先查询 - 调用工具时始终使用英文术语,即使用户使用其他语言提问。仅在无法查询时尝试原语言术语。用用户的语言回复
Critical Workflow Requirements
关键工作流要求
1. Report-First Approach (MANDATORY)
1. 报告优先法(强制要求)
DO NOT show search process or tool outputs to the user. Instead:
-
Create the report file FIRST - Before any data collection:
- File name:
[TARGET]_binder_discovery_report.md - Initialize with all section headers from the template
- Add placeholder text: in each section
[Researching...]
- File name:
-
Progressively update the report - As you gather data:
- Update each section with findings immediately
- The user sees the report growing, not the search process
-
Output separate data files:
- - Prioritized compounds with SMILES, scores
[TARGET]_candidate_compounds.csv - - Literature references (optional)
[TARGET]_bibliography.json
禁止向用户展示搜索过程或工具输出。应遵循以下步骤:
-
先创建报告文件 - 在收集任何数据前:
- 文件名:
[TARGET]_binder_discovery_report.md - 用模板初始化所有章节标题
- 在每个章节添加占位文本:
[研究中...]
- 文件名:
-
逐步更新报告 - 收集数据时:
- 立即用研究结果更新对应章节
- 用户看到的是报告内容逐步完善,而非搜索过程
-
输出独立数据文件:
- - 带SMILES、评分的优先级化合物列表
[TARGET]_candidate_compounds.csv - - 文献参考文献(可选)
[TARGET]_bibliography.json
2. Citation Requirements (MANDATORY)
2. 引用要求(强制要求)
Every piece of information MUST include its source:
markdown
undefined所有信息必须包含来源:
markdown
undefined3.2 Known Inhibitors
3.2 已知抑制剂
| Compound | ChEMBL ID | IC50 (nM) | Selectivity | Source |
|---|---|---|---|---|
| Imatinib | CHEMBL941 | 38 | ABL-selective | ChEMBL |
| Dasatinib | CHEMBL1421 | 0.5 | Multi-kinase | ChEMBL |
Source: ChEMBL via (CHEMBL1862)
ChEMBL_get_target_activities
---| 化合物 | ChEMBL ID | IC50 (nM) | 选择性 | 来源 |
|---|---|---|---|---|
| Imatinib | CHEMBL941 | 38 | ABL选择性 | ChEMBL |
| Dasatinib | CHEMBL1421 | 0.5 | 多激酶 | ChEMBL |
来源:ChEMBL via (CHEMBL1862)
ChEMBL_get_target_activities
---Workflow Overview
工作流概览
Phase 0: Tool Verification (check parameter names)
↓
Phase 1: Target Validation
├─ 1.1 Resolve identifiers (UniProt, Ensembl, ChEMBL target ID)
├─ 1.2 Assess druggability/tractability
│ └─ 1.2.5 Check therapeutic antibodies (Thera-SAbDab) [NEW]
├─ 1.3 Identify binding sites
└─ 1.4 Predict structure (NvidiaNIM_alphafold2/esmfold)
↓
Phase 2: Known Ligand Mining
├─ Extract ChEMBL bioactivity data
├─ Get GtoPdb interactions
├─ Identify chemical probes
├─ BindingDB affinity data (NEW - Ki/IC50/Kd)
├─ PubChem BioAssay HTS data (NEW - screening hits)
└─ Analyze SAR from known actives
↓
Phase 3: Structure Analysis
├─ Get PDB structures with ligands
├─ Check EMDB for cryo-EM structures (NEW - for membrane targets)
├─ Analyze binding pocket
└─ Identify key interactions
↓
Phase 3.5: Docking Validation (NvidiaNIM_diffdock/boltz2) [NEW]
├─ Dock reference inhibitor
└─ Validate binding pocket geometry
↓
Phase 4: Compound Expansion
├─ 4.1-4.3 Similarity/substructure search
└─ 4.4 De novo generation (NvidiaNIM_genmol/molmim) [NEW]
↓
Phase 5: ADMET Filtering
├─ Predict physicochemical properties
├─ Predict ADMET endpoints
└─ Flag liabilities
↓
Phase 6: Candidate Docking & Prioritization
├─ Dock all candidates (NvidiaNIM_diffdock/boltz2) [UPDATED]
├─ Score by docking + ADMET + novelty
├─ Assess synthesis feasibility
└─ Generate final ranked list
↓
Phase 7: Report Synthesis阶段0:工具验证(检查参数名称)
↓
阶段1:靶点验证
├─ 1.1 标识符解析(UniProt、Ensembl、ChEMBL靶点ID)
├─ 1.2 成药性/可开发性评估
│ └─ 1.2.5 检查治疗性抗体(Thera-SAbDab)【新增】
├─ 1.3 识别结合位点
└─ 1.4 结构预测(NvidiaNIM_alphafold2/esmfold)
↓
阶段2:已知配体挖掘
├─ 提取ChEMBL生物活性数据
├─ 获取GtoPdb相互作用数据
├─ 识别化学探针
├─ BindingDB亲和力数据【新增 - Ki/IC50/Kd】
├─ PubChem BioAssay高通量筛选数据【新增 - 筛选命中物】
└─ 分析已知活性化合物的构效关系(SAR)
↓
阶段3:结构分析
├─ 获取带配体的PDB结构
├─ 检查EMDB冷冻电镜结构【新增 - 针对膜蛋白靶点】
├─ 分析结合口袋
└─ 识别关键相互作用
↓
阶段3.5:对接验证(NvidiaNIM_diffdock/boltz2)【新增】
├─ 对接参考抑制剂
└─ 验证结合口袋几何结构
↓
阶段4:化合物拓展
├─ 4.1-4.3 相似性/子结构搜索
└─ 4.4 从头生成(NvidiaNIM_genmol/molmim)【新增】
↓
阶段5:ADMET筛选
├─ 预测理化性质
├─ 预测ADMET终点
└─ 标记风险因素
↓
阶段6:候选物对接与优先级排序
├─ 对接所有候选物(NvidiaNIM_diffdock/boltz2)【更新】
├─ 按对接评分+ADMET+新颖性综合评分
├─ 评估合成可行性
└─ 生成最终排名列表
↓
阶段7:报告整合Phase 0: Tool Verification
阶段0:工具验证
CRITICAL: Verify tool parameters before calling unfamiliar tools.
python
undefined关键:调用不熟悉的工具前,先验证工具参数。
python
undefinedCheck tool params to prevent silent failures
检查工具参数以避免静默失败
tool_info = tu.tools.get_tool_info(tool_name="ChEMBL_get_target_activities")
undefinedtool_info = tu.tools.get_tool_info(tool_name="ChEMBL_get_target_activities")
undefinedKnown Parameter Corrections
已知参数修正
| Tool | WRONG Parameter | CORRECT Parameter |
|---|---|---|
| | |
| | |
| | |
| | |
| 工具 | 错误参数 | 正确参数 |
|---|---|---|
| | |
| | |
| | |
| | |
Phase 1: Target Validation
阶段1:靶点验证
1.1 Identifier Resolution Chain
1.1 标识符解析链
1. UniProt_search(query=target_name, organism="human")
└─ Extract: UniProt accession, gene name, protein name
2. MyGene_query_genes(q=gene_symbol, species="human")
└─ Extract: Ensembl gene ID, NCBI gene ID
3. ChEMBL_search_targets(query=target_name, organism="Homo sapiens")
└─ Extract: ChEMBL target ID, target type
4. GtoPdb_get_targets(query=target_name)
└─ Extract: GtoPdb target ID (if GPCR/ion channel/enzyme)Store all IDs for downstream queries:
ids = {
'uniprot': 'P00533',
'ensembl': 'ENSG00000146648',
'chembl_target': 'CHEMBL203',
'gene_symbol': 'EGFR',
'gtopdb': '1797' # if available
}1. UniProt_search(query=target_name, organism="human")
└─ 提取:UniProt登录号、基因名、蛋白名
2. MyGene_query_genes(q=gene_symbol, species="human")
└─ 提取:Ensembl基因ID、NCBI基因ID
3. ChEMBL_search_targets(query=target_name, organism="Homo sapiens")
└─ 提取:ChEMBL靶点ID、靶点类型
4. GtoPdb_get_targets(query=target_name)
└─ 提取:GtoPdb靶点ID(若为GPCR/离子通道/酶)存储所有ID用于后续查询:
ids = {
'uniprot': 'P00533',
'ensembl': 'ENSG00000146648',
'chembl_target': 'CHEMBL203',
'gene_symbol': 'EGFR',
'gtopdb': '1797' # 若可用
}1.2 Druggability Assessment
1.2 成药性评估
Multi-Source Triangulation:
1. OpenTargets_get_target_tractability_by_ensemblID(ensemblId)
└─ Extract: Small molecule tractability score, bucket
2. DGIdb_get_gene_druggability(genes=[gene_symbol])
└─ Extract: Druggability categories, known drug count
3. OpenTargets_get_target_classes_by_ensemblID(ensemblId)
└─ Extract: Target class (kinase, GPCR, etc.)
4. GPCRdb_get_protein(protein=entry_name) # NEW - for GPCRs
└─ Extract: GPCR family, receptor state, ligand binding data多源交叉验证:
1. OpenTargets_get_target_tractability_by_ensemblID(ensemblId)
└─ 提取:小分子可开发性评分、分级
2. DGIdb_get_gene_druggability(genes=[gene_symbol])
└─ 提取:成药性分类、已知药物数量
3. OpenTargets_get_target_classes_by_ensemblID(ensemblId)
└─ 提取:靶点类别(激酶、GPCR等)
4. GPCRdb_get_protein(protein=entry_name) 【新增 - 针对GPCR】
└─ 提取:GPCR家族、受体状态、配体结合数据1.2a GPCRdb Integration (NEW - for GPCR Targets)
1.2a GPCRdb集成(新增 - 针对GPCR靶点)
~35% of all approved drugs target GPCRs. For GPCR targets, use specialized data:
python
def check_if_gpcr_and_enrich(tu, target_name, uniprot_id):
"""Check if target is GPCR and get specialized data."""
# Build GPCRdb entry name (e.g., "adrb2_human")
entry_name = f"{target_name.lower()}_human"
# Check if it's a GPCR
gpcr_info = tu.tools.GPCRdb_get_protein(
operation="get_protein",
protein=entry_name
)
if gpcr_info.get('status') == 'success':
# It's a GPCR - get specialized data
# Get known structures (active/inactive states)
structures = tu.tools.GPCRdb_get_structures(
operation="get_structures",
protein=entry_name
)
# Get known ligands
ligands = tu.tools.GPCRdb_get_ligands(
operation="get_ligands",
protein=entry_name
)
# Get mutation data (important for SAR)
mutations = tu.tools.GPCRdb_get_mutations(
operation="get_mutations",
protein=entry_name
)
return {
'is_gpcr': True,
'gpcr_family': gpcr_info['data'].get('family'),
'gpcr_class': gpcr_info['data'].get('receptor_class'),
'structures': structures['data'].get('structures', []),
'ligands': ligands['data'].get('ligands', []),
'mutation_data': mutations['data'].get('mutations', [])
}
return {'is_gpcr': False}GPCRdb Advantages:
- GPCR-specific sequence alignments (Ballesteros-Weinstein numbering)
- Active vs. inactive state structures
- Curated ligand binding data
- Experimental mutation effects on ligand binding
Druggability Scorecard:
| Factor | Assessment | Score |
|---|---|---|
| Known small molecule drugs | Yes (3+) | ★★★ |
| Tractability bucket | 1-3 | ★★☆-★★★ |
| Target class | Enzyme/GPCR/Ion channel | ★★★ |
| Binding site known | Yes (X-ray) | ★★★ |
| GPCRdb ligands available | Yes (10+) | ★★★ (GPCR only) |
| Therapeutic antibodies exist | Check Thera-SAbDab | See 1.2.5 |
Decision Point: If druggability score < ★★☆, warn user about challenges.
约35%的已获批药物靶向GPCR。针对GPCR靶点,使用专用数据:
python
def check_if_gpcr_and_enrich(tu, target_name, uniprot_id):
"""检查靶点是否为GPCR并获取专用数据。"""
# 构建GPCRdb条目名称(例如:"adrb2_human")
entry_name = f"{target_name.lower()}_human"
# 检查是否为GPCR
gpcr_info = tu.tools.GPCRdb_get_protein(
operation="get_protein",
protein=entry_name
)
if gpcr_info.get('status') == 'success':
# 是GPCR - 获取专用数据
# 获取已知结构(激活/非激活状态)
structures = tu.tools.GPCRdb_get_structures(
operation="get_structures",
protein=entry_name
)
# 获取已知配体
ligands = tu.tools.GPCRdb_get_ligands(
operation="get_ligands",
protein=entry_name
)
# 获取突变数据(对SAR分析很重要)
mutations = tu.tools.GPCRdb_get_mutations(
operation="get_mutations",
protein=entry_name
)
return {
'is_gpcr': True,
'gpcr_family': gpcr_info['data'].get('family'),
'gpcr_class': gpcr_info['data'].get('receptor_class'),
'structures': structures['data'].get('structures', []),
'ligands': ligands['data'].get('ligands', []),
'mutation_data': mutations['data'].get('mutations', [])
}
return {'is_gpcr': False}GPCRdb优势:
- GPCR特异性序列比对(Ballesteros-Weinstein编号)
- 激活态与非激活态结构
- curated配体结合数据
- 实验突变对配体结合的影响
成药性评分卡:
| 因素 | 评估 | 评分 |
|---|---|---|
| 已知小分子药物 | 是(≥3种) | ★★★ |
| 可开发性分级 | 1-3 | ★★☆-★★★ |
| 靶点类别 | 酶/GPCR/离子通道 | ★★★ |
| 结合位点已知 | 是(X射线结构) | ★★★ |
| GPCRdb配体可用 | 是(≥10种) | ★★★(仅GPCR) |
| 治疗性抗体存在 | 检查Thera-SAbDab | 见1.2.5 |
决策点:如果成药性评分 < ★★☆,需向用户提示挑战。
1.2.5 Therapeutic Antibody Landscape (NEW)
1.2.5 治疗性抗体格局(新增)
Check if therapeutic antibodies already target this protein - important for:
- Understanding competitive landscape
- Validating target tractability (if antibodies work, target is validated)
- Identifying potential combination approaches
python
def check_therapeutic_antibodies(tu, target_name):
"""
Check Thera-SAbDab for therapeutic antibodies against target.
"""
# Search by target name
results = tu.tools.TheraSAbDab_search_by_target(
target=target_name
)
if results.get('status') == 'success':
antibodies = results['data'].get('therapeutics', [])
# Categorize by clinical stage
by_phase = {'Approved': [], 'Phase 3': [], 'Phase 2': [], 'Phase 1': [], 'Preclinical': []}
for ab in antibodies:
phase = ab.get('phase', 'Unknown')
for key in by_phase.keys():
if key.lower() in phase.lower():
by_phase[key].append(ab)
break
return {
'total_antibodies': len(antibodies),
'by_phase': by_phase,
'antibodies': antibodies[:10], # Top 10
'competitive_alert': len(by_phase.get('Approved', [])) > 0
}
return None
def get_antibody_landscape(tu, target_name, uniprot_id=None):
"""
Comprehensive antibody competitive landscape.
"""
# Thera-SAbDab search
therasabdab = check_therapeutic_antibodies(tu, target_name)
# Also search by common synonyms
synonyms = [target_name]
if target_name != uniprot_id:
synonyms.append(uniprot_id)
all_antibodies = []
for synonym in synonyms:
results = tu.tools.TheraSAbDab_search_therapeutics(query=synonym)
if results.get('status') == 'success':
all_antibodies.extend(results['data'].get('therapeutics', []))
# Deduplicate
seen = set()
unique = []
for ab in all_antibodies:
inn = ab.get('inn_name')
if inn and inn not in seen:
seen.add(inn)
unique.append(ab)
return {
'antibodies': unique,
'count': len(unique),
'has_approved': any(ab.get('phase', '').lower() == 'approved' for ab in unique),
'source': 'Thera-SAbDab'
}Report Output:
markdown
undefined检查是否已有治疗性抗体靶向该蛋白,这对以下方面很重要:
- 了解竞争格局
- 验证靶点可开发性(若抗体有效,则靶点已验证)
- 识别潜在联合治疗方案
python
def check_therapeutic_antibodies(tu, target_name):
"""
检查Thera-SAbDab中针对该靶点的治疗性抗体。
"""
# 按靶点名称搜索
results = tu.tools.TheraSAbDab_search_by_target(
target=target_name
)
if results.get('status') == 'success':
antibodies = results['data'].get('therapeutics', [])
# 按临床阶段分类
by_phase = {'已获批': [], '3期': [], '2期': [], '1期': [], '临床前': []}
for ab in antibodies:
phase = ab.get('phase', '未知')
for key in by_phase.keys():
if key.lower() in phase.lower():
by_phase[key].append(ab)
break
return {
'总抗体数': len(antibodies),
'按阶段分类': by_phase,
'抗体列表': antibodies[:10], # 前10种
'竞争预警': len(by_phase.get('已获批', [])) > 0
}
return None
def get_antibody_landscape(tu, target_name, uniprot_id=None):
"""
全面的抗体竞争格局分析。
"""
# Thera-SAbDab搜索
therasabdab = check_therapeutic_antibodies(tu, target_name)
# 同时按常用同义词搜索
synonyms = [target_name]
if target_name != uniprot_id:
synonyms.append(uniprot_id)
all_antibodies = []
for synonym in synonyms:
results = tu.tools.TheraSAbDab_search_therapeutics(query=synonym)
if results.get('status') == 'success':
all_antibodies.extend(results['data'].get('therapeutics', []))
# 去重
seen = set()
unique = []
for ab in all_antibodies:
inn = ab.get('inn_name')
if inn and inn not in seen:
seen.add(inn)
unique.append(ab)
return {
'抗体列表': unique,
'数量': len(unique),
'有已获批抗体': any(ab.get('phase', '').lower() == 'approved' for ab in unique),
'来源': 'Thera-SAbDab'
}报告输出:
markdown
undefined1.2.5 Therapeutic Antibody Landscape (NEW)
1.2.5 治疗性抗体格局(新增)
Thera-SAbDab Search Results:
| Antibody (INN) | Target | Format | Phase | PDB |
|---|---|---|---|---|
| Pembrolizumab | PD-1 | IgG4 | Approved | 5DK3 |
| Nivolumab | PD-1 | IgG4 | Approved | 5WT9 |
| Cemiplimab | PD-1 | IgG4 | Approved | 7WVM |
Competitive Landscape: ⚠️ 3 approved antibodies target this protein
Strategic Implication: Small molecule approach offers differentiation (oral dosing, CNS penetration, cost)
Source: Thera-SAbDab via
TheraSAbDab_search_by_target
**Why Include Antibody Landscape**:
- **Validation**: Approved antibodies = validated target
- **Competition**: Understand what's already in market/clinic
- **Strategy**: Identify gaps (no oral, no CNS-penetrant)
- **Synergy**: Potential combination opportunitiesThera-SAbDab搜索结果:
| 抗体(INN) | 靶点 | 格式 | 阶段 | PDB |
|---|---|---|---|---|
| Pembrolizumab | PD-1 | IgG4 | 已获批 | 5DK3 |
| Nivolumab | PD-1 | IgG4 | 已获批 | 5WT9 |
| Cemiplimab | PD-1 | IgG4 | 已获批 | 7WVM |
竞争格局:⚠️ 3种已获批抗体靶向该蛋白
战略意义:小分子方案具备差异化优势(口服给药、血脑屏障穿透性、成本)
来源:Thera-SAbDab via
TheraSAbDab_search_by_target
**为何包含抗体格局**:
- **验证**:已获批抗体=靶点已验证
- **竞争**:了解市场/临床已有药物
- **战略**:识别空白领域(无口服剂型、无血脑屏障穿透性)
- **协同**:潜在联合治疗机会1.3 Binding Site Analysis
1.3 结合位点分析
1. ChEMBL_search_binding_sites(target_chembl_id)
└─ Extract: Binding site names, types
2. get_binding_affinity_by_pdb_id(pdb_id) # For each PDB with ligand
└─ Extract: Kd, Ki, IC50 values for co-crystallized ligands
3. InterPro_get_protein_domains(uniprot_accession)
└─ Extract: Domain architecture, active sitesOutput for Report:
markdown
undefined1. ChEMBL_search_binding_sites(target_chembl_id)
└─ 提取:结合位点名称、类型
2. get_binding_affinity_by_pdb_id(pdb_id) # 针对每个带配体的PDB
└─ 提取:共结晶配体的Kd、Ki、IC50值
3. InterPro_get_protein_domains(uniprot_accession)
└─ 提取:结构域架构、活性位点报告输出:
markdown
undefined1.3 Binding Site Assessment
1.3 结合位点评估
Known Binding Sites:
| Site | Type | Evidence | Key Residues | Source |
|---|---|---|---|---|
| ATP pocket | Orthosteric | X-ray (23 PDBs) | K745, E762, M793 | PDB/ChEMBL |
| Allosteric pocket | Allosteric | X-ray (3 PDBs) | T790, C797 | PDB |
Binding Site Druggability: ★★★ (well-defined pocket, multiple co-crystal structures)
Source: ChEMBL via , PDB structures
ChEMBL_search_binding_sitesundefined已知结合位点:
| 位点 | 类型 | 证据 | 关键残基 | 来源 |
|---|---|---|---|---|
| ATP口袋 | 正构 | X射线(23个PDB) | K745, E762, M793 | PDB/ChEMBL |
| 别构口袋 | 别构 | X射线(3个PDB) | T790, C797 | PDB |
结合位点成药性:★★★(口袋定义清晰,多个共结晶结构)
来源:ChEMBL via , PDB结构
ChEMBL_search_binding_sitesundefined1.4 Structure Prediction (NVIDIA NIM)
1.4 结构预测(NVIDIA NIM)
When no experimental structure is available, or for custom domain predictions.
Requires: environment variable
NVIDIA_API_KEYOption A: AlphaFold2 (High accuracy, async)
NvidiaNIM_alphafold2(
sequence=kinase_domain_sequence,
algorithm="mmseqs2",
relax_prediction=False
)
└─ Returns: PDB structure with pLDDT confidence scores
└─ Use when: Accuracy is critical, time is available (~5-15 min)Option B: ESMFold (Fast, synchronous)
NvidiaNIM_esmfold(sequence=kinase_domain_sequence)
└─ Returns: PDB structure (max 1024 AA)
└─ Use when: Quick assessment needed (~30 sec)Report pLDDT Confidence:
markdown
undefined当无实验结构可用,或需要自定义结构域预测时使用。
要求:环境变量
NVIDIA_API_KEY选项A:AlphaFold2(高精度,异步)
NvidiaNIM_alphafold2(
sequence=kinase_domain_sequence,
algorithm="mmseqs2",
relax_prediction=False
)
└─ 返回:带pLDDT置信度评分的PDB结构
└─ 适用场景:对精度要求高,可等待(约5-15分钟)选项B:ESMFold(快速,同步)
NvidiaNIM_esmfold(sequence=kinase_domain_sequence)
└─ 返回:PDB结构(最大1024个氨基酸)
└─ 适用场景:快速评估(约30秒)报告pLDDT置信度:
markdown
undefined1.4 Structure Prediction Quality
1.4 结构预测质量
Method: AlphaFold2 via NVIDIA NIM
Mean pLDDT: 90.94 (very high confidence)
| Confidence Level | Range | Fraction | Interpretation |
|---|---|---|---|
| Very High | ≥90 | 74.3% | Highly reliable |
| Confident | 70-90 | 16.0% | Reliable |
| Low | 50-70 | 9.0% | Use caution |
| Very Low | <50 | 0.7% | Unreliable |
Key Binding Residue Confidence:
| Residue | Function | pLDDT |
|---|---|---|
| K745 | ATP binding | 90.0 |
| T790 | Gatekeeper | 92.3 |
| M793 | Hinge region | 95.3 |
| D855 | DFG motif | 89.5 |
Source: NVIDIA NIM via
NvidiaNIM_alphafold2
---方法:AlphaFold2 via NVIDIA NIM
平均pLDDT:90.94(极高置信度)
| 置信度等级 | 范围 | 占比 | 解读 |
|---|---|---|---|
| 极高 | ≥90 | 74.3% | 高度可靠 |
| 高 | 70-90 | 16.0% | 可靠 |
| 低 | 50-70 | 9.0% | 谨慎使用 |
| 极低 | <50 | 0.7% | 不可靠 |
关键结合残基置信度:
| 残基 | 功能 | pLDDT |
|---|---|---|
| K745 | ATP结合 | 90.0 |
| T790 | 守门残基 | 92.3 |
| M793 | 铰链区 | 95.3 |
| D855 | DFG基序 | 89.5 |
来源:NVIDIA NIM via
NvidiaNIM_alphafold2
---Phase 2: Known Ligand Mining
阶段2:已知配体挖掘
2.1 ChEMBL Bioactivity Data
2.1 ChEMBL生物活性数据
1. ChEMBL_get_target_activities(target_chembl_id, limit=500)
└─ Filter: standard_type in ["IC50", "Ki", "Kd", "EC50"]
└─ Filter: standard_value < 10000 nM
└─ Extract: ChEMBL molecule IDs, SMILES, potency values
2. ChEMBL_get_molecule(molecule_chembl_id) # For top actives
└─ Extract: Full molecular data, max_phase, oral flagActivity Summary Table:
markdown
undefined1. ChEMBL_get_target_activities(target_chembl_id, limit=500)
└─ 筛选:standard_type in ["IC50", "Ki", "Kd", "EC50"]
└─ 筛选:standard_value < 10000 nM
└─ 提取:ChEMBL分子ID、SMILES、活性值
2. ChEMBL_get_molecule(molecule_chembl_id) # 针对顶级活性化合物
└─ 提取:完整分子数据、研发阶段、口服标记活性汇总表:
markdown
undefined2.1 Known Active Compounds (ChEMBL)
2.1 已知活性化合物(ChEMBL)
Total Bioactivity Points: 2,847 (IC50: 1,234 | Ki: 892 | Kd: 456 | EC50: 265)
Compounds with IC50 < 100 nM: 156
Approved Drugs for This Target: 5
| Compound | ChEMBL ID | IC50 (nM) | Max Phase | SMILES (truncated) |
|---|---|---|---|---|
| Erlotinib | CHEMBL553 | 2 | 4 | COc1cc2ncnc(Nc3ccc... |
| Gefitinib | CHEMBL939 | 5 | 4 | COc1cc2ncnc(Nc3ccc... |
| [Novel] | CHEMBL123 | 12 | 0 | c1ccc(NC(=O)c2ccc... |
Source: ChEMBL via (CHEMBL203)
ChEMBL_get_target_activitiesundefined总生物活性数据点:2,847(IC50: 1,234 | Ki: 892 | Kd: 456 | EC50: 265)
IC50 < 100 nM的化合物:156种
针对该靶点的已获批药物:5种
| 化合物 | ChEMBL ID | IC50 (nM) | 研发阶段 | SMILES(截断) |
|---|---|---|---|---|
| Erlotinib | CHEMBL553 | 2 | 4 | COc1cc2ncnc(Nc3ccc... |
| Gefitinib | CHEMBL939 | 5 | 4 | COc1cc2ncnc(Nc3ccc... |
| [新型] | CHEMBL123 | 12 | 0 | c1ccc(NC(=O)c2ccc... |
来源:ChEMBL via (CHEMBL203)
ChEMBL_get_target_activitiesundefined2.2 GtoPdb Interactions
2.2 GtoPdb相互作用数据
GtoPdb_get_target_interactions(target_id)
└─ Extract: Ligands with pKi/pIC50, selectivity dataGtoPdb_get_target_interactions(target_id)
└─ 提取:带pKi/pIC50、选择性数据的配体2.3 Chemical Probes
2.3 化学探针
OpenTargets_get_chemical_probes_by_target_ensemblID(ensemblId)
└─ Extract: Validated chemical probes with ratingsOutput for Report:
markdown
undefinedOpenTargets_get_chemical_probes_by_target_ensemblID(ensemblId)
└─ 提取:带评级的验证化学探针报告输出:
markdown
undefined2.3 Chemical Probes
2.3 化学探针
| Probe | Target | Rating | Use | Caveat | Source |
|---|---|---|---|---|---|
| Probe-X | EGFR | ★★★★ | In vivo | None | Chemical Probes Portal |
| Probe-Y | EGFR | ★★★☆ | In vitro | Off-target kinase activity | Open Targets |
Recommended Probe for Target Validation: Probe-X (highest rating, validated in vivo)
undefined| 探针 | 靶点 | 评级 | 用途 | 注意事项 | 来源 |
|---|---|---|---|---|---|
| Probe-X | EGFR | ★★★★ | 体内实验 | 无 | Chemical Probes Portal |
| Probe-Y | EGFR | ★★★☆ | 体外实验 | 脱靶激酶活性 | Open Targets |
推荐用于靶点验证的探针:Probe-X(最高评级,体内验证)
undefined2.4 SAR Analysis from Actives
2.4 已知活性化合物的SAR分析
Identify common scaffolds and SAR trends:
markdown
undefined识别共同骨架和构效关系趋势:
markdown
undefined2.4 Structure-Activity Relationships
2.4 构效关系(SAR)洞察
Core Scaffolds Identified:
-
4-Anilinoquinazoline (34 compounds, IC50 range: 2-500 nM)
- N1 position: Aryl preferred
- C6/C7: Methoxy groups improve potency
-
Pyrimidine-amine (12 compounds, IC50 range: 15-800 nM)
- Less potent than quinazolines
- Better selectivity profile
Key SAR Insights:
- Halogen at meta position of aniline increases potency 3-5x
- C7 ethoxy group critical for binding (H-bond to M793)
undefined核心骨架:
-
4-苯胺基喹唑啉(34种化合物,IC50范围:2-500 nM)
- N1位:优先芳基
- C6/C7:甲氧基提高活性
-
嘧啶胺(12种化合物,IC50范围:15-800 nM)
- 活性低于喹唑啉
- 选择性更好
关键SAR洞察:
- 苯胺间位的卤素使活性提高3-5倍
- C7乙氧基对结合至关重要(与M793形成氢键)
undefined2.5 BindingDB Affinity Data (NEW)
2.5 BindingDB亲和力数据(新增)
BindingDB provides experimental binding affinity data complementary to ChEMBL:
python
def get_bindingdb_ligands(tu, uniprot_id, affinity_cutoff=10000):
"""
Get ligands from BindingDB with measured affinities.
BindingDB advantages:
- May have compounds not in ChEMBL
- Different affinity types (Ki, IC50, Kd)
- Direct literature links
"""
result = tu.tools.BindingDB_get_ligands_by_uniprot(
uniprot=uniprot_id,
affinity_cutoff=affinity_cutoff # nM
)
if result:
ligands = []
for entry in result:
ligands.append({
'smiles': entry.get('smile'),
'affinity_type': entry.get('affinity_type'),
'affinity_nM': entry.get('affinity'),
'pmid': entry.get('pmid'),
'monomer_id': entry.get('monomerid')
})
# Sort by potency
ligands.sort(key=lambda x: float(x['affinity_nM']) if x['affinity_nM'] else 1e6)
return ligands[:50] # Top 50
return []
def find_compound_polypharmacology(tu, smiles, similarity_cutoff=0.85):
"""Find off-target interactions for selectivity analysis."""
targets = tu.tools.BindingDB_get_targets_by_compound(
smiles=smiles,
similarity_cutoff=similarity_cutoff
)
return targets # Other proteins this compound may bindBindingDB Output for Report:
markdown
undefinedBindingDB提供补充ChEMBL的实验结合亲和力数据:
python
def get_bindingdb_ligands(tu, uniprot_id, affinity_cutoff=10000):
"""
从BindingDB获取带实测亲和力的配体。
BindingDB优势:
- 可能包含ChEMBL中没有的化合物
- 不同亲和力类型(Ki、IC50、Kd)
- 直接文献链接
"""
result = tu.tools.BindingDB_get_ligands_by_uniprot(
uniprot=uniprot_id,
affinity_cutoff=affinity_cutoff # nM
)
if result:
ligands = []
for entry in result:
ligands.append({
'smiles': entry.get('smile'),
'affinity_type': entry.get('affinity_type'),
'affinity_nM': entry.get('affinity'),
'pmid': entry.get('pmid'),
'monomer_id': entry.get('monomerid')
})
# 按活性排序
ligands.sort(key=lambda x: float(x['affinity_nM']) if x['affinity_nM'] else 1e6)
return ligands[:50] # 前50种
return []
def find_compound_polypharmacology(tu, smiles, similarity_cutoff=0.85):
"""寻找脱靶相互作用以进行选择性分析。"""
targets = tu.tools.BindingDB_get_targets_by_compound(
smiles=smiles,
similarity_cutoff=similarity_cutoff
)
return targets # 该化合物可能结合的其他蛋白BindingDB报告输出:
markdown
undefined2.5 Additional Ligands (BindingDB) (NEW)
2.5 额外配体(BindingDB)(新增)
Total Unique Ligands: 89 (non-overlapping with ChEMBL)
Most Potent: 0.3 nM Ki
| SMILES | Affinity Type | Value (nM) | PMID | BindingDB ID |
|---|---|---|---|---|
| CC(C)Cc1ccc... | Ki | 0.3 | 15737014 | 12345 |
| COc1cc2ncnc... | IC50 | 2.1 | 16460808 | 12346 |
Novel Scaffolds from BindingDB: 3 scaffolds not seen in ChEMBL data
Source: BindingDB via
BindingDB_get_ligands_by_uniprotundefined独特配体总数:89种(与ChEMBL无重叠)
活性最强:0.3 nM Ki
| SMILES | 亲和力类型 | 数值(nM) | PMID | BindingDB ID |
|---|---|---|---|---|
| CC(C)Cc1ccc... | Ki | 0.3 | 15737014 | 12345 |
| COc1cc2ncnc... | IC50 | 2.1 | 16460808 | 12346 |
BindingDB中的新型骨架:3种未在ChEMBL数据中出现的骨架
来源:BindingDB via
BindingDB_get_ligands_by_uniprotundefined2.6 PubChem BioAssay Screening Data (NEW)
2.6 PubChem BioAssay筛选数据(新增)
PubChem BioAssay provides HTS screening results and dose-response data:
python
def get_pubchem_assays_for_target(tu, gene_symbol):
"""
Get bioassays and active compounds from PubChem.
Advantages:
- HTS data not in ChEMBL
- NIH-funded screening programs (MLPCN)
- Dose-response curves for IC50 calculation
"""
# Search assays targeting this gene
assays = tu.tools.PubChem_search_assays_by_target_gene(
gene_symbol=gene_symbol
)
results = {
'assays': [],
'total_active_compounds': 0
}
if assays.get('data', {}).get('aids'):
for aid in assays['data']['aids'][:10]: # Top 10 assays
# Get assay summary
summary = tu.tools.PubChem_get_assay_summary(aid=aid)
# Get active compounds
actives = tu.tools.PubChem_get_assay_active_compounds(aid=aid)
active_cids = actives.get('data', {}).get('cids', [])
results['assays'].append({
'aid': aid,
'summary': summary.get('data', {}),
'active_count': len(active_cids)
})
results['total_active_compounds'] += len(active_cids)
return results
def get_dose_response_data(tu, aid):
"""Get dose-response curves for IC50/EC50 determination."""
dr_data = tu.tools.PubChem_get_assay_dose_response(aid=aid)
return dr_data
def get_compound_bioactivity_profile(tu, cid):
"""Get all bioactivity data for a compound."""
profile = tu.tools.PubChem_get_compound_bioactivity(cid=cid)
return profilePubChem BioAssay Output for Report:
markdown
undefinedPubChem BioAssay提供高通量筛选(HTS)结果和剂量响应数据:
python
def get_pubchem_assays_for_target(tu, gene_symbol):
"""
从PubChem获取生物实验和活性化合物。
优势:
- ChEMBL中没有的HTS数据
- NIH资助的筛选项目(MLPCN)
- 用于IC50计算的剂量响应曲线
"""
# 搜索针对该基因的实验
assays = tu.tools.PubChem_search_assays_by_target_gene(
gene_symbol=gene_symbol
)
results = {
'assays': [],
'total_active_compounds': 0
}
if assays.get('data', {}).get('aids'):
for aid in assays['data']['aids'][:10]: # 前10个实验
# 获取实验摘要
summary = tu.tools.PubChem_get_assay_summary(aid=aid)
# 获取活性化合物
actives = tu.tools.PubChem_get_assay_active_compounds(aid=aid)
active_cids = actives.get('data', {}).get('cids', [])
results['assays'].append({
'aid': aid,
'summary': summary.get('data', {}),
'active_count': len(active_cids)
})
results['total_active_compounds'] += len(active_cids)
return results
def get_dose_response_data(tu, aid):
"""获取用于IC50/EC50测定的剂量响应曲线。"""
dr_data = tu.tools.PubChem_get_assay_dose_response(aid=aid)
return dr_data
def get_compound_bioactivity_profile(tu, cid):
"""获取化合物的所有生物活性数据。"""
profile = tu.tools.PubChem_get_compound_bioactivity(cid=cid)
return profilePubChem BioAssay报告输出:
markdown
undefined2.6 PubChem HTS Screening Data (NEW)
2.6 PubChem HTS筛选数据(新增)
Assays Found: 45
Total Active Compounds Across Assays: ~1,200
| AID | Assay Type | Active Compounds | Target | Description |
|---|---|---|---|---|
| 504526 | HTS | 234 | EGFR | qHTS inhibition screen |
| 1053104 | Dose-response | 12 | EGFR kinase | Confirmatory IC50 |
| 651564 | Cellular | 8 | EGFR | Cell proliferation assay |
Novel Actives (not in ChEMBL/BindingDB):
- CID 12345678: Active in AID 504526, IC50 = 45 nM
- CID 23456789: Active in AID 1053104, IC50 = 120 nM
Source: PubChem via ,
PubChem_search_assays_by_target_genePubChem_get_assay_active_compounds
**Why Use Both BindingDB and PubChem**:
| Source | Strengths | Best For |
|--------|-----------|----------|
| **ChEMBL** | Curated, standardized, SAR data | Primary ligand source |
| **BindingDB** | Direct affinity measurements | Ki/Kd values, PMIDs |
| **PubChem BioAssay** | HTS data, NIH screens | Novel scaffolds, broad coverage |
---找到的实验:45个
所有实验中的活性化合物总数:约1,200种
| AID | 实验类型 | 活性化合物 | 靶点 | 描述 |
|---|---|---|---|---|
| 504526 | HTS | 234 | EGFR | qHTS抑制筛选 |
| 1053104 | 剂量响应 | 12 | EGFR激酶 | 确证性IC50测定 |
| 651564 | 细胞水平 | 8 | EGFR | 细胞增殖实验 |
新型活性化合物(未在ChEMBL/BindingDB中出现):
- CID 12345678:在AID 504526中活性,IC50 = 45 nM
- CID 23456789:在AID 1053104中活性,IC50 = 120 nM
来源:PubChem via ,
PubChem_search_assays_by_target_genePubChem_get_assay_active_compounds
**为何同时使用BindingDB和PubChem**:
| 来源 | 优势 | 最佳用途 |
|--------|-----------|----------|
| **ChEMBL** | curated、标准化、SAR数据 | 主要配体来源 |
| **BindingDB** | 直接亲和力测量 | Ki/Kd值、PMID |
| **PubChem BioAssay** | HTS数据、NIH筛选 | 新型骨架、广泛覆盖 |
---Phase 3: Structure Analysis
阶段3:结构分析
3.1 PDB Structure Retrieval
3.1 PDB结构检索
1. PDB_search_similar_structures(query=uniprot_accession, type="sequence")
└─ Extract: PDB IDs with ligands
2. get_protein_metadata_by_pdb_id(pdb_id)
└─ Extract: Resolution, method, ligand codes
3. alphafold_get_prediction(accession=uniprot_accession)
└─ Extract: Predicted structure (if no experimental)1. PDB_search_similar_structures(query=uniprot_accession, type="sequence")
└─ 提取:带配体的PDB ID
2. get_protein_metadata_by_pdb_id(pdb_id)
└─ 提取:分辨率、方法、配体代码
3. alphafold_get_prediction(accession=uniprot_accession)
└─ 提取:预测结构(若无实验结构)3.1b EMDB Cryo-EM Structures (NEW)
3.1b EMDB冷冻电镜结构(新增)
Prioritize EMDB for: Membrane proteins (GPCRs, ion channels), large complexes, targets with multiple conformational states.
python
def get_cryoem_structures(tu, target_name, uniprot_accession):
"""Get cryo-EM structures for membrane targets."""
# Search EMDB
emdb_results = tu.tools.emdb_search(
query=f"{target_name} membrane receptor"
)
structures = []
for entry in emdb_results[:5]:
details = tu.tools.emdb_get_entry(entry_id=entry['emdb_id'])
# Get associated PDB model (essential for docking)
pdb_models = details.get('pdb_ids', [])
structures.append({
'emdb_id': entry['emdb_id'],
'resolution': entry.get('resolution', 'N/A'),
'title': entry.get('title', 'N/A'),
'conformational_state': details.get('state', 'Unknown'),
'pdb_models': pdb_models
})
return structuresWhen to use cryo-EM over X-ray:
| Target Type | Prefer cryo-EM? | Reason |
|---|---|---|
| GPCR | Yes | Native membrane conformation |
| Ion channel | Yes | Multiple functional states |
| Receptor-ligand complex | Yes | Physiological state |
| Kinase | Usually X-ray | Higher resolution typically |
Structure Summary:
markdown
undefined优先使用EMDB的场景:膜蛋白(GPCR、离子通道)、大型复合物、存在多种构象状态的靶点。
python
def get_cryoem_structures(tu, target_name, uniprot_accession):
"""获取膜靶点的冷冻电镜结构。"""
# 搜索EMDB
emdb_results = tu.tools.emdb_search(
query=f"{target_name} membrane receptor"
)
structures = []
for entry in emdb_results[:5]:
details = tu.tools.emdb_get_entry(entry_id=entry['emdb_id'])
# 获取关联的PDB模型(对接必需)
pdb_models = details.get('pdb_ids', [])
structures.append({
'emdb_id': entry['emdb_id'],
'resolution': entry.get('resolution', 'N/A'),
'title': entry.get('title', 'N/A'),
'构象状态': details.get('state', 'Unknown'),
'pdb_models': pdb_models
})
return structures何时优先使用冷冻电镜而非X射线:
| 靶点类型 | 优先冷冻电镜? | 原因 |
|---|---|---|
| GPCR | 是 | 天然膜构象 |
| 离子通道 | 是 | 多种功能状态 |
| 受体-配体复合物 | 是 | 生理状态 |
| 激酶 | 通常X射线 | 通常分辨率更高 |
结构汇总:
markdown
undefined3.1 Available Structures
3.1 可用结构
| PDB ID | Resolution | Method | Ligand | Affinity | State |
|---|---|---|---|---|---|
| 1M17 | 2.6 Å | X-ray | Erlotinib | Ki=0.4 nM | Active |
| 4HJO | 2.1 Å | X-ray | Lapatinib | Ki=3 nM | Inactive |
| AF-P00533 | - | Predicted | None | - | - |
| PDB ID | 分辨率 | 方法 | 配体 | 亲和力 | 状态 |
|---|---|---|---|---|---|
| 1M17 | 2.6 Å | X射线 | Erlotinib | Ki=0.4 nM | 激活态 |
| 4HJO | 2.1 Å | X射线 | Lapatinib | Ki=3 nM | 非激活态 |
| AF-P00533 | - | 预测 | 无 | - | - |
3.1b Cryo-EM Structures (EMDB)
3.1b 冷冻电镜结构(EMDB)
| EMDB ID | Resolution | PDB Model | Conformation | Ligand |
|---|---|---|---|---|
| EMD-12345 | 3.2 Å | 7ABC | Active | Agonist |
| EMD-23456 | 3.5 Å | 8DEF | Inactive | Antagonist |
Best Structure for Docking: 1M17 (high resolution, relevant ligand)
Source: RCSB PDB, EMDB, AlphaFold DB
undefined| EMDB ID | 分辨率 | PDB模型 | 构象 | 配体 |
|---|---|---|---|---|
| EMD-12345 | 3.2 Å | 7ABC | 激活态 | 激动剂 |
| EMD-23456 | 3.5 Å | 8DEF | 非激活态 | 拮抗剂 |
对接最佳结构:1M17(高分辨率,相关配体)
来源:RCSB PDB, EMDB, AlphaFold DB
undefined3.2 Binding Pocket Analysis
3.2 结合口袋分析
get_binding_affinity_by_pdb_id(pdb_id)
└─ Extract: Binding affinities for co-crystallized ligandsOutput for Report:
markdown
undefinedget_binding_affinity_by_pdb_id(pdb_id)
└─ 提取:共结晶配体的结合亲和力报告输出:
markdown
undefined3.2 Binding Pocket Characterization
3.2 结合口袋表征
Pocket Volume: ~850 ų (well-defined)
Key Interaction Residues:
- Hinge region: M793 (backbone H-bond donor/acceptor)
- Gatekeeper: T790 (small residue, allows access)
- DFG motif: D855 (active conformation)
- Selectivity pocket: L788, G796 (unique to EGFR)
Druggability Assessment: High (enclosed pocket, conserved interactions)
---口袋体积:~850 ų(定义清晰)
关键相互作用残基:
- 铰链区:M793(主链氢键供体/受体)
- 守门残基:T790(小残基,允许进入)
- DFG基序:D855(激活构象)
- 选择性口袋:L788, G796(EGFR特有)
成药性评估:高(封闭口袋,保守相互作用)
---Phase 3.5: Docking Validation (NVIDIA NIM)
阶段3.5:对接验证(NVIDIA NIM)
Validate structure and score compounds using molecular docking.
Requires: environment variable
NVIDIA_API_KEY使用分子对接验证结构并评分化合物。
要求:环境变量
NVIDIA_API_KEY3.5.1 Reference Compound Docking
3.5.1 参考化合物对接
Dock a known inhibitor to validate the structure captures the binding pocket correctly.
Option A: DiffDock (Blind docking, PDB + SDF input)
NvidiaNIM_diffdock(
protein=pdb_content, # PDB text content
ligand=reference_sdf, # SDF/MOL2 content
num_poses=10
)
└─ Returns: Docking poses with confidence scores
└─ Use: When you have PDB structure and ligand SDF fileOption B: Boltz2 (From sequence + SMILES)
NvidiaNIM_boltz2(
polymers=[{"molecule_type": "protein", "sequence": kinase_sequence}],
ligands=[{"smiles": "COc1cc2ncnc(Nc3ccc(C#C)cc3)c2cc1OCCOC"}],
sampling_steps=50,
diffusion_samples=1
)
└─ Returns: Protein-ligand complex structure
└─ Use: When starting from SMILES, no SDF needed对接已知抑制剂以验证结构是否正确捕获结合口袋。
选项A:DiffDock(盲对接,PDF + SDF输入)
NvidiaNIM_diffdock(
protein=pdb_content, # PDB文本内容
ligand=reference_sdf, # SDF/MOL2内容
num_poses=10
)
└─ 返回:带置信度评分的对接构象
└─ 适用场景:有PDB结构和配体SDF文件时选项B:Boltz2(从序列 + SMILES开始)
NvidiaNIM_boltz2(
polymers=[{"molecule_type": "protein", "sequence": kinase_sequence}],
ligands=[{"smiles": "COc1cc2ncnc(Nc3ccc(C#C)cc3)c2cc1OCCOC"}],
sampling_steps=50,
diffusion_samples=1
)
└─ 返回:蛋白-配体复合物结构
└─ 适用场景:从SMILES开始,无SDF文件时3.5.2 Docking Score Interpretation
3.5.2 对接评分解读
| Score vs Reference | Priority | Symbol |
|---|---|---|
| Higher than reference | Top priority | ★★★★ |
| Within 5% of reference | High priority | ★★★ |
| Within 20% of reference | Moderate priority | ★★☆ |
| >20% lower | Low priority | ★☆☆ |
Report Format:
markdown
undefined| 与参考评分对比 | 优先级 | 符号 |
|---|---|---|
| 高于参考 | 最高优先级 | ★★★★ |
| 参考的5%以内 | 高优先级 | ★★★ |
| 参考的20%以内 | 中优先级 | ★★☆ |
| 低于参考>20% | 低优先级 | ★☆☆ |
报告格式:
markdown
undefined3.5 Docking Validation Results
3.5 对接验证结果
Reference Compound: Erlotinib
Method: DiffDock via NVIDIA NIM
| Metric | Value | Interpretation |
|---|---|---|
| Best Pose Confidence | 0.906 | Excellent |
| Steric Clashes | None | Clean binding pose |
Validation Status: ✓ Structure captures binding pocket correctly
Source: NVIDIA NIM via
NvidiaNIM_diffdock
---参考化合物:Erlotinib
方法:DiffDock via NVIDIA NIM
| 指标 | 数值 | 解读 |
|---|---|---|
| 最佳构象置信度 | 0.906 | 优秀 |
| 空间冲突 | 无 | 结合构象干净 |
验证状态:✓ 结构正确捕获结合口袋
来源:NVIDIA NIM via
NvidiaNIM_diffdock
---Phase 4: Compound Expansion
阶段4:化合物拓展
4.1 Similarity Search
4.1 相似性搜索
Starting from top actives, expand chemical space:
1. ChEMBL_search_similar_molecules(molecule=top_active_smiles, similarity=70)
└─ Extract: Similar compounds not yet tested on target
2. PubChem_search_compounds_by_similarity(smiles, threshold=0.7)
└─ Extract: PubChem CIDs with similar structuresStrategy:
- Use 3-5 diverse actives as seeds
- Similarity threshold: 70-85% (balance novelty vs. activity)
- Prioritize compounds NOT in ChEMBL bioactivity for target
从顶级活性化合物开始,拓展化学空间:
1. ChEMBL_search_similar_molecules(molecule=top_active_smiles, similarity=70)
└─ 提取:尚未针对该靶点测试的相似化合物
2. PubChem_search_compounds_by_similarity(smiles, threshold=0.7)
└─ 提取:结构相似的PubChem CID策略:
- 使用3-5种多样化的活性化合物作为种子
- 相似性阈值:70-85%(平衡新颖性与活性)
- 优先选择未在ChEMBL该靶点生物活性数据中的化合物
4.2 Substructure Search
4.2 子结构搜索
1. ChEMBL_search_substructure(smiles=core_scaffold)
└─ Extract: Compounds containing scaffold
2. PubChem_search_compounds_by_substructure(smiles=core_scaffold)
└─ Extract: Additional scaffold-containing compounds1. ChEMBL_search_substructure(smiles=core_scaffold)
└─ 提取:包含该骨架的化合物
2. PubChem_search_compounds_by_substructure(smiles=core_scaffold)
└─ 提取:额外的含该骨架化合物4.3 Cross-Database Mining
4.3 跨数据库挖掘
1. STITCH_get_chemical_protein_interactions(identifier=target_gene)
└─ Extract: Additional chemical-protein links
2. DGIdb_get_drug_gene_interactions(genes=[gene_symbol])
└─ Extract: Approved/investigational drugsOutput for Report:
markdown
undefined1. STITCH_get_chemical_protein_interactions(identifier=target_gene)
└─ 提取:额外的化学-蛋白关联
2. DGIdb_get_drug_gene_interactions(genes=[gene_symbol])
└─ 提取:已获批/在研药物报告输出:
markdown
undefined4. Compound Expansion Results
4. 化合物拓展结果
Starting Seeds: 5 diverse actives (IC50 < 100 nM)
Similarity Expansion: 847 compounds (70% threshold)
Substructure Search: 234 scaffold matches
Cross-Database: 45 additional hits
After Deduplication: 923 unique candidate compounds
| Source | Compounds | Already Tested | Novel Candidates |
|---|---|---|---|
| ChEMBL similarity | 456 | 234 | 222 |
| PubChem similarity | 391 | 156 | 235 |
| ChEMBL substructure | 178 | 89 | 89 |
| STITCH | 45 | 23 | 22 |
| Total Unique | 923 | 355 | 568 |
undefined起始种子:5种多样化活性化合物(IC50 < 100 nM)
相似性拓展:847种化合物(70%阈值)
子结构搜索:234种骨架匹配物
跨数据库:45种额外命中物
去重后:923种独特候选化合物
| 来源 | 化合物总数 | 已测试 | 新型候选物 |
|---|---|---|---|
| ChEMBL相似性 | 456 | 234 | 222 |
| PubChem相似性 | 391 | 156 | 235 |
| ChEMBL子结构 | 178 | 89 | 89 |
| STITCH | 45 | 23 | 22 |
| 总计独特 | 923 | 355 | 568 |
undefined4.4 De Novo Molecule Generation (NVIDIA NIM)
4.4 从头分子生成(NVIDIA NIM)
When database mining yields insufficient candidates, generate novel molecules.
Requires: environment variable
NVIDIA_API_KEYOption A: GenMol (Scaffold Hopping with Masked Regions)
NvidiaNIM_genmol(
smiles="COc1cc2ncnc(Nc3ccc([*{3-8}])c([*{1-3}])c3)c2cc1OCCCN1CCOCC1",
num_molecules=100,
temperature=2.0,
scoring="QED"
)
└─ Input: SMILES with [*{min-max}] masked regions
└─ Output: Generated molecules with QED/LogP scores
└─ Use: Explore specific positions while keeping scaffoldMask Design Strategy:
| Position | Mask | Purpose |
|---|---|---|
| Aniline substituent | | Small groups (halogen, methyl) |
| Solubilizing group | | Morpholine, piperazine variants |
| Linker region | | Spacer variations |
Example Masked SMILES for EGFR:
undefined当数据库挖掘得到的候选物不足时,生成新型分子。
要求:环境变量
NVIDIA_API_KEY选项A:GenMol(带掩码区域的骨架跃迁)
NvidiaNIM_genmol(
smiles="COc1cc2ncnc(Nc3ccc([*{3-8}])c([*{1-3}])c3)c2cc1OCCCN1CCOCC1",
num_molecules=100,
temperature=2.0,
scoring="QED"
)
└─ 输入:带[*{min-max}]掩码区域的SMILES
└─ 输出:带QED/LogP评分的生成分子
└─ 适用场景:探索特定位置,同时保留骨架掩码设计策略:
| 位置 | 掩码 | 目的 |
|---|---|---|
| 苯胺取代基 | | 小基团(卤素、甲基) |
| 增溶基团 | | 吗啉、哌嗪变体 |
| 连接区 | | 间隔基变体 |
EGFR的掩码SMILES示例:
undefinedKeep quinazoline core, vary aniline and tail
保留喹唑啉核心,改变苯胺和尾部
COc1cc2ncnc(Nc3ccc([{1-3}])c([{1-3}])c3)c2cc1[*{5-12}]
**Option B: MolMIM (Controlled Generation from Reference)**NvidiaNIM_molmim(
smi="COc1cc2ncnc(Nc3ccc(Cl)cc3)c2cc1OCCN1CCOCC1",
num_molecules=50,
algorithm="CMA-ES"
)
└─ Input: Reference SMILES (known active)
└─ Output: Optimized analogs with property scores
└─ Use: Generate close analogs of top actives
**Generation Workflow**:
1. Identify top 3-5 actives from Phase 2
2. Design masked SMILES for GenMol OR use as reference for MolMIM
3. Generate 50-100 molecules per seed
4. Pass generated molecules to Phase 5 (ADMET filtering)
5. Dock survivors in Phase 6 for final ranking
**Report Format**:
```markdownCOc1cc2ncnc(Nc3ccc([{1-3}])c([{1-3}])c3)c2cc1[*{5-12}]
**选项B:MolMIM(基于参考的可控生成)**NvidiaNIM_molmim(
smi="COc1cc2ncnc(Nc3ccc(Cl)cc3)c2cc1OCCN1CCOCC1",
num_molecules=50,
algorithm="CMA-ES"
)
└─ 输入:参考SMILES(已知活性化合物)
└─ 输出:带性质评分的优化类似物
└─ 适用场景:生成顶级活性化合物的类似物
**生成工作流**:
1. 从阶段2中识别前3-5种活性化合物
2. 为GenMol设计掩码SMILES,或作为MolMIM的参考
3. 每个种子生成50-100个分子
4. 将生成的分子传入阶段5(ADMET筛选)
5. 在阶段6对接存活分子以进行最终排名
**报告格式**:
```markdown4.4 De Novo Generation Results
4.4 从头生成结果
Method: GenMol via NVIDIA NIM
Seed Scaffold: 4-anilinoquinazoline (from erlotinib)
Masked Positions: Aniline (3,4), solubilizing tail
| Metric | Value |
|---|---|
| Molecules Generated | 100 |
| Passing Lipinski | 95 (95%) |
| Mean QED Score | 0.72 |
| Unique Scaffolds | 12 |
Top Generated Compounds:
| ID | SMILES | QED | LogP | Novelty |
|---|---|---|---|---|
| GEN-001 | COc1cc2ncnc(Nc3ccc(Cl)c(Cl)c3)c2cc1OCCCN1CCOCC1 | 0.81 | 4.2 | Novel substitution |
| GEN-002 | COc1cc2ncnc(Nc3ccc(C#N)c(F)c3)c2cc1OCCCN1CCOCC1 | 0.78 | 3.8 | Novel substitution |
Source: NVIDIA NIM via
NvidiaNIM_genmol
---方法:GenMol via NVIDIA NIM
种子骨架:4-苯胺基喹唑啉(来自erlotinib)
掩码位置:苯胺(3,4位)、增溶尾部
| 指标 | 数值 |
|---|---|
| 生成分子数 | 100 |
| 通过Lipinski规则 | 95(95%) |
| 平均QED评分 | 0.72 |
| 独特骨架 | 12 |
顶级生成化合物:
| ID | SMILES | QED | LogP | 新颖性 |
|---|---|---|---|---|
| GEN-001 | COc1cc2ncnc(Nc3ccc(Cl)c(Cl)c3)c2cc1OCCCN1CCOCC1 | 0.81 | 4.2 | 新型取代 |
| GEN-002 | COc1cc2ncnc(Nc3ccc(C#N)c(F)c3)c2cc1OCCCN1CCOCC1 | 0.78 | 3.8 | 新型取代 |
来源:NVIDIA NIM via
NvidiaNIM_genmol
---Phase 5: ADMET Filtering
阶段5:ADMET筛选
5.1 Physicochemical Properties
5.1 理化性质
ADMETAI_predict_physicochemical_properties(smiles=[compound_list])
└─ Filter: Lipinski violations ≤ 1
└─ Filter: QED > 0.3
└─ Filter: MW 200-600ADMETAI_predict_physicochemical_properties(smiles=[compound_list])
└─ 筛选:Lipinski规则违反数 ≤ 1
└─ 筛选:QED > 0.3
└─ 筛选:分子量 200-6005.2 ADMET Endpoints
5.2 ADMET终点
1. ADMETAI_predict_bioavailability(smiles=[compound_list])
└─ Filter: Oral bioavailability > 0.3
2. ADMETAI_predict_toxicity(smiles=[compound_list])
└─ Filter: AMES < 0.5, hERG < 0.5, DILI < 0.5
3. ADMETAI_predict_CYP_interactions(smiles=[compound_list])
└─ Flag: CYP3A4 inhibitors (drug interaction risk)1. ADMETAI_predict_bioavailability(smiles=[compound_list])
└─ 筛选:口服生物利用度 > 0.3
2. ADMETAI_predict_toxicity(smiles=[compound_list])
└─ 筛选:AMES < 0.5, hERG < 0.5, DILI < 0.5
3. ADMETAI_predict_CYP_interactions(smiles=[compound_list])
└─ 标记:CYP3A4抑制剂(药物相互作用风险)5.3 Structural Alerts
5.3 结构警报
ChEMBL_search_compound_structural_alerts(smiles=compound_smiles)
└─ Flag: PAINS, reactive groups, toxicophoresADMET Filter Summary:
markdown
undefinedChEMBL_search_compound_structural_alerts(smiles=compound_smiles)
└─ 标记:PAINS、反应性基团、毒性基团ADMET筛选汇总:
markdown
undefined5. ADMET Filtering Results
5. ADMET筛选结果
| Filter Stage | Input | Passed | Failed | Pass Rate |
|---|---|---|---|---|
| Physicochemical (Lipinski) | 568 | 456 | 112 | 80% |
| Drug-likeness (QED > 0.3) | 456 | 398 | 58 | 87% |
| Bioavailability (> 0.3) | 398 | 312 | 86 | 78% |
| Toxicity filters | 312 | 267 | 45 | 86% |
| Structural alerts | 267 | 234 | 33 | 88% |
| Final Candidates | 568 | 234 | 334 | 41% |
Common Failure Reasons:
- High molecular weight (>600): 67 compounds
- Low predicted bioavailability: 86 compounds
- hERG liability: 28 compounds
- PAINS alerts: 18 compounds
---| 筛选阶段 | 输入 | 通过 | 失败 | 通过率 |
|---|---|---|---|---|
| 理化性质(Lipinski) | 568 | 456 | 112 | 80% |
| 类药性(QED > 0.3) | 456 | 398 | 58 | 87% |
| 生物利用度(> 0.3) | 398 | 312 | 86 | 78% |
| 毒性筛选 | 312 | 267 | 45 | 86% |
| 结构警报 | 267 | 234 | 33 | 88% |
| 最终候选物 | 568 | 234 | 334 | 41% |
常见失败原因:
- 高分子量(>600):67种化合物
- 预测生物利用度低:86种化合物
- hERG风险:28种化合物
- PAINS警报:18种化合物
---Phase 6: Candidate Prioritization
阶段6:候选物优先级排序
6.1 Scoring Framework
6.1 评分框架
Score each candidate on multiple dimensions:
| Dimension | Weight | Scoring Criteria |
|---|---|---|
| Structural Similarity | 25% | Tanimoto to actives (0.7-1.0 → 1-5) |
| Novelty | 20% | Not in ChEMBL bioactivity = +2; Novel scaffold = +3 |
| ADMET Score | 25% | Composite of property predictions |
| Synthesis Feasibility | 15% | SA score (1-10), commercial availability |
| Scaffold Diversity | 15% | Cluster representative bonus |
从多个维度对每个候选物评分:
| 维度 | 权重 | 评分标准 |
|---|---|---|
| 结构相似性 | 25% | 与活性化合物的Tanimoto系数(0.7-1.0 → 1-5分) |
| 新颖性 | 20% | 不在ChEMBL生物活性数据中=+2;新型骨架=+3 |
| ADMET评分 | 25% | 性质预测的综合评分 |
| 合成可行性 | 15% | SA评分(1-10)、商业可得性 |
| 骨架多样性 | 15% | 簇代表加分 |
6.2 Synthesis Feasibility
6.2 合成可行性
markdown
undefinedmarkdown
undefined6.2 Synthesis Feasibility Assessment
6.2 合成可行性评估
| Candidate | SA Score | Commercial | Estimated Steps | Flag |
|---|---|---|---|---|
| Compound-1 | 2.3 | Yes (Enamine) | 0 | ★★★ |
| Compound-2 | 3.5 | Building block | 2-3 | ★★☆ |
| Compound-3 | 5.8 | No | 6-8 | ★☆☆ |
SA Score Interpretation:
- 1-3: Easy synthesis
- 3-5: Moderate complexity
- 5-10: Challenging synthesis
undefined| 候选物 | SA评分 | 商业可得性 | 预计步骤 | 标记 |
|---|---|---|---|---|
| Compound-1 | 2.3 | 是(Enamine) | 0 | ★★★ |
| Compound-2 | 3.5 | 合成砌块 | 2-3 | ★★☆ |
| Compound-3 | 5.8 | 否 | 6-8 | ★☆☆ |
SA评分解读:
- 1-3:合成容易
- 3-5:中等复杂度
- 5-10:合成困难
undefined6.3 Final Prioritized List
6.3 最终优先级列表
markdown
undefinedmarkdown
undefined6.3 Top 20 Candidate Compounds
6.3 前20种候选化合物
| Rank | ID | SMILES | Sim. Score | ADMET | Novelty | Overall | Rationale |
|---|---|---|---|---|---|---|---|
| 1 | CPD-001 | Cc1ccc... | 0.82 | 4.5 | Novel scaffold | 4.2 | High similarity, clean ADMET |
| 2 | CPD-002 | COc1cc... | 0.78 | 4.3 | Not tested | 4.0 | Quinazoline analog |
| 3 | CPD-003 | Nc1ccc... | 0.75 | 4.1 | Novel core | 3.9 | New chemotype |
| ... | ... | ... | ... | ... | ... | ... | ... |
Scaffold Diversity: 7 distinct scaffolds in top 20
Commercial Availability: 12/20 available for purchase
Estimated Hit Rate: 15-25% (based on similarity to actives)
---| 排名 | ID | SMILES | 相似性评分 | ADMET | 新颖性 | 总分 | 理由 |
|---|---|---|---|---|---|---|---|
| 1 | CPD-001 | Cc1ccc... | 0.82 | 4.5 | 新型骨架 | 4.2 | 高相似性,干净的ADMET |
| 2 | CPD-002 | COc1cc... | 0.78 | 4.3 | 未测试 | 4.0 | 喹唑啉类似物 |
| 3 | CPD-003 | Nc1ccc... | 0.75 | 4.1 | 新型核心 | 3.9 | 新化学型 |
| ... | ... | ... | ... | ... | ... | ... | ... |
骨架多样性:前20种中有7种独特骨架
商业可得性:20种中有12种可购买
预计命中率:15-25%(基于与活性化合物的相似性)
---Phase 6.5: Literature Evidence (NEW)
阶段6.5:文献证据(新增)
6.5.1 Literature Search for Validation
6.5.1 文献搜索验证
Search literature to validate candidate compounds and understand target context.
python
def search_binder_literature(tu, target_name, compound_scaffolds):
"""Search literature for compound and target evidence."""
# PubMed: Published SAR studies
sar_papers = tu.tools.PubMed_search_articles(
query=f"{target_name} inhibitor SAR structure-activity",
limit=30
)
# BioRxiv: Latest unpublished findings
preprints = tu.tools.BioRxiv_search_preprints(
query=f"{target_name} small molecule discovery",
limit=15
)
# MedRxiv: Clinical data on inhibitors
clinical = tu.tools.MedRxiv_search_preprints(
query=f"{target_name} inhibitor clinical trial",
limit=10
)
# Citation analysis for key papers
key_papers = sar_papers[:10]
for paper in key_papers:
citation = tu.tools.openalex_search_works(
query=paper['title'],
limit=1
)
paper['citations'] = citation[0].get('cited_by_count', 0) if citation else 0
return {
'published_sar': sar_papers,
'preprints': preprints,
'clinical_preprints': clinical,
'high_impact_papers': sorted(key_papers, key=lambda x: x.get('citations', 0), reverse=True)
}搜索文献以验证候选化合物并了解靶点背景。
python
def search_binder_literature(tu, target_name, compound_scaffolds):
"""搜索化合物和靶点的文献证据。"""
# PubMed:已发表的SAR研究
sar_papers = tu.tools.PubMed_search_articles(
query=f"{target_name} inhibitor SAR structure-activity",
limit=30
)
# BioRxiv:最新未发表研究结果
preprints = tu.tools.BioRxiv_search_preprints(
query=f"{target_name} small molecule discovery",
limit=15
)
# MedRxiv:抑制剂的临床数据
clinical = tu.tools.MedRxiv_search_preprints(
query=f"{target_name} inhibitor clinical trial",
limit=10
)
# 关键论文的引用分析
key_papers = sar_papers[:10]
for paper in key_papers:
citation = tu.tools.openalex_search_works(
query=paper['title'],
limit=1
)
paper['citations'] = citation[0].get('cited_by_count', 0) if citation else 0
return {
'published_sar': sar_papers,
'preprints': preprints,
'clinical_preprints': clinical,
'high_impact_papers': sorted(key_papers, key=lambda x: x.get('citations', 0), reverse=True)
}6.5.2 Output for Report
6.5.2 报告输出
markdown
undefinedmarkdown
undefined6.5 Literature Evidence
6.5 文献证据
Published SAR Studies
已发表SAR研究
| PMID | Title | Year | Key Insight |
|---|---|---|---|
| 34567890 | Discovery of novel EGFR inhibitors... | 2024 | C7 substitution critical |
| 33456789 | Structure-activity relationship of... | 2023 | Fluorine improves potency |
| PMID | 标题 | 年份 | 关键洞察 |
|---|---|---|---|
| 34567890 | Discovery of novel EGFR inhibitors... | 2024 | C7位取代至关重要 |
| 33456789 | Structure-activity relationship of... | 2023 | 氟提高活性 |
Recent Preprints (⚠️ Not Peer-Reviewed)
近期预印本(⚠️ 未同行评审)
| Source | Title | Posted | Relevance |
|---|---|---|---|
| BioRxiv | Novel scaffolds for EGFR... | 2024-02 | New chemotype discovery |
| MedRxiv | Clinical activity of... | 2024-01 | Phase 2 results |
| 来源 | 标题 | 发布时间 | 相关性 |
|---|---|---|---|
| BioRxiv | Novel scaffolds for EGFR... | 2024-02 | 新化学型发现 |
| MedRxiv | Clinical activity of... | 2024-01 | 2期结果 |
High-Impact References
高影响力参考文献
| PMID | Citations | Title |
|---|---|---|
| 32123456 | 523 | Landmark EGFR inhibitor study... |
| 31234567 | 312 | Comprehensive SAR analysis... |
Source: PubMed, BioRxiv, MedRxiv, OpenAlex
---| PMID | 引用数 | 标题 |
|---|---|---|
| 32123456 | 523 | Landmark EGFR inhibitor study... |
| 31234567 | 312 | Comprehensive SAR analysis... |
来源:PubMed, BioRxiv, MedRxiv, OpenAlex
---Report Template
报告模板
File:
[TARGET]_binder_discovery_report.mdmarkdown
undefined文件:
[TARGET]_binder_discovery_report.mdmarkdown
undefinedSmall Molecule Binder Discovery: [TARGET]
小分子结合剂发现:[TARGET]
Generated: [Date] | Query: [Original query] | Status: In Progress
生成时间:[日期] | 查询:[原始查询] | 状态:进行中
Executive Summary
执行摘要
[Researching...]
[研究中...]
1. Target Validation
1. 靶点验证
1.1 Target Identifiers
1.1 靶点标识符
[Researching...]
[研究中...]
1.2 Druggability Assessment
1.2 成药性评估
[Researching...]
[研究中...]
1.3 Binding Site Analysis
1.3 结合位点分析
[Researching...]
[研究中...]
2. Known Ligand Landscape
2. 已知配体格局
2.1 ChEMBL Bioactivity Summary
2.1 ChEMBL生物活性汇总
[Researching...]
[研究中...]
2.2 Approved Drugs & Clinical Compounds
2.2 已获批药物与临床化合物
[Researching...]
[研究中...]
2.3 Chemical Probes
2.3 化学探针
[Researching...]
[研究中...]
2.4 SAR Insights
2.4 SAR洞察
[Researching...]
[研究中...]
3. Structural Information
3. 结构信息
3.1 Available Structures
3.1 可用结构
[Researching...]
[研究中...]
3.2 Binding Pocket Analysis
3.2 结合口袋分析
[Researching...]
[研究中...]
3.3 Key Interactions
3.3 关键相互作用
[Researching...]
[研究中...]
4. Compound Expansion
4. 化合物拓展
4.1 Similarity Search Results
4.1 相似性搜索结果
[Researching...]
[研究中...]
4.2 Substructure Search Results
4.2 子结构搜索结果
[Researching...]
[研究中...]
4.3 Cross-Database Mining
4.3 跨数据库挖掘
[Researching...]
[研究中...]
5. ADMET Filtering
5. ADMET筛选
5.1 Physicochemical Filters
5.1 理化筛选
[Researching...]
[研究中...]
5.2 ADMET Predictions
5.2 ADMET预测
[Researching...]
[研究中...]
5.3 Structural Alerts
5.3 结构警报
[Researching...]
[研究中...]
5.4 Filter Summary
5.4 筛选汇总
[Researching...]
[研究中...]
6. Candidate Prioritization
6. 候选物优先级排序
6.1 Scoring Methodology
6.1 评分方法
[Researching...]
[研究中...]
6.2 Synthesis Feasibility
6.2 合成可行性
[Researching...]
[研究中...]
6.3 Top 20 Candidates
6.3 前20种候选物
[Researching...]
[研究中...]
7. Recommendations
7. 建议
7.1 Immediate Actions
7.1 立即行动
[Researching...]
[研究中...]
7.2 Experimental Validation Plan
7.2 实验验证计划
[Researching...]
[研究中...]
7.3 Backup Strategies
7.3 备选策略
[Researching...]
[研究中...]
8. Data Gaps & Limitations
8. 数据缺口与局限性
[Researching...]
[研究中...]
9. Data Sources
9. 数据来源
[Will be populated as research progresses...]
[将随研究进展填充...]
10. Methods Summary
10. 方法汇总
| Step | Tool | Purpose |
|---|---|---|
| Sequence retrieval | UniProt_search | Get protein sequence |
| Structure prediction | NvidiaNIM_alphafold2 / NvidiaNIM_esmfold | 3D structure with pLDDT |
| Docking validation | NvidiaNIM_diffdock / NvidiaNIM_boltz2 | Validate binding pocket |
| Known ligands | ChEMBL_get_target_activities | Bioactivity data |
| Similarity search | ChEMBL_search_similar_molecules | Expand chemical space |
| De novo generation | NvidiaNIM_genmol / NvidiaNIM_molmim | Novel molecule design |
| ADMET filtering | ADMETAI_predict_* | Drug-likeness assessment |
| Candidate docking | NvidiaNIM_diffdock / NvidiaNIM_boltz2 | Final scoring |
---| 步骤 | 工具 | 目的 |
|---|---|---|
| 序列检索 | UniProt_search | 获取蛋白序列 |
| 结构预测 | NvidiaNIM_alphafold2 / NvidiaNIM_esmfold | 带pLDDT的3D结构 |
| 对接验证 | NvidiaNIM_diffdock / NvidiaNIM_boltz2 | 验证结合口袋 |
| 已知配体 | ChEMBL_get_target_activities | 生物活性数据 |
| 相似性搜索 | ChEMBL_search_similar_molecules | 拓展化学空间 |
| 从头生成 | NvidiaNIM_genmol / NvidiaNIM_molmim | 新型分子设计 |
| ADMET筛选 | ADMETAI_predict_* | 类药性评估 |
| 候选物对接 | NvidiaNIM_diffdock / NvidiaNIM_boltz2 | 最终评分 |
---Evidence Grading
证据分级
| Tier | Symbol | Description | Example |
|---|---|---|---|
| T0 | ★★★★ | Docking score > reference inhibitor | Better than erlotinib |
| T1 | ★★★ | Experimental IC50/Ki < 100 nM | ChEMBL bioactivity |
| T2 | ★★☆ | Docking within 5% of reference OR IC50 100-1000 nM | High priority |
| T3 | ★☆☆ | Structural similarity > 80% to T1 | Predicted active |
| T4 | ☆☆☆ | Similarity 70-80%, scaffold match | Lower confidence |
| T5 | ○○○ | Generated molecule, ADMET-passed, no docking | Speculative |
Docking-Enhanced Grading: When NVIDIA NIM docking is available, compounds gain evidence:
- Docking > reference → upgrade to T0 (★★★★)
- Docking within 5% → upgrade to T2 (★★☆)
- Docking within 20% → maintain current tier
- Docking >20% worse → downgrade one tier
Apply to all candidate compounds:
markdown
| Compound | Evidence | Docking vs Ref | Rationale |
|----------|----------|----------------|-----------|
| CPD-001 | ★★★★ | +8.3% | 85% similar, docking > erlotinib |
| CPD-002 | ★★★ | -2.1% | IC50=45nM, validated by docking |
| CPD-003 | ★★☆ | -4.5% | 78% similar, good docking |
| GEN-001 | ★☆☆ | -15% | Generated, ADMET-passed || 层级 | 符号 | 描述 | 示例 |
|---|---|---|---|
| T0 | ★★★★ | 对接评分 > 参考抑制剂 | 优于erlotinib |
| T1 | ★★★ | 实验IC50/Ki < 100 nM | ChEMBL生物活性 |
| T2 | ★★☆ | 对接评分在参考的5%以内 或 IC50 100-1000 nM | 高优先级 |
| T3 | ★☆☆ | 结构相似性 > 80% 与T1化合物 | 预测活性 |
| T4 | ☆☆☆ | 相似性70-80%,骨架匹配 | 低置信度 |
| T5 | ○○○ | 生成分子,通过ADMET筛选,无对接数据 | 推测性 |
对接增强分级:当NVIDIA NIM对接可用时,化合物证据升级:
- 对接评分>参考 → 升级为T0(★★★★)
- 对接评分在参考的5%以内 → 升级为T2(★★☆)
- 对接评分在参考的20%以内 → 保持当前层级
- 对接评分低于参考>20% → 降级一级
应用于所有候选化合物:
markdown
| 化合物 | 证据 | 与参考对接对比 | 理由 |
|----------|----------|----------------|-----------|
| CPD-001 | ★★★★ | +8.3% | 85%相似,对接评分优于erlotinib |
| CPD-002 | ★★★ | -2.1% | IC50=45nM,对接验证 |
| CPD-003 | ★★☆ | -4.5% | 78%相似,对接良好 |
| GEN-001 | ★☆☆ | -15% | 生成分子,通过ADMET筛选 |Mandatory Completeness Checklist
强制完整性检查清单
Phase 1: Target Validation
阶段1:靶点验证
- UniProt accession resolved
- ChEMBL target ID obtained
- Druggability assessed (≥2 sources)
- Target class identified
- Binding site information (or "No structural data")
- 解析UniProt登录号
- 获取ChEMBL靶点ID
- 评估成药性(≥2个来源)
- 识别靶点类别
- 结合位点信息(或“无结构数据”)
Phase 2: Known Ligands
阶段2:已知配体
- ChEMBL activities queried (≥100 or all available)
- Activity statistics (count, potency range)
- Top 10 actives listed with IC50
- Chemical probes identified (or "None available")
- SAR insights summarized
- 查询ChEMBL活性数据(≥100条或所有可用)
- 活性统计(数量、活性范围)
- 列出前10种活性化合物及IC50
- 识别化学探针(或“无可用探针”)
- 总结SAR洞察
Phase 3: Structure
阶段3:结构
- PDB structures listed (or "No experimental structure")
- Best structure for docking identified
- Binding pocket described (or "Predicted from AlphaFold")
- 列出PDB结构(或“无实验结构”)
- 识别对接最佳结构
- 描述结合口袋(或“基于AlphaFold预测”)
Phase 4: Expansion
阶段4:拓展
- ≥3 seed compounds used
- Similarity search completed (≥100 results or exhausted)
- Substructure search completed
- Deduplicated candidate count reported
- 使用≥3种种子化合物
- 完成相似性搜索(≥100条结果或穷尽)
- 完成子结构搜索
- 报告去重后的候选物数量
Phase 5: ADMET
阶段5:ADMET
- Physicochemical filters applied
- Toxicity predictions run
- Structural alerts checked
- Filter funnel table included
- 应用理化筛选
- 运行毒性预测
- 检查结构警报
- 包含筛选漏斗表
Phase 6: Prioritization
阶段6:优先级排序
- ≥20 candidates ranked (or all if fewer)
- Scoring methodology explained
- Synthesis feasibility assessed
- Scaffold diversity noted
- 排名≥20种候选物(或所有候选物如果少于20种)
- 解释评分方法
- 评估合成可行性
- 记录骨架多样性
Phase 7: Recommendations
阶段7:建议
- ≥3 immediate actions listed
- Experimental validation plan outlined
- Data gaps aggregated
- 列出≥3项立即行动
- 概述实验验证计划
- 汇总数据缺口
Tool Reference by Phase
各阶段工具参考
Phase 1: Target Validation
阶段1:靶点验证
| Tool | Purpose |
|---|---|
| Resolve UniProt accession |
| Get Ensembl/NCBI IDs |
| Get ChEMBL target ID |
| Tractability assessment |
| Druggability categories |
| Binding site info |
| Domain architecture |
| 工具 | 目的 |
|---|---|
| 解析UniProt登录号 |
| 获取Ensembl/NCBI ID |
| 获取ChEMBL靶点ID |
| 可开发性评估 |
| 成药性分类 |
| 结合位点信息 |
| 结构域架构 |
Phase 2: Known Ligands
阶段2:已知配体
| Tool | Purpose |
|---|---|
| Bioactivity data |
| Molecule details |
| Pharmacology data |
| Chemical probes |
| Known drugs |
| 工具 | 目的 |
|---|---|
| 生物活性数据 |
| 分子详情 |
| 药理学数据 |
| 化学探针 |
| 已知药物 |
Phase 1.4: Structure Prediction (NVIDIA NIM)
阶段1.4:结构预测(NVIDIA NIM)
| Tool | Purpose |
|---|---|
| High-accuracy structure prediction with pLDDT |
| Fast structure prediction (max 1024 AA) |
| MSA generation for AlphaFold |
| 工具 | 目的 |
|---|---|
| 带pLDDT的高精度结构预测 |
| 快速结构预测(最大1024个氨基酸) |
| 为AlphaFold生成多序列比对(MSA) |
Phase 3: Structure
阶段3:结构
| Tool | Purpose |
|---|---|
| Find PDB structures |
| Structure metadata |
| Ligand affinities |
| Predicted structure (AlphaFold DB) |
| Ligand structures |
| Search cryo-EM structures (NEW) |
| Get EMDB entry details (NEW) |
| 工具 | 目的 |
|---|---|
| 查找PDB结构 |
| 结构元数据 |
| 配体亲和力 |
| 预测结构(AlphaFold DB) |
| 配体结构 |
| 搜索冷冻电镜结构(新增) |
| 获取EMDB条目详情(新增) |
Phase 3.5: Docking Validation (NVIDIA NIM)
阶段3.5:对接验证(NVIDIA NIM)
| Tool | Purpose |
|---|---|
| Blind molecular docking (PDB + SDF) |
| Protein-ligand complex (sequence + SMILES) |
| 工具 | 目的 |
|---|---|
| 盲分子对接(PDB + SDF) |
| 蛋白-配体复合物(序列 + SMILES) |
Phase 4: Expansion
阶段4:拓展
| Tool | Purpose |
|---|---|
| Similarity search |
| PubChem similarity |
| Substructure search |
| PubChem substructure |
| Cross-database |
| 工具 | 目的 |
|---|---|
| 相似性搜索 |
| PubChem相似性搜索 |
| 子结构搜索 |
| PubChem子结构搜索 |
| 跨数据库挖掘 |
Phase 4.4: De Novo Generation (NVIDIA NIM)
阶段4.4:从头生成(NVIDIA NIM)
| Tool | Purpose |
|---|---|
| Scaffold hopping with masked regions |
| Controlled generation from reference |
| 工具 | 目的 |
|---|---|
| 带掩码区域的骨架跃迁 |
| 基于参考的可控生成 |
Phase 5: ADMET
阶段5:ADMET
| Tool | Purpose |
|---|---|
| Drug-likeness |
| Oral absorption |
| Toxicity flags |
| CYP liabilities |
| PAINS, alerts |
| 工具 | 目的 |
|---|---|
| 类药性 |
| 口服吸收 |
| 毒性标记 |
| CYP风险 |
| PAINS、警报 |
Phase 6: Candidate Docking (NVIDIA NIM)
阶段6:候选物对接(NVIDIA NIM)
| Tool | Purpose |
|---|---|
| Score all candidates by docking |
| Alternative docking from SMILES |
| 工具 | 目的 |
|---|---|
| 通过对接为所有候选物评分 |
| 从SMILES出发的替代对接方法 |
Phase 6.5: Literature Evidence (NEW)
阶段6.5:文献证据(新增)
| Tool | Purpose |
|---|---|
| Published SAR studies |
| Latest biology preprints |
| Clinical preprints |
| Citation analysis |
| AI-ranked papers |
| 工具 | 目的 |
|---|---|
| 已发表SAR研究 |
| 最新生物学预印本 |
| 临床预印本 |
| 引用分析 |
| AI排名论文 |
Fallback Chains
备选工具链
| Primary Tool | Fallback 1 | Fallback 2 | Use When |
|---|---|---|---|
| | | No ChEMBL data |
| | | ChEMBL exhausted |
| | | No PDB structure |
| | | AlphaFold DB unavailable |
| | | AlphaFold2 NIM error |
| | Skip docking, use similarity | Docking error |
| | Manual scaffold hopping | Generation error |
| | Document "Unknown" | Open Targets error |
| SwissADME tools | Basic Lipinski | Invalid SMILES |
| | | Membrane proteins |
| | | Literature search |
| | Skip preprints | Preprint sources |
NVIDIA NIM API Key Required: Tools with prefix require environment variable. Check availability at start:
NvidiaNIM_NVIDIA_API_KEYpython
import os
nvidia_available = bool(os.environ.get("NVIDIA_API_KEY"))| 主工具 | 备选1 | 备选2 | 适用场景 |
|---|---|---|---|
| | | 无ChEMBL数据时 |
| | | ChEMBL穷尽时 |
| | | 无PDB结构时 |
| | | AlphaFold DB不可用时 |
| | | AlphaFold2 NIM错误时 |
| | 跳过对接,使用相似性 | 对接错误时 |
| | 手动骨架跃迁 | 生成错误时 |
| | 记录“未知” | Open Targets错误时 |
| SwissADME工具 | 基础Lipinski规则 | SMILES无效时 |
| | | 膜蛋白时 |
| | | 文献搜索时 |
| | 跳过预印本 | 预印本来源不可用时 |
需要NVIDIA NIM API密钥:带前缀的工具需要环境变量。开始时检查可用性:
NvidiaNIM_NVIDIA_API_KEYpython
import os
nvidia_available = bool(os.environ.get("NVIDIA_API_KEY"))If not available, fall back to non-NIM alternatives
若不可用,切换到非NIM备选工具
---
---Common Use Cases
常见用例
Well-Characterized Target
特征明确的靶点
User: "Find novel binders for EGFR"
→ Rich ChEMBL data; focus on novel scaffolds, selectivity, ADMET
用户:“寻找EGFR的新型结合剂”
→ 丰富的ChEMBL数据;聚焦新型骨架、选择性、ADMET
Novel Target
新型靶点
User: "Find small molecules for [new target with no known ligands]"
→ Limited bioactivity; rely on structure-based assessment, similar target ligands
用户:“寻找[无已知配体的新靶点]的小分子”
→ 生物活性有限;依赖基于结构的评估、相似靶点配体
Lead Optimization
先导化合物优化
User: "Find analogs of compound X for target Y"
→ Deep similarity search around specific compound; focus on SAR
用户:“寻找针对靶点Y的化合物X的类似物”
→ 围绕特定化合物的深度相似性搜索;聚焦SAR
Selectivity Challenge
选择性挑战
User: "Find selective inhibitors for kinase X vs kinase Y"
→ Include selectivity analysis; filter by off-target predictions
用户:“寻找针对激酶X vs 激酶Y的选择性抑制剂”
→ 包含选择性分析;按脱靶预测筛选
When NOT to Use This Skill
何时不使用该技能
- Drug research → Use tooluniverse-drug-research (existing drug profiling)
- Target research only → Use tooluniverse-target-research
- Single compound ADMET → Call ADMET tools directly
- Literature search → Use tooluniverse-literature-deep-research
- Protein structure only → Use tooluniverse-protein-structure-retrieval
Use this skill for discovering new compounds for a protein target.
- 药物研究 → 使用tooluniverse-drug-research(现有药物分析)
- 仅靶点研究 → 使用tooluniverse-target-research
- 单个化合物ADMET → 直接调用ADMET工具
- 文献搜索 → 使用tooluniverse-literature-deep-research
- 仅蛋白结构 → 使用tooluniverse-protein-structure-retrieval
该技能适用于为蛋白靶点发现新化合物的场景。
Additional Resources
额外资源
- Checklist: CHECKLIST.md - Pre-delivery verification
- Examples: EXAMPLES.md - Detailed workflow examples
- Tool corrections: TOOLS_REFERENCE.md - Parameter corrections
- 检查清单:CHECKLIST.md - 交付前验证
- 示例:EXAMPLES.md - 详细工作流示例
- 工具修正:TOOLS_REFERENCE.md - 参数修正