target-prioritization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTarget Prioritization
靶点优先级排序
A multi-source drug-target due-diligence pipeline for ranked gene lists.
针对排序后基因列表的多源药物靶点尽职调查流程。
When this skill triggers
触发场景
The user has a list of candidate genes (typically from a DE / DEG / scRNA-seq
analysis) and wants a per-gene dossier across multiple evidence dimensions
plus a composite re-ranking. The DE statistical rank is just the entry
point; the final priority is informed by protein biology, genetics,
druggability, and research maturity.
Common input shapes:
- A CSV with a column (DE output like
gene)expression_table_pass_either_1s.csv - A plain-text gene list (one symbol per line)
- A list of symbols inline in the user's message
用户拥有候选基因列表(通常来自差异表达(DE)/差异表达基因(DEG)/scRNA-seq分析),希望获取涵盖多维度证据的单基因档案,并基于综合得分重新排序。差异表达统计排名仅为切入点;最终优先级由蛋白质生物学、遗传学、成药性和研究成熟度共同决定。
常见输入形式:
- 包含列的CSV文件(如差异表达输出文件
gene)expression_table_pass_either_1s.csv - 纯文本基因列表(每行一个基因符号)
- 用户消息中直接列出的基因符号
Output
输出结果
Three files inside :
<output_dir>/- — one section per gene, sorted by composite score, with a short LLM-written rationale and recommended next step
targets_report.md - — flat table for sorting/filtering in Excel/pandas
targets_summary.csv - — raw API responses (audit trail, reusable across future re-scorings)
raw_data/<source>.json
<output_dir>/- — 按综合得分排序的单基因章节,包含LLM生成的简短依据和推荐下一步操作
targets_report.md - — 扁平化表格,可在Excel/pandas中进行排序/筛选
targets_summary.csv - — 原始API响应(审计追踪,可用于后续重新评分)
raw_data/<source>.json
Pipeline
流程
input gene list
│
▼
scripts/orchestrate.py
│
├─► fetch_uniprot.py → protein localization, surface, MHC, coding
├─► fetch_opentargets.py → tractability, approved drugs, associated
│ diseases (subsumes GWAS Catalog via OT's
│ integrated genetics evidence), DepMap CRISPR
│ essentiality, gnomAD LOEUF / pLI constraint
├─► fetch_pubmed.py → paper counts (total + focus_disease + cell_context)
├─► fetch_hpa.py → HPA tissue / single-cell specificity + nCPM,
│ expression cluster, cancer prognostics
└─► fetch_chembl.py → top-potency tool compounds per gene (pIC50,
IC50 nM, mechanism) — dossier-only, no score
│
▼
scripts/aggregate.py
│
▼
output_dir/
├─ raw_data/*.json
├─ targets_summary.csv ← composite-score-ranked
└─ targets_report.md ← Claude fills the rationale sectionsinput gene list
│
▼
scripts/orchestrate.py
│
├─► fetch_uniprot.py → 蛋白质定位、表面蛋白、MHC、编码信息
├─► fetch_opentargets.py → 可开发性、已获批药物、关联疾病(通过OpenTargets集成的遗传学证据涵盖GWAS Catalog)、DepMap CRISPR必需性、gnomAD LOEUF / pLI约束
├─► fetch_pubmed.py → 论文数量(总数+目标疾病+细胞背景相关)
├─► fetch_hpa.py → HPA组织/单细胞特异性+ nCPM、表达聚类、癌症预后信息
└─► fetch_chembl.py → 每个基因对应的高活性工具化合物(pIC50、IC50 nM、作用机制)——仅纳入档案,不参与评分
│
▼
scripts/aggregate.py
│
▼
output_dir/
├─ raw_data/*.json
├─ targets_summary.csv ← 按综合得分排序
└─ targets_report.md ← Claude填充依据章节How to invoke
调用方式
bash
python3 ~/myagents/myskills/target-prioritization/scripts/orchestrate.py \
--input <gene_list.csv_or_txt> \
--output <output_dir> \
[--gene-col gene] \
[--top 50]- accepts a CSV (with
--input, default--gene-col), agene/.txt, or any file where the first column has gene symbols. Skips header if first cell is.tsv/gene/case-insensitive.symbol - limits the dossier to the top N input genes (default 50) — input order is preserved up to that cut, then composite-score re-ranks within.
--top
orchestrate.py<output_dir>/raw_data/<source>.jsonaggregate.pyweights.yamltargets_summary.csvtargets_report.mdbash
python3 ~/myagents/myskills/target-prioritization/scripts/orchestrate.py \
--input <gene_list.csv_or_txt> \
--output <output_dir> \
[--gene-col gene] \
[--top 50]- 支持CSV文件(需配合
--input参数,默认值为--gene-col)、gene/.txt文件,或首列为基因符号的任意文件。若首单元格为.tsv/gene(不区分大小写)则跳过表头。symbol - 参数限制档案仅包含输入列表中排名前N的基因(默认值为50)——在该范围内保留输入顺序,再通过综合得分重新排序。
--top
orchestrate.py<output_dir>/raw_data/<source>.jsonaggregate.pyweights.yamltargets_summary.csvtargets_report.mdComposite score
综合得分
Weights live in and can be overridden per-run with .
Defaults aim for "find druggable, genetically supported targets with clean
therapeutic window and expression in the cell of interest":
weights.yaml--weightscomposite_score = w1 * druggability_score (approved drugs, tractability, clin trials)
+ w2 * disease_genetics_score (OpenTargets disease associations + focus-disease bonus)
+ w3 * tractability_bonus (surface or secreted vs intracellular)
+ w4 * tissue_specificity (HPA tissue tag — narrow expression = cleaner window)
+ w5 * cell_context_score (HPA single-cell nCPM rank in FOCUS_CELL_TYPES)
+ w6 * essentiality_score (DepMap CRISPR % essential, pan-essentials capped)
+ w7 * safety_constraint_score (gnomAD LOEUF — high = LoF tolerated → safer to inhibit)
+ w8 * expression_score (from input DE if present)
+ w9 * novelty_bonus (favors moderately studied)
- w10 * over_studied_penalty (PubMed total > cap → diminishing returns)ChEMBL contributes dossier columns (, ,
, ) but no score component — its
job is to surface concrete tool compounds for the "Suggested next step" slot.
chembl_target_idchembl_best_pchemblchembl_best_ic50_nmchembl_top_compoundsEach component is normalized to [0, 1]. The composite is therefore
roughly in [-w7, sum(w1..w6)] and is min-max rescaled before reporting.
Read for the current defaults.
weights.yaml权重定义在中,可通过参数在每次运行时覆盖默认值。默认权重旨在「找到具有成药性、遗传学支持、清晰治疗窗口且在目标细胞中表达的靶点」:
weights.yaml--weightscomposite_score = w1 * druggability_score (已获批药物、可开发性、临床试验情况)
+ w2 * disease_genetics_score (OpenTargets疾病关联+目标疾病额外加分)
+ w3 * tractability_bonus (表面/分泌蛋白vs细胞内蛋白)
+ w4 * tissue_specificity (HPA组织标签——表达范围越窄,治疗窗口越清晰)
+ w5 * cell_context_score (HPA单细胞数据中目标细胞类型的nCPM排名)
+ w6 * essentiality_score (DepMap CRISPR必需性百分比,泛必需基因设上限)
+ w7 * safety_constraint_score (gnomAD LOEUF——值越高表示功能缺失可耐受→抑制更安全)
+ w8 * expression_score (若输入包含差异表达数据则纳入)
+ w9 * novelty_bonus (倾向于中等研究热度的基因)
- w10 * over_studied_penalty (PubMed论文总数超过阈值→收益递减)ChEMBL提供档案列(、、、)但不参与评分——其作用是为「建议下一步操作」提供具体的工具化合物。
chembl_target_idchembl_best_pchemblchembl_best_ic50_nmchembl_top_compounds每个组件均归一化至[0,1]区间。因此综合得分大致在[-w7, sum(w1..w6)]范围内,报告前会进行最小-最大缩放。请查看获取当前默认权重。
weights.yamlWriting the rationale
撰写依据
After produces with blank rationale
slots, Claude reads the per-gene dossier rows and writes a 2-3 sentence
rationale per gene. Use the template in —
it specifies the structure (one line on the most compelling evidence, one
line on the main risk, one line on the suggested next experimental step).
aggregate.pytargets_report.mdprompts/rationale_template.mdFor the top 5–10 genes by composite score, also write a short executive
summary at the top of the report. Keep it factual and grounded in the
dossier data; do not hallucinate beyond what the JSONs contain.
aggregate.pytargets_report.mdprompts/rationale_template.md对于综合得分排名前5-10的基因,还需在报告顶部撰写简短的执行摘要。内容需基于档案数据,保持客观真实;不得超出JSON数据范围编造信息。
Data source notes
数据源说明
All free, no API key needed. Rate limits handled in fetchers:
- UniProt REST — 100 req/sec, batched via query
accession - OpenTargets GraphQL — generous, single endpoint; provides disease genetics signal via integrated
associatedDiseases - PubMed E-utilities — 3 req/sec without key; fetchers respect this
- Human Protein Atlas — for symbol→ENSG, then per-ENSG
search_download.php; no rate limit documented, fetcher sleeps 0.15s/gene/<ENSG>.json - DepMap CRISPR essentiality — fetched via inside the OpenTargets call (no separate endpoint)
target.depMapEssentiality - gnomAD constraint — fetched via inside the OpenTargets call (avoids gnomAD's WAF on direct API access)
target.geneticConstraint - ChEMBL REST — then
target/search.json; ~5 req/sec friendly, fetcher sleeps 0.2s/geneactivity.json
For deeper API details and field mappings, see
.
references/api_endpoints.md所有数据源均免费,无需API密钥。数据获取脚本已处理速率限制:
- UniProt REST — 100请求/秒,通过查询批量处理
accession - OpenTargets GraphQL — 限制宽松,单端点;通过集成的提供疾病遗传学信号
associatedDiseases - PubMed E-utilities — 无密钥时3请求/秒;数据获取脚本遵守该限制
- 人类蛋白质图谱(HPA) — 通过将基因符号转换为ENSG,再调用
search_download.php获取数据;无公开速率限制,脚本每个基因休眠0.15秒/<ENSG>.json - DepMap CRISPR必需性 — 通过OpenTargets调用中的获取(无需单独端点)
target.depMapEssentiality - gnomAD约束 — 通过OpenTargets调用中的获取(避免直接调用gnomAD API时的WAF限制)
target.geneticConstraint - ChEMBL REST — 先调用再调用
target/search.json;建议约5请求/秒,脚本每个基因休眠0.2秒activity.json
如需了解更详细的API细节和字段映射,请查看。
references/api_endpoints.mdRetargeting the focus disease + cell context
重新设置目标疾病与细胞背景
The skill ships with an autoimmunity / T-cell default but is intentionally
disease-agnostic. Three edits switch the focus:
- and
scripts/fetch_opentargets.py— changescripts/aggregate.pyto the lowercased substrings that should mark a drug or disease association as "in-scope" (e.g.FOCUS_DISEASE_TERMSfor oncology;("cancer", "carcinoma", "lymphoma")for neurodegeneration;("alzheimer", "parkinson", "huntington", "als")for metabolic disease).("diabetes", "obesity", "fatty liver", "nash") - — change
scripts/aggregate.pyto the HPA single-cell type names that should driveFOCUS_CELL_TYPES. Must match HPA's exact strings (case-sensitive); see comment block above the tuple for examples per domain.cell_context_score - — adjust the
scripts/fetch_pubmed.pyandfocus_diseasequeries incell_context(these power the PubMed counts in the dossier).CONTEXTS
No other code changes are needed; the CSV column names already use the
neutral / prefixes.
focus_disease_*cell_context该工具默认针对自身免疫/T细胞领域,但设计上支持任意疾病场景。修改以下三处即可切换目标方向:
- 和
scripts/fetch_opentargets.py— 将scripts/aggregate.py修改为小写子字符串,用于标记药物或疾病关联为「相关」(例如肿瘤领域设为FOCUS_DISEASE_TERMS;神经退行性疾病设为("cancer", "carcinoma", "lymphoma");代谢疾病设为("alzheimer", "parkinson", "huntington", "als"))。("diabetes", "obesity", "fatty liver", "nash") - — 将
scripts/aggregate.py修改为HPA单细胞类型名称,用于驱动FOCUS_CELL_TYPES。必须严格匹配HPA的字符串(区分大小写);可查看元组上方的注释块获取各领域示例。cell_context_score - — 调整
scripts/fetch_pubmed.py中的CONTEXTS和focus_disease查询(这些查询驱动档案中的PubMed计数)。cell_context
无需修改其他代码;CSV列名已使用中性前缀/。
focus_disease_*cell_contextWhen NOT to use this skill
不适用场景
- Single-gene look-ups (overkill — just ask Claude to web-search)
- Non-human genes (most APIs are human-only; fetchers will silently return empty)
- Pure literature review without target ambition — use or
scholar-deep-researchinsteadliterature-review
- 单基因查询(大材小用——直接让Claude进行网络搜索即可)
- 非人类基因(多数API仅支持人类数据;数据获取脚本会静默返回空值)
- 无靶点研究目标的纯文献综述——改用或
scholar-deep-research工具literature-review
Iteration tips
迭代建议
The pipeline is designed to be re-runnable cheaply:
- Raw JSON cache means re-scoring with different is a one-second
weights.yamlrerunaggregate.py - To add a new evidence source, add that writes
scripts/fetch_<source>.pywith the sameraw_data/<source>.jsonshape, then add a corresponding term in{gene: {fields}}.aggregate.py::compute_composite_score
该流程设计为可低成本重复运行:
- 原始JSON缓存意味着使用不同重新评分仅需运行一秒钟的
weights.yamlaggregate.py - 如需添加新数据源,只需新增脚本,以
scripts/fetch_<source>.py格式写入{gene: {fields}},然后在raw_data/<source>.json中添加对应的评分项即可。aggregate.py::compute_composite_score