target-prioritization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Target Prioritization

靶点优先级排序

A multi-source drug-target due-diligence pipeline for ranked gene lists.
针对排序后基因列表的多源药物靶点尽职调查流程。

When this skill triggers

触发场景

The user has a list of candidate genes (typically from a DE / DEG / scRNA-seq analysis) and wants a per-gene dossier across multiple evidence dimensions plus a composite re-ranking. The DE statistical rank is just the entry point; the final priority is informed by protein biology, genetics, druggability, and research maturity.
Common input shapes:
  • A CSV with a
    gene
    column (DE output like
    expression_table_pass_either_1s.csv
    )
  • A plain-text gene list (one symbol per line)
  • A list of symbols inline in the user's message
用户拥有候选基因列表(通常来自差异表达(DE)/差异表达基因(DEG)/scRNA-seq分析),希望获取涵盖多维度证据的单基因档案,并基于综合得分重新排序。差异表达统计排名仅为切入点;最终优先级由蛋白质生物学、遗传学、成药性和研究成熟度共同决定。
常见输入形式:
  • 包含
    gene
    列的CSV文件(如差异表达输出文件
    expression_table_pass_either_1s.csv
  • 纯文本基因列表(每行一个基因符号)
  • 用户消息中直接列出的基因符号

Output

输出结果

Three files inside
<output_dir>/
:
  1. targets_report.md
    — one section per gene, sorted by composite score, with a short LLM-written rationale and recommended next step
  2. targets_summary.csv
    — flat table for sorting/filtering in Excel/pandas
  3. raw_data/<source>.json
    — raw API responses (audit trail, reusable across future re-scorings)
<output_dir>/
目录下的三个文件:
  1. targets_report.md
    — 按综合得分排序的单基因章节,包含LLM生成的简短依据和推荐下一步操作
  2. targets_summary.csv
    — 扁平化表格,可在Excel/pandas中进行排序/筛选
  3. raw_data/<source>.json
    — 原始API响应(审计追踪,可用于后续重新评分)

Pipeline

流程

input gene list
scripts/orchestrate.py
   ├─► fetch_uniprot.py        → protein localization, surface, MHC, coding
   ├─► fetch_opentargets.py    → tractability, approved drugs, associated
   │                              diseases (subsumes GWAS Catalog via OT's
   │                              integrated genetics evidence), DepMap CRISPR
   │                              essentiality, gnomAD LOEUF / pLI constraint
   ├─► fetch_pubmed.py         → paper counts (total + focus_disease + cell_context)
   ├─► fetch_hpa.py            → HPA tissue / single-cell specificity + nCPM,
   │                              expression cluster, cancer prognostics
   └─► fetch_chembl.py         → top-potency tool compounds per gene (pIC50,
                                  IC50 nM, mechanism) — dossier-only, no score
scripts/aggregate.py
output_dir/
  ├─ raw_data/*.json
  ├─ targets_summary.csv       ← composite-score-ranked
  └─ targets_report.md         ← Claude fills the rationale sections
input gene list
scripts/orchestrate.py
   ├─► fetch_uniprot.py        → 蛋白质定位、表面蛋白、MHC、编码信息
   ├─► fetch_opentargets.py    → 可开发性、已获批药物、关联疾病(通过OpenTargets集成的遗传学证据涵盖GWAS Catalog)、DepMap CRISPR必需性、gnomAD LOEUF / pLI约束
   ├─► fetch_pubmed.py         → 论文数量(总数+目标疾病+细胞背景相关)
   ├─► fetch_hpa.py            → HPA组织/单细胞特异性+ nCPM、表达聚类、癌症预后信息
   └─► fetch_chembl.py         → 每个基因对应的高活性工具化合物(pIC50、IC50 nM、作用机制)——仅纳入档案,不参与评分
scripts/aggregate.py
output_dir/
  ├─ raw_data/*.json
  ├─ targets_summary.csv       ← 按综合得分排序
  └─ targets_report.md         ← Claude填充依据章节

How to invoke

调用方式

bash
python3 ~/myagents/myskills/target-prioritization/scripts/orchestrate.py \
    --input <gene_list.csv_or_txt> \
    --output <output_dir> \
    [--gene-col gene] \
    [--top 50]
  • --input
    accepts a CSV (with
    --gene-col
    , default
    gene
    ), a
    .txt
    /
    .tsv
    , or any file where the first column has gene symbols. Skips header if first cell is
    gene
    /
    symbol
    /case-insensitive.
  • --top
    limits the dossier to the top N input genes (default 50) — input order is preserved up to that cut, then composite-score re-ranks within.
orchestrate.py
runs the five fetchers in parallel (Python threads, since all calls are I/O-bound). Each writes a self-contained JSON to
<output_dir>/raw_data/<source>.json
. Then
aggregate.py
merges them, computes the composite score using
weights.yaml
, writes
targets_summary.csv
, and emits a
targets_report.md
skeleton with one section per gene — the rationale and risks fields are left blank for Claude to fill.
bash
python3 ~/myagents/myskills/target-prioritization/scripts/orchestrate.py \
    --input <gene_list.csv_or_txt> \
    --output <output_dir> \
    [--gene-col gene] \
    [--top 50]
  • --input
    支持CSV文件(需配合
    --gene-col
    参数,默认值为
    gene
    )、
    .txt
    /
    .tsv
    文件,或首列为基因符号的任意文件。若首单元格为
    gene
    /
    symbol
    (不区分大小写)则跳过表头。
  • --top
    参数限制档案仅包含输入列表中排名前N的基因(默认值为50)——在该范围内保留输入顺序,再通过综合得分重新排序。
orchestrate.py
以并行方式运行五个数据获取脚本(Python线程,因所有调用均为I/O密集型)。每个脚本将独立的JSON文件写入
<output_dir>/raw_data/<source>.json
。随后
aggregate.py
合并这些数据,使用
weights.yaml
计算综合得分,生成
targets_summary.csv
,并输出带有单基因章节的
targets_report.md
框架——依据和风险字段留空,由Claude填充

Composite score

综合得分

Weights live in
weights.yaml
and can be overridden per-run with
--weights
. Defaults aim for "find druggable, genetically supported targets with clean therapeutic window and expression in the cell of interest":
composite_score = w1 * druggability_score          (approved drugs, tractability, clin trials)
                + w2 * disease_genetics_score      (OpenTargets disease associations + focus-disease bonus)
                + w3 * tractability_bonus          (surface or secreted vs intracellular)
                + w4 * tissue_specificity          (HPA tissue tag — narrow expression = cleaner window)
                + w5 * cell_context_score          (HPA single-cell nCPM rank in FOCUS_CELL_TYPES)
                + w6 * essentiality_score          (DepMap CRISPR % essential, pan-essentials capped)
                + w7 * safety_constraint_score     (gnomAD LOEUF — high = LoF tolerated → safer to inhibit)
                + w8 * expression_score            (from input DE if present)
                + w9 * novelty_bonus               (favors moderately studied)
                - w10 * over_studied_penalty       (PubMed total > cap → diminishing returns)
ChEMBL contributes dossier columns (
chembl_target_id
,
chembl_best_pchembl
,
chembl_best_ic50_nm
,
chembl_top_compounds
) but no score component — its job is to surface concrete tool compounds for the "Suggested next step" slot.
Each component is normalized to [0, 1]. The composite is therefore roughly in [-w7, sum(w1..w6)] and is min-max rescaled before reporting. Read
weights.yaml
for the current defaults.
权重定义在
weights.yaml
中,可通过
--weights
参数在每次运行时覆盖默认值。默认权重旨在「找到具有成药性、遗传学支持、清晰治疗窗口且在目标细胞中表达的靶点」:
composite_score = w1 * druggability_score          (已获批药物、可开发性、临床试验情况)
                + w2 * disease_genetics_score      (OpenTargets疾病关联+目标疾病额外加分)
                + w3 * tractability_bonus          (表面/分泌蛋白vs细胞内蛋白)
                + w4 * tissue_specificity          (HPA组织标签——表达范围越窄,治疗窗口越清晰)
                + w5 * cell_context_score          (HPA单细胞数据中目标细胞类型的nCPM排名)
                + w6 * essentiality_score          (DepMap CRISPR必需性百分比,泛必需基因设上限)
                + w7 * safety_constraint_score     (gnomAD LOEUF——值越高表示功能缺失可耐受→抑制更安全)
                + w8 * expression_score            (若输入包含差异表达数据则纳入)
                + w9 * novelty_bonus               (倾向于中等研究热度的基因)
                - w10 * over_studied_penalty       (PubMed论文总数超过阈值→收益递减)
ChEMBL提供档案列(
chembl_target_id
chembl_best_pchembl
chembl_best_ic50_nm
chembl_top_compounds
)但不参与评分——其作用是为「建议下一步操作」提供具体的工具化合物。
每个组件均归一化至[0,1]区间。因此综合得分大致在[-w7, sum(w1..w6)]范围内,报告前会进行最小-最大缩放。请查看
weights.yaml
获取当前默认权重

Writing the rationale

撰写依据

After
aggregate.py
produces
targets_report.md
with blank rationale slots, Claude reads the per-gene dossier rows and writes a 2-3 sentence rationale per gene. Use the template in
prompts/rationale_template.md
— it specifies the structure (one line on the most compelling evidence, one line on the main risk, one line on the suggested next experimental step).
For the top 5–10 genes by composite score, also write a short executive summary at the top of the report. Keep it factual and grounded in the dossier data; do not hallucinate beyond what the JSONs contain.
aggregate.py
生成带有空白依据栏的
targets_report.md
后,Claude会读取单基因档案数据,为每个基因撰写2-3句话的依据。使用
prompts/rationale_template.md
中的模板——规定了结构(一句最具说服力的证据、一句主要风险、一句建议的下一步实验操作)。
对于综合得分排名前5-10的基因,还需在报告顶部撰写简短的执行摘要。内容需基于档案数据,保持客观真实;不得超出JSON数据范围编造信息。

Data source notes

数据源说明

All free, no API key needed. Rate limits handled in fetchers:
  • UniProt REST — 100 req/sec, batched via
    accession
    query
  • OpenTargets GraphQL — generous, single endpoint; provides disease genetics signal via integrated
    associatedDiseases
  • PubMed E-utilities — 3 req/sec without key; fetchers respect this
  • Human Protein Atlas
    search_download.php
    for symbol→ENSG, then per-ENSG
    /<ENSG>.json
    ; no rate limit documented, fetcher sleeps 0.15s/gene
  • DepMap CRISPR essentiality — fetched via
    target.depMapEssentiality
    inside the OpenTargets call (no separate endpoint)
  • gnomAD constraint — fetched via
    target.geneticConstraint
    inside the OpenTargets call (avoids gnomAD's WAF on direct API access)
  • ChEMBL REST
    target/search.json
    then
    activity.json
    ; ~5 req/sec friendly, fetcher sleeps 0.2s/gene
For deeper API details and field mappings, see
references/api_endpoints.md
.
所有数据源均免费,无需API密钥。数据获取脚本已处理速率限制:
  • UniProt REST — 100请求/秒,通过
    accession
    查询批量处理
  • OpenTargets GraphQL — 限制宽松,单端点;通过集成的
    associatedDiseases
    提供疾病遗传学信号
  • PubMed E-utilities — 无密钥时3请求/秒;数据获取脚本遵守该限制
  • 人类蛋白质图谱(HPA) — 通过
    search_download.php
    将基因符号转换为ENSG,再调用
    /<ENSG>.json
    获取数据;无公开速率限制,脚本每个基因休眠0.15秒
  • DepMap CRISPR必需性 — 通过OpenTargets调用中的
    target.depMapEssentiality
    获取(无需单独端点)
  • gnomAD约束 — 通过OpenTargets调用中的
    target.geneticConstraint
    获取(避免直接调用gnomAD API时的WAF限制)
  • ChEMBL REST — 先调用
    target/search.json
    再调用
    activity.json
    ;建议约5请求/秒,脚本每个基因休眠0.2秒
如需了解更详细的API细节和字段映射,请查看
references/api_endpoints.md

Retargeting the focus disease + cell context

重新设置目标疾病与细胞背景

The skill ships with an autoimmunity / T-cell default but is intentionally disease-agnostic. Three edits switch the focus:
  • scripts/fetch_opentargets.py
    and
    scripts/aggregate.py
    — change
    FOCUS_DISEASE_TERMS
    to the lowercased substrings that should mark a drug or disease association as "in-scope" (e.g.
    ("cancer", "carcinoma", "lymphoma")
    for oncology;
    ("alzheimer", "parkinson", "huntington", "als")
    for neurodegeneration;
    ("diabetes", "obesity", "fatty liver", "nash")
    for metabolic disease).
  • scripts/aggregate.py
    — change
    FOCUS_CELL_TYPES
    to the HPA single-cell type names that should drive
    cell_context_score
    . Must match HPA's exact strings (case-sensitive); see comment block above the tuple for examples per domain.
  • scripts/fetch_pubmed.py
    — adjust the
    focus_disease
    and
    cell_context
    queries in
    CONTEXTS
    (these power the PubMed counts in the dossier).
No other code changes are needed; the CSV column names already use the neutral
focus_disease_*
/
cell_context
prefixes.
该工具默认针对自身免疫/T细胞领域,但设计上支持任意疾病场景。修改以下三处即可切换目标方向:
  • scripts/fetch_opentargets.py
    scripts/aggregate.py
    — 将
    FOCUS_DISEASE_TERMS
    修改为小写子字符串,用于标记药物或疾病关联为「相关」(例如肿瘤领域设为
    ("cancer", "carcinoma", "lymphoma")
    ;神经退行性疾病设为
    ("alzheimer", "parkinson", "huntington", "als")
    ;代谢疾病设为
    ("diabetes", "obesity", "fatty liver", "nash")
    )。
  • scripts/aggregate.py
    — 将
    FOCUS_CELL_TYPES
    修改为HPA单细胞类型名称,用于驱动
    cell_context_score
    。必须严格匹配HPA的字符串(区分大小写);可查看元组上方的注释块获取各领域示例。
  • scripts/fetch_pubmed.py
    — 调整
    CONTEXTS
    中的
    focus_disease
    cell_context
    查询(这些查询驱动档案中的PubMed计数)。
无需修改其他代码;CSV列名已使用中性前缀
focus_disease_*
/
cell_context

When NOT to use this skill

不适用场景

  • Single-gene look-ups (overkill — just ask Claude to web-search)
  • Non-human genes (most APIs are human-only; fetchers will silently return empty)
  • Pure literature review without target ambition — use
    scholar-deep-research
    or
    literature-review
    instead
  • 单基因查询(大材小用——直接让Claude进行网络搜索即可)
  • 非人类基因(多数API仅支持人类数据;数据获取脚本会静默返回空值)
  • 无靶点研究目标的纯文献综述——改用
    scholar-deep-research
    literature-review
    工具

Iteration tips

迭代建议

The pipeline is designed to be re-runnable cheaply:
  • Raw JSON cache means re-scoring with different
    weights.yaml
    is a one-second
    aggregate.py
    rerun
  • To add a new evidence source, add
    scripts/fetch_<source>.py
    that writes
    raw_data/<source>.json
    with the same
    {gene: {fields}}
    shape, then add a corresponding term in
    aggregate.py::compute_composite_score
    .
该流程设计为可低成本重复运行:
  • 原始JSON缓存意味着使用不同
    weights.yaml
    重新评分仅需运行一秒钟的
    aggregate.py
  • 如需添加新数据源,只需新增
    scripts/fetch_<source>.py
    脚本,以
    {gene: {fields}}
    格式写入
    raw_data/<source>.json
    ,然后在
    aggregate.py::compute_composite_score
    中添加对应的评分项即可。