tooluniverse-comparative-genomics

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Comparative Genomics & Ortholog Analysis

比较基因组学与直系同源分析

Cross-species gene comparison, ortholog identification, sequence retrieval, and functional conservation analysis integrating Ensembl Compara, NCBI, UniProt, OLS, Monarch, and OpenTargets.
整合Ensembl Compara、NCBI、UniProt、OLS、Monarch及OpenTargets工具,实现跨物种基因比较、直系同源基因识别、序列检索与功能保守性分析。

LOOK UP, DON'T GUESS

查资料,勿猜测

When uncertain about any scientific fact, SEARCH databases first (PubMed, UniProt, ChEMBL, ClinVar, etc.) rather than reasoning from memory. A database-verified answer is always more reliable than a guess.
当对任何科学事实存疑时,优先检索数据库(PubMed、UniProt、ChEMBL、ClinVar等),而非凭记忆推断。经数据库验证的答案永远比猜测更可靠。

COMPUTE, DON'T DESCRIBE

做计算,勿描述

When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.
当分析需要计算(统计、数据处理、评分、富集分析)时,通过Bash编写并运行Python代码。不要描述你会做什么——直接执行并报告实际结果。使用ToolUniverse工具检索数据,再通过Python(pandas、scipy、statsmodels、matplotlib)进行分析。

When to Use This Skill

何时使用该技能

Triggers:
  • "Find the mouse ortholog of [human gene]"
  • "Compare [gene] across species"
  • "Is [gene] conserved in [organism]?"
  • "What are the orthologs of [gene]?"
  • "Cross-species comparison of [gene/protein]"
  • "Evolutionary conservation of [gene]"
  • "Compare GO annotations between human and mouse [gene]"
Use Cases:
  1. Ortholog Discovery: Find equivalent genes in other species for a human gene
  2. Conservation Analysis: Assess how conserved a gene is across evolutionary distance
  3. Functional Comparison: Compare GO terms, domains, and annotations across orthologs
  4. Model Organism Selection: Determine which model organism best recapitulates human gene function
  5. Gene Tree Analysis: Visualize evolutionary history of a gene family
  6. Cross-Species Phenotype Bridging: Link human disease phenotypes to model organism phenotypes via orthologs

触发场景:
  • "查找[人类基因]的小鼠直系同源基因"
  • "跨物种比较[基因]"
  • "[基因]在[生物]中是否保守?"
  • "[基因]的直系同源基因有哪些?"
  • "[基因/蛋白]的跨物种比较"
  • "[基因]的进化保守性"
  • "比较人类与小鼠[基因]的GO注释"
适用场景:
  1. 直系同源基因发现: 为人类基因寻找其他物种中的等效基因
  2. 保守性分析: 评估基因在进化过程中的保守程度
  3. 功能比较: 对比直系同源基因的GO术语、结构域及注释信息
  4. 模式生物选择: 确定最能重现人类基因功能的模式生物
  5. 基因树分析: 可视化基因家族的进化历史
  6. 跨物种表型关联: 通过直系同源基因将人类疾病表型与模式生物表型关联

Conservation Reasoning Framework

保守性推理框架

Understanding conservation requires distinguishing between types of evolutionary patterns and what they imply about function.
High conservation signals functional constraint. When a gene is maintained as a 1:1 ortholog from yeast to humans, purifying selection has prevented sequence divergence — the gene's function is essential and cannot be easily altered. Highly conserved positions within a protein sequence (high PhastCons scores > 0.8, or GERP RS > 4) are under strong constraint; mutations at these positions are disproportionately pathogenic. For non-coding regions, conservation in mammals at PhastCons > 0.5 suggests a candidate regulatory element.
Low conservation in one lineage has two possible explanations: relaxed selection or positive selection. Use the dN/dS ratio (nonsynonymous to synonymous substitution rate) to distinguish them. A dN/dS ratio near 1 suggests neutral evolution — the gene is no longer under purifying selection (relaxed constraint, possibly reflecting loss of function in that lineage). A dN/dS ratio > 1 indicates positive selection — the gene is diverging faster than neutral expectation, often because it is adapting to a new environment or function. A dN/dS ratio << 1 is the signature of purifying selection (functional constraint). When a vertebrate gene shows high divergence in a specific branch of the tree, ask which explanation applies before concluding that function is lost.
Ortholog relationship type shapes interpretation. A 1:1 ortholog (one gene in human, one in mouse) is the highest-confidence functional equivalent — it has not been duplicated in either lineage, so it most likely performs the same ancestral role. A 1:many relationship (one gene in human, multiple in mouse) means the target species has duplicated the gene; the copies may have subfunctionalized (each copy performs a subset of the original roles) or neofunctionalized (one copy gained a new role). Do not assume both copies retain full ancestral function. A many:many relationship reflects complex duplication history in both species and requires analyzing each paralog pair individually.
Conservation depth predicts essentiality. A gene conserved across all vertebrates suggests a fundamental cellular process. A gene conserved only in mammals suggests a more specialized vertebrate innovation. A gene present only in primates or only in humans is likely a recent evolutionary acquisition, possibly involved in human-specific biology but often lacking the depth of functional characterization available for deeply conserved genes.
Absence of an ortholog is a finding, not an error. Lineage-specific genes exist and are biologically meaningful. Before concluding a gene is lineage-specific, check: (1) whether BLAST with relaxed thresholds finds distant homologs, (2) whether a highly divergent ortholog exists that Ensembl Compara missed, and (3) whether the gene belongs to a rapidly evolving family (immune genes, olfactory receptors, reproductive proteins) where turnover is expected.

理解保守性需要区分不同的进化模式及其对功能的暗示。
高度保守性意味着功能约束。当一个基因从酵母到人类都保持1:1直系同源关系时,净化选择阻止了序列分化——该基因的功能至关重要,无法轻易改变。蛋白质序列中高度保守的位点(PhastCons得分>0.8,或GERP RS>4)受到强约束;这些位点的突变更易致病。对于非编码区域,哺乳动物中PhastCons>0.5的保守性表明其可能是候选调控元件。
某一谱系中保守性低有两种可能解释:选择放松或正向选择。使用dN/dS比值(非同义替换率与同义替换率的比值)来区分。dN/dS比值接近1表明中性进化——该基因不再受净化选择(约束放松,可能反映该谱系中功能丧失)。dN/dS比值>1表明正向选择——基因分化速度快于中性预期,通常是因为它正在适应新环境或功能。dN/dS比值<<1是净化选择(功能约束)的标志。当脊椎动物基因在进化树的特定分支中显示高度分化时,在得出功能丧失的结论前,需明确适用哪种解释。
直系同源关系类型影响解读。1:1直系同源(人类一个基因,小鼠一个基因)是可信度最高的功能等效基因——两个谱系中均未发生复制,因此它很可能执行相同的祖先功能。1:多关系(人类一个基因,小鼠多个基因)意味着目标物种发生了基因复制;复制后的基因可能发生亚功能化(每个复制体执行原始功能的子集)或新功能化(一个复制体获得新功能)。不要假设所有复制体都保留完整的祖先功能。多:多关系反映两个物种都有复杂的复制历史,需要单独分析每个旁系同源对。
保守深度预测必要性。在所有脊椎动物中保守的基因暗示着基础细胞过程。仅在哺乳动物中保守的基因表明是更特化的脊椎动物创新。仅存在于灵长类或人类中的基因可能是近期进化获得的,可能参与人类特异性生物学功能,但通常缺乏深度保守基因所具备的功能表征数据。
无直系同源基因是一个发现,而非错误。谱系特异性基因确实存在且具有生物学意义。在得出基因是谱系特异性的结论前,需检查:(1) 放宽阈值的BLAST是否能找到远缘同源物;(2) 是否存在Ensembl Compara未识别的高度分化直系同源基因;(3) 该基因是否属于快速进化家族(免疫基因、嗅觉受体、生殖蛋白),这类家族中基因更替是预期现象。

Workflow Overview

工作流概述

Input (gene symbol/ID + reference species)
  |
  v
Phase 1: Gene Identification & Validation
  |
  v
Phase 2: Ortholog Discovery (Ensembl Compara + OpenTargets)
  |
  v
Phase 3: Sequence Retrieval (NCBI + Ensembl)
  |
  v
Phase 4: Functional Annotation Comparison (UniProt + OLS GO terms)
  |
  v
Phase 5: Cross-Species Phenotype Bridging (Monarch)
  |
  v
Phase 6: Gene Tree & Evolutionary Context (Ensembl Compara)
  |
  v
Report: Conservation summary, ortholog evidence, functional comparison, phenotype bridging

输入(基因符号/ID + 参考物种)
  |
  v
阶段1:基因识别与验证
  |
  v
阶段2:直系同源基因发现(Ensembl Compara + OpenTargets)
  |
  v
阶段3:序列检索(NCBI + Ensembl)
  |
  v
阶段4:功能注释比较(UniProt + OLS GO术语)
  |
  v
阶段5:跨物种表型关联(Monarch)
  |
  v
阶段6:基因树与进化背景(Ensembl Compara)
  |
  v
报告:保守性总结、直系同源证据、功能比较、表型关联

Phase 1: Gene Identification & Validation

阶段1:基因识别与验证

ensembl_lookup_gene
takes
gene_id
(symbol or Ensembl ID). The
species
parameter is REQUIRED when using gene symbols (e.g.,
species="homo_sapiens"
); omitting it causes errors. Extract the Ensembl gene ID, description, biotype, and chromosomal coordinates for downstream queries. For non-human references, adjust
species
accordingly (e.g., "mus_musculus", "danio_rerio").

ensembl_lookup_gene
接收
gene_id
(符号或Ensembl ID)。使用基因符号时,
species
参数为必填项(例如
species="homo_sapiens"
);省略该参数会导致错误。提取Ensembl基因ID、描述、生物类型及染色体坐标,用于后续查询。对于非人类参考物种,相应调整
species
参数(例如"mus_musculus"、"danio_rerio")。

Phase 2: Ortholog Discovery

阶段2:直系同源基因发现

EnsemblCompara_get_orthologues
is the primary tool. It takes
gene
(symbol or Ensembl ID),
species
(source species, default "human"), and optionally
target_species
(e.g., "mouse", "zebrafish") or
target_taxon
(NCBI taxon ID). Omit
target_species
to get all orthologs across the tree; filter client-side for specific species. It returns homology type (one2one, one2many, many2many) and the taxonomy divergence level for each ortholog.
ensembl_get_homology
is the alternative when you need sequence-level data alongside the ortholog mapping. Use
sequence="protein"
and
aligned=true
for aligned sequence comparison across species.
OpenTargets_get_target_homologues_by_ensemblID
(takes
ensemblId
) provides supplementary ortholog data from OpenTargets, which can add druggability context and cross-reference with model organism phenotype data.
Reasoning: Prioritize 1:1 orthologs as high-confidence functional equivalents. For 1:many cases, report all copies and flag the need for paralog-specific functional analysis. If no Ensembl Compara entry exists, try BLAST as a last resort (note: BLAST protein search against swissprot is slow, 5-30 minutes; against nr may take longer).
Key model organisms to check: mouse (taxon 10090), rat (10116), zebrafish (7955), fruit fly (7227), C. elegans (6239), S. cerevisiae (4932).

EnsemblCompara_get_orthologues
是核心工具。它接收
gene
(符号或Ensembl ID)、
species
(源物种,默认"human"),可选参数
target_species
(例如"mouse"、"zebrafish")或
target_taxon
(NCBI分类ID)。省略
target_species
可获取进化树中所有直系同源基因;在客户端筛选特定物种。返回结果包含同源类型(one2one、one2many、many2many)及每个直系同源基因的分类分化水平。
ensembl_get_homology
是替代工具,当你需要序列级数据与直系同源映射一起获取时使用。设置
sequence="protein"
aligned=true
可进行跨物种的比对序列比较。
OpenTargets_get_target_homologues_by_ensemblID
(接收
ensemblId
)提供来自OpenTargets的补充直系同源数据,可添加成药性背景信息,并与模式生物表型数据交叉引用。
推理逻辑:优先将1:1直系同源基因视为高可信度功能等效基因。对于1:多的情况,报告所有复制体并标记需要进行旁系同源特异性功能分析。如果Ensembl Compara中无相关条目,最后尝试使用BLAST(注意:针对swissprot的BLAST蛋白搜索速度较慢,需5-30分钟;针对nr数据库可能耗时更长)。
需重点检查的模式生物:小鼠(分类ID 10090)、大鼠(10116)、斑马鱼(7955)、果蝇(7227)、秀丽隐杆线虫(6239)、酿酒酵母(4932)。

Phase 3: Sequence Retrieval

阶段3:序列检索

Use
NCBI_search_nucleotide
(takes
organism
as full name, e.g., "Homo sapiens";
gene
;
seq_type
= "mRNA") to find sequence records, then
NCBI_fetch_accessions
to convert UIDs to accession numbers, then
NCBI_get_sequence
to retrieve FASTA data. Prefer RefSeq (NM_* for mRNA, NP_* for protein) over other accessions for canonical sequence.
When aligned sequences are needed directly,
ensembl_get_homology
with
sequence="cdna"
or
sequence="protein"
is faster than running BLAST. Use BLAST only when Ensembl Compara does not find orthologs.

使用
NCBI_search_nucleotide
(接收
organism
为全称,例如"Homo sapiens";
gene
seq_type
= "mRNA")查找序列记录,然后使用
NCBI_fetch_accessions
将UID转换为登录号,再使用
NCBI_get_sequence
获取FASTA数据。优先选择RefSeq(mRNA为NM_,蛋白为NP_)而非其他登录号,以获取标准序列。
当需要直接获取比对序列时,使用
ensembl_get_homology
并设置
sequence="cdna"
sequence="protein"
比运行BLAST更快。仅当Ensembl Compara未找到直系同源基因时才使用BLAST。

Phase 4: Functional Annotation Comparison

阶段4:功能注释比较

UniProt_search
takes a query in UniProt syntax (e.g.,
"gene:TP53 AND organism_id:9606 AND reviewed:true"
) and
fields
to retrieve specific annotation columns including GO terms. Use
reviewed:true
to restrict to Swiss-Prot curated entries.
UniProt_get_function_by_accession
takes a UniProt accession and returns a list of function description strings (not a dict).
For each species being compared, retrieve GO terms and group them by Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). Shared GO terms indicate conserved function; terms present in human but absent in the ortholog may reflect annotation bias (less-studied organisms have fewer GO annotations) rather than true functional divergence. Focus conservation claims on shared terms.
Reasoning about annotation gaps: If a mouse ortholog lacks a GO term present in the human protein, consider that this may reflect incomplete annotation of the mouse gene rather than functional divergence. The inverse — a GO term in mouse that is absent in human — is less common but can indicate diverged or acquired function.

UniProt_search
接收UniProt语法的查询(例如
"gene:TP53 AND organism_id:9606 AND reviewed:true"
)和
fields
参数,以检索特定注释列,包括GO术语。使用
reviewed:true
限制为Swiss-Prot curated条目。
UniProt_get_function_by_accession
接收UniProt登录号,返回功能描述字符串列表(非字典格式)。
对于每个待比较的物种,检索GO术语并按生物过程(BP)、分子功能(MF)和细胞组分(CC)分组。共享的GO术语表明功能保守;人类存在但直系同源基因中缺失的术语可能反映注释偏差(研究较少的生物GO注释较少)而非真正的功能分化。保守性结论应聚焦于共享术语。
注释缺口推理:如果小鼠直系同源基因缺少人类蛋白中存在的GO术语,需考虑这可能反映小鼠基因注释不完整而非功能分化。反之——小鼠存在但人类缺失的GO术语——较为少见,但可能表明功能分化或获得新功能。

Phase 5: Cross-Species Phenotype Bridging

阶段5:跨物种表型关联

Monarch_search_gene
(takes
query
as gene symbol) returns gene CURIEs needed for Monarch queries.
Monarch_get_gene_phenotypes
and
Monarch_get_gene_diseases
take a gene CURIE (e.g., "HGNC:11998") and return phenotype/disease associations spanning multiple species.
Phenotype ontologies by species: Human = HP (HPO), Mouse = MP (Mammalian Phenotype), Zebrafish = ZP, Fly = FBcv. Monarch integrates across species; compare phenotype themes (e.g., "tumor susceptibility" in human and "increased tumor incidence" in mouse) rather than requiring exact term matches.
Reasoning for model organism selection: A mouse ortholog that has a 1:1 relationship AND shows phenotypes in Monarch that recapitulate the human disease is a strong disease model candidate. If the mouse phenotype diverges significantly from the human disease phenotype, this is worth flagging — it could indicate species-specific function or a limitation of the model.

Monarch_search_gene
(接收
query
为基因符号)返回Monarch查询所需的基因CURIE。
Monarch_get_gene_phenotypes
Monarch_get_gene_diseases
接收基因CURIE(例如"HGNC:11998"),返回跨物种的表型/疾病关联信息。
各物种的表型本体:人类=HP(HPO)、小鼠=MP(哺乳动物表型)、斑马鱼=ZP、果蝇=FBcv。Monarch整合跨物种数据;比较表型主题(例如人类的"肿瘤易感性"与小鼠的"肿瘤发生率增加"),而非要求完全匹配术语。
模式生物选择推理:具有1:1直系同源关系且Monarch中显示的表型可重现人类疾病的小鼠直系同源基因,是理想的疾病模型候选。如果小鼠表型与人类疾病表型差异显著,需标记这一点——这可能表明物种特异性功能或模型局限性。

Phase 6: Gene Tree & Evolutionary Context

阶段6:基因树与进化背景

EnsemblCompara_get_gene_tree
(takes
gene
,
species
) returns the gene tree members, species distribution, and speciation vs. duplication events.
EnsemblCompara_get_paralogues
returns all paralogs in the source species.
From the gene tree, assess: (1) how many species contain a member of this gene family; (2) when gene duplication events occurred (ancient vs. recent); (3) whether the gene family expanded in particular lineages. A gene present in a single copy across all vertebrates (deep conservation, no duplication) is likely under strong selective constraint.

EnsemblCompara_get_gene_tree
(接收
gene
species
)返回基因树成员、物种分布及物种形成与复制事件。
EnsemblCompara_get_paralogues
返回源物种中的所有旁系同源基因。
从基因树评估:(1) 该基因家族存在于多少物种中;(2) 基因复制事件发生的时间(古老 vs 近期);(3) 该基因家族是否在特定谱系中扩增。在所有脊椎动物中以单拷贝存在的基因(深度保守,无复制)可能受到强选择约束。

Synthesis Questions

综合分析问题

When interpreting the assembled evidence, work through these questions:
  1. Is the ortholog relationship 1:1 or has duplication created paralogs that may have diverged in function? This determines how directly findings in the model organism translate to the human gene.
  2. Do orthologs share conserved GO terms (especially Biological Process), or are there lineage-specific functional annotations suggesting divergence?
  3. For disease gene studies, does the model organism ortholog recapitulate relevant human phenotypes (via Monarch), supporting its use as a disease model?
  4. Are non-coding regulatory regions around the gene also conserved (PhastCons/GERP from OpenCRAVAT), suggesting conservation of gene regulation beyond protein function?
  5. If no ortholog is found, is the gene truly lineage-specific, or might a highly divergent homolog exist that is only detectable by sensitive sequence methods?

解读收集到的证据时,需依次回答以下问题:
  1. 直系同源关系是1:1还是因复制产生了可能功能分化的旁系同源基因?这决定了模式生物中的发现能在多大程度上直接推广到人类基因。
  2. 直系同源基因是否共享保守的GO术语(尤其是生物过程),还是存在谱系特异性功能注释表明分化?
  3. 对于疾病基因研究,模式生物直系同源基因是否能重现相关人类表型(通过Monarch),支持其作为疾病模型的使用?
  4. 基因周围的非编码调控区域是否也保守(来自OpenCRAVAT的PhastCons/GERP数据),表明除蛋白功能外基因调控也保守?
  5. 如果未找到直系同源基因,该基因真的是谱系特异性的,还是可能存在仅通过敏感序列方法才能检测到的高度分化同源物?

Fallback Strategies

备选策略

  • Ortholog not found in Ensembl Compara: Try
    ensembl_get_homology
    , then
    OpenTargets_get_target_homologues_by_ensemblID
    , then BLAST as last resort
  • Sequence retrieval fails: Use
    ensembl_get_homology
    with
    sequence="cdna"
    as alternative to NCBI
  • UniProt returns empty with reviewed:true: Try without that filter; organism may have only TrEMBL entries
  • Monarch returns no data: Use
    MonarchV3_get_associations
    with
    category="biolink:GeneToPhenotypicFeatureAssociation"
    as alternative
  • Gene symbol ambiguous across species: Use Ensembl IDs throughout to avoid symbol confusion (e.g., "p53" vs "tp53" in zebrafish)

  • Ensembl Compara中未找到直系同源基因:尝试
    ensembl_get_homology
    ,然后
    OpenTargets_get_target_homologues_by_ensemblID
    ,最后使用BLAST
  • 序列检索失败:使用
    ensembl_get_homology
    并设置
    sequence="cdna"
    作为NCBI的替代方案
  • UniProt使用reviewed:true返回空结果:尝试移除该筛选条件;该生物可能只有TrEMBL条目
  • Monarch无数据返回:使用
    MonarchV3_get_associations
    并设置
    category="biolink:GeneToPhenotypicFeatureAssociation"
    作为替代方案
  • 基因符号在跨物种中存在歧义:全程使用Ensembl ID避免符号混淆(例如斑马鱼中的"p53" vs "tp53")

Limitations

局限性

  • Ensembl Compara: Best for vertebrates; invertebrate and plant coverage is limited for some gene families
  • BLAST_protein_search: Very slow (5-30 min); use only as last resort for ortholog discovery
  • Monarch: Phenotype coverage varies by organism; mouse and zebrafish are best covered; fly and worm data are sparser
  • UniProt GO annotations: Bias toward well-studied organisms; absence of annotation does not mean absence of function
  • NCBI_search_nucleotide: May return many isoforms; filter for RefSeq (NM_*) for canonical transcripts
  • Conservation does not equal essentiality: Some highly conserved genes are dispensable in specific organisms
  • Ensembl Compara:最适用于脊椎动物;无脊椎动物和植物的部分基因家族覆盖有限
  • BLAST_protein_search:速度极慢(5-30分钟);仅作为直系同源基因发现的最后手段
  • Monarch:表型覆盖因生物而异;小鼠和斑马鱼覆盖最好;果蝇和线虫数据较稀疏
  • UniProt GO注释:偏向研究充分的生物;注释缺失不代表功能缺失
  • NCBI_search_nucleotide:可能返回多个异构体;筛选RefSeq(NM_*)获取标准转录本
  • 保守性不等于必要性:一些高度保守的基因在特定生物中并非必需