ensembl-database
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEnsembl Database: ID Mapping and Genomic Features
Ensembl数据库:ID映射与基因组特征
Prerequisites
前提条件
- : Read the
uvskill and follow its Setup instructions to ensureuvis installed and on PATH.uv - User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://useast.ensembl.org/index.html and https://github.com/Ensembl/ensembl-rest/wiki, then (2) create the file recording the notification text and timestamp.
- :阅读
uv技能文档并按照其安装说明操作,确保uv已安装并添加至系统PATH。uv - 用户通知:若此技能目录中不存在LICENSE_NOTIFICATION.txt文件,则需:(1) 显著通知用户查看条款:https://useast.ensembl.org/index.html 和 https://github.com/Ensembl/ensembl-rest/wiki;(2) 创建该文件,记录通知文本及时间戳。
Overview
概述
The Ensembl database is a resource for genome annotation. This skill allows you
to interact with the Ensembl REST API to resolve ambiguous symbols,
cross-reference IDs (RefSeq, HGNC, UniProt, ENSG), fetch raw sequences, and
retrieve detailed transcript structures.
Key Concepts:
- ENSG (Gene): Stable identifier for a human gene. Other species will have different three-letter species codes.
- ENST (Transcript): Stable identifier for a transcript (splicing isoform).
- ENSP (Protein): Stable identifier for a translated protein.
- MANE Select: The consensus primary transcript agreed upon by Ensembl and NCBI.
- Canonical: Ensembl's representative transcript (used if MANE is not available or non-human).
Ensembl数据库是一个基因组注释资源。本技能允许你通过Ensembl REST API解析模糊符号、交叉引用ID(RefSeq、HGNC、UniProt、ENSG)、获取原始序列,以及检索详细的转录本结构。
核心概念:
- ENSG(基因):人类基因的稳定标识符。其他物种会使用不同的三字母物种代码。
- ENST(转录本):转录本(剪接异构体)的稳定标识符。
- ENSP(蛋白质):翻译后蛋白质的稳定标识符。
- MANE Select:Ensembl与NCBI共同认可的一致性主转录本。
- Canonical(标准转录本):Ensembl指定的代表性转录本(当MANE不可用或针对非人类物种时使用)。
Core Rules
核心规则
- Use the Wrapper: ALWAYS execute the provided helper scripts to query the database rather than accessing the database directly. The scripts automatically enforce the required rate limit gracefully.
- Default Species: If the species is absent or ambiguous in the prompt,
default to . You MUST explicitly flag this default to the user to ensure they are aware.
"human" - Primary Transcripts: When listing transcripts for a gene, only return the MANE Select transcript (for human) or the Canonical transcript (for others) unless the user explicitly asks for all alternative isoforms. You MUST flag to the user when multiple transcripts are available and you are defaulting to the primary one.
- Assembly Handling: The default assembly is GRCh38. For GRCh37 requests,
you MUST use the flag. You MUST explicitly flag to the user when a non-default assembly is being used.
--assembly GRCh37 - Output Location: The script writes full JSON/FASTA output to temporary
files in by default, or to a user-specified file using the
/tmpflag. It also prints a concise summary to stdout.--output - Notification: If this skill is used, ensure this is mentioned in the output.
- 使用封装脚本:必须始终执行提供的辅助脚本查询数据库,而非直接访问数据库。脚本会自动优雅地执行所需的速率限制。
- 默认物种:若提示中未指定物种或物种信息模糊,默认使用(人类)。必须明确向用户标记此默认设置,确保其知晓。
"human" - 主转录本:列出某一基因的转录本时,仅返回MANE Select转录本(针对人类)或Canonical转录本(针对其他物种),除非用户明确要求所有可变异构体。当存在多个转录本且你默认返回主转录本时,必须向用户标记说明。
- 组装版本处理:默认组装版本为GRCh38。对于GRCh37请求,必须使用参数。当使用非默认组装版本时,必须明确向用户标记说明。
--assembly GRCh37 - 输出位置:脚本默认将完整的JSON/FASTA输出写入目录下的临时文件,也可通过
/tmp参数指定用户自定义文件。同时会在标准输出(stdout)打印简洁的摘要信息。--output - 通知要求:若使用此技能,需确保在输出中提及这一点。
Available Commands
可用命令
1. Resolve Gene ID — Resolve a symbol, alias, or RefSeq ID to ENSG ID(s).
Automatically falls back to resolving synonyms if primary symbol is not found.
bash
uv run scripts/ensembl_api.py resolve-gene TP53 --species human --output tp53.json
uv run scripts/ensembl_api.py resolve-gene PCL2 --output pcl2.json # Falls back to synonym resolution2. Map ID to External Database — Cross-reference an Ensembl ID to UniProt,
HGNC, RefSeq, etc.
bash
uv run scripts/ensembl_api.py map-id ENSG00000141510 --external-db UniProt --output uniprot_map.json
uv run scripts/ensembl_api.py map-id ENST00000269305 --external-db RefSeq_mRNA --output refseq_map.json3. Get Genomic Sequence — Fetch raw DNA for a coordinate window. Supports
GRCh37 via .
--assembly GRCh37bash
uv run scripts/ensembl_api.py get-sequence 17:7661779-7687550 --species human --output seq.txt
uv run scripts/ensembl_api.py get-sequence chr9:21971100-21971200 --assembly GRCh37 --output seq_grch37.txt4. Gene Summary — High-level metadata: symbol, biotype, description,
chromosomal location.
bash
uv run scripts/ensembl_api.py gene-summary ENSG00000141510 --output gene_summary.json5. List Transcripts — All transcripts for a gene, with optional
or filters. Output includes Transcript Support
Level (TSL).
--only-mane--only-canonicalbash
uv run scripts/ensembl_api.py transcripts ENSG00000141510 --only-mane --output transcripts_mane.json
uv run scripts/ensembl_api.py transcripts ENSG00000141510 --only-canonical --output transcripts_canonical.json
uv run scripts/ensembl_api.py transcripts ENSG00000141510 --output transcripts_all.json5b. Canonical TSS — Get the single coordinate of the Transcription Start
Site (TSS) for the canonical transcript of a gene.
[!NOTE] Unlike the standardcommand,transcriptsaccepts both symbols (e.g.,canonical-tss) and Ensembl IDs, and automatically resolves them. It also does the math for strand orientation (TSS isTP53forStartstrand and+forEndstrand), outputting the single integer coordinate directly.-
bash
uv run scripts/ensembl_api.py canonical-tss TP53 --output tp53_tss.json
uv run scripts/ensembl_api.py canonical-tss ENSG00000141510 --output tss.json6. Transcript Structure — Exon coordinates, CDS boundaries, and computed
5'/3' UTR regions for a transcript.
bash
uv run scripts/ensembl_api.py transcript-structure ENST00000269305 --output structure.json7. Protein Info — ENSP ID and sequence length for a transcript.
bash
uv run scripts/ensembl_api.py protein-info ENST00000269305 --output protein_info.json8. Protein Sequence — Amino acid FASTA for a transcript (ENST) or protein
(ENSP) ID.
bash
uv run scripts/ensembl_api.py protein-sequence ENST00000269305 --output protein.fasta
uv run scripts/ensembl_api.py protein-sequence ENSP00000269305 --output protein_ensp.fasta9. Variant Consequence (VEP) — Predict molecular consequences for a genomic
variant. Includes open-licensed plugins: AlphaMissense, Conservation,
DosageSensitivity, IntAct, MaveDB, OpenTargets, LoF (Loftee), NMD, UTRAnnotator,
mutfunc, LOEUF.
bash
uv run scripts/ensembl_api.py vep 9:21971147:T:C --species human --output vep.json
uv run scripts/ensembl_api.py vep rs699 --species human --output vep_rs699.jsonExample VEP stdout output:
[*] Variant: 9:21971147:T>C
[*] Most severe consequence: missense_variant
[*] Found 15 transcript consequences.
[*] VEP Predictions:
- ENST00000304494 (CDKN2A): Consequence = missense_variant
- ENST00000304494 (CDKN2A): Amino Acids = N/S
- ENST00000304494 (CDKN2A): SIFT = deleterious (0.01)
- ENST00000304494 (CDKN2A): AlphaMissense Class = likely_benign
- ENST00000304494 (CDKN2A): AlphaMissense Pathogenicity = 0.2129
- ENST00000304494 (CDKN2A): Conservation = 2.05
- ENST00000304494 (CDKN2A): Dosage Sensitivity (Haplo) = 0.889228328567991
- ENST00000304494 (CDKN2A): Dosage Sensitivity (Triplo) = 0.135514349094646
- ENST00000304494 (CDKN2A): Loss of Function (LOEUF) = 0.791Presenting VEP Results: After running the VEP command, you MUST present the
full VEP Predictions list from stdout to the user. This list contains both
standard VEP predictions (Consequence, Amino Acids, SIFT, PolyPhen) and
open-license plugin results (AlphaMissense, Conservation, Dosage Sensitivity,
LOEUF, Loftee LoF, NMD, UTRAnnotator, Mutfunc). Do NOT just summarize — show the
complete list so the user can see all predictions. If the list is very long
(many transcripts), show the MANE Select / canonical transcript rows in full and
note that the complete data is in the JSON output.
1. 解析基因ID — 将基因符号、别名或RefSeq ID解析为ENSG ID。若未找到主符号,会自动尝试解析同义词。
bash
uv run scripts/ensembl_api.py resolve-gene TP53 --species human --output tp53.json
uv run scripts/ensembl_api.py resolve-gene PCL2 --output pcl2.json # 自动尝试同义词解析2. ID映射至外部数据库 — 将Ensembl ID交叉引用至UniProt、HGNC、RefSeq等数据库。
bash
uv run scripts/ensembl_api.py map-id ENSG00000141510 --external-db UniProt --output uniprot_map.json
uv run scripts/ensembl_api.py map-id ENST00000269305 --external-db RefSeq_mRNA --output refseq_map.json3. 获取基因组序列 — 获取指定坐标区间的原始DNA序列。支持通过参数使用GRCh37版本。
--assembly GRCh37bash
uv run scripts/ensembl_api.py get-sequence 17:7661779-7687550 --species human --output seq.txt
uv run scripts/ensembl_api.py get-sequence chr9:21971100-21971200 --assembly GRCh37 --output seq_grch37.txt4. 基因摘要 — 获取基因的高级元数据:符号、生物类型、描述、染色体位置。
bash
uv run scripts/ensembl_api.py gene-summary ENSG00000141510 --output gene_summary.json5. 列出转录本 — 获取某一基因的所有转录本,可使用或参数过滤。输出包含转录本支持水平(TSL)。
--only-mane--only-canonicalbash
uv run scripts/ensembl_api.py transcripts ENSG00000141510 --only-mane --output transcripts_mane.json
uv run scripts/ensembl_api.py transcripts ENSG00000141510 --only-canonical --output transcripts_canonical.json
uv run scripts/ensembl_api.py transcripts ENSG00000141510 --output transcripts_all.json5b. 标准转录本起始位点(Canonical TSS) — 获取某一基因的标准转录本的转录起始位点(TSS)的单个坐标。
[!NOTE] 与标准命令不同,transcripts命令同时接受基因符号(如canonical-tss)和Ensembl ID,并自动解析。它还会根据链方向进行计算(正链TP53的TSS为+,负链Start的TSS为-),直接输出单个整数坐标。End
bash
uv run scripts/ensembl_api.py canonical-tss TP53 --output tp53_tss.json
uv run scripts/ensembl_api.py canonical-tss ENSG00000141510 --output tss.json6. 转录本结构 — 获取转录本的外显子坐标、CDS边界,以及计算得到的5'/3'非翻译区(UTR)区域。
bash
uv run scripts/ensembl_api.py transcript-structure ENST00000269305 --output structure.json7. 蛋白质信息 — 获取转录本对应的ENSP ID及序列长度。
bash
uv run scripts/ensembl_api.py protein-info ENST00000269305 --output protein_info.json8. 蛋白质序列 — 获取转录本(ENST)或蛋白质(ENSP)ID对应的氨基酸FASTA序列。
bash
uv run scripts/ensembl_api.py protein-sequence ENST00000269305 --output protein.fasta
uv run scripts/ensembl_api.py protein-sequence ENSP00000269305 --output protein_ensp.fasta9. 变异结果预测(VEP) — 预测基因组变异的分子结果。包含开源许可插件:AlphaMissense、Conservation、DosageSensitivity、IntAct、MaveDB、OpenTargets、LoF(Loftee)、NMD、UTRAnnotator、mutfunc、LOEUF。
bash
uv run scripts/ensembl_api.py vep 9:21971147:T:C --species human --output vep.json
uv run scripts/ensembl_api.py vep rs699 --species human --output vep_rs699.jsonVEP标准输出示例:
[*] Variant: 9:21971147:T>C
[*] Most severe consequence: missense_variant
[*] Found 15 transcript consequences.
[*] VEP Predictions:
- ENST00000304494 (CDKN2A): Consequence = missense_variant
- ENST00000304494 (CDKN2A): Amino Acids = N/S
- ENST00000304494 (CDKN2A): SIFT = deleterious (0.01)
- ENST00000304494 (CDKN2A): AlphaMissense Class = likely_benign
- ENST00000304494 (CDKN2A): AlphaMissense Pathogenicity = 0.2129
- ENST00000304494 (CDKN2A): Conservation = 2.05
- ENST00000304494 (CDKN2A): Dosage Sensitivity (Haplo) = 0.889228328567991
- ENST00000304494 (CDKN2A): Dosage Sensitivity (Triplo) = 0.135514349094646
- ENST00000304494 (CDKN2A): Loss of Function (LOEUF) = 0.791VEP结果展示要求:运行VEP命令后,必须将标准输出中的完整VEP预测列表呈现给用户。该列表包含标准VEP预测结果(Consequence、Amino Acids、SIFT、PolyPhen)及开源许可插件结果(AlphaMissense、Conservation、Dosage Sensitivity、LOEUF、Loftee LoF、NMD、UTRAnnotator、Mutfunc)。不得仅提供摘要,需展示完整列表以便用户查看所有预测结果。若列表过长(涉及多个转录本),需完整展示MANE Select/标准转录本的条目,并注明完整数据存于JSON输出文件中。
Parsing Outputs
输出解析
If the user needs detailed, nested structural data (like the precise integer
coordinates of Exon 2 of a transcript) that isn't summarized in stdout:
- Locate the JSON file (either specified via or the temporary file path printed by the script).
--output - Use terminal tools like or write a quick, disposable python snippet to extract the specific data point requested. Do not attempt to read the entire JSON file into your context if it is very large.
jq
若用户需要详细的嵌套结构数据(如某一转录本第2号外显子的精确整数坐标),而此类信息未在标准输出的摘要中体现:
- 定位JSON文件(通过参数指定的文件,或脚本打印的临时文件路径)。
--output - 使用终端工具(如)或编写简单的一次性Python代码片段提取所需的特定数据点。若JSON文件过大,请勿尝试将整个文件读入上下文。
jq
Custom Queries
自定义查询
If you need to make an API call that the script does not support (e.g., fetching
protein domain annotations, coordinate mapping between assemblies, homology
searches, linkage disequilibrium, or phenotype lookups), read
for a complete reference of available
endpoints, parameters, and response fields.
references/ensembl_rest_api_reference.mdCRITICAL: When writing custom scripts or using alternatives to the provided
scripts, you MUST respect the Ensembl REST API rate limits (maximum 15
requests per second) and handle errors gracefully (e.g.,
with exponential backoff).
429 Too Many Requests若你需要执行脚本不支持的API调用(如获取蛋白质结构域注释、组装版本间的坐标映射、同源性搜索、连锁不平衡或表型查询),请阅读文档,获取所有可用端点、参数及响应字段的完整参考信息。
references/ensembl_rest_api_reference.md重要提示:当编写自定义脚本或使用替代方案而非提供的脚本时,必须遵守Ensembl REST API的速率限制(最多每秒15次请求),并优雅处理错误(如使用指数退避策略)。
429 Too Many Requests