ensembl-database

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Ensembl Database

Ensembl基因组数据库

Overview

概述

Access and query the Ensembl genome database, a comprehensive resource for vertebrate genomic data maintained by EMBL-EBI. The database provides gene annotations, sequences, variants, regulatory information, and comparative genomics data for over 250 species. Current release is 115 (September 2025).
访问并查询Ensembl基因组数据库,这是由EMBL-EBI维护的脊椎动物基因组数据综合资源。该数据库提供基因注释、序列、变异、调控信息以及超过250个物种的比较基因组学数据。当前版本为115(2025年9月)。

When to Use This Skill

何时使用该技能

This skill should be used when:
  • Querying gene information by symbol or Ensembl ID
  • Retrieving DNA, transcript, or protein sequences
  • Analyzing genetic variants using the Variant Effect Predictor (VEP)
  • Finding orthologs and paralogs across species
  • Accessing regulatory features and genomic annotations
  • Converting coordinates between genome assemblies (e.g., GRCh37 to GRCh38)
  • Performing comparative genomics analyses
  • Integrating Ensembl data into genomic research pipelines
在以下场景中可使用本技能:
  • 通过基因符号或Ensembl ID查询基因信息
  • 获取DNA、转录本或蛋白质序列
  • 使用Variant Effect Predictor(VEP)分析遗传变异
  • 跨物种查找同源基因(orthologs)和旁系同源基因(paralogs)
  • 访问调控元件和基因组注释
  • 在不同基因组组装版本间转换坐标(例如GRCh37转GRCh38)
  • 进行比较基因组学分析
  • 将Ensembl数据整合到基因组研究流程中

Core Capabilities

核心功能

1. Gene Information Retrieval

1. 基因信息检索

Query gene data by symbol, Ensembl ID, or external database identifiers.
Common operations:
  • Look up gene information by symbol (e.g., "BRCA2", "TP53")
  • Retrieve transcript and protein information
  • Get gene coordinates and chromosomal locations
  • Access cross-references to external databases (UniProt, RefSeq, etc.)
Using the ensembl_rest package:
python
from ensembl_rest import EnsemblClient

client = EnsemblClient()
通过基因符号、Ensembl ID或外部数据库标识符查询基因数据。
常见操作:
  • 通过基因符号(如"BRCA2"、"TP53")查询基因信息
  • 获取转录本和蛋白质信息
  • 获取基因坐标和染色体位置
  • 访问与外部数据库(UniProt、RefSeq等)的交叉引用
使用ensembl_rest包:
python
from ensembl_rest import EnsemblClient

client = EnsemblClient()

Look up gene by symbol

通过基因符号查询基因

gene_data = client.symbol_lookup( species='human', symbol='BRCA2' )
gene_data = client.symbol_lookup( species='human', symbol='BRCA2' )

Get detailed gene information

获取详细基因信息

gene_info = client.lookup_id( id='ENSG00000139618', # BRCA2 Ensembl ID expand=True )

**Direct REST API (no package):**
```python
import requests

server = "https://rest.ensembl.org"
gene_info = client.lookup_id( id='ENSG00000139618', # BRCA2的Ensembl ID expand=True )

**直接调用REST API(无需安装包):**
```python
import requests

server = "https://rest.ensembl.org"

Symbol lookup

基因符号查询

response = requests.get( f"{server}/lookup/symbol/homo_sapiens/BRCA2", headers={"Content-Type": "application/json"} ) gene_data = response.json()
undefined
response = requests.get( f"{server}/lookup/symbol/homo_sapiens/BRCA2", headers={"Content-Type": "application/json"} ) gene_data = response.json()
undefined

2. Sequence Retrieval

2. 序列获取

Fetch genomic, transcript, or protein sequences in various formats (JSON, FASTA, plain text).
Operations:
  • Get DNA sequences for genes or genomic regions
  • Retrieve transcript sequences (cDNA)
  • Access protein sequences
  • Extract sequences with flanking regions or modifications
Example:
python
undefined
以多种格式(JSON、FASTA、纯文本)获取基因组、转录本或蛋白质序列。
操作:
  • 获取基因或基因组区域的DNA序列
  • 获取转录本序列(cDNA)
  • 访问蛋白质序列
  • 提取带有侧翼区域或修饰的序列
示例:
python
undefined

Using ensembl_rest package

使用ensembl_rest包

sequence = client.sequence_id( id='ENSG00000139618', # Gene ID content_type='application/json' )
sequence = client.sequence_id( id='ENSG00000139618', # 基因ID content_type='application/json' )

Get sequence for a genomic region

获取基因组区域的序列

region_seq = client.sequence_region( species='human', region='7:140424943-140624564' # chromosome:start-end )
undefined
region_seq = client.sequence_region( species='human', region='7:140424943-140624564' # 染色体:起始位置-结束位置 )
undefined

3. Variant Analysis

3. 变异分析

Query genetic variation data and predict variant consequences using the Variant Effect Predictor (VEP).
Capabilities:
  • Look up variants by rsID or genomic coordinates
  • Predict functional consequences of variants
  • Access population frequency data
  • Retrieve phenotype associations
VEP example:
python
undefined
查询遗传变异数据,并使用Variant Effect Predictor(VEP)预测变异后果。
功能:
  • 通过rsID或基因组坐标查询变异
  • 预测变异的功能后果
  • 访问群体频率数据
  • 获取表型关联信息
VEP示例:
python
undefined

Predict variant consequences

预测变异后果

vep_result = client.vep_hgvs( species='human', hgvs_notation='ENST00000380152.7:c.803C>T' )
vep_result = client.vep_hgvs( species='human', hgvs_notation='ENST00000380152.7:c.803C>T' )

Query variant by rsID

通过rsID查询变异

variant = client.variation_id( species='human', id='rs699' )
undefined
variant = client.variation_id( species='human', id='rs699' )
undefined

4. Comparative Genomics

4. 比较基因组学

Perform cross-species comparisons to identify orthologs, paralogs, and evolutionary relationships.
Operations:
  • Find orthologs (same gene in different species)
  • Identify paralogs (related genes in same species)
  • Access gene trees showing evolutionary relationships
  • Retrieve gene family information
Example:
python
undefined
进行跨物种比较,识别同源基因、旁系同源基因及进化关系。
操作:
  • 查找不同物种中的同源基因
  • 识别同一物种中的旁系同源基因
  • 查看展示进化关系的基因树
  • 获取基因家族信息
示例:
python
undefined

Find orthologs for a human gene

查找人类基因的同源基因

orthologs = client.homology_ensemblgene( id='ENSG00000139618', # Human BRCA2 target_species='mouse' )
orthologs = client.homology_ensemblgene( id='ENSG00000139618', # 人类BRCA2基因 target_species='mouse' )

Get gene tree

获取基因树

gene_tree = client.genetree_member_symbol( species='human', symbol='BRCA2' )
undefined
gene_tree = client.genetree_member_symbol( species='human', symbol='BRCA2' )
undefined

5. Genomic Region Analysis

5. 基因组区域分析

Find all genomic features (genes, transcripts, regulatory elements) in a specific region.
Use cases:
  • Identify all genes in a chromosomal region
  • Find regulatory features (promoters, enhancers)
  • Locate variants within a region
  • Retrieve structural features
Example:
python
undefined
查找特定区域内的所有基因组特征(基因、转录本、调控元件)。
使用场景:
  • 识别染色体区域内的所有基因
  • 查找调控元件(启动子、增强子)
  • 定位区域内的变异
  • 获取结构特征
示例:
python
undefined

Find all features in a region

查找区域内的所有特征

features = client.overlap_region( species='human', region='7:140424943-140624564', feature='gene' )
undefined
features = client.overlap_region( species='human', region='7:140424943-140624564', feature='gene' )
undefined

6. Assembly Mapping

6. 组装版本映射

Convert coordinates between different genome assemblies (e.g., GRCh37 to GRCh38).
Important: Use
https://grch37.rest.ensembl.org
for GRCh37/hg19 queries and
https://rest.ensembl.org
for current assemblies.
Example:
python
from ensembl_rest import AssemblyMapper
在不同基因组组装版本间转换坐标(例如GRCh37转GRCh38)。
注意: 查询GRCh37/hg19时使用
https://grch37.rest.ensembl.org
,查询当前组装版本时使用
https://rest.ensembl.org
示例:
python
from ensembl_rest import AssemblyMapper

Map coordinates from GRCh37 to GRCh38

将坐标从GRCh37映射到GRCh38

mapper = AssemblyMapper( species='human', asm_from='GRCh37', asm_to='GRCh38' )
mapped = mapper.map(chrom='7', start=140453136, end=140453136)
undefined
mapper = AssemblyMapper( species='human', asm_from='GRCh37', asm_to='GRCh38' )
mapped = mapper.map(chrom='7', start=140453136, end=140453136)
undefined

API Best Practices

API最佳实践

Rate Limiting

速率限制

The Ensembl REST API has rate limits. Follow these practices:
  1. Respect rate limits: Maximum 15 requests per second for anonymous users
  2. Handle 429 responses: When rate-limited, check the
    Retry-After
    header and wait
  3. Use batch endpoints: When querying multiple items, use batch endpoints where available
  4. Cache results: Store frequently accessed data to reduce API calls
Ensembl REST API设有速率限制,请遵循以下规范:
  1. 遵守速率限制: 匿名用户最多每秒15次请求
  2. 处理429响应: 触发速率限制时,检查
    Retry-After
    头并等待相应时间
  3. 使用批量端点: 查询多个条目时,尽可能使用批量端点
  4. 缓存结果: 存储频繁访问的数据以减少API调用次数

Error Handling

错误处理

Always implement proper error handling:
python
import requests
import time

def query_ensembl(endpoint, params=None, max_retries=3):
    server = "https://rest.ensembl.org"
    headers = {"Content-Type": "application/json"}

    for attempt in range(max_retries):
        response = requests.get(
            f"{server}{endpoint}",
            headers=headers,
            params=params
        )

        if response.status_code == 200:
            return response.json()
        elif response.status_code == 429:
            # Rate limited - wait and retry
            retry_after = int(response.headers.get('Retry-After', 1))
            time.sleep(retry_after)
        else:
            response.raise_for_status()

    raise Exception(f"Failed after {max_retries} attempts")
始终实现完善的错误处理:
python
import requests
import time

def query_ensembl(endpoint, params=None, max_retries=3):
    server = "https://rest.ensembl.org"
    headers = {"Content-Type": "application/json"}

    for attempt in range(max_retries):
        response = requests.get(
            f"{server}{endpoint}",
            headers=headers,
            params=params
        )

        if response.status_code == 200:
            return response.json()
        elif response.status_code == 429:
            # 触发速率限制 - 等待后重试
            retry_after = int(response.headers.get('Retry-After', 1))
            time.sleep(retry_after)
        else:
            response.raise_for_status()

    raise Exception(f"经过{max_retries}次尝试后仍失败")

Installation

安装

Python Package (Recommended)

Python包(推荐)

bash
uv pip install ensembl_rest
The
ensembl_rest
package provides a Pythonic interface to all Ensembl REST API endpoints.
bash
uv pip install ensembl_rest
ensembl_rest
包提供了Python风格的接口,可调用所有Ensembl REST API端点。

Direct REST API

直接调用REST API

No installation needed - use standard HTTP libraries like
requests
:
bash
uv pip install requests
无需安装包 - 使用标准HTTP库如
requests
即可:
bash
uv pip install requests

Resources

资源

references/

references/

  • api_endpoints.md
    : Comprehensive documentation of all 17 API endpoint categories with examples and parameters
  • api_endpoints.md
    : 包含所有17类API端点的综合文档,附示例和参数说明

scripts/

scripts/

  • ensembl_query.py
    : Reusable Python script for common Ensembl queries with built-in rate limiting and error handling
  • ensembl_query.py
    : 可复用的Python脚本,用于常见Ensembl查询,内置速率限制和错误处理

Common Workflows

常见工作流

Workflow 1: Gene Annotation Pipeline

工作流1:基因注释流程

  1. Look up gene by symbol to get Ensembl ID
  2. Retrieve transcript information
  3. Get protein sequences for all transcripts
  4. Find orthologs in other species
  5. Export results
  1. 通过基因符号查询获取Ensembl ID
  2. 获取转录本信息
  3. 获取所有转录本的蛋白质序列
  4. 查找其他物种中的同源基因
  5. 导出结果

Workflow 2: Variant Analysis

工作流2:变异分析

  1. Query variant by rsID or coordinates
  2. Use VEP to predict functional consequences
  3. Check population frequencies
  4. Retrieve phenotype associations
  5. Generate report
  1. 通过rsID或坐标查询变异
  2. 使用VEP预测功能后果
  3. 检查群体频率
  4. 获取表型关联信息
  5. 生成报告

Workflow 3: Comparative Analysis

工作流3:比较分析

  1. Start with gene of interest in reference species
  2. Find orthologs in target species
  3. Retrieve sequences for all orthologs
  4. Compare gene structures and features
  5. Analyze evolutionary conservation
  1. 从参考物种中选择目标基因
  2. 查找目标物种中的同源基因
  3. 获取所有同源基因的序列
  4. 比较基因结构和特征
  5. 分析进化保守性

Species and Assembly Information

物种与组装版本信息

To query available species and assemblies:
python
undefined
查询可用物种和组装版本:
python
undefined

List all available species

列出所有可用物种

species_list = client.info_species()
species_list = client.info_species()

Get assembly information for a species

获取某物种的组装版本信息

assembly_info = client.info_assembly(species='human')

Common species identifiers:
- Human: `homo_sapiens` or `human`
- Mouse: `mus_musculus` or `mouse`
- Zebrafish: `danio_rerio` or `zebrafish`
- Fruit fly: `drosophila_melanogaster`
assembly_info = client.info_assembly(species='human')

常见物种标识符:
- 人类:`homo_sapiens`或`human`
- 小鼠:`mus_musculus`或`mouse`
- 斑马鱼:`danio_rerio`或`zebrafish`
- 果蝇:`drosophila_melanogaster`

Additional Resources

额外资源