ena-database

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ENA Database

ENA数据库

Overview

概述

The European Nucleotide Archive (ENA) is a comprehensive public repository for nucleotide sequence data and associated metadata. Access and query DNA/RNA sequences, raw reads, genome assemblies, and functional annotations through REST APIs and FTP for genomics and bioinformatics pipelines.
欧洲核苷酸档案库(ENA)是一个综合性的核苷酸序列数据及相关元数据公共存储库。可通过REST API和FTP访问并查询DNA/RNA序列、原始测序读段、基因组组装数据以及功能注释,用于基因组学和生物信息学流程。

When to Use This Skill

何时使用该技能

This skill should be used when:
  • Retrieving nucleotide sequences or raw sequencing reads by accession
  • Searching for samples, studies, or assemblies by metadata criteria
  • Downloading FASTQ files or genome assemblies for analysis
  • Querying taxonomic information for organisms
  • Accessing sequence annotations and functional data
  • Integrating ENA data into bioinformatics pipelines
  • Performing cross-reference searches to related databases
  • Bulk downloading datasets via FTP or Aspera
本技能适用于以下场景:
  • 通过登录号检索核苷酸序列或原始测序读段
  • 按元数据条件搜索样本、研究项目或组装数据
  • 下载FASTQ文件或基因组组装数据用于分析
  • 查询生物的分类学信息
  • 访问序列注释和功能数据
  • 将ENA数据集成到生物信息学流程中
  • 执行相关数据库的交叉引用搜索
  • 通过FTP或Aspera批量下载数据集

Core Capabilities

核心功能

1. Data Types and Structure

1. 数据类型与结构

ENA organizes data into hierarchical object types:
Studies/Projects - Group related data and control release dates. Studies are the primary unit for citing archived data.
Samples - Represent units of biomaterial from which sequencing libraries were produced. Samples must be registered before submitting most data types.
Raw Reads - Consist of:
  • Experiments: Metadata about sequencing methods, library preparation, and instrument details
  • Runs: References to data files containing raw sequencing reads from a single sequencing run
Assemblies - Genome, transcriptome, metagenome, or metatranscriptome assemblies at various completion levels.
Sequences - Assembled and annotated sequences stored in the EMBL Nucleotide Sequence Database, including coding/non-coding regions and functional annotations.
Analyses - Results from computational analyses of sequence data.
Taxonomy Records - Taxonomic information including lineage and rank.
ENA将数据组织为层级化的对象类型:
研究项目(Studies/Projects) - 对相关数据进行分组并控制发布日期。研究项目是引用存档数据的主要单元。
样本(Samples) - 代表用于制备测序文库的生物材料单元。在提交大多数数据类型之前,必须先注册样本。
原始测序读段(Raw Reads) - 包含:
  • 实验(Experiments):关于测序方法、文库制备和仪器细节的元数据
  • 运行(Runs):指向包含单次测序运行产生的原始测序读段的数据文件的引用
组装数据(Assemblies) - 不同完成程度的基因组、转录组、宏基因组或宏转录组装数据。
序列(Sequences) - 存储在EMBL核苷酸序列数据库中的已组装和注释序列,包括编码/非编码区域和功能注释。
分析结果(Analyses) - 序列数据计算分析的结果。
分类学记录(Taxonomy Records) - 包含谱系和分类等级的分类学信息。

2. Programmatic Access

2. 程序化访问

ENA provides multiple REST APIs for data access. Consult
references/api_reference.md
for detailed endpoint documentation.
Key APIs:
ENA Portal API - Advanced search functionality across all ENA data types
ENA Browser API - Direct retrieval of records and metadata
ENA Taxonomy REST API - Query taxonomic information
  • Access lineage, rank, and related taxonomic data
ENA Cross Reference Service - Access related records from external databases
CRAM Reference Registry - Retrieve reference sequences
Rate Limiting: All APIs have a rate limit of 50 requests per second. Exceeding this returns HTTP 429 (Too Many Requests).
ENA提供多个REST API用于数据访问。详细的端点文档请参考
references/api_reference.md
主要API:
ENA门户API - 支持跨所有ENA数据类型的高级搜索功能
ENA浏览器API - 直接检索记录和元数据
ENA分类学REST API - 查询分类学信息
  • 访问谱系、分类等级和相关分类学数据
ENA交叉引用服务 - 访问来自外部数据库的相关记录
CRAM参考序列注册库 - 检索参考序列
速率限制:所有API的速率限制为每秒50次请求。超过限制将返回HTTP 429(请求过多)。

3. Searching and Retrieving Data

3. 搜索与检索数据

Browser-Based Search:
  • Free text search across all fields
  • Sequence similarity search (BLAST integration)
  • Cross-reference search to find related records
  • Advanced search with Rulespace query builder
Programmatic Queries:
  • Use Portal API for advanced searches at scale
  • Filter by data type, date range, taxonomy, or metadata fields
  • Download results as tabulated metadata summaries or XML records
Example API Query Pattern:
python
import requests
基于浏览器的搜索:
  • 跨所有字段的自由文本搜索
  • 序列相似性搜索(集成BLAST)
  • 交叉引用搜索以查找相关记录
  • 使用Rulespace查询构建器进行高级搜索
程序化查询:
  • 使用门户API进行大规模高级搜索
  • 按数据类型、日期范围、分类学或元数据字段进行筛选
  • 以下载表格化元数据摘要或XML记录的形式获取结果
API查询示例模式:
python
import requests

Search for samples from a specific study

搜索特定研究项目中的样本

base_url = "https://www.ebi.ac.uk/ena/portal/api/search" params = { "result": "sample", "query": "study_accession=PRJEB1234", "format": "json", "limit": 100 }
response = requests.get(base_url, params=params) samples = response.json()
undefined
base_url = "https://www.ebi.ac.uk/ena/portal/api/search" params = { "result": "sample", "query": "study_accession=PRJEB1234", "format": "json", "limit": 100 }
response = requests.get(base_url, params=params) samples = response.json()
undefined

4. Data Retrieval Formats

4. 数据检索格式

Metadata Formats:
  • XML (native ENA format)
  • JSON (via Portal API)
  • TSV/CSV (tabulated summaries)
Sequence Data:
  • FASTQ (raw reads)
  • BAM/CRAM (aligned reads)
  • FASTA (assembled sequences)
  • EMBL flat file format (annotated sequences)
Download Methods:
  • Direct API download (small files)
  • FTP for bulk data transfer
  • Aspera for high-speed transfer of large datasets
  • enaBrowserTools command-line utility for bulk downloads
元数据格式:
  • XML(ENA原生格式)
  • JSON(通过门户API)
  • TSV/CSV(表格化摘要)
序列数据:
  • FASTQ(原始测序读段)
  • BAM/CRAM(比对后的读段)
  • FASTA(已组装序列)
  • EMBL flat file格式(带注释的序列)
下载方式:
  • 直接API下载(小文件)
  • FTP用于批量数据传输
  • Aspera用于大型数据集的高速传输
  • enaBrowserTools命令行工具用于批量下载

5. Common Use Cases

5. 常见用例

Retrieve raw sequencing reads by accession:
python
undefined
通过登录号检索原始测序读段:
python
undefined

Download run files using Browser API

使用浏览器API下载运行文件


**Search for all samples in a study:**
```python

**搜索某一研究项目中的所有样本:**
```python

Use Portal API to list samples

使用门户API列出样本


**Find assemblies for a specific organism:**
```python

**查找特定生物的组装数据:**
```python

Search assemblies by taxonomy

按分类学搜索组装数据


**Get taxonomic lineage:**
```python

**获取分类学谱系:**
```python

Query taxonomy API

查询分类学API

undefined
taxon_id = "562" # 大肠杆菌 url = f"https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/{taxon_id}"
undefined

6. Integration with Analysis Pipelines

6. 与分析流程集成

Bulk Download Pattern:
  1. Search for accessions matching criteria using Portal API
  2. Extract file URLs from search results
  3. Download files via FTP or using enaBrowserTools
  4. Process downloaded data in pipeline
BLAST Integration: Integrate with EBI's NCBI BLAST service (REST/SOAP API) for sequence similarity searches against ENA sequences.
批量下载模式:
  1. 使用门户API搜索符合条件的登录号
  2. 从搜索结果中提取文件URL
  3. 通过FTP或enaBrowserTools下载文件
  4. 在流程中处理下载的数据
BLAST集成: 与EBI的NCBI BLAST服务(REST/SOAP API)集成,针对ENA序列执行序列相似性搜索。

7. Best Practices

7. 最佳实践

Rate Limiting:
  • Implement exponential backoff when receiving HTTP 429 responses
  • Batch requests when possible to stay within 50 req/sec limit
  • Use bulk download tools for large datasets instead of iterating API calls
Data Citation:
  • Always cite using Study/Project accessions when publishing
  • Include accession numbers for specific samples, runs, or assemblies used
API Response Handling:
  • Check HTTP status codes before processing responses
  • Parse XML responses using proper XML libraries (not regex)
  • Handle pagination for large result sets
Performance:
  • Use FTP/Aspera for downloading large files (>100MB)
  • Prefer TSV/JSON formats over XML when only metadata is needed
  • Cache taxonomy lookups locally when processing many records
速率限制:
  • 收到HTTP 429响应时,实现指数退避机制
  • 尽可能批量处理请求,以保持在每秒50次请求的限制内
  • 对于大型数据集,使用批量下载工具而非循环调用API
数据引用:
  • 发表成果时,始终使用研究项目/项目登录号进行引用
  • 包含所使用的特定样本、运行或组装数据的登录号
API响应处理:
  • 处理响应前检查HTTP状态码
  • 使用合适的XML库解析XML响应(而非正则表达式)
  • 处理大型结果集的分页
性能优化:
  • 对于大型文件(>100MB),使用FTP/Aspera下载
  • 当仅需要元数据时,优先选择TSV/JSON格式而非XML
  • 处理大量记录时,在本地缓存分类学查询结果

Resources

资源

This skill includes detailed reference documentation for working with ENA:
本技能包含使用ENA的详细参考文档:

references/

references/

api_reference.md - Comprehensive API endpoint documentation including:
  • Detailed parameters for Portal API and Browser API
  • Response format specifications
  • Advanced query syntax and operators
  • Field names for filtering and searching
  • Common API patterns and examples
Load this reference when constructing complex API queries, debugging API responses, or needing specific parameter details.
api_reference.md - 全面的API端点文档,包括:
  • 门户API和浏览器API的详细参数
  • 响应格式规范
  • 高级查询语法和操作符
  • 用于筛选和搜索的字段名称
  • 常见API模式和示例
构建复杂API查询、调试API响应或需要特定参数细节时,请查阅本参考文档。