ena-database
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseENA Database
ENA数据库
Overview
概述
The European Nucleotide Archive (ENA) is a comprehensive public repository for nucleotide sequence data and associated metadata. Access and query DNA/RNA sequences, raw reads, genome assemblies, and functional annotations through REST APIs and FTP for genomics and bioinformatics pipelines.
欧洲核苷酸档案库(ENA)是一个综合性的核苷酸序列数据及相关元数据公共存储库。可通过REST API和FTP访问并查询DNA/RNA序列、原始测序读段、基因组组装数据以及功能注释,用于基因组学和生物信息学流程。
When to Use This Skill
何时使用该技能
This skill should be used when:
- Retrieving nucleotide sequences or raw sequencing reads by accession
- Searching for samples, studies, or assemblies by metadata criteria
- Downloading FASTQ files or genome assemblies for analysis
- Querying taxonomic information for organisms
- Accessing sequence annotations and functional data
- Integrating ENA data into bioinformatics pipelines
- Performing cross-reference searches to related databases
- Bulk downloading datasets via FTP or Aspera
本技能适用于以下场景:
- 通过登录号检索核苷酸序列或原始测序读段
- 按元数据条件搜索样本、研究项目或组装数据
- 下载FASTQ文件或基因组组装数据用于分析
- 查询生物的分类学信息
- 访问序列注释和功能数据
- 将ENA数据集成到生物信息学流程中
- 执行相关数据库的交叉引用搜索
- 通过FTP或Aspera批量下载数据集
Core Capabilities
核心功能
1. Data Types and Structure
1. 数据类型与结构
ENA organizes data into hierarchical object types:
Studies/Projects - Group related data and control release dates. Studies are the primary unit for citing archived data.
Samples - Represent units of biomaterial from which sequencing libraries were produced. Samples must be registered before submitting most data types.
Raw Reads - Consist of:
- Experiments: Metadata about sequencing methods, library preparation, and instrument details
- Runs: References to data files containing raw sequencing reads from a single sequencing run
Assemblies - Genome, transcriptome, metagenome, or metatranscriptome assemblies at various completion levels.
Sequences - Assembled and annotated sequences stored in the EMBL Nucleotide Sequence Database, including coding/non-coding regions and functional annotations.
Analyses - Results from computational analyses of sequence data.
Taxonomy Records - Taxonomic information including lineage and rank.
ENA将数据组织为层级化的对象类型:
研究项目(Studies/Projects) - 对相关数据进行分组并控制发布日期。研究项目是引用存档数据的主要单元。
样本(Samples) - 代表用于制备测序文库的生物材料单元。在提交大多数数据类型之前,必须先注册样本。
原始测序读段(Raw Reads) - 包含:
- 实验(Experiments):关于测序方法、文库制备和仪器细节的元数据
- 运行(Runs):指向包含单次测序运行产生的原始测序读段的数据文件的引用
组装数据(Assemblies) - 不同完成程度的基因组、转录组、宏基因组或宏转录组装数据。
序列(Sequences) - 存储在EMBL核苷酸序列数据库中的已组装和注释序列,包括编码/非编码区域和功能注释。
分析结果(Analyses) - 序列数据计算分析的结果。
分类学记录(Taxonomy Records) - 包含谱系和分类等级的分类学信息。
2. Programmatic Access
2. 程序化访问
ENA provides multiple REST APIs for data access. Consult for detailed endpoint documentation.
references/api_reference.mdKey APIs:
ENA Portal API - Advanced search functionality across all ENA data types
- Documentation: https://www.ebi.ac.uk/ena/portal/api/doc
- Use for complex queries and metadata searches
ENA Browser API - Direct retrieval of records and metadata
- Documentation: https://www.ebi.ac.uk/ena/browser/api/doc
- Use for downloading specific records by accession
- Returns data in XML format
ENA Taxonomy REST API - Query taxonomic information
- Access lineage, rank, and related taxonomic data
ENA Cross Reference Service - Access related records from external databases
- Endpoint: https://www.ebi.ac.uk/ena/xref/rest/
CRAM Reference Registry - Retrieve reference sequences
- Endpoint: https://www.ebi.ac.uk/ena/cram/
- Query by MD5 or SHA1 checksums
Rate Limiting: All APIs have a rate limit of 50 requests per second. Exceeding this returns HTTP 429 (Too Many Requests).
ENA提供多个REST API用于数据访问。详细的端点文档请参考。
references/api_reference.md主要API:
ENA门户API - 支持跨所有ENA数据类型的高级搜索功能
- 文档:https://www.ebi.ac.uk/ena/portal/api/doc
- 用于复杂查询和元数据搜索
ENA浏览器API - 直接检索记录和元数据
- 文档:https://www.ebi.ac.uk/ena/browser/api/doc
- 用于通过登录号下载特定记录
- 返回XML格式的数据
ENA分类学REST API - 查询分类学信息
- 访问谱系、分类等级和相关分类学数据
ENA交叉引用服务 - 访问来自外部数据库的相关记录
CRAM参考序列注册库 - 检索参考序列
- 端点:https://www.ebi.ac.uk/ena/cram/
- 通过MD5或SHA1校验和进行查询
速率限制:所有API的速率限制为每秒50次请求。超过限制将返回HTTP 429(请求过多)。
3. Searching and Retrieving Data
3. 搜索与检索数据
Browser-Based Search:
- Free text search across all fields
- Sequence similarity search (BLAST integration)
- Cross-reference search to find related records
- Advanced search with Rulespace query builder
Programmatic Queries:
- Use Portal API for advanced searches at scale
- Filter by data type, date range, taxonomy, or metadata fields
- Download results as tabulated metadata summaries or XML records
Example API Query Pattern:
python
import requests基于浏览器的搜索:
- 跨所有字段的自由文本搜索
- 序列相似性搜索(集成BLAST)
- 交叉引用搜索以查找相关记录
- 使用Rulespace查询构建器进行高级搜索
程序化查询:
- 使用门户API进行大规模高级搜索
- 按数据类型、日期范围、分类学或元数据字段进行筛选
- 以下载表格化元数据摘要或XML记录的形式获取结果
API查询示例模式:
python
import requestsSearch for samples from a specific study
搜索特定研究项目中的样本
base_url = "https://www.ebi.ac.uk/ena/portal/api/search"
params = {
"result": "sample",
"query": "study_accession=PRJEB1234",
"format": "json",
"limit": 100
}
response = requests.get(base_url, params=params)
samples = response.json()
undefinedbase_url = "https://www.ebi.ac.uk/ena/portal/api/search"
params = {
"result": "sample",
"query": "study_accession=PRJEB1234",
"format": "json",
"limit": 100
}
response = requests.get(base_url, params=params)
samples = response.json()
undefined4. Data Retrieval Formats
4. 数据检索格式
Metadata Formats:
- XML (native ENA format)
- JSON (via Portal API)
- TSV/CSV (tabulated summaries)
Sequence Data:
- FASTQ (raw reads)
- BAM/CRAM (aligned reads)
- FASTA (assembled sequences)
- EMBL flat file format (annotated sequences)
Download Methods:
- Direct API download (small files)
- FTP for bulk data transfer
- Aspera for high-speed transfer of large datasets
- enaBrowserTools command-line utility for bulk downloads
元数据格式:
- XML(ENA原生格式)
- JSON(通过门户API)
- TSV/CSV(表格化摘要)
序列数据:
- FASTQ(原始测序读段)
- BAM/CRAM(比对后的读段)
- FASTA(已组装序列)
- EMBL flat file格式(带注释的序列)
下载方式:
- 直接API下载(小文件)
- FTP用于批量数据传输
- Aspera用于大型数据集的高速传输
- enaBrowserTools命令行工具用于批量下载
5. Common Use Cases
5. 常见用例
Retrieve raw sequencing reads by accession:
python
undefined通过登录号检索原始测序读段:
python
undefinedDownload run files using Browser API
使用浏览器API下载运行文件
accession = "ERR123456"
url = f"https://www.ebi.ac.uk/ena/browser/api/xml/{accession}"
**Search for all samples in a study:**
```pythonaccession = "ERR123456"
url = f"https://www.ebi.ac.uk/ena/browser/api/xml/{accession}"
**搜索某一研究项目中的所有样本:**
```pythonUse Portal API to list samples
使用门户API列出样本
study_id = "PRJNA123456"
url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=sample&query=study_accession={study_id}&format=tsv"
**Find assemblies for a specific organism:**
```pythonstudy_id = "PRJNA123456"
url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=sample&query=study_accession={study_id}&format=tsv"
**查找特定生物的组装数据:**
```pythonSearch assemblies by taxonomy
按分类学搜索组装数据
organism = "Escherichia coli"
url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree({organism})&format=json"
**Get taxonomic lineage:**
```pythonorganism = "Escherichia coli"
url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree({organism})&format=json"
**获取分类学谱系:**
```pythonQuery taxonomy API
查询分类学API
taxon_id = "562" # E. coli
url = f"https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/{taxon_id}"
undefinedtaxon_id = "562" # 大肠杆菌
url = f"https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/{taxon_id}"
undefined6. Integration with Analysis Pipelines
6. 与分析流程集成
Bulk Download Pattern:
- Search for accessions matching criteria using Portal API
- Extract file URLs from search results
- Download files via FTP or using enaBrowserTools
- Process downloaded data in pipeline
BLAST Integration:
Integrate with EBI's NCBI BLAST service (REST/SOAP API) for sequence similarity searches against ENA sequences.
批量下载模式:
- 使用门户API搜索符合条件的登录号
- 从搜索结果中提取文件URL
- 通过FTP或enaBrowserTools下载文件
- 在流程中处理下载的数据
BLAST集成:
与EBI的NCBI BLAST服务(REST/SOAP API)集成,针对ENA序列执行序列相似性搜索。
7. Best Practices
7. 最佳实践
Rate Limiting:
- Implement exponential backoff when receiving HTTP 429 responses
- Batch requests when possible to stay within 50 req/sec limit
- Use bulk download tools for large datasets instead of iterating API calls
Data Citation:
- Always cite using Study/Project accessions when publishing
- Include accession numbers for specific samples, runs, or assemblies used
API Response Handling:
- Check HTTP status codes before processing responses
- Parse XML responses using proper XML libraries (not regex)
- Handle pagination for large result sets
Performance:
- Use FTP/Aspera for downloading large files (>100MB)
- Prefer TSV/JSON formats over XML when only metadata is needed
- Cache taxonomy lookups locally when processing many records
速率限制:
- 收到HTTP 429响应时,实现指数退避机制
- 尽可能批量处理请求,以保持在每秒50次请求的限制内
- 对于大型数据集,使用批量下载工具而非循环调用API
数据引用:
- 发表成果时,始终使用研究项目/项目登录号进行引用
- 包含所使用的特定样本、运行或组装数据的登录号
API响应处理:
- 处理响应前检查HTTP状态码
- 使用合适的XML库解析XML响应(而非正则表达式)
- 处理大型结果集的分页
性能优化:
- 对于大型文件(>100MB),使用FTP/Aspera下载
- 当仅需要元数据时,优先选择TSV/JSON格式而非XML
- 处理大量记录时,在本地缓存分类学查询结果
Resources
资源
This skill includes detailed reference documentation for working with ENA:
本技能包含使用ENA的详细参考文档:
references/
references/
api_reference.md - Comprehensive API endpoint documentation including:
- Detailed parameters for Portal API and Browser API
- Response format specifications
- Advanced query syntax and operators
- Field names for filtering and searching
- Common API patterns and examples
Load this reference when constructing complex API queries, debugging API responses, or needing specific parameter details.
api_reference.md - 全面的API端点文档,包括:
- 门户API和浏览器API的详细参数
- 响应格式规范
- 高级查询语法和操作符
- 用于筛选和搜索的字段名称
- 常见API模式和示例
构建复杂API查询、调试API响应或需要特定参数细节时,请查阅本参考文档。