gene-database

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Gene Database

NCBI Gene数据库

Overview

概述

NCBI Gene is a comprehensive database integrating gene information from diverse species. It provides nomenclature, reference sequences (RefSeqs), chromosomal maps, biological pathways, genetic variations, phenotypes, and cross-references to global genomic resources.
NCBI Gene是一个整合了多物种基因信息的综合数据库。它提供基因命名规范、参考序列(RefSeqs)、染色体图谱、生物通路、遗传变异、表型信息,以及与全球基因组资源的交叉引用。

When to Use This Skill

何时使用该技能

This skill should be used when working with gene data including searching by gene symbol or ID, retrieving gene sequences and metadata, analyzing gene functions and pathways, or performing batch gene lookups.
当你需要处理基因数据时,例如通过基因符号或ID进行搜索、获取基因序列和元数据、分析基因功能与通路,或者执行批量基因查询时,可使用该技能。

Quick Start

快速入门

NCBI provides two main APIs for gene data access:
  1. E-utilities (Traditional): Full-featured API for all Entrez databases with flexible querying
  2. NCBI Datasets API (Newer): Optimized for gene data retrieval with simplified workflows
Choose E-utilities for complex queries and cross-database searches. Choose Datasets API for straightforward gene data retrieval with metadata and sequences in a single request.
NCBI提供两个主要的API用于基因数据访问:
  1. E-utilities(传统工具):适用于所有Entrez数据库的全功能API,支持灵活查询
  2. NCBI Datasets API(新版工具):针对基因数据检索优化,工作流程更简洁
复杂查询和跨数据库搜索请使用E-utilities。若需通过单次请求获取基因元数据和序列的直接检索,选择Datasets API。

Common Workflows

常见工作流程

Search Genes by Symbol or Name

通过基因符号或名称搜索基因

To search for genes by symbol or name across organisms:
  1. Use the
    scripts/query_gene.py
    script with E-utilities ESearch
  2. Specify the gene symbol and organism (e.g., "BRCA1 in human")
  3. The script returns matching Gene IDs
Example query patterns:
  • Gene symbol:
    insulin[gene name] AND human[organism]
  • Gene with disease:
    dystrophin[gene name] AND muscular dystrophy[disease]
  • Chromosome location:
    human[organism] AND 17q21[chromosome]
要跨物种通过基因符号或名称搜索基因:
  1. 使用
    scripts/query_gene.py
    脚本结合E-utilities ESearch功能
  2. 指定基因符号和物种(例如:"人类的BRCA1")
  3. 脚本会返回匹配的Gene ID
示例查询模式:
  • 基因符号:
    insulin[gene name] AND human[organism]
  • 关联疾病的基因:
    dystrophin[gene name] AND muscular dystrophy[disease]
  • 染色体位置:
    human[organism] AND 17q21[chromosome]

Retrieve Gene Information by ID

通过ID获取基因信息

To fetch detailed information for known Gene IDs:
  1. Use
    scripts/fetch_gene_data.py
    with the Datasets API for comprehensive data
  2. Alternatively, use
    scripts/query_gene.py
    with E-utilities EFetch for specific formats
  3. Specify desired output format (JSON, XML, or text)
The Datasets API returns:
  • Gene nomenclature and aliases
  • Reference sequences (RefSeqs) for transcripts and proteins
  • Chromosomal location and mapping
  • Gene Ontology (GO) annotations
  • Associated publications
要获取已知Gene ID的详细信息:
  1. 使用
    scripts/fetch_gene_data.py
    结合Datasets API获取全面数据
  2. 或者使用
    scripts/query_gene.py
    结合E-utilities EFetch获取特定格式的数据
  3. 指定所需的输出格式(JSON、XML或文本)
Datasets API返回的内容包括:
  • 基因命名规范及别名
  • 转录本和蛋白质的参考序列(RefSeqs)
  • 染色体位置与图谱信息
  • 基因本体(GO)注释
  • 相关出版物

Batch Gene Lookups

批量基因查询

For multiple genes simultaneously:
  1. Use
    scripts/batch_gene_lookup.py
    for efficient batch processing
  2. Provide a list of gene symbols or IDs
  3. Specify the organism for symbol-based queries
  4. The script handles rate limiting automatically (10 requests/second with API key)
This workflow is useful for:
  • Validating gene lists
  • Retrieving metadata for gene panels
  • Cross-referencing gene identifiers
  • Building gene annotation tables
同时处理多个基因查询:
  1. 使用
    scripts/batch_gene_lookup.py
    实现高效批量处理
  2. 提供基因符号或ID列表
  3. 基于符号查询时指定物种
  4. 脚本会自动处理速率限制(使用API密钥时为10次请求/秒)
该工作流程适用于:
  • 验证基因列表
  • 获取基因面板的元数据
  • 交叉引用基因标识符
  • 构建基因注释表

Search by Biological Context

按生物学背景搜索

To find genes associated with specific biological functions or phenotypes:
  1. Use E-utilities with Gene Ontology (GO) terms or phenotype keywords
  2. Query by pathway names or disease associations
  3. Filter by organism, chromosome, or other attributes
Example searches:
  • By GO term:
    GO:0006915[biological process]
    (apoptosis)
  • By phenotype:
    diabetes[phenotype] AND mouse[organism]
  • By pathway:
    insulin signaling pathway[pathway]
要查找与特定生物功能或表型相关的基因:
  1. 使用E-utilities结合基因本体(GO)术语或表型关键词
  2. 通过通路名称或疾病关联进行查询
  3. 按物种、染色体或其他属性过滤结果
示例搜索:
  • 按GO术语:
    GO:0006915[biological process]
    (细胞凋亡)
  • 按表型:
    diabetes[phenotype] AND mouse[organism]
  • 按通路:
    insulin signaling pathway[pathway]

API Access Patterns

API访问模式

Rate Limits:
  • Without API key: 3 requests/second for E-utilities, 5 requests/second for Datasets API
  • With API key: 10 requests/second for both APIs
Authentication: Register for a free NCBI API key at https://www.ncbi.nlm.nih.gov/account/ to increase rate limits.
Error Handling: Both APIs return standard HTTP status codes. Common errors include:
  • 400: Malformed query or invalid parameters
  • 429: Rate limit exceeded
  • 404: Gene ID not found
Retry failed requests with exponential backoff.
速率限制:
  • 无API密钥时:E-utilities为3次请求/秒,Datasets API为5次请求/秒
  • 有API密钥时:两个API均为10次请求/秒
身份验证: 访问https://www.ncbi.nlm.nih.gov/account/注册免费的NCBI API密钥,以提高速率限制。
错误处理: 两个API均返回标准HTTP状态码。常见错误包括:
  • 400:查询格式错误或参数无效
  • 429:超出速率限制
  • 404:未找到Gene ID
对临时失败的请求使用指数退避策略进行重试。

Script Usage

脚本使用

query_gene.py

query_gene.py

Query NCBI Gene using E-utilities (ESearch, ESummary, EFetch).
bash
python scripts/query_gene.py --search "BRCA1" --organism "human"
python scripts/query_gene.py --id 672 --format json
python scripts/query_gene.py --search "insulin[gene] AND diabetes[disease]"
使用E-utilities(ESearch、ESummary、EFetch)查询NCBI Gene数据库。
bash
python scripts/query_gene.py --search "BRCA1" --organism "human"
python scripts/query_gene.py --id 672 --format json
python scripts/query_gene.py --search "insulin[gene] AND diabetes[disease]"

fetch_gene_data.py

fetch_gene_data.py

Fetch comprehensive gene data using NCBI Datasets API.
bash
python scripts/fetch_gene_data.py --gene-id 672
python scripts/fetch_gene_data.py --symbol BRCA1 --taxon human
python scripts/fetch_gene_data.py --symbol TP53 --taxon "Homo sapiens" --output json
使用NCBI Datasets API获取全面的基因数据。
bash
python scripts/fetch_gene_data.py --gene-id 672
python scripts/fetch_gene_data.py --symbol BRCA1 --taxon human
python scripts/fetch_gene_data.py --symbol TP53 --taxon "Homo sapiens" --output json

batch_gene_lookup.py

batch_gene_lookup.py

Process multiple gene queries efficiently.
bash
python scripts/batch_gene_lookup.py --file gene_list.txt --organism human
python scripts/batch_gene_lookup.py --ids 672,7157,5594 --output results.json
高效处理多个基因查询。
bash
python scripts/batch_gene_lookup.py --file gene_list.txt --organism human
python scripts/batch_gene_lookup.py --ids 672,7157,5594 --output results.json

API References

API参考资料

For detailed API documentation including endpoints, parameters, response formats, and examples, refer to:
  • references/api_reference.md
    - Comprehensive API documentation for E-utilities and Datasets API
  • references/common_workflows.md
    - Additional examples and use case patterns
Search these references when needing specific API endpoint details, parameter options, or response structure information.
如需详细的API文档,包括端点、参数、响应格式和示例,请参考:
  • references/api_reference.md
    - E-utilities和Datasets API的综合API文档
  • references/common_workflows.md
    - 更多基因查询示例和用例模式
当需要特定API端点细节、参数选项或响应结构信息时,可查阅这些参考资料。

Data Formats

数据格式

NCBI Gene data can be retrieved in multiple formats:
  • JSON: Structured data ideal for programmatic processing
  • XML: Detailed hierarchical format with full metadata
  • GenBank: Sequence data with annotations
  • FASTA: Sequence data only
  • Text: Human-readable summaries
Choose JSON for modern applications, XML for legacy systems requiring detailed metadata, and FASTA for sequence analysis workflows.
NCBI Gene数据可通过多种格式获取:
  • JSON:结构化数据,适合程序化处理
  • XML:详细的分层格式,包含完整元数据
  • GenBank:带注释的序列数据
  • FASTA:仅序列数据
  • 文本:人类可读的摘要
现代应用选择JSON格式,需要详细元数据的遗留系统选择XML格式,序列分析工作流选择FASTA格式。

Best Practices

最佳实践

  1. Always specify organism when searching by gene symbol to avoid ambiguity
  2. Use Gene IDs for precise lookups when available
  3. Batch requests when working with multiple genes to minimize API calls
  4. Cache results locally to reduce redundant queries
  5. Include API key in scripts for higher rate limits
  6. Handle errors gracefully with retry logic for transient failures
  7. Validate gene symbols before batch processing to catch typos
  1. 搜索基因符号时务必指定物种,避免歧义
  2. 当有可用的Gene ID时,使用Gene ID进行精确查询
  3. 处理多个基因时使用批量请求,减少API调用次数
  4. 将结果本地缓存,减少重复查询
  5. 在脚本中加入API密钥,获得更高的速率限制
  6. 优雅处理错误,对临时失败的请求使用重试逻辑
  7. 批量处理前验证基因符号,避免拼写错误

Resources

资源

This skill includes:
该技能包含以下内容:

scripts/

scripts/

  • query_gene.py
    - Query genes using E-utilities (ESearch, ESummary, EFetch)
  • fetch_gene_data.py
    - Fetch gene data using NCBI Datasets API
  • batch_gene_lookup.py
    - Handle multiple gene queries efficiently
  • query_gene.py
    - 使用E-utilities(ESearch、ESummary、EFetch)查询基因
  • fetch_gene_data.py
    - 使用NCBI Datasets API获取基因数据
  • batch_gene_lookup.py
    - 高效处理多个基因查询

references/

references/

  • api_reference.md
    - Detailed API documentation for both E-utilities and Datasets API
  • common_workflows.md
    - Examples of common gene queries and use cases
  • api_reference.md
    - E-utilities和Datasets API的详细API文档
  • common_workflows.md
    - 常见基因查询和用例示例