uniprot-database

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

UniProt Database

UniProt数据库

Overview

概述

UniProt is the world's leading comprehensive protein sequence and functional information resource. Search proteins by name, gene, or accession, retrieve sequences in FASTA format, perform ID mapping across databases, access Swiss-Prot/TrEMBL annotations via REST API for protein analysis.
UniProt是全球领先的综合性蛋白质序列与功能信息资源。可通过名称、基因或登录号搜索蛋白质,以FASTA格式获取序列,跨数据库进行ID映射,通过REST API访问Swiss-Prot/TrEMBL注释信息,用于蛋白质分析。

When to Use This Skill

何时使用本技能

This skill should be used when:
  • Searching for protein entries by name, gene symbol, accession, or organism
  • Retrieving protein sequences in FASTA or other formats
  • Mapping identifiers between UniProt and external databases (Ensembl, RefSeq, PDB, etc.)
  • Accessing protein annotations including GO terms, domains, and functional descriptions
  • Batch retrieving multiple protein entries efficiently
  • Querying reviewed (Swiss-Prot) vs. unreviewed (TrEMBL) protein data
  • Streaming large protein datasets
  • Building custom queries with field-specific search syntax
本技能适用于以下场景:
  • 通过名称、基因符号、登录号或物种搜索蛋白质条目
  • 以FASTA或其他格式获取蛋白质序列
  • 在UniProt与外部数据库(Ensembl、RefSeq、PDB等)之间进行标识符映射
  • 访问蛋白质注释信息,包括GO术语、结构域和功能描述
  • 高效批量获取多个蛋白质条目
  • 查询已审核(Swiss-Prot)与未审核(TrEMBL)的蛋白质数据
  • 流式传输大型蛋白质数据集
  • 使用字段专属搜索语法构建自定义查询

Core Capabilities

核心功能

1. Searching for Proteins

1. 搜索蛋白质

Search UniProt using natural language queries or structured search syntax.
Common search patterns:
python
undefined
使用自然语言查询或结构化搜索语法搜索UniProt。
常见搜索模式:
python
undefined

Search by protein name

Search by protein name

query = "insulin AND organism_name:"Homo sapiens""
query = "insulin AND organism_name:"Homo sapiens""

Search by gene name

Search by gene name

query = "gene:BRCA1 AND reviewed:true"
query = "gene:BRCA1 AND reviewed:true"

Search by accession

Search by accession

query = "accession:P12345"
query = "accession:P12345"

Search by sequence length

Search by sequence length

query = "length:[100 TO 500]"
query = "length:[100 TO 500]"

Search by taxonomy

Search by taxonomy

query = "taxonomy_id:9606" # Human proteins
query = "taxonomy_id:9606" # Human proteins

Search by GO term

Search by GO term

query = "go:0005515" # Protein binding

Use the API search endpoint: `https://rest.uniprot.org/uniprotkb/search?query={query}&format={format}`

**Supported formats:** JSON, TSV, Excel, XML, FASTA, RDF, TXT
query = "go:0005515" # Protein binding

使用API搜索端点:`https://rest.uniprot.org/uniprotkb/search?query={query}&format={format}`

**支持的格式:** JSON、TSV、Excel、XML、FASTA、RDF、TXT

2. Retrieving Individual Protein Entries

2. 获取单个蛋白质条目

Retrieve specific protein entries by accession number.
Accession number formats:
  • Classic: P12345, Q1AAA9, O15530 (6 characters: letter + 5 alphanumeric)
  • Extended: A0A022YWF9 (10 characters for newer entries)
Retrieve endpoint:
https://rest.uniprot.org/uniprotkb/{accession}.{format}
Example:
https://rest.uniprot.org/uniprotkb/P12345.fasta
通过登录号获取特定的蛋白质条目。
登录号格式:
  • 经典格式:P12345、Q1AAA9、O15530(6个字符:字母+5个字母数字组合)
  • 扩展格式:A0A022YWF9(新条目为10个字符)
获取端点:
https://rest.uniprot.org/uniprotkb/{accession}.{format}
示例:
https://rest.uniprot.org/uniprotkb/P12345.fasta

3. Batch Retrieval and ID Mapping

3. 批量获取与ID映射

Map protein identifiers between different database systems and retrieve multiple entries efficiently.
ID Mapping workflow:
  1. Submit mapping job to:
    https://rest.uniprot.org/idmapping/run
  2. Check job status:
    https://rest.uniprot.org/idmapping/status/{jobId}
  3. Retrieve results:
    https://rest.uniprot.org/idmapping/results/{jobId}
Supported databases for mapping:
  • UniProtKB AC/ID
  • Gene names
  • Ensembl, RefSeq, EMBL
  • PDB, AlphaFoldDB
  • KEGG, GO terms
  • And many more (see
    /references/id_mapping_databases.md
    )
Limitations:
  • Maximum 100,000 IDs per job
  • Results stored for 7 days
在不同数据库系统之间映射蛋白质标识符,并高效获取多个条目。
ID映射工作流:
  1. 提交映射任务至:
    https://rest.uniprot.org/idmapping/run
  2. 检查任务状态:
    https://rest.uniprot.org/idmapping/status/{jobId}
  3. 获取结果:
    https://rest.uniprot.org/idmapping/results/{jobId}
支持映射的数据库:
  • UniProtKB AC/ID
  • 基因名称
  • Ensembl、RefSeq、EMBL
  • PDB、AlphaFoldDB
  • KEGG、GO术语
  • 更多数据库(详见
    /references/id_mapping_databases.md
限制:
  • 每个任务最多支持100,000个ID
  • 结果保留7天

4. Streaming Large Result Sets

4. 流式传输大型结果集

For large queries that exceed pagination limits, use the stream endpoint:
https://rest.uniprot.org/uniprotkb/stream?query={query}&format={format}
The stream endpoint returns all results without pagination, suitable for downloading complete datasets.
对于超出分页限制的大型查询,使用流端点:
https://rest.uniprot.org/uniprotkb/stream?query={query}&format={format}
流端点无需分页即可返回所有结果,适合下载完整数据集。

5. Customizing Retrieved Fields

5. 自定义获取字段

Specify exactly which fields to retrieve for efficient data transfer.
Common fields:
  • accession
    - UniProt accession number
  • id
    - Entry name
  • gene_names
    - Gene name(s)
  • organism_name
    - Organism
  • protein_name
    - Protein names
  • sequence
    - Amino acid sequence
  • length
    - Sequence length
  • go_*
    - Gene Ontology annotations
  • cc_*
    - Comment fields (function, interaction, etc.)
  • ft_*
    - Feature annotations (domains, sites, etc.)
Example:
https://rest.uniprot.org/uniprotkb/search?query=insulin&fields=accession,gene_names,organism_name,length,sequence&format=tsv
See
/references/api_fields.md
for complete field list.
精确指定需要获取的字段,以实现高效的数据传输。
常见字段:
  • accession
    - UniProt登录号
  • id
    - 条目名称
  • gene_names
    - 基因名称
  • organism_name
    - 物种
  • protein_name
    - 蛋白质名称
  • sequence
    - 氨基酸序列
  • length
    - 序列长度
  • go_*
    - 基因本体注释
  • cc_*
    - 注释字段(功能、相互作用等)
  • ft_*
    - 特征注释(结构域、位点等)
示例:
https://rest.uniprot.org/uniprotkb/search?query=insulin&fields=accession,gene_names,organism_name,length,sequence&format=tsv
完整字段列表详见
/references/api_fields.md

Python Implementation

Python实现

For programmatic access, use the provided helper script
scripts/uniprot_client.py
which implements:
  • search_proteins(query, format)
    - Search UniProt with any query
  • get_protein(accession, format)
    - Retrieve single protein entry
  • map_ids(ids, from_db, to_db)
    - Map between identifier types
  • batch_retrieve(accessions, format)
    - Retrieve multiple entries
  • stream_results(query, format)
    - Stream large result sets
Alternative Python packages:
  • Unipressed: Modern, typed Python client for UniProt REST API
  • bioservices: Comprehensive bioinformatics web services client
对于程序化访问,可使用提供的辅助脚本
scripts/uniprot_client.py
,其实现了以下功能:
  • search_proteins(query, format)
    - 使用任意查询条件搜索UniProt
  • get_protein(accession, format)
    - 获取单个蛋白质条目
  • map_ids(ids, from_db, to_db)
    - 在不同标识符类型之间映射
  • batch_retrieve(accessions, format)
    - 批量获取多个条目
  • stream_results(query, format)
    - 流式传输大型结果集
替代Python包:
  • Unipressed: 用于UniProt REST API的现代化类型化Python客户端
  • bioservices: 全面的生物信息学Web服务客户端

Query Syntax Examples

查询语法示例

Boolean operators:
kinase AND organism_name:human
(diabetes OR insulin) AND reviewed:true
cancer NOT lung
Field-specific searches:
gene:BRCA1
accession:P12345
organism_id:9606
taxonomy_name:"Homo sapiens"
annotation:(type:signal)
Range queries:
length:[100 TO 500]
mass:[50000 TO 100000]
Wildcards:
gene:BRCA*
protein_name:kinase*
See
/references/query_syntax.md
for comprehensive syntax documentation.
布尔运算符:
kinase AND organism_name:human
(diabetes OR insulin) AND reviewed:true
cancer NOT lung
字段专属搜索:
gene:BRCA1
accession:P12345
organism_id:9606
taxonomy_name:"Homo sapiens"
annotation:(type:signal)
范围查询:
length:[100 TO 500]
mass:[50000 TO 100000]
通配符:
gene:BRCA*
protein_name:kinase*
完整语法文档详见
/references/query_syntax.md

Best Practices

最佳实践

  1. Use reviewed entries when possible: Filter with
    reviewed:true
    for Swiss-Prot (manually curated) entries
  2. Specify format explicitly: Choose the most appropriate format (FASTA for sequences, TSV for tabular data, JSON for programmatic parsing)
  3. Use field selection: Only request fields you need to reduce bandwidth and processing time
  4. Handle pagination: For large result sets, implement proper pagination or use the stream endpoint
  5. Cache results: Store frequently accessed data locally to minimize API calls
  6. Rate limiting: Be respectful of API resources; implement delays for large batch operations
  7. Check data quality: TrEMBL entries are computational predictions; Swiss-Prot entries are manually reviewed
  1. 尽可能使用已审核条目:使用
    reviewed:true
    过滤SwissProt(人工审核)条目
  2. 明确指定格式:选择最合适的格式(序列用FASTA,表格数据用TSV,程序化解析用JSON)
  3. 使用字段选择:仅请求所需字段,以减少带宽和处理时间
  4. 处理分页:对于大型结果集,实现适当的分页或使用流端点
  5. 缓存结果:将频繁访问的数据存储在本地,以减少API调用
  6. 速率限制:尊重API资源,对大型批量操作添加延迟
  7. 检查数据质量:TrEMBL条目为计算预测结果;Swiss-Prot条目为人工审核结果

Resources

资源

scripts/

scripts/

uniprot_client.py
- Python client with helper functions for common UniProt operations including search, retrieval, ID mapping, and streaming.
uniprot_client.py
- 包含搜索、获取、ID映射、流式传输等常见UniProt操作的辅助函数的Python客户端。

references/

references/

  • api_fields.md
    - Complete list of available fields for customizing queries
  • id_mapping_databases.md
    - Supported databases for ID mapping operations
  • query_syntax.md
    - Comprehensive query syntax with advanced examples
  • api_examples.md
    - Code examples in multiple languages (Python, curl, R)
  • api_fields.md
    - 可用于自定义查询的完整字段列表
  • id_mapping_databases.md
    - 支持ID映射的数据库列表
  • query_syntax.md
    - 包含高级示例的全面查询语法文档
  • api_examples.md
    - 多语言(Python、curl、R)代码示例

Additional Resources

额外资源