uniprot-database

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

UniProt Database

UniProt数据库

UniProt serves as the authoritative resource for protein sequence data and functional annotations. This skill enables programmatic access to search proteins by various criteria, retrieve FASTA sequences, translate identifiers between biological databases, and query both manually curated (Swiss-Prot) and computationally predicted (TrEMBL) protein records.
UniProt是蛋白质序列数据和功能注释的权威资源。本技能支持通过编程方式按多种条件搜索蛋白质、检索FASTA序列、在生物数据库间转换标识符,以及查询人工整理的(Swiss-Prot)和计算预测的(TrEMBL)蛋白质记录。

Use Cases

使用场景

  • Retrieve protein sequences in FASTA format for downstream analysis
  • Query proteins by name, gene symbol, organism, or functional terms
  • Convert identifiers between UniProt, Ensembl, RefSeq, PDB, and 100+ databases
  • Access functional annotations including GO terms, domains, and pathways
  • Download curated datasets for machine learning or comparative studies
  • Build protein datasets filtered by organism, size, or annotation quality
  • 检索FASTA格式的蛋白质序列用于下游分析
  • 按名称、基因符号、生物或功能术语查询蛋白质
  • 在UniProt、Ensembl、RefSeq、PDB及100+种数据库间转换标识符
  • 获取包括GO术语、结构域和通路在内的功能注释
  • 下载经过整理的数据集用于机器学习或比较研究
  • 构建按生物、大小或注释质量筛选的蛋白质数据集

Installation

安装

No package installation required - UniProt provides a REST API accessed via HTTP requests:
python
import requests
无需安装任何包——UniProt提供可通过HTTP请求访问的REST API:
python
import requests

Test connectivity

Test connectivity

resp = requests.get("https://rest.uniprot.org/uniprotkb/P53_HUMAN.json") print(resp.json()["primaryAccession"]) # Q9NZC2 or similar
undefined
resp = requests.get("https://rest.uniprot.org/uniprotkb/P53_HUMAN.json") print(resp.json()["primaryAccession"]) # Q9NZC2 or similar
undefined

Searching the Database

数据库搜索

Basic Text Search

基础文本搜索

Find proteins by keywords, names, or descriptions:
python
import requests

endpoint = "https://rest.uniprot.org/uniprotkb/search"
params = {
    "query": "hemoglobin AND organism_id:9606 AND reviewed:true",
    "format": "json",
    "size": 10
}

resp = requests.get(endpoint, params=params)
results = resp.json()

for entry in results["results"]:
    acc = entry["primaryAccession"]
    name = entry["proteinDescription"]["recommendedName"]["fullName"]["value"]
    print(f"{acc}: {name}")
通过关键词、名称或描述查找蛋白质:
python
import requests

endpoint = "https://rest.uniprot.org/uniprotkb/search"
params = {
    "query": "hemoglobin AND organism_id:9606 AND reviewed:true",
    "format": "json",
    "size": 10
}

resp = requests.get(endpoint, params=params)
results = resp.json()

for entry in results["results"]:
    acc = entry["primaryAccession"]
    name = entry["proteinDescription"]["recommendedName"]["fullName"]["value"]
    print(f"{acc}: {name}")

Query Syntax

查询语法

UniProt uses a powerful query language with field prefixes and boolean operators:
undefined
UniProt使用带有字段前缀和布尔运算符的强大查询语言:
undefined

Boolean combinations

Boolean combinations

hemoglobin AND organism_id:9606 (kinase OR phosphatase) AND reviewed:true receptor NOT bacteria
hemoglobin AND organism_id:9606 (kinase OR phosphatase) AND reviewed:true receptor NOT bacteria

Field-specific queries

Field-specific queries

gene:TP53 accession:P00533 organism_name:"Homo sapiens"
gene:TP53 accession:P00533 organism_name:"Homo sapiens"

Numeric ranges

Numeric ranges

length:[100 TO 500] mass:[20000 TO 50000]
length:[100 TO 500] mass:[20000 TO 50000]

Wildcards

Wildcards

gene:IL* protein_name:transport*
gene:IL* protein_name:transport*

Existence checks

Existence checks

cc_function:* # has function annotation xref:pdb # has PDB structure ft_signal:* # has signal peptide
undefined
cc_function:* # has function annotation xref:pdb # has PDB structure ft_signal:* # has signal peptide
undefined

Common Filters

常用筛选条件

FilterDescription
reviewed:true
Swiss-Prot entries only (manually curated)
organism_id:9606
Human proteins (NCBI taxonomy ID)
organism_id:10090
Mouse proteins
length:[100 TO 500]
Sequence length range
xref:pdb
Has experimental structure
cc_disease:*
Has disease association
筛选条件描述
reviewed:true
仅Swiss-Prot条目(人工整理)
organism_id:9606
人类蛋白质(NCBI分类学ID)
organism_id:10090
小鼠蛋白质
length:[100 TO 500]
序列长度范围
xref:pdb
拥有实验结构
cc_disease:*
与疾病相关联

Fetching Individual Entries

获取单个条目

Access specific proteins using their accession numbers:
python
import requests

acc = "P53_HUMAN"  # or "P04637"
url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
resp = requests.get(url)
print(resp.text)
通过登录号访问特定蛋白质:
python
import requests

acc = "P53_HUMAN"  # or "P04637"
url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
resp = requests.get(url)
print(resp.text)

Supported Formats

支持的格式

FormatExtensionUse Case
FASTA
.fasta
Sequence analysis, alignments
JSON
.json
Parsing in code
TSV
.tsv
Spreadsheets, data frames
XML
.xml
Structured data exchange
TXT
.txt
Human-readable flat file
格式扩展名使用场景
FASTA
.fasta
序列分析、比对
JSON
.json
代码解析
TSV
.tsv
电子表格、数据框
XML
.xml
结构化数据交换
TXT
.txt
人类可读的平面文件

Custom Fields (TSV)

自定义字段(TSV)

Request only the fields you need to minimize bandwidth:
python
import requests

params = {
    "query": "gene:TP53 AND reviewed:true",
    "format": "tsv",
    "fields": "accession,gene_names,organism_name,length,sequence,cc_function"
}

resp = requests.get("https://rest.uniprot.org/uniprotkb/search", params=params)
print(resp.text)
Common field sets:
undefined
仅请求所需字段以减少带宽占用:
python
import requests

params = {
    "query": "gene:TP53 AND reviewed:true",
    "format": "tsv",
    "fields": "accession,gene_names,organism_name,length,sequence,cc_function"
}

resp = requests.get("https://rest.uniprot.org/uniprotkb/search", params=params)
print(resp.text)
常用字段集:
undefined

Minimal identification

Minimal identification

accession,id,protein_name,gene_names,organism_name
accession,id,protein_name,gene_names,organism_name

Sequence analysis

Sequence analysis

accession,sequence,length,mass,xref_pdb,xref_alphafolddb
accession,sequence,length,mass,xref_pdb,xref_alphafolddb

Functional profiling

Functional profiling

accession,protein_name,cc_function,cc_catalytic_activity,go,cc_pathway
accession,protein_name,cc_function,cc_catalytic_activity,go,cc_pathway

Clinical applications

Clinical applications

accession,gene_names,cc_disease,xref_omim,ft_variant

See `references/api-reference.md` for the complete field catalog.
accession,gene_names,cc_disease,xref_omim,ft_variant

完整字段目录请参见`references/api-reference.md`。

Identifier Mapping

标识符映射

Translate identifiers between database systems:
python
import requests
import time

def map_identifiers(ids, source_db, target_db):
    """Map identifiers from one database to another."""
    # Submit mapping job
    submit_resp = requests.post(
        "https://rest.uniprot.org/idmapping/run",
        data={
            "from": source_db,
            "to": target_db,
            "ids": ",".join(ids)
        }
    )
    job_id = submit_resp.json()["jobId"]

    # Poll until complete
    status_url = f"https://rest.uniprot.org/idmapping/status/{job_id}"
    while True:
        status_resp = requests.get(status_url)
        status_data = status_resp.json()
        if "results" in status_data or "failedIds" in status_data:
            break
        time.sleep(2)

    # Fetch results
    results_resp = requests.get(
        f"https://rest.uniprot.org/idmapping/results/{job_id}"
    )
    return results_resp.json()
在不同数据库系统间转换标识符:
python
import requests
import time

def map_identifiers(ids, source_db, target_db):
    """Map identifiers from one database to another."""
    # Submit mapping job
    submit_resp = requests.post(
        "https://rest.uniprot.org/idmapping/run",
        data={
            "from": source_db,
            "to": target_db,
            "ids": ",".join(ids)
        }
    )
    job_id = submit_resp.json()["jobId"]

    # Poll until complete
    status_url = f"https://rest.uniprot.org/idmapping/status/{job_id}"
    while True:
        status_resp = requests.get(status_url)
        status_data = status_resp.json()
        if "results" in status_data or "failedIds" in status_data:
            break
        time.sleep(2)

    # Fetch results
    results_resp = requests.get(
        f"https://rest.uniprot.org/idmapping/results/{job_id}"
    )
    return results_resp.json()

Examples

Examples

UniProt to PDB

UniProt to PDB

mapping = map_identifiers(["P04637", "P00533"], "UniProtKB_AC-ID", "PDB")
mapping = map_identifiers(["P04637", "P00533"], "UniProtKB_AC-ID", "PDB")

Gene symbols to UniProt

Gene symbols to UniProt

mapping = map_identifiers(["TP53", "EGFR"], "Gene_Name", "UniProtKB")
mapping = map_identifiers(["TP53", "EGFR"], "Gene_Name", "UniProtKB")

UniProt to Ensembl

UniProt to Ensembl

mapping = map_identifiers(["P00533"], "UniProtKB_AC-ID", "Ensembl")
undefined
mapping = map_identifiers(["P00533"], "UniProtKB_AC-ID", "Ensembl")
undefined

Common Database Pairs

常用数据库对

FromToUse Case
UniProtKB_AC-ID
PDB
Find structures
UniProtKB_AC-ID
Ensembl
Link to genomics
Gene_Name
UniProtKB
Gene symbol lookup
RefSeq_Protein
UniProtKB
NCBI to UniProt
UniProtKB_AC-ID
GO
Get GO annotations
UniProtKB_AC-ID
ChEMBL
Drug target info
See
references/api-reference.md
for all 200+ supported databases.
Constraints:
  • Maximum 100,000 identifiers per request
  • Results persist for 7 days
源数据库目标数据库使用场景
UniProtKB_AC-ID
PDB
查找结构
UniProtKB_AC-ID
Ensembl
关联基因组数据
Gene_Name
UniProtKB
基因符号查询
RefSeq_Protein
UniProtKB
NCBI转UniProt
UniProtKB_AC-ID
GO
获取GO注释
UniProtKB_AC-ID
ChEMBL
药物靶点信息
所有200+种支持的数据库请参见
references/api-reference.md
限制条件:
  • 每次请求最多支持100,000个标识符
  • 结果保留7天

Streaming Large Datasets

流式传输大型数据集

For complete proteomes or large result sets, use streaming to bypass pagination:
python
import requests

params = {
    "query": "organism_id:9606 AND reviewed:true",
    "format": "fasta"
}

resp = requests.get(
    "https://rest.uniprot.org/uniprotkb/stream",
    params=params,
    stream=True
)

with open("human_proteome.fasta", "wb") as f:
    for chunk in resp.iter_content(chunk_size=8192):
        f.write(chunk)
对于完整蛋白质组或大型结果集,使用流式传输绕过分页:
python
import requests

params = {
    "query": "organism_id:9606 AND reviewed:true",
    "format": "fasta"
}

resp = requests.get(
    "https://rest.uniprot.org/uniprotkb/stream",
    params=params,
    stream=True
)

with open("human_proteome.fasta", "wb") as f:
    for chunk in resp.iter_content(chunk_size=8192):
        f.write(chunk)

Batch Operations

批量操作

Rate-Limited Client

限速客户端

Respect server resources when processing many requests:
python
import requests
import time

class UniProtClient:
    BASE = "https://rest.uniprot.org"

    def __init__(self, delay=0.5):
        self.delay = delay
        self.last_call = 0

    def _throttle(self):
        elapsed = time.time() - self.last_call
        if elapsed < self.delay:
            time.sleep(self.delay - elapsed)
        self.last_call = time.time()

    def get_proteins(self, accessions, batch_size=100):
        """Fetch metadata for multiple accessions."""
        results = []

        for i in range(0, len(accessions), batch_size):
            batch = accessions[i:i+batch_size]
            query = " OR ".join(f"accession:{a}" for a in batch)

            self._throttle()
            resp = requests.get(
                f"{self.BASE}/uniprotkb/search",
                params={"query": query, "format": "json", "size": batch_size}
            )

            if resp.ok:
                results.extend(resp.json().get("results", []))

        return results
处理大量请求时请尊重服务器资源:
python
import requests
import time

class UniProtClient:
    BASE = "https://rest.uniprot.org"

    def __init__(self, delay=0.5):
        self.delay = delay
        self.last_call = 0

    def _throttle(self):
        elapsed = time.time() - self.last_call
        if elapsed < self.delay:
            time.sleep(self.delay - elapsed)
        self.last_call = time.time()

    def get_proteins(self, accessions, batch_size=100):
        """Fetch metadata for multiple accessions."""
        results = []

        for i in range(0, len(accessions), batch_size):
            batch = accessions[i:i+batch_size]
            query = " OR ".join(f"accession:{a}" for a in batch)

            self._throttle()
            resp = requests.get(
                f"{self.BASE}/uniprotkb/search",
                params={"query": query, "format": "json", "size": batch_size}
            )

            if resp.ok:
                results.extend(resp.json().get("results", []))

        return results

Usage

Usage

client = UniProtClient(delay=0.3) proteins = client.get_proteins(["P04637", "P00533", "Q07817", "P38398"])
undefined
client = UniProtClient(delay=0.3) proteins = client.get_proteins(["P04637", "P00533", "Q07817", "P38398"])
undefined

Paginated Retrieval

分页检索

For queries with many results:
python
import requests

def fetch_all(query, fields=None, max_results=None):
    """Retrieve all results with automatic pagination."""
    url = "https://rest.uniprot.org/uniprotkb/search"
    collected = []

    params = {
        "query": query,
        "format": "json",
        "size": 500
    }
    if fields:
        params["fields"] = ",".join(fields)

    while url:
        resp = requests.get(url, params=params if "rest.uniprot.org" in url else None)
        data = resp.json()
        collected.extend(data["results"])

        if max_results and len(collected) >= max_results:
            return collected[:max_results]

        url = resp.links.get("next", {}).get("url")
        params = None  # Next URL contains all params

    return collected
对于结果较多的查询:
python
import requests

def fetch_all(query, fields=None, max_results=None):
    """Retrieve all results with automatic pagination."""
    url = "https://rest.uniprot.org/uniprotkb/search"
    collected = []

    params = {
        "query": query,
        "format": "json",
        "size": 500
    }
    if fields:
        params["fields"] = ",".join(fields)

    while url:
        resp = requests.get(url, params=params if "rest.uniprot.org" in url else None)
        data = resp.json()
        collected.extend(data["results"])

        if max_results and len(collected) >= max_results:
            return collected[:max_results]

        url = resp.links.get("next", {}).get("url")
        params = None  # Next URL contains all params

    return collected

Example: all human phosphatases

Example: all human phosphatases

entries = fetch_all( "protein_name:phosphatase AND organism_id:9606 AND reviewed:true", fields=["accession", "gene_names", "protein_name"] )
undefined
entries = fetch_all( "protein_name:phosphatase AND organism_id:9606 AND reviewed:true", fields=["accession", "gene_names", "protein_name"] )
undefined

Working with Results

结果处理

Parse JSON Response

解析JSON响应

python
import requests

resp = requests.get(
    "https://rest.uniprot.org/uniprotkb/search",
    params={
        "query": "gene:BRCA1 AND reviewed:true",
        "format": "json",
        "size": 1
    }
)

entry = resp.json()["results"][0]
python
import requests

resp = requests.get(
    "https://rest.uniprot.org/uniprotkb/search",
    params={
        "query": "gene:BRCA1 AND reviewed:true",
        "format": "json",
        "size": 1
    }
)

entry = resp.json()["results"][0]

Extract common fields

Extract common fields

accession = entry["primaryAccession"] gene_name = entry["genes"][0]["geneName"]["value"] organism = entry["organism"]["scientificName"] sequence = entry["sequence"]["value"] length = entry["sequence"]["length"]
accession = entry["primaryAccession"] gene_name = entry["genes"][0]["geneName"]["value"] organism = entry["organism"]["scientificName"] sequence = entry["sequence"]["value"] length = entry["sequence"]["length"]

Function annotation

Function annotation

if "comments" in entry: for comment in entry["comments"]: if comment["commentType"] == "FUNCTION": print(f"Function: {comment['texts'][0]['value']}")
undefined
if "comments" in entry: for comment in entry["comments"]: if comment["commentType"] == "FUNCTION": print(f"Function: {comment['texts'][0]['value']}")
undefined

Build a Protein Dataset

构建蛋白质数据集

python
import requests
import csv

def build_dataset(query, output_path, fields):
    """Export search results to CSV."""
    resp = requests.get(
        "https://rest.uniprot.org/uniprotkb/stream",
        params={
            "query": query,
            "format": "tsv",
            "fields": ",".join(fields)
        }
    )

    with open(output_path, "w") as f:
        f.write(resp.text)
python
import requests
import csv

def build_dataset(query, output_path, fields):
    """Export search results to CSV."""
    resp = requests.get(
        "https://rest.uniprot.org/uniprotkb/stream",
        params={
            "query": query,
            "format": "tsv",
            "fields": ",".join(fields)
        }
    )

    with open(output_path, "w") as f:
        f.write(resp.text)

Create dataset of human kinases

Create dataset of human kinases

build_dataset( query="family:kinase AND organism_id:9606 AND reviewed:true", output_path="human_kinases.tsv", fields=["accession", "gene_names", "protein_name", "length", "sequence"] )
undefined
build_dataset( query="family:kinase AND organism_id:9606 AND reviewed:true", output_path="human_kinases.tsv", fields=["accession", "gene_names", "protein_name", "length", "sequence"] )
undefined

Key Terminology

关键术语

Swiss-Prot vs TrEMBL: Swiss-Prot entries (
reviewed:true
) are manually curated by experts. TrEMBL entries (
reviewed:false
) are computationally predicted. Always prefer Swiss-Prot for high-confidence data.
Accession Number: Stable identifier for a protein entry (e.g., P04637). Entry names like "P53_HUMAN" may change.
Entity Types: UniProt covers UniProtKB (proteins), UniRef (clustered sequences), UniParc (archive), and Proteomes (complete sets).
Annotation Score: Quality indicator from 1 (basic) to 5 (comprehensive). Higher scores indicate more complete annotations.
Swiss-Prot与TrEMBL:Swiss-Prot条目(
reviewed:true
)由专家人工整理。TrEMBL条目(
reviewed:false
)为计算预测所得。如需高可信度数据,请优先选择Swiss-Prot。
登录号:蛋白质条目的稳定标识符(如P04637)。像“P53_HUMAN”这样的条目名称可能会变更。
实体类型:UniProt涵盖UniProtKB(蛋白质)、UniRef(聚类序列)、UniParc(存档)和Proteomes(完整集合)。
注释评分:质量指标,从1(基础)到5(全面)。评分越高表示注释越完整。

Best Practices

最佳实践

RecommendationRationale
Add
reviewed:true
to queries
Swiss-Prot entries are manually curated
Request minimal fieldsReduces transfer size and response time
Use streaming for large setsAvoids pagination complexity
Implement rate limitingRespects server resources (0.3-0.5s delay)
Cache repeated queriesMinimizes redundant API calls
Handle errors gracefullyNetwork issues, rate limits, missing entries
建议理由
查询时添加
reviewed:true
Swiss-Prot条目经过人工整理
请求最少字段减少传输大小和响应时间
大型数据集使用流式传输避免分页复杂度
实现限速尊重服务器资源(0.3-0.5秒延迟)
缓存重复查询减少冗余API调用
优雅处理错误网络问题、限速、条目缺失

References

参考资料

See
references/api-reference.md
for:
  • Complete field listing for query customization
  • All searchable attributes and operators
  • Database pairs for identifier translation
  • Working code examples in curl, R, and JavaScript
  • Rate limiting and error handling strategies
references/api-reference.md
包含:
  • 查询自定义的完整字段列表
  • 所有可搜索属性和运算符
  • 标识符转换的数据库对
  • curl、R和JavaScript的可用代码示例
  • 限速和错误处理策略

External Documentation

外部文档