uniprot-database

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

UniProt Database

UniProt数据库

UniProt serves as the authoritative resource for protein sequence data and functional annotations. This skill enables programmatic access to search proteins by various criteria, retrieve FASTA sequences, translate identifiers between biological databases, and query both manually curated (Swiss-Prot) and computationally predicted (TrEMBL) protein records.

UniProt是蛋白质序列数据和功能注释的权威资源。本技能支持通过编程方式按多种条件搜索蛋白质、检索FASTA序列、在生物数据库间转换标识符，以及查询人工整理的（Swiss-Prot）和计算预测的（TrEMBL）蛋白质记录。

Use Cases

使用场景

Retrieve protein sequences in FASTA format for downstream analysis
Query proteins by name, gene symbol, organism, or functional terms
Convert identifiers between UniProt, Ensembl, RefSeq, PDB, and 100+ databases
Access functional annotations including GO terms, domains, and pathways
Download curated datasets for machine learning or comparative studies
Build protein datasets filtered by organism, size, or annotation quality

检索FASTA格式的蛋白质序列用于下游分析
按名称、基因符号、生物或功能术语查询蛋白质
在UniProt、Ensembl、RefSeq、PDB及100+种数据库间转换标识符
获取包括GO术语、结构域和通路在内的功能注释
下载经过整理的数据集用于机器学习或比较研究
构建按生物、大小或注释质量筛选的蛋白质数据集

Installation

安装

No package installation required - UniProt provides a REST API accessed via HTTP requests:

python

import requests

无需安装任何包——UniProt提供可通过HTTP请求访问的REST API：

python

import requests

Test connectivity

resp = requests.get("https://rest.uniprot.org/uniprotkb/P53_HUMAN.json") print(resp.json()["primaryAccession"]) # Q9NZC2 or similar

undefined

resp = requests.get("https://rest.uniprot.org/uniprotkb/P53_HUMAN.json") print(resp.json()["primaryAccession"]) # Q9NZC2 or similar

undefined

Searching the Database

数据库搜索

Basic Text Search

基础文本搜索

Find proteins by keywords, names, or descriptions:

python

import requests

endpoint = "https://rest.uniprot.org/uniprotkb/search"
params = {
    "query": "hemoglobin AND organism_id:9606 AND reviewed:true",
    "format": "json",
    "size": 10
}

resp = requests.get(endpoint, params=params)
results = resp.json()

for entry in results["results"]:
    acc = entry["primaryAccession"]
    name = entry["proteinDescription"]["recommendedName"]["fullName"]["value"]
    print(f"{acc}: {name}")

通过关键词、名称或描述查找蛋白质：

python

import requests

endpoint = "https://rest.uniprot.org/uniprotkb/search"
params = {
    "query": "hemoglobin AND organism_id:9606 AND reviewed:true",
    "format": "json",
    "size": 10
}

resp = requests.get(endpoint, params=params)
results = resp.json()

for entry in results["results"]:
    acc = entry["primaryAccession"]
    name = entry["proteinDescription"]["recommendedName"]["fullName"]["value"]
    print(f"{acc}: {name}")

Query Syntax

查询语法

UniProt uses a powerful query language with field prefixes and boolean operators:

undefined

UniProt使用带有字段前缀和布尔运算符的强大查询语言：

undefined

Boolean combinations

hemoglobin AND organism_id:9606 (kinase OR phosphatase) AND reviewed:true receptor NOT bacteria

Field-specific queries

gene:TP53 accession:P00533 organism_name:"Homo sapiens"

Numeric ranges

length:[100 TO 500] mass:[20000 TO 50000]

Wildcards

gene:IL* protein_name:transport*

Existence checks

cc_function:* # has function annotation xref:pdb # has PDB structure ft_signal:* # has signal peptide

undefined

cc_function:* # has function annotation xref:pdb # has PDB structure ft_signal:* # has signal peptide

undefined

Common Filters

常用筛选条件

Filter	Description
`reviewed:true`	Swiss-Prot entries only (manually curated)
`organism_id:9606`	Human proteins (NCBI taxonomy ID)
`organism_id:10090`	Mouse proteins
`length:[100 TO 500]`	Sequence length range
`xref:pdb`	Has experimental structure
`cc_disease:*`	Has disease association

筛选条件	描述
`reviewed:true`	仅Swiss-Prot条目（人工整理）
`organism_id:9606`	人类蛋白质（NCBI分类学ID）
`organism_id:10090`	小鼠蛋白质
`length:[100 TO 500]`	序列长度范围
`xref:pdb`	拥有实验结构
`cc_disease:*`	与疾病相关联

Fetching Individual Entries

获取单个条目

Access specific proteins using their accession numbers:

python

import requests

acc = "P53_HUMAN"  # or "P04637"
url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
resp = requests.get(url)
print(resp.text)

通过登录号访问特定蛋白质：

python

import requests

acc = "P53_HUMAN"  # or "P04637"
url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
resp = requests.get(url)
print(resp.text)

Supported Formats

支持的格式

Format	Extension	Use Case
FASTA	`.fasta`	Sequence analysis, alignments
JSON	`.json`	Parsing in code
TSV	`.tsv`	Spreadsheets, data frames
XML	`.xml`	Structured data exchange
TXT	`.txt`	Human-readable flat file

格式	扩展名	使用场景
FASTA	`.fasta`	序列分析、比对
JSON	`.json`	代码解析
TSV	`.tsv`	电子表格、数据框
XML	`.xml`	结构化数据交换
TXT	`.txt`	人类可读的平面文件

Custom Fields (TSV)

自定义字段（TSV）

Request only the fields you need to minimize bandwidth:

python

import requests

params = {
    "query": "gene:TP53 AND reviewed:true",
    "format": "tsv",
    "fields": "accession,gene_names,organism_name,length,sequence,cc_function"
}

resp = requests.get("https://rest.uniprot.org/uniprotkb/search", params=params)
print(resp.text)

Common field sets:

undefined

仅请求所需字段以减少带宽占用：

python

import requests

params = {
    "query": "gene:TP53 AND reviewed:true",
    "format": "tsv",
    "fields": "accession,gene_names,organism_name,length,sequence,cc_function"
}

resp = requests.get("https://rest.uniprot.org/uniprotkb/search", params=params)
print(resp.text)

常用字段集：

undefined

Minimal identification

accession,id,protein_name,gene_names,organism_name

Sequence analysis

accession,sequence,length,mass,xref_pdb,xref_alphafolddb

Functional profiling

accession,protein_name,cc_function,cc_catalytic_activity,go,cc_pathway

Clinical applications

accession,gene_names,cc_disease,xref_omim,ft_variant


See `references/api-reference.md` for the complete field catalog.

accession,gene_names,cc_disease,xref_omim,ft_variant


完整字段目录请参见`references/api-reference.md`。

Identifier Mapping

标识符映射

Translate identifiers between database systems:

python

import requests
import time

def map_identifiers(ids, source_db, target_db):
    """Map identifiers from one database to another."""
    # Submit mapping job
    submit_resp = requests.post(
        "https://rest.uniprot.org/idmapping/run",
        data={
            "from": source_db,
            "to": target_db,
            "ids": ",".join(ids)
        }
    )
    job_id = submit_resp.json()["jobId"]

    # Poll until complete
    status_url = f"https://rest.uniprot.org/idmapping/status/{job_id}"
    while True:
        status_resp = requests.get(status_url)
        status_data = status_resp.json()
        if "results" in status_data or "failedIds" in status_data:
            break
        time.sleep(2)

    # Fetch results
    results_resp = requests.get(
        f"https://rest.uniprot.org/idmapping/results/{job_id}"
    )
    return results_resp.json()

在不同数据库系统间转换标识符：

python

import requests
import time

def map_identifiers(ids, source_db, target_db):
    """Map identifiers from one database to another."""
    # Submit mapping job
    submit_resp = requests.post(
        "https://rest.uniprot.org/idmapping/run",
        data={
            "from": source_db,
            "to": target_db,
            "ids": ",".join(ids)
        }
    )
    job_id = submit_resp.json()["jobId"]

    # Poll until complete
    status_url = f"https://rest.uniprot.org/idmapping/status/{job_id}"
    while True:
        status_resp = requests.get(status_url)
        status_data = status_resp.json()
        if "results" in status_data or "failedIds" in status_data:
            break
        time.sleep(2)

    # Fetch results
    results_resp = requests.get(
        f"https://rest.uniprot.org/idmapping/results/{job_id}"
    )
    return results_resp.json()

Examples

UniProt to PDB

mapping = map_identifiers(["P04637", "P00533"], "UniProtKB_AC-ID", "PDB")

Gene symbols to UniProt

mapping = map_identifiers(["TP53", "EGFR"], "Gene_Name", "UniProtKB")

UniProt to Ensembl

mapping = map_identifiers(["P00533"], "UniProtKB_AC-ID", "Ensembl")

undefined

mapping = map_identifiers(["P00533"], "UniProtKB_AC-ID", "Ensembl")

undefined

Common Database Pairs

常用数据库对

From	To	Use Case
`UniProtKB_AC-ID`	`PDB`	Find structures
`UniProtKB_AC-ID`	`Ensembl`	Link to genomics
`Gene_Name`	`UniProtKB`	Gene symbol lookup
`RefSeq_Protein`	`UniProtKB`	NCBI to UniProt
`UniProtKB_AC-ID`	`GO`	Get GO annotations
`UniProtKB_AC-ID`	`ChEMBL`	Drug target info

See

references/api-reference.md

for all 200+ supported databases.

Constraints:

Maximum 100,000 identifiers per request
Results persist for 7 days

源数据库	目标数据库	使用场景
`UniProtKB_AC-ID`	`PDB`	查找结构
`UniProtKB_AC-ID`	`Ensembl`	关联基因组数据
`Gene_Name`	`UniProtKB`	基因符号查询
`RefSeq_Protein`	`UniProtKB`	NCBI转UniProt
`UniProtKB_AC-ID`	`GO`	获取GO注释
`UniProtKB_AC-ID`	`ChEMBL`	药物靶点信息

所有200+种支持的数据库请参见

references/api-reference.md

。

限制条件：

每次请求最多支持100,000个标识符
结果保留7天

Streaming Large Datasets

流式传输大型数据集

For complete proteomes or large result sets, use streaming to bypass pagination:

python

import requests

params = {
    "query": "organism_id:9606 AND reviewed:true",
    "format": "fasta"
}

resp = requests.get(
    "https://rest.uniprot.org/uniprotkb/stream",
    params=params,
    stream=True
)

with open("human_proteome.fasta", "wb") as f:
    for chunk in resp.iter_content(chunk_size=8192):
        f.write(chunk)

对于完整蛋白质组或大型结果集，使用流式传输绕过分页：

python

import requests

params = {
    "query": "organism_id:9606 AND reviewed:true",
    "format": "fasta"
}

resp = requests.get(
    "https://rest.uniprot.org/uniprotkb/stream",
    params=params,
    stream=True
)

with open("human_proteome.fasta", "wb") as f:
    for chunk in resp.iter_content(chunk_size=8192):
        f.write(chunk)

Batch Operations

批量操作

Rate-Limited Client

限速客户端

Respect server resources when processing many requests:

python

import requests
import time

class UniProtClient:
    BASE = "https://rest.uniprot.org"

    def __init__(self, delay=0.5):
        self.delay = delay
        self.last_call = 0

    def _throttle(self):
        elapsed = time.time() - self.last_call
        if elapsed < self.delay:
            time.sleep(self.delay - elapsed)
        self.last_call = time.time()

    def get_proteins(self, accessions, batch_size=100):
        """Fetch metadata for multiple accessions."""
        results = []

        for i in range(0, len(accessions), batch_size):
            batch = accessions[i:i+batch_size]
            query = " OR ".join(f"accession:{a}" for a in batch)

            self._throttle()
            resp = requests.get(
                f"{self.BASE}/uniprotkb/search",
                params={"query": query, "format": "json", "size": batch_size}
            )

            if resp.ok:
                results.extend(resp.json().get("results", []))

        return results

处理大量请求时请尊重服务器资源：

python

import requests
import time

class UniProtClient:
    BASE = "https://rest.uniprot.org"

    def __init__(self, delay=0.5):
        self.delay = delay
        self.last_call = 0

    def _throttle(self):
        elapsed = time.time() - self.last_call
        if elapsed < self.delay:
            time.sleep(self.delay - elapsed)
        self.last_call = time.time()

    def get_proteins(self, accessions, batch_size=100):
        """Fetch metadata for multiple accessions."""
        results = []

        for i in range(0, len(accessions), batch_size):
            batch = accessions[i:i+batch_size]
            query = " OR ".join(f"accession:{a}" for a in batch)

            self._throttle()
            resp = requests.get(
                f"{self.BASE}/uniprotkb/search",
                params={"query": query, "format": "json", "size": batch_size}
            )

            if resp.ok:
                results.extend(resp.json().get("results", []))

        return results

Usage

client = UniProtClient(delay=0.3) proteins = client.get_proteins(["P04637", "P00533", "Q07817", "P38398"])

undefined

client = UniProtClient(delay=0.3) proteins = client.get_proteins(["P04637", "P00533", "Q07817", "P38398"])

undefined

Paginated Retrieval

分页检索

For queries with many results:

python

import requests

def fetch_all(query, fields=None, max_results=None):
    """Retrieve all results with automatic pagination."""
    url = "https://rest.uniprot.org/uniprotkb/search"
    collected = []

    params = {
        "query": query,
        "format": "json",
        "size": 500
    }
    if fields:
        params["fields"] = ",".join(fields)

    while url:
        resp = requests.get(url, params=params if "rest.uniprot.org" in url else None)
        data = resp.json()
        collected.extend(data["results"])

        if max_results and len(collected) >= max_results:
            return collected[:max_results]

        url = resp.links.get("next", {}).get("url")
        params = None  # Next URL contains all params

    return collected

对于结果较多的查询：

python

import requests

def fetch_all(query, fields=None, max_results=None):
    """Retrieve all results with automatic pagination."""
    url = "https://rest.uniprot.org/uniprotkb/search"
    collected = []

    params = {
        "query": query,
        "format": "json",
        "size": 500
    }
    if fields:
        params["fields"] = ",".join(fields)

    while url:
        resp = requests.get(url, params=params if "rest.uniprot.org" in url else None)
        data = resp.json()
        collected.extend(data["results"])

        if max_results and len(collected) >= max_results:
            return collected[:max_results]

        url = resp.links.get("next", {}).get("url")
        params = None  # Next URL contains all params

    return collected

Example: all human phosphatases

entries = fetch_all( "protein_name:phosphatase AND organism_id:9606 AND reviewed:true", fields=["accession", "gene_names", "protein_name"] )

undefined

entries = fetch_all( "protein_name:phosphatase AND organism_id:9606 AND reviewed:true", fields=["accession", "gene_names", "protein_name"] )

undefined

Working with Results

结果处理

Parse JSON Response

解析JSON响应

python

import requests

resp = requests.get(
    "https://rest.uniprot.org/uniprotkb/search",
    params={
        "query": "gene:BRCA1 AND reviewed:true",
        "format": "json",
        "size": 1
    }
)

entry = resp.json()["results"][0]

python

import requests

resp = requests.get(
    "https://rest.uniprot.org/uniprotkb/search",
    params={
        "query": "gene:BRCA1 AND reviewed:true",
        "format": "json",
        "size": 1
    }
)

entry = resp.json()["results"][0]

Extract common fields

accession = entry["primaryAccession"] gene_name = entry["genes"][0]["geneName"]["value"] organism = entry["organism"]["scientificName"] sequence = entry["sequence"]["value"] length = entry["sequence"]["length"]

Function annotation

if "comments" in entry: for comment in entry["comments"]: if comment["commentType"] == "FUNCTION": print(f"Function: {comment['texts'][0]['value']}")

undefined

if "comments" in entry: for comment in entry["comments"]: if comment["commentType"] == "FUNCTION": print(f"Function: {comment['texts'][0]['value']}")

undefined

Build a Protein Dataset

构建蛋白质数据集

python

import requests
import csv

def build_dataset(query, output_path, fields):
    """Export search results to CSV."""
    resp = requests.get(
        "https://rest.uniprot.org/uniprotkb/stream",
        params={
            "query": query,
            "format": "tsv",
            "fields": ",".join(fields)
        }
    )

    with open(output_path, "w") as f:
        f.write(resp.text)

python

import requests
import csv

def build_dataset(query, output_path, fields):
    """Export search results to CSV."""
    resp = requests.get(
        "https://rest.uniprot.org/uniprotkb/stream",
        params={
            "query": query,
            "format": "tsv",
            "fields": ",".join(fields)
        }
    )

    with open(output_path, "w") as f:
        f.write(resp.text)

Create dataset of human kinases

build_dataset( query="family:kinase AND organism_id:9606 AND reviewed:true", output_path="human_kinases.tsv", fields=["accession", "gene_names", "protein_name", "length", "sequence"] )

undefined

build_dataset( query="family:kinase AND organism_id:9606 AND reviewed:true", output_path="human_kinases.tsv", fields=["accession", "gene_names", "protein_name", "length", "sequence"] )

undefined

Key Terminology

关键术语

Swiss-Prot vs TrEMBL: Swiss-Prot entries (

reviewed:true

) are manually curated by experts. TrEMBL entries (

reviewed:false

) are computationally predicted. Always prefer Swiss-Prot for high-confidence data.

Accession Number: Stable identifier for a protein entry (e.g., P04637). Entry names like "P53_HUMAN" may change.

Entity Types: UniProt covers UniProtKB (proteins), UniRef (clustered sequences), UniParc (archive), and Proteomes (complete sets).

Annotation Score: Quality indicator from 1 (basic) to 5 (comprehensive). Higher scores indicate more complete annotations.

Swiss-Prot与TrEMBL：Swiss-Prot条目（

reviewed:true

）由专家人工整理。TrEMBL条目（

reviewed:false

）为计算预测所得。如需高可信度数据，请优先选择Swiss-Prot。

登录号：蛋白质条目的稳定标识符（如P04637）。像“P53_HUMAN”这样的条目名称可能会变更。

实体类型：UniProt涵盖UniProtKB（蛋白质）、UniRef（聚类序列）、UniParc（存档）和Proteomes（完整集合）。

注释评分：质量指标，从1（基础）到5（全面）。评分越高表示注释越完整。

Best Practices

最佳实践

Recommendation	Rationale
Add `reviewed:true` to queries	Swiss-Prot entries are manually curated
Request minimal fields	Reduces transfer size and response time
Use streaming for large sets	Avoids pagination complexity
Implement rate limiting	Respects server resources (0.3-0.5s delay)
Cache repeated queries	Minimizes redundant API calls
Handle errors gracefully	Network issues, rate limits, missing entries

建议	理由
查询时添加 `reviewed:true`	Swiss-Prot条目经过人工整理
请求最少字段	减少传输大小和响应时间
大型数据集使用流式传输	避免分页复杂度
实现限速	尊重服务器资源（0.3-0.5秒延迟）
缓存重复查询	减少冗余API调用
优雅处理错误	网络问题、限速、条目缺失

References

参考资料

See

references/api-reference.md

for:

Complete field listing for query customization
All searchable attributes and operators
Database pairs for identifier translation
Working code examples in curl, R, and JavaScript
Rate limiting and error handling strategies

references/api-reference.md

包含：

查询自定义的完整字段列表
所有可搜索属性和运算符
标识符转换的数据库对
curl、R和JavaScript的可用代码示例
限速和错误处理策略

External Documentation

外部文档

REST API: https://www.uniprot.org/help/api
Query Fields: https://www.uniprot.org/help/query-fields
ID Mapping: https://www.uniprot.org/help/id_mapping
Programmatic Access: https://www.uniprot.org/help/programmatic_access

REST API: https://www.uniprot.org/help/api
查询字段: https://www.uniprot.org/help/query-fields
ID映射: https://www.uniprot.org/help/id_mapping
编程访问: https://www.uniprot.org/help/programmatic_access