uniprot-database
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseUniProt Database
UniProt数据库
UniProt serves as the authoritative resource for protein sequence data and functional annotations. This skill enables programmatic access to search proteins by various criteria, retrieve FASTA sequences, translate identifiers between biological databases, and query both manually curated (Swiss-Prot) and computationally predicted (TrEMBL) protein records.
UniProt是蛋白质序列数据和功能注释的权威资源。本技能支持通过编程方式按多种条件搜索蛋白质、检索FASTA序列、在生物数据库间转换标识符,以及查询人工整理的(Swiss-Prot)和计算预测的(TrEMBL)蛋白质记录。
Use Cases
使用场景
- Retrieve protein sequences in FASTA format for downstream analysis
- Query proteins by name, gene symbol, organism, or functional terms
- Convert identifiers between UniProt, Ensembl, RefSeq, PDB, and 100+ databases
- Access functional annotations including GO terms, domains, and pathways
- Download curated datasets for machine learning or comparative studies
- Build protein datasets filtered by organism, size, or annotation quality
- 检索FASTA格式的蛋白质序列用于下游分析
- 按名称、基因符号、生物或功能术语查询蛋白质
- 在UniProt、Ensembl、RefSeq、PDB及100+种数据库间转换标识符
- 获取包括GO术语、结构域和通路在内的功能注释
- 下载经过整理的数据集用于机器学习或比较研究
- 构建按生物、大小或注释质量筛选的蛋白质数据集
Installation
安装
No package installation required - UniProt provides a REST API accessed via HTTP requests:
python
import requests无需安装任何包——UniProt提供可通过HTTP请求访问的REST API:
python
import requestsTest connectivity
Test connectivity
resp = requests.get("https://rest.uniprot.org/uniprotkb/P53_HUMAN.json")
print(resp.json()["primaryAccession"]) # Q9NZC2 or similar
undefinedresp = requests.get("https://rest.uniprot.org/uniprotkb/P53_HUMAN.json")
print(resp.json()["primaryAccession"]) # Q9NZC2 or similar
undefinedSearching the Database
数据库搜索
Basic Text Search
基础文本搜索
Find proteins by keywords, names, or descriptions:
python
import requests
endpoint = "https://rest.uniprot.org/uniprotkb/search"
params = {
"query": "hemoglobin AND organism_id:9606 AND reviewed:true",
"format": "json",
"size": 10
}
resp = requests.get(endpoint, params=params)
results = resp.json()
for entry in results["results"]:
acc = entry["primaryAccession"]
name = entry["proteinDescription"]["recommendedName"]["fullName"]["value"]
print(f"{acc}: {name}")通过关键词、名称或描述查找蛋白质:
python
import requests
endpoint = "https://rest.uniprot.org/uniprotkb/search"
params = {
"query": "hemoglobin AND organism_id:9606 AND reviewed:true",
"format": "json",
"size": 10
}
resp = requests.get(endpoint, params=params)
results = resp.json()
for entry in results["results"]:
acc = entry["primaryAccession"]
name = entry["proteinDescription"]["recommendedName"]["fullName"]["value"]
print(f"{acc}: {name}")Query Syntax
查询语法
UniProt uses a powerful query language with field prefixes and boolean operators:
undefinedUniProt使用带有字段前缀和布尔运算符的强大查询语言:
undefinedBoolean combinations
Boolean combinations
hemoglobin AND organism_id:9606
(kinase OR phosphatase) AND reviewed:true
receptor NOT bacteria
hemoglobin AND organism_id:9606
(kinase OR phosphatase) AND reviewed:true
receptor NOT bacteria
Field-specific queries
Field-specific queries
gene:TP53
accession:P00533
organism_name:"Homo sapiens"
gene:TP53
accession:P00533
organism_name:"Homo sapiens"
Numeric ranges
Numeric ranges
length:[100 TO 500]
mass:[20000 TO 50000]
length:[100 TO 500]
mass:[20000 TO 50000]
Wildcards
Wildcards
gene:IL*
protein_name:transport*
gene:IL*
protein_name:transport*
Existence checks
Existence checks
cc_function:* # has function annotation
xref:pdb # has PDB structure
ft_signal:* # has signal peptide
undefinedcc_function:* # has function annotation
xref:pdb # has PDB structure
ft_signal:* # has signal peptide
undefinedCommon Filters
常用筛选条件
| Filter | Description |
|---|---|
| Swiss-Prot entries only (manually curated) |
| Human proteins (NCBI taxonomy ID) |
| Mouse proteins |
| Sequence length range |
| Has experimental structure |
| Has disease association |
| 筛选条件 | 描述 |
|---|---|
| 仅Swiss-Prot条目(人工整理) |
| 人类蛋白质(NCBI分类学ID) |
| 小鼠蛋白质 |
| 序列长度范围 |
| 拥有实验结构 |
| 与疾病相关联 |
Fetching Individual Entries
获取单个条目
Access specific proteins using their accession numbers:
python
import requests
acc = "P53_HUMAN" # or "P04637"
url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
resp = requests.get(url)
print(resp.text)通过登录号访问特定蛋白质:
python
import requests
acc = "P53_HUMAN" # or "P04637"
url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
resp = requests.get(url)
print(resp.text)Supported Formats
支持的格式
| Format | Extension | Use Case |
|---|---|---|
| FASTA | | Sequence analysis, alignments |
| JSON | | Parsing in code |
| TSV | | Spreadsheets, data frames |
| XML | | Structured data exchange |
| TXT | | Human-readable flat file |
| 格式 | 扩展名 | 使用场景 |
|---|---|---|
| FASTA | | 序列分析、比对 |
| JSON | | 代码解析 |
| TSV | | 电子表格、数据框 |
| XML | | 结构化数据交换 |
| TXT | | 人类可读的平面文件 |
Custom Fields (TSV)
自定义字段(TSV)
Request only the fields you need to minimize bandwidth:
python
import requests
params = {
"query": "gene:TP53 AND reviewed:true",
"format": "tsv",
"fields": "accession,gene_names,organism_name,length,sequence,cc_function"
}
resp = requests.get("https://rest.uniprot.org/uniprotkb/search", params=params)
print(resp.text)Common field sets:
undefined仅请求所需字段以减少带宽占用:
python
import requests
params = {
"query": "gene:TP53 AND reviewed:true",
"format": "tsv",
"fields": "accession,gene_names,organism_name,length,sequence,cc_function"
}
resp = requests.get("https://rest.uniprot.org/uniprotkb/search", params=params)
print(resp.text)常用字段集:
undefinedMinimal identification
Minimal identification
accession,id,protein_name,gene_names,organism_name
accession,id,protein_name,gene_names,organism_name
Sequence analysis
Sequence analysis
accession,sequence,length,mass,xref_pdb,xref_alphafolddb
accession,sequence,length,mass,xref_pdb,xref_alphafolddb
Functional profiling
Functional profiling
accession,protein_name,cc_function,cc_catalytic_activity,go,cc_pathway
accession,protein_name,cc_function,cc_catalytic_activity,go,cc_pathway
Clinical applications
Clinical applications
accession,gene_names,cc_disease,xref_omim,ft_variant
See `references/api-reference.md` for the complete field catalog.accession,gene_names,cc_disease,xref_omim,ft_variant
完整字段目录请参见`references/api-reference.md`。Identifier Mapping
标识符映射
Translate identifiers between database systems:
python
import requests
import time
def map_identifiers(ids, source_db, target_db):
"""Map identifiers from one database to another."""
# Submit mapping job
submit_resp = requests.post(
"https://rest.uniprot.org/idmapping/run",
data={
"from": source_db,
"to": target_db,
"ids": ",".join(ids)
}
)
job_id = submit_resp.json()["jobId"]
# Poll until complete
status_url = f"https://rest.uniprot.org/idmapping/status/{job_id}"
while True:
status_resp = requests.get(status_url)
status_data = status_resp.json()
if "results" in status_data or "failedIds" in status_data:
break
time.sleep(2)
# Fetch results
results_resp = requests.get(
f"https://rest.uniprot.org/idmapping/results/{job_id}"
)
return results_resp.json()在不同数据库系统间转换标识符:
python
import requests
import time
def map_identifiers(ids, source_db, target_db):
"""Map identifiers from one database to another."""
# Submit mapping job
submit_resp = requests.post(
"https://rest.uniprot.org/idmapping/run",
data={
"from": source_db,
"to": target_db,
"ids": ",".join(ids)
}
)
job_id = submit_resp.json()["jobId"]
# Poll until complete
status_url = f"https://rest.uniprot.org/idmapping/status/{job_id}"
while True:
status_resp = requests.get(status_url)
status_data = status_resp.json()
if "results" in status_data or "failedIds" in status_data:
break
time.sleep(2)
# Fetch results
results_resp = requests.get(
f"https://rest.uniprot.org/idmapping/results/{job_id}"
)
return results_resp.json()Examples
Examples
UniProt to PDB
UniProt to PDB
mapping = map_identifiers(["P04637", "P00533"], "UniProtKB_AC-ID", "PDB")
mapping = map_identifiers(["P04637", "P00533"], "UniProtKB_AC-ID", "PDB")
Gene symbols to UniProt
Gene symbols to UniProt
mapping = map_identifiers(["TP53", "EGFR"], "Gene_Name", "UniProtKB")
mapping = map_identifiers(["TP53", "EGFR"], "Gene_Name", "UniProtKB")
UniProt to Ensembl
UniProt to Ensembl
mapping = map_identifiers(["P00533"], "UniProtKB_AC-ID", "Ensembl")
undefinedmapping = map_identifiers(["P00533"], "UniProtKB_AC-ID", "Ensembl")
undefinedCommon Database Pairs
常用数据库对
| From | To | Use Case |
|---|---|---|
| | Find structures |
| | Link to genomics |
| | Gene symbol lookup |
| | NCBI to UniProt |
| | Get GO annotations |
| | Drug target info |
See for all 200+ supported databases.
references/api-reference.mdConstraints:
- Maximum 100,000 identifiers per request
- Results persist for 7 days
| 源数据库 | 目标数据库 | 使用场景 |
|---|---|---|
| | 查找结构 |
| | 关联基因组数据 |
| | 基因符号查询 |
| | NCBI转UniProt |
| | 获取GO注释 |
| | 药物靶点信息 |
所有200+种支持的数据库请参见。
references/api-reference.md限制条件:
- 每次请求最多支持100,000个标识符
- 结果保留7天
Streaming Large Datasets
流式传输大型数据集
For complete proteomes or large result sets, use streaming to bypass pagination:
python
import requests
params = {
"query": "organism_id:9606 AND reviewed:true",
"format": "fasta"
}
resp = requests.get(
"https://rest.uniprot.org/uniprotkb/stream",
params=params,
stream=True
)
with open("human_proteome.fasta", "wb") as f:
for chunk in resp.iter_content(chunk_size=8192):
f.write(chunk)对于完整蛋白质组或大型结果集,使用流式传输绕过分页:
python
import requests
params = {
"query": "organism_id:9606 AND reviewed:true",
"format": "fasta"
}
resp = requests.get(
"https://rest.uniprot.org/uniprotkb/stream",
params=params,
stream=True
)
with open("human_proteome.fasta", "wb") as f:
for chunk in resp.iter_content(chunk_size=8192):
f.write(chunk)Batch Operations
批量操作
Rate-Limited Client
限速客户端
Respect server resources when processing many requests:
python
import requests
import time
class UniProtClient:
BASE = "https://rest.uniprot.org"
def __init__(self, delay=0.5):
self.delay = delay
self.last_call = 0
def _throttle(self):
elapsed = time.time() - self.last_call
if elapsed < self.delay:
time.sleep(self.delay - elapsed)
self.last_call = time.time()
def get_proteins(self, accessions, batch_size=100):
"""Fetch metadata for multiple accessions."""
results = []
for i in range(0, len(accessions), batch_size):
batch = accessions[i:i+batch_size]
query = " OR ".join(f"accession:{a}" for a in batch)
self._throttle()
resp = requests.get(
f"{self.BASE}/uniprotkb/search",
params={"query": query, "format": "json", "size": batch_size}
)
if resp.ok:
results.extend(resp.json().get("results", []))
return results处理大量请求时请尊重服务器资源:
python
import requests
import time
class UniProtClient:
BASE = "https://rest.uniprot.org"
def __init__(self, delay=0.5):
self.delay = delay
self.last_call = 0
def _throttle(self):
elapsed = time.time() - self.last_call
if elapsed < self.delay:
time.sleep(self.delay - elapsed)
self.last_call = time.time()
def get_proteins(self, accessions, batch_size=100):
"""Fetch metadata for multiple accessions."""
results = []
for i in range(0, len(accessions), batch_size):
batch = accessions[i:i+batch_size]
query = " OR ".join(f"accession:{a}" for a in batch)
self._throttle()
resp = requests.get(
f"{self.BASE}/uniprotkb/search",
params={"query": query, "format": "json", "size": batch_size}
)
if resp.ok:
results.extend(resp.json().get("results", []))
return resultsUsage
Usage
client = UniProtClient(delay=0.3)
proteins = client.get_proteins(["P04637", "P00533", "Q07817", "P38398"])
undefinedclient = UniProtClient(delay=0.3)
proteins = client.get_proteins(["P04637", "P00533", "Q07817", "P38398"])
undefinedPaginated Retrieval
分页检索
For queries with many results:
python
import requests
def fetch_all(query, fields=None, max_results=None):
"""Retrieve all results with automatic pagination."""
url = "https://rest.uniprot.org/uniprotkb/search"
collected = []
params = {
"query": query,
"format": "json",
"size": 500
}
if fields:
params["fields"] = ",".join(fields)
while url:
resp = requests.get(url, params=params if "rest.uniprot.org" in url else None)
data = resp.json()
collected.extend(data["results"])
if max_results and len(collected) >= max_results:
return collected[:max_results]
url = resp.links.get("next", {}).get("url")
params = None # Next URL contains all params
return collected对于结果较多的查询:
python
import requests
def fetch_all(query, fields=None, max_results=None):
"""Retrieve all results with automatic pagination."""
url = "https://rest.uniprot.org/uniprotkb/search"
collected = []
params = {
"query": query,
"format": "json",
"size": 500
}
if fields:
params["fields"] = ",".join(fields)
while url:
resp = requests.get(url, params=params if "rest.uniprot.org" in url else None)
data = resp.json()
collected.extend(data["results"])
if max_results and len(collected) >= max_results:
return collected[:max_results]
url = resp.links.get("next", {}).get("url")
params = None # Next URL contains all params
return collectedExample: all human phosphatases
Example: all human phosphatases
entries = fetch_all(
"protein_name:phosphatase AND organism_id:9606 AND reviewed:true",
fields=["accession", "gene_names", "protein_name"]
)
undefinedentries = fetch_all(
"protein_name:phosphatase AND organism_id:9606 AND reviewed:true",
fields=["accession", "gene_names", "protein_name"]
)
undefinedWorking with Results
结果处理
Parse JSON Response
解析JSON响应
python
import requests
resp = requests.get(
"https://rest.uniprot.org/uniprotkb/search",
params={
"query": "gene:BRCA1 AND reviewed:true",
"format": "json",
"size": 1
}
)
entry = resp.json()["results"][0]python
import requests
resp = requests.get(
"https://rest.uniprot.org/uniprotkb/search",
params={
"query": "gene:BRCA1 AND reviewed:true",
"format": "json",
"size": 1
}
)
entry = resp.json()["results"][0]Extract common fields
Extract common fields
accession = entry["primaryAccession"]
gene_name = entry["genes"][0]["geneName"]["value"]
organism = entry["organism"]["scientificName"]
sequence = entry["sequence"]["value"]
length = entry["sequence"]["length"]
accession = entry["primaryAccession"]
gene_name = entry["genes"][0]["geneName"]["value"]
organism = entry["organism"]["scientificName"]
sequence = entry["sequence"]["value"]
length = entry["sequence"]["length"]
Function annotation
Function annotation
if "comments" in entry:
for comment in entry["comments"]:
if comment["commentType"] == "FUNCTION":
print(f"Function: {comment['texts'][0]['value']}")
undefinedif "comments" in entry:
for comment in entry["comments"]:
if comment["commentType"] == "FUNCTION":
print(f"Function: {comment['texts'][0]['value']}")
undefinedBuild a Protein Dataset
构建蛋白质数据集
python
import requests
import csv
def build_dataset(query, output_path, fields):
"""Export search results to CSV."""
resp = requests.get(
"https://rest.uniprot.org/uniprotkb/stream",
params={
"query": query,
"format": "tsv",
"fields": ",".join(fields)
}
)
with open(output_path, "w") as f:
f.write(resp.text)python
import requests
import csv
def build_dataset(query, output_path, fields):
"""Export search results to CSV."""
resp = requests.get(
"https://rest.uniprot.org/uniprotkb/stream",
params={
"query": query,
"format": "tsv",
"fields": ",".join(fields)
}
)
with open(output_path, "w") as f:
f.write(resp.text)Create dataset of human kinases
Create dataset of human kinases
build_dataset(
query="family:kinase AND organism_id:9606 AND reviewed:true",
output_path="human_kinases.tsv",
fields=["accession", "gene_names", "protein_name", "length", "sequence"]
)
undefinedbuild_dataset(
query="family:kinase AND organism_id:9606 AND reviewed:true",
output_path="human_kinases.tsv",
fields=["accession", "gene_names", "protein_name", "length", "sequence"]
)
undefinedKey Terminology
关键术语
Swiss-Prot vs TrEMBL: Swiss-Prot entries () are manually curated by experts. TrEMBL entries () are computationally predicted. Always prefer Swiss-Prot for high-confidence data.
reviewed:truereviewed:falseAccession Number: Stable identifier for a protein entry (e.g., P04637). Entry names like "P53_HUMAN" may change.
Entity Types: UniProt covers UniProtKB (proteins), UniRef (clustered sequences), UniParc (archive), and Proteomes (complete sets).
Annotation Score: Quality indicator from 1 (basic) to 5 (comprehensive). Higher scores indicate more complete annotations.
Swiss-Prot与TrEMBL:Swiss-Prot条目()由专家人工整理。TrEMBL条目()为计算预测所得。如需高可信度数据,请优先选择Swiss-Prot。
reviewed:truereviewed:false登录号:蛋白质条目的稳定标识符(如P04637)。像“P53_HUMAN”这样的条目名称可能会变更。
实体类型:UniProt涵盖UniProtKB(蛋白质)、UniRef(聚类序列)、UniParc(存档)和Proteomes(完整集合)。
注释评分:质量指标,从1(基础)到5(全面)。评分越高表示注释越完整。
Best Practices
最佳实践
| Recommendation | Rationale |
|---|---|
Add | Swiss-Prot entries are manually curated |
| Request minimal fields | Reduces transfer size and response time |
| Use streaming for large sets | Avoids pagination complexity |
| Implement rate limiting | Respects server resources (0.3-0.5s delay) |
| Cache repeated queries | Minimizes redundant API calls |
| Handle errors gracefully | Network issues, rate limits, missing entries |
| 建议 | 理由 |
|---|---|
查询时添加 | Swiss-Prot条目经过人工整理 |
| 请求最少字段 | 减少传输大小和响应时间 |
| 大型数据集使用流式传输 | 避免分页复杂度 |
| 实现限速 | 尊重服务器资源(0.3-0.5秒延迟) |
| 缓存重复查询 | 减少冗余API调用 |
| 优雅处理错误 | 网络问题、限速、条目缺失 |
References
参考资料
See for:
references/api-reference.md- Complete field listing for query customization
- All searchable attributes and operators
- Database pairs for identifier translation
- Working code examples in curl, R, and JavaScript
- Rate limiting and error handling strategies
references/api-reference.md- 查询自定义的完整字段列表
- 所有可搜索属性和运算符
- 标识符转换的数据库对
- curl、R和JavaScript的可用代码示例
- 限速和错误处理策略
External Documentation
外部文档
- REST API: https://www.uniprot.org/help/api
- Query Fields: https://www.uniprot.org/help/query-fields
- ID Mapping: https://www.uniprot.org/help/id_mapping
- Programmatic Access: https://www.uniprot.org/help/programmatic_access