pdb-database
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRCSB Protein Data Bank
RCSB Protein Data Bank
The Protein Data Bank hosts over 200,000 experimentally determined macromolecular structures plus computed models from AlphaFold and ModelArchive. This skill provides programmatic access to search, download, and analyze structural data.
Protein Data Bank存储了超过20万种实验测定的大分子结构,以及来自AlphaFold和ModelArchive的计算模型。本技能提供程序化的搜索、下载和分析结构数据的访问方式。
Applicable Scenarios
适用场景
| Task | Examples |
|---|---|
| Structure Retrieval | Download coordinates for a known PDB ID |
| Similarity Search | Find structures similar by sequence or 3D fold |
| Metadata Access | Get resolution, method, organism, ligands |
| Dataset Building | Compile structures for ML training or analysis |
| Drug Discovery | Identify ligand-bound structures for a target |
| Quality Filtering | Select high-resolution, well-refined structures |
| 任务 | 示例 |
|---|---|
| 结构检索 | 下载已知PDB ID的坐标 |
| 相似性搜索 | 查找序列或3D折叠相似的结构 |
| 元数据访问 | 获取分辨率、实验方法、来源生物、配体信息 |
| 数据集构建 | 编译用于机器学习训练或分析的结构 |
| 药物发现 | 识别目标蛋白的配体结合结构 |
| 质量筛选 | 选择高分辨率、精修完善的结构 |
Setup
环境搭建
bash
pip install rcsb-api requestsThe package (v1.5.0+) provides:
rcsb-api- - Query construction and execution
rcsbapi.search - - DataQuery for batch retrieval
rcsbapi.data
bash
pip install rcsb-api requestsrcsb-api- - 查询构建与执行
rcsbapi.search - - 用于批量检索的DataQuery
rcsbapi.data
Quick Reference
快速参考
Search Queries
搜索查询
python
from rcsbapi.search import TextQuery, AttributeQuery, SeqSimilarityQuery, StructSimilarityQuerypython
from rcsbapi.search import TextQuery, AttributeQuery, SeqSimilarityQuery, StructSimilarityQueryText search
Text search
results = list(TextQuery("kinase inhibitor")())
results = list(TextQuery("kinase inhibitor")())
Filter by organism (use string attribute paths)
Filter by organism (use string attribute paths)
human = AttributeQuery(
attribute="rcsb_entity_source_organism.scientific_name",
operator="exact_match",
value="Homo sapiens"
)
human = AttributeQuery(
attribute="rcsb_entity_source_organism.scientific_name",
operator="exact_match",
value="Homo sapiens"
)
Filter by resolution
Filter by resolution
high_res = AttributeQuery(
attribute="rcsb_entry_info.resolution_combined",
operator="less",
value=2.0
)
high_res = AttributeQuery(
attribute="rcsb_entry_info.resolution_combined",
operator="less",
value=2.0
)
Filter by experimental method
Filter by experimental method
xray = AttributeQuery(
attribute="exptl.method",
operator="exact_match",
value="X-RAY DIFFRACTION"
)
xray = AttributeQuery(
attribute="exptl.method",
operator="exact_match",
value="X-RAY DIFFRACTION"
)
Combine queries: & (AND), | (OR), ~ (NOT)
Combine queries: & (AND), | (OR), ~ (NOT)
results = list((TextQuery("kinase") & human & high_res)())
results = list((TextQuery("kinase") & human & high_res)())
Sequence similarity (MMseqs2) - minimum 25 residues required
Sequence similarity (MMseqs2) - minimum 25 residues required
seq_query = SeqSimilarityQuery(
value="VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
evalue_cutoff=1e-5,
identity_cutoff=0.7
)
seq_query = SeqSimilarityQuery(
value="VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
evalue_cutoff=1e-5,
identity_cutoff=0.7
)
Structure similarity (3D fold)
Structure similarity (3D fold)
struct_query = StructSimilarityQuery(
structure_search_type="entry_id",
entry_id="4HHB"
)
undefinedstruct_query = StructSimilarityQuery(
structure_search_type="entry_id",
entry_id="4HHB"
)
undefinedData Retrieval (REST API)
数据检索(REST API)
The package's data module has limited functionality. Use the REST API directly for metadata:
rcsb-apipython
import requests
def fetch_entry(pdb_id: str) -> dict:
"""Fetch entry metadata from RCSB REST API."""
resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
resp.raise_for_status()
return resp.json()rcsb-apipython
import requests
def fetch_entry(pdb_id: str) -> dict:
"""Fetch entry metadata from RCSB REST API."""
resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
resp.raise_for_status()
return resp.json()Example usage
Example usage
data = fetch_entry("4HHB")
print(data["struct"]["title"])
print(data["rcsb_entry_info"]["resolution_combined"])
data = fetch_entry("4HHB")
print(data["struct"]["title"])
print(data["rcsb_entry_info"]["resolution_combined"])
Polymer entity (chain info + sequence)
Polymer entity (chain info + sequence)
def fetch_polymer_entity(pdb_id: str, entity_id: int = 1) -> dict:
resp = requests.get(f"https://data.rcsb.org/rest/v1/core/polymer_entity/{pdb_id}/{entity_id}")
resp.raise_for_status()
return resp.json()
entity = fetch_polymer_entity("4HHB", 1)
sequence = entity["entity_poly"]["pdbx_seq_one_letter_code"]
undefineddef fetch_polymer_entity(pdb_id: str, entity_id: int = 1) -> dict:
resp = requests.get(f"https://data.rcsb.org/rest/v1/core/polymer_entity/{pdb_id}/{entity_id}")
resp.raise_for_status()
return resp.json()
entity = fetch_polymer_entity("4HHB", 1)
sequence = entity["entity_poly"]["pdbx_seq_one_letter_code"]
undefinedFile Downloads
文件下载
| Format | URL Pattern |
|---|---|
| PDB | |
| mmCIF | |
| Assembly | |
| FASTA | |
python
import requests
from pathlib import Path
def download_structure(pdb_id: str, fmt: str = "cif", outdir: str = ".") -> Path:
url = f"https://files.rcsb.org/download/{pdb_id}.{fmt}"
resp = requests.get(url)
resp.raise_for_status()
outpath = Path(outdir) / f"{pdb_id}.{fmt}"
outpath.write_text(resp.text)
return outpath| 格式 | URL模板 |
|---|---|
| PDB | |
| mmCIF | |
| 组装体 | |
| FASTA | |
python
import requests
from pathlib import Path
def download_structure(pdb_id: str, fmt: str = "cif", outdir: str = ".") -> Path:
url = f"https://files.rcsb.org/download/{pdb_id}.{fmt}"
resp = requests.get(url)
resp.raise_for_status()
outpath = Path(outdir) / f"{pdb_id}.{fmt}"
outpath.write_text(resp.text)
return outpathCommon Workflows
常见工作流
Find High-Quality Human Structures
查找高质量人类结构
python
from rcsbapi.search import TextQuery, AttributeQuery
query = (
TextQuery("receptor") &
AttributeQuery(
attribute="rcsb_entity_source_organism.scientific_name",
operator="exact_match",
value="Homo sapiens"
) &
AttributeQuery(
attribute="rcsb_entry_info.resolution_combined",
operator="less",
value=2.5
) &
AttributeQuery(
attribute="exptl.method",
operator="exact_match",
value="X-RAY DIFFRACTION"
)
)
results = list(query())python
from rcsbapi.search import TextQuery, AttributeQuery
query = (
TextQuery("receptor") &
AttributeQuery(
attribute="rcsb_entity_source_organism.scientific_name",
operator="exact_match",
value="Homo sapiens"
) &
AttributeQuery(
attribute="rcsb_entry_info.resolution_combined",
operator="less",
value=2.5
) &
AttributeQuery(
attribute="exptl.method",
operator="exact_match",
value="X-RAY DIFFRACTION"
)
)
results = list(query())Batch Metadata Retrieval
批量元数据检索
python
import requests
import time
def fetch_batch(pdb_ids: list, delay: float = 0.3) -> dict:
"""Fetch metadata with rate limiting."""
results = {}
for pdb_id in pdb_ids:
try:
resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
resp.raise_for_status()
data = resp.json()
results[pdb_id] = {
"title": data["struct"]["title"],
"resolution": data.get("rcsb_entry_info", {}).get("resolution_combined"),
"method": data.get("exptl", [{}])[0].get("method"),
}
except Exception as e:
results[pdb_id] = {"error": str(e)}
time.sleep(delay)
return resultspython
import requests
import time
def fetch_batch(pdb_ids: list, delay: float = 0.3) -> dict:
"""Fetch metadata with rate limiting."""
results = {}
for pdb_id in pdb_ids:
try:
resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
resp.raise_for_status()
data = resp.json()
results[pdb_id] = {
"title": data["struct"]["title"],
"resolution": data.get("rcsb_entry_info", {}).get("resolution_combined"),
"method": data.get("exptl", [{}])[0].get("method"),
}
except Exception as e:
results[pdb_id] = {"error": str(e)}
time.sleep(delay)
return resultsFind Drug-Bound Structures
查找药物结合结构
python
from rcsbapi.search import AttributeQuerypython
from rcsbapi.search import AttributeQueryFind structures containing imatinib (ligand ID: STI)
Find structures containing imatinib (ligand ID: STI)
query = AttributeQuery(
attribute="rcsb_nonpolymer_entity_instance_container_identifiers.comp_id",
operator="exact_match",
value="STI"
)
drug_complexes = list(query())
undefinedquery = AttributeQuery(
attribute="rcsb_nonpolymer_entity_instance_container_identifiers.comp_id",
operator="exact_match",
value="STI"
)
drug_complexes = list(query())
undefinedGraphQL for Complex Queries
复杂查询使用GraphQL
python
import requests
query = """
{
entry(entry_id: "4HHB") {
struct { title }
rcsb_entry_info {
resolution_combined
deposited_atom_count
}
polymer_entities {
rcsb_polymer_entity { pdbx_description }
entity_poly { pdbx_seq_one_letter_code }
}
}
}
"""
response = requests.post(
"https://data.rcsb.org/graphql",
json={"query": query}
)
result = response.json()["data"]["entry"]python
import requests
query = """
{
entry(entry_id: "4HHB") {
struct { title }
rcsb_entry_info {
resolution_combined
deposited_atom_count
}
polymer_entities {
rcsb_polymer_entity { pdbx_description }
entity_poly { pdbx_seq_one_letter_code }
}
}
}
"""
response = requests.post(
"https://data.rcsb.org/graphql",
json={"query": query}
)
result = response.json()["data"]["entry"]Key Concepts
核心概念
| Term | Definition |
|---|---|
| PDB ID | 4-character alphanumeric code (e.g., "4HHB"). AlphaFold uses "AF_" prefix |
| Entity | Distinct molecular species. A homodimer has one entity appearing twice |
| Resolution | Quality metric in angstroms. Lower is better; <2.0 Å is high quality |
| Biological Assembly | Functional oligomeric state (may differ from asymmetric unit) |
| mmCIF | Modern format replacing legacy PDB; required for large structures |
| 术语 | 定义 |
|---|---|
| PDB ID | 4字符字母数字代码(例如"4HHB")。AlphaFold使用"AF_"前缀 |
| Entity(实体) | 不同的分子种类。同源二聚体包含一个出现两次的实体 |
| Resolution(分辨率) | 埃为单位的质量指标。数值越小质量越高;<2.0 Å属于高质量 |
| Biological Assembly(生物组装体) | 功能性寡聚体状态(可能与不对称单元不同) |
| mmCIF | 替代传统PDB格式的现代格式;大型结构必须使用该格式 |
Common Attribute Paths
常用属性路径
Use these string paths with :
AttributeQuery| Attribute | Description |
|---|---|
| Source organism (e.g., "Homo sapiens") |
| Resolution in angstroms |
| Experimental method (X-RAY DIFFRACTION, ELECTRON MICROSCOPY, SOLUTION NMR) |
| Ligand/small molecule ID |
| Structure title |
| Deposition date |
以下字符串路径可用于:
AttributeQuery| 属性 | 描述 |
|---|---|
| 来源生物(例如"Homo sapiens") |
| 分辨率(埃) |
| 实验方法(X-RAY DIFFRACTION、ELECTRON MICROSCOPY、SOLUTION NMR) |
| 配体/小分子ID |
| 结构标题 |
| 提交日期 |
Best Practices
最佳实践
| Practice | Rationale |
|---|---|
| Use mmCIF format | PDB format has atom count limits |
| Filter by resolution | <2.5 Å for most analyses; <2.0 Å for detailed work |
| Check experimental method | X-ray vs cryo-EM vs NMR have different quality metrics |
| Rate limit requests | 2-3 req/s to avoid 429 errors |
| Cache downloads | Structures rarely change after release |
| Prefer GraphQL | Reduces requests for complex data needs |
| 实践 | 理由 |
|---|---|
| 使用mmCIF格式 | PDB格式存在原子数量限制 |
| 按分辨率筛选 | 大多数分析使用<2.5 Å;精细工作使用<2.0 Å |
| 检查实验方法 | X射线、冷冻电镜、NMR的质量指标不同 |
| 请求限速 | 每秒2-3次请求,避免429错误 |
| 缓存下载内容 | 结构发布后极少更改 |
| 优先使用GraphQL | 减少复杂数据需求的请求次数 |
Troubleshooting
故障排除
| Issue | Resolution |
|---|---|
| 404 on entry fetch | Entry may be obsoleted; check RCSB website for superseding ID |
| 429 Too Many Requests | Implement exponential backoff; reduce request rate |
| Empty search results | Check query syntax; use |
| Large structure fails | Use mmCIF format instead of PDB |
| Missing sequence data | Query polymer entity endpoint, not entry |
| 问题 | 解决方法 |
|---|---|
| 条目获取返回404 | 条目可能已被废弃;请在RCSB网站查询替代ID |
| 429请求过多错误 | 实现指数退避策略;降低请求频率 |
| 搜索结果为空 | 检查查询语法;使用 |
| 大型结构处理失败 | 使用mmCIF格式替代PDB |
| 缺少序列数据 | 查询polymer entity端点,而非entry端点 |
References
参考资料
See for:
references/api-reference.md- Complete REST endpoint documentation
- All searchable attributes and operators
- Advanced query patterns
- Rate limiting strategies
查看获取:
references/api-reference.md- 完整的REST端点文档
- 所有可搜索属性与操作符
- 高级查询模式
- 请求限速策略
External Links
外部链接
- RCSB PDB: https://www.rcsb.org
- API Overview: https://www.rcsb.org/docs/programmatic-access/web-apis-overview
- Python Package: https://rcsbapi.readthedocs.io/
- GraphQL Explorer: https://data.rcsb.org/graphql
- RCSB PDB: https://www.rcsb.org
- API概述: https://www.rcsb.org/docs/programmatic-access/web-apis-overview
- Python包: https://rcsbapi.readthedocs.io/
- GraphQL Explorer: https://data.rcsb.org/graphql