pdb-database

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RCSB Protein Data Bank

RCSB Protein Data Bank

The Protein Data Bank hosts over 200,000 experimentally determined macromolecular structures plus computed models from AlphaFold and ModelArchive. This skill provides programmatic access to search, download, and analyze structural data.
Protein Data Bank存储了超过20万种实验测定的大分子结构,以及来自AlphaFold和ModelArchive的计算模型。本技能提供程序化的搜索、下载和分析结构数据的访问方式。

Applicable Scenarios

适用场景

TaskExamples
Structure RetrievalDownload coordinates for a known PDB ID
Similarity SearchFind structures similar by sequence or 3D fold
Metadata AccessGet resolution, method, organism, ligands
Dataset BuildingCompile structures for ML training or analysis
Drug DiscoveryIdentify ligand-bound structures for a target
Quality FilteringSelect high-resolution, well-refined structures
任务示例
结构检索下载已知PDB ID的坐标
相似性搜索查找序列或3D折叠相似的结构
元数据访问获取分辨率、实验方法、来源生物、配体信息
数据集构建编译用于机器学习训练或分析的结构
药物发现识别目标蛋白的配体结合结构
质量筛选选择高分辨率、精修完善的结构

Setup

环境搭建

bash
pip install rcsb-api requests
The
rcsb-api
package (v1.5.0+) provides:
  • rcsbapi.search
    - Query construction and execution
  • rcsbapi.data
    - DataQuery for batch retrieval
bash
pip install rcsb-api requests
rcsb-api
包(v1.5.0+)提供:
  • rcsbapi.search
    - 查询构建与执行
  • rcsbapi.data
    - 用于批量检索的DataQuery

Quick Reference

快速参考

Search Queries

搜索查询

python
from rcsbapi.search import TextQuery, AttributeQuery, SeqSimilarityQuery, StructSimilarityQuery
python
from rcsbapi.search import TextQuery, AttributeQuery, SeqSimilarityQuery, StructSimilarityQuery

Text search

Text search

results = list(TextQuery("kinase inhibitor")())
results = list(TextQuery("kinase inhibitor")())

Filter by organism (use string attribute paths)

Filter by organism (use string attribute paths)

human = AttributeQuery( attribute="rcsb_entity_source_organism.scientific_name", operator="exact_match", value="Homo sapiens" )
human = AttributeQuery( attribute="rcsb_entity_source_organism.scientific_name", operator="exact_match", value="Homo sapiens" )

Filter by resolution

Filter by resolution

high_res = AttributeQuery( attribute="rcsb_entry_info.resolution_combined", operator="less", value=2.0 )
high_res = AttributeQuery( attribute="rcsb_entry_info.resolution_combined", operator="less", value=2.0 )

Filter by experimental method

Filter by experimental method

xray = AttributeQuery( attribute="exptl.method", operator="exact_match", value="X-RAY DIFFRACTION" )
xray = AttributeQuery( attribute="exptl.method", operator="exact_match", value="X-RAY DIFFRACTION" )

Combine queries: & (AND), | (OR), ~ (NOT)

Combine queries: & (AND), | (OR), ~ (NOT)

results = list((TextQuery("kinase") & human & high_res)())
results = list((TextQuery("kinase") & human & high_res)())

Sequence similarity (MMseqs2) - minimum 25 residues required

Sequence similarity (MMseqs2) - minimum 25 residues required

seq_query = SeqSimilarityQuery( value="VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH", evalue_cutoff=1e-5, identity_cutoff=0.7 )
seq_query = SeqSimilarityQuery( value="VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH", evalue_cutoff=1e-5, identity_cutoff=0.7 )

Structure similarity (3D fold)

Structure similarity (3D fold)

struct_query = StructSimilarityQuery( structure_search_type="entry_id", entry_id="4HHB" )
undefined
struct_query = StructSimilarityQuery( structure_search_type="entry_id", entry_id="4HHB" )
undefined

Data Retrieval (REST API)

数据检索(REST API)

The
rcsb-api
package's data module has limited functionality. Use the REST API directly for metadata:
python
import requests

def fetch_entry(pdb_id: str) -> dict:
    """Fetch entry metadata from RCSB REST API."""
    resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
    resp.raise_for_status()
    return resp.json()
rcsb-api
包的data模块功能有限。如需获取元数据,可直接使用REST API:
python
import requests

def fetch_entry(pdb_id: str) -> dict:
    """Fetch entry metadata from RCSB REST API."""
    resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
    resp.raise_for_status()
    return resp.json()

Example usage

Example usage

data = fetch_entry("4HHB") print(data["struct"]["title"]) print(data["rcsb_entry_info"]["resolution_combined"])
data = fetch_entry("4HHB") print(data["struct"]["title"]) print(data["rcsb_entry_info"]["resolution_combined"])

Polymer entity (chain info + sequence)

Polymer entity (chain info + sequence)

def fetch_polymer_entity(pdb_id: str, entity_id: int = 1) -> dict: resp = requests.get(f"https://data.rcsb.org/rest/v1/core/polymer_entity/{pdb_id}/{entity_id}") resp.raise_for_status() return resp.json()
entity = fetch_polymer_entity("4HHB", 1) sequence = entity["entity_poly"]["pdbx_seq_one_letter_code"]
undefined
def fetch_polymer_entity(pdb_id: str, entity_id: int = 1) -> dict: resp = requests.get(f"https://data.rcsb.org/rest/v1/core/polymer_entity/{pdb_id}/{entity_id}") resp.raise_for_status() return resp.json()
entity = fetch_polymer_entity("4HHB", 1) sequence = entity["entity_poly"]["pdbx_seq_one_letter_code"]
undefined

File Downloads

文件下载

FormatURL Pattern
PDB
https://files.rcsb.org/download/{ID}.pdb
mmCIF
https://files.rcsb.org/download/{ID}.cif
Assembly
https://files.rcsb.org/download/{ID}.pdb1
FASTA
https://www.rcsb.org/fasta/entry/{ID}
python
import requests
from pathlib import Path

def download_structure(pdb_id: str, fmt: str = "cif", outdir: str = ".") -> Path:
    url = f"https://files.rcsb.org/download/{pdb_id}.{fmt}"
    resp = requests.get(url)
    resp.raise_for_status()
    outpath = Path(outdir) / f"{pdb_id}.{fmt}"
    outpath.write_text(resp.text)
    return outpath
格式URL模板
PDB
https://files.rcsb.org/download/{ID}.pdb
mmCIF
https://files.rcsb.org/download/{ID}.cif
组装体
https://files.rcsb.org/download/{ID}.pdb1
FASTA
https://www.rcsb.org/fasta/entry/{ID}
python
import requests
from pathlib import Path

def download_structure(pdb_id: str, fmt: str = "cif", outdir: str = ".") -> Path:
    url = f"https://files.rcsb.org/download/{pdb_id}.{fmt}"
    resp = requests.get(url)
    resp.raise_for_status()
    outpath = Path(outdir) / f"{pdb_id}.{fmt}"
    outpath.write_text(resp.text)
    return outpath

Common Workflows

常见工作流

Find High-Quality Human Structures

查找高质量人类结构

python
from rcsbapi.search import TextQuery, AttributeQuery

query = (
    TextQuery("receptor") &
    AttributeQuery(
        attribute="rcsb_entity_source_organism.scientific_name",
        operator="exact_match",
        value="Homo sapiens"
    ) &
    AttributeQuery(
        attribute="rcsb_entry_info.resolution_combined",
        operator="less",
        value=2.5
    ) &
    AttributeQuery(
        attribute="exptl.method",
        operator="exact_match",
        value="X-RAY DIFFRACTION"
    )
)
results = list(query())
python
from rcsbapi.search import TextQuery, AttributeQuery

query = (
    TextQuery("receptor") &
    AttributeQuery(
        attribute="rcsb_entity_source_organism.scientific_name",
        operator="exact_match",
        value="Homo sapiens"
    ) &
    AttributeQuery(
        attribute="rcsb_entry_info.resolution_combined",
        operator="less",
        value=2.5
    ) &
    AttributeQuery(
        attribute="exptl.method",
        operator="exact_match",
        value="X-RAY DIFFRACTION"
    )
)
results = list(query())

Batch Metadata Retrieval

批量元数据检索

python
import requests
import time

def fetch_batch(pdb_ids: list, delay: float = 0.3) -> dict:
    """Fetch metadata with rate limiting."""
    results = {}
    for pdb_id in pdb_ids:
        try:
            resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
            resp.raise_for_status()
            data = resp.json()
            results[pdb_id] = {
                "title": data["struct"]["title"],
                "resolution": data.get("rcsb_entry_info", {}).get("resolution_combined"),
                "method": data.get("exptl", [{}])[0].get("method"),
            }
        except Exception as e:
            results[pdb_id] = {"error": str(e)}
        time.sleep(delay)
    return results
python
import requests
import time

def fetch_batch(pdb_ids: list, delay: float = 0.3) -> dict:
    """Fetch metadata with rate limiting."""
    results = {}
    for pdb_id in pdb_ids:
        try:
            resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
            resp.raise_for_status()
            data = resp.json()
            results[pdb_id] = {
                "title": data["struct"]["title"],
                "resolution": data.get("rcsb_entry_info", {}).get("resolution_combined"),
                "method": data.get("exptl", [{}])[0].get("method"),
            }
        except Exception as e:
            results[pdb_id] = {"error": str(e)}
        time.sleep(delay)
    return results

Find Drug-Bound Structures

查找药物结合结构

python
from rcsbapi.search import AttributeQuery
python
from rcsbapi.search import AttributeQuery

Find structures containing imatinib (ligand ID: STI)

Find structures containing imatinib (ligand ID: STI)

query = AttributeQuery( attribute="rcsb_nonpolymer_entity_instance_container_identifiers.comp_id", operator="exact_match", value="STI" ) drug_complexes = list(query())
undefined
query = AttributeQuery( attribute="rcsb_nonpolymer_entity_instance_container_identifiers.comp_id", operator="exact_match", value="STI" ) drug_complexes = list(query())
undefined

GraphQL for Complex Queries

复杂查询使用GraphQL

python
import requests

query = """
{
  entry(entry_id: "4HHB") {
    struct { title }
    rcsb_entry_info {
      resolution_combined
      deposited_atom_count
    }
    polymer_entities {
      rcsb_polymer_entity { pdbx_description }
      entity_poly { pdbx_seq_one_letter_code }
    }
  }
}
"""

response = requests.post(
    "https://data.rcsb.org/graphql",
    json={"query": query}
)
result = response.json()["data"]["entry"]
python
import requests

query = """
{
  entry(entry_id: "4HHB") {
    struct { title }
    rcsb_entry_info {
      resolution_combined
      deposited_atom_count
    }
    polymer_entities {
      rcsb_polymer_entity { pdbx_description }
      entity_poly { pdbx_seq_one_letter_code }
    }
  }
}
"""

response = requests.post(
    "https://data.rcsb.org/graphql",
    json={"query": query}
)
result = response.json()["data"]["entry"]

Key Concepts

核心概念

TermDefinition
PDB ID4-character alphanumeric code (e.g., "4HHB"). AlphaFold uses "AF_" prefix
EntityDistinct molecular species. A homodimer has one entity appearing twice
ResolutionQuality metric in angstroms. Lower is better; <2.0 Å is high quality
Biological AssemblyFunctional oligomeric state (may differ from asymmetric unit)
mmCIFModern format replacing legacy PDB; required for large structures
术语定义
PDB ID4字符字母数字代码(例如"4HHB")。AlphaFold使用"AF_"前缀
Entity(实体)不同的分子种类。同源二聚体包含一个出现两次的实体
Resolution(分辨率)埃为单位的质量指标。数值越小质量越高;<2.0 Å属于高质量
Biological Assembly(生物组装体)功能性寡聚体状态(可能与不对称单元不同)
mmCIF替代传统PDB格式的现代格式;大型结构必须使用该格式

Common Attribute Paths

常用属性路径

Use these string paths with
AttributeQuery
:
AttributeDescription
rcsb_entity_source_organism.scientific_name
Source organism (e.g., "Homo sapiens")
rcsb_entry_info.resolution_combined
Resolution in angstroms
exptl.method
Experimental method (X-RAY DIFFRACTION, ELECTRON MICROSCOPY, SOLUTION NMR)
rcsb_nonpolymer_entity_instance_container_identifiers.comp_id
Ligand/small molecule ID
struct.title
Structure title
rcsb_accession_info.deposit_date
Deposition date
以下字符串路径可用于
AttributeQuery
属性描述
rcsb_entity_source_organism.scientific_name
来源生物(例如"Homo sapiens")
rcsb_entry_info.resolution_combined
分辨率(埃)
exptl.method
实验方法(X-RAY DIFFRACTION、ELECTRON MICROSCOPY、SOLUTION NMR)
rcsb_nonpolymer_entity_instance_container_identifiers.comp_id
配体/小分子ID
struct.title
结构标题
rcsb_accession_info.deposit_date
提交日期

Best Practices

最佳实践

PracticeRationale
Use mmCIF formatPDB format has atom count limits
Filter by resolution<2.5 Å for most analyses; <2.0 Å for detailed work
Check experimental methodX-ray vs cryo-EM vs NMR have different quality metrics
Rate limit requests2-3 req/s to avoid 429 errors
Cache downloadsStructures rarely change after release
Prefer GraphQLReduces requests for complex data needs
实践理由
使用mmCIF格式PDB格式存在原子数量限制
按分辨率筛选大多数分析使用<2.5 Å;精细工作使用<2.0 Å
检查实验方法X射线、冷冻电镜、NMR的质量指标不同
请求限速每秒2-3次请求,避免429错误
缓存下载内容结构发布后极少更改
优先使用GraphQL减少复杂数据需求的请求次数

Troubleshooting

故障排除

IssueResolution
404 on entry fetchEntry may be obsoleted; check RCSB website for superseding ID
429 Too Many RequestsImplement exponential backoff; reduce request rate
Empty search resultsCheck query syntax; use
query.to_dict()
to debug
Large structure failsUse mmCIF format instead of PDB
Missing sequence dataQuery polymer entity endpoint, not entry
问题解决方法
条目获取返回404条目可能已被废弃;请在RCSB网站查询替代ID
429请求过多错误实现指数退避策略;降低请求频率
搜索结果为空检查查询语法;使用
query.to_dict()
调试
大型结构处理失败使用mmCIF格式替代PDB
缺少序列数据查询polymer entity端点,而非entry端点

References

参考资料

See
references/api-reference.md
for:
  • Complete REST endpoint documentation
  • All searchable attributes and operators
  • Advanced query patterns
  • Rate limiting strategies
查看
references/api-reference.md
获取:
  • 完整的REST端点文档
  • 所有可搜索属性与操作符
  • 高级查询模式
  • 请求限速策略

External Links

外部链接