pdb-database

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RCSB Protein Data Bank skill

RCSB Protein Data Bank 技能

Prerequisites

前提条件

  1. uv
    : Read the
    uv
    skill and follow its Setup instructions to ensure
    uv
    is installed and on PATH.
  2. User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.rcsb.org/pages/usage-policy, then (2) create the file recording the notification text and timestamp.
  1. uv
    : 阅读
    uv
    技能文档并遵循其安装说明,确保
    uv
    已安装且在PATH路径中。
  2. 用户通知: 如果此技能目录中尚未存在LICENSE_NOTIFICATION.txt文件,则(1) 显著通知用户查看https://www.rcsb.org/pages/usage-policy上的条款,然后(2) 创建记录通知文本和时间戳的文件。

Core Rules

核心规则

  • Always prefer to use the provided scripts. Only as a last resort use
    curl
    ,
    urllib
    , raw HTTP requests, or any other method to access PDB APIs. The scripts automatically enforce required rate limits.
  • Always redirect output to a file. Parse output with e.g.
    jq
    ,
    grep
    , or a short Python snippet. Do NOT print large API responses to stdout to avoid truncation.
  • Notification: If this skill is used, ensure this is mentioned in the output.
  • Explain your queries On completing a task that used PDB JSON/GraphQL queries, explain in clear language what your query did so the user can correct any bad assumptions.
  • 优先使用提供的脚本:仅在万不得已时才使用
    curl
    urllib
    、原始HTTP请求或其他方法访问PDB API。脚本会自动强制执行所需的速率限制。
  • 始终将输出重定向到文件:使用
    jq
    grep
    或简短的Python代码片段解析输出。请勿将大型API响应打印到标准输出,以避免截断。
  • 通知要求:如果使用了此技能,请确保在输出中提及这一点。
  • 解释查询逻辑:完成使用PDB JSON/GraphQL查询的任务后,用清晰的语言解释查询的作用,以便用户纠正任何错误假设。

Attribute-based search workflow

基于属性的搜索流程

  1. Fetch the relevant schema to discover searchable attribute names. For structure attributes:
    uv run scripts/fetch_schema.py --api search_structure --output schema_structure.txt
    For chemical attributes:
    uv run scripts/fetch_schema.py --api search_chemical --output schema_chemical.txt
  2. Grep the schema to find relevant attributes. Grep one keyword at a time and examine many lines — there are lots of similar attributes and you must choose the best match for the user's intent.
  3. Compose and run a JSON search query using the discovered attributes:
    uv run scripts/search_pdb.py --query '<JSON>' --return_type <RETURN_TYPE> --output results.json
    Pass the
    --count_only
    flag to get just the number of matching entries.
  1. 获取相关模式以发现可搜索的属性名称。对于结构属性:
    uv run scripts/fetch_schema.py --api search_structure --output schema_structure.txt
    对于化学属性:
    uv run scripts/fetch_schema.py --api search_chemical --output schema_chemical.txt
  2. 在模式中搜索以找到相关属性。每次搜索一个关键词并查看多行内容——有许多相似的属性,你必须选择最符合用户意图的匹配项。
  3. 编写并运行JSON搜索查询,使用发现的属性:
    uv run scripts/search_pdb.py --query '<JSON>' --return_type <RETURN_TYPE> --output results.json
    添加
    --count_only
    标志仅获取匹配条目的数量。

For step 2: some basic PDB concepts (helpful for attribute choice)

步骤2补充:一些基础的PDB概念(有助于选择属性)

  • Entity: A unique molecule found in a structure.
  • Instance / Chain: A particular copy of an entity. E.g. if a structure contains two protein chains with the same sequence, they are the same entity but different instances / chains.
  • Assembly: A biologically relevant collection of instances / chains. This may be the same as the deposited structure, a subset, or multiple copies.
  • Label vs Auth: Polymer instances get letter labels ("A", "B", "AA") and their monomers are numbered. There are author-assigned ("auth") and PDB-internal ("label") schemes. The label scheme is more consistent and is always used in scripts and APIs. However, users and papers may refer to the author scheme (clarify which scheme is being used if necessary).
  • Chemical component: A small molecule / monomer, with an ID matching
    [A-Z]{1,3}
  • Primary citation: The main publication about a structure. Prefer
    primary_citation
    attributes over
    citation
    attributes.
  • Resolution: Frequently used measure of structure quality (lower is better). Usually prefer
    rcsb_entry_info.resolution_combined
    , which accounts for different experimental methods.
  • Entity(实体):结构中存在的独特分子。
  • Instance / Chain(实例/链):实体的特定副本。例如,如果一个结构包含两条序列相同的蛋白质链,它们是同一个实体但属于不同的实例/链。
  • Assembly(组装体):具有生物学相关性的实例/链集合。它可能与提交的结构相同,或是其子集,或是多个副本。
  • Label vs Auth(标签与作者命名):聚合物实例会获得字母标签("A"、"B"、"AA"),其单体会被编号。存在作者分配("auth")和PDB内部("label")两种命名方案。标签方案更一致,始终在脚本和API中使用。但用户和文献可能会引用作者方案(如有必要,请明确说明使用的是哪种方案)。
  • Chemical component(化学成分):小分子/单体,ID符合
    [A-Z]{1,3}
    格式。
  • Primary citation(主要引用文献):关于某个结构的主要出版物。优先使用
    primary_citation
    属性而非
    citation
    属性。
  • Resolution(分辨率):衡量结构质量的常用指标(数值越低越好)。通常优先使用
    rcsb_entry_info.resolution_combined
    ,它会考虑不同的实验方法。

For step 3: Example queries

步骤3补充:查询示例

bash
undefined
bash
undefined

Non-human proteins published in Nature, newest first

Non-human proteins published in Nature, newest first

uv run scripts/search_pdb.py --query '{ "type": "group", "logical_operator": "and", "nodes": [ { "type": "terminal", "service": "text", "parameters": { "operator": "exact_match", "negation": true, "value": "Homo sapiens", "attribute": "rcsb_entity_source_organism.taxonomy_lineage.name" } }, { "type": "terminal", "service": "text", "parameters": { "operator": "exact_match", "value": "Nature", "attribute": "rcsb_primary_citation.rcsb_journal_abbrev" } } ] }' --return_type entry --sort_by rcsb_accession_info.initial_release_date --sort_direction desc --page_start 0 --rows 100 --output results.json

```bash
uv run scripts/search_pdb.py --query '{ "type": "group", "logical_operator": "and", "nodes": [ { "type": "terminal", "service": "text", "parameters": { "operator": "exact_match", "negation": true, "value": "Homo sapiens", "attribute": "rcsb_entity_source_organism.taxonomy_lineage.name" } }, { "type": "terminal", "service": "text", "parameters": { "operator": "exact_match", "value": "Nature", "attribute": "rcsb_primary_citation.rcsb_journal_abbrev" } } ] }' --return_type entry --sort_by rcsb_accession_info.initial_release_date --sort_direction desc --page_start 0 --rows 100 --output results.json

```bash

Structures containing the chemical component CA (Ca2+ ion)

Structures containing the chemical component CA (Ca2+ ion)

uv run scripts/search_pdb.py --query '{ "type": "terminal", "service": "text_chem", "parameters": { "operator": "exact_match", "value": "CA", "attribute": "rcsb_chem_comp_container_identifiers.comp_id" } }' --return_type entry --output results.json

```bash
uv run scripts/search_pdb.py --query '{ "type": "terminal", "service": "text_chem", "parameters": { "operator": "exact_match", "value": "CA", "attribute": "rcsb_chem_comp_container_identifiers.comp_id" } }' --return_type entry --output results.json

```bash

Number of entries with disulfide bonds

Number of entries with disulfide bonds

uv run scripts/search_pdb.py --query '{ "type": "terminal", "service": "text", "parameters": { "operator": "exact_match", "value": "disulfide bridge", "attribute": "rcsb_polymer_struct_conn.connect_type" } }' --return_type entry --count-only --output count.json

Common operators: `exact_match`, `equals`, `exists`, `contains_phrase`,
`contains_words`, `in`, `greater`, `less`
uv run scripts/search_pdb.py --query '{ "type": "terminal", "service": "text", "parameters": { "operator": "exact_match", "value": "disulfide bridge", "attribute": "rcsb_polymer_struct_conn.connect_type" } }' --return_type entry --count-only --output count.json

常用操作符:`exact_match`, `equals`, `exists`, `contains_phrase`, `contains_words`, `in`, `greater`, `less`

Similarity-based search workflow

基于相似性的搜索流程

Similarity searches do not require a schema fetch. Basic examples:
bash
undefined
相似性搜索无需获取模式。基础示例:
bash
undefined

Sequence similarity

Sequence similarity

uv run scripts/search_pdb.py --query '{ "query": { "type": "terminal", "service": "sequence", "parameters": { "evalue_cutoff": 1, "identity_cutoff": 0.9, "sequence_type": "protein", "value": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQ" } }, "request_options": { "scoring_strategy": "sequence" } }' --return_type polymer_entity --output results.json

```bash
uv run scripts/search_pdb.py --query '{ "query": { "type": "terminal", "service": "sequence", "parameters": { "evalue_cutoff": 1, "identity_cutoff": 0.9, "sequence_type": "protein", "value": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQ" } }, "request_options": { "scoring_strategy": "sequence" } }' --return_type polymer_entity --output results.json

```bash

Structure similarity

Structure similarity

uv run scripts/search_pdb.py --query '{ "type": "terminal", "service": "structure", "parameters": { "value": {"entry_id": "6LU7", "asym_id": "A"}, "number_of_candidates": 2000 } }' --return_type polymer_entity --output results.json

```bash
uv run scripts/search_pdb.py --query '{ "type": "terminal", "service": "structure", "parameters": { "value": {"entry_id": "6LU7", "asym_id": "A"}, "number_of_candidates": 2000 } }' --return_type polymer_entity --output results.json

```bash

Sequence motif match

Sequence motif match

uv run scripts/search_pdb.py --query '{ "type": "terminal", "service": "seqmotif", "parameters": { "value": "C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.", "pattern_type": "prosite", "sequence_type": "protein" } }' --return_type polymer_entity --output results.json

```bash
uv run scripts/search_pdb.py --query '{ "type": "terminal", "service": "seqmotif", "parameters": { "value": "C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.", "pattern_type": "prosite", "sequence_type": "protein" } }' --return_type polymer_entity --output results.json

```bash

Chemical descriptor match

Chemical descriptor match

uv run scripts/search_pdb.py --query '{ "type": "terminal", "service": "chemical", "parameters": { "value": "InChI=1S/C8H9NO2/c1-6(10)9-7-2-4-8(11)5-3-7/h2-5,11H,1H3,(H,9,10)", "type": "descriptor", "descriptor_type": "InChI", "match_type": "graph-strict" } }' --return_type mol_definition --output results.json

See https://search.rcsb.org/#search-services for more details.
uv run scripts/search_pdb.py --query '{ "type": "terminal", "service": "chemical", "parameters": { "value": "InChI=1S/C8H9NO2/c1-6(10)9-7-2-4-8(11)5-3-7/h2-5,11H,1H3,(H,9,10)", "type": "descriptor", "descriptor_type": "InChI", "match_type": "graph-strict" } }' --return_type mol_definition --output results.json

更多详情请查看https://search.rcsb.org/#search-services。

Full text search workflow

全文搜索流程

Searches all text associated with an entry. Example:
bash
uv run scripts/search_pdb.py --query '{ "type": "terminal", "service": "full_text", "parameters": { "value": "isopeptide + ( collagen | fibrinogen )" } }' --return_type entry --output results.json
Important: use
full_text
search as a last resort when there's no more precise attribute search available. Consider using the
struct.title
or
rcsb_pubmed_abstract_text
attributes instead.
搜索与条目关联的所有文本。示例:
bash
uv run scripts/search_pdb.py --query '{ "type": "terminal", "service": "full_text", "parameters": { "value": "isopeptide + ( collagen | fibrinogen )" } }' --return_type entry --output results.json
重要提示:仅当没有更精确的属性搜索可用时,才将
full_text
搜索作为最后手段。考虑使用
struct.title
rcsb_pubmed_abstract_text
属性替代。

File download workflow

文件下载流程

To download full PDB entries, use the
download_coordinate_files.py
script. Use this when you need access to atomic coordinates, when asked for a pdb / mmcif file, or when non-specifically asked to fetch a PDB code. Example:
bash
uv run scripts/download_coordinate_files.py --ids "4HHB,6BEA" --format "mmcif" --output_dir <OUTPUT_DIR>
要下载完整的PDB条目,请使用
download_coordinate_files.py
脚本。当你需要访问原子坐标、被请求提供pdb/mmcif文件,或被要求获取某个PDB代码时使用此脚本。示例:
bash
uv run scripts/download_coordinate_files.py --ids "4HHB,6BEA" --format "mmcif" --output_dir <OUTPUT_DIR>

Metadata query workflow

元数据查询流程

This flow is significantly more efficient than downloading full coordinate files when you only need a few pieces of metadata about each entry / entity.
  1. Fetch the schema for the relevant object type. E.g.
    uv run scripts/fetch_schema.py --api data_entry --output schema_entry.txt
  2. Grep the schema for relevant fields (one keyword at a time, many lines).
  3. Compose and run a GraphQL metadata query:
    uv run scripts/fetch_pdb_metadata.py --query '<GraphQL>' --output results.json
当你只需要每个条目/实体的少量元数据时,此流程比下载完整坐标文件高效得多。
  1. 获取相关对象类型的模式。例如:
    uv run scripts/fetch_schema.py --api data_entry --output schema_entry.txt
  2. 在模式中搜索相关字段(每次一个关键词,查看多行内容)。
  3. 编写并运行GraphQL元数据查询
    uv run scripts/fetch_pdb_metadata.py --query '<GraphQL>' --output results.json

For step 3: Example queries

步骤3补充:查询示例

bash
undefined
bash
undefined

Fetch structure titles and experimental methods

Fetch structure titles and experimental methods

uv run scripts/fetch_pdb_metadata.py --query '{ entries(entry_ids: ["1STP", "2JEF", "1CDG"]) { rcsb_id struct { title } exptl { method } } }' --output results.json

```bash
uv run scripts/fetch_pdb_metadata.py --query '{ entries(entry_ids: ["1STP", "2JEF", "1CDG"]) { rcsb_id struct { title } exptl { method } } }' --output results.json

```bash

Fetch polymer entity taxonomy and cluster membership

Fetch polymer entity taxonomy and cluster membership

uv run scripts/fetch_pdb_metadata.py --query '{ polymer_entities(entity_ids:["2CPK_1","3WHM_1","2D5Z_1"]) { rcsb_id rcsb_entity_source_organism { ncbi_taxonomy_id ncbi_scientific_name } rcsb_cluster_membership { cluster_id identity } } }' --output results.json

```bash
uv run scripts/fetch_pdb_metadata.py --query '{ polymer_entities(entity_ids:["2CPK_1","3WHM_1","2D5Z_1"]) { rcsb_id rcsb_entity_source_organism { ncbi_taxonomy_id ncbi_scientific_name } rcsb_cluster_membership { cluster_id identity } } }' --output results.json

```bash

Fetch polymer entity external sequence database accessions

Fetch polymer entity external sequence database accessions

uv run scripts/fetch_pdb_metadata.py --query '{ entries(entry_ids:["7NHM", "5L2G"]){ polymer_entities { rcsb_id rcsb_polymer_entity_container_identifiers { reference_sequence_identifiers { database_accession database_name } } } } }' --output results.json
undefined
uv run scripts/fetch_pdb_metadata.py --query '{ entries(entry_ids:["7NHM", "5L2G"]){ polymer_entities { rcsb_id rcsb_polymer_entity_container_identifiers { reference_sequence_identifiers { database_accession database_name } } } } }' --output results.json
undefined