alphafold-database-fetch-and-analyze

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AlphaFold Database: Fetch and Analyze

AlphaFold数据库:获取与分析

Prerequisites

前提条件

  1. uv
    : Read the
    uv
    skill and follow its Setup instructions to ensure
    uv
    is installed and on PATH.
  2. User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://alphafold.ebi.ac.uk/, then (2) create the file recording the notification text and timestamp.
  1. uv
    :阅读
    uv
    技能文档并按照其设置说明操作,确保
    uv
    已安装且在PATH路径中。
  2. 用户通知:如果本技能目录中不存在LICENSE_NOTIFICATION.txt文件,则(1) 显著提醒用户查看https://alphafold.ebi.ac.uk/上的条款,然后(2) 创建该文件并记录通知文本和时间戳。

Overview

概述

Downloads AlphaFold predicted structures (mmCIF) and Predicted Aligned Error (PAE) matrices from the AlphaFold Database for a given UniProt ID, then performs automated heuristic analysis on structural confidence (pLDDT), intrinsically disordered regions, rigid domain boundaries, and inter-domain flexibility.
Do NOT use when:
  • The user only has a protein name, gene name, or amino acid sequence (no UniProt ID) — ask them to look up the ID on UniProt.
  • The user wants to search for structural homologs (use Foldseek).
  • The user wants to run AlphaFold predictions on a custom sequence.
  • The user needs experimental PDB structures (use RCSB PDB).
针对给定的UniProt ID,从AlphaFold数据库下载AlphaFold预测的结构文件(mmCIF格式)和预测对齐误差(PAE)矩阵,然后对结构置信度(pLDDT)、固有无序区域、刚性结构域边界以及结构域间灵活性进行自动化启发式分析。
禁止使用场景:
  • 用户仅提供蛋白质名称、基因名称或氨基酸序列(无UniProt ID)——请让用户在UniProt上查询该ID。
  • 用户想要搜索结构同源物(使用Foldseek工具)。
  • 用户想要针对自定义序列运行AlphaFold预测。
  • 用户需要实验性PDB结构(使用RCSB PDB工具)。

Core Rules

核心规则

  • Use the Wrapper: ALWAYS execute the provided helper scripts to query the database rather than accessing the database directly. The scripts automatically enforce the required rate limit gracefully.
  • Do not attempt to calculate domain boundaries or assess structural disorder yourself; always rely on the output provided by the script.
  • If this skill is used, ensure this is mentioned in the output.
  • 使用封装脚本:始终执行提供的辅助脚本查询数据库,而非直接访问数据库。脚本会自动优雅地执行所需的速率限制。
  • 请勿自行计算结构域边界或评估结构无序性;始终依赖脚本输出的结果。
  • 如果使用了本技能,请确保在输出中提及这一点。

Utility Scripts

实用脚本

1. Fetch Structure Files
Downloads the
.cif
structure file,
_predicted_aligned_error.json
, and API metadata JSON (
-metadata.json
) for a UniProt ID. Handles fragment fallback for very large proteins.
Examples:
bash
uv run scripts/fetch_structure.py P00520 -o /path/to/output/
uv run scripts/fetch_structure.py P04637 -o /path/to/custom_results/
Always specify
-o
with an absolute path or a path relative to the user's project root, never a path relative to the skill directory.
2. Analyze pLDDT Confidence
Reads pLDDT confidence metrics from a saved AFDB metadata JSON file (produced by
fetch_structure.py
) and prints a heuristic confidence assessment (structured, disordered, mixed).
Example:
bash
uv run scripts/analyze_plddt.py ./data/AF-P00520-F1-metadata.json
3. Analyze PAE / Domain Boundaries
Reads a downloaded PAE JSON file and detects rigid domain boundaries using a sliding-window PAE heuristic.
Example:
bash
uv run scripts/analyze_pae.py ./data/AF-P00520-F1-predicted_aligned_error_v6.json
1. 获取结构文件
下载指定UniProt ID对应的
.cif
结构文件、
_predicted_aligned_error.json
文件以及API元数据JSON文件(
-metadata.json
)。针对超大蛋白质会处理片段回退逻辑。
示例:
bash
uv run scripts/fetch_structure.py P00520 -o /path/to/output/
uv run scripts/fetch_structure.py P04637 -o /path/to/custom_results/
请始终使用绝对路径或相对于用户项目根目录的路径指定
-o
参数,切勿使用相对于技能目录的路径。
2. 分析pLDDT置信度
从已保存的AFDB元数据JSON文件(由
fetch_structure.py
生成)中读取pLDDT置信度指标,并输出启发式置信度评估结果(结构化、无序化、混合)。
示例:
bash
uv run scripts/analyze_plddt.py ./data/AF-P00520-F1-metadata.json
3. 分析PAE / 结构域边界
读取已下载的PAE JSON文件,使用滑动窗口PAE启发式方法检测刚性结构域边界。
示例:
bash
uv run scripts/analyze_pae.py ./data/AF-P00520-F1-predicted_aligned_error_v6.json

Interpreting the Output

输出解读

The script prints analysis to stdout. Read it carefully and synthesize the results for the user:
  1. Isoform / Large Protein Warning (MANDATORY): Check the script output for any
    [!] WARNING
    lines. If the script reports that no canonical entry was found and an isoform was used, or if the protein is very large (>2700 AAs), you MUST prominently relay this warning to the user. Do not omit this warning.
  2. Synthesize the Structural Analysis: Combine the "pLDDT Conclusion" and the "PAE Structural Conclusion" into a single, cohesive overall summary. Describe the protein's overall folding confidence, the presence of disordered regions, and its rigid domain layout.
  3. Highlight the supporting metrics:
    • Overall Global pLDDT and the breakdown of fraction confidence (especially Very Low vs. Very High).
    • Domain Boundary Analysis (number of distinct global domains and their specific residue ranges).
  4. Explicit Disorder Warning: If the analysis concludes that the protein is highly intrinsically disordered (e.g., high fraction of <50 pLDDT or lack of rigid domains), issue a separate, prominent warning. Advise the user against proceeding with whole-protein downstream structural analysis (like Foldseek or docking). If small ordered domains exist amidst the disorder, advise the user to restrict any future analysis strictly to those specific residue boundaries.
  5. Remind the user that per-residue pLDDT is embedded in the B-factor column of the downloaded mmCIF file.
脚本会将分析结果打印到标准输出。请仔细阅读并为用户综合结果:
  1. 异构体/超大蛋白质警告(必填):检查脚本输出中的
    [!] WARNING
    行。如果脚本报告未找到标准条目而使用了异构体,或者蛋白质非常大(>2700个氨基酸),你必须显著向用户传达此警告,不得省略。
  2. 综合结构分析:将“pLDDT结论”和“PAE结构结论”合并为一个连贯的整体总结。描述蛋白质的整体折叠置信度、无序区域的存在情况以及刚性结构域布局。
  3. 突出支持性指标:
    • 整体全局pLDDT值以及置信度比例细分(尤其是极低与极高部分)。
    • 结构域边界分析(不同全局结构域的数量及其具体残基范围)。
  4. 明确的无序性警告:如果分析得出蛋白质高度固有无序的结论(例如,pLDDT<50的比例很高或缺乏刚性结构域),请单独发出显著警告。建议用户不要继续进行全蛋白质下游结构分析(如Foldseek或对接)。如果无序区域中存在小的有序结构域,请建议用户将未来的任何分析严格限制在这些特定残基范围内。
  5. 提醒用户:每个残基的pLDDT值已嵌入到下载的mmCIF文件的B因子列中。