foldseek-structural-search

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prerequisites

前提条件

  1. uv
    : Read the
    uv
    skill and follow its Setup instructions to ensure
    uv
    is installed and on PATH.
  2. User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://search.foldseek.com/search and https://github.com/steineggerlab/foldseek, then (2) create the file recording the notification text and timestamp.
  1. uv
    :阅读
    uv
    技能文档并遵循其设置说明,确保
    uv
    已安装且在PATH路径中。
  2. 用户通知:若此技能目录中不存在LICENSE_NOTIFICATION.txt文件,则需(1)显著通知用户查看https://search.foldseek.com/search和 https://github.com/steineggerlab/foldseek上的条款,然后(2)创建该文件,记录通知文本和时间戳。

Goal

目标

Submit a user-provided 3D protein structure file (
.cif
,
.mmcif
, or
.pdb
) to the Foldseek web server API to find structurally similar proteins. Report the top structural hits, interpret key alignment metrics, summarize the inferred protein functions, save the Markdown-formatted table to a
.md
file, and save the full detailed results to a local JSON file.
将用户提供的蛋白质3D结构文件(.cif、.mmcif或.pdb格式)提交至Foldseek服务器API,查找结构相似的蛋白质。报告排名靠前的结构匹配结果,解读关键比对指标,总结推断出的蛋白质功能,将Markdown格式的结果表保存为.md文件,并将完整的详细结果保存至本地JSON文件。

Core Rules

核心规则

  • File Requirement: This tool absolutely cannot search by sequence, name, or accession ID. It strictly requires a
    .pdb
    ,
    .cif
    , or
    .mmcif
    file path.
  • Strict Validation: Never bypass the input validation or the database allowlist check.
  • Do Not Parse the JSON: Rely entirely on the generated
    .md
    file for your immediate summary. The JSON is saved purely for subsequent, specialized tool use.
  • No Raw Parsing: Do not attempt to parse or read the raw 3D coordinates yourself; always pass the file to the script.
  • Notification: If this skill is used, ensure this is mentioned in the output.
  • 文件要求:此工具绝对不能通过序列、名称或登录号进行搜索,严格要求提供.pdb、.cif或.mmcif格式的文件路径。
  • 严格验证:绝不能绕过输入验证或数据库白名单检查。
  • 请勿解析JSON:直接使用生成的.md文件进行即时总结。JSON文件仅用于后续的专业工具处理。
  • 禁止原始解析:请勿尝试自行解析或读取原始3D坐标,务必将文件传递给脚本处理。
  • 通知要求:若使用此技能,需在输出中明确提及这一点。

Instructions

操作步骤

  1. Strict Input Validation: Verify that the user has explicitly provided a valid path to a
    .cif
    ,
    .mmcif
    , or
    .pdb
    file in their workspace.
    • If the user provided a protein name, an amino acid sequence, or an accession ID (e.g., a UniProt ID) but NO downloaded structure file, halt immediately. Do not run the script.
    • Inform the user that Foldseek requires a physical 3D coordinate file, and suggest downloading the structure first (e.g., using the AlphaFold fetch tool).
  2. Database Validation: Check if the user requested specific databases to search.
    • Allowed List:
      afdb50
      ,
      afdb-swissprot
      ,
      pdb100
      ,
      BFVD
      ,
      mgnify_esm30
      ,
      cath50
      ,
      gmgcl_id
      ,
      bfmd
      ,
      afdb-proteome
      .
    • If the user requests a database NOT on this list, halt immediately. Do not run the script. Inform the user that the database is unsupported and provide them with the allowed list.
  3. Generate File Names: Generate descriptive output file names for both the JSON data and the Markdown table based on the input file (e.g.,
    proteinA_foldseek_results.json
    and
    proteinA_foldseek_results.md
    ).
  4. Execute the python script based on the user's request, redirecting the standard output into your generated
    .md
    file:
    • Default (No databases specified):
      uv run scripts/search.py <path-to-file> -o <generated-filename.json> > <generated-filename.md>
    • Custom (Valid databases specified):
      uv run scripts/search.py <path-to-file> -o <generated-filename.json> --databases <db1,db2,db3> > <generated-filename.md>
  5. The script will query the databases, save the full JSON payload, and write a Markdown-formatted table to your specified
    .md
    file.
  6. Read the Results: Open and read the newly generated
    .md
    file carefully to view the Markdown table.
  7. Interpret the Metrics: Summarize the top 3 to 5 structural matches that have meaningfull annotations for the user. When reporting, assess the match quality using these specific fields:
    • Prob (Probability): Values approaching 1.0 (100%) indicate extreme confidence that the fold is a true structural homologue.
    • Q-Cov (Query Coverage): High percentages mean the match covers the majority of the query protein's overall shape, rather than just a small local motif.
    • E-value & Seq Identity: Use these to provide additional evolutionary context.
  8. Perform Functional Analysis: Analyze the text descriptions embedded within the
    Target ID
    column for the reported matches.
    • Explicitly report the specific protein names/functions of the top structural homologues.
    • Provide a synthesized overview summarizing the entire variety of different functions, domains, or protein families found across the whole list of homologues (e.g., "Most hits are portal proteins, but there is also a distinct cluster of viral capsid matches...").
  9. Explicitly inform the user of both newly created files (
    .json
    and
    .md
    ) and their locations so they can be seamlessly used in subsequent analysis steps.
  1. 严格输入验证:确认用户已明确提供工作区中有效的.cif、.mmcif或.pdb文件路径。
    • 若用户提供的是蛋白质名称、氨基酸序列或登录号(如UniProt ID)但未提供已下载的结构文件,立即停止操作,请勿运行脚本。
    • 告知用户Foldseek需要物理3D坐标文件,并建议先下载结构(例如使用AlphaFold获取工具)。
  2. 数据库验证:检查用户是否指定了要搜索的特定数据库。
    • 允许列表
      afdb50
      afdb-swissprot
      pdb100
      BFVD
      mgnify_esm30
      cath50
      gmgcl_id
      bfmd
      afdb-proteome
    • 若用户请求的数据库不在此列表中,立即停止操作,请勿运行脚本。告知用户该数据库不受支持,并提供允许列表。
  3. 生成文件名:根据输入文件为JSON数据和Markdown表生成描述性的输出文件名(例如
    proteinA_foldseek_results.json
    proteinA_foldseek_results.md
    )。
  4. 根据用户请求执行Python脚本,将标准输出重定向到生成的.md文件:
    • 默认(未指定数据库)
      uv run scripts/search.py <path-to-file> -o <generated-filename.json> > <generated-filename.md>
    • 自定义(指定有效数据库)
      uv run scripts/search.py <path-to-file> -o <generated-filename.json> --databases <db1,db2,db3> > <generated-filename.md>
  5. 脚本将查询数据库,保存完整的JSON负载,并将Markdown格式的表格写入指定的.md文件。
  6. 读取结果:仔细打开并读取新生成的.md文件,查看Markdown表格。
  7. 解读指标:为用户总结排名前3至5个带有有意义注释的结构匹配结果。报告时,使用以下特定字段评估匹配质量:
    • Prob(概率):接近1.0(100%)的值表明对该折叠为真实结构同源物的置信度极高。
    • Q-Cov(查询覆盖率):百分比越高,说明匹配覆盖了查询蛋白质的大部分整体结构,而非仅小部分局部基序。
    • E值与序列一致性:使用这些指标提供额外的进化背景信息。
  8. 功能分析:分析报告匹配结果中
    Target ID
    列包含的文本描述。
    • 明确报告排名靠前的结构同源物的具体蛋白质名称/功能。
    • 提供综合概述,总结所有同源物中发现的不同功能、结构域或蛋白质家族的多样性(例如:“大多数匹配结果是门户蛋白,但也有一组明显的病毒衣壳匹配结果……”)。
  9. 明确告知用户新创建的两个文件(.json和.md)及其位置,以便后续分析步骤中可以无缝使用。

* If the API returns an error or the file is missing, inform the user clearly

* 若API返回错误或文件缺失,请清晰告知用户并要求他们验证文件路径。

and ask them to verify the file path.