foldseek-structural-search

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Prerequisites

前提条件

uv
: Read the
```
uv
```
skill and follow its Setup instructions to ensure
```
uv
```
is installed and on PATH.
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://search.foldseek.com/search and https://github.com/steineggerlab/foldseek, then (2) create the file recording the notification text and timestamp.

uv
：阅读
```
uv
```
技能文档并遵循其设置说明，确保
```
uv
```
已安装且在PATH路径中。
用户通知：若此技能目录中不存在LICENSE_NOTIFICATION.txt文件，则需（1）显著通知用户查看https://search.foldseek.com/search和 https://github.com/steineggerlab/foldseek上的条款，然后（2）创建该文件，记录通知文本和时间戳。

Goal

目标

Submit a user-provided 3D protein structure file (

.cif

.mmcif

, or

.pdb

) to the Foldseek web server API to find structurally similar proteins. Report the top structural hits, interpret key alignment metrics, summarize the inferred protein functions, save the Markdown-formatted table to a

.md

file, and save the full detailed results to a local JSON file.

将用户提供的蛋白质3D结构文件（.cif、.mmcif或.pdb格式）提交至Foldseek服务器API，查找结构相似的蛋白质。报告排名靠前的结构匹配结果，解读关键比对指标，总结推断出的蛋白质功能，将Markdown格式的结果表保存为.md文件，并将完整的详细结果保存至本地JSON文件。

Core Rules

核心规则

File Requirement: This tool absolutely cannot search by sequence, name, or accession ID. It strictly requires a
```
.pdb
```
,
```
.cif
```
, or
```
.mmcif
```
file path.
Strict Validation: Never bypass the input validation or the database allowlist check.
Do Not Parse the JSON: Rely entirely on the generated
```
.md
```
file for your immediate summary. The JSON is saved purely for subsequent, specialized tool use.
No Raw Parsing: Do not attempt to parse or read the raw 3D coordinates yourself; always pass the file to the script.
Notification: If this skill is used, ensure this is mentioned in the output.

文件要求：此工具绝对不能通过序列、名称或登录号进行搜索，严格要求提供.pdb、.cif或.mmcif格式的文件路径。
严格验证：绝不能绕过输入验证或数据库白名单检查。
请勿解析JSON：直接使用生成的.md文件进行即时总结。JSON文件仅用于后续的专业工具处理。
禁止原始解析：请勿尝试自行解析或读取原始3D坐标，务必将文件传递给脚本处理。
通知要求：若使用此技能，需在输出中明确提及这一点。

Instructions

操作步骤

Strict Input Validation: Verify that the user has explicitly provided a valid path to a
```
.cif
```
,
```
.mmcif
```
, or
```
.pdb
```
file in their workspace.
- If the user provided a protein name, an amino acid sequence, or an accession ID (e.g., a UniProt ID) but NO downloaded structure file, halt immediately. Do not run the script.
- Inform the user that Foldseek requires a physical 3D coordinate file, and suggest downloading the structure first (e.g., using the AlphaFold fetch tool).
Database Validation: Check if the user requested specific databases to search.
- Allowed List:
```
afdb50
```
  ,
```
afdb-swissprot
```
  ,
```
pdb100
```
  ,
```
BFVD
```
  ,
```
mgnify_esm30
```
  ,
```
cath50
```
  ,
```
gmgcl_id
```
  ,
```
bfmd
```
  ,
```
afdb-proteome
```
  .
- If the user requests a database NOT on this list, halt immediately. Do not run the script. Inform the user that the database is unsupported and provide them with the allowed list.
Generate File Names: Generate descriptive output file names for both the JSON data and the Markdown table based on the input file (e.g.,
```
proteinA_foldseek_results.json
```
and
```
proteinA_foldseek_results.md
```
).

Execute the python script based on the user's request, redirecting the standard output into your generated

.md

file:

Default (No databases specified):

uv run scripts/search.py <path-to-file> -o <generated-filename.json> > <generated-filename.md>

Custom (Valid databases specified):

uv run scripts/search.py <path-to-file> -o <generated-filename.json> --databases <db1,db2,db3> > <generated-filename.md>

The script will query the databases, save the full JSON payload, and write a Markdown-formatted table to your specified
```
.md
```
file.
Read the Results: Open and read the newly generated
```
.md
```
file carefully to view the Markdown table.
Interpret the Metrics: Summarize the top 3 to 5 structural matches that have meaningfull annotations for the user. When reporting, assess the match quality using these specific fields:
- Prob (Probability): Values approaching 1.0 (100%) indicate extreme confidence that the fold is a true structural homologue.
- Q-Cov (Query Coverage): High percentages mean the match covers the majority of the query protein's overall shape, rather than just a small local motif.
- E-value & Seq Identity: Use these to provide additional evolutionary context.
Perform Functional Analysis: Analyze the text descriptions embedded within the
```
Target ID
```
column for the reported matches.
- Explicitly report the specific protein names/functions of the top structural homologues.
- Provide a synthesized overview summarizing the entire variety of different functions, domains, or protein families found across the whole list of homologues (e.g., "Most hits are portal proteins, but there is also a distinct cluster of viral capsid matches...").
Explicitly inform the user of both newly created files (
```
.json
```
and
```
.md
```
) and their locations so they can be seamlessly used in subsequent analysis steps.

严格输入验证：确认用户已明确提供工作区中有效的.cif、.mmcif或.pdb文件路径。
- 若用户提供的是蛋白质名称、氨基酸序列或登录号（如UniProt ID）但未提供已下载的结构文件，立即停止操作，请勿运行脚本。
- 告知用户Foldseek需要物理3D坐标文件，并建议先下载结构（例如使用AlphaFold获取工具）。
数据库验证：检查用户是否指定了要搜索的特定数据库。
- 允许列表：
```
afdb50
```
  、
```
afdb-swissprot
```
  、
```
pdb100
```
  、
```
BFVD
```
  、
```
mgnify_esm30
```
  、
```
cath50
```
  、
```
gmgcl_id
```
  、
```
bfmd
```
  、
```
afdb-proteome
```
  。
- 若用户请求的数据库不在此列表中，立即停止操作，请勿运行脚本。告知用户该数据库不受支持，并提供允许列表。
生成文件名：根据输入文件为JSON数据和Markdown表生成描述性的输出文件名（例如
```
proteinA_foldseek_results.json
```
和
```
proteinA_foldseek_results.md
```
）。

根据用户请求执行Python脚本，将标准输出重定向到生成的.md文件：

默认（未指定数据库）：

uv run scripts/search.py <path-to-file> -o <generated-filename.json> > <generated-filename.md>

自定义（指定有效数据库）：

uv run scripts/search.py <path-to-file> -o <generated-filename.json> --databases <db1,db2,db3> > <generated-filename.md>

脚本将查询数据库，保存完整的JSON负载，并将Markdown格式的表格写入指定的.md文件。
读取结果：仔细打开并读取新生成的.md文件，查看Markdown表格。
解读指标：为用户总结排名前3至5个带有有意义注释的结构匹配结果。报告时，使用以下特定字段评估匹配质量：
- Prob（概率）：接近1.0（100%）的值表明对该折叠为真实结构同源物的置信度极高。
- Q-Cov（查询覆盖率）：百分比越高，说明匹配覆盖了查询蛋白质的大部分整体结构，而非仅小部分局部基序。
- E值与序列一致性：使用这些指标提供额外的进化背景信息。
功能分析：分析报告匹配结果中
```
Target ID
```
列包含的文本描述。
- 明确报告排名靠前的结构同源物的具体蛋白质名称/功能。
- 提供综合概述，总结所有同源物中发现的不同功能、结构域或蛋白质家族的多样性（例如：“大多数匹配结果是门户蛋白，但也有一组明显的病毒衣壳匹配结果……”）。
明确告知用户新创建的两个文件（.json和.md）及其位置，以便后续分析步骤中可以无缝使用。

* If the API returns an error or the file is missing, inform the user clearly

* 若API返回错误或文件缺失，请清晰告知用户并要求他们验证文件路径。

and ask them to verify the file path.

—