protein-sequence-similarity-search

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Prerequisites

前提条件

uv
: Read the
```
uv
```
skill and follow its Setup instructions to ensure
```
uv
```
is installed and on PATH.
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.ebi.ac.uk/jdispatcher/sss/ncbiblast and https://colabfold.com, then (2) create the file recording the notification text and timestamp.
.env
file: Make sure the
```
.env
```
file exists in your home directory. Create one if it does not exist.
USER_EMAIL
(optional but recommended): Recommended by the EBI for BLAST job tracking, but the skill works without it. If the variable is missing from
```
.env
```
, do NOT ask the user to paste it into the chat (this would leak the value into the agent's context). Instead, give the user this command — substituting
ENV_FILE
with the resolved literal path to the
.env
file:
bash
```
printf "Enter contact email: " && read email && echo "USER_EMAIL=$email" >> "ENV_FILE" && echo "Saved."
```
The scripts load credentials automatically via
```
dotenv
```
. NEVER read, print, or inspect the
```
.env
```
file or its variables (e.g. no
```
cat
```
,
```
grep
```
,
```
echo
```
,
```
printenv
```
, or
```
os.environ.get
```
on keys). Credentials must stay out of the agent's context.

uv
：阅读
```
uv
```
技能文档并遵循其设置说明，确保
```
uv
```
已安装且在PATH路径中。
用户通知：如果此技能目录中不存在LICENSE_NOTIFICATION.txt文件，则(1) 显眼地通知用户查看https://www.ebi.ac.uk/jdispatcher/sss/ncbiblast和https://colabfold.com上的条款，然后(2) 创建记录通知文本和时间戳的文件。
.env
文件：确保主目录中存在
```
.env
```
文件，若不存在则创建一个。
USER_EMAIL
（可选但推荐）：EBI推荐用于BLAST任务追踪，但即使没有该变量技能也能正常运行。如果
```
.env
```
中缺少此变量，请勿要求用户在聊天中粘贴（这会导致值泄露到Agent的上下文）。相反，向用户提供以下命令——将
ENV_FILE
替换为
.env
文件的实际路径：
bash
```
printf "Enter contact email: " && read email && echo "USER_EMAIL=$email" >> "ENV_FILE" && echo "Saved."
```
脚本会通过
```
dotenv
```
自动加载凭据。绝对不要读取、打印或检查
```
.env
```
文件或其变量（例如，不要对键使用
```
cat
```
、
```
grep
```
、
```
echo
```
、
```
printenv
```
或
```
os.environ.get
```
）。凭据必须远离Agent的上下文。

Goal

目标

Take a user-provided amino acid sequence (or a path to a

.fasta

file), search for sequence homologues using the fastest available method, generate a Markdown-formatted table of the top hits, interpret key alignment metrics, summarize the inferred protein functions, and save results locally for future programmatic analysis.

接收用户提供的氨基酸序列（或

.fasta

文件路径），使用最快可用的方法搜索序列同源物，生成Markdown格式的顶级匹配结果表格，解读关键比对指标，总结推断出的蛋白质功能，并将结果本地保存以供后续程序化分析。

Core Rules

核心规则

Strict Validation: For BLAST, only use database codes listed in the table below.
No Hallucinations: If a script throws an error or returns no hits, inform the user clearly. Do NOT invent sequence homologues.
Do Not Parse Output Files: Do not parse the JSON, a3m, or any other raw output files. Rely on the generated
```
.md
```
file for your summary. The JSON and other outputs are for subsequent tool use only.
Always State the Method: Every report must clearly state whether the search used the quick MMseqs2 (ColabFold API) or the slower EBI BLAST method.
Notification: If this skill is used, ensure this is mentioned in the output. Explicitly state that the corresponding program (MMSEQS2 or EBI BLAST) and Sequence Databases were used.

严格验证：对于BLAST，仅使用下表中列出的数据库代码。
禁止虚构内容：如果脚本抛出错误或未返回匹配结果，清晰告知用户。请勿虚构序列同源物。
不要解析输出文件：不要解析JSON、a3m或任何其他原始输出文件。依赖生成的
```
.md
```
文件进行总结。JSON和其他输出仅用于后续工具使用。
始终说明方法：每份报告必须明确说明搜索使用的是快速的MMseqs2（ColabFold API）还是较慢的EBI BLAST方法。
通知说明：如果使用此技能，需在输出中提及。明确说明使用了对应的程序（MMSEQS2或EBI BLAST）和序列数据库。

Search Method Selection

搜索方法选择

Choose the search method based on the user's request:

If the user says "quick search" or "fast search", no specific method requested / general homologue search, of if you are unsure: Run MMseqs2 (fast, default) using

mmseqs2_search.py

If MMseqs2 fails (exit code 2: RATELIMIT or API error) or User explicitly requests "BLAST" or a specific BLAST database (e.g.

uniprotkb_swissprot

pdb

uniprotkb_human

): Run BLAST using

uniprot_blast.py

根据用户的请求选择搜索方法：

如果用户要求**"快速搜索"，或者未指定具体方法/仅要求通用同源物搜索**，或者你不确定时：使用

mmseqs2_search.py

运行MMseqs2（快速，默认选项）

如果MMseqs2失败（退出码2：速率限制或API错误），或者用户明确要求"BLAST"，或者指定了特定BLAST数据库（例如

uniprotkb_swissprot

、

pdb

、

uniprotkb_human

）：使用

uniprot_blast.py

运行BLAST

Instructions

操作步骤

Identify the query from the user. It can be a raw sequence string (e.g., "MKVLY...") or a path to a local file (e.g., "./data/sequence.fasta").
Determine the search method using the list above.

识别用户提供的查询内容，可以是原始序列字符串（例如"MKVLY..."）或本地文件路径（例如"./data/sequence.fasta"）。
根据上述规则确定搜索方法。

Path A: MMseqs2 Search (Default)

路径A：MMseqs2搜索（默认）

Generate File Names: Generate descriptive output file names based on the input (e.g.,
```
proteinA_mmseqs2.json
```
and
```
proteinA_mmseqs2.md
```
).

Execute the MMseqs2 script:

Default:

uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>

With mgnify:

uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --include-mgnify

The script will query the ColabFold MMseqs2 API and poll for completion. This is typically fast (under 2 minutes).
If the script exits with code 2 (API failure, rate limit), automatically fall back to BLAST (Path B below). Inform the user: "MMseqs2 search failed, falling back to BLAST."
Read the Results: Open and read the generated
```
.md
```
file.

生成文件名：根据输入生成描述性输出文件名（例如
```
proteinA_mmseqs2.json
```
和
```
proteinA_mmseqs2.md
```
）。

执行MMseqs2脚本：

默认命令：

uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>

包含mgnify的命令：

uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --include-mgnify

脚本将查询ColabFold MMseqs2 API并轮询完成状态。此过程通常很快（不到2分钟）。
如果脚本以代码2退出（API失败、速率限制），自动切换到BLAST（下方路径B）。告知用户："MMseqs2搜索失败，将切换为BLAST搜索。"
读取结果：打开并读取生成的
```
.md
```
文件。

Path B: BLAST Search (Explicit or Fallback)

路径B：BLAST搜索（明确指定或备选）

Database Selection & Validation: Determine the most appropriate database(s) based on the user's prompt.
- Consult the Available BLAST Databases table below.
- If the user specifies a taxonomic group (e.g., "Find homologues in microbes"), select the corresponding
```
Database Code
```
  (e.g.,
```
uniprotkb_bacteria
```
  ).
- If the user explicitly requests curated hits, use
```
uniprotkb_swissprot
```
  .
- If no specific database is requested, do not specify
```
--databases
```
  .
- Validation: Ensure the database code exactly matches an entry in the table. If the user requests a database not on the list, do not proceed and provide the allowed list.

Generate File Names: (e.g.,

proteinA_ebi_blast.json

and

proteinA_ebi_blast.md

This API requires the user email address to be set in the USER_EMAIL environment variable for inclusion in request header.

Execute the BLAST script:

Default (uniprotkb):

uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>

Custom database:

uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --databases <db1,db2>

The script will query the EBI BLAST API and poll the server. Note: This can take up to 15 minutes; wait patiently.
Read the Results: Open and read the generated
```
.md
```
file.

数据库选择与验证：根据用户提示确定最合适的数据库。
- 参考下方可用BLAST数据库表格。
- 如果用户指定了分类组（例如"查找微生物中的同源物"），选择对应的
```
数据库代码
```
  （例如
```
uniprotkb_bacteria
```
  ）。
- 如果用户明确要求经过 curated 的匹配结果，使用
```
uniprotkb_swissprot
```
  。
- 如果未指定数据库，不要添加
```
--databases
```
  参数。
- 验证：确保数据库代码与表格中的条目完全匹配。如果用户请求的数据库不在列表中，请勿继续执行，并提供允许的数据库列表。

生成文件名（例如

proteinA_ebi_blast.json

和

proteinA_ebi_blast.md

）。

此API要求用户邮箱地址已设置在USER_EMAIL环境变量中，以包含在请求头中。

执行BLAST脚本：

默认（uniprotkb）：

uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>

自定义数据库：

uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --databases <db1,db2>

脚本将查询EBI BLAST API并轮询服务器状态。注意：此过程可能需要长达15分钟，请耐心等待。
读取结果：打开并读取生成的
```
.md
```
文件。

Common Steps (Both Methods)

通用步骤（两种方法均适用）

Interpret the Metrics: Summarize the top 3 to 5 sequence homologues. Assess match quality using:
- Q-Cov (Query Coverage): High percentages mean the match covers most of the query sequence.
- E-value: Lower E-values (e.g.,
```
1e-50
```
  ) indicate extreme statistical significance.
- Seq Identity: Provides evolutionary context (highly conserved vs. distant homologue).
Perform Functional Analysis:
- If the results table includes protein descriptions, analyze them directly: report specific protein names/functions of the top homologues and summarize the variety of functions, domains, or protein families found.
- If the results contain only UniProt accession IDs without descriptions (common with MMseqs2), look up the protein names and functions for the top 3–5 hits using the uniprot-database skill or other appropriate methods before summarizing.
Inform the user of both newly created files (
```
.json
```
and
```
.md
```
) and their locations.

解读指标：总结排名前3至5的序列同源物。使用以下指标评估匹配质量：
- Q-Cov（查询覆盖率）：百分比越高，说明匹配覆盖了查询序列的大部分区域。
- E值：E值越低（例如
```
1e-50
```
  ），表示统计显著性极高。
- 序列一致性：提供进化背景（高度保守 vs 远缘同源物）。
功能分析：
- 如果结果表格包含蛋白质描述，直接分析：报告顶级同源物的具体蛋白质名称/功能，并总结发现的功能、结构域或蛋白质家族的多样性。
- 如果结果仅包含UniProt登录号而无描述（MMseqs2常见情况），在总结前使用uniprot-database技能或其他合适方法查找排名前3-5的匹配项的蛋白质名称和功能。
告知用户新创建的两个文件（
```
.json
```
和
```
.md
```
）及其位置。

Available BLAST Databases

可用BLAST数据库

```
uniprotkb
```
– UniProt Knowledgebase (The UniProt Knowledgebase includes UniProtKB/Swiss-Prot and UniProtKB/TrEMBL): The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-references. Search UniProtKB to retrieve "everything that is known" about a particular sequence
```
uniprotkb_swissprot
```
– UniProtKB/Swiss-Prot (The manually annotated section of UniProtKB): The manually curated subsection of the UniProt Knowledgebase
```
uniprotkb_swissprotsv
```
– UniProtKB/Swiss-Prot isoforms (The manually annotated isoforms of UniProtKB/Swiss-Prot): The isoform sequences for the manually curated subsection of the UniProt Knowledgebase
```
uniprotkb_reference_proteomes
```
– UniProtKB Reference Proteomes: Taxonomic subset of the UniProtKB Reference Proteomes
```
uniprotkb_trembl
```
– UniProtKB/TrEMBL (The automatically annotated section of UniProtKB): Subsection of the UniProt Knowledgebase derived from ENA Sequence (formerly EMBL-Bank) coding sequence translations with annotation produced by an automated process
```
uniprotkb_refprotswissprot
```
– UniProtKB Reference Proteomes plus Swiss-Prot: UniProtKB Reference Proteomes plus Swiss-Prot
```
uniprotkb_archaea
```
– UniProtKB Archaea: Taxonomic subset of the UniProt Knowledgebase for archaea
```
uniprotkb_arthropoda
```
– UniProtKB Arthropoda: Taxonomic subset of the UniProt Knowledgebase for arthropoda
```
uniprotkb_bacteria
```
– UniProtKB Bacteria: Taxonomic subset of the UniProt Knowledgebase for bacteria
```
uniprotkb_complete_microbial_proteomes
```
– UniProtKB Complete Microbial Proteomes: Taxonomic subset of the UniProt Knowledgebase for complete microbial proteomes
```
uniprotkb_eukaryota
```
– UniProtKB Eukaryota: Taxonomic subset of the UniProt Knowledgebase for eukaryota
```
uniprotkb_fungi
```
– UniProtKB Fungi: Taxonomic subset of the UniProt Knowledgebase for fungi
```
uniprotkb_human
```
– UniProtKB Human: Taxonomic subset of the UniProt Knowledgebase for human
```
uniprotkb_mammals
```
– UniProtKB Mammals: Taxonomic subset of the UniProt Knowledgebase for mammals
```
uniprotkb_nematoda
```
– UniProtKB Nematoda: Taxonomic subset of the UniProt Knowledgebase for nematoda
```
uniprotkb_rodents
```
– UniProtKB Rodents: Taxonomic subset of the UniProt Knowledgebase for rodents
```
uniprotkb_vertebrates
```
– UniProtKB Vertebrates: Taxonomic subset of the UniProt Knowledgebase for vertebrates
```
uniprotkb_viridiplantae
```
– UniProtKB Viridiplantae: Taxonomic subset of the UniProt Knowledgebase for viridiplantae
```
uniprotkb_viruses
```
– UniProtKB Viruses: Taxonomic subset of the UniProt Knowledgebase for viruses
```
uniprotkb_enzyme
```
– UniProtKB Enzyme: Taxonomic subset of the UniProt Knowledgebase for enzymes
```
uniprotkb_covid19
```
– UniProtKB COVID-19: Taxonomic subset of the UniProt Knowledgebase for COVID-19
```
uniref100
```
– UniProt Clusters 100% (UniRef100): The UniProt Reference Clusters (UniRef) containing sequences which are 100% identical.
```
uniref90
```
– UniProt Clusters 90% (UniRef90): The UniProt Reference Clusters (UniRef) containing sequences which are 90% identical.
```
uniref50
```
– UniProt Clusters 50% (UniRef50): The UniProt Reference Clusters (UniRef) containing sequences which are 50% identical.
```
pdb
```
– Protein Structure Sequences (PDBe protein structure sequences): Protein sequences from structures described in the Brookhaven Protein Data Bank (PDB)

```
uniprotkb
```
– UniProt知识库（包含UniProtKB/Swiss-Prot和UniProtKB/TrEMBL）：UniProt知识库（UniProtKB）是获取大量经过curated的蛋白质信息的核心入口，包括功能、分类和交叉引用。搜索UniProtKB可检索特定序列的"所有已知信息"
```
uniprotkb_swissprot
```
– UniProtKB/Swiss-Prot（UniProtKB的人工注释部分）：UniProt知识库的人工curated子部分
```
uniprotkb_swissprotsv
```
– UniProtKB/Swiss-Prot异构体（UniProtKB/Swiss-Prot的人工注释异构体）：UniProt知识库人工curated子部分的异构体序列
```
uniprotkb_reference_proteomes
```
– UniProtKB参考蛋白质组：UniProtKB参考蛋白质组的分类子集
```
uniprotkb_trembl
```
– UniProtKB/TrEMBL（UniProtKB的自动注释部分）：UniProt知识库的子部分，源自ENA序列（原EMBL-Bank）编码序列的翻译，注释由自动化流程生成
```
uniprotkb_refprotswissprot
```
– UniProtKB参考蛋白质组+Swiss-Prot：UniProtKB参考蛋白质组加上Swiss-Prot
```
uniprotkb_archaea
```
– UniProtKB古菌：UniProt知识库的古菌分类子集
```
uniprotkb_arthropoda
```
– UniProtKB节肢动物：UniProt知识库的节肢动物分类子集
```
uniprotkb_bacteria
```
– UniProtKB细菌：UniProt知识库的细菌分类子集
```
uniprotkb_complete_microbial_proteomes
```
– UniProtKB完整微生物蛋白质组：UniProt知识库的完整微生物蛋白质组分类子集
```
uniprotkb_eukaryota
```
– UniProtKB真核生物：UniProt知识库的真核生物分类子集
```
uniprotkb_fungi
```
– UniProtKB真菌：UniProt知识库的真菌分类子集
```
uniprotkb_human
```
– UniProtKB人类：UniProt知识库的人类分类子集
```
uniprotkb_mammals
```
– UniProtKB哺乳动物：UniProt知识库的哺乳动物分类子集
```
uniprotkb_nematoda
```
– UniProtKB线虫：UniProt知识库的线虫分类子集
```
uniprotkb_rodents
```
– UniProtKB啮齿动物：UniProt知识库的啮齿动物分类子集
```
uniprotkb_vertebrates
```
– UniProtKB脊椎动物：UniProt知识库的脊椎动物分类子集
```
uniprotkb_viridiplantae
```
– UniProtKB绿色植物：UniProt知识库的绿色植物分类子集
```
uniprotkb_viruses
```
– UniProtKB病毒：UniProt知识库的病毒分类子集
```
uniprotkb_enzyme
```
– UniProtKB酶：UniProt知识库的酶分类子集
```
uniprotkb_covid19
```
– UniProtKB新冠病毒：UniProt知识库的新冠病毒分类子集
```
uniref100
```
– UniProt 100%聚类（UniRef100）：包含100%相同序列的UniProt参考聚类（UniRef）
```
uniref90
```
– UniProt 90%聚类（UniRef90）：包含90%相同序列的UniProt参考聚类（UniRef）
```
uniref50
```
– UniProt 50%聚类（UniRef50）：包含50%相同序列的UniProt参考聚类（UniRef）
```
pdb
```
– 蛋白质结构序列（PDBe蛋白质结构序列）：布鲁克海文蛋白质数据库（PDB）中描述的结构对应的蛋白质序列