protein-sequence-similarity-search

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prerequisites

前提条件

  1. uv
    : Read the
    uv
    skill and follow its Setup instructions to ensure
    uv
    is installed and on PATH.
  2. User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.ebi.ac.uk/jdispatcher/sss/ncbiblast and https://colabfold.com, then (2) create the file recording the notification text and timestamp.
  3. .env
    file
    : Make sure the
    .env
    file exists in your home directory. Create one if it does not exist.
  4. USER_EMAIL
    (optional but recommended): Recommended by the EBI for BLAST job tracking, but the skill works without it. If the variable is missing from
    .env
    , do NOT ask the user to paste it into the chat (this would leak the value into the agent's context). Instead, give the user this command — substituting
    ENV_FILE
    with the resolved literal path to the
    .env
    file
    :
    bash
    printf "Enter contact email: " && read email && echo "USER_EMAIL=$email" >> "ENV_FILE" && echo "Saved."
    The scripts load credentials automatically via
    dotenv
    . NEVER read, print, or inspect the
    .env
    file or its variables (e.g. no
    cat
    ,
    grep
    ,
    echo
    ,
    printenv
    , or
    os.environ.get
    on keys). Credentials must stay out of the agent's context.
  1. uv
    :阅读
    uv
    技能文档并遵循其设置说明,确保
    uv
    已安装且在PATH路径中。
  2. 用户通知:如果此技能目录中不存在LICENSE_NOTIFICATION.txt文件,则(1) 显眼地通知用户查看https://www.ebi.ac.uk/jdispatcher/sss/ncbiblast和https://colabfold.com上的条款,然后(2) 创建记录通知文本和时间戳的文件。
  3. .env
    文件
    :确保主目录中存在
    .env
    文件,若不存在则创建一个。
  4. USER_EMAIL
    (可选但推荐):EBI推荐用于BLAST任务追踪,但即使没有该变量技能也能正常运行。如果
    .env
    中缺少此变量,请勿要求用户在聊天中粘贴(这会导致值泄露到Agent的上下文)。相反,向用户提供以下命令——
    ENV_FILE
    替换为
    .env
    文件的实际路径
    bash
    printf "Enter contact email: " && read email && echo "USER_EMAIL=$email" >> "ENV_FILE" && echo "Saved."
    脚本会通过
    dotenv
    自动加载凭据。绝对不要读取、打印或检查
    .env
    文件或其变量(例如,不要对键使用
    cat
    grep
    echo
    printenv
    os.environ.get
    )。凭据必须远离Agent的上下文。

Goal

目标

Take a user-provided amino acid sequence (or a path to a
.fasta
file), search for sequence homologues using the fastest available method, generate a Markdown-formatted table of the top hits, interpret key alignment metrics, summarize the inferred protein functions, and save results locally for future programmatic analysis.
接收用户提供的氨基酸序列(或
.fasta
文件路径),使用最快可用的方法搜索序列同源物,生成Markdown格式的顶级匹配结果表格,解读关键比对指标,总结推断出的蛋白质功能,并将结果本地保存以供后续程序化分析。

Core Rules

核心规则

  • Strict Validation: For BLAST, only use database codes listed in the table below.
  • No Hallucinations: If a script throws an error or returns no hits, inform the user clearly. Do NOT invent sequence homologues.
  • Do Not Parse Output Files: Do not parse the JSON, a3m, or any other raw output files. Rely on the generated
    .md
    file for your summary. The JSON and other outputs are for subsequent tool use only.
  • Always State the Method: Every report must clearly state whether the search used the quick MMseqs2 (ColabFold API) or the slower EBI BLAST method.
  • Notification: If this skill is used, ensure this is mentioned in the output. Explicitly state that the corresponding program (MMSEQS2 or EBI BLAST) and Sequence Databases were used.
  • 严格验证:对于BLAST,仅使用下表中列出的数据库代码。
  • 禁止虚构内容:如果脚本抛出错误或未返回匹配结果,清晰告知用户。请勿虚构序列同源物。
  • 不要解析输出文件:不要解析JSON、a3m或任何其他原始输出文件。依赖生成的
    .md
    文件进行总结。JSON和其他输出仅用于后续工具使用。
  • 始终说明方法:每份报告必须明确说明搜索使用的是快速的MMseqs2(ColabFold API)还是较慢的EBI BLAST方法。
  • 通知说明:如果使用此技能,需在输出中提及。明确说明使用了对应的程序(MMSEQS2或EBI BLAST)和序列数据库。

Search Method Selection

搜索方法选择

Choose the search method based on the user's request:
If the user says "quick search" or "fast search", no specific method requested / general homologue search, of if you are unsure: Run MMseqs2 (fast, default) using
mmseqs2_search.py
If MMseqs2 fails (exit code 2: RATELIMIT or API error) or User explicitly requests "BLAST" or a specific BLAST database (e.g.
uniprotkb_swissprot
,
pdb
,
uniprotkb_human
): Run BLAST using
uniprot_blast.py
根据用户的请求选择搜索方法:
如果用户要求**"快速搜索",或者未指定具体方法/仅要求通用同源物搜索**,或者你不确定时:使用
mmseqs2_search.py
运行MMseqs2(快速,默认选项)
如果MMseqs2失败(退出码2:速率限制或API错误),或者用户明确要求"BLAST",或者指定了特定BLAST数据库(例如
uniprotkb_swissprot
pdb
uniprotkb_human
):使用
uniprot_blast.py
运行BLAST

Instructions

操作步骤

  1. Identify the query from the user. It can be a raw sequence string (e.g., "MKVLY...") or a path to a local file (e.g., "./data/sequence.fasta").
  2. Determine the search method using the list above.
  1. 识别用户提供的查询内容,可以是原始序列字符串(例如"MKVLY...")或本地文件路径(例如"./data/sequence.fasta")。
  2. 根据上述规则确定搜索方法

Path A: MMseqs2 Search (Default)

路径A:MMseqs2搜索(默认)

  1. Generate File Names: Generate descriptive output file names based on the input (e.g.,
    proteinA_mmseqs2.json
    and
    proteinA_mmseqs2.md
    ).
  2. Execute the MMseqs2 script:
    • Default:
    uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>
    • With mgnify:
    uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --include-mgnify
  3. The script will query the ColabFold MMseqs2 API and poll for completion. This is typically fast (under 2 minutes).
  4. If the script exits with code 2 (API failure, rate limit), automatically fall back to BLAST (Path B below). Inform the user: "MMseqs2 search failed, falling back to BLAST."
  5. Read the Results: Open and read the generated
    .md
    file.
  1. 生成文件名:根据输入生成描述性输出文件名(例如
    proteinA_mmseqs2.json
    proteinA_mmseqs2.md
    )。
  2. 执行MMseqs2脚本:
    • 默认命令
    uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>
    • 包含mgnify的命令
    uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --include-mgnify
  3. 脚本将查询ColabFold MMseqs2 API并轮询完成状态。此过程通常很快(不到2分钟)。
  4. 如果脚本以代码2退出(API失败、速率限制),自动切换到BLAST(下方路径B)。告知用户:"MMseqs2搜索失败,将切换为BLAST搜索。"
  5. 读取结果:打开并读取生成的
    .md
    文件。

Path B: BLAST Search (Explicit or Fallback)

路径B:BLAST搜索(明确指定或备选)

  1. Database Selection & Validation: Determine the most appropriate database(s) based on the user's prompt.
    • Consult the Available BLAST Databases table below.
    • If the user specifies a taxonomic group (e.g., "Find homologues in microbes"), select the corresponding
      Database Code
      (e.g.,
      uniprotkb_bacteria
      ).
    • If the user explicitly requests curated hits, use
      uniprotkb_swissprot
      .
    • If no specific database is requested, do not specify
      --databases
      .
    • Validation: Ensure the database code exactly matches an entry in the table. If the user requests a database not on the list, do not proceed and provide the allowed list.
  2. Generate File Names: (e.g.,
    proteinA_ebi_blast.json
    and
    proteinA_ebi_blast.md
    ).
  3. This API requires the user email address to be set in the USER_EMAIL environment variable for inclusion in request header.
  4. Execute the BLAST script:
    • Default (uniprotkb):
    uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>
    • Custom database:
    uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --databases <db1,db2>
  5. The script will query the EBI BLAST API and poll the server. Note: This can take up to 15 minutes; wait patiently.
  6. Read the Results: Open and read the generated
    .md
    file.
  1. 数据库选择与验证:根据用户提示确定最合适的数据库。
    • 参考下方可用BLAST数据库表格。
    • 如果用户指定了分类组(例如"查找微生物中的同源物"),选择对应的
      数据库代码
      (例如
      uniprotkb_bacteria
      )。
    • 如果用户明确要求经过 curated 的匹配结果,使用
      uniprotkb_swissprot
    • 如果未指定数据库,不要添加
      --databases
      参数。
    • 验证:确保数据库代码与表格中的条目完全匹配。如果用户请求的数据库不在列表中,请勿继续执行,并提供允许的数据库列表。
  2. 生成文件名(例如
    proteinA_ebi_blast.json
    proteinA_ebi_blast.md
    )。
  3. 此API要求用户邮箱地址已设置在USER_EMAIL环境变量中,以包含在请求头中。
  4. 执行BLAST脚本:
    • 默认(uniprotkb)
    uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>
    • 自定义数据库
    uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --databases <db1,db2>
  5. 脚本将查询EBI BLAST API并轮询服务器状态。注意:此过程可能需要长达15分钟,请耐心等待。
  6. 读取结果:打开并读取生成的
    .md
    文件。

Common Steps (Both Methods)

通用步骤(两种方法均适用)

  1. Interpret the Metrics: Summarize the top 3 to 5 sequence homologues. Assess match quality using:
    • Q-Cov (Query Coverage): High percentages mean the match covers most of the query sequence.
    • E-value: Lower E-values (e.g.,
      1e-50
      ) indicate extreme statistical significance.
    • Seq Identity: Provides evolutionary context (highly conserved vs. distant homologue).
  2. Perform Functional Analysis:
    • If the results table includes protein descriptions, analyze them directly: report specific protein names/functions of the top homologues and summarize the variety of functions, domains, or protein families found.
    • If the results contain only UniProt accession IDs without descriptions (common with MMseqs2), look up the protein names and functions for the top 3–5 hits using the uniprot-database skill or other appropriate methods before summarizing.
  3. Inform the user of both newly created files (
    .json
    and
    .md
    ) and their locations.
  1. 解读指标:总结排名前3至5的序列同源物。使用以下指标评估匹配质量:
    • Q-Cov(查询覆盖率):百分比越高,说明匹配覆盖了查询序列的大部分区域。
    • E值:E值越低(例如
      1e-50
      ),表示统计显著性极高。
    • 序列一致性:提供进化背景(高度保守 vs 远缘同源物)。
  2. 功能分析
    • 如果结果表格包含蛋白质描述,直接分析:报告顶级同源物的具体蛋白质名称/功能,并总结发现的功能、结构域或蛋白质家族的多样性。
    • 如果结果仅包含UniProt登录号而无描述(MMseqs2常见情况),在总结前使用uniprot-database技能或其他合适方法查找排名前3-5的匹配项的蛋白质名称和功能。
  3. 告知用户新创建的两个文件(
    .json
    .md
    )及其位置。

Available BLAST Databases

可用BLAST数据库

  • uniprotkb
    – UniProt Knowledgebase (The UniProt Knowledgebase includes UniProtKB/Swiss-Prot and UniProtKB/TrEMBL): The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-references. Search UniProtKB to retrieve "everything that is known" about a particular sequence
  • uniprotkb_swissprot
    – UniProtKB/Swiss-Prot (The manually annotated section of UniProtKB): The manually curated subsection of the UniProt Knowledgebase
  • uniprotkb_swissprotsv
    – UniProtKB/Swiss-Prot isoforms (The manually annotated isoforms of UniProtKB/Swiss-Prot): The isoform sequences for the manually curated subsection of the UniProt Knowledgebase
  • uniprotkb_reference_proteomes
    – UniProtKB Reference Proteomes: Taxonomic subset of the UniProtKB Reference Proteomes
  • uniprotkb_trembl
    – UniProtKB/TrEMBL (The automatically annotated section of UniProtKB): Subsection of the UniProt Knowledgebase derived from ENA Sequence (formerly EMBL-Bank) coding sequence translations with annotation produced by an automated process
  • uniprotkb_refprotswissprot
    – UniProtKB Reference Proteomes plus Swiss-Prot: UniProtKB Reference Proteomes plus Swiss-Prot
  • uniprotkb_archaea
    – UniProtKB Archaea: Taxonomic subset of the UniProt Knowledgebase for archaea
  • uniprotkb_arthropoda
    – UniProtKB Arthropoda: Taxonomic subset of the UniProt Knowledgebase for arthropoda
  • uniprotkb_bacteria
    – UniProtKB Bacteria: Taxonomic subset of the UniProt Knowledgebase for bacteria
  • uniprotkb_complete_microbial_proteomes
    – UniProtKB Complete Microbial Proteomes: Taxonomic subset of the UniProt Knowledgebase for complete microbial proteomes
  • uniprotkb_eukaryota
    – UniProtKB Eukaryota: Taxonomic subset of the UniProt Knowledgebase for eukaryota
  • uniprotkb_fungi
    – UniProtKB Fungi: Taxonomic subset of the UniProt Knowledgebase for fungi
  • uniprotkb_human
    – UniProtKB Human: Taxonomic subset of the UniProt Knowledgebase for human
  • uniprotkb_mammals
    – UniProtKB Mammals: Taxonomic subset of the UniProt Knowledgebase for mammals
  • uniprotkb_nematoda
    – UniProtKB Nematoda: Taxonomic subset of the UniProt Knowledgebase for nematoda
  • uniprotkb_rodents
    – UniProtKB Rodents: Taxonomic subset of the UniProt Knowledgebase for rodents
  • uniprotkb_vertebrates
    – UniProtKB Vertebrates: Taxonomic subset of the UniProt Knowledgebase for vertebrates
  • uniprotkb_viridiplantae
    – UniProtKB Viridiplantae: Taxonomic subset of the UniProt Knowledgebase for viridiplantae
  • uniprotkb_viruses
    – UniProtKB Viruses: Taxonomic subset of the UniProt Knowledgebase for viruses
  • uniprotkb_enzyme
    – UniProtKB Enzyme: Taxonomic subset of the UniProt Knowledgebase for enzymes
  • uniprotkb_covid19
    – UniProtKB COVID-19: Taxonomic subset of the UniProt Knowledgebase for COVID-19
  • uniref100
    – UniProt Clusters 100% (UniRef100): The UniProt Reference Clusters (UniRef) containing sequences which are 100% identical.
  • uniref90
    – UniProt Clusters 90% (UniRef90): The UniProt Reference Clusters (UniRef) containing sequences which are 90% identical.
  • uniref50
    – UniProt Clusters 50% (UniRef50): The UniProt Reference Clusters (UniRef) containing sequences which are 50% identical.
  • pdb
    – Protein Structure Sequences (PDBe protein structure sequences): Protein sequences from structures described in the Brookhaven Protein Data Bank (PDB)
  • uniprotkb
    – UniProt知识库(包含UniProtKB/Swiss-Prot和UniProtKB/TrEMBL):UniProt知识库(UniProtKB)是获取大量经过curated的蛋白质信息的核心入口,包括功能、分类和交叉引用。搜索UniProtKB可检索特定序列的"所有已知信息"
  • uniprotkb_swissprot
    – UniProtKB/Swiss-Prot(UniProtKB的人工注释部分):UniProt知识库的人工curated子部分
  • uniprotkb_swissprotsv
    – UniProtKB/Swiss-Prot异构体(UniProtKB/Swiss-Prot的人工注释异构体):UniProt知识库人工curated子部分的异构体序列
  • uniprotkb_reference_proteomes
    – UniProtKB参考蛋白质组:UniProtKB参考蛋白质组的分类子集
  • uniprotkb_trembl
    – UniProtKB/TrEMBL(UniProtKB的自动注释部分):UniProt知识库的子部分,源自ENA序列(原EMBL-Bank)编码序列的翻译,注释由自动化流程生成
  • uniprotkb_refprotswissprot
    – UniProtKB参考蛋白质组+Swiss-Prot:UniProtKB参考蛋白质组加上Swiss-Prot
  • uniprotkb_archaea
    – UniProtKB古菌:UniProt知识库的古菌分类子集
  • uniprotkb_arthropoda
    – UniProtKB节肢动物:UniProt知识库的节肢动物分类子集
  • uniprotkb_bacteria
    – UniProtKB细菌:UniProt知识库的细菌分类子集
  • uniprotkb_complete_microbial_proteomes
    – UniProtKB完整微生物蛋白质组:UniProt知识库的完整微生物蛋白质组分类子集
  • uniprotkb_eukaryota
    – UniProtKB真核生物:UniProt知识库的真核生物分类子集
  • uniprotkb_fungi
    – UniProtKB真菌:UniProt知识库的真菌分类子集
  • uniprotkb_human
    – UniProtKB人类:UniProt知识库的人类分类子集
  • uniprotkb_mammals
    – UniProtKB哺乳动物:UniProt知识库的哺乳动物分类子集
  • uniprotkb_nematoda
    – UniProtKB线虫:UniProt知识库的线虫分类子集
  • uniprotkb_rodents
    – UniProtKB啮齿动物:UniProt知识库的啮齿动物分类子集
  • uniprotkb_vertebrates
    – UniProtKB脊椎动物:UniProt知识库的脊椎动物分类子集
  • uniprotkb_viridiplantae
    – UniProtKB绿色植物:UniProt知识库的绿色植物分类子集
  • uniprotkb_viruses
    – UniProtKB病毒:UniProt知识库的病毒分类子集
  • uniprotkb_enzyme
    – UniProtKB酶:UniProt知识库的酶分类子集
  • uniprotkb_covid19
    – UniProtKB新冠病毒:UniProt知识库的新冠病毒分类子集
  • uniref100
    – UniProt 100%聚类(UniRef100):包含100%相同序列的UniProt参考聚类(UniRef)
  • uniref90
    – UniProt 90%聚类(UniRef90):包含90%相同序列的UniProt参考聚类(UniRef)
  • uniref50
    – UniProt 50%聚类(UniRef50):包含50%相同序列的UniProt参考聚类(UniRef)
  • pdb
    – 蛋白质结构序列(PDBe蛋白质结构序列):布鲁克海文蛋白质数据库(PDB)中描述的结构对应的蛋白质序列