protein-sequence-similarity-search
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrerequisites
前提条件
-
: Read the
uvskill and follow its Setup instructions to ensureuvis installed and on PATH.uv -
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.ebi.ac.uk/jdispatcher/sss/ncbiblast and https://colabfold.com, then (2) create the file recording the notification text and timestamp.
-
file: Make sure the
.envfile exists in your home directory. Create one if it does not exist..env -
(optional but recommended): Recommended by the EBI for BLAST job tracking, but the skill works without it. If the variable is missing from
USER_EMAIL, do NOT ask the user to paste it into the chat (this would leak the value into the agent's context). Instead, give the user this command — substituting.envwith the resolved literal path to theENV_FILEfile:.envbashprintf "Enter contact email: " && read email && echo "USER_EMAIL=$email" >> "ENV_FILE" && echo "Saved."The scripts load credentials automatically via. NEVER read, print, or inspect thedotenvfile or its variables (e.g. no.env,cat,grep,echo, orprintenvon keys). Credentials must stay out of the agent's context.os.environ.get
-
:阅读
uv技能文档并遵循其设置说明,确保uv已安装且在PATH路径中。uv -
用户通知:如果此技能目录中不存在LICENSE_NOTIFICATION.txt文件,则(1) 显眼地通知用户查看https://www.ebi.ac.uk/jdispatcher/sss/ncbiblast和https://colabfold.com上的条款,然后(2) 创建记录通知文本和时间戳的文件。
-
文件:确保主目录中存在
.env文件,若不存在则创建一个。.env -
(可选但推荐):EBI推荐用于BLAST任务追踪,但即使没有该变量技能也能正常运行。如果
USER_EMAIL中缺少此变量,请勿要求用户在聊天中粘贴(这会导致值泄露到Agent的上下文)。相反,向用户提供以下命令——将.env替换为ENV_FILE文件的实际路径:.envbashprintf "Enter contact email: " && read email && echo "USER_EMAIL=$email" >> "ENV_FILE" && echo "Saved."脚本会通过自动加载凭据。绝对不要读取、打印或检查dotenv文件或其变量(例如,不要对键使用.env、cat、grep、echo或printenv)。凭据必须远离Agent的上下文。os.environ.get
Goal
目标
Take a user-provided amino acid sequence (or a path to a file), search
for sequence homologues using the fastest available method, generate a
Markdown-formatted table of the top hits, interpret key alignment metrics,
summarize the inferred protein functions, and save results locally for future
programmatic analysis.
.fasta接收用户提供的氨基酸序列(或文件路径),使用最快可用的方法搜索序列同源物,生成Markdown格式的顶级匹配结果表格,解读关键比对指标,总结推断出的蛋白质功能,并将结果本地保存以供后续程序化分析。
.fastaCore Rules
核心规则
- Strict Validation: For BLAST, only use database codes listed in the table below.
- No Hallucinations: If a script throws an error or returns no hits, inform the user clearly. Do NOT invent sequence homologues.
- Do Not Parse Output Files: Do not parse the JSON, a3m, or any other raw
output files. Rely on the generated file for your summary. The JSON and other outputs are for subsequent tool use only.
.md - Always State the Method: Every report must clearly state whether the search used the quick MMseqs2 (ColabFold API) or the slower EBI BLAST method.
- Notification: If this skill is used, ensure this is mentioned in the output. Explicitly state that the corresponding program (MMSEQS2 or EBI BLAST) and Sequence Databases were used.
- 严格验证:对于BLAST,仅使用下表中列出的数据库代码。
- 禁止虚构内容:如果脚本抛出错误或未返回匹配结果,清晰告知用户。请勿虚构序列同源物。
- 不要解析输出文件:不要解析JSON、a3m或任何其他原始输出文件。依赖生成的文件进行总结。JSON和其他输出仅用于后续工具使用。
.md - 始终说明方法:每份报告必须明确说明搜索使用的是快速的MMseqs2(ColabFold API)还是较慢的EBI BLAST方法。
- 通知说明:如果使用此技能,需在输出中提及。明确说明使用了对应的程序(MMSEQS2或EBI BLAST)和序列数据库。
Search Method Selection
搜索方法选择
Choose the search method based on the user's request:
If the user says "quick search" or "fast search", no specific method
requested / general homologue search, of if you are unsure: Run MMseqs2 (fast,
default) using
mmseqs2_search.pyIf MMseqs2 fails (exit code 2: RATELIMIT or API error) or User explicitly
requests "BLAST" or a specific BLAST database (e.g. ,
, ): Run BLAST using
uniprotkb_swissprotpdbuniprotkb_humanuniprot_blast.py根据用户的请求选择搜索方法:
如果用户要求**"快速搜索",或者未指定具体方法/仅要求通用同源物搜索**,或者你不确定时:使用运行MMseqs2(快速,默认选项)
mmseqs2_search.py如果MMseqs2失败(退出码2:速率限制或API错误),或者用户明确要求"BLAST",或者指定了特定BLAST数据库(例如、、):使用运行BLAST
uniprotkb_swissprotpdbuniprotkb_humanuniprot_blast.pyInstructions
操作步骤
-
Identify the query from the user. It can be a raw sequence string (e.g., "MKVLY...") or a path to a local file (e.g., "./data/sequence.fasta").
-
Determine the search method using the list above.
-
识别用户提供的查询内容,可以是原始序列字符串(例如"MKVLY...")或本地文件路径(例如"./data/sequence.fasta")。
-
根据上述规则确定搜索方法。
Path A: MMseqs2 Search (Default)
路径A:MMseqs2搜索(默认)
-
Generate File Names: Generate descriptive output file names based on the input (e.g.,and
proteinA_mmseqs2.json).proteinA_mmseqs2.md -
Execute the MMseqs2 script:
- Default:
uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>- With mgnify:
uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --include-mgnify -
The script will query the ColabFold MMseqs2 API and poll for completion. This is typically fast (under 2 minutes).
-
If the script exits with code 2 (API failure, rate limit), automatically fall back to BLAST (Path B below). Inform the user: "MMseqs2 search failed, falling back to BLAST."
-
Read the Results: Open and read the generatedfile.
.md
-
生成文件名:根据输入生成描述性输出文件名(例如和
proteinA_mmseqs2.json)。proteinA_mmseqs2.md -
执行MMseqs2脚本:
- 默认命令:
uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>- 包含mgnify的命令:
uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --include-mgnify -
脚本将查询ColabFold MMseqs2 API并轮询完成状态。此过程通常很快(不到2分钟)。
-
如果脚本以代码2退出(API失败、速率限制),自动切换到BLAST(下方路径B)。告知用户:"MMseqs2搜索失败,将切换为BLAST搜索。"
-
读取结果:打开并读取生成的文件。
.md
Path B: BLAST Search (Explicit or Fallback)
路径B:BLAST搜索(明确指定或备选)
-
Database Selection & Validation: Determine the most appropriate database(s) based on the user's prompt.
- Consult the Available BLAST Databases table below.
- If the user specifies a taxonomic group (e.g., "Find homologues in
microbes"), select the corresponding (e.g.,
Database Code).uniprotkb_bacteria - If the user explicitly requests curated hits, use .
uniprotkb_swissprot - If no specific database is requested, do not specify .
--databases - Validation: Ensure the database code exactly matches an entry in the table. If the user requests a database not on the list, do not proceed and provide the allowed list.
-
Generate File Names: (e.g.,and
proteinA_ebi_blast.json).proteinA_ebi_blast.md -
This API requires the user email address to be set in the USER_EMAIL environment variable for inclusion in request header.
-
Execute the BLAST script:
- Default (uniprotkb):
uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>- Custom database:
uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --databases <db1,db2> -
The script will query the EBI BLAST API and poll the server. Note: This can take up to 15 minutes; wait patiently.
-
Read the Results: Open and read the generatedfile.
.md
-
数据库选择与验证:根据用户提示确定最合适的数据库。
- 参考下方可用BLAST数据库表格。
- 如果用户指定了分类组(例如"查找微生物中的同源物"),选择对应的(例如
数据库代码)。uniprotkb_bacteria - 如果用户明确要求经过 curated 的匹配结果,使用。
uniprotkb_swissprot - 如果未指定数据库,不要添加参数。
--databases - 验证:确保数据库代码与表格中的条目完全匹配。如果用户请求的数据库不在列表中,请勿继续执行,并提供允许的数据库列表。
-
生成文件名(例如和
proteinA_ebi_blast.json)。proteinA_ebi_blast.md -
此API要求用户邮箱地址已设置在USER_EMAIL环境变量中,以包含在请求头中。
-
执行BLAST脚本:
- 默认(uniprotkb):
uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>- 自定义数据库:
uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --databases <db1,db2> -
脚本将查询EBI BLAST API并轮询服务器状态。注意:此过程可能需要长达15分钟,请耐心等待。
-
读取结果:打开并读取生成的文件。
.md
Common Steps (Both Methods)
通用步骤(两种方法均适用)
- Interpret the Metrics: Summarize the top 3 to 5 sequence homologues.
Assess match quality using:
- Q-Cov (Query Coverage): High percentages mean the match covers most of the query sequence.
- E-value: Lower E-values (e.g., ) indicate extreme statistical significance.
1e-50 - Seq Identity: Provides evolutionary context (highly conserved vs. distant homologue).
- Perform Functional Analysis:
- If the results table includes protein descriptions, analyze them directly: report specific protein names/functions of the top homologues and summarize the variety of functions, domains, or protein families found.
- If the results contain only UniProt accession IDs without descriptions (common with MMseqs2), look up the protein names and functions for the top 3–5 hits using the uniprot-database skill or other appropriate methods before summarizing.
- Inform the user of both newly created files (and
.json) and their locations..md
- 解读指标:总结排名前3至5的序列同源物。使用以下指标评估匹配质量:
- Q-Cov(查询覆盖率):百分比越高,说明匹配覆盖了查询序列的大部分区域。
- E值:E值越低(例如),表示统计显著性极高。
1e-50 - 序列一致性:提供进化背景(高度保守 vs 远缘同源物)。
- 功能分析:
- 如果结果表格包含蛋白质描述,直接分析:报告顶级同源物的具体蛋白质名称/功能,并总结发现的功能、结构域或蛋白质家族的多样性。
- 如果结果仅包含UniProt登录号而无描述(MMseqs2常见情况),在总结前使用uniprot-database技能或其他合适方法查找排名前3-5的匹配项的蛋白质名称和功能。
- 告知用户新创建的两个文件(和
.json)及其位置。.md
Available BLAST Databases
可用BLAST数据库
- – UniProt Knowledgebase (The UniProt Knowledgebase includes UniProtKB/Swiss-Prot and UniProtKB/TrEMBL): The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-references. Search UniProtKB to retrieve "everything that is known" about a particular sequence
uniprotkb - – UniProtKB/Swiss-Prot (The manually annotated section of UniProtKB): The manually curated subsection of the UniProt Knowledgebase
uniprotkb_swissprot - – UniProtKB/Swiss-Prot isoforms (The manually annotated isoforms of UniProtKB/Swiss-Prot): The isoform sequences for the manually curated subsection of the UniProt Knowledgebase
uniprotkb_swissprotsv - – UniProtKB Reference Proteomes: Taxonomic subset of the UniProtKB Reference Proteomes
uniprotkb_reference_proteomes - – UniProtKB/TrEMBL (The automatically annotated section of UniProtKB): Subsection of the UniProt Knowledgebase derived from ENA Sequence (formerly EMBL-Bank) coding sequence translations with annotation produced by an automated process
uniprotkb_trembl - – UniProtKB Reference Proteomes plus Swiss-Prot: UniProtKB Reference Proteomes plus Swiss-Prot
uniprotkb_refprotswissprot - – UniProtKB Archaea: Taxonomic subset of the UniProt Knowledgebase for archaea
uniprotkb_archaea - – UniProtKB Arthropoda: Taxonomic subset of the UniProt Knowledgebase for arthropoda
uniprotkb_arthropoda - – UniProtKB Bacteria: Taxonomic subset of the UniProt Knowledgebase for bacteria
uniprotkb_bacteria - – UniProtKB Complete Microbial Proteomes: Taxonomic subset of the UniProt Knowledgebase for complete microbial proteomes
uniprotkb_complete_microbial_proteomes - – UniProtKB Eukaryota: Taxonomic subset of the UniProt Knowledgebase for eukaryota
uniprotkb_eukaryota - – UniProtKB Fungi: Taxonomic subset of the UniProt Knowledgebase for fungi
uniprotkb_fungi - – UniProtKB Human: Taxonomic subset of the UniProt Knowledgebase for human
uniprotkb_human - – UniProtKB Mammals: Taxonomic subset of the UniProt Knowledgebase for mammals
uniprotkb_mammals - – UniProtKB Nematoda: Taxonomic subset of the UniProt Knowledgebase for nematoda
uniprotkb_nematoda - – UniProtKB Rodents: Taxonomic subset of the UniProt Knowledgebase for rodents
uniprotkb_rodents - – UniProtKB Vertebrates: Taxonomic subset of the UniProt Knowledgebase for vertebrates
uniprotkb_vertebrates - – UniProtKB Viridiplantae: Taxonomic subset of the UniProt Knowledgebase for viridiplantae
uniprotkb_viridiplantae - – UniProtKB Viruses: Taxonomic subset of the UniProt Knowledgebase for viruses
uniprotkb_viruses - – UniProtKB Enzyme: Taxonomic subset of the UniProt Knowledgebase for enzymes
uniprotkb_enzyme - – UniProtKB COVID-19: Taxonomic subset of the UniProt Knowledgebase for COVID-19
uniprotkb_covid19 - – UniProt Clusters 100% (UniRef100): The UniProt Reference Clusters (UniRef) containing sequences which are 100% identical.
uniref100 - – UniProt Clusters 90% (UniRef90): The UniProt Reference Clusters (UniRef) containing sequences which are 90% identical.
uniref90 - – UniProt Clusters 50% (UniRef50): The UniProt Reference Clusters (UniRef) containing sequences which are 50% identical.
uniref50 - – Protein Structure Sequences (PDBe protein structure sequences): Protein sequences from structures described in the Brookhaven Protein Data Bank (PDB)
pdb
- – UniProt知识库(包含UniProtKB/Swiss-Prot和UniProtKB/TrEMBL):UniProt知识库(UniProtKB)是获取大量经过curated的蛋白质信息的核心入口,包括功能、分类和交叉引用。搜索UniProtKB可检索特定序列的"所有已知信息"
uniprotkb - – UniProtKB/Swiss-Prot(UniProtKB的人工注释部分):UniProt知识库的人工curated子部分
uniprotkb_swissprot - – UniProtKB/Swiss-Prot异构体(UniProtKB/Swiss-Prot的人工注释异构体):UniProt知识库人工curated子部分的异构体序列
uniprotkb_swissprotsv - – UniProtKB参考蛋白质组:UniProtKB参考蛋白质组的分类子集
uniprotkb_reference_proteomes - – UniProtKB/TrEMBL(UniProtKB的自动注释部分):UniProt知识库的子部分,源自ENA序列(原EMBL-Bank)编码序列的翻译,注释由自动化流程生成
uniprotkb_trembl - – UniProtKB参考蛋白质组+Swiss-Prot:UniProtKB参考蛋白质组加上Swiss-Prot
uniprotkb_refprotswissprot - – UniProtKB古菌:UniProt知识库的古菌分类子集
uniprotkb_archaea - – UniProtKB节肢动物:UniProt知识库的节肢动物分类子集
uniprotkb_arthropoda - – UniProtKB细菌:UniProt知识库的细菌分类子集
uniprotkb_bacteria - – UniProtKB完整微生物蛋白质组:UniProt知识库的完整微生物蛋白质组分类子集
uniprotkb_complete_microbial_proteomes - – UniProtKB真核生物:UniProt知识库的真核生物分类子集
uniprotkb_eukaryota - – UniProtKB真菌:UniProt知识库的真菌分类子集
uniprotkb_fungi - – UniProtKB人类:UniProt知识库的人类分类子集
uniprotkb_human - – UniProtKB哺乳动物:UniProt知识库的哺乳动物分类子集
uniprotkb_mammals - – UniProtKB线虫:UniProt知识库的线虫分类子集
uniprotkb_nematoda - – UniProtKB啮齿动物:UniProt知识库的啮齿动物分类子集
uniprotkb_rodents - – UniProtKB脊椎动物:UniProt知识库的脊椎动物分类子集
uniprotkb_vertebrates - – UniProtKB绿色植物:UniProt知识库的绿色植物分类子集
uniprotkb_viridiplantae - – UniProtKB病毒:UniProt知识库的病毒分类子集
uniprotkb_viruses - – UniProtKB酶:UniProt知识库的酶分类子集
uniprotkb_enzyme - – UniProtKB新冠病毒:UniProt知识库的新冠病毒分类子集
uniprotkb_covid19 - – UniProt 100%聚类(UniRef100):包含100%相同序列的UniProt参考聚类(UniRef)
uniref100 - – UniProt 90%聚类(UniRef90):包含90%相同序列的UniProt参考聚类(UniRef)
uniref90 - – UniProt 50%聚类(UniRef50):包含50%相同序列的UniProt参考聚类(UniRef)
uniref50 - – 蛋白质结构序列(PDBe蛋白质结构序列):布鲁克海文蛋白质数据库(PDB)中描述的结构对应的蛋白质序列
pdb