protein-sequence-msa
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrerequisites
前提条件
-
: Read the
uvskill and follow its Setup instructions to ensureuvis installed and on PATH.uv -
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.ebi.ac.uk/jdispatcher/msa/clustalo and https://www.ebi.ac.uk/about/terms-of-use/, then (2) create the file recording the notification text and timestamp.
-
file: Make sure the
.envfile exists in your home directory. Create one if it does not exist..env -
(optional but recommended): Recommended by the EBI for Clustal Omega job tracking, but the skill works without it. If the variable is missing from
USER_EMAIL, do NOT ask the user to paste it into the chat (this would leak the value into the agent's context). Instead, give the user this command — substituting.envwith the resolved literal path to theENV_FILEfile:.envbashprintf "Enter contact email: " && read email && echo "USER_EMAIL=$email" >> "ENV_FILE" && echo "Saved."The scripts load credentials automatically via. NEVER read, print, or inspect thedotenvfile or its variables (e.g. no.env,cat,grep,echo, orprintenvon keys). Credentials must stay out of the agent's context.os.environ.get
-
:阅读
uv技能文档并按照其设置说明操作,确保uv已安装并添加到PATH中。uv -
用户通知:如果此技能目录中不存在LICENSE_NOTIFICATION.txt文件,则(1) 显著提醒用户查看https://www.ebi.ac.uk/jdispatcher/msa/clustalo和https://www.ebi.ac.uk/about/terms-of-use/上的条款,然后(2) 创建该文件,记录通知文本和时间戳。
-
文件:确保您的主目录中存在
.env文件。如果不存在,请创建一个。.env -
(可选但推荐):EBI推荐用于Clustal Omega任务跟踪,但即使没有该变量,此技能也能正常工作。如果
USER_EMAIL文件中缺少该变量,请勿要求用户在聊天中粘贴(这会导致值泄露到Agent的上下文)。相反,请向用户提供以下命令——将.env替换为ENV_FILE文件的实际解析路径:.envbashprintf "Enter contact email: " && read email && echo "USER_EMAIL=$email" >> "ENV_FILE" && echo "Saved."脚本会通过自动加载凭证。绝对不要读取、打印或检查dotenv文件或其变量(例如,不要对键使用.env、cat、grep、echo或printenv)。凭证必须远离Agent的上下文。os.environ.get
Core Rules
核心规则
- Use the Wrapper: ALWAYS execute the alignment using
rather than writing your own curl or custom Python requests. The script automatically enforces the required rate limit to respect EBI's Terms of Use.
scripts/msa_align.py - Notification: If this skill is used, ensure this is mentioned in the output.
- Always state the method: Every report must clearly state that the alignment was performed using EBI Clustal Omega.
- No Hallucinations: Do NOT invent alignments or conservation metrics. Report only what is present in the alignment file.
- 使用封装脚本:始终使用执行比对,而非自行编写curl或自定义Python请求。该脚本会自动强制执行所需的速率限制,以遵守EBI的使用条款。
scripts/msa_align.py - 通知要求:如果使用此技能,请确保在输出中提及这一点。
- 明确说明方法:每份报告必须清晰说明比对是使用EBI Clustal Omega执行的。
- 禁止虚构内容:请勿编造比对结果或保守性指标。仅报告比对文件中存在的内容。
Goal
目标
Take a file containing multiple protein sequences in FASTA format, perform
multiple sequence alignment using the EBI Clustal Omega API, save the resulting
alignment locally for future programmatic analysis, and interpret the results
towards addressing the user's specific research objective (e.g., assessing
similarity, identifying conserved domains, or analyzing key residues).
获取一个包含多条FASTA格式蛋白质序列的文件,使用EBI Clustal Omega API执行多序列比对,将生成的比对结果保存到本地以供后续程序化分析,并根据用户的具体研究目标(例如评估相似性、识别保守结构域或分析关键残基)解读结果。
Instructions
操作步骤
-
Prepare Input File: The input must be a plain text file containing two or more protein sequences in FASTA format. Each sequence header must start with asymbol. Example:
>>Sequence_1_Name MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQ QRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG >Sequence_2_Name MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQ QRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG -
Execute Alignment: Run the alignment script:bash
uv run scripts/msa_align.py <INPUT_FASTA> -o <OUTPUT_FILE>Always specify the output file withor-o.--output -
Interpret and Report Results: Analyze the Clustal Omega alignment by selecting metrics and mapping strategies aligned with the research objective. Note that while Clustal Omega produces a Global Alignment, pairwise metrics can be extracted to evaluate specific relationships within the set:
- Identity Metric Options: The choice of denominator determines how
insertions/deletions (gaps) affect the final percentage. Select the most
appropriate calculation based on the biological context:
- Pairwise - Sequence Coverage: . Use when determining if a specific domain or fragment is fully preserved within a larger protein. This ignores gaps in the longer sequence, focusing purely on the "content" of the shorter one.
(Identical Residue Matches) / (Length of Shorter Sequence) - Pairwise - Global Identity: . Use when comparing full-length sequences of similar expected length. This is the most conservative metric; it penalizes for all gaps (indels) introduced by any sequence in the MSA.
(Identical Residue Matches) / (Total Alignment Columns) - Pairwise - Overlap Identity: . Use when comparing a fragment to a full-length protein or when sequences have long unaligned "tails." This focuses on similarity only where the sequences physically overlap.
(Identical Residue Matches) / (Total Alignment Columns - Terminal Gaps) - Multisequence - Conservation Index: . Use for quantifying the percentage of residues that are 100% identical across the entire alignment set. This identifies the core evolutionary signature of the protein family.
(Fully Conserved Columns) / (Total Alignment Columns)
- Pairwise - Sequence Coverage:
- Feature Mapping: Leverage known biological data from specific
sequences to ground the analysis:
- Knowledge Gathering: Identify relevant known sites or regions (e.g., catalytic residues, binding motifs) from your input or via external tools.
- Coordinate Projection: Map these features onto the corresponding Column Indices of the alignment.
- Targeted Discussion: Use these columns to drive the assessment:
- Local Conservation: Analyze if the known functional residues are invariant across the set.
- Region-Specific Metrics: Calculate identity/similarity specifically within the mapped functional regions rather than the whole sequence.
- Goal Contribution: Discuss how this data contributes to your goal, e.g. using conservation to corroborate a prediction or divergence to reject a functional hypothesis.
- Identity Metric Options: The choice of denominator determines how
insertions/deletions (gaps) affect the final percentage. Select the most
appropriate calculation based on the biological context:
-
准备输入文件:输入必须是一个纯文本文件,包含两条或以上FASTA格式的蛋白质序列。每条序列的标题必须以符号开头。示例:
>>Sequence_1_Name MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQ QRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG >Sequence_2_Name MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQ QRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG -
执行比对:运行比对脚本:bash
uv run scripts/msa_align.py <INPUT_FASTA> -o <OUTPUT_FILE>请始终使用或-o指定输出文件。--output -
解读并报告结果:根据研究目标选择合适的指标和映射策略,分析Clustal Omega的比对结果。请注意,虽然Clustal Omega生成的是全局比对,但可以提取成对指标来评估序列集中的特定关系:
- 一致性指标选项:分母的选择决定了插入/缺失(缺口)对最终百分比的影响。请根据生物学背景选择最合适的计算方式:
- 成对 - 序列覆盖度:。用于确定特定结构域或片段是否在更大的蛋白质中完全保留。此指标忽略较长序列中的缺口,仅关注较短序列的“内容”。
(相同残基匹配数) / (较短序列的长度) - 成对 - 全局一致性:。用于比较长度相近的全长序列。这是最保守的指标;它会惩罚多序列比对(MSA)中任何序列引入的所有缺口(插入缺失)。
(相同残基匹配数) / (比对总列数) - 成对 - 重叠一致性:。用于将片段与全长蛋白质进行比较,或当序列存在较长未比对“尾部”时使用。此指标仅关注序列实际重叠区域的相似性。
(相同残基匹配数) / (比对总列数 - 末端缺口) - 多序列 - 保守性指数:。用于量化整个比对集中100%相同的残基百分比。此指标可识别蛋白质家族的核心进化特征。
(完全保守的列数) / (比对总列数)
- 成对 - 序列覆盖度:
- 特征映射:利用特定序列的已知生物学数据为分析提供依据:
- 知识收集:从输入或通过外部工具识别相关的已知位点或区域(例如催化残基、结合基序)。
- 坐标投影:将这些特征映射到比对的对应列索引上。
- 针对性讨论:使用这些列来推动评估:
- 局部保守性:分析已知功能残基在整个序列集中是否保持不变。
- 区域特异性指标:专门针对映射的功能区域(而非整个序列)计算一致性/相似性。
- 目标贡献:讨论这些数据如何助力您的目标,例如利用保守性验证预测,或利用差异性否定功能假设。
- 一致性指标选项:分母的选择决定了插入/缺失(缺口)对最终百分比的影响。请根据生物学背景选择最合适的计算方式:
References
参考资料
- Multiple Sequence Alignment: https://www.ebi.ac.uk/jdispatcher/msa/clustalo