protein-sequence-msa

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prerequisites

前提条件

  1. uv
    : Read the
    uv
    skill and follow its Setup instructions to ensure
    uv
    is installed and on PATH.
  2. User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://www.ebi.ac.uk/jdispatcher/msa/clustalo and https://www.ebi.ac.uk/about/terms-of-use/, then (2) create the file recording the notification text and timestamp.
  3. .env
    file
    : Make sure the
    .env
    file exists in your home directory. Create one if it does not exist.
  4. USER_EMAIL
    (optional but recommended): Recommended by the EBI for Clustal Omega job tracking, but the skill works without it. If the variable is missing from
    .env
    , do NOT ask the user to paste it into the chat (this would leak the value into the agent's context). Instead, give the user this command — substituting
    ENV_FILE
    with the resolved literal path to the
    .env
    file
    :
    bash
    printf "Enter contact email: " && read email && echo "USER_EMAIL=$email" >> "ENV_FILE" && echo "Saved."
    The scripts load credentials automatically via
    dotenv
    . NEVER read, print, or inspect the
    .env
    file or its variables (e.g. no
    cat
    ,
    grep
    ,
    echo
    ,
    printenv
    , or
    os.environ.get
    on keys). Credentials must stay out of the agent's context.
  1. uv
    :阅读
    uv
    技能文档并按照其设置说明操作,确保
    uv
    已安装并添加到PATH中。
  2. 用户通知:如果此技能目录中不存在LICENSE_NOTIFICATION.txt文件,则(1) 显著提醒用户查看https://www.ebi.ac.uk/jdispatcher/msa/clustalo和https://www.ebi.ac.uk/about/terms-of-use/上的条款,然后(2) 创建该文件,记录通知文本和时间戳。
  3. .env
    文件
    :确保您的主目录中存在
    .env
    文件。如果不存在,请创建一个。
  4. USER_EMAIL
    (可选但推荐):EBI推荐用于Clustal Omega任务跟踪,但即使没有该变量,此技能也能正常工作。如果
    .env
    文件中缺少该变量,请勿要求用户在聊天中粘贴(这会导致值泄露到Agent的上下文)。相反,请向用户提供以下命令——将
    ENV_FILE
    替换为
    .env
    文件的实际解析路径:
    bash
    printf "Enter contact email: " && read email && echo "USER_EMAIL=$email" >> "ENV_FILE" && echo "Saved."
    脚本会通过
    dotenv
    自动加载凭证。绝对不要读取、打印或检查
    .env
    文件或其变量(例如,不要对键使用
    cat
    grep
    echo
    printenv
    os.environ.get
    )。凭证必须远离Agent的上下文。

Core Rules

核心规则

  • Use the Wrapper: ALWAYS execute the alignment using
    scripts/msa_align.py
    rather than writing your own curl or custom Python requests. The script automatically enforces the required rate limit to respect EBI's Terms of Use.
  • Notification: If this skill is used, ensure this is mentioned in the output.
  • Always state the method: Every report must clearly state that the alignment was performed using EBI Clustal Omega.
  • No Hallucinations: Do NOT invent alignments or conservation metrics. Report only what is present in the alignment file.
  • 使用封装脚本:始终使用
    scripts/msa_align.py
    执行比对,而非自行编写curl或自定义Python请求。该脚本会自动强制执行所需的速率限制,以遵守EBI的使用条款。
  • 通知要求:如果使用此技能,请确保在输出中提及这一点。
  • 明确说明方法:每份报告必须清晰说明比对是使用EBI Clustal Omega执行的。
  • 禁止虚构内容:请勿编造比对结果或保守性指标。仅报告比对文件中存在的内容。

Goal

目标

Take a file containing multiple protein sequences in FASTA format, perform multiple sequence alignment using the EBI Clustal Omega API, save the resulting alignment locally for future programmatic analysis, and interpret the results towards addressing the user's specific research objective (e.g., assessing similarity, identifying conserved domains, or analyzing key residues).
获取一个包含多条FASTA格式蛋白质序列的文件,使用EBI Clustal Omega API执行多序列比对,将生成的比对结果保存到本地以供后续程序化分析,并根据用户的具体研究目标(例如评估相似性、识别保守结构域或分析关键残基)解读结果。

Instructions

操作步骤

  1. Prepare Input File: The input must be a plain text file containing two or more protein sequences in FASTA format. Each sequence header must start with a
    >
    symbol. Example:
    >Sequence_1_Name
    MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQ
    QRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
    >Sequence_2_Name
    MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQ
    QRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
  2. Execute Alignment: Run the alignment script:
    bash
    uv run scripts/msa_align.py <INPUT_FASTA> -o <OUTPUT_FILE>
    Always specify the output file with
    -o
    or
    --output
    .
  3. Interpret and Report Results: Analyze the Clustal Omega alignment by selecting metrics and mapping strategies aligned with the research objective. Note that while Clustal Omega produces a Global Alignment, pairwise metrics can be extracted to evaluate specific relationships within the set:
    • Identity Metric Options: The choice of denominator determines how insertions/deletions (gaps) affect the final percentage. Select the most appropriate calculation based on the biological context:
      • Pairwise - Sequence Coverage:
        (Identical Residue Matches) / (Length of Shorter Sequence)
        . Use when determining if a specific domain or fragment is fully preserved within a larger protein. This ignores gaps in the longer sequence, focusing purely on the "content" of the shorter one.
      • Pairwise - Global Identity:
        (Identical Residue Matches) / (Total Alignment Columns)
        . Use when comparing full-length sequences of similar expected length. This is the most conservative metric; it penalizes for all gaps (indels) introduced by any sequence in the MSA.
      • Pairwise - Overlap Identity:
        (Identical Residue Matches) / (Total Alignment Columns - Terminal Gaps)
        . Use when comparing a fragment to a full-length protein or when sequences have long unaligned "tails." This focuses on similarity only where the sequences physically overlap.
      • Multisequence - Conservation Index:
        (Fully Conserved Columns) / (Total Alignment Columns)
        . Use for quantifying the percentage of residues that are 100% identical across the entire alignment set. This identifies the core evolutionary signature of the protein family.
    • Feature Mapping: Leverage known biological data from specific sequences to ground the analysis:
      • Knowledge Gathering: Identify relevant known sites or regions (e.g., catalytic residues, binding motifs) from your input or via external tools.
      • Coordinate Projection: Map these features onto the corresponding Column Indices of the alignment.
      • Targeted Discussion: Use these columns to drive the assessment:
        • Local Conservation: Analyze if the known functional residues are invariant across the set.
        • Region-Specific Metrics: Calculate identity/similarity specifically within the mapped functional regions rather than the whole sequence.
        • Goal Contribution: Discuss how this data contributes to your goal, e.g. using conservation to corroborate a prediction or divergence to reject a functional hypothesis.
  1. 准备输入文件:输入必须是一个纯文本文件,包含两条或以上FASTA格式的蛋白质序列。每条序列的标题必须以
    >
    符号开头。示例:
    >Sequence_1_Name
    MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQ
    QRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
    >Sequence_2_Name
    MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQ
    QRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
  2. 执行比对:运行比对脚本:
    bash
    uv run scripts/msa_align.py <INPUT_FASTA> -o <OUTPUT_FILE>
    请始终使用
    -o
    --output
    指定输出文件。
  3. 解读并报告结果:根据研究目标选择合适的指标和映射策略,分析Clustal Omega的比对结果。请注意,虽然Clustal Omega生成的是全局比对,但可以提取成对指标来评估序列集中的特定关系:
    • 一致性指标选项:分母的选择决定了插入/缺失(缺口)对最终百分比的影响。请根据生物学背景选择最合适的计算方式:
      • 成对 - 序列覆盖度
        (相同残基匹配数) / (较短序列的长度)
        。用于确定特定结构域或片段是否在更大的蛋白质中完全保留。此指标忽略较长序列中的缺口,仅关注较短序列的“内容”。
      • 成对 - 全局一致性
        (相同残基匹配数) / (比对总列数)
        。用于比较长度相近的全长序列。这是最保守的指标;它会惩罚多序列比对(MSA)中任何序列引入的所有缺口(插入缺失)。
      • 成对 - 重叠一致性
        (相同残基匹配数) / (比对总列数 - 末端缺口)
        。用于将片段与全长蛋白质进行比较,或当序列存在较长未比对“尾部”时使用。此指标仅关注序列实际重叠区域的相似性。
      • 多序列 - 保守性指数
        (完全保守的列数) / (比对总列数)
        。用于量化整个比对集中100%相同的残基百分比。此指标可识别蛋白质家族的核心进化特征。
    • 特征映射:利用特定序列的已知生物学数据为分析提供依据:
      • 知识收集:从输入或通过外部工具识别相关的已知位点或区域(例如催化残基、结合基序)。
      • 坐标投影:将这些特征映射到比对的对应列索引上。
      • 针对性讨论:使用这些列来推动评估:
        • 局部保守性:分析已知功能残基在整个序列集中是否保持不变。
        • 区域特异性指标:专门针对映射的功能区域(而非整个序列)计算一致性/相似性。
        • 目标贡献:讨论这些数据如何助力您的目标,例如利用保守性验证预测,或利用差异性否定功能假设。

References

参考资料