protein-assembly

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Protein Assembly Skill

蛋白质组装技能

This skill provides structured guidance for designing fusion protein gBlock sequences that combine multiple protein components (antibody fragments, fluorescent proteins, enzyme domains) into a single optimized DNA construct.

本技能为设计融合蛋白gBlock序列提供结构化指导，可将多种蛋白质组件（抗体片段、荧光蛋白、酶结构域）整合为单一优化的DNA构建体。

When to Use This Skill

技能适用场景

This skill applies to tasks that involve:

Designing fusion proteins from multiple sources (PDB, plasmids, protein databases)
Creating gBlock sequences with specific linker requirements
Codon optimization for GC content constraints
Combining fluorescent proteins with specific excitation/emission wavelengths
Assembling multi-domain proteins with N-terminal methionine removal

本技能适用于以下任务：

从多来源（PDB、质粒、蛋白质数据库）设计融合蛋白
创建带有特定连接子要求的gBlock序列
针对GC含量约束进行密码子优化
整合具有特定激发/发射波长的荧光蛋白
组装需去除N端甲硫氨酸的多结构域蛋白质

Structured Approach

结构化实施步骤

Phase 1: Information Gathering and Cataloging

阶段1：信息收集与分类整理

Objective: Collect ALL required sequence data before any design work begins.

Inventory input files completely
- Read ALL input files in their entirety (avoid truncated reads)
- For GenBank (.gb) files, parse the complete file to extract CDS/protein sequences
- For FASTA files, extract all sequences with their identifiers
- For PDB ID lists, note all IDs for batch retrieval
Fetch external sequences systematically
- Query PDB API for each protein ID to retrieve amino acid sequences
- Query relevant protein databases (e.g., fpbase for fluorescent proteins)
- Document each retrieved sequence with its source and identifier
Create a sequence catalog
- List all available protein sequences with clear labels
- Note the source of each sequence (PDB ID, plasmid CDS, database)
- Identify any missing sequences before proceeding

目标： 在开始任何设计工作前，收集所有所需的序列数据。

完整清点输入文件
- 完整读取所有输入文件（避免截断读取）
- 对于GenBank（.gb）文件，解析完整文件以提取CDS/蛋白质序列
- 对于FASTA文件，提取所有带标识符的序列
- 对于PDB ID列表，记录所有ID以便批量检索
系统获取外部序列
- 针对每个蛋白质ID查询PDB API以获取氨基酸序列
- 查询相关蛋白质数据库（如荧光蛋白数据库fpbase）
- 记录每条检索到的序列及其来源和标识符
创建序列目录
- 列出所有可用蛋白质序列并标注清晰标签
- 记录每条序列的来源（PDB ID、质粒CDS、数据库）
- 在推进前确认是否存在缺失序列

Phase 2: Protein Identification and Selection

阶段2：蛋白质识别与筛选

Objective: Match proteins to task requirements using specific criteria.

Wavelength matching for fluorescent proteins
- Search for proteins with exact wavelength matches (not approximate)
- Verify both excitation AND emission peaks against requirements
- Document the selected donor and acceptor proteins with rationale
Binding domain identification
- Identify proteins that bind specific molecules (substrates, ligands)
- Cross-reference PDB entries with known binding partners
- Verify binding capability through database annotations
Target protein identification
- For antibody-related tasks, identify the target antigen
- Use sequence homology or database lookups as needed
- Document the identification method and confidence

目标： 根据特定标准匹配蛋白质与任务需求。

荧光蛋白波长匹配
- 搜索具有精确波长匹配的蛋白质（非近似匹配）
- 验证激发峰和发射峰是否均符合要求
- 记录选定的供体和受体蛋白及其筛选依据
结合结构域识别
- 识别可结合特定分子（底物、配体）的蛋白质
- 交叉参考PDB条目与已知结合伙伴
- 通过数据库注释验证结合能力
目标蛋白质识别
- 对于抗体相关任务，识别目标抗原
- 必要时使用序列同源性分析或数据库查询
- 记录识别方法及置信度

Phase 3: Sequence Processing

阶段3：序列处理

Objective: Prepare individual protein sequences for fusion.

N-terminal methionine handling
- Remove N-terminal methionines from ALL internal proteins
- Keep only the first protein's N-terminal methionine (if required)
- Document which sequences were modified
Sequence validation
- Verify each sequence is complete and valid
- Check for unusual amino acids or sequence artifacts
- Confirm sequences match expected lengths

目标： 准备用于融合的单个蛋白质序列。

N端甲硫氨酸处理
- 去除所有内部蛋白质的N端甲硫氨酸
- 仅保留首个蛋白质的N端甲硫氨酸（如有要求）
- 记录哪些序列已被修改
序列验证
- 验证每条序列是否完整有效
- 检查是否存在异常氨基酸或序列伪迹
- 确认序列长度符合预期

Phase 4: Fusion Protein Assembly

阶段4：融合蛋白组装

Objective: Construct the complete fusion protein sequence.

Follow the specified protein order exactly
- Do not deviate from the required arrangement
- Document the order: [Protein1]-[Linker]-[Protein2]-[Linker]-...
Design appropriate linkers
- Use GS (Glycine-Serine) linkers of specified length
- Common patterns: (GGGGS)n or (GS)n where n provides required length
- Ensure linkers fall within length constraints (e.g., 5-20 amino acids)
Assemble the complete protein sequence
- Concatenate proteins with linkers in correct order
- Verify the assembled sequence is continuous and valid

目标： 构建完整的融合蛋白序列。

严格遵循指定的蛋白质顺序
- 不得偏离要求的排列顺序
- 记录顺序：[蛋白质1]-[连接子]-[蛋白质2]-[连接子]-...
设计合适的连接子
- 使用指定长度的GS（甘氨酸-丝氨酸）连接子
- 常见模式：(GGGGS)n 或 (GS)n，其中n决定所需长度
- 确保连接子符合长度限制（如5-20个氨基酸）
组装完整蛋白质序列
- 按正确顺序将蛋白质与连接子拼接
- 验证组装后的序列是否连续且有效

Phase 5: Codon Optimization and DNA Generation

阶段5：密码子优化与DNA序列生成

Objective: Convert protein to optimized DNA sequence.

Initial codon translation
- Convert each amino acid to a codon
- Use a standard codon table for the target organism
GC content optimization
- Calculate GC content in sliding windows (e.g., 50 nucleotides)
- Identify windows outside acceptable range (e.g., 30-70%)
- Swap synonymous codons to bring GC content within range
- Re-verify after each swap
Length verification
- Confirm DNA sequence meets length constraints (e.g., ≤3000 nt)
- If too long, review design choices (linker lengths, protein selections)

目标： 将蛋白质序列转换为优化后的DNA序列。

初始密码子转换
- 将每个氨基酸转换为对应密码子
- 使用目标生物的标准密码子表
GC含量优化
- 以滑动窗口（如50个核苷酸）计算GC content
- 识别超出可接受范围（如30-70%）的窗口
- 替换同义密码子使GC content回到范围内
- 每次替换后重新验证
长度验证
- 确认DNA序列符合长度限制（如≤3000 nt）
- 若过长，重新评估设计选择（连接子长度、蛋白质筛选）

Phase 6: Output Generation

阶段6：输出生成

Objective: Create the required output file(s).

Write output immediately after assembly
- Do not delay output file creation
- Write to the exact path specified in requirements
Include appropriate formatting
- Follow any specified format (plain text, FASTA, etc.)
- Include headers or metadata if required
Verify output file exists
- Confirm the file was created successfully
- Verify file contents match the designed sequence

目标： 创建所需的输出文件。

组装完成后立即生成输出
- 不得延迟输出文件的创建
- 写入要求中指定的精确路径
采用合适的格式
- 遵循指定格式（纯文本、FASTA等）
- 如有要求，包含头部信息或元数据
验证输出文件是否存在
- 确认文件已成功创建
- 验证文件内容与设计的序列一致

Verification Checkpoints

验证检查点

After Phase 1:

阶段1完成后：

All input files read completely (no truncation)
All external sequences retrieved
Sequence catalog is complete

所有输入文件已完整读取（无截断）
所有外部序列已检索完成
序列目录完整

After Phase 2:

阶段2完成后：

All required proteins identified
Wavelength/binding requirements verified
Selection rationale documented

所有所需蛋白质已识别
波长/结合要求已验证
筛选依据已记录

After Phase 3:

阶段3完成后：

N-terminal methionines handled correctly
All sequences validated

N端甲硫氨酸处理正确
所有序列已验证

After Phase 4:

阶段4完成后：

Protein order matches requirements
Linkers meet length constraints
Complete fusion sequence assembled

蛋白质顺序符合要求
连接子满足长度限制
完整融合序列已组装

After Phase 5:

阶段5完成后：

GC content within range in ALL windows
DNA length within constraints

所有滑动窗口的GC content均在范围内
DNA长度符合限制

After Phase 6:

阶段6完成后：

Output file exists at specified path
File contents are correct

输出文件存在于指定路径
文件内容正确

Common Pitfalls

常见误区

Incomplete file reading
- GenBank files may be large; ensure complete parsing
- Extract CDS translations, not just raw sequences
Approximate wavelength matching
- Use exact values, not "close enough" matches
- Verify both excitation AND emission, not just one
Forgetting N-terminal methionines
- Internal proteins in fusions should have Met removed
- Only the first protein retains its N-terminal Met
Ignoring GC content windows
- Check ALL sliding windows, not just overall GC%
- Optimize problematic regions with synonymous codons
Delayed output generation
- Create output file as soon as sequence is ready
- Do not continue gathering information after design is complete
Information gathering loops
- Set a clear stopping point for research
- Progress to execution even with incomplete information
- A partial solution is better than no solution

文件读取不完整
- GenBank文件可能较大，需确保完整解析
- 提取CDS翻译序列，而非仅原始序列
近似波长匹配
- 使用精确值，而非“接近即可”的匹配
- 需同时验证激发峰和发射峰，而非仅其一
遗漏N端甲硫氨酸处理
- 融合蛋白中的内部蛋白质需去除甲硫氨酸
- 仅首个蛋白质保留其N端甲硫氨酸
忽略GC content滑动窗口检查
- 检查所有滑动窗口，而非仅整体GC%
- 使用同义密码子优化有问题的区域
延迟输出生成
- 序列确定后立即创建输出文件
- 设计完成后无需继续收集信息
信息收集循环
- 为研究设定明确的停止节点
- 即使信息不完整也要推进执行
- 部分解决方案优于无解决方案

Output-First Strategy

输出优先策略

If time or resources are constrained:

Create the output file early, even with placeholders
Update the file as each component is determined
Ensure a valid (if imperfect) output exists at task end

This ensures the primary deliverable exists, which can be refined with additional information.

若时间或资源受限：

尽早创建输出文件，即使包含占位符
确定每个组件后更新文件
确保任务结束时存在有效（即使不完善）的输出

这可确保核心交付物存在，后续可利用额外信息进行优化。