protein-assembly
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseProtein Assembly Skill
蛋白质组装技能
This skill provides structured guidance for designing fusion protein gBlock sequences that combine multiple protein components (antibody fragments, fluorescent proteins, enzyme domains) into a single optimized DNA construct.
本技能为设计融合蛋白gBlock序列提供结构化指导,可将多种蛋白质组件(抗体片段、荧光蛋白、酶结构域)整合为单一优化的DNA构建体。
When to Use This Skill
技能适用场景
This skill applies to tasks that involve:
- Designing fusion proteins from multiple sources (PDB, plasmids, protein databases)
- Creating gBlock sequences with specific linker requirements
- Codon optimization for GC content constraints
- Combining fluorescent proteins with specific excitation/emission wavelengths
- Assembling multi-domain proteins with N-terminal methionine removal
本技能适用于以下任务:
- 从多来源(PDB、质粒、蛋白质数据库)设计融合蛋白
- 创建带有特定连接子要求的gBlock序列
- 针对GC含量约束进行密码子优化
- 整合具有特定激发/发射波长的荧光蛋白
- 组装需去除N端甲硫氨酸的多结构域蛋白质
Structured Approach
结构化实施步骤
Phase 1: Information Gathering and Cataloging
阶段1:信息收集与分类整理
Objective: Collect ALL required sequence data before any design work begins.
-
Inventory input files completely
- Read ALL input files in their entirety (avoid truncated reads)
- For GenBank (.gb) files, parse the complete file to extract CDS/protein sequences
- For FASTA files, extract all sequences with their identifiers
- For PDB ID lists, note all IDs for batch retrieval
-
Fetch external sequences systematically
- Query PDB API for each protein ID to retrieve amino acid sequences
- Query relevant protein databases (e.g., fpbase for fluorescent proteins)
- Document each retrieved sequence with its source and identifier
-
Create a sequence catalog
- List all available protein sequences with clear labels
- Note the source of each sequence (PDB ID, plasmid CDS, database)
- Identify any missing sequences before proceeding
目标: 在开始任何设计工作前,收集所有所需的序列数据。
-
完整清点输入文件
- 完整读取所有输入文件(避免截断读取)
- 对于GenBank(.gb)文件,解析完整文件以提取CDS/蛋白质序列
- 对于FASTA文件,提取所有带标识符的序列
- 对于PDB ID列表,记录所有ID以便批量检索
-
系统获取外部序列
- 针对每个蛋白质ID查询PDB API以获取氨基酸序列
- 查询相关蛋白质数据库(如荧光蛋白数据库fpbase)
- 记录每条检索到的序列及其来源和标识符
-
创建序列目录
- 列出所有可用蛋白质序列并标注清晰标签
- 记录每条序列的来源(PDB ID、质粒CDS、数据库)
- 在推进前确认是否存在缺失序列
Phase 2: Protein Identification and Selection
阶段2:蛋白质识别与筛选
Objective: Match proteins to task requirements using specific criteria.
-
Wavelength matching for fluorescent proteins
- Search for proteins with exact wavelength matches (not approximate)
- Verify both excitation AND emission peaks against requirements
- Document the selected donor and acceptor proteins with rationale
-
Binding domain identification
- Identify proteins that bind specific molecules (substrates, ligands)
- Cross-reference PDB entries with known binding partners
- Verify binding capability through database annotations
-
Target protein identification
- For antibody-related tasks, identify the target antigen
- Use sequence homology or database lookups as needed
- Document the identification method and confidence
目标: 根据特定标准匹配蛋白质与任务需求。
-
荧光蛋白波长匹配
- 搜索具有精确波长匹配的蛋白质(非近似匹配)
- 验证激发峰和发射峰是否均符合要求
- 记录选定的供体和受体蛋白及其筛选依据
-
结合结构域识别
- 识别可结合特定分子(底物、配体)的蛋白质
- 交叉参考PDB条目与已知结合伙伴
- 通过数据库注释验证结合能力
-
目标蛋白质识别
- 对于抗体相关任务,识别目标抗原
- 必要时使用序列同源性分析或数据库查询
- 记录识别方法及置信度
Phase 3: Sequence Processing
阶段3:序列处理
Objective: Prepare individual protein sequences for fusion.
-
N-terminal methionine handling
- Remove N-terminal methionines from ALL internal proteins
- Keep only the first protein's N-terminal methionine (if required)
- Document which sequences were modified
-
Sequence validation
- Verify each sequence is complete and valid
- Check for unusual amino acids or sequence artifacts
- Confirm sequences match expected lengths
目标: 准备用于融合的单个蛋白质序列。
-
N端甲硫氨酸处理
- 去除所有内部蛋白质的N端甲硫氨酸
- 仅保留首个蛋白质的N端甲硫氨酸(如有要求)
- 记录哪些序列已被修改
-
序列验证
- 验证每条序列是否完整有效
- 检查是否存在异常氨基酸或序列伪迹
- 确认序列长度符合预期
Phase 4: Fusion Protein Assembly
阶段4:融合蛋白组装
Objective: Construct the complete fusion protein sequence.
-
Follow the specified protein order exactly
- Do not deviate from the required arrangement
- Document the order: [Protein1]-[Linker]-[Protein2]-[Linker]-...
-
Design appropriate linkers
- Use GS (Glycine-Serine) linkers of specified length
- Common patterns: (GGGGS)n or (GS)n where n provides required length
- Ensure linkers fall within length constraints (e.g., 5-20 amino acids)
-
Assemble the complete protein sequence
- Concatenate proteins with linkers in correct order
- Verify the assembled sequence is continuous and valid
目标: 构建完整的融合蛋白序列。
-
严格遵循指定的蛋白质顺序
- 不得偏离要求的排列顺序
- 记录顺序:[蛋白质1]-[连接子]-[蛋白质2]-[连接子]-...
-
设计合适的连接子
- 使用指定长度的GS(甘氨酸-丝氨酸)连接子
- 常见模式:(GGGGS)n 或 (GS)n,其中n决定所需长度
- 确保连接子符合长度限制(如5-20个氨基酸)
-
组装完整蛋白质序列
- 按正确顺序将蛋白质与连接子拼接
- 验证组装后的序列是否连续且有效
Phase 5: Codon Optimization and DNA Generation
阶段5:密码子优化与DNA序列生成
Objective: Convert protein to optimized DNA sequence.
-
Initial codon translation
- Convert each amino acid to a codon
- Use a standard codon table for the target organism
-
GC content optimization
- Calculate GC content in sliding windows (e.g., 50 nucleotides)
- Identify windows outside acceptable range (e.g., 30-70%)
- Swap synonymous codons to bring GC content within range
- Re-verify after each swap
-
Length verification
- Confirm DNA sequence meets length constraints (e.g., ≤3000 nt)
- If too long, review design choices (linker lengths, protein selections)
目标: 将蛋白质序列转换为优化后的DNA序列。
-
初始密码子转换
- 将每个氨基酸转换为对应密码子
- 使用目标生物的标准密码子表
-
GC含量优化
- 以滑动窗口(如50个核苷酸)计算GC content
- 识别超出可接受范围(如30-70%)的窗口
- 替换同义密码子使GC content回到范围内
- 每次替换后重新验证
-
长度验证
- 确认DNA序列符合长度限制(如≤3000 nt)
- 若过长,重新评估设计选择(连接子长度、蛋白质筛选)
Phase 6: Output Generation
阶段6:输出生成
Objective: Create the required output file(s).
-
Write output immediately after assembly
- Do not delay output file creation
- Write to the exact path specified in requirements
-
Include appropriate formatting
- Follow any specified format (plain text, FASTA, etc.)
- Include headers or metadata if required
-
Verify output file exists
- Confirm the file was created successfully
- Verify file contents match the designed sequence
目标: 创建所需的输出文件。
-
组装完成后立即生成输出
- 不得延迟输出文件的创建
- 写入要求中指定的精确路径
-
采用合适的格式
- 遵循指定格式(纯文本、FASTA等)
- 如有要求,包含头部信息或元数据
-
验证输出文件是否存在
- 确认文件已成功创建
- 验证文件内容与设计的序列一致
Verification Checkpoints
验证检查点
After Phase 1:
阶段1完成后:
- All input files read completely (no truncation)
- All external sequences retrieved
- Sequence catalog is complete
- 所有输入文件已完整读取(无截断)
- 所有外部序列已检索完成
- 序列目录完整
After Phase 2:
阶段2完成后:
- All required proteins identified
- Wavelength/binding requirements verified
- Selection rationale documented
- 所有所需蛋白质已识别
- 波长/结合要求已验证
- 筛选依据已记录
After Phase 3:
阶段3完成后:
- N-terminal methionines handled correctly
- All sequences validated
- N端甲硫氨酸处理正确
- 所有序列已验证
After Phase 4:
阶段4完成后:
- Protein order matches requirements
- Linkers meet length constraints
- Complete fusion sequence assembled
- 蛋白质顺序符合要求
- 连接子满足长度限制
- 完整融合序列已组装
After Phase 5:
阶段5完成后:
- GC content within range in ALL windows
- DNA length within constraints
- 所有滑动窗口的GC content均在范围内
- DNA长度符合限制
After Phase 6:
阶段6完成后:
- Output file exists at specified path
- File contents are correct
- 输出文件存在于指定路径
- 文件内容正确
Common Pitfalls
常见误区
-
Incomplete file reading
- GenBank files may be large; ensure complete parsing
- Extract CDS translations, not just raw sequences
-
Approximate wavelength matching
- Use exact values, not "close enough" matches
- Verify both excitation AND emission, not just one
-
Forgetting N-terminal methionines
- Internal proteins in fusions should have Met removed
- Only the first protein retains its N-terminal Met
-
Ignoring GC content windows
- Check ALL sliding windows, not just overall GC%
- Optimize problematic regions with synonymous codons
-
Delayed output generation
- Create output file as soon as sequence is ready
- Do not continue gathering information after design is complete
-
Information gathering loops
- Set a clear stopping point for research
- Progress to execution even with incomplete information
- A partial solution is better than no solution
-
文件读取不完整
- GenBank文件可能较大,需确保完整解析
- 提取CDS翻译序列,而非仅原始序列
-
近似波长匹配
- 使用精确值,而非“接近即可”的匹配
- 需同时验证激发峰和发射峰,而非仅其一
-
遗漏N端甲硫氨酸处理
- 融合蛋白中的内部蛋白质需去除甲硫氨酸
- 仅首个蛋白质保留其N端甲硫氨酸
-
忽略GC content滑动窗口检查
- 检查所有滑动窗口,而非仅整体GC%
- 使用同义密码子优化有问题的区域
-
延迟输出生成
- 序列确定后立即创建输出文件
- 设计完成后无需继续收集信息
-
信息收集循环
- 为研究设定明确的停止节点
- 即使信息不完整也要推进执行
- 部分解决方案优于无解决方案
Output-First Strategy
输出优先策略
If time or resources are constrained:
- Create the output file early, even with placeholders
- Update the file as each component is determined
- Ensure a valid (if imperfect) output exists at task end
This ensures the primary deliverable exists, which can be refined with additional information.
若时间或资源受限:
- 尽早创建输出文件,即使包含占位符
- 确定每个组件后更新文件
- 确保任务结束时存在有效(即使不完善)的输出
这可确保核心交付物存在,后续可利用额外信息进行优化。