biopython

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Biopython: Computational Molecular Biology in Python

Biopython：Python中的计算分子生物学工具

Overview

概述

Biopython is a comprehensive set of freely available Python tools for biological computation. It provides functionality for sequence manipulation, file I/O, database access, structural bioinformatics, phylogenetics, and many other bioinformatics tasks. The current version is Biopython 1.85 (released January 2025), which supports Python 3 and requires NumPy.

Biopython是一套免费的综合性Python生物计算工具集，提供序列操作、文件I/O、数据库访问、结构生物信息学、系统发育分析等多种生物信息学任务的功能。当前版本为Biopython 1.85（2025年1月发布），支持Python 3，且依赖NumPy。

When to Use This Skill

适用场景

Use this skill when:

Working with biological sequences (DNA, RNA, or protein)
Reading, writing, or converting biological file formats (FASTA, GenBank, FASTQ, PDB, mmCIF, etc.)
Accessing NCBI databases (GenBank, PubMed, Protein, Gene, etc.) via Entrez
Running BLAST searches or parsing BLAST results
Performing sequence alignments (pairwise or multiple sequence alignments)
Analyzing protein structures from PDB files
Creating, manipulating, or visualizing phylogenetic trees
Finding sequence motifs or analyzing motif patterns
Calculating sequence statistics (GC content, molecular weight, melting temperature, etc.)
Performing structural bioinformatics tasks
Working with population genetics data
Any other computational molecular biology task

在以下场景中使用本工具：

处理生物序列（DNA、RNA或蛋白质）
读取、写入或转换生物文件格式（FASTA、GenBank、FASTQ、PDB、mmCIF等）
通过Entrez访问NCBI数据库（GenBank、PubMed、Protein、Gene等）
运行BLAST搜索或解析BLAST结果
进行序列比对（双序列或多序列比对）
分析PDB文件中的蛋白质结构
创建、操作或可视化系统发育树
查找序列基序或分析基序模式
计算序列统计数据（GC含量、分子量、解链温度等）
执行结构生物信息学任务
处理群体遗传学数据
其他计算分子生物学相关任务

Core Capabilities

核心功能

Biopython is organized into modular sub-packages, each addressing specific bioinformatics domains:

Sequence Handling - Bio.Seq and Bio.SeqIO for sequence manipulation and file I/O
Alignment Analysis - Bio.Align and Bio.AlignIO for pairwise and multiple sequence alignments
Database Access - Bio.Entrez for programmatic access to NCBI databases
BLAST Operations - Bio.Blast for running and parsing BLAST searches
Structural Bioinformatics - Bio.PDB for working with 3D protein structures
Phylogenetics - Bio.Phylo for phylogenetic tree manipulation and visualization
Advanced Features - Motifs, population genetics, sequence utilities, and more

Biopython采用模块化子包的架构，每个子包对应特定的生物信息学领域：

序列处理 - Bio.Seq和Bio.SeqIO用于序列操作和文件I/O
比对分析 - Bio.Align和Bio.AlignIO用于双序列和多序列比对
数据库访问 - Bio.Entrez用于程序化访问NCBI数据库
BLAST操作 - Bio.Blast用于运行和解析BLAST搜索
结构生物信息学 - Bio.PDB用于处理3D蛋白质结构
系统发育分析 - Bio.Phylo用于系统发育树的操作和可视化
高级功能 - 基序分析、群体遗传学、序列工具等

Installation and Setup

安装与配置

Install Biopython using pip (requires Python 3 and NumPy):

python

uv pip install biopython

For NCBI database access, always set your email address (required by NCBI):

python

from Bio import Entrez
Entrez.email = "your.email@example.com"

使用pip安装Biopython（需Python 3和NumPy）：

python

uv pip install biopython

访问NCBI数据库时，必须设置你的邮箱地址（NCBI强制要求）：

python

from Bio import Entrez
Entrez.email = "your.email@example.com"

Optional: API key for higher rate limits (10 req/s instead of 3 req/s)

可选：使用API密钥提高请求速率限制（从3次/秒提升至10次/秒）

Entrez.api_key = "your_api_key_here"

undefined

Entrez.api_key = "your_api_key_here"

undefined

Using This Skill

使用指南

This skill provides comprehensive documentation organized by functionality area. When working on a task, consult the relevant reference documentation:

本工具提供按功能领域分类的完整文档。处理任务时，请参考对应领域的参考文档：

1. Sequence Handling (Bio.Seq & Bio.SeqIO)

1. 序列处理（Bio.Seq & Bio.SeqIO）

Reference:

references/sequence_io.md

Use for:

Creating and manipulating biological sequences
Reading and writing sequence files (FASTA, GenBank, FASTQ, etc.)
Converting between file formats
Extracting sequences from large files
Sequence translation, transcription, and reverse complement
Working with SeqRecord objects

Quick example:

python

from Bio import SeqIO

参考文档：

references/sequence_io.md

适用场景：

创建和操作生物序列
读取和写入序列文件（FASTA、GenBank、FASTQ等）
转换文件格式
从大文件中提取序列
序列翻译、转录和反向互补
处理SeqRecord对象

快速示例：

python

from Bio import SeqIO

Read sequences from FASTA file

从FASTA文件读取序列

for record in SeqIO.parse("sequences.fasta", "fasta"): print(f"{record.id}: {len(record.seq)} bp")

Convert GenBank to FASTA

将GenBank格式转换为FASTA格式

SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")

undefined

SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")

undefined

2. Alignment Analysis (Bio.Align & Bio.AlignIO)

2. 比对分析（Bio.Align & Bio.AlignIO）

Reference:

references/alignment.md

Use for:

Pairwise sequence alignment (global and local)
Reading and writing multiple sequence alignments
Using substitution matrices (BLOSUM, PAM)
Calculating alignment statistics
Customizing alignment parameters

Quick example:

python

from Bio import Align

参考文档：

references/alignment.md

适用场景：

双序列比对（全局比对和局部比对）
读取和写入多序列比对结果
使用替换矩阵（BLOSUM、PAM）
计算比对统计数据
自定义比对参数

快速示例：

python

from Bio import Align

Pairwise alignment

双序列比对

aligner = Align.PairwiseAligner() aligner.mode = 'global' alignments = aligner.align("ACCGGT", "ACGGT") print(alignments[0])

undefined

aligner = Align.PairwiseAligner() aligner.mode = 'global' alignments = aligner.align("ACCGGT", "ACGGT") print(alignments[0])

undefined

3. Database Access (Bio.Entrez)

3. 数据库访问（Bio.Entrez）

Reference:

references/databases.md

Use for:

Searching NCBI databases (PubMed, GenBank, Protein, Gene, etc.)
Downloading sequences and records
Fetching publication information
Finding related records across databases
Batch downloading with proper rate limiting

Quick example:

python

from Bio import Entrez
Entrez.email = "your.email@example.com"

参考文档：

references/databases.md

适用场景：

搜索NCBI数据库（PubMed、GenBank、Protein、Gene等）
下载序列和记录
获取出版物信息
跨数据库查找相关记录
批量下载并遵守速率限制

快速示例：

python

from Bio import Entrez
Entrez.email = "your.email@example.com"

Search PubMed

搜索PubMed

handle = Entrez.esearch(db="pubmed", term="biopython", retmax=10) results = Entrez.read(handle) handle.close() print(f"Found {results['Count']} results")

undefined

handle = Entrez.esearch(db="pubmed", term="biopython", retmax=10) results = Entrez.read(handle) handle.close() print(f"找到 {results['Count']} 条结果")

undefined

4. BLAST Operations (Bio.Blast)

4. BLAST操作（Bio.Blast）

Reference:

references/blast.md

Use for:

Running BLAST searches via NCBI web services
Running local BLAST searches
Parsing BLAST XML output
Filtering results by E-value or identity
Extracting hit sequences

Quick example:

python

from Bio.Blast import NCBIWWW, NCBIXML

参考文档：

references/blast.md

适用场景：

通过NCBI网络服务运行BLAST搜索
运行本地BLAST搜索
解析BLAST XML输出结果
按E值或相似度过滤结果
提取命中序列

快速示例：

python

from Bio.Blast import NCBIWWW, NCBIXML

Run BLAST search

运行BLAST搜索

result_handle = NCBIWWW.qblast("blastn", "nt", "ATCGATCGATCG") blast_record = NCBIXML.read(result_handle)

Display top hits

显示前5个命中结果

for alignment in blast_record.alignments[:5]: print(f"{alignment.title}: E-value={alignment.hsps[0].expect}")

undefined

for alignment in blast_record.alignments[:5]: print(f"{alignment.title}: E值={alignment.hsps[0].expect}")

undefined

5. Structural Bioinformatics (Bio.PDB)

5. 结构生物信息学（Bio.PDB）

Reference:

references/structure.md

Use for:

Parsing PDB and mmCIF structure files
Navigating protein structure hierarchy (SMCRA: Structure/Model/Chain/Residue/Atom)
Calculating distances, angles, and dihedrals
Secondary structure assignment (DSSP)
Structure superimposition and RMSD calculation
Extracting sequences from structures

Quick example:

python

from Bio.PDB import PDBParser

参考文档：

references/structure.md

适用场景：

解析PDB和mmCIF结构文件
遍历蛋白质结构层级（SMCRA：Structure/Model/Chain/Residue/Atom）
计算距离、角度和二面角
二级结构分配（DSSP）
结构叠加和RMSD计算
从结构中提取序列

快速示例：

python

from Bio.PDB import PDBParser

Parse structure

解析结构

parser = PDBParser(QUIET=True) structure = parser.get_structure("1crn", "1crn.pdb")

Calculate distance between alpha carbons

计算α碳原子之间的距离

chain = structure[0]["A"] distance = chain[10]["CA"] - chain[20]["CA"] print(f"Distance: {distance:.2f} Å")

undefined

chain = structure[0]["A"] distance = chain[10]["CA"] - chain[20]["CA"] print(f"距离: {distance:.2f} Å")

undefined

6. Phylogenetics (Bio.Phylo)

6. 系统发育分析（Bio.Phylo）

Reference:

references/phylogenetics.md

Use for:

Reading and writing phylogenetic trees (Newick, NEXUS, phyloXML)
Building trees from distance matrices or alignments
Tree manipulation (pruning, rerooting, ladderizing)
Calculating phylogenetic distances
Creating consensus trees
Visualizing trees

Quick example:

python

from Bio import Phylo

参考文档：

references/phylogenetics.md

适用场景：

读取和写入系统发育树（Newick、NEXUS、phyloXML格式）
从距离矩阵或比对结果构建树
树操作（剪枝、重根、梯化）
计算系统发育距离
创建共识树
可视化树结构

快速示例：

python

from Bio import Phylo

Read and visualize tree

读取并可视化树

tree = Phylo.read("tree.nwk", "newick") Phylo.draw_ascii(tree)

Calculate distance

计算距离

distance = tree.distance("Species_A", "Species_B") print(f"Distance: {distance:.3f}")

undefined

distance = tree.distance("Species_A", "Species_B") print(f"距离: {distance:.3f}")

undefined

7. Advanced Features

7. 高级功能

Reference:

references/advanced.md

Use for:

Sequence motifs (Bio.motifs) - Finding and analyzing motif patterns
Population genetics (Bio.PopGen) - GenePop files, Fst calculations, Hardy-Weinberg tests
Sequence utilities (Bio.SeqUtils) - GC content, melting temperature, molecular weight, protein analysis
Restriction analysis (Bio.Restriction) - Finding restriction enzyme sites
Clustering (Bio.Cluster) - K-means and hierarchical clustering
Genome diagrams (GenomeDiagram) - Visualizing genomic features

Quick example:

python

from Bio.SeqUtils import gc_fraction, molecular_weight
from Bio.Seq import Seq

seq = Seq("ATCGATCGATCG")
print(f"GC content: {gc_fraction(seq):.2%}")
print(f"Molecular weight: {molecular_weight(seq, seq_type='DNA'):.2f} g/mol")

参考文档：

references/advanced.md

适用场景：

序列基序（Bio.motifs）- 查找和分析基序模式
群体遗传学（Bio.PopGen）- GenePop文件处理、Fst计算、哈迪-温伯格检验
序列工具（Bio.SeqUtils）- GC含量、解链温度、分子量、蛋白质分析
限制性酶切分析（Bio.Restriction）- 查找限制性酶切位点
聚类分析（Bio.Cluster）- K均值和层次聚类
基因组图谱（GenomeDiagram）- 可视化基因组特征

快速示例：

python

from Bio.SeqUtils import gc_fraction, molecular_weight
from Bio.Seq import Seq

seq = Seq("ATCGATCGATCG")
print(f"GC含量: {gc_fraction(seq):.2%}")
print(f"分子量: {molecular_weight(seq, seq_type='DNA'):.2f} g/mol")

General Workflow Guidelines

通用工作流指南

Reading Documentation

阅读文档

When a user asks about a specific Biopython task:

Identify the relevant module based on the task description
Read the appropriate reference file using the Read tool
Extract relevant code patterns and adapt them to the user's specific needs
Combine multiple modules when the task requires it

Example search patterns for reference files:

bash

undefined

当用户询问特定的Biopython任务时：

确定相关模块：根据任务描述找到对应的子包
阅读对应参考文件：使用Read工具查看相关参考文档
提取相关代码模式：并根据用户的具体需求进行调整
组合多个模块：当任务需要时，结合多个子包的功能

参考文件的示例搜索模式：

bash

undefined

Find information about specific functions

查找特定函数的信息

grep -n "SeqIO.parse" references/sequence_io.md

Find examples of specific tasks

查找特定任务的示例

grep -n "BLAST" references/blast.md

Find information about specific concepts

查找特定概念的信息

grep -n "alignment" references/alignment.md

undefined

grep -n "alignment" references/alignment.md

undefined

Writing Biopython Code

编写Biopython代码

Follow these principles when writing Biopython code:

Import modules explicitly

python

from Bio import SeqIO, Entrez
from Bio.Seq import Seq

Set Entrez email when using NCBI databases
python
```
Entrez.email = "your.email@example.com"
```
Use appropriate file formats - Check which format best suits the task
python
```
# Common formats: "fasta", "genbank", "fastq", "clustal", "phylip"
```

Handle files properly - Close handles after use or use context managers

python

with open("file.fasta") as handle:
    records = SeqIO.parse(handle, "fasta")

Use iterators for large files - Avoid loading everything into memory

python

for record in SeqIO.parse("large_file.fasta", "fasta"):
    # Process one record at a time

Handle errors gracefully - Network operations and file parsing can fail

python

try:
    handle = Entrez.efetch(db="nucleotide", id=accession)
except HTTPError as e:
    print(f"Error: {e}")

编写Biopython代码时，请遵循以下原则：

显式导入模块

python

from Bio import SeqIO, Entrez
from Bio.Seq import Seq

使用NCBI数据库时设置Entrez邮箱
python
```
Entrez.email = "your.email@example.com"
```
选择合适的文件格式：根据任务选择最适合的格式
python
```
# 常见格式："fasta", "genbank", "fastq", "clustal", "phylip"
```

正确处理文件：使用后关闭句柄，或使用上下文管理器

python

with open("file.fasta") as handle:
    records = SeqIO.parse(handle, "fasta")

对大文件使用迭代器：避免将所有数据加载到内存中

python

for record in SeqIO.parse("large_file.fasta", "fasta"):
    # 逐个处理记录

优雅处理错误：网络操作和文件解析可能失败

python

try:
    handle = Entrez.efetch(db="nucleotide", id=accession)
except HTTPError as e:
    print(f"错误: {e}")

Common Patterns

常见代码模式

Pattern 1: Fetch Sequence from GenBank

模式1：从GenBank获取序列

python

from Bio import Entrez, SeqIO

Entrez.email = "your.email@example.com"

python

from Bio import Entrez, SeqIO

Entrez.email = "your.email@example.com"

Fetch sequence

获取序列

handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text") record = SeqIO.read(handle, "genbank") handle.close()

print(f"Description: {record.description}") print(f"Sequence length: {len(record.seq)}")

undefined

handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text") record = SeqIO.read(handle, "genbank") handle.close()

print(f"描述: {record.description}") print(f"序列长度: {len(record.seq)}")

undefined

Pattern 2: Sequence Analysis Pipeline

模式2：序列分析流水线

python

from Bio import SeqIO
from Bio.SeqUtils import gc_fraction

for record in SeqIO.parse("sequences.fasta", "fasta"):
    # Calculate statistics
    gc = gc_fraction(record.seq)
    length = len(record.seq)

    # Find ORFs, translate, etc.
    protein = record.seq.translate()

    print(f"{record.id}: {length} bp, GC={gc:.2%}")

python

from Bio import SeqIO
from Bio.SeqUtils import gc_fraction

for record in SeqIO.parse("sequences.fasta", "fasta"):
    # 计算统计数据
    gc = gc_fraction(record.seq)
    length = len(record.seq)

    # 查找ORF、翻译等
    protein = record.seq.translate()

    print(f"{record.id}: {length} bp, GC含量={gc:.2%}")

Pattern 3: BLAST and Fetch Top Hits

模式3：BLAST搜索并获取前序命中结果

python

from Bio.Blast import NCBIWWW, NCBIXML
from Bio import Entrez, SeqIO

Entrez.email = "your.email@example.com"

python

from Bio.Blast import NCBIWWW, NCBIXML
from Bio import Entrez, SeqIO

Entrez.email = "your.email@example.com"

Run BLAST

运行BLAST搜索

result_handle = NCBIWWW.qblast("blastn", "nt", sequence) blast_record = NCBIXML.read(result_handle)

Get top hit accessions

获取前5个命中结果的登录号

accessions = [aln.accession for aln in blast_record.alignments[:5]]

Fetch sequences

下载序列

for acc in accessions: handle = Entrez.efetch(db="nucleotide", id=acc, rettype="fasta", retmode="text") record = SeqIO.read(handle, "fasta") handle.close() print(f">{record.description}")

undefined

for acc in accessions: handle = Entrez.efetch(db="nucleotide", id=acc, rettype="fasta", retmode="text") record = SeqIO.read(handle, "fasta") handle.close() print(f">{record.description}")

undefined

Pattern 4: Build Phylogenetic Tree from Sequences

模式4：从序列构建系统发育树

python

from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

python

from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

Read alignment

读取比对结果

alignment = AlignIO.read("alignment.fasta", "fasta")

Calculate distances

计算距离

calculator = DistanceCalculator("identity") dm = calculator.get_distance(alignment)

Build tree

构建树

constructor = DistanceTreeConstructor() tree = constructor.nj(dm)

Visualize

可视化

Phylo.draw_ascii(tree)

undefined

Phylo.draw_ascii(tree)

undefined

Best Practices

最佳实践

Always read relevant reference documentation before writing code
Use grep to search reference files for specific functions or examples
Validate file formats before parsing
Handle missing data gracefully - Not all records have all fields
Cache downloaded data - Don't repeatedly download the same sequences
Respect NCBI rate limits - Use API keys and proper delays
Test with small datasets before processing large files
Keep Biopython updated to get latest features and bug fixes
Use appropriate genetic code tables for translation
Document analysis parameters for reproducibility

编写代码前先阅读相关参考文档
使用gsearch搜索参考文件查找特定函数或示例
解析前验证文件格式
优雅处理缺失数据：并非所有记录都包含所有字段
缓存下载的数据：不要重复下载相同的序列
遵守NCBI的速率限制：使用API密钥并设置适当的延迟
处理大文件前先用小数据集测试
保持Biopython更新：获取最新功能和Bug修复
翻译时使用合适的遗传密码表
记录分析参数：确保可重复性

Troubleshooting Common Issues

常见问题排查

Issue: "No handlers could be found for logger 'Bio.Entrez'"

问题："No handlers could be found for logger 'Bio.Entrez'"

Solution: This is just a warning. Set Entrez.email to suppress it.

解决方案： 这只是警告，设置Entrez.email即可消除

Issue: "HTTP Error 400" from NCBI

问题：NCBI返回"HTTP Error 400"

Solution: Check that IDs/accessions are valid and properly formatted.

解决方案： 检查ID/登录号是否有效且格式正确

Issue: "ValueError: EOF" when parsing files

问题：解析文件时出现"ValueError: EOF"

Solution: Verify file format matches the specified format string.

解决方案： 验证文件格式是否与指定的格式字符串匹配

Issue: Alignment fails with "sequences are not the same length"

问题：比对失败，提示"sequences are not the same length"

Solution: Ensure sequences are aligned before using AlignIO or MultipleSeqAlignment.

解决方案： 使用AlignIO或MultipleSeqAlignment前，确保序列已对齐

Issue: BLAST searches are slow

问题：BLAST搜索速度慢

Solution: Use local BLAST for large-scale searches, or cache results.

解决方案： 大规模搜索使用本地BLAST，或缓存结果

Issue: PDB parser warnings

问题：PDB解析器发出警告

Solution: Use

PDBParser(QUIET=True)

to suppress warnings, or investigate structure quality.

解决方案： 使用

PDBParser(QUIET=True)

抑制警告，或检查结构质量

Additional Resources

额外资源

Official Documentation: https://biopython.org/docs/latest/
Tutorial: https://biopython.org/docs/latest/Tutorial/
Cookbook: https://biopython.org/docs/latest/Tutorial/ (advanced examples)
GitHub: https://github.com/biopython/biopython
Mailing List: biopython@biopython.org

官方文档：https://biopython.org/docs/latest/
教程：https://biopython.org/docs/latest/Tutorial/
Cookbook：https://biopython.org/docs/latest/Tutorial/（高级示例）
GitHub：https://github.com/biopython/biopython
邮件列表：biopython@biopython.org

Quick Reference

快速参考

To locate information in reference files, use these search patterns:

bash

undefined

使用以下搜索模式在参考文件中定位信息：

bash

undefined

Search for specific functions

搜索特定函数

grep -n "function_name" references/*.md

Find examples of specific tasks

查找特定任务的示例

grep -n "example" references/sequence_io.md

Find all occurrences of a module

查找所有包含特定模块的内容

grep -n "Bio.Seq" references/*.md

undefined

grep -n "Bio.Seq" references/*.md

undefined

Summary

总结

Biopython provides comprehensive tools for computational molecular biology. When using this skill:

Identify the task domain (sequences, alignments, databases, BLAST, structures, phylogenetics, or advanced)
Consult the appropriate reference file in the
```
references/
```
directory
Adapt code examples to the specific use case
Combine multiple modules when needed for complex workflows
Follow best practices for file handling, error checking, and data management

The modular reference documentation ensures detailed, searchable information for every major Biopython capability.

Biopython为计算分子生物学提供了全面的工具集。使用本工具时：

确定任务领域（序列、比对、数据库、BLAST、结构、系统发育或高级功能）
查阅对应参考文件：在
```
references/
```
目录中找到相关文档
调整代码示例：适配具体的使用场景
组合多个模块：复杂工作流需要时结合多个子包
遵循最佳实践：文件处理、错误检查和数据管理

模块化的参考文档为Biopython的每个主要功能提供了详细的可搜索信息。