biopython

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Biopython - Bioinformatics Library

Biopython - 生物信息学库

Industry-standard Python library for computational biology and bioinformatics workflows.
计算生物学和生物信息学工作流的行业标准Python库。

When to Use

适用场景

  • Parsing and manipulating biological sequences (DNA, RNA, protein)
  • Reading and writing sequence files (FASTA, FASTQ, GenBank, EMBL, SwissProt)
  • Performing sequence alignments (pairwise and multiple)
  • Running and parsing BLAST searches
  • Analyzing protein structures from PDB files
  • Calculating sequence statistics and molecular weights
  • Translating DNA to protein sequences
  • Finding restriction enzyme sites
  • Building and analyzing phylogenetic trees
  • Accessing NCBI databases (Entrez, PubMed)
  • Computing sequence motifs and patterns
  • Analyzing next-generation sequencing data
  • 解析和处理生物序列(DNA、RNA、蛋白质)
  • 读取和写入序列文件(FASTA、FASTQ、GenBank、EMBL、SwissProt)
  • 执行序列比对(双序列和多序列)
  • 运行和解析BLAST搜索结果
  • 分析PDB文件中的蛋白质结构
  • 计算序列统计数据和分子量
  • 将DNA序列翻译为蛋白质序列
  • 寻找限制性酶切位点
  • 构建和分析系统发育树
  • 访问NCBI数据库(Entrez、PubMed)
  • 计算序列基序和模式
  • 分析下一代测序数据

Reference Documentation

参考文档

Official docs: https://biopython.org/
Tutorial: https://biopython.org/DIST/docs/tutorial/Tutorial.html
Search patterns:
SeqIO.parse
,
Seq
,
AlignIO
,
NCBIWWW.qblast
,
PDBParser
官方文档: https://biopython.org/
教程: https://biopython.org/DIST/docs/tutorial/Tutorial.html
常用搜索模式:
SeqIO.parse
,
Seq
,
AlignIO
,
NCBIWWW.qblast
,
PDBParser

Core Principles

核心原则

Use Biopython For

Biopython适用场景

TaskModuleExample
Create sequences
Seq
Seq("ATCG")
Read sequence files
SeqIO
SeqIO.parse("file.fasta", "fasta")
Pairwise alignment
pairwise2
pairwise2.align.globalxx(s1, s2)
Multiple alignment
AlignIO
AlignIO.read("align.fasta", "fasta")
BLAST searches
NCBIWWW
NCBIWWW.qblast("blastn", "nr", seq)
PDB structures
PDB.PDBParser
PDBParser().get_structure()
Phylogenetic trees
Phylo
Phylo.read("tree.xml", "phyloxml")
NCBI databases
Entrez
Entrez.esearch(db="nucleotide")
任务模块示例
创建序列
Seq
Seq("ATCG")
读取序列文件
SeqIO
SeqIO.parse("file.fasta", "fasta")
双序列比对
pairwise2
pairwise2.align.globalxx(s1, s2)
多序列比对
AlignIO
AlignIO.read("align.fasta", "fasta")
BLAST搜索
NCBIWWW
NCBIWWW.qblast("blastn", "nr", seq)
PDB结构分析
PDB.PDBParser
PDBParser().get_structure()
系统发育树分析
Phylo
Phylo.read("tree.xml", "phyloxml")
NCBI数据库访问
Entrez
Entrez.esearch(db="nucleotide")

Do NOT Use For

Biopython不适用场景

  • High-performance genome assembly (use SPAdes, Canu)
  • Variant calling from BAM files (use GATK, BCFtools)
  • RNA-seq differential expression (use DESeq2, edgeR)
  • Protein structure prediction (use AlphaFold, RoseTTAFold)
  • Large-scale metagenomics (use specialized pipelines)
  • 高性能基因组组装(使用SPAdes、Canu)
  • 从BAM文件进行变异检测(使用GATK、BCFtools)
  • RNA-seq差异表达分析(使用DESeq2、edgeR)
  • 蛋白质结构预测(使用AlphaFold、RoseTTAFold)
  • 大规模宏基因组分析(使用专用流程)

Quick Reference

快速参考

Installation

安装

bash
undefined
bash
undefined

pip (recommended)

pip(推荐)

pip install biopython
pip install biopython

With optional dependencies

安装可选依赖

pip install biopython[extra]
pip install biopython[extra]

conda

conda安装

conda install -c conda-forge biopython
conda install -c conda-forge biopython

Development version

开发版本

Standard Imports

标准导入

python
undefined
python
undefined

Core sequence handling

核心序列处理

from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio import SeqIO, AlignIO
from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio import SeqIO, AlignIO

Sequence alignment

序列比对

from Bio import pairwise2 from Bio.Align import MultipleSeqAlignment from Bio.Align.Applications import ClustalwCommandline
from Bio import pairwise2 from Bio.Align import MultipleSeqAlignment from Bio.Align.Applications import ClustalwCommandline

BLAST

BLAST相关

from Bio.Blast import NCBIWWW, NCBIXML
from Bio.Blast import NCBIWWW, NCBIXML

Structure analysis

结构分析

from Bio.PDB import PDBParser, PDBIO, Select from Bio.PDB.DSSP import DSSP from Bio.PDB.Polypeptide import PPBuilder
from Bio.PDB import PDBParser, PDBIO, Select from Bio.PDB.DSSP import DSSP from Bio.PDB.Polypeptide import PPBuilder

Phylogenetics

系统发育学

from Bio import Phylo from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio import Phylo from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

NCBI Entrez

NCBI Entrez

from Bio import Entrez
from Bio import Entrez

Additional tools

附加工具

from Bio.SeqUtils import GC, molecular_weight from Bio.Restriction import *
undefined
from Bio.SeqUtils import GC, molecular_weight from Bio.Restriction import *
undefined

Basic Pattern - Sequence Creation

基础模式 - 创建序列

python
from Bio.Seq import Seq
python
from Bio.Seq import Seq

Create DNA sequence

创建DNA序列

dna = Seq("ATGGCCATTGTAATGGGCCGC")
dna = Seq("ATGGCCATTGTAATGGGCCGC")

Transcribe to RNA

转录为RNA

rna = dna.transcribe()
rna = dna.transcribe()

Translate to protein

翻译为蛋白质

protein = dna.translate()
protein = dna.translate()

Reverse complement

反向互补序列

rev_comp = dna.reverse_complement()
print(f"DNA: {dna}") print(f"RNA: {rna}") print(f"Protein: {protein}") print(f"RevComp: {rev_comp}")
undefined
rev_comp = dna.reverse_complement()
print(f"DNA: {dna}") print(f"RNA: {rna}") print(f"Protein: {protein}") print(f"RevComp: {rev_comp}")
undefined

Basic Pattern - File Reading

基础模式 - 读取文件

python
from Bio import SeqIO
python
from Bio import SeqIO

Read FASTA file (iterator - memory efficient)

读取FASTA文件(迭代器 - 内存高效)

for record in SeqIO.parse("sequences.fasta", "fasta"): print(f"ID: {record.id}") print(f"Length: {len(record.seq)}") print(f"Sequence: {record.seq[:50]}...")
for record in SeqIO.parse("sequences.fasta", "fasta"): print(f"ID: {record.id}") print(f"长度: {len(record.seq)}") print(f"序列: {record.seq[:50]}...")

Read single sequence

读取单个序列

record = SeqIO.read("single.fasta", "fasta")
undefined
record = SeqIO.read("single.fasta", "fasta")
undefined

Basic Pattern - Sequence Alignment

基础模式 - 序列比对

python
from Bio import pairwise2
from Bio.Seq import Seq

seq1 = Seq("ACCGT")
seq2 = Seq("ACGT")
python
from Bio import pairwise2
from Bio.Seq import Seq

seq1 = Seq("ACCGT")
seq2 = Seq("ACGT")

Global alignment

全局比对

alignments = pairwise2.align.globalxx(seq1, seq2)
alignments = pairwise2.align.globalxx(seq1, seq2)

Print best alignment

打印最优比对结果

best = alignments[0] print(pairwise2.format_alignment(*best))
undefined
best = alignments[0] print(pairwise2.format_alignment(*best))
undefined

Critical Rules

关键规则

✅ DO

✅ 建议做法

  • Use iterators for large files -
    SeqIO.parse()
    not
    SeqIO.to_dict()
  • Validate sequences - Check alphabet and length before operations
  • Handle file formats correctly - Match parser to actual file format
  • Check alignment quality - Verify gaps and identity percentages
  • Use appropriate genetic code - Specify table for translation
  • Close file handles - Use context managers or explicit close
  • Filter FASTQ by quality - Don't trust all reads equally
  • Verify PDB structure - Check for missing atoms/residues
  • Set Entrez email - Required for NCBI API usage
  • Handle translation frames - Consider all three reading frames
  • 对大文件使用迭代器 - 使用
    SeqIO.parse()
    而非
    SeqIO.to_dict()
  • 验证序列 - 在操作前检查序列的字母表和长度
  • 正确处理文件格式 - 确保解析器与实际文件格式匹配
  • 检查比对质量 - 验证空位和一致性百分比
  • 使用合适的遗传密码 - 翻译时指定密码子表
  • 关闭文件句柄 - 使用上下文管理器或显式关闭
  • 按质量过滤FASTQ - 不要完全信任所有reads
  • 验证PDB结构 - 检查是否存在缺失的原子/残基
  • 设置Entrez邮箱 - NCBI API使用要求必须设置
  • 处理翻译读框 - 考虑所有三个阅读框

❌ DON'T

❌ 不建议做法

  • Load entire FASTQ into memory - Use streaming
  • Ignore sequence type - DNA, RNA, and protein need different handling
  • Skip quality filtering - FASTQ quality scores matter
  • Use wrong genetic code - Different organisms use different tables
  • Forget stop codons - Handle them explicitly in translation
  • Mix alphabets - Don't compare DNA with protein sequences
  • Trust all BLAST hits - Filter by e-value and identity
  • Ignore chain breaks - PDB structures may have gaps
  • Hammer NCBI servers - Use rate limiting (3 requests/sec without API key)
  • Compare raw sequences - Align first, then compare
  • 将整个FASTQ加载到内存 - 使用流式处理
  • 忽略序列类型 - DNA、RNA和蛋白质需要不同的处理方式
  • 跳过质量过滤 - FASTQ的质量分数很重要
  • 使用错误的遗传密码 - 不同生物使用不同的密码子表
  • 忽略终止密码子 - 在翻译时显式处理它们
  • 混合字母表 - 不要将DNA与蛋白质序列进行比较
  • 信任所有BLAST比对结果 - 按e值和一致性进行过滤
  • 忽略链断裂 - PDB结构可能存在空位
  • 频繁请求NCBI服务器 - 限制请求速率(无API密钥时最多3次/秒)
  • 直接比较原始序列 - 先比对,再比较

Anti-Patterns (NEVER)

反模式(绝对避免)

python
undefined
python
undefined

❌ BAD: Loading entire file into memory

❌ 错误:将整个文件加载到内存

records = list(SeqIO.parse("huge.fastq", "fastq")) for record in records: process(record) # OOM for large files!
records = list(SeqIO.parse("huge.fastq", "fastq")) for record in records: process(record) # 大文件会导致内存溢出!

✅ GOOD: Stream processing

✅ 正确:流式处理

for record in SeqIO.parse("huge.fastq", "fastq"): if min(record.letter_annotations["phred_quality"]) >= 20: process(record)
for record in SeqIO.parse("huge.fastq", "fastq"): if min(record.letter_annotations["phred_quality"]) >= 20: process(record)

❌ BAD: Translating without checking frame

❌ 错误:不检查读框直接翻译

protein = dna_seq.translate()
protein = dna_seq.translate()

May include stop codons or wrong frame!

可能包含终止密码子或错误读框!

✅ GOOD: Specify frame and stop codon handling

✅ 正确:指定读框和终止密码子处理方式

protein = dna_seq.translate(to_stop=True, table=1)
protein = dna_seq.translate(to_stop=True, table=1)

❌ BAD: No email for Entrez

❌ 错误:Entrez请求不设置邮箱

Entrez.esearch(db="nucleotide", term="human")
Entrez.esearch(db="nucleotide", term="human")

NCBI will block you!

NCBI会封禁你的IP!

✅ GOOD: Always set email

✅ 正确:始终设置邮箱

Entrez.email = "your.email@example.com" handle = Entrez.esearch(db="nucleotide", term="human")
Entrez.email = "your.email@example.com" handle = Entrez.esearch(db="nucleotide", term="human")

❌ BAD: Comparing sequences directly

❌ 错误:直接比较序列

if str(seq1) == str(seq2): print("Same") # Ignores gaps, misalignments!
if str(seq1) == str(seq2): print("相同") # 忽略了空位和比对错误!

✅ GOOD: Align then compare

✅ 正确:先比对再比较

from Bio import pairwise2 alignments = pairwise2.align.globalxx(seq1, seq2) score = alignments[0][2] # Alignment score
from Bio import pairwise2 alignments = pairwise2.align.globalxx(seq1, seq2) score = alignments[0][2] # 比对分数

❌ BAD: Wrong file format

❌ 错误:指定错误的文件格式

records = SeqIO.parse("file.gb", "fasta") # Wrong!
records = SeqIO.parse("file.gb", "fasta") # 错误!

✅ GOOD: Match format to file

✅ 正确:匹配文件格式

records = SeqIO.parse("file.gb", "genbank")
undefined
records = SeqIO.parse("file.gb", "genbank")
undefined

Sequence Objects

序列对象

Creating and Manipulating Sequences

创建和操作序列

python
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
python
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

DNA sequence

DNA序列

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")

Basic operations

基础操作

length = len(dna) gc_count = dna.count("G") + dna.count("C") gc_content = (gc_count / length) * 100
length = len(dna) gc_count = dna.count("G") + dna.count("C") gc_content = (gc_count / length) * 100

Find subsequence

查找子序列

position = dna.find("ATG") # Returns index or -1
position = dna.find("ATG") # 返回索引或-1

Slicing

切片

first_codon = dna[:3] last_10 = dna[-10:]
print(f"Length: {length}") print(f"GC content: {gc_content:.2f}%") print(f"ATG at position: {position}")
undefined
first_codon = dna[:3] last_10 = dna[-10:]
print(f"长度: {length}") print(f"GC含量: {gc_content:.2f}%") print(f"ATG位置: {position}")
undefined

Transcription and Translation

转录和翻译

python
from Bio.Seq import Seq
python
from Bio.Seq import Seq

DNA to RNA transcription

DNA转录为RNA

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG") rna = dna.transcribe()
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG") rna = dna.transcribe()

RNA to protein translation

RNA翻译为蛋白质

protein = rna.translate()
protein = rna.translate()

DNA to protein (direct)

DNA直接翻译为蛋白质

protein_direct = dna.translate()
protein_direct = dna.translate()

Back-transcription

反向转录

dna_back = rna.back_transcribe()
dna_back = rna.back_transcribe()

Reverse complement

反向互补序列

rev_comp = dna.reverse_complement()
print(f"DNA: {dna}") print(f"RNA: {rna}") print(f"Protein: {protein}") print(f"Rev Comp: {rev_comp}")
undefined
rev_comp = dna.reverse_complement()
print(f"DNA: {dna}") print(f"RNA: {rna}") print(f"蛋白质: {protein}") print(f"反向互补: {rev_comp}")
undefined

Translation with Different Genetic Codes

使用不同遗传密码进行翻译

python
from Bio.Seq import Seq

dna = Seq("ATGGGCTAG")
python
from Bio.Seq import Seq

dna = Seq("ATGGGCTAG")

Standard genetic code (table 1)

标准遗传密码(表1)

protein_standard = dna.translate(table=1)
protein_standard = dna.translate(table=1)

Mitochondrial genetic code (table 2)

线粒体遗传密码(表2)

protein_mito = dna.translate(table=2)
protein_mito = dna.translate(table=2)

Stop at first stop codon

遇到第一个终止密码子即停止

protein_to_stop = dna.translate(to_stop=True)
protein_to_stop = dna.translate(to_stop=True)

All three reading frames

所有三个阅读框

for i in range(3): frame = dna[i:] protein = frame.translate(to_stop=True) print(f"Frame {i}: {protein}")
undefined
for i in range(3): frame = dna[i:] protein = frame.translate(to_stop=True) print(f"读框 {i}: {protein}")
undefined

Sequence Utilities

序列工具

python
from Bio.Seq import Seq
from Bio.SeqUtils import GC, molecular_weight

seq = Seq("ATGGCCATTGTAATGGGCCGC")
python
from Bio.Seq import Seq
from Bio.SeqUtils import GC, molecular_weight

seq = Seq("ATGGCCATTGTAATGGGCCGC")

GC content

GC含量

gc = GC(seq)
gc = GC(seq)

Molecular weight

分子量

mw = molecular_weight(seq, seq_type='DNA')
mw = molecular_weight(seq, seq_type='DNA')

GC skew (for origin of replication analysis)

GC偏斜(用于复制起点分析)

from Bio.SeqUtils import GC_skew skew = GC_skew(seq, window=100)
print(f"GC content: {gc:.2f}%") print(f"Molecular weight: {mw:.2f} Da")
undefined
from Bio.SeqUtils import GC_skew skew = GC_skew(seq, window=100)
print(f"GC含量: {gc:.2f}%") print(f"分子量: {mw:.2f} Da")
undefined

SeqRecord Objects

SeqRecord对象

Creating SeqRecord Objects

创建SeqRecord对象

python
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
python
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

Create with minimal info

创建最小信息的SeqRecord

seq = Seq("ATGGCCATTG") record = SeqRecord(seq, id="seq1", description="My sequence")
seq = Seq("ATGGCCATTG") record = SeqRecord(seq, id="seq1", description="我的序列")

With full annotation

创建带完整注释的SeqRecord

record = SeqRecord( Seq("ATGGCCATTG"), id="gene1", name="GeneName", description="Gene encoding protein X", annotations={ "organism": "Homo sapiens", "molecule_type": "DNA" } )
record = SeqRecord( Seq("ATGGCCATTG"), id="gene1", name="GeneName", description="编码蛋白质X的基因", annotations={ "organism": "Homo sapiens", "molecule_type": "DNA" } )

Add features

添加特征

from Bio.SeqFeature import SeqFeature, FeatureLocation feature = SeqFeature( FeatureLocation(0, 10), type="CDS", qualifiers={"product": "hypothetical protein"} ) record.features.append(feature)
print(record)
undefined
from Bio.SeqFeature import SeqFeature, FeatureLocation feature = SeqFeature( FeatureLocation(0, 10), type="CDS", qualifiers={"product": "假设蛋白质"} ) record.features.append(feature)
print(record)
undefined

SeqRecord Attributes

SeqRecord属性

python
from Bio import SeqIO

record = SeqIO.read("sequence.gb", "genbank")
python
from Bio import SeqIO

record = SeqIO.read("sequence.gb", "genbank")

Access components

访问组件

sequence = record.seq identifier = record.id name = record.name description = record.description
sequence = record.seq identifier = record.id name = record.name description = record.description

Annotations (dictionary)

注释(字典)

organism = record.annotations.get("organism", "Unknown")
organism = record.annotations.get("organism", "未知")

Features (list)

特征(列表)

for feature in record.features: if feature.type == "CDS": product = feature.qualifiers.get("product", ["Unknown"])[0] location = feature.location print(f"{product} at {location}")
for feature in record.features: if feature.type == "CDS": product = feature.qualifiers.get("product", ["未知"])[0] location = feature.location print(f"{product} 位于 {location}")

Letter annotations (quality scores for FASTQ)

字母注释(FASTQ的质量分数)

if "phred_quality" in record.letter_annotations: avg_quality = sum(record.letter_annotations["phred_quality"]) / len(record) print(f"Average quality: {avg_quality:.2f}")
undefined
if "phred_quality" in record.letter_annotations: avg_quality = sum(record.letter_annotations["phred_quality"]) / len(record) print(f"平均质量: {avg_quality:.2f}")
undefined

File Input/Output

文件输入/输出

Reading FASTA Files

读取FASTA文件

python
from Bio import SeqIO
python
from Bio import SeqIO

Single sequence

读取单个序列

record = SeqIO.read("single.fasta", "fasta") print(f"{record.id}: {record.seq}")
record = SeqIO.read("single.fasta", "fasta") print(f"{record.id}: {record.seq}")

Multiple sequences (iterator)

读取多个序列(迭代器)

for record in SeqIO.parse("multiple.fasta", "fasta"): print(f"{record.id}: {len(record.seq)} bp")
for record in SeqIO.parse("multiple.fasta", "fasta"): print(f"{record.id}: {len(record.seq)} bp")

Load into dictionary (use sparingly!)

加载到字典(谨慎使用!)

sequences = SeqIO.to_dict(SeqIO.parse("sequences.fasta", "fasta")) seq1 = sequences["seq1"]
undefined
sequences = SeqIO.to_dict(SeqIO.parse("sequences.fasta", "fasta")) seq1 = sequences["seq1"]
undefined

Writing FASTA Files

写入FASTA文件

python
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
python
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

Create records

创建序列记录

records = [] for i in range(10): seq = Seq("ATCG" * (i + 1)) record = SeqRecord(seq, id=f"seq_{i}", description=f"Sequence {i}") records.append(record)
records = [] for i in range(10): seq = Seq("ATCG" * (i + 1)) record = SeqRecord(seq, id=f"seq_{i}", description=f"序列 {i}") records.append(record)

Write to file

写入文件

SeqIO.write(records, "output.fasta", "fasta")
SeqIO.write(records, "output.fasta", "fasta")

Append to existing file

追加到现有文件

with open("output.fasta", "a") as f: SeqIO.write(record, f, "fasta")
undefined
with open("output.fasta", "a") as f: SeqIO.write(record, f, "fasta")
undefined

Reading FASTQ Files

读取FASTQ文件

python
from Bio import SeqIO
python
from Bio import SeqIO

Read FASTQ with quality scores

读取带质量分数的FASTQ

for record in SeqIO.parse("reads.fastq", "fastq"): seq_id = record.id sequence = record.seq quality = record.letter_annotations["phred_quality"]
avg_q = sum(quality) / len(quality)
min_q = min(quality)

print(f"{seq_id}: avg Q={avg_q:.1f}, min Q={min_q}")
undefined
for record in SeqIO.parse("reads.fastq", "fastq"): seq_id = record.id sequence = record.seq quality = record.letter_annotations["phred_quality"]
avg_q = sum(quality) / len(quality)
min_q = min(quality)

print(f"{seq_id}: 平均Q值={avg_q:.1f}, 最小Q值={min_q}")
undefined

Quality Filtering FASTQ

质量过滤FASTQ

python
from Bio import SeqIO

def filter_fastq(input_file, output_file, min_quality=20, min_length=50):
    """Filter FASTQ by quality and length."""
    good_reads = 0
    
    with open(output_file, "w") as out_handle:
        for record in SeqIO.parse(input_file, "fastq"):
            # Check length
            if len(record.seq) < min_length:
                continue
            
            # Check quality
            qualities = record.letter_annotations["phred_quality"]
            if min(qualities) >= min_quality:
                SeqIO.write(record, out_handle, "fastq")
                good_reads += 1
    
    return good_reads
python
from Bio import SeqIO

def filter_fastq(input_file, output_file, min_quality=20, min_length=50):
    """按质量和长度过滤FASTQ文件。"""
    good_reads = 0
    
    with open(output_file, "w") as out_handle:
        for record in SeqIO.parse(input_file, "fastq"):
            # 检查长度
            if len(record.seq) < min_length:
                continue
            
            # 检查质量
            qualities = record.letter_annotations["phred_quality"]
            if min(qualities) >= min_quality:
                SeqIO.write(record, out_handle, "fastq")
                good_reads += 1
    
    return good_reads

Filter reads

过滤reads

n_passed = filter_fastq("raw.fastq", "filtered.fastq") print(f"Passed: {n_passed} reads")
undefined
n_passed = filter_fastq("raw.fastq", "filtered.fastq") print(f"通过过滤: {n_passed}条reads")
undefined

Reading GenBank Files

读取GenBank文件

python
from Bio import SeqIO
python
from Bio import SeqIO

GenBank files have rich annotations

GenBank文件包含丰富的注释

for record in SeqIO.parse("sequence.gb", "genbank"): print(f"ID: {record.id}") print(f"Organism: {record.annotations['organism']}") print(f"Sequence length: {len(record.seq)}")
# Extract features
for feature in record.features:
    if feature.type == "CDS":
        product = feature.qualifiers.get("product", ["Unknown"])[0]
        start = feature.location.start
        end = feature.location.end
        print(f"  CDS: {product} ({start}-{end})")
undefined
for record in SeqIO.parse("sequence.gb", "genbank"): print(f"ID: {record.id}") print(f"生物: {record.annotations['organism']}") print(f"序列长度: {len(record.seq)}")
# 提取特征
for feature in record.features:
    if feature.type == "CDS":
        product = feature.qualifiers.get("product", ["未知"])[0]
        start = feature.location.start
        end = feature.location.end
        print(f"  CDS: {product} ({start}-{end})")
undefined

Converting File Formats

转换文件格式

python
from Bio import SeqIO
python
from Bio import SeqIO

Convert GenBank to FASTA

将GenBank转换为FASTA

records = SeqIO.parse("input.gb", "genbank") SeqIO.write(records, "output.fasta", "fasta")
records = SeqIO.parse("input.gb", "genbank") SeqIO.write(records, "output.fasta", "fasta")

Convert FASTQ to FASTA (lose quality scores)

将FASTQ转换为FASTA(丢失质量分数)

records = SeqIO.parse("reads.fastq", "fastq") SeqIO.write(records, "sequences.fasta", "fasta")
records = SeqIO.parse("reads.fastq", "fastq") SeqIO.write(records, "sequences.fasta", "fasta")

Multiple format conversions in one go

通用格式转换函数

def convert_format(input_file, input_format, output_file, output_format): """Generic format converter.""" count = SeqIO.convert(input_file, input_format, output_file, output_format) print(f"Converted {count} records")
convert_format("seq.gb", "genbank", "seq.fasta", "fasta")
undefined
def convert_format(input_file, input_format, output_file, output_format): """通用序列文件格式转换器。""" count = SeqIO.convert(input_file, input_format, output_file, output_format) print(f"转换了 {count} 条记录")
convert_format("seq.gb", "genbank", "seq.fasta", "fasta")
undefined

Sequence Alignment

序列比对

Pairwise Alignment

双序列比对

python
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.pairwise2 import format_alignment

seq1 = Seq("ACCGGT")
seq2 = Seq("ACGT")
python
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.pairwise2 import format_alignment

seq1 = Seq("ACCGGT")
seq2 = Seq("ACGT")

Global alignment (Needleman-Wunsch)

全局比对(Needleman-Wunsch算法)

alignments = pairwise2.align.globalxx(seq1, seq2)
alignments = pairwise2.align.globalxx(seq1, seq2)

Print best alignment

打印最优比对结果

best = alignments[0] print(format_alignment(*best))
best = alignments[0] print(format_alignment(*best))

With scoring matrix

使用评分矩阵

alignments = pairwise2.align.globalms( seq1, seq2, match=2, # Match score mismatch=-1, # Mismatch penalty open=-0.5, # Gap open penalty extend=-0.1 # Gap extension penalty )
undefined
alignments = pairwise2.align.globalms( seq1, seq2, match=2, # 匹配得分 mismatch=-1, # 错配惩罚 open=-0.5, # 空位开放惩罚 extend=-0.1 # 空位延伸惩罚 )
undefined

Custom Scoring

自定义评分

python
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.Align import substitution_matrices

seq1 = Seq("KEVLA")
seq2 = Seq("KELVA")
python
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.Align import substitution_matrices

seq1 = Seq("KEVLA")
seq2 = Seq("KELVA")

Use BLOSUM62 matrix

使用BLOSUM62矩阵

matrix = substitution_matrices.load("BLOSUM62")
alignments = pairwise2.align.globaldx( seq1, seq2, matrix, open=-10, extend=-0.5 )
print(pairwise2.format_alignment(*alignments[0]))
undefined
matrix = substitution_matrices.load("BLOSUM62")
alignments = pairwise2.align.globaldx( seq1, seq2, matrix, open=-10, extend=-0.5 )
print(pairwise2.format_alignment(*alignments[0]))
undefined

Local Alignment

局部比对

python
from Bio import pairwise2
from Bio.Seq import Seq
python
from Bio import pairwise2
from Bio.Seq import Seq

Local alignment (Smith-Waterman)

局部比对(Smith-Waterman算法)

seq1 = Seq("GCATGCTAGATGCTA") seq2 = Seq("ATGCTA")
alignments = pairwise2.align.localxx(seq1, seq2)
seq1 = Seq("GCATGCTAGATGCTA") seq2 = Seq("ATGCTA")
alignments = pairwise2.align.localxx(seq1, seq2)

Best local alignment

最优局部比对

best = alignments[0] print(pairwise2.format_alignment(*best))
undefined
best = alignments[0] print(pairwise2.format_alignment(*best))
undefined

Multiple Sequence Alignment

多序列比对

python
from Bio import AlignIO
from Bio.Align.Applications import ClustalwCommandline
python
from Bio import AlignIO
from Bio.Align.Applications import ClustalwCommandline

Run ClustalW (requires installation)

运行ClustalW(需要预先安装)

cline = ClustalwCommandline("clustalw2", infile="sequences.fasta") stdout, stderr = cline()
cline = ClustalwCommandline("clustalw2", infile="sequences.fasta") stdout, stderr = cline()

Read alignment

读取比对结果

alignment = AlignIO.read("sequences.aln", "clustal")
print(f"Alignment length: {alignment.get_alignment_length()}") print(f"Number of sequences: {len(alignment)}")
alignment = AlignIO.read("sequences.aln", "clustal")
print(f"比对长度: {alignment.get_alignment_length()}") print(f"序列数量: {len(alignment)}")

Print alignment

打印比对结果

print(alignment)
print(alignment)

Access sequences

访问序列

for record in alignment: print(f"{record.id}: {record.seq}")
undefined
for record in alignment: print(f"{record.id}: {record.seq}")
undefined

Alignment Statistics

比对统计

python
from Bio import AlignIO

alignment = AlignIO.read("alignment.fasta", "fasta")
python
from Bio import AlignIO

alignment = AlignIO.read("alignment.fasta", "fasta")

Calculate identity

计算一致性

def calculate_identity(alignment): """Calculate pairwise identity matrix.""" n = len(alignment) identities = []
for i in range(n):
    for j in range(i+1, n):
        seq1 = alignment[i].seq
        seq2 = alignment[j].seq
        
        matches = sum(a == b for a, b in zip(seq1, seq2) if a != '-' and b != '-')
        aligned_length = sum(1 for a, b in zip(seq1, seq2) if a != '-' and b != '-')
        
        identity = matches / aligned_length * 100 if aligned_length > 0 else 0
        identities.append(identity)

return identities
identities = calculate_identity(alignment) print(f"Average identity: {sum(identities)/len(identities):.2f}%")
undefined
def calculate_identity(alignment): """计算双序列一致性矩阵。""" n = len(alignment) identities = []
for i in range(n):
    for j in range(i+1, n):
        seq1 = alignment[i].seq
        seq2 = alignment[j].seq
        
        matches = sum(a == b for a, b in zip(seq1, seq2) if a != '-' and b != '-')
        aligned_length = sum(1 for a, b in zip(seq1, seq2) if a != '-' and b != '-')
        
        identity = matches / aligned_length * 100 if aligned_length > 0 else 0
        identities.append(identity)

return identities
identities = calculate_identity(alignment) print(f"平均一致性: {sum(identities)/len(identities):.2f}%")
undefined

BLAST Searches

BLAST搜索

Running BLAST Online

在线运行BLAST

python
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO
python
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO

Read query sequence

读取查询序列

record = SeqIO.read("query.fasta", "fasta")
record = SeqIO.read("query.fasta", "fasta")

Run BLAST

运行BLAST

result_handle = NCBIWWW.qblast( program="blastn", # blastn, blastp, blastx, tblastn, tblastx database="nt", # nt, nr, refseq_rna, etc. sequence=str(record.seq), hitlist_size=10 )
result_handle = NCBIWWW.qblast( program="blastn", # blastn、blastp、blastx、tblastn、tblastx database="nt", # nt、nr、refseq_rna等 sequence=str(record.seq), hitlist_size=10 )

Save results

保存结果

with open("blast_results.xml", "w") as out_handle: out_handle.write(result_handle.read())
result_handle.close()
undefined
with open("blast_results.xml", "w") as out_handle: out_handle.write(result_handle.read())
result_handle.close()
undefined

Parsing BLAST Results

解析BLAST结果

python
from Bio.Blast import NCBIXML
python
from Bio.Blast import NCBIXML

Parse BLAST XML output

解析BLAST XML输出

with open("blast_results.xml") as result_handle: blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
    print(f"Query: {blast_record.query}")
    print(f"Number of hits: {len(blast_record.alignments)}")
    
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            if hsp.expect < 0.001:  # E-value threshold
                print(f"\n  Hit: {alignment.title}")
                print(f"  Length: {alignment.length}")
                print(f"  E-value: {hsp.expect}")
                print(f"  Identity: {hsp.identities}/{hsp.align_length} "
                      f"({hsp.identities/hsp.align_length*100:.1f}%)")
                print(f"  Query:   {hsp.query}")
                print(f"  Subject: {hsp.sbjct}")
undefined
with open("blast_results.xml") as result_handle: blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
    print(f"查询序列: {blast_record.query}")
    print(f"比对命中数量: {len(blast_record.alignments)}")
    
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            if hsp.expect < 0.001:  # E值阈值
                print(f"\n  命中序列: {alignment.title}")
                print(f"  长度: {alignment.length}")
                print(f"  E值: {hsp.expect}")
                print(f"  一致性: {hsp.identities}/{hsp.align_length} "
                      f"({hsp.identities/hsp.align_length*100:.1f}%)")
                print(f"  查询序列片段:   {hsp.query}")
                print(f"  目标序列片段: {hsp.sbjct}")
undefined

BLAST with Custom Parameters

自定义参数运行BLAST

python
from Bio.Blast import NCBIWWW

result_handle = NCBIWWW.qblast(
    program="blastp",
    database="nr",
    sequence=protein_seq,
    expect=0.001,          # E-value threshold
    word_size=3,           # Word size
    matrix_name="BLOSUM62", # Substitution matrix
    gapcosts="11 1",       # Gap costs (open, extend)
    hitlist_size=50,       # Number of hits
    filter="L"             # Low complexity filter
)
python
from Bio.Blast import NCBIWWW

result_handle = NCBIWWW.qblast(
    program="blastp",
    database="nr",
    sequence=protein_seq,
    expect=0.001,          # E值阈值
    word_size=3,           # 字长
    matrix_name="BLOSUM62", # 替换矩阵
    gapcosts="11 1",       # 空位成本(开放、延伸)
    hitlist_size=50,       # 命中数量
    filter="L"             # 低复杂度过滤
)

Batch BLAST

批量BLAST

python
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO
import time

def batch_blast(fasta_file, output_file):
    """Run BLAST for multiple sequences."""
    results = []
    
    for record in SeqIO.parse(fasta_file, "fasta"):
        print(f"BLASTing {record.id}...")
        
        result_handle = NCBIWWW.qblast(
            "blastn", "nt",
            str(record.seq),
            hitlist_size=5
        )
        
        blast_record = NCBIXML.read(result_handle)
        results.append((record.id, blast_record))
        
        # Be nice to NCBI servers
        time.sleep(1)
    
    return results
python
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO
import time

def batch_blast(fasta_file, output_file):
    """对多条序列运行BLAST。"""
    results = []
    
    for record in SeqIO.parse(fasta_file, "fasta"):
        print(f"正在对 {record.id} 运行BLAST...")
        
        result_handle = NCBIWWW.qblast(
            "blastn", "nt",
            str(record.seq),
            hitlist_size=5
        )
        
        blast_record = NCBIXML.read(result_handle)
        results.append((record.id, blast_record))
        
        # 友好访问NCBI服务器
        time.sleep(1)
    
    return results

Run batch BLAST

运行批量BLAST

results = batch_blast("queries.fasta", "batch_results.xml")
undefined
results = batch_blast("queries.fasta", "batch_results.xml")
undefined

Protein Structure Analysis

蛋白质结构分析

Loading PDB Files

加载PDB文件

python
from Bio.PDB import PDBParser
python
from Bio.PDB import PDBParser

Create parser

创建解析器

parser = PDBParser(QUIET=True)
parser = PDBParser(QUIET=True)

Load structure

加载结构

structure = parser.get_structure("protein", "protein.pdb")
structure = parser.get_structure("protein", "protein.pdb")

Access hierarchy: Structure → Model → Chain → Residue → Atom

访问层级结构:Structure → Model → Chain → Residue → Atom

model = structure[0] chain = model['A']
print(f"Structure ID: {structure.id}") print(f"Number of models: {len(structure)}") print(f"Number of chains: {len(model)}") print(f"Number of residues in chain A: {len(chain)}")
undefined
model = structure[0] chain = model['A']
print(f"结构ID: {structure.id}") print(f"模型数量: {len(structure)}") print(f"链数量: {len(model)}") print(f"链A中的残基数量: {len(chain)}")
undefined

Extracting Information

提取信息

python
from Bio.PDB import PDBParser

parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
python
from Bio.PDB import PDBParser

parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")

Iterate through structure

遍历结构

for model in structure: for chain in model: print(f"Chain {chain.id}")
    for residue in chain:
        # Skip hetero atoms (water, ligands)
        if residue.id[0] == ' ':
            resname = residue.resname
            resid = residue.id[1]
            
            # Get atoms
            for atom in residue:
                atom_name = atom.name
                coord = atom.coord
                print(f"  {resname}{resid} {atom_name}: {coord}")
undefined
for model in structure: for chain in model: print(f"链 {chain.id}")
    for residue in chain:
        # 跳过杂原子(水、配体)
        if residue.id[0] == ' ':
            resname = residue.resname
            resid = residue.id[1]
            
            # 获取原子
            for atom in residue:
                atom_name = atom.name
                coord = atom.coord
                print(f"  {resname}{resid} {atom_name}: {coord}")
undefined

Calculating Distances

计算距离

python
from Bio.PDB import PDBParser
import numpy as np

parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
python
from Bio.PDB import PDBParser
import numpy as np

parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")

Get two atoms

获取两个原子

chain = structure[0]['A'] atom1 = chain[10]['CA'] # CA atom of residue 10 atom2 = chain[20]['CA'] # CA atom of residue 20
chain = structure[0]['A'] atom1 = chain[10]['CA'] # 残基10的CA原子 atom2 = chain[20]['CA'] # 残基20的CA原子

Calculate distance

计算距离

distance = atom1 - atom2 # Overloaded operator print(f"Distance: {distance:.2f} Å")
distance = atom1 - atom2 # 重载运算符 print(f"距离: {distance:.2f} Å")

Manual calculation

手动计算

coord1 = atom1.coord coord2 = atom2.coord dist = np.linalg.norm(coord1 - coord2) print(f"Distance (manual): {dist:.2f} Å")
undefined
coord1 = atom1.coord coord2 = atom2.coord dist = np.linalg.norm(coord1 - coord2) print(f"距离(手动计算): {dist:.2f} Å")
undefined

Center of Mass

质心计算

python
from Bio.PDB import PDBParser
import numpy as np

parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")

def calculate_center_of_mass(entity):
    """Calculate center of mass for Structure/Model/Chain."""
    coords = []
    masses = []
    
    for atom in entity.get_atoms():
        coords.append(atom.coord)
        masses.append(atom.mass)
    
    coords = np.array(coords)
    masses = np.array(masses)
    
    com = np.average(coords, axis=0, weights=masses)
    return com
python
from Bio.PDB import PDBParser
import numpy as np

parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")

def calculate_center_of_mass(entity):
    """计算Structure/Model/Chain的质心。"""
    coords = []
    masses = []
    
    for atom in entity.get_atoms():
        coords.append(atom.coord)
        masses.append(atom.mass)
    
    coords = np.array(coords)
    masses = np.array(masses)
    
    com = np.average(coords, axis=0, weights=masses)
    return com

Calculate for whole structure

计算整个结构的质心

com = calculate_center_of_mass(structure) print(f"Center of mass: {com}")
undefined
com = calculate_center_of_mass(structure) print(f"质心: {com}")
undefined

Selecting Atoms

选择原子

python
from Bio.PDB import PDBParser, Selection

parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
python
from Bio.PDB import PDBParser, Selection

parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")

Select all atoms

选择所有原子

all_atoms = Selection.unfold_entities(structure, 'A') print(f"Total atoms: {len(all_atoms)}")
all_atoms = Selection.unfold_entities(structure, 'A') print(f"总原子数: {len(all_atoms)}")

Select specific atoms

选择特定原子

model = structure[0]
model = structure[0]

All CA atoms

所有CA原子

ca_atoms = [atom for atom in model.get_atoms() if atom.name == 'CA'] print(f"CA atoms: {len(ca_atoms)}")
ca_atoms = [atom for atom in model.get_atoms() if atom.name == 'CA'] print(f"CA原子数: {len(ca_atoms)}")

Specific residue range

特定残基范围

chain_a = model['A'] residues_10_to_20 = [chain_a[i] for i in range(10, 21) if i in chain_a]
undefined
chain_a = model['A'] residues_10_to_20 = [chain_a[i] for i in range(10, 21) if i in chain_a]
undefined

Secondary Structure with DSSP

使用DSSP分析二级结构

python
from Bio.PDB import PDBParser
from Bio.PDB.DSSP import DSSP

parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
model = structure[0]
python
from Bio.PDB import PDBParser
from Bio.PDB.DSSP import DSSP

parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
model = structure[0]

Run DSSP (requires dssp executable)

运行DSSP(需要dssp可执行文件)

dssp = DSSP(model, "protein.pdb")
dssp = DSSP(model, "protein.pdb")

Extract secondary structure

提取二级结构

for key in dssp: residue = dssp[key] chain_id = key[0] res_id = key[1][1] ss = residue[2] # Secondary structure: H=helix, E=sheet, C=coil acc = residue[3] # Accessible surface area
print(f"{chain_id} {res_id}: {ss} (ASA={acc:.1f})")
undefined
for key in dssp: residue = dssp[key] chain_id = key[0] res_id = key[1][1] ss = residue[2] # 二级结构: H=螺旋, E=折叠, C=无规卷曲 acc = residue[3] # 可及表面积
print(f"{chain_id} {res_id}: {ss} (ASA={acc:.1f})")
undefined

Extracting Sequence from PDB

从PDB提取序列

python
from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import PPBuilder

parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
python
from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import PPBuilder

parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")

Build polypeptides

构建多肽

ppb = PPBuilder()
for chain in structure[0]: for pp in ppb.build_peptides(chain): sequence = pp.get_sequence() print(f"Chain {chain.id}: {sequence}")
undefined
ppb = PPBuilder()
for chain in structure[0]: for pp in ppb.build_peptides(chain): sequence = pp.get_sequence() print(f"链 {chain.id}: {sequence}")
undefined

Writing PDB Files

写入PDB文件

python
from Bio.PDB import PDBParser, PDBIO, Select
python
from Bio.PDB import PDBParser, PDBIO, Select

Load structure

加载结构

parser = PDBParser() structure = parser.get_structure("protein", "input.pdb")
parser = PDBParser() structure = parser.get_structure("protein", "input.pdb")

Select what to save

选择要保存的部分

class ChainSelect(Select): def accept_chain(self, chain): return chain.id == 'A' # Only save chain A
class ChainSelect(Select): def accept_chain(self, chain): return chain.id == 'A' # 仅保存链A

Write structure

写入结构

io = PDBIO() io.set_structure(structure) io.save("chain_A.pdb", ChainSelect())
undefined
io = PDBIO() io.set_structure(structure) io.save("chain_A.pdb", ChainSelect())
undefined

Phylogenetic Analysis

系统发育分析

Building Trees from Alignment

从比对结果构建树

python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

Read alignment

读取比对结果

alignment = AlignIO.read("alignment.fasta", "fasta")
alignment = AlignIO.read("alignment.fasta", "fasta")

Calculate distance matrix

计算距离矩阵

calculator = DistanceCalculator('identity') dm = calculator.get_distance(alignment)
calculator = DistanceCalculator('identity') dm = calculator.get_distance(alignment)

Construct tree (Neighbor-Joining)

构建树(邻接法)

constructor = DistanceTreeConstructor(calculator, 'nj') tree = constructor.build_tree(alignment)
constructor = DistanceTreeConstructor(calculator, 'nj') tree = constructor.build_tree(alignment)

Save tree

保存树

Phylo.write(tree, "tree.xml", "phyloxml")
print(tree)
undefined
Phylo.write(tree, "tree.xml", "phyloxml")
print(tree)
undefined

Tree Visualization

树可视化

python
from Bio import Phylo
python
from Bio import Phylo

Load tree

加载树

tree = Phylo.read("tree.xml", "phyloxml")
tree = Phylo.read("tree.xml", "phyloxml")

Draw tree (ASCII)

ASCII格式绘制树

Phylo.draw_ascii(tree)
Phylo.draw_ascii(tree)

Draw tree (graphical - requires matplotlib)

图形化绘制(需要matplotlib)

import matplotlib.pyplot as plt Phylo.draw(tree) plt.savefig("tree.png")
import matplotlib.pyplot as plt Phylo.draw(tree) plt.savefig("tree.png")

Interactive view

交互式查看

tree.ladderize() # Flip tree for better visualization

tree.ladderize() # 翻转树以获得更好的可视化效果

Phylo.draw(tree, do_show=True)

Phylo.draw(tree, do_show=True)

undefined
undefined

Tree Analysis

树分析

python
from Bio import Phylo

tree = Phylo.read("tree.xml", "phyloxml")
python
from Bio import Phylo

tree = Phylo.read("tree.xml", "phyloxml")

Get all terminals (leaves)

获取所有终端节点(叶子)

terminals = tree.get_terminals() print(f"Number of leaves: {len(terminals)}")
terminals = tree.get_terminals() print(f"叶子节点数量: {len(terminals)}")

Get all non-terminals (internal nodes)

获取所有非终端节点(内部节点)

nonterminals = tree.get_nonterminals() print(f"Number of internal nodes: {len(nonterminals)}")
nonterminals = tree.get_nonterminals() print(f"内部节点数量: {len(nonterminals)}")

Find common ancestor

寻找共同祖先

clade1 = tree.find_any(name="Species1") clade2 = tree.find_any(name="Species2") common = tree.common_ancestor(clade1, clade2)
clade1 = tree.find_any(name="Species1") clade2 = tree.find_any(name="Species2") common = tree.common_ancestor(clade1, clade2)

Calculate distances

计算距离

dist = tree.distance(clade1, clade2) print(f"Distance between Species1 and Species2: {dist:.4f}")
dist = tree.distance(clade1, clade2) print(f"Species1和Species2之间的距离: {dist:.4f}")

Get path between nodes

获取节点间路径

path = tree.get_path(clade1, clade2)
undefined
path = tree.get_path(clade1, clade2)
undefined

Creating Trees Programmatically

程序化创建树

python
from Bio.Phylo.BaseTree import Tree, Clade
python
from Bio.Phylo.BaseTree import Tree, Clade

Create tree structure

创建树结构

tree = Tree()
tree = Tree()

Create clades (nodes)

创建分支(节点)

clade_a = Clade(branch_length=0.5, name="A") clade_b = Clade(branch_length=0.3, name="B") clade_c = Clade(branch_length=0.4, name="C")
clade_a = Clade(branch_length=0.5, name="A") clade_b = Clade(branch_length=0.3, name="B") clade_c = Clade(branch_length=0.4, name="C")

Create internal node

创建内部节点

internal = Clade(branch_length=0.2) internal.clades = [clade_a, clade_b]
internal = Clade(branch_length=0.2) internal.clades = [clade_a, clade_b]

Create root

创建根节点

root = Clade() root.clades = [internal, clade_c]
tree.root = root
root = Clade() root.clades = [internal, clade_c]
tree.root = root

Visualize

可视化

Phylo.draw_ascii(tree)
undefined
Phylo.draw_ascii(tree)
undefined

NCBI Entrez

NCBI Entrez

Searching Databases

搜索数据库

python
from Bio import Entrez
python
from Bio import Entrez

ALWAYS set your email

务必设置你的邮箱

Entrez.email = "your.email@example.com"
Entrez.email = "your.email@example.com"

Search nucleotide database

搜索核苷酸数据库

handle = Entrez.esearch( db="nucleotide", term="Homo sapiens[Organism] AND COX1", retmax=10 )
record = Entrez.read(handle) handle.close()
print(f"Found {record['Count']} records") print(f"IDs: {record['IdList']}")
undefined
handle = Entrez.esearch( db="nucleotide", term="Homo sapiens[Organism] AND COX1", retmax=10 )
record = Entrez.read(handle) handle.close()
print(f"找到 {record['Count']} 条记录") print(f"ID列表: {record['IdList']}")
undefined

Fetching Records

获取记录

python
from Bio import Entrez, SeqIO

Entrez.email = "your.email@example.com"
python
from Bio import Entrez, SeqIO

Entrez.email = "your.email@example.com"

Fetch by ID

通过ID获取记录

handle = Entrez.efetch( db="nucleotide", id="NC_000001", rettype="gb", retmode="text" )
record = SeqIO.read(handle, "genbank") handle.close()
print(f"ID: {record.id}") print(f"Description: {record.description}") print(f"Length: {len(record.seq)}")
undefined
handle = Entrez.efetch( db="nucleotide", id="NC_000001", rettype="gb", retmode="text" )
record = SeqIO.read(handle, "genbank") handle.close()
print(f"ID: {record.id}") print(f"描述: {record.description}") print(f"长度: {len(record.seq)}")
undefined

Batch Downloads

批量下载

python
from Bio import Entrez, SeqIO

Entrez.email = "your.email@example.com"

def download_sequences(id_list, output_file):
    """Download multiple sequences by ID."""
    # Fetch records
    handle = Entrez.efetch(
        db="nucleotide",
        id=id_list,
        rettype="fasta",
        retmode="text"
    )
    
    records = SeqIO.parse(handle, "fasta")
    SeqIO.write(records, output_file, "fasta")
    handle.close()
python
from Bio import Entrez, SeqIO

Entrez.email = "your.email@example.com"

def download_sequences(id_list, output_file):
    """通过ID批量下载序列。"""
    # 获取记录
    handle = Entrez.efetch(
        db="nucleotide",
        id=id_list,
        rettype="fasta",
        retmode="text"
    )
    
    records = SeqIO.parse(handle, "fasta")
    SeqIO.write(records, output_file, "fasta")
    handle.close()

Download

下载序列

ids = ["NM_001301717", "NM_001301718", "NM_001301719"] download_sequences(ids, "downloaded.fasta")
undefined
ids = ["NM_001301717", "NM_001301718", "NM_001301719"] download_sequences(ids, "downloaded.fasta")
undefined

Entrez with History

使用历史记录的Entrez

python
from Bio import Entrez

Entrez.email = "your.email@example.com"
python
from Bio import Entrez

Entrez.email = "your.email@example.com"

Search with history

带历史记录的搜索

search_handle = Entrez.esearch( db="nucleotide", term="human[organism] AND COX1", usehistory="y", retmax=1000 )
search_results = Entrez.read(search_handle) search_handle.close()
webenv = search_results["WebEnv"] query_key = search_results["QueryKey"] count = int(search_results["Count"])
print(f"Found {count} records")
search_handle = Entrez.esearch( db="nucleotide", term="human[organism] AND COX1", usehistory="y", retmax=1000 )
search_results = Entrez.read(search_handle) search_handle.close()
webenv = search_results["WebEnv"] query_key = search_results["QueryKey"] count = int(search_results["Count"])
print(f"找到 {count} 条记录")

Fetch in batches

分批获取

batch_size = 100 for start in range(0, count, batch_size): fetch_handle = Entrez.efetch( db="nucleotide", rettype="fasta", retmode="text", retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key )
# Process batch
data = fetch_handle.read()
fetch_handle.close()
undefined
batch_size = 100 for start in range(0, count, batch_size): fetch_handle = Entrez.efetch( db="nucleotide", rettype="fasta", retmode="text", retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key )
# 处理批次
data = fetch_handle.read()
fetch_handle.close()
undefined

Advanced Workflows

高级工作流

Complete Gene Analysis Pipeline

完整基因分析流程

python
from Bio import Entrez, SeqIO
from Bio.Seq import Seq
from Bio.SeqUtils import GC

Entrez.email = "your.email@example.com"

def analyze_gene(gene_id):
    """Complete analysis of a gene."""
    # 1. Fetch sequence
    handle = Entrez.efetch(db="nucleotide", id=gene_id, rettype="gb", retmode="text")
    record = SeqIO.read(handle, "genbank")
    handle.close()
    
    # 2. Basic stats
    length = len(record.seq)
    gc_content = GC(record.seq)
    
    # 3. Extract CDS
    cds_list = []
    for feature in record.features:
        if feature.type == "CDS":
            cds_seq = feature.extract(record.seq)
            product = feature.qualifiers.get("product", ["Unknown"])[0]
            cds_list.append((product, cds_seq))
    
    # 4. Translate CDS
    proteins = []
    for product, cds in cds_list:
        try:
            protein = cds.translate(to_stop=True)
            proteins.append((product, protein))
        except:
            proteins.append((product, None))
    
    return {
        'id': record.id,
        'length': length,
        'gc': gc_content,
        'n_cds': len(cds_list),
        'proteins': proteins
    }
python
from Bio import Entrez, SeqIO
from Bio.Seq import Seq
from Bio.SeqUtils import GC

Entrez.email = "your.email@example.com"

def analyze_gene(gene_id):
    """完整的基因分析流程。"""
    # 1. 获取序列
    handle = Entrez.efetch(db="nucleotide", id=gene_id, rettype="gb", retmode="text")
    record = SeqIO.read(handle, "genbank")
    handle.close()
    
    # 2. 基础统计
    length = len(record.seq)
    gc_content = GC(record.seq)
    
    # 3. 提取CDS
    cds_list = []
    for feature in record.features:
        if feature.type == "CDS":
            cds_seq = feature.extract(record.seq)
            product = feature.qualifiers.get("product", ["未知"])[0]
            cds_list.append((product, cds_seq))
    
    # 4. 翻译CDS
    proteins = []
    for product, cds in cds_list:
        try:
            protein = cds.translate(to_stop=True)
            proteins.append((product, protein))
        except:
            proteins.append((product, None))
    
    return {
        'id': record.id,
        'length': length,
        'gc': gc_content,
        'n_cds': len(cds_list),
        'proteins': proteins
    }

Analyze gene

分析基因

result = analyze_gene("NM_000518") print(f"Gene: {result['id']}") print(f"Length: {result['length']} bp") print(f"GC: {result['gc']:.2f}%") print(f"CDS: {result['n_cds']}")
undefined
result = analyze_gene("NM_000518") print(f"基因: {result['id']}") print(f"长度: {result['length']} bp") print(f"GC含量: {result['gc']:.2f}%") print(f"CDS数量: {result['n_cds']}")
undefined

RNA-Seq Read Processing

RNA-Seq Read处理

python
from Bio import SeqIO
import numpy as np

def process_rnaseq_reads(fastq_file, min_quality=30, min_length=50):
    """Process RNA-seq reads with quality filtering."""
    stats = {
        'total': 0,
        'passed_quality': 0,
        'passed_length': 0,
        'final': 0
    }
    
    passed_reads = []
    
    for record in SeqIO.parse(fastq_file, "fastq"):
        stats['total'] += 1
        
        # Quality filter
        qualities = record.letter_annotations["phred_quality"]
        if np.mean(qualities) < min_quality:
            continue
        stats['passed_quality'] += 1
        
        # Length filter
        if len(record.seq) < min_length:
            continue
        stats['passed_length'] += 1
        
        # Trim adapters (simple example)
        # In reality, use specialized tools like Cutadapt
        
        passed_reads.append(record)
        stats['final'] += 1
    
    return passed_reads, stats
python
from Bio import SeqIO
import numpy as np

def process_rnaseq_reads(fastq_file, min_quality=30, min_length=50):
    """带质量过滤的RNA-seq Read处理。"""
    stats = {
        'total': 0,
        'passed_quality': 0,
        'passed_length': 0,
        'final': 0
    }
    
    passed_reads = []
    
    for record in SeqIO.parse(fastq_file, "fastq"):
        stats['total'] += 1
        
        # 质量过滤
        qualities = record.letter_annotations["phred_quality"]
        if np.mean(qualities) < min_quality:
            continue
        stats['passed_quality'] += 1
        
        # 长度过滤
        if len(record.seq) < min_length:
            continue
        stats['passed_length'] += 1
        
        # 切除接头(简单示例)
        # 实际中使用专用工具如Cutadapt
        
        passed_reads.append(record)
        stats['final'] += 1
    
    return passed_reads, stats

Process reads

处理reads

reads, stats = process_rnaseq_reads("rnaseq.fastq") print(f"Total: {stats['total']}") print(f"Passed quality: {stats['passed_quality']}") print(f"Passed length: {stats['passed_length']}") print(f"Final: {stats['final']}")
undefined
reads, stats = process_rnaseq_reads("rnaseq.fastq") print(f"总数: {stats['total']}") print(f"通过质量过滤: {stats['passed_quality']}") print(f"通过长度过滤: {stats['passed_length']}") print(f"最终保留: {stats['final']}")
undefined

Restriction Enzyme Analysis

限制性酶切分析

python
from Bio import SeqIO
from Bio.Restriction import *
python
from Bio import SeqIO
from Bio.Restriction import *

Load sequence

加载序列

record = SeqIO.read("plasmid.gb", "genbank") seq = record.seq
record = SeqIO.read("plasmid.gb", "genbank") seq = record.seq

Create restriction batch

创建限制性酶批处理

rb = RestrictionBatch([EcoRI, BamHI, PstI, HindIII])
rb = RestrictionBatch([EcoRI, BamHI, PstI, HindIII])

Find restriction sites

寻找酶切位点

analysis = rb.search(seq)
analysis = rb.search(seq)

Print results

打印结果

for enzyme, sites in analysis.items(): if sites: print(f"{enzyme}: {len(sites)} site(s) at {sites}") else: print(f"{enzyme}: No sites found")
for enzyme, sites in analysis.items(): if sites: print(f"{enzyme}: {len(sites)}个位点,位于 {sites}") else: print(f"{enzyme}: 未找到位点")

Find enzymes that cut once (good for cloning)

寻找单酶切位点(适合克隆)

single_cutters = [enz for enz, sites in analysis.items() if len(sites) == 1] print(f"\nSingle cutters: {[str(e) for e in single_cutters]}")
undefined
single_cutters = [enz for enz, sites in analysis.items() if len(sites) == 1] print(f"\n单酶切酶: {[str(e) for e in single_cutters]}")
undefined

Codon Usage Analysis

密码子使用分析

python
from Bio import SeqIO
from collections import Counter

def analyze_codon_usage(genbank_file):
    """Analyze codon usage in CDS features."""
    codon_counts = Counter()
    total_codons = 0
    
    for record in SeqIO.parse(genbank_file, "genbank"):
        for feature in record.features:
            if feature.type == "CDS":
                # Extract CDS sequence
                cds = feature.extract(record.seq)
                
                # Count codons
                for i in range(0, len(cds) - 2, 3):
                    codon = str(cds[i:i+3])
                    if len(codon) == 3:
                        codon_counts[codon] += 1
                        total_codons += 1
    
    # Calculate frequencies
    codon_freq = {codon: count/total_codons 
                  for codon, count in codon_counts.items()}
    
    return codon_counts, codon_freq
python
from Bio import SeqIO
from collections import Counter

def analyze_codon_usage(genbank_file):
    """分析CDS特征中的密码子使用情况。"""
    codon_counts = Counter()
    total_codons = 0
    
    for record in SeqIO.parse(genbank_file, "genbank"):
        for feature in record.features:
            if feature.type == "CDS":
                # 提取CDS序列
                cds = feature.extract(record.seq)
                
                # 统计密码子
                for i in range(0, len(cds) - 2, 3):
                    codon = str(cds[i:i+3])
                    if len(codon) == 3:
                        codon_counts[codon] += 1
                        total_codons += 1
    
    # 计算频率
    codon_freq = {codon: count/total_codons 
                  for codon, count in codon_counts.items()}
    
    return codon_counts, codon_freq

Analyze

分析

counts, freqs = analyze_codon_usage("genome.gb")
counts, freqs = analyze_codon_usage("genome.gb")

Print most common codons

打印最常用的密码子

for codon, freq in sorted(freqs.items(), key=lambda x: x[1], reverse=True)[:10]: print(f"{codon}: {freq:.4f}")
undefined
for codon, freq in sorted(freqs.items(), key=lambda x: x[1], reverse=True)[:10]: print(f"{codon}: {freq:.4f}")
undefined

Performance Optimization

性能优化

Memory-Efficient File Processing

内存高效的文件处理

python
from Bio import SeqIO

def process_large_fasta(filename, process_func):
    """Process large FASTA without loading into memory."""
    count = 0
    
    for record in SeqIO.parse(filename, "fasta"):
        process_func(record)
        count += 1
        
        # Progress report
        if count % 10000 == 0:
            print(f"Processed {count} sequences")
    
    return count
python
from Bio import SeqIO

def process_large_fasta(filename, process_func):
    """不加载到内存的大FASTA文件处理。"""
    count = 0
    
    for record in SeqIO.parse(filename, "fasta"):
        process_func(record)
        count += 1
        
        # 进度报告
        if count % 10000 == 0:
            print(f"已处理 {count} 条序列")
    
    return count

Example processor

示例处理器

def calculate_gc(record): from Bio.SeqUtils import GC gc = GC(record.seq) if gc > 60: print(f"{record.id}: High GC ({gc:.1f}%)")
def calculate_gc(record): from Bio.SeqUtils import GC gc = GC(record.seq) if gc > 60: print(f"{record.id}: 高GC含量 ({gc:.1f}%)")

Process

处理文件

n = process_large_fasta("large_genome.fasta", calculate_gc)
undefined
n = process_large_fasta("large_genome.fasta", calculate_gc)
undefined

Parallel Processing

并行处理

python
from Bio import SeqIO
from multiprocessing import Pool
from functools import partial

def process_sequence(record, min_length=100):
    """Process single sequence."""
    if len(record.seq) >= min_length:
        from Bio.SeqUtils import GC
        return (record.id, len(record.seq), GC(record.seq))
    return None

def parallel_process_fasta(filename, n_processes=4):
    """Process FASTA in parallel."""
    # Load sequences
    records = list(SeqIO.parse(filename, "fasta"))
    
    # Process in parallel
    with Pool(n_processes) as pool:
        results = pool.map(process_sequence, records)
    
    # Filter None results
    return [r for r in results if r is not None]
python
from Bio import SeqIO
from multiprocessing import Pool
from functools import partial

def process_sequence(record, min_length=100):
    """处理单条序列。"""
    if len(record.seq) >= min_length:
        from Bio.SeqUtils import GC
        return (record.id, len(record.seq), GC(record.seq))
    return None

def parallel_process_fasta(filename, n_processes=4):
    """并行处理FASTA文件。"""
    # 加载序列
    records = list(SeqIO.parse(filename, "fasta"))
    
    # 并行处理
    with Pool(n_processes) as pool:
        results = pool.map(process_sequence, records)
    
    # 过滤空结果
    return [r for r in results if r is not None]

Process

处理文件

results = parallel_process_fasta("sequences.fasta", n_processes=4) for seq_id, length, gc in results[:10]: print(f"{seq_id}: {length} bp, GC={gc:.2f}%")
undefined
results = parallel_process_fasta("sequences.fasta", n_processes=4) for seq_id, length, gc in results[:10]: print(f"{seq_id}: {length} bp, GC含量={gc:.2f}%")
undefined

Index Files for Random Access

索引文件用于随机访问

python
from Bio import SeqIO
python
from Bio import SeqIO

Create index (fast random access)

创建索引(快速随机访问)

record_dict = SeqIO.index("large.fasta", "fasta")
record_dict = SeqIO.index("large.fasta", "fasta")

Access specific sequence instantly

立即访问特定序列

record = record_dict["seq_12345"] print(record.seq)
record = record_dict["seq_12345"] print(record.seq)

Iterate (still memory efficient)

迭代(仍保持内存高效)

for key in record_dict: record = record_dict[key] process(record)
for key in record_dict: record = record_dict[key] process(record)

Close when done

使用完毕后关闭

record_dict.close()
record_dict.close()

SQLite-backed index (for very large files)

SQLite-backed索引(用于超大文件)

record_dict = SeqIO.index_db("large.idx", "huge.fasta", "fasta")
undefined
record_dict = SeqIO.index_db("large.idx", "huge.fasta", "fasta")
undefined

Common Pitfalls and Solutions

常见陷阱与解决方案

Translation Frame Errors

翻译读框错误

python
from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATGGGCCGC")
python
from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATGGGCCGC")

Problem: Wrong reading frame

问题:错误的阅读框

protein = dna.translate() # May have stop codons
protein = dna.translate() # 可能包含终止密码子

Solution: Check all frames

解决方案:检查所有读框

for i in range(3): frame = dna[i:] protein = frame.translate(to_stop=True, table=1) if len(protein) > 10: # Reasonable length print(f"Frame {i}: {protein}")
undefined
for i in range(3): frame = dna[i:] protein = frame.translate(to_stop=True, table=1) if len(protein) > 10: # 合理长度 print(f"读框 {i}: {protein}")
undefined

File Format Mismatches

文件格式不匹配

python
from Bio import SeqIO
python
from Bio import SeqIO

Problem: Wrong format specified

问题:指定错误的格式

try: records = SeqIO.parse("file.fasta", "genbank") # Wrong! except: print("Format mismatch")
try: records = SeqIO.parse("file.fasta", "genbank") # 错误! except: print("格式不匹配")

Solution: Verify format

解决方案:自动识别格式

import os
def guess_format(filename): ext = os.path.splitext(filename)[1].lower() format_map = { '.fasta': 'fasta', '.fa': 'fasta', '.fna': 'fasta', '.fastq': 'fastq', '.fq': 'fastq', '.gb': 'genbank', '.gbk': 'genbank' } return format_map.get(ext, 'fasta')
format = guess_format("file.fasta") records = SeqIO.parse("file.fasta", format)
undefined
import os
def guess_format(filename): ext = os.path.splitext(filename)[1].lower() format_map = { '.fasta': 'fasta', '.fa': 'fasta', '.fna': 'fasta', '.fastq': 'fastq', '.fq': 'fastq', '.gb': 'genbank', '.gbk': 'genbank' } return format_map.get(ext, 'fasta')
format = guess_format("file.fasta") records = SeqIO.parse("file.fasta", format)
undefined

NCBI Rate Limiting

NCBI速率限制

python
from Bio import Entrez
import time

Entrez.email = "your.email@example.com"
python
from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

Problem: Too many requests

问题:请求过于频繁

def bad_batch_download(id_list): for gene_id in id_list: handle = Entrez.efetch(db="nucleotide", id=gene_id) # This will get you blocked!
def bad_batch_download(id_list): for gene_id in id_list: handle = Entrez.efetch(db="nucleotide", id=gene_id) # 这会导致你被封禁!

Solution: Add delays

解决方案:添加延迟

def good_batch_download(id_list): results = [] for gene_id in id_list: handle = Entrez.efetch(db="nucleotide", id=gene_id, rettype="gb", retmode="text") record = SeqIO.read(handle, "genbank") handle.close() results.append(record)
    time.sleep(0.34)  # ~3 requests/sec
return results
def good_batch_download(id_list): results = [] for gene_id in id_list: handle = Entrez.efetch(db="nucleotide", id=gene_id, rettype="gb", retmode="text") record = SeqIO.read(handle, "genbank") handle.close() results.append(record)
    time.sleep(0.34)  # ~3次请求/秒
return results

Better: Use Entrez history for large batches

更好的方法:对大批次使用Entrez历史记录

undefined
undefined

Memory Issues with Large Alignments

大比对结果的内存问题

python
from Bio import AlignIO
python
from Bio import AlignIO

Problem: Loading huge alignment

问题:加载超大比对结果

alignment = AlignIO.read("huge_alignment.fasta", "fasta") # OOM!
alignment = AlignIO.read("huge_alignment.fasta", "fasta") # 内存溢出!

Solution: Process incrementally

解决方案:分块处理

def process_alignment_chunks(filename, chunk_size=1000): records = SeqIO.parse(filename, "fasta")
chunk = []
for record in records:
    chunk.append(record)
    
    if len(chunk) >= chunk_size:
        # Process chunk
        process_chunk(chunk)
        chunk = []

# Process remaining
if chunk:
    process_chunk(chunk)

This comprehensive Biopython guide covers 50+ examples across all major bioinformatics workflows!
def process_alignment_chunks(filename, chunk_size=1000): records = SeqIO.parse(filename, "fasta")
chunk = []
for record in records:
    chunk.append(record)
    
    if len(chunk) >= chunk_size:
        # 处理块
        process_chunk(chunk)
        chunk = []

# 处理剩余部分
if chunk:
    process_chunk(chunk)

这份全面的Biopython指南涵盖了50多个生物信息学核心工作流的示例!