biopython
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBiopython - Bioinformatics Library
Biopython - 生物信息学库
Industry-standard Python library for computational biology and bioinformatics workflows.
计算生物学和生物信息学工作流的行业标准Python库。
When to Use
适用场景
- Parsing and manipulating biological sequences (DNA, RNA, protein)
- Reading and writing sequence files (FASTA, FASTQ, GenBank, EMBL, SwissProt)
- Performing sequence alignments (pairwise and multiple)
- Running and parsing BLAST searches
- Analyzing protein structures from PDB files
- Calculating sequence statistics and molecular weights
- Translating DNA to protein sequences
- Finding restriction enzyme sites
- Building and analyzing phylogenetic trees
- Accessing NCBI databases (Entrez, PubMed)
- Computing sequence motifs and patterns
- Analyzing next-generation sequencing data
- 解析和处理生物序列(DNA、RNA、蛋白质)
- 读取和写入序列文件(FASTA、FASTQ、GenBank、EMBL、SwissProt)
- 执行序列比对(双序列和多序列)
- 运行和解析BLAST搜索结果
- 分析PDB文件中的蛋白质结构
- 计算序列统计数据和分子量
- 将DNA序列翻译为蛋白质序列
- 寻找限制性酶切位点
- 构建和分析系统发育树
- 访问NCBI数据库(Entrez、PubMed)
- 计算序列基序和模式
- 分析下一代测序数据
Reference Documentation
参考文档
Official docs: https://biopython.org/
Tutorial: https://biopython.org/DIST/docs/tutorial/Tutorial.html
Search patterns:, , , ,
Tutorial: https://biopython.org/DIST/docs/tutorial/Tutorial.html
Search patterns:
SeqIO.parseSeqAlignIONCBIWWW.qblastPDBParser官方文档: https://biopython.org/
教程: https://biopython.org/DIST/docs/tutorial/Tutorial.html
常用搜索模式:, , , ,
教程: https://biopython.org/DIST/docs/tutorial/Tutorial.html
常用搜索模式:
SeqIO.parseSeqAlignIONCBIWWW.qblastPDBParserCore Principles
核心原则
Use Biopython For
Biopython适用场景
| Task | Module | Example |
|---|---|---|
| Create sequences | | |
| Read sequence files | | |
| Pairwise alignment | | |
| Multiple alignment | | |
| BLAST searches | | |
| PDB structures | | |
| Phylogenetic trees | | |
| NCBI databases | | |
| 任务 | 模块 | 示例 |
|---|---|---|
| 创建序列 | | |
| 读取序列文件 | | |
| 双序列比对 | | |
| 多序列比对 | | |
| BLAST搜索 | | |
| PDB结构分析 | | |
| 系统发育树分析 | | |
| NCBI数据库访问 | | |
Do NOT Use For
Biopython不适用场景
- High-performance genome assembly (use SPAdes, Canu)
- Variant calling from BAM files (use GATK, BCFtools)
- RNA-seq differential expression (use DESeq2, edgeR)
- Protein structure prediction (use AlphaFold, RoseTTAFold)
- Large-scale metagenomics (use specialized pipelines)
- 高性能基因组组装(使用SPAdes、Canu)
- 从BAM文件进行变异检测(使用GATK、BCFtools)
- RNA-seq差异表达分析(使用DESeq2、edgeR)
- 蛋白质结构预测(使用AlphaFold、RoseTTAFold)
- 大规模宏基因组分析(使用专用流程)
Quick Reference
快速参考
Installation
安装
bash
undefinedbash
undefinedpip (recommended)
pip(推荐)
pip install biopython
pip install biopython
With optional dependencies
安装可选依赖
pip install biopython[extra]
pip install biopython[extra]
conda
conda安装
conda install -c conda-forge biopython
conda install -c conda-forge biopython
Development version
开发版本
pip install git+https://github.com/biopython/biopython.git
undefinedpip install git+https://github.com/biopython/biopython.git
undefinedStandard Imports
标准导入
python
undefinedpython
undefinedCore sequence handling
核心序列处理
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO, AlignIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO, AlignIO
Sequence alignment
序列比对
from Bio import pairwise2
from Bio.Align import MultipleSeqAlignment
from Bio.Align.Applications import ClustalwCommandline
from Bio import pairwise2
from Bio.Align import MultipleSeqAlignment
from Bio.Align.Applications import ClustalwCommandline
BLAST
BLAST相关
from Bio.Blast import NCBIWWW, NCBIXML
from Bio.Blast import NCBIWWW, NCBIXML
Structure analysis
结构分析
from Bio.PDB import PDBParser, PDBIO, Select
from Bio.PDB.DSSP import DSSP
from Bio.PDB.Polypeptide import PPBuilder
from Bio.PDB import PDBParser, PDBIO, Select
from Bio.PDB.DSSP import DSSP
from Bio.PDB.Polypeptide import PPBuilder
Phylogenetics
系统发育学
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
NCBI Entrez
NCBI Entrez
from Bio import Entrez
from Bio import Entrez
Additional tools
附加工具
from Bio.SeqUtils import GC, molecular_weight
from Bio.Restriction import *
undefinedfrom Bio.SeqUtils import GC, molecular_weight
from Bio.Restriction import *
undefinedBasic Pattern - Sequence Creation
基础模式 - 创建序列
python
from Bio.Seq import Seqpython
from Bio.Seq import SeqCreate DNA sequence
创建DNA序列
dna = Seq("ATGGCCATTGTAATGGGCCGC")
dna = Seq("ATGGCCATTGTAATGGGCCGC")
Transcribe to RNA
转录为RNA
rna = dna.transcribe()
rna = dna.transcribe()
Translate to protein
翻译为蛋白质
protein = dna.translate()
protein = dna.translate()
Reverse complement
反向互补序列
rev_comp = dna.reverse_complement()
print(f"DNA: {dna}")
print(f"RNA: {rna}")
print(f"Protein: {protein}")
print(f"RevComp: {rev_comp}")
undefinedrev_comp = dna.reverse_complement()
print(f"DNA: {dna}")
print(f"RNA: {rna}")
print(f"Protein: {protein}")
print(f"RevComp: {rev_comp}")
undefinedBasic Pattern - File Reading
基础模式 - 读取文件
python
from Bio import SeqIOpython
from Bio import SeqIORead FASTA file (iterator - memory efficient)
读取FASTA文件(迭代器 - 内存高效)
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"ID: {record.id}")
print(f"Length: {len(record.seq)}")
print(f"Sequence: {record.seq[:50]}...")
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"ID: {record.id}")
print(f"长度: {len(record.seq)}")
print(f"序列: {record.seq[:50]}...")
Read single sequence
读取单个序列
record = SeqIO.read("single.fasta", "fasta")
undefinedrecord = SeqIO.read("single.fasta", "fasta")
undefinedBasic Pattern - Sequence Alignment
基础模式 - 序列比对
python
from Bio import pairwise2
from Bio.Seq import Seq
seq1 = Seq("ACCGT")
seq2 = Seq("ACGT")python
from Bio import pairwise2
from Bio.Seq import Seq
seq1 = Seq("ACCGT")
seq2 = Seq("ACGT")Global alignment
全局比对
alignments = pairwise2.align.globalxx(seq1, seq2)
alignments = pairwise2.align.globalxx(seq1, seq2)
Print best alignment
打印最优比对结果
best = alignments[0]
print(pairwise2.format_alignment(*best))
undefinedbest = alignments[0]
print(pairwise2.format_alignment(*best))
undefinedCritical Rules
关键规则
✅ DO
✅ 建议做法
- Use iterators for large files - not
SeqIO.parse()SeqIO.to_dict() - Validate sequences - Check alphabet and length before operations
- Handle file formats correctly - Match parser to actual file format
- Check alignment quality - Verify gaps and identity percentages
- Use appropriate genetic code - Specify table for translation
- Close file handles - Use context managers or explicit close
- Filter FASTQ by quality - Don't trust all reads equally
- Verify PDB structure - Check for missing atoms/residues
- Set Entrez email - Required for NCBI API usage
- Handle translation frames - Consider all three reading frames
- 对大文件使用迭代器 - 使用而非
SeqIO.parse()SeqIO.to_dict() - 验证序列 - 在操作前检查序列的字母表和长度
- 正确处理文件格式 - 确保解析器与实际文件格式匹配
- 检查比对质量 - 验证空位和一致性百分比
- 使用合适的遗传密码 - 翻译时指定密码子表
- 关闭文件句柄 - 使用上下文管理器或显式关闭
- 按质量过滤FASTQ - 不要完全信任所有reads
- 验证PDB结构 - 检查是否存在缺失的原子/残基
- 设置Entrez邮箱 - NCBI API使用要求必须设置
- 处理翻译读框 - 考虑所有三个阅读框
❌ DON'T
❌ 不建议做法
- Load entire FASTQ into memory - Use streaming
- Ignore sequence type - DNA, RNA, and protein need different handling
- Skip quality filtering - FASTQ quality scores matter
- Use wrong genetic code - Different organisms use different tables
- Forget stop codons - Handle them explicitly in translation
- Mix alphabets - Don't compare DNA with protein sequences
- Trust all BLAST hits - Filter by e-value and identity
- Ignore chain breaks - PDB structures may have gaps
- Hammer NCBI servers - Use rate limiting (3 requests/sec without API key)
- Compare raw sequences - Align first, then compare
- 将整个FASTQ加载到内存 - 使用流式处理
- 忽略序列类型 - DNA、RNA和蛋白质需要不同的处理方式
- 跳过质量过滤 - FASTQ的质量分数很重要
- 使用错误的遗传密码 - 不同生物使用不同的密码子表
- 忽略终止密码子 - 在翻译时显式处理它们
- 混合字母表 - 不要将DNA与蛋白质序列进行比较
- 信任所有BLAST比对结果 - 按e值和一致性进行过滤
- 忽略链断裂 - PDB结构可能存在空位
- 频繁请求NCBI服务器 - 限制请求速率(无API密钥时最多3次/秒)
- 直接比较原始序列 - 先比对,再比较
Anti-Patterns (NEVER)
反模式(绝对避免)
python
undefinedpython
undefined❌ BAD: Loading entire file into memory
❌ 错误:将整个文件加载到内存
records = list(SeqIO.parse("huge.fastq", "fastq"))
for record in records:
process(record) # OOM for large files!
records = list(SeqIO.parse("huge.fastq", "fastq"))
for record in records:
process(record) # 大文件会导致内存溢出!
✅ GOOD: Stream processing
✅ 正确:流式处理
for record in SeqIO.parse("huge.fastq", "fastq"):
if min(record.letter_annotations["phred_quality"]) >= 20:
process(record)
for record in SeqIO.parse("huge.fastq", "fastq"):
if min(record.letter_annotations["phred_quality"]) >= 20:
process(record)
❌ BAD: Translating without checking frame
❌ 错误:不检查读框直接翻译
protein = dna_seq.translate()
protein = dna_seq.translate()
May include stop codons or wrong frame!
可能包含终止密码子或错误读框!
✅ GOOD: Specify frame and stop codon handling
✅ 正确:指定读框和终止密码子处理方式
protein = dna_seq.translate(to_stop=True, table=1)
protein = dna_seq.translate(to_stop=True, table=1)
❌ BAD: No email for Entrez
❌ 错误:Entrez请求不设置邮箱
Entrez.esearch(db="nucleotide", term="human")
Entrez.esearch(db="nucleotide", term="human")
NCBI will block you!
NCBI会封禁你的IP!
✅ GOOD: Always set email
✅ 正确:始终设置邮箱
Entrez.email = "your.email@example.com"
handle = Entrez.esearch(db="nucleotide", term="human")
Entrez.email = "your.email@example.com"
handle = Entrez.esearch(db="nucleotide", term="human")
❌ BAD: Comparing sequences directly
❌ 错误:直接比较序列
if str(seq1) == str(seq2):
print("Same") # Ignores gaps, misalignments!
if str(seq1) == str(seq2):
print("相同") # 忽略了空位和比对错误!
✅ GOOD: Align then compare
✅ 正确:先比对再比较
from Bio import pairwise2
alignments = pairwise2.align.globalxx(seq1, seq2)
score = alignments[0][2] # Alignment score
from Bio import pairwise2
alignments = pairwise2.align.globalxx(seq1, seq2)
score = alignments[0][2] # 比对分数
❌ BAD: Wrong file format
❌ 错误:指定错误的文件格式
records = SeqIO.parse("file.gb", "fasta") # Wrong!
records = SeqIO.parse("file.gb", "fasta") # 错误!
✅ GOOD: Match format to file
✅ 正确:匹配文件格式
records = SeqIO.parse("file.gb", "genbank")
undefinedrecords = SeqIO.parse("file.gb", "genbank")
undefinedSequence Objects
序列对象
Creating and Manipulating Sequences
创建和操作序列
python
from Bio.Seq import Seq
from Bio.Alphabet import IUPACpython
from Bio.Seq import Seq
from Bio.Alphabet import IUPACDNA sequence
DNA序列
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
Basic operations
基础操作
length = len(dna)
gc_count = dna.count("G") + dna.count("C")
gc_content = (gc_count / length) * 100
length = len(dna)
gc_count = dna.count("G") + dna.count("C")
gc_content = (gc_count / length) * 100
Find subsequence
查找子序列
position = dna.find("ATG") # Returns index or -1
position = dna.find("ATG") # 返回索引或-1
Slicing
切片
first_codon = dna[:3]
last_10 = dna[-10:]
print(f"Length: {length}")
print(f"GC content: {gc_content:.2f}%")
print(f"ATG at position: {position}")
undefinedfirst_codon = dna[:3]
last_10 = dna[-10:]
print(f"长度: {length}")
print(f"GC含量: {gc_content:.2f}%")
print(f"ATG位置: {position}")
undefinedTranscription and Translation
转录和翻译
python
from Bio.Seq import Seqpython
from Bio.Seq import SeqDNA to RNA transcription
DNA转录为RNA
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
rna = dna.transcribe()
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
rna = dna.transcribe()
RNA to protein translation
RNA翻译为蛋白质
protein = rna.translate()
protein = rna.translate()
DNA to protein (direct)
DNA直接翻译为蛋白质
protein_direct = dna.translate()
protein_direct = dna.translate()
Back-transcription
反向转录
dna_back = rna.back_transcribe()
dna_back = rna.back_transcribe()
Reverse complement
反向互补序列
rev_comp = dna.reverse_complement()
print(f"DNA: {dna}")
print(f"RNA: {rna}")
print(f"Protein: {protein}")
print(f"Rev Comp: {rev_comp}")
undefinedrev_comp = dna.reverse_complement()
print(f"DNA: {dna}")
print(f"RNA: {rna}")
print(f"蛋白质: {protein}")
print(f"反向互补: {rev_comp}")
undefinedTranslation with Different Genetic Codes
使用不同遗传密码进行翻译
python
from Bio.Seq import Seq
dna = Seq("ATGGGCTAG")python
from Bio.Seq import Seq
dna = Seq("ATGGGCTAG")Standard genetic code (table 1)
标准遗传密码(表1)
protein_standard = dna.translate(table=1)
protein_standard = dna.translate(table=1)
Mitochondrial genetic code (table 2)
线粒体遗传密码(表2)
protein_mito = dna.translate(table=2)
protein_mito = dna.translate(table=2)
Stop at first stop codon
遇到第一个终止密码子即停止
protein_to_stop = dna.translate(to_stop=True)
protein_to_stop = dna.translate(to_stop=True)
All three reading frames
所有三个阅读框
for i in range(3):
frame = dna[i:]
protein = frame.translate(to_stop=True)
print(f"Frame {i}: {protein}")
undefinedfor i in range(3):
frame = dna[i:]
protein = frame.translate(to_stop=True)
print(f"读框 {i}: {protein}")
undefinedSequence Utilities
序列工具
python
from Bio.Seq import Seq
from Bio.SeqUtils import GC, molecular_weight
seq = Seq("ATGGCCATTGTAATGGGCCGC")python
from Bio.Seq import Seq
from Bio.SeqUtils import GC, molecular_weight
seq = Seq("ATGGCCATTGTAATGGGCCGC")GC content
GC含量
gc = GC(seq)
gc = GC(seq)
Molecular weight
分子量
mw = molecular_weight(seq, seq_type='DNA')
mw = molecular_weight(seq, seq_type='DNA')
GC skew (for origin of replication analysis)
GC偏斜(用于复制起点分析)
from Bio.SeqUtils import GC_skew
skew = GC_skew(seq, window=100)
print(f"GC content: {gc:.2f}%")
print(f"Molecular weight: {mw:.2f} Da")
undefinedfrom Bio.SeqUtils import GC_skew
skew = GC_skew(seq, window=100)
print(f"GC含量: {gc:.2f}%")
print(f"分子量: {mw:.2f} Da")
undefinedSeqRecord Objects
SeqRecord对象
Creating SeqRecord Objects
创建SeqRecord对象
python
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecordpython
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecordCreate with minimal info
创建最小信息的SeqRecord
seq = Seq("ATGGCCATTG")
record = SeqRecord(seq, id="seq1", description="My sequence")
seq = Seq("ATGGCCATTG")
record = SeqRecord(seq, id="seq1", description="我的序列")
With full annotation
创建带完整注释的SeqRecord
record = SeqRecord(
Seq("ATGGCCATTG"),
id="gene1",
name="GeneName",
description="Gene encoding protein X",
annotations={
"organism": "Homo sapiens",
"molecule_type": "DNA"
}
)
record = SeqRecord(
Seq("ATGGCCATTG"),
id="gene1",
name="GeneName",
description="编码蛋白质X的基因",
annotations={
"organism": "Homo sapiens",
"molecule_type": "DNA"
}
)
Add features
添加特征
from Bio.SeqFeature import SeqFeature, FeatureLocation
feature = SeqFeature(
FeatureLocation(0, 10),
type="CDS",
qualifiers={"product": "hypothetical protein"}
)
record.features.append(feature)
print(record)
undefinedfrom Bio.SeqFeature import SeqFeature, FeatureLocation
feature = SeqFeature(
FeatureLocation(0, 10),
type="CDS",
qualifiers={"product": "假设蛋白质"}
)
record.features.append(feature)
print(record)
undefinedSeqRecord Attributes
SeqRecord属性
python
from Bio import SeqIO
record = SeqIO.read("sequence.gb", "genbank")python
from Bio import SeqIO
record = SeqIO.read("sequence.gb", "genbank")Access components
访问组件
sequence = record.seq
identifier = record.id
name = record.name
description = record.description
sequence = record.seq
identifier = record.id
name = record.name
description = record.description
Annotations (dictionary)
注释(字典)
organism = record.annotations.get("organism", "Unknown")
organism = record.annotations.get("organism", "未知")
Features (list)
特征(列表)
for feature in record.features:
if feature.type == "CDS":
product = feature.qualifiers.get("product", ["Unknown"])[0]
location = feature.location
print(f"{product} at {location}")
for feature in record.features:
if feature.type == "CDS":
product = feature.qualifiers.get("product", ["未知"])[0]
location = feature.location
print(f"{product} 位于 {location}")
Letter annotations (quality scores for FASTQ)
字母注释(FASTQ的质量分数)
if "phred_quality" in record.letter_annotations:
avg_quality = sum(record.letter_annotations["phred_quality"]) / len(record)
print(f"Average quality: {avg_quality:.2f}")
undefinedif "phred_quality" in record.letter_annotations:
avg_quality = sum(record.letter_annotations["phred_quality"]) / len(record)
print(f"平均质量: {avg_quality:.2f}")
undefinedFile Input/Output
文件输入/输出
Reading FASTA Files
读取FASTA文件
python
from Bio import SeqIOpython
from Bio import SeqIOSingle sequence
读取单个序列
record = SeqIO.read("single.fasta", "fasta")
print(f"{record.id}: {record.seq}")
record = SeqIO.read("single.fasta", "fasta")
print(f"{record.id}: {record.seq}")
Multiple sequences (iterator)
读取多个序列(迭代器)
for record in SeqIO.parse("multiple.fasta", "fasta"):
print(f"{record.id}: {len(record.seq)} bp")
for record in SeqIO.parse("multiple.fasta", "fasta"):
print(f"{record.id}: {len(record.seq)} bp")
Load into dictionary (use sparingly!)
加载到字典(谨慎使用!)
sequences = SeqIO.to_dict(SeqIO.parse("sequences.fasta", "fasta"))
seq1 = sequences["seq1"]
undefinedsequences = SeqIO.to_dict(SeqIO.parse("sequences.fasta", "fasta"))
seq1 = sequences["seq1"]
undefinedWriting FASTA Files
写入FASTA文件
python
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecordpython
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecordCreate records
创建序列记录
records = []
for i in range(10):
seq = Seq("ATCG" * (i + 1))
record = SeqRecord(seq, id=f"seq_{i}", description=f"Sequence {i}")
records.append(record)
records = []
for i in range(10):
seq = Seq("ATCG" * (i + 1))
record = SeqRecord(seq, id=f"seq_{i}", description=f"序列 {i}")
records.append(record)
Write to file
写入文件
SeqIO.write(records, "output.fasta", "fasta")
SeqIO.write(records, "output.fasta", "fasta")
Append to existing file
追加到现有文件
with open("output.fasta", "a") as f:
SeqIO.write(record, f, "fasta")
undefinedwith open("output.fasta", "a") as f:
SeqIO.write(record, f, "fasta")
undefinedReading FASTQ Files
读取FASTQ文件
python
from Bio import SeqIOpython
from Bio import SeqIORead FASTQ with quality scores
读取带质量分数的FASTQ
for record in SeqIO.parse("reads.fastq", "fastq"):
seq_id = record.id
sequence = record.seq
quality = record.letter_annotations["phred_quality"]
avg_q = sum(quality) / len(quality)
min_q = min(quality)
print(f"{seq_id}: avg Q={avg_q:.1f}, min Q={min_q}")undefinedfor record in SeqIO.parse("reads.fastq", "fastq"):
seq_id = record.id
sequence = record.seq
quality = record.letter_annotations["phred_quality"]
avg_q = sum(quality) / len(quality)
min_q = min(quality)
print(f"{seq_id}: 平均Q值={avg_q:.1f}, 最小Q值={min_q}")undefinedQuality Filtering FASTQ
质量过滤FASTQ
python
from Bio import SeqIO
def filter_fastq(input_file, output_file, min_quality=20, min_length=50):
"""Filter FASTQ by quality and length."""
good_reads = 0
with open(output_file, "w") as out_handle:
for record in SeqIO.parse(input_file, "fastq"):
# Check length
if len(record.seq) < min_length:
continue
# Check quality
qualities = record.letter_annotations["phred_quality"]
if min(qualities) >= min_quality:
SeqIO.write(record, out_handle, "fastq")
good_reads += 1
return good_readspython
from Bio import SeqIO
def filter_fastq(input_file, output_file, min_quality=20, min_length=50):
"""按质量和长度过滤FASTQ文件。"""
good_reads = 0
with open(output_file, "w") as out_handle:
for record in SeqIO.parse(input_file, "fastq"):
# 检查长度
if len(record.seq) < min_length:
continue
# 检查质量
qualities = record.letter_annotations["phred_quality"]
if min(qualities) >= min_quality:
SeqIO.write(record, out_handle, "fastq")
good_reads += 1
return good_readsFilter reads
过滤reads
n_passed = filter_fastq("raw.fastq", "filtered.fastq")
print(f"Passed: {n_passed} reads")
undefinedn_passed = filter_fastq("raw.fastq", "filtered.fastq")
print(f"通过过滤: {n_passed}条reads")
undefinedReading GenBank Files
读取GenBank文件
python
from Bio import SeqIOpython
from Bio import SeqIOGenBank files have rich annotations
GenBank文件包含丰富的注释
for record in SeqIO.parse("sequence.gb", "genbank"):
print(f"ID: {record.id}")
print(f"Organism: {record.annotations['organism']}")
print(f"Sequence length: {len(record.seq)}")
# Extract features
for feature in record.features:
if feature.type == "CDS":
product = feature.qualifiers.get("product", ["Unknown"])[0]
start = feature.location.start
end = feature.location.end
print(f" CDS: {product} ({start}-{end})")undefinedfor record in SeqIO.parse("sequence.gb", "genbank"):
print(f"ID: {record.id}")
print(f"生物: {record.annotations['organism']}")
print(f"序列长度: {len(record.seq)}")
# 提取特征
for feature in record.features:
if feature.type == "CDS":
product = feature.qualifiers.get("product", ["未知"])[0]
start = feature.location.start
end = feature.location.end
print(f" CDS: {product} ({start}-{end})")undefinedConverting File Formats
转换文件格式
python
from Bio import SeqIOpython
from Bio import SeqIOConvert GenBank to FASTA
将GenBank转换为FASTA
records = SeqIO.parse("input.gb", "genbank")
SeqIO.write(records, "output.fasta", "fasta")
records = SeqIO.parse("input.gb", "genbank")
SeqIO.write(records, "output.fasta", "fasta")
Convert FASTQ to FASTA (lose quality scores)
将FASTQ转换为FASTA(丢失质量分数)
records = SeqIO.parse("reads.fastq", "fastq")
SeqIO.write(records, "sequences.fasta", "fasta")
records = SeqIO.parse("reads.fastq", "fastq")
SeqIO.write(records, "sequences.fasta", "fasta")
Multiple format conversions in one go
通用格式转换函数
def convert_format(input_file, input_format, output_file, output_format):
"""Generic format converter."""
count = SeqIO.convert(input_file, input_format, output_file, output_format)
print(f"Converted {count} records")
convert_format("seq.gb", "genbank", "seq.fasta", "fasta")
undefineddef convert_format(input_file, input_format, output_file, output_format):
"""通用序列文件格式转换器。"""
count = SeqIO.convert(input_file, input_format, output_file, output_format)
print(f"转换了 {count} 条记录")
convert_format("seq.gb", "genbank", "seq.fasta", "fasta")
undefinedSequence Alignment
序列比对
Pairwise Alignment
双序列比对
python
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.pairwise2 import format_alignment
seq1 = Seq("ACCGGT")
seq2 = Seq("ACGT")python
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.pairwise2 import format_alignment
seq1 = Seq("ACCGGT")
seq2 = Seq("ACGT")Global alignment (Needleman-Wunsch)
全局比对(Needleman-Wunsch算法)
alignments = pairwise2.align.globalxx(seq1, seq2)
alignments = pairwise2.align.globalxx(seq1, seq2)
Print best alignment
打印最优比对结果
best = alignments[0]
print(format_alignment(*best))
best = alignments[0]
print(format_alignment(*best))
With scoring matrix
使用评分矩阵
alignments = pairwise2.align.globalms(
seq1, seq2,
match=2, # Match score
mismatch=-1, # Mismatch penalty
open=-0.5, # Gap open penalty
extend=-0.1 # Gap extension penalty
)
undefinedalignments = pairwise2.align.globalms(
seq1, seq2,
match=2, # 匹配得分
mismatch=-1, # 错配惩罚
open=-0.5, # 空位开放惩罚
extend=-0.1 # 空位延伸惩罚
)
undefinedCustom Scoring
自定义评分
python
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.Align import substitution_matrices
seq1 = Seq("KEVLA")
seq2 = Seq("KELVA")python
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.Align import substitution_matrices
seq1 = Seq("KEVLA")
seq2 = Seq("KELVA")Use BLOSUM62 matrix
使用BLOSUM62矩阵
matrix = substitution_matrices.load("BLOSUM62")
alignments = pairwise2.align.globaldx(
seq1, seq2,
matrix,
open=-10,
extend=-0.5
)
print(pairwise2.format_alignment(*alignments[0]))
undefinedmatrix = substitution_matrices.load("BLOSUM62")
alignments = pairwise2.align.globaldx(
seq1, seq2,
matrix,
open=-10,
extend=-0.5
)
print(pairwise2.format_alignment(*alignments[0]))
undefinedLocal Alignment
局部比对
python
from Bio import pairwise2
from Bio.Seq import Seqpython
from Bio import pairwise2
from Bio.Seq import SeqLocal alignment (Smith-Waterman)
局部比对(Smith-Waterman算法)
seq1 = Seq("GCATGCTAGATGCTA")
seq2 = Seq("ATGCTA")
alignments = pairwise2.align.localxx(seq1, seq2)
seq1 = Seq("GCATGCTAGATGCTA")
seq2 = Seq("ATGCTA")
alignments = pairwise2.align.localxx(seq1, seq2)
Best local alignment
最优局部比对
best = alignments[0]
print(pairwise2.format_alignment(*best))
undefinedbest = alignments[0]
print(pairwise2.format_alignment(*best))
undefinedMultiple Sequence Alignment
多序列比对
python
from Bio import AlignIO
from Bio.Align.Applications import ClustalwCommandlinepython
from Bio import AlignIO
from Bio.Align.Applications import ClustalwCommandlineRun ClustalW (requires installation)
运行ClustalW(需要预先安装)
cline = ClustalwCommandline("clustalw2", infile="sequences.fasta")
stdout, stderr = cline()
cline = ClustalwCommandline("clustalw2", infile="sequences.fasta")
stdout, stderr = cline()
Read alignment
读取比对结果
alignment = AlignIO.read("sequences.aln", "clustal")
print(f"Alignment length: {alignment.get_alignment_length()}")
print(f"Number of sequences: {len(alignment)}")
alignment = AlignIO.read("sequences.aln", "clustal")
print(f"比对长度: {alignment.get_alignment_length()}")
print(f"序列数量: {len(alignment)}")
Print alignment
打印比对结果
print(alignment)
print(alignment)
Access sequences
访问序列
for record in alignment:
print(f"{record.id}: {record.seq}")
undefinedfor record in alignment:
print(f"{record.id}: {record.seq}")
undefinedAlignment Statistics
比对统计
python
from Bio import AlignIO
alignment = AlignIO.read("alignment.fasta", "fasta")python
from Bio import AlignIO
alignment = AlignIO.read("alignment.fasta", "fasta")Calculate identity
计算一致性
def calculate_identity(alignment):
"""Calculate pairwise identity matrix."""
n = len(alignment)
identities = []
for i in range(n):
for j in range(i+1, n):
seq1 = alignment[i].seq
seq2 = alignment[j].seq
matches = sum(a == b for a, b in zip(seq1, seq2) if a != '-' and b != '-')
aligned_length = sum(1 for a, b in zip(seq1, seq2) if a != '-' and b != '-')
identity = matches / aligned_length * 100 if aligned_length > 0 else 0
identities.append(identity)
return identitiesidentities = calculate_identity(alignment)
print(f"Average identity: {sum(identities)/len(identities):.2f}%")
undefineddef calculate_identity(alignment):
"""计算双序列一致性矩阵。"""
n = len(alignment)
identities = []
for i in range(n):
for j in range(i+1, n):
seq1 = alignment[i].seq
seq2 = alignment[j].seq
matches = sum(a == b for a, b in zip(seq1, seq2) if a != '-' and b != '-')
aligned_length = sum(1 for a, b in zip(seq1, seq2) if a != '-' and b != '-')
identity = matches / aligned_length * 100 if aligned_length > 0 else 0
identities.append(identity)
return identitiesidentities = calculate_identity(alignment)
print(f"平均一致性: {sum(identities)/len(identities):.2f}%")
undefinedBLAST Searches
BLAST搜索
Running BLAST Online
在线运行BLAST
python
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIOpython
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIORead query sequence
读取查询序列
record = SeqIO.read("query.fasta", "fasta")
record = SeqIO.read("query.fasta", "fasta")
Run BLAST
运行BLAST
result_handle = NCBIWWW.qblast(
program="blastn", # blastn, blastp, blastx, tblastn, tblastx
database="nt", # nt, nr, refseq_rna, etc.
sequence=str(record.seq),
hitlist_size=10
)
result_handle = NCBIWWW.qblast(
program="blastn", # blastn、blastp、blastx、tblastn、tblastx
database="nt", # nt、nr、refseq_rna等
sequence=str(record.seq),
hitlist_size=10
)
Save results
保存结果
with open("blast_results.xml", "w") as out_handle:
out_handle.write(result_handle.read())
result_handle.close()
undefinedwith open("blast_results.xml", "w") as out_handle:
out_handle.write(result_handle.read())
result_handle.close()
undefinedParsing BLAST Results
解析BLAST结果
python
from Bio.Blast import NCBIXMLpython
from Bio.Blast import NCBIXMLParse BLAST XML output
解析BLAST XML输出
with open("blast_results.xml") as result_handle:
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
print(f"Query: {blast_record.query}")
print(f"Number of hits: {len(blast_record.alignments)}")
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 0.001: # E-value threshold
print(f"\n Hit: {alignment.title}")
print(f" Length: {alignment.length}")
print(f" E-value: {hsp.expect}")
print(f" Identity: {hsp.identities}/{hsp.align_length} "
f"({hsp.identities/hsp.align_length*100:.1f}%)")
print(f" Query: {hsp.query}")
print(f" Subject: {hsp.sbjct}")undefinedwith open("blast_results.xml") as result_handle:
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
print(f"查询序列: {blast_record.query}")
print(f"比对命中数量: {len(blast_record.alignments)}")
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 0.001: # E值阈值
print(f"\n 命中序列: {alignment.title}")
print(f" 长度: {alignment.length}")
print(f" E值: {hsp.expect}")
print(f" 一致性: {hsp.identities}/{hsp.align_length} "
f"({hsp.identities/hsp.align_length*100:.1f}%)")
print(f" 查询序列片段: {hsp.query}")
print(f" 目标序列片段: {hsp.sbjct}")undefinedBLAST with Custom Parameters
自定义参数运行BLAST
python
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast(
program="blastp",
database="nr",
sequence=protein_seq,
expect=0.001, # E-value threshold
word_size=3, # Word size
matrix_name="BLOSUM62", # Substitution matrix
gapcosts="11 1", # Gap costs (open, extend)
hitlist_size=50, # Number of hits
filter="L" # Low complexity filter
)python
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast(
program="blastp",
database="nr",
sequence=protein_seq,
expect=0.001, # E值阈值
word_size=3, # 字长
matrix_name="BLOSUM62", # 替换矩阵
gapcosts="11 1", # 空位成本(开放、延伸)
hitlist_size=50, # 命中数量
filter="L" # 低复杂度过滤
)Batch BLAST
批量BLAST
python
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO
import time
def batch_blast(fasta_file, output_file):
"""Run BLAST for multiple sequences."""
results = []
for record in SeqIO.parse(fasta_file, "fasta"):
print(f"BLASTing {record.id}...")
result_handle = NCBIWWW.qblast(
"blastn", "nt",
str(record.seq),
hitlist_size=5
)
blast_record = NCBIXML.read(result_handle)
results.append((record.id, blast_record))
# Be nice to NCBI servers
time.sleep(1)
return resultspython
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO
import time
def batch_blast(fasta_file, output_file):
"""对多条序列运行BLAST。"""
results = []
for record in SeqIO.parse(fasta_file, "fasta"):
print(f"正在对 {record.id} 运行BLAST...")
result_handle = NCBIWWW.qblast(
"blastn", "nt",
str(record.seq),
hitlist_size=5
)
blast_record = NCBIXML.read(result_handle)
results.append((record.id, blast_record))
# 友好访问NCBI服务器
time.sleep(1)
return resultsRun batch BLAST
运行批量BLAST
results = batch_blast("queries.fasta", "batch_results.xml")
undefinedresults = batch_blast("queries.fasta", "batch_results.xml")
undefinedProtein Structure Analysis
蛋白质结构分析
Loading PDB Files
加载PDB文件
python
from Bio.PDB import PDBParserpython
from Bio.PDB import PDBParserCreate parser
创建解析器
parser = PDBParser(QUIET=True)
parser = PDBParser(QUIET=True)
Load structure
加载结构
structure = parser.get_structure("protein", "protein.pdb")
structure = parser.get_structure("protein", "protein.pdb")
Access hierarchy: Structure → Model → Chain → Residue → Atom
访问层级结构:Structure → Model → Chain → Residue → Atom
model = structure[0]
chain = model['A']
print(f"Structure ID: {structure.id}")
print(f"Number of models: {len(structure)}")
print(f"Number of chains: {len(model)}")
print(f"Number of residues in chain A: {len(chain)}")
undefinedmodel = structure[0]
chain = model['A']
print(f"结构ID: {structure.id}")
print(f"模型数量: {len(structure)}")
print(f"链数量: {len(model)}")
print(f"链A中的残基数量: {len(chain)}")
undefinedExtracting Information
提取信息
python
from Bio.PDB import PDBParser
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")python
from Bio.PDB import PDBParser
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")Iterate through structure
遍历结构
for model in structure:
for chain in model:
print(f"Chain {chain.id}")
for residue in chain:
# Skip hetero atoms (water, ligands)
if residue.id[0] == ' ':
resname = residue.resname
resid = residue.id[1]
# Get atoms
for atom in residue:
atom_name = atom.name
coord = atom.coord
print(f" {resname}{resid} {atom_name}: {coord}")undefinedfor model in structure:
for chain in model:
print(f"链 {chain.id}")
for residue in chain:
# 跳过杂原子(水、配体)
if residue.id[0] == ' ':
resname = residue.resname
resid = residue.id[1]
# 获取原子
for atom in residue:
atom_name = atom.name
coord = atom.coord
print(f" {resname}{resid} {atom_name}: {coord}")undefinedCalculating Distances
计算距离
python
from Bio.PDB import PDBParser
import numpy as np
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")python
from Bio.PDB import PDBParser
import numpy as np
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")Get two atoms
获取两个原子
chain = structure[0]['A']
atom1 = chain[10]['CA'] # CA atom of residue 10
atom2 = chain[20]['CA'] # CA atom of residue 20
chain = structure[0]['A']
atom1 = chain[10]['CA'] # 残基10的CA原子
atom2 = chain[20]['CA'] # 残基20的CA原子
Calculate distance
计算距离
distance = atom1 - atom2 # Overloaded operator
print(f"Distance: {distance:.2f} Å")
distance = atom1 - atom2 # 重载运算符
print(f"距离: {distance:.2f} Å")
Manual calculation
手动计算
coord1 = atom1.coord
coord2 = atom2.coord
dist = np.linalg.norm(coord1 - coord2)
print(f"Distance (manual): {dist:.2f} Å")
undefinedcoord1 = atom1.coord
coord2 = atom2.coord
dist = np.linalg.norm(coord1 - coord2)
print(f"距离(手动计算): {dist:.2f} Å")
undefinedCenter of Mass
质心计算
python
from Bio.PDB import PDBParser
import numpy as np
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
def calculate_center_of_mass(entity):
"""Calculate center of mass for Structure/Model/Chain."""
coords = []
masses = []
for atom in entity.get_atoms():
coords.append(atom.coord)
masses.append(atom.mass)
coords = np.array(coords)
masses = np.array(masses)
com = np.average(coords, axis=0, weights=masses)
return compython
from Bio.PDB import PDBParser
import numpy as np
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
def calculate_center_of_mass(entity):
"""计算Structure/Model/Chain的质心。"""
coords = []
masses = []
for atom in entity.get_atoms():
coords.append(atom.coord)
masses.append(atom.mass)
coords = np.array(coords)
masses = np.array(masses)
com = np.average(coords, axis=0, weights=masses)
return comCalculate for whole structure
计算整个结构的质心
com = calculate_center_of_mass(structure)
print(f"Center of mass: {com}")
undefinedcom = calculate_center_of_mass(structure)
print(f"质心: {com}")
undefinedSelecting Atoms
选择原子
python
from Bio.PDB import PDBParser, Selection
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")python
from Bio.PDB import PDBParser, Selection
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")Select all atoms
选择所有原子
all_atoms = Selection.unfold_entities(structure, 'A')
print(f"Total atoms: {len(all_atoms)}")
all_atoms = Selection.unfold_entities(structure, 'A')
print(f"总原子数: {len(all_atoms)}")
Select specific atoms
选择特定原子
model = structure[0]
model = structure[0]
All CA atoms
所有CA原子
ca_atoms = [atom for atom in model.get_atoms() if atom.name == 'CA']
print(f"CA atoms: {len(ca_atoms)}")
ca_atoms = [atom for atom in model.get_atoms() if atom.name == 'CA']
print(f"CA原子数: {len(ca_atoms)}")
Specific residue range
特定残基范围
chain_a = model['A']
residues_10_to_20 = [chain_a[i] for i in range(10, 21) if i in chain_a]
undefinedchain_a = model['A']
residues_10_to_20 = [chain_a[i] for i in range(10, 21) if i in chain_a]
undefinedSecondary Structure with DSSP
使用DSSP分析二级结构
python
from Bio.PDB import PDBParser
from Bio.PDB.DSSP import DSSP
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
model = structure[0]python
from Bio.PDB import PDBParser
from Bio.PDB.DSSP import DSSP
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
model = structure[0]Run DSSP (requires dssp executable)
运行DSSP(需要dssp可执行文件)
dssp = DSSP(model, "protein.pdb")
dssp = DSSP(model, "protein.pdb")
Extract secondary structure
提取二级结构
for key in dssp:
residue = dssp[key]
chain_id = key[0]
res_id = key[1][1]
ss = residue[2] # Secondary structure: H=helix, E=sheet, C=coil
acc = residue[3] # Accessible surface area
print(f"{chain_id} {res_id}: {ss} (ASA={acc:.1f})")undefinedfor key in dssp:
residue = dssp[key]
chain_id = key[0]
res_id = key[1][1]
ss = residue[2] # 二级结构: H=螺旋, E=折叠, C=无规卷曲
acc = residue[3] # 可及表面积
print(f"{chain_id} {res_id}: {ss} (ASA={acc:.1f})")undefinedExtracting Sequence from PDB
从PDB提取序列
python
from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import PPBuilder
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")python
from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import PPBuilder
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")Build polypeptides
构建多肽
ppb = PPBuilder()
for chain in structure[0]:
for pp in ppb.build_peptides(chain):
sequence = pp.get_sequence()
print(f"Chain {chain.id}: {sequence}")
undefinedppb = PPBuilder()
for chain in structure[0]:
for pp in ppb.build_peptides(chain):
sequence = pp.get_sequence()
print(f"链 {chain.id}: {sequence}")
undefinedWriting PDB Files
写入PDB文件
python
from Bio.PDB import PDBParser, PDBIO, Selectpython
from Bio.PDB import PDBParser, PDBIO, SelectLoad structure
加载结构
parser = PDBParser()
structure = parser.get_structure("protein", "input.pdb")
parser = PDBParser()
structure = parser.get_structure("protein", "input.pdb")
Select what to save
选择要保存的部分
class ChainSelect(Select):
def accept_chain(self, chain):
return chain.id == 'A' # Only save chain A
class ChainSelect(Select):
def accept_chain(self, chain):
return chain.id == 'A' # 仅保存链A
Write structure
写入结构
io = PDBIO()
io.set_structure(structure)
io.save("chain_A.pdb", ChainSelect())
undefinedio = PDBIO()
io.set_structure(structure)
io.save("chain_A.pdb", ChainSelect())
undefinedPhylogenetic Analysis
系统发育分析
Building Trees from Alignment
从比对结果构建树
python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructorpython
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructorRead alignment
读取比对结果
alignment = AlignIO.read("alignment.fasta", "fasta")
alignment = AlignIO.read("alignment.fasta", "fasta")
Calculate distance matrix
计算距离矩阵
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
Construct tree (Neighbor-Joining)
构建树(邻接法)
constructor = DistanceTreeConstructor(calculator, 'nj')
tree = constructor.build_tree(alignment)
constructor = DistanceTreeConstructor(calculator, 'nj')
tree = constructor.build_tree(alignment)
Save tree
保存树
Phylo.write(tree, "tree.xml", "phyloxml")
print(tree)
undefinedPhylo.write(tree, "tree.xml", "phyloxml")
print(tree)
undefinedTree Visualization
树可视化
python
from Bio import Phylopython
from Bio import PhyloLoad tree
加载树
tree = Phylo.read("tree.xml", "phyloxml")
tree = Phylo.read("tree.xml", "phyloxml")
Draw tree (ASCII)
ASCII格式绘制树
Phylo.draw_ascii(tree)
Phylo.draw_ascii(tree)
Draw tree (graphical - requires matplotlib)
图形化绘制(需要matplotlib)
import matplotlib.pyplot as plt
Phylo.draw(tree)
plt.savefig("tree.png")
import matplotlib.pyplot as plt
Phylo.draw(tree)
plt.savefig("tree.png")
Interactive view
交互式查看
tree.ladderize() # Flip tree for better visualization
tree.ladderize() # 翻转树以获得更好的可视化效果
Phylo.draw(tree, do_show=True)
Phylo.draw(tree, do_show=True)
undefinedundefinedTree Analysis
树分析
python
from Bio import Phylo
tree = Phylo.read("tree.xml", "phyloxml")python
from Bio import Phylo
tree = Phylo.read("tree.xml", "phyloxml")Get all terminals (leaves)
获取所有终端节点(叶子)
terminals = tree.get_terminals()
print(f"Number of leaves: {len(terminals)}")
terminals = tree.get_terminals()
print(f"叶子节点数量: {len(terminals)}")
Get all non-terminals (internal nodes)
获取所有非终端节点(内部节点)
nonterminals = tree.get_nonterminals()
print(f"Number of internal nodes: {len(nonterminals)}")
nonterminals = tree.get_nonterminals()
print(f"内部节点数量: {len(nonterminals)}")
Find common ancestor
寻找共同祖先
clade1 = tree.find_any(name="Species1")
clade2 = tree.find_any(name="Species2")
common = tree.common_ancestor(clade1, clade2)
clade1 = tree.find_any(name="Species1")
clade2 = tree.find_any(name="Species2")
common = tree.common_ancestor(clade1, clade2)
Calculate distances
计算距离
dist = tree.distance(clade1, clade2)
print(f"Distance between Species1 and Species2: {dist:.4f}")
dist = tree.distance(clade1, clade2)
print(f"Species1和Species2之间的距离: {dist:.4f}")
Get path between nodes
获取节点间路径
path = tree.get_path(clade1, clade2)
undefinedpath = tree.get_path(clade1, clade2)
undefinedCreating Trees Programmatically
程序化创建树
python
from Bio.Phylo.BaseTree import Tree, Cladepython
from Bio.Phylo.BaseTree import Tree, CladeCreate tree structure
创建树结构
tree = Tree()
tree = Tree()
Create clades (nodes)
创建分支(节点)
clade_a = Clade(branch_length=0.5, name="A")
clade_b = Clade(branch_length=0.3, name="B")
clade_c = Clade(branch_length=0.4, name="C")
clade_a = Clade(branch_length=0.5, name="A")
clade_b = Clade(branch_length=0.3, name="B")
clade_c = Clade(branch_length=0.4, name="C")
Create internal node
创建内部节点
internal = Clade(branch_length=0.2)
internal.clades = [clade_a, clade_b]
internal = Clade(branch_length=0.2)
internal.clades = [clade_a, clade_b]
Create root
创建根节点
root = Clade()
root.clades = [internal, clade_c]
tree.root = root
root = Clade()
root.clades = [internal, clade_c]
tree.root = root
Visualize
可视化
Phylo.draw_ascii(tree)
undefinedPhylo.draw_ascii(tree)
undefinedNCBI Entrez
NCBI Entrez
Searching Databases
搜索数据库
python
from Bio import Entrezpython
from Bio import EntrezALWAYS set your email
务必设置你的邮箱
Entrez.email = "your.email@example.com"
Entrez.email = "your.email@example.com"
Search nucleotide database
搜索核苷酸数据库
handle = Entrez.esearch(
db="nucleotide",
term="Homo sapiens[Organism] AND COX1",
retmax=10
)
record = Entrez.read(handle)
handle.close()
print(f"Found {record['Count']} records")
print(f"IDs: {record['IdList']}")
undefinedhandle = Entrez.esearch(
db="nucleotide",
term="Homo sapiens[Organism] AND COX1",
retmax=10
)
record = Entrez.read(handle)
handle.close()
print(f"找到 {record['Count']} 条记录")
print(f"ID列表: {record['IdList']}")
undefinedFetching Records
获取记录
python
from Bio import Entrez, SeqIO
Entrez.email = "your.email@example.com"python
from Bio import Entrez, SeqIO
Entrez.email = "your.email@example.com"Fetch by ID
通过ID获取记录
handle = Entrez.efetch(
db="nucleotide",
id="NC_000001",
rettype="gb",
retmode="text"
)
record = SeqIO.read(handle, "genbank")
handle.close()
print(f"ID: {record.id}")
print(f"Description: {record.description}")
print(f"Length: {len(record.seq)}")
undefinedhandle = Entrez.efetch(
db="nucleotide",
id="NC_000001",
rettype="gb",
retmode="text"
)
record = SeqIO.read(handle, "genbank")
handle.close()
print(f"ID: {record.id}")
print(f"描述: {record.description}")
print(f"长度: {len(record.seq)}")
undefinedBatch Downloads
批量下载
python
from Bio import Entrez, SeqIO
Entrez.email = "your.email@example.com"
def download_sequences(id_list, output_file):
"""Download multiple sequences by ID."""
# Fetch records
handle = Entrez.efetch(
db="nucleotide",
id=id_list,
rettype="fasta",
retmode="text"
)
records = SeqIO.parse(handle, "fasta")
SeqIO.write(records, output_file, "fasta")
handle.close()python
from Bio import Entrez, SeqIO
Entrez.email = "your.email@example.com"
def download_sequences(id_list, output_file):
"""通过ID批量下载序列。"""
# 获取记录
handle = Entrez.efetch(
db="nucleotide",
id=id_list,
rettype="fasta",
retmode="text"
)
records = SeqIO.parse(handle, "fasta")
SeqIO.write(records, output_file, "fasta")
handle.close()Download
下载序列
ids = ["NM_001301717", "NM_001301718", "NM_001301719"]
download_sequences(ids, "downloaded.fasta")
undefinedids = ["NM_001301717", "NM_001301718", "NM_001301719"]
download_sequences(ids, "downloaded.fasta")
undefinedEntrez with History
使用历史记录的Entrez
python
from Bio import Entrez
Entrez.email = "your.email@example.com"python
from Bio import Entrez
Entrez.email = "your.email@example.com"Search with history
带历史记录的搜索
search_handle = Entrez.esearch(
db="nucleotide",
term="human[organism] AND COX1",
usehistory="y",
retmax=1000
)
search_results = Entrez.read(search_handle)
search_handle.close()
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
count = int(search_results["Count"])
print(f"Found {count} records")
search_handle = Entrez.esearch(
db="nucleotide",
term="human[organism] AND COX1",
usehistory="y",
retmax=1000
)
search_results = Entrez.read(search_handle)
search_handle.close()
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
count = int(search_results["Count"])
print(f"找到 {count} 条记录")
Fetch in batches
分批获取
batch_size = 100
for start in range(0, count, batch_size):
fetch_handle = Entrez.efetch(
db="nucleotide",
rettype="fasta",
retmode="text",
retstart=start,
retmax=batch_size,
webenv=webenv,
query_key=query_key
)
# Process batch
data = fetch_handle.read()
fetch_handle.close()undefinedbatch_size = 100
for start in range(0, count, batch_size):
fetch_handle = Entrez.efetch(
db="nucleotide",
rettype="fasta",
retmode="text",
retstart=start,
retmax=batch_size,
webenv=webenv,
query_key=query_key
)
# 处理批次
data = fetch_handle.read()
fetch_handle.close()undefinedAdvanced Workflows
高级工作流
Complete Gene Analysis Pipeline
完整基因分析流程
python
from Bio import Entrez, SeqIO
from Bio.Seq import Seq
from Bio.SeqUtils import GC
Entrez.email = "your.email@example.com"
def analyze_gene(gene_id):
"""Complete analysis of a gene."""
# 1. Fetch sequence
handle = Entrez.efetch(db="nucleotide", id=gene_id, rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
# 2. Basic stats
length = len(record.seq)
gc_content = GC(record.seq)
# 3. Extract CDS
cds_list = []
for feature in record.features:
if feature.type == "CDS":
cds_seq = feature.extract(record.seq)
product = feature.qualifiers.get("product", ["Unknown"])[0]
cds_list.append((product, cds_seq))
# 4. Translate CDS
proteins = []
for product, cds in cds_list:
try:
protein = cds.translate(to_stop=True)
proteins.append((product, protein))
except:
proteins.append((product, None))
return {
'id': record.id,
'length': length,
'gc': gc_content,
'n_cds': len(cds_list),
'proteins': proteins
}python
from Bio import Entrez, SeqIO
from Bio.Seq import Seq
from Bio.SeqUtils import GC
Entrez.email = "your.email@example.com"
def analyze_gene(gene_id):
"""完整的基因分析流程。"""
# 1. 获取序列
handle = Entrez.efetch(db="nucleotide", id=gene_id, rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
# 2. 基础统计
length = len(record.seq)
gc_content = GC(record.seq)
# 3. 提取CDS
cds_list = []
for feature in record.features:
if feature.type == "CDS":
cds_seq = feature.extract(record.seq)
product = feature.qualifiers.get("product", ["未知"])[0]
cds_list.append((product, cds_seq))
# 4. 翻译CDS
proteins = []
for product, cds in cds_list:
try:
protein = cds.translate(to_stop=True)
proteins.append((product, protein))
except:
proteins.append((product, None))
return {
'id': record.id,
'length': length,
'gc': gc_content,
'n_cds': len(cds_list),
'proteins': proteins
}Analyze gene
分析基因
result = analyze_gene("NM_000518")
print(f"Gene: {result['id']}")
print(f"Length: {result['length']} bp")
print(f"GC: {result['gc']:.2f}%")
print(f"CDS: {result['n_cds']}")
undefinedresult = analyze_gene("NM_000518")
print(f"基因: {result['id']}")
print(f"长度: {result['length']} bp")
print(f"GC含量: {result['gc']:.2f}%")
print(f"CDS数量: {result['n_cds']}")
undefinedRNA-Seq Read Processing
RNA-Seq Read处理
python
from Bio import SeqIO
import numpy as np
def process_rnaseq_reads(fastq_file, min_quality=30, min_length=50):
"""Process RNA-seq reads with quality filtering."""
stats = {
'total': 0,
'passed_quality': 0,
'passed_length': 0,
'final': 0
}
passed_reads = []
for record in SeqIO.parse(fastq_file, "fastq"):
stats['total'] += 1
# Quality filter
qualities = record.letter_annotations["phred_quality"]
if np.mean(qualities) < min_quality:
continue
stats['passed_quality'] += 1
# Length filter
if len(record.seq) < min_length:
continue
stats['passed_length'] += 1
# Trim adapters (simple example)
# In reality, use specialized tools like Cutadapt
passed_reads.append(record)
stats['final'] += 1
return passed_reads, statspython
from Bio import SeqIO
import numpy as np
def process_rnaseq_reads(fastq_file, min_quality=30, min_length=50):
"""带质量过滤的RNA-seq Read处理。"""
stats = {
'total': 0,
'passed_quality': 0,
'passed_length': 0,
'final': 0
}
passed_reads = []
for record in SeqIO.parse(fastq_file, "fastq"):
stats['total'] += 1
# 质量过滤
qualities = record.letter_annotations["phred_quality"]
if np.mean(qualities) < min_quality:
continue
stats['passed_quality'] += 1
# 长度过滤
if len(record.seq) < min_length:
continue
stats['passed_length'] += 1
# 切除接头(简单示例)
# 实际中使用专用工具如Cutadapt
passed_reads.append(record)
stats['final'] += 1
return passed_reads, statsProcess reads
处理reads
reads, stats = process_rnaseq_reads("rnaseq.fastq")
print(f"Total: {stats['total']}")
print(f"Passed quality: {stats['passed_quality']}")
print(f"Passed length: {stats['passed_length']}")
print(f"Final: {stats['final']}")
undefinedreads, stats = process_rnaseq_reads("rnaseq.fastq")
print(f"总数: {stats['total']}")
print(f"通过质量过滤: {stats['passed_quality']}")
print(f"通过长度过滤: {stats['passed_length']}")
print(f"最终保留: {stats['final']}")
undefinedRestriction Enzyme Analysis
限制性酶切分析
python
from Bio import SeqIO
from Bio.Restriction import *python
from Bio import SeqIO
from Bio.Restriction import *Load sequence
加载序列
record = SeqIO.read("plasmid.gb", "genbank")
seq = record.seq
record = SeqIO.read("plasmid.gb", "genbank")
seq = record.seq
Create restriction batch
创建限制性酶批处理
rb = RestrictionBatch([EcoRI, BamHI, PstI, HindIII])
rb = RestrictionBatch([EcoRI, BamHI, PstI, HindIII])
Find restriction sites
寻找酶切位点
analysis = rb.search(seq)
analysis = rb.search(seq)
Print results
打印结果
for enzyme, sites in analysis.items():
if sites:
print(f"{enzyme}: {len(sites)} site(s) at {sites}")
else:
print(f"{enzyme}: No sites found")
for enzyme, sites in analysis.items():
if sites:
print(f"{enzyme}: {len(sites)}个位点,位于 {sites}")
else:
print(f"{enzyme}: 未找到位点")
Find enzymes that cut once (good for cloning)
寻找单酶切位点(适合克隆)
single_cutters = [enz for enz, sites in analysis.items() if len(sites) == 1]
print(f"\nSingle cutters: {[str(e) for e in single_cutters]}")
undefinedsingle_cutters = [enz for enz, sites in analysis.items() if len(sites) == 1]
print(f"\n单酶切酶: {[str(e) for e in single_cutters]}")
undefinedCodon Usage Analysis
密码子使用分析
python
from Bio import SeqIO
from collections import Counter
def analyze_codon_usage(genbank_file):
"""Analyze codon usage in CDS features."""
codon_counts = Counter()
total_codons = 0
for record in SeqIO.parse(genbank_file, "genbank"):
for feature in record.features:
if feature.type == "CDS":
# Extract CDS sequence
cds = feature.extract(record.seq)
# Count codons
for i in range(0, len(cds) - 2, 3):
codon = str(cds[i:i+3])
if len(codon) == 3:
codon_counts[codon] += 1
total_codons += 1
# Calculate frequencies
codon_freq = {codon: count/total_codons
for codon, count in codon_counts.items()}
return codon_counts, codon_freqpython
from Bio import SeqIO
from collections import Counter
def analyze_codon_usage(genbank_file):
"""分析CDS特征中的密码子使用情况。"""
codon_counts = Counter()
total_codons = 0
for record in SeqIO.parse(genbank_file, "genbank"):
for feature in record.features:
if feature.type == "CDS":
# 提取CDS序列
cds = feature.extract(record.seq)
# 统计密码子
for i in range(0, len(cds) - 2, 3):
codon = str(cds[i:i+3])
if len(codon) == 3:
codon_counts[codon] += 1
total_codons += 1
# 计算频率
codon_freq = {codon: count/total_codons
for codon, count in codon_counts.items()}
return codon_counts, codon_freqAnalyze
分析
counts, freqs = analyze_codon_usage("genome.gb")
counts, freqs = analyze_codon_usage("genome.gb")
Print most common codons
打印最常用的密码子
for codon, freq in sorted(freqs.items(), key=lambda x: x[1], reverse=True)[:10]:
print(f"{codon}: {freq:.4f}")
undefinedfor codon, freq in sorted(freqs.items(), key=lambda x: x[1], reverse=True)[:10]:
print(f"{codon}: {freq:.4f}")
undefinedPerformance Optimization
性能优化
Memory-Efficient File Processing
内存高效的文件处理
python
from Bio import SeqIO
def process_large_fasta(filename, process_func):
"""Process large FASTA without loading into memory."""
count = 0
for record in SeqIO.parse(filename, "fasta"):
process_func(record)
count += 1
# Progress report
if count % 10000 == 0:
print(f"Processed {count} sequences")
return countpython
from Bio import SeqIO
def process_large_fasta(filename, process_func):
"""不加载到内存的大FASTA文件处理。"""
count = 0
for record in SeqIO.parse(filename, "fasta"):
process_func(record)
count += 1
# 进度报告
if count % 10000 == 0:
print(f"已处理 {count} 条序列")
return countExample processor
示例处理器
def calculate_gc(record):
from Bio.SeqUtils import GC
gc = GC(record.seq)
if gc > 60:
print(f"{record.id}: High GC ({gc:.1f}%)")
def calculate_gc(record):
from Bio.SeqUtils import GC
gc = GC(record.seq)
if gc > 60:
print(f"{record.id}: 高GC含量 ({gc:.1f}%)")
Process
处理文件
n = process_large_fasta("large_genome.fasta", calculate_gc)
undefinedn = process_large_fasta("large_genome.fasta", calculate_gc)
undefinedParallel Processing
并行处理
python
from Bio import SeqIO
from multiprocessing import Pool
from functools import partial
def process_sequence(record, min_length=100):
"""Process single sequence."""
if len(record.seq) >= min_length:
from Bio.SeqUtils import GC
return (record.id, len(record.seq), GC(record.seq))
return None
def parallel_process_fasta(filename, n_processes=4):
"""Process FASTA in parallel."""
# Load sequences
records = list(SeqIO.parse(filename, "fasta"))
# Process in parallel
with Pool(n_processes) as pool:
results = pool.map(process_sequence, records)
# Filter None results
return [r for r in results if r is not None]python
from Bio import SeqIO
from multiprocessing import Pool
from functools import partial
def process_sequence(record, min_length=100):
"""处理单条序列。"""
if len(record.seq) >= min_length:
from Bio.SeqUtils import GC
return (record.id, len(record.seq), GC(record.seq))
return None
def parallel_process_fasta(filename, n_processes=4):
"""并行处理FASTA文件。"""
# 加载序列
records = list(SeqIO.parse(filename, "fasta"))
# 并行处理
with Pool(n_processes) as pool:
results = pool.map(process_sequence, records)
# 过滤空结果
return [r for r in results if r is not None]Process
处理文件
results = parallel_process_fasta("sequences.fasta", n_processes=4)
for seq_id, length, gc in results[:10]:
print(f"{seq_id}: {length} bp, GC={gc:.2f}%")
undefinedresults = parallel_process_fasta("sequences.fasta", n_processes=4)
for seq_id, length, gc in results[:10]:
print(f"{seq_id}: {length} bp, GC含量={gc:.2f}%")
undefinedIndex Files for Random Access
索引文件用于随机访问
python
from Bio import SeqIOpython
from Bio import SeqIOCreate index (fast random access)
创建索引(快速随机访问)
record_dict = SeqIO.index("large.fasta", "fasta")
record_dict = SeqIO.index("large.fasta", "fasta")
Access specific sequence instantly
立即访问特定序列
record = record_dict["seq_12345"]
print(record.seq)
record = record_dict["seq_12345"]
print(record.seq)
Iterate (still memory efficient)
迭代(仍保持内存高效)
for key in record_dict:
record = record_dict[key]
process(record)
for key in record_dict:
record = record_dict[key]
process(record)
Close when done
使用完毕后关闭
record_dict.close()
record_dict.close()
SQLite-backed index (for very large files)
SQLite-backed索引(用于超大文件)
record_dict = SeqIO.index_db("large.idx", "huge.fasta", "fasta")
undefinedrecord_dict = SeqIO.index_db("large.idx", "huge.fasta", "fasta")
undefinedCommon Pitfalls and Solutions
常见陷阱与解决方案
Translation Frame Errors
翻译读框错误
python
from Bio.Seq import Seq
dna = Seq("ATGGCCATTGTAATGGGCCGC")python
from Bio.Seq import Seq
dna = Seq("ATGGCCATTGTAATGGGCCGC")Problem: Wrong reading frame
问题:错误的阅读框
protein = dna.translate() # May have stop codons
protein = dna.translate() # 可能包含终止密码子
Solution: Check all frames
解决方案:检查所有读框
for i in range(3):
frame = dna[i:]
protein = frame.translate(to_stop=True, table=1)
if len(protein) > 10: # Reasonable length
print(f"Frame {i}: {protein}")
undefinedfor i in range(3):
frame = dna[i:]
protein = frame.translate(to_stop=True, table=1)
if len(protein) > 10: # 合理长度
print(f"读框 {i}: {protein}")
undefinedFile Format Mismatches
文件格式不匹配
python
from Bio import SeqIOpython
from Bio import SeqIOProblem: Wrong format specified
问题:指定错误的格式
try:
records = SeqIO.parse("file.fasta", "genbank") # Wrong!
except:
print("Format mismatch")
try:
records = SeqIO.parse("file.fasta", "genbank") # 错误!
except:
print("格式不匹配")
Solution: Verify format
解决方案:自动识别格式
import os
def guess_format(filename):
ext = os.path.splitext(filename)[1].lower()
format_map = {
'.fasta': 'fasta',
'.fa': 'fasta',
'.fna': 'fasta',
'.fastq': 'fastq',
'.fq': 'fastq',
'.gb': 'genbank',
'.gbk': 'genbank'
}
return format_map.get(ext, 'fasta')
format = guess_format("file.fasta")
records = SeqIO.parse("file.fasta", format)
undefinedimport os
def guess_format(filename):
ext = os.path.splitext(filename)[1].lower()
format_map = {
'.fasta': 'fasta',
'.fa': 'fasta',
'.fna': 'fasta',
'.fastq': 'fastq',
'.fq': 'fastq',
'.gb': 'genbank',
'.gbk': 'genbank'
}
return format_map.get(ext, 'fasta')
format = guess_format("file.fasta")
records = SeqIO.parse("file.fasta", format)
undefinedNCBI Rate Limiting
NCBI速率限制
python
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"python
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"Problem: Too many requests
问题:请求过于频繁
def bad_batch_download(id_list):
for gene_id in id_list:
handle = Entrez.efetch(db="nucleotide", id=gene_id)
# This will get you blocked!
def bad_batch_download(id_list):
for gene_id in id_list:
handle = Entrez.efetch(db="nucleotide", id=gene_id)
# 这会导致你被封禁!
Solution: Add delays
解决方案:添加延迟
def good_batch_download(id_list):
results = []
for gene_id in id_list:
handle = Entrez.efetch(db="nucleotide", id=gene_id,
rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
results.append(record)
time.sleep(0.34) # ~3 requests/sec
return resultsdef good_batch_download(id_list):
results = []
for gene_id in id_list:
handle = Entrez.efetch(db="nucleotide", id=gene_id,
rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
results.append(record)
time.sleep(0.34) # ~3次请求/秒
return resultsBetter: Use Entrez history for large batches
更好的方法:对大批次使用Entrez历史记录
undefinedundefinedMemory Issues with Large Alignments
大比对结果的内存问题
python
from Bio import AlignIOpython
from Bio import AlignIOProblem: Loading huge alignment
问题:加载超大比对结果
alignment = AlignIO.read("huge_alignment.fasta", "fasta") # OOM!
alignment = AlignIO.read("huge_alignment.fasta", "fasta") # 内存溢出!
Solution: Process incrementally
解决方案:分块处理
def process_alignment_chunks(filename, chunk_size=1000):
records = SeqIO.parse(filename, "fasta")
chunk = []
for record in records:
chunk.append(record)
if len(chunk) >= chunk_size:
# Process chunk
process_chunk(chunk)
chunk = []
# Process remaining
if chunk:
process_chunk(chunk)
This comprehensive Biopython guide covers 50+ examples across all major bioinformatics workflows!def process_alignment_chunks(filename, chunk_size=1000):
records = SeqIO.parse(filename, "fasta")
chunk = []
for record in records:
chunk.append(record)
if len(chunk) >= chunk_size:
# 处理块
process_chunk(chunk)
chunk = []
# 处理剩余部分
if chunk:
process_chunk(chunk)
这份全面的Biopython指南涵盖了50多个生物信息学核心工作流的示例!