Biopython - Bioinformatics Library
Biopython - 生物信息学库
Industry-standard Python library for computational biology and bioinformatics workflows.
计算生物学和生物信息学工作流的行业标准Python库。
- Parsing and manipulating biological sequences (DNA, RNA, protein)
- Reading and writing sequence files (FASTA, FASTQ, GenBank, EMBL, SwissProt)
- Performing sequence alignments (pairwise and multiple)
- Running and parsing BLAST searches
- Analyzing protein structures from PDB files
- Calculating sequence statistics and molecular weights
- Translating DNA to protein sequences
- Finding restriction enzyme sites
- Building and analyzing phylogenetic trees
- Accessing NCBI databases (Entrez, PubMed)
- Computing sequence motifs and patterns
- Analyzing next-generation sequencing data
- 解析和处理生物序列(DNA、RNA、蛋白质)
- 读取和写入序列文件(FASTA、FASTQ、GenBank、EMBL、SwissProt)
- 执行序列比对(双序列和多序列)
- 运行和解析BLAST搜索结果
- 分析PDB文件中的蛋白质结构
- 计算序列统计数据和分子量
- 将DNA序列翻译为蛋白质序列
- 寻找限制性酶切位点
- 构建和分析系统发育树
- 访问NCBI数据库(Entrez、PubMed)
- 计算序列基序和模式
- 分析下一代测序数据
Reference Documentation
参考文档
Use Biopython For
Biopython适用场景
| Task | Module | Example |
|---|
| Create sequences | | |
| Read sequence files | | SeqIO.parse("file.fasta", "fasta")
|
| Pairwise alignment | | pairwise2.align.globalxx(s1, s2)
|
| Multiple alignment | | AlignIO.read("align.fasta", "fasta")
|
| BLAST searches | | NCBIWWW.qblast("blastn", "nr", seq)
|
| PDB structures | | PDBParser().get_structure()
|
| Phylogenetic trees | | Phylo.read("tree.xml", "phyloxml")
|
| NCBI databases | | Entrez.esearch(db="nucleotide")
|
| 任务 | 模块 | 示例 |
|---|
| 创建序列 | | |
| 读取序列文件 | | SeqIO.parse("file.fasta", "fasta")
|
| 双序列比对 | | pairwise2.align.globalxx(s1, s2)
|
| 多序列比对 | | AlignIO.read("align.fasta", "fasta")
|
| BLAST搜索 | | NCBIWWW.qblast("blastn", "nr", seq)
|
| PDB结构分析 | | PDBParser().get_structure()
|
| 系统发育树分析 | | Phylo.read("tree.xml", "phyloxml")
|
| NCBI数据库访问 | | Entrez.esearch(db="nucleotide")
|
Do NOT Use For
Biopython不适用场景
- High-performance genome assembly (use SPAdes, Canu)
- Variant calling from BAM files (use GATK, BCFtools)
- RNA-seq differential expression (use DESeq2, edgeR)
- Protein structure prediction (use AlphaFold, RoseTTAFold)
- Large-scale metagenomics (use specialized pipelines)
- 高性能基因组组装(使用SPAdes、Canu)
- 从BAM文件进行变异检测(使用GATK、BCFtools)
- RNA-seq差异表达分析(使用DESeq2、edgeR)
- 蛋白质结构预测(使用AlphaFold、RoseTTAFold)
- 大规模宏基因组分析(使用专用流程)
With optional dependencies
安装可选依赖
pip install biopython[extra]
pip install biopython[extra]
conda install -c conda-forge biopython
conda install -c conda-forge biopython
Core sequence handling
核心序列处理
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO, AlignIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO, AlignIO
from Bio import pairwise2
from Bio.Align import MultipleSeqAlignment
from Bio.Align.Applications import ClustalwCommandline
from Bio import pairwise2
from Bio.Align import MultipleSeqAlignment
from Bio.Align.Applications import ClustalwCommandline
from Bio.Blast import NCBIWWW, NCBIXML
from Bio.Blast import NCBIWWW, NCBIXML
from Bio.PDB import PDBParser, PDBIO, Select
from Bio.PDB.DSSP import DSSP
from Bio.PDB.Polypeptide import PPBuilder
from Bio.PDB import PDBParser, PDBIO, Select
from Bio.PDB.DSSP import DSSP
from Bio.PDB.Polypeptide import PPBuilder
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.SeqUtils import GC, molecular_weight
from Bio.Restriction import *
from Bio.SeqUtils import GC, molecular_weight
from Bio.Restriction import *
Basic Pattern - Sequence Creation
基础模式 - 创建序列
python
from Bio.Seq import Seq
python
from Bio.Seq import Seq
Create DNA sequence
创建DNA序列
dna = Seq("ATGGCCATTGTAATGGGCCGC")
dna = Seq("ATGGCCATTGTAATGGGCCGC")
Translate to protein
翻译为蛋白质
protein = dna.translate()
protein = dna.translate()
rev_comp = dna.reverse_complement()
print(f"DNA: {dna}")
print(f"RNA: {rna}")
print(f"Protein: {protein}")
print(f"RevComp: {rev_comp}")
rev_comp = dna.reverse_complement()
print(f"DNA: {dna}")
print(f"RNA: {rna}")
print(f"Protein: {protein}")
print(f"RevComp: {rev_comp}")
Basic Pattern - File Reading
基础模式 - 读取文件
python
from Bio import SeqIO
python
from Bio import SeqIO
Read FASTA file (iterator - memory efficient)
读取FASTA文件(迭代器 - 内存高效)
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"ID: {record.id}")
print(f"Length: {len(record.seq)}")
print(f"Sequence: {record.seq[:50]}...")
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"ID: {record.id}")
print(f"长度: {len(record.seq)}")
print(f"序列: {record.seq[:50]}...")
Read single sequence
读取单个序列
record = SeqIO.read("single.fasta", "fasta")
record = SeqIO.read("single.fasta", "fasta")
Basic Pattern - Sequence Alignment
基础模式 - 序列比对
python
from Bio import pairwise2
from Bio.Seq import Seq
seq1 = Seq("ACCGT")
seq2 = Seq("ACGT")
python
from Bio import pairwise2
from Bio.Seq import Seq
seq1 = Seq("ACCGT")
seq2 = Seq("ACGT")
alignments = pairwise2.align.globalxx(seq1, seq2)
alignments = pairwise2.align.globalxx(seq1, seq2)
Print best alignment
打印最优比对结果
best = alignments[0]
print(pairwise2.format_alignment(*best))
best = alignments[0]
print(pairwise2.format_alignment(*best))
- Use iterators for large files - not
- Validate sequences - Check alphabet and length before operations
- Handle file formats correctly - Match parser to actual file format
- Check alignment quality - Verify gaps and identity percentages
- Use appropriate genetic code - Specify table for translation
- Close file handles - Use context managers or explicit close
- Filter FASTQ by quality - Don't trust all reads equally
- Verify PDB structure - Check for missing atoms/residues
- Set Entrez email - Required for NCBI API usage
- Handle translation frames - Consider all three reading frames
- 对大文件使用迭代器 - 使用而非
- 验证序列 - 在操作前检查序列的字母表和长度
- 正确处理文件格式 - 确保解析器与实际文件格式匹配
- 检查比对质量 - 验证空位和一致性百分比
- 使用合适的遗传密码 - 翻译时指定密码子表
- 关闭文件句柄 - 使用上下文管理器或显式关闭
- 按质量过滤FASTQ - 不要完全信任所有reads
- 验证PDB结构 - 检查是否存在缺失的原子/残基
- 设置Entrez邮箱 - NCBI API使用要求必须设置
- 处理翻译读框 - 考虑所有三个阅读框
- Load entire FASTQ into memory - Use streaming
- Ignore sequence type - DNA, RNA, and protein need different handling
- Skip quality filtering - FASTQ quality scores matter
- Use wrong genetic code - Different organisms use different tables
- Forget stop codons - Handle them explicitly in translation
- Mix alphabets - Don't compare DNA with protein sequences
- Trust all BLAST hits - Filter by e-value and identity
- Ignore chain breaks - PDB structures may have gaps
- Hammer NCBI servers - Use rate limiting (3 requests/sec without API key)
- Compare raw sequences - Align first, then compare
- 将整个FASTQ加载到内存 - 使用流式处理
- 忽略序列类型 - DNA、RNA和蛋白质需要不同的处理方式
- 跳过质量过滤 - FASTQ的质量分数很重要
- 使用错误的遗传密码 - 不同生物使用不同的密码子表
- 忽略终止密码子 - 在翻译时显式处理它们
- 混合字母表 - 不要将DNA与蛋白质序列进行比较
- 信任所有BLAST比对结果 - 按e值和一致性进行过滤
- 忽略链断裂 - PDB结构可能存在空位
- 频繁请求NCBI服务器 - 限制请求速率(无API密钥时最多3次/秒)
- 直接比较原始序列 - 先比对,再比较
Anti-Patterns (NEVER)
反模式(绝对避免)
❌ BAD: Loading entire file into memory
❌ 错误:将整个文件加载到内存
records = list(SeqIO.parse("huge.fastq", "fastq"))
for record in records:
process(record) # OOM for large files!
records = list(SeqIO.parse("huge.fastq", "fastq"))
for record in records:
process(record) # 大文件会导致内存溢出!
✅ GOOD: Stream processing
✅ 正确:流式处理
for record in SeqIO.parse("huge.fastq", "fastq"):
if min(record.letter_annotations["phred_quality"]) >= 20:
process(record)
for record in SeqIO.parse("huge.fastq", "fastq"):
if min(record.letter_annotations["phred_quality"]) >= 20:
process(record)
❌ BAD: Translating without checking frame
❌ 错误:不检查读框直接翻译
protein = dna_seq.translate()
protein = dna_seq.translate()
May include stop codons or wrong frame!
可能包含终止密码子或错误读框!
✅ GOOD: Specify frame and stop codon handling
✅ 正确:指定读框和终止密码子处理方式
protein = dna_seq.translate(to_stop=True, table=1)
protein = dna_seq.translate(to_stop=True, table=1)
❌ BAD: No email for Entrez
❌ 错误:Entrez请求不设置邮箱
Entrez.esearch(db="nucleotide", term="human")
Entrez.esearch(db="nucleotide", term="human")
NCBI will block you!
NCBI会封禁你的IP!
✅ GOOD: Always set email
✅ 正确:始终设置邮箱
Entrez.email = "your.email@example.com"
handle = Entrez.esearch(db="nucleotide", term="human")
Entrez.email = "your.email@example.com"
handle = Entrez.esearch(db="nucleotide", term="human")
❌ BAD: Comparing sequences directly
❌ 错误:直接比较序列
if str(seq1) == str(seq2):
print("Same") # Ignores gaps, misalignments!
if str(seq1) == str(seq2):
print("相同") # 忽略了空位和比对错误!
✅ GOOD: Align then compare
✅ 正确:先比对再比较
from Bio import pairwise2
alignments = pairwise2.align.globalxx(seq1, seq2)
score = alignments[0][2] # Alignment score
from Bio import pairwise2
alignments = pairwise2.align.globalxx(seq1, seq2)
score = alignments[0][2] # 比对分数
❌ BAD: Wrong file format
❌ 错误:指定错误的文件格式
records = SeqIO.parse("file.gb", "fasta") # Wrong!
records = SeqIO.parse("file.gb", "fasta") # 错误!
✅ GOOD: Match format to file
✅ 正确:匹配文件格式
records = SeqIO.parse("file.gb", "genbank")
records = SeqIO.parse("file.gb", "genbank")
Creating and Manipulating Sequences
创建和操作序列
python
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
python
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
length = len(dna)
gc_count = dna.count("G") + dna.count("C")
gc_content = (gc_count / length) * 100
length = len(dna)
gc_count = dna.count("G") + dna.count("C")
gc_content = (gc_count / length) * 100
position = dna.find("ATG") # Returns index or -1
position = dna.find("ATG") # 返回索引或-1
first_codon = dna[:3]
last_10 = dna[-10:]
print(f"Length: {length}")
print(f"GC content: {gc_content:.2f}%")
print(f"ATG at position: {position}")
first_codon = dna[:3]
last_10 = dna[-10:]
print(f"长度: {length}")
print(f"GC含量: {gc_content:.2f}%")
print(f"ATG位置: {position}")
Transcription and Translation
转录和翻译
python
from Bio.Seq import Seq
python
from Bio.Seq import Seq
DNA to RNA transcription
DNA转录为RNA
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
rna = dna.transcribe()
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
rna = dna.transcribe()
RNA to protein translation
RNA翻译为蛋白质
protein = rna.translate()
protein = rna.translate()
DNA to protein (direct)
DNA直接翻译为蛋白质
protein_direct = dna.translate()
protein_direct = dna.translate()
dna_back = rna.back_transcribe()
dna_back = rna.back_transcribe()
rev_comp = dna.reverse_complement()
print(f"DNA: {dna}")
print(f"RNA: {rna}")
print(f"Protein: {protein}")
print(f"Rev Comp: {rev_comp}")
rev_comp = dna.reverse_complement()
print(f"DNA: {dna}")
print(f"RNA: {rna}")
print(f"蛋白质: {protein}")
print(f"反向互补: {rev_comp}")
Translation with Different Genetic Codes
使用不同遗传密码进行翻译
python
from Bio.Seq import Seq
dna = Seq("ATGGGCTAG")
python
from Bio.Seq import Seq
dna = Seq("ATGGGCTAG")
Standard genetic code (table 1)
标准遗传密码(表1)
protein_standard = dna.translate(table=1)
protein_standard = dna.translate(table=1)
Mitochondrial genetic code (table 2)
线粒体遗传密码(表2)
protein_mito = dna.translate(table=2)
protein_mito = dna.translate(table=2)
Stop at first stop codon
遇到第一个终止密码子即停止
protein_to_stop = dna.translate(to_stop=True)
protein_to_stop = dna.translate(to_stop=True)
All three reading frames
所有三个阅读框
for i in range(3):
frame = dna[i:]
protein = frame.translate(to_stop=True)
print(f"Frame {i}: {protein}")
for i in range(3):
frame = dna[i:]
protein = frame.translate(to_stop=True)
print(f"读框 {i}: {protein}")
python
from Bio.Seq import Seq
from Bio.SeqUtils import GC, molecular_weight
seq = Seq("ATGGCCATTGTAATGGGCCGC")
python
from Bio.Seq import Seq
from Bio.SeqUtils import GC, molecular_weight
seq = Seq("ATGGCCATTGTAATGGGCCGC")
mw = molecular_weight(seq, seq_type='DNA')
mw = molecular_weight(seq, seq_type='DNA')
GC skew (for origin of replication analysis)
GC偏斜(用于复制起点分析)
from Bio.SeqUtils import GC_skew
skew = GC_skew(seq, window=100)
print(f"GC content: {gc:.2f}%")
print(f"Molecular weight: {mw:.2f} Da")
from Bio.SeqUtils import GC_skew
skew = GC_skew(seq, window=100)
print(f"GC含量: {gc:.2f}%")
print(f"分子量: {mw:.2f} Da")
SeqRecord Objects
SeqRecord对象
Creating SeqRecord Objects
创建SeqRecord对象
python
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
python
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
Create with minimal info
创建最小信息的SeqRecord
seq = Seq("ATGGCCATTG")
record = SeqRecord(seq, id="seq1", description="My sequence")
seq = Seq("ATGGCCATTG")
record = SeqRecord(seq, id="seq1", description="我的序列")
With full annotation
创建带完整注释的SeqRecord
record = SeqRecord(
Seq("ATGGCCATTG"),
id="gene1",
name="GeneName",
description="Gene encoding protein X",
annotations={
"organism": "Homo sapiens",
"molecule_type": "DNA"
}
)
record = SeqRecord(
Seq("ATGGCCATTG"),
id="gene1",
name="GeneName",
description="编码蛋白质X的基因",
annotations={
"organism": "Homo sapiens",
"molecule_type": "DNA"
}
)
from Bio.SeqFeature import SeqFeature, FeatureLocation
feature = SeqFeature(
FeatureLocation(0, 10),
type="CDS",
qualifiers={"product": "hypothetical protein"}
)
record.features.append(feature)
print(record)
from Bio.SeqFeature import SeqFeature, FeatureLocation
feature = SeqFeature(
FeatureLocation(0, 10),
type="CDS",
qualifiers={"product": "假设蛋白质"}
)
record.features.append(feature)
print(record)
SeqRecord Attributes
SeqRecord属性
python
from Bio import SeqIO
record = SeqIO.read("sequence.gb", "genbank")
python
from Bio import SeqIO
record = SeqIO.read("sequence.gb", "genbank")
sequence = record.seq
identifier = record.id
name = record.name
description = record.description
sequence = record.seq
identifier = record.id
name = record.name
description = record.description
Annotations (dictionary)
注释(字典)
organism = record.annotations.get("organism", "Unknown")
organism = record.annotations.get("organism", "未知")
for feature in record.features:
if feature.type == "CDS":
product = feature.qualifiers.get("product", ["Unknown"])[0]
location = feature.location
print(f"{product} at {location}")
for feature in record.features:
if feature.type == "CDS":
product = feature.qualifiers.get("product", ["未知"])[0]
location = feature.location
print(f"{product} 位于 {location}")
Letter annotations (quality scores for FASTQ)
字母注释(FASTQ的质量分数)
if "phred_quality" in record.letter_annotations:
avg_quality = sum(record.letter_annotations["phred_quality"]) / len(record)
print(f"Average quality: {avg_quality:.2f}")
if "phred_quality" in record.letter_annotations:
avg_quality = sum(record.letter_annotations["phred_quality"]) / len(record)
print(f"平均质量: {avg_quality:.2f}")
Reading FASTA Files
读取FASTA文件
python
from Bio import SeqIO
python
from Bio import SeqIO
record = SeqIO.read("single.fasta", "fasta")
print(f"{record.id}: {record.seq}")
record = SeqIO.read("single.fasta", "fasta")
print(f"{record.id}: {record.seq}")
Multiple sequences (iterator)
读取多个序列(迭代器)
for record in SeqIO.parse("multiple.fasta", "fasta"):
print(f"{record.id}: {len(record.seq)} bp")
for record in SeqIO.parse("multiple.fasta", "fasta"):
print(f"{record.id}: {len(record.seq)} bp")
Load into dictionary (use sparingly!)
加载到字典(谨慎使用!)
sequences = SeqIO.to_dict(SeqIO.parse("sequences.fasta", "fasta"))
seq1 = sequences["seq1"]
sequences = SeqIO.to_dict(SeqIO.parse("sequences.fasta", "fasta"))
seq1 = sequences["seq1"]
Writing FASTA Files
写入FASTA文件
python
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
python
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
records = []
for i in range(10):
seq = Seq("ATCG" * (i + 1))
record = SeqRecord(seq, id=f"seq_{i}", description=f"Sequence {i}")
records.append(record)
records = []
for i in range(10):
seq = Seq("ATCG" * (i + 1))
record = SeqRecord(seq, id=f"seq_{i}", description=f"序列 {i}")
records.append(record)
SeqIO.write(records, "output.fasta", "fasta")
SeqIO.write(records, "output.fasta", "fasta")
Append to existing file
追加到现有文件
with open("output.fasta", "a") as f:
SeqIO.write(record, f, "fasta")
with open("output.fasta", "a") as f:
SeqIO.write(record, f, "fasta")
Reading FASTQ Files
读取FASTQ文件
python
from Bio import SeqIO
python
from Bio import SeqIO
Read FASTQ with quality scores
读取带质量分数的FASTQ
for record in SeqIO.parse("reads.fastq", "fastq"):
seq_id = record.id
sequence = record.seq
quality = record.letter_annotations["phred_quality"]
avg_q = sum(quality) / len(quality)
min_q = min(quality)
print(f"{seq_id}: avg Q={avg_q:.1f}, min Q={min_q}")
for record in SeqIO.parse("reads.fastq", "fastq"):
seq_id = record.id
sequence = record.seq
quality = record.letter_annotations["phred_quality"]
avg_q = sum(quality) / len(quality)
min_q = min(quality)
print(f"{seq_id}: 平均Q值={avg_q:.1f}, 最小Q值={min_q}")
Quality Filtering FASTQ
质量过滤FASTQ
python
from Bio import SeqIO
def filter_fastq(input_file, output_file, min_quality=20, min_length=50):
"""Filter FASTQ by quality and length."""
good_reads = 0
with open(output_file, "w") as out_handle:
for record in SeqIO.parse(input_file, "fastq"):
# Check length
if len(record.seq) < min_length:
continue
# Check quality
qualities = record.letter_annotations["phred_quality"]
if min(qualities) >= min_quality:
SeqIO.write(record, out_handle, "fastq")
good_reads += 1
return good_reads
python
from Bio import SeqIO
def filter_fastq(input_file, output_file, min_quality=20, min_length=50):
"""按质量和长度过滤FASTQ文件。"""
good_reads = 0
with open(output_file, "w") as out_handle:
for record in SeqIO.parse(input_file, "fastq"):
# 检查长度
if len(record.seq) < min_length:
continue
# 检查质量
qualities = record.letter_annotations["phred_quality"]
if min(qualities) >= min_quality:
SeqIO.write(record, out_handle, "fastq")
good_reads += 1
return good_reads
n_passed = filter_fastq("raw.fastq", "filtered.fastq")
print(f"Passed: {n_passed} reads")
n_passed = filter_fastq("raw.fastq", "filtered.fastq")
print(f"通过过滤: {n_passed}条reads")
Reading GenBank Files
读取GenBank文件
python
from Bio import SeqIO
python
from Bio import SeqIO
GenBank files have rich annotations
GenBank文件包含丰富的注释
for record in SeqIO.parse("sequence.gb", "genbank"):
print(f"ID: {record.id}")
print(f"Organism: {record.annotations['organism']}")
print(f"Sequence length: {len(record.seq)}")
# Extract features
for feature in record.features:
if feature.type == "CDS":
product = feature.qualifiers.get("product", ["Unknown"])[0]
start = feature.location.start
end = feature.location.end
print(f" CDS: {product} ({start}-{end})")
for record in SeqIO.parse("sequence.gb", "genbank"):
print(f"ID: {record.id}")
print(f"生物: {record.annotations['organism']}")
print(f"序列长度: {len(record.seq)}")
# 提取特征
for feature in record.features:
if feature.type == "CDS":
product = feature.qualifiers.get("product", ["未知"])[0]
start = feature.location.start
end = feature.location.end
print(f" CDS: {product} ({start}-{end})")
Converting File Formats
转换文件格式
python
from Bio import SeqIO
python
from Bio import SeqIO
Convert GenBank to FASTA
将GenBank转换为FASTA
records = SeqIO.parse("input.gb", "genbank")
SeqIO.write(records, "output.fasta", "fasta")
records = SeqIO.parse("input.gb", "genbank")
SeqIO.write(records, "output.fasta", "fasta")
Convert FASTQ to FASTA (lose quality scores)
将FASTQ转换为FASTA(丢失质量分数)
records = SeqIO.parse("reads.fastq", "fastq")
SeqIO.write(records, "sequences.fasta", "fasta")
records = SeqIO.parse("reads.fastq", "fastq")
SeqIO.write(records, "sequences.fasta", "fasta")
Multiple format conversions in one go
通用格式转换函数
def convert_format(input_file, input_format, output_file, output_format):
"""Generic format converter."""
count = SeqIO.convert(input_file, input_format, output_file, output_format)
print(f"Converted {count} records")
convert_format("seq.gb", "genbank", "seq.fasta", "fasta")
def convert_format(input_file, input_format, output_file, output_format):
"""通用序列文件格式转换器。"""
count = SeqIO.convert(input_file, input_format, output_file, output_format)
print(f"转换了 {count} 条记录")
convert_format("seq.gb", "genbank", "seq.fasta", "fasta")
python
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.pairwise2 import format_alignment
seq1 = Seq("ACCGGT")
seq2 = Seq("ACGT")
python
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.pairwise2 import format_alignment
seq1 = Seq("ACCGGT")
seq2 = Seq("ACGT")
Global alignment (Needleman-Wunsch)
全局比对(Needleman-Wunsch算法)
alignments = pairwise2.align.globalxx(seq1, seq2)
alignments = pairwise2.align.globalxx(seq1, seq2)
Print best alignment
打印最优比对结果
best = alignments[0]
print(format_alignment(*best))
best = alignments[0]
print(format_alignment(*best))
With scoring matrix
使用评分矩阵
alignments = pairwise2.align.globalms(
seq1, seq2,
match=2, # Match score
mismatch=-1, # Mismatch penalty
open=-0.5, # Gap open penalty
extend=-0.1 # Gap extension penalty
)
alignments = pairwise2.align.globalms(
seq1, seq2,
match=2, # 匹配得分
mismatch=-1, # 错配惩罚
open=-0.5, # 空位开放惩罚
extend=-0.1 # 空位延伸惩罚
)
python
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.Align import substitution_matrices
seq1 = Seq("KEVLA")
seq2 = Seq("KELVA")
python
from Bio import pairwise2
from Bio.Seq import Seq
from Bio.Align import substitution_matrices
seq1 = Seq("KEVLA")
seq2 = Seq("KELVA")
Use BLOSUM62 matrix
使用BLOSUM62矩阵
matrix = substitution_matrices.load("BLOSUM62")
alignments = pairwise2.align.globaldx(
seq1, seq2,
matrix,
open=-10,
extend=-0.5
)
print(pairwise2.format_alignment(*alignments[0]))
matrix = substitution_matrices.load("BLOSUM62")
alignments = pairwise2.align.globaldx(
seq1, seq2,
matrix,
open=-10,
extend=-0.5
)
print(pairwise2.format_alignment(*alignments[0]))
python
from Bio import pairwise2
from Bio.Seq import Seq
python
from Bio import pairwise2
from Bio.Seq import Seq
Local alignment (Smith-Waterman)
局部比对(Smith-Waterman算法)
seq1 = Seq("GCATGCTAGATGCTA")
seq2 = Seq("ATGCTA")
alignments = pairwise2.align.localxx(seq1, seq2)
seq1 = Seq("GCATGCTAGATGCTA")
seq2 = Seq("ATGCTA")
alignments = pairwise2.align.localxx(seq1, seq2)
Best local alignment
最优局部比对
best = alignments[0]
print(pairwise2.format_alignment(*best))
best = alignments[0]
print(pairwise2.format_alignment(*best))
Multiple Sequence Alignment
多序列比对
python
from Bio import AlignIO
from Bio.Align.Applications import ClustalwCommandline
python
from Bio import AlignIO
from Bio.Align.Applications import ClustalwCommandline
Run ClustalW (requires installation)
运行ClustalW(需要预先安装)
cline = ClustalwCommandline("clustalw2", infile="sequences.fasta")
stdout, stderr = cline()
cline = ClustalwCommandline("clustalw2", infile="sequences.fasta")
stdout, stderr = cline()
alignment = AlignIO.read("sequences.aln", "clustal")
print(f"Alignment length: {alignment.get_alignment_length()}")
print(f"Number of sequences: {len(alignment)}")
alignment = AlignIO.read("sequences.aln", "clustal")
print(f"比对长度: {alignment.get_alignment_length()}")
print(f"序列数量: {len(alignment)}")
for record in alignment:
print(f"{record.id}: {record.seq}")
for record in alignment:
print(f"{record.id}: {record.seq}")
python
from Bio import AlignIO
alignment = AlignIO.read("alignment.fasta", "fasta")
python
from Bio import AlignIO
alignment = AlignIO.read("alignment.fasta", "fasta")
def calculate_identity(alignment):
"""Calculate pairwise identity matrix."""
n = len(alignment)
identities = []
for i in range(n):
for j in range(i+1, n):
seq1 = alignment[i].seq
seq2 = alignment[j].seq
matches = sum(a == b for a, b in zip(seq1, seq2) if a != '-' and b != '-')
aligned_length = sum(1 for a, b in zip(seq1, seq2) if a != '-' and b != '-')
identity = matches / aligned_length * 100 if aligned_length > 0 else 0
identities.append(identity)
return identities
identities = calculate_identity(alignment)
print(f"Average identity: {sum(identities)/len(identities):.2f}%")
def calculate_identity(alignment):
"""计算双序列一致性矩阵。"""
n = len(alignment)
identities = []
for i in range(n):
for j in range(i+1, n):
seq1 = alignment[i].seq
seq2 = alignment[j].seq
matches = sum(a == b for a, b in zip(seq1, seq2) if a != '-' and b != '-')
aligned_length = sum(1 for a, b in zip(seq1, seq2) if a != '-' and b != '-')
identity = matches / aligned_length * 100 if aligned_length > 0 else 0
identities.append(identity)
return identities
identities = calculate_identity(alignment)
print(f"平均一致性: {sum(identities)/len(identities):.2f}%")
Running BLAST Online
在线运行BLAST
python
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO
python
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO
Read query sequence
读取查询序列
record = SeqIO.read("query.fasta", "fasta")
record = SeqIO.read("query.fasta", "fasta")
result_handle = NCBIWWW.qblast(
program="blastn", # blastn, blastp, blastx, tblastn, tblastx
database="nt", # nt, nr, refseq_rna, etc.
sequence=str(record.seq),
hitlist_size=10
)
result_handle = NCBIWWW.qblast(
program="blastn", # blastn、blastp、blastx、tblastn、tblastx
database="nt", # nt、nr、refseq_rna等
sequence=str(record.seq),
hitlist_size=10
)
with open("blast_results.xml", "w") as out_handle:
out_handle.write(result_handle.read())
result_handle.close()
with open("blast_results.xml", "w") as out_handle:
out_handle.write(result_handle.read())
result_handle.close()
Parsing BLAST Results
解析BLAST结果
python
from Bio.Blast import NCBIXML
python
from Bio.Blast import NCBIXML
Parse BLAST XML output
解析BLAST XML输出
with open("blast_results.xml") as result_handle:
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
print(f"Query: {blast_record.query}")
print(f"Number of hits: {len(blast_record.alignments)}")
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 0.001: # E-value threshold
print(f"\n Hit: {alignment.title}")
print(f" Length: {alignment.length}")
print(f" E-value: {hsp.expect}")
print(f" Identity: {hsp.identities}/{hsp.align_length} "
f"({hsp.identities/hsp.align_length*100:.1f}%)")
print(f" Query: {hsp.query}")
print(f" Subject: {hsp.sbjct}")
with open("blast_results.xml") as result_handle:
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
print(f"查询序列: {blast_record.query}")
print(f"比对命中数量: {len(blast_record.alignments)}")
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 0.001: # E值阈值
print(f"\n 命中序列: {alignment.title}")
print(f" 长度: {alignment.length}")
print(f" E值: {hsp.expect}")
print(f" 一致性: {hsp.identities}/{hsp.align_length} "
f"({hsp.identities/hsp.align_length*100:.1f}%)")
print(f" 查询序列片段: {hsp.query}")
print(f" 目标序列片段: {hsp.sbjct}")
BLAST with Custom Parameters
自定义参数运行BLAST
python
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast(
program="blastp",
database="nr",
sequence=protein_seq,
expect=0.001, # E-value threshold
word_size=3, # Word size
matrix_name="BLOSUM62", # Substitution matrix
gapcosts="11 1", # Gap costs (open, extend)
hitlist_size=50, # Number of hits
filter="L" # Low complexity filter
)
python
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast(
program="blastp",
database="nr",
sequence=protein_seq,
expect=0.001, # E值阈值
word_size=3, # 字长
matrix_name="BLOSUM62", # 替换矩阵
gapcosts="11 1", # 空位成本(开放、延伸)
hitlist_size=50, # 命中数量
filter="L" # 低复杂度过滤
)
python
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO
import time
def batch_blast(fasta_file, output_file):
"""Run BLAST for multiple sequences."""
results = []
for record in SeqIO.parse(fasta_file, "fasta"):
print(f"BLASTing {record.id}...")
result_handle = NCBIWWW.qblast(
"blastn", "nt",
str(record.seq),
hitlist_size=5
)
blast_record = NCBIXML.read(result_handle)
results.append((record.id, blast_record))
# Be nice to NCBI servers
time.sleep(1)
return results
python
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO
import time
def batch_blast(fasta_file, output_file):
"""对多条序列运行BLAST。"""
results = []
for record in SeqIO.parse(fasta_file, "fasta"):
print(f"正在对 {record.id} 运行BLAST...")
result_handle = NCBIWWW.qblast(
"blastn", "nt",
str(record.seq),
hitlist_size=5
)
blast_record = NCBIXML.read(result_handle)
results.append((record.id, blast_record))
# 友好访问NCBI服务器
time.sleep(1)
return results
results = batch_blast("queries.fasta", "batch_results.xml")
results = batch_blast("queries.fasta", "batch_results.xml")
Protein Structure Analysis
蛋白质结构分析
python
from Bio.PDB import PDBParser
python
from Bio.PDB import PDBParser
parser = PDBParser(QUIET=True)
parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "protein.pdb")
structure = parser.get_structure("protein", "protein.pdb")
Access hierarchy: Structure → Model → Chain → Residue → Atom
访问层级结构:Structure → Model → Chain → Residue → Atom
model = structure[0]
chain = model['A']
print(f"Structure ID: {structure.id}")
print(f"Number of models: {len(structure)}")
print(f"Number of chains: {len(model)}")
print(f"Number of residues in chain A: {len(chain)}")
model = structure[0]
chain = model['A']
print(f"结构ID: {structure.id}")
print(f"模型数量: {len(structure)}")
print(f"链数量: {len(model)}")
print(f"链A中的残基数量: {len(chain)}")
Extracting Information
提取信息
python
from Bio.PDB import PDBParser
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
python
from Bio.PDB import PDBParser
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
Iterate through structure
遍历结构
for model in structure:
for chain in model:
print(f"Chain {chain.id}")
for residue in chain:
# Skip hetero atoms (water, ligands)
if residue.id[0] == ' ':
resname = residue.resname
resid = residue.id[1]
# Get atoms
for atom in residue:
atom_name = atom.name
coord = atom.coord
print(f" {resname}{resid} {atom_name}: {coord}")
for model in structure:
for chain in model:
print(f"链 {chain.id}")
for residue in chain:
# 跳过杂原子(水、配体)
if residue.id[0] == ' ':
resname = residue.resname
resid = residue.id[1]
# 获取原子
for atom in residue:
atom_name = atom.name
coord = atom.coord
print(f" {resname}{resid} {atom_name}: {coord}")
Calculating Distances
计算距离
python
from Bio.PDB import PDBParser
import numpy as np
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
python
from Bio.PDB import PDBParser
import numpy as np
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
chain = structure[0]['A']
atom1 = chain[10]['CA'] # CA atom of residue 10
atom2 = chain[20]['CA'] # CA atom of residue 20
chain = structure[0]['A']
atom1 = chain[10]['CA'] # 残基10的CA原子
atom2 = chain[20]['CA'] # 残基20的CA原子
distance = atom1 - atom2 # Overloaded operator
print(f"Distance: {distance:.2f} Å")
distance = atom1 - atom2 # 重载运算符
print(f"距离: {distance:.2f} Å")
coord1 = atom1.coord
coord2 = atom2.coord
dist = np.linalg.norm(coord1 - coord2)
print(f"Distance (manual): {dist:.2f} Å")
coord1 = atom1.coord
coord2 = atom2.coord
dist = np.linalg.norm(coord1 - coord2)
print(f"距离(手动计算): {dist:.2f} Å")
python
from Bio.PDB import PDBParser
import numpy as np
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
def calculate_center_of_mass(entity):
"""Calculate center of mass for Structure/Model/Chain."""
coords = []
masses = []
for atom in entity.get_atoms():
coords.append(atom.coord)
masses.append(atom.mass)
coords = np.array(coords)
masses = np.array(masses)
com = np.average(coords, axis=0, weights=masses)
return com
python
from Bio.PDB import PDBParser
import numpy as np
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
def calculate_center_of_mass(entity):
"""计算Structure/Model/Chain的质心。"""
coords = []
masses = []
for atom in entity.get_atoms():
coords.append(atom.coord)
masses.append(atom.mass)
coords = np.array(coords)
masses = np.array(masses)
com = np.average(coords, axis=0, weights=masses)
return com
Calculate for whole structure
计算整个结构的质心
com = calculate_center_of_mass(structure)
print(f"Center of mass: {com}")
com = calculate_center_of_mass(structure)
print(f"质心: {com}")
python
from Bio.PDB import PDBParser, Selection
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
python
from Bio.PDB import PDBParser, Selection
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
all_atoms = Selection.unfold_entities(structure, 'A')
print(f"Total atoms: {len(all_atoms)}")
all_atoms = Selection.unfold_entities(structure, 'A')
print(f"总原子数: {len(all_atoms)}")
Select specific atoms
选择特定原子
ca_atoms = [atom for atom in model.get_atoms() if atom.name == 'CA']
print(f"CA atoms: {len(ca_atoms)}")
ca_atoms = [atom for atom in model.get_atoms() if atom.name == 'CA']
print(f"CA原子数: {len(ca_atoms)}")
Specific residue range
特定残基范围
chain_a = model['A']
residues_10_to_20 = [chain_a[i] for i in range(10, 21) if i in chain_a]
chain_a = model['A']
residues_10_to_20 = [chain_a[i] for i in range(10, 21) if i in chain_a]
Secondary Structure with DSSP
使用DSSP分析二级结构
python
from Bio.PDB import PDBParser
from Bio.PDB.DSSP import DSSP
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
model = structure[0]
python
from Bio.PDB import PDBParser
from Bio.PDB.DSSP import DSSP
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
model = structure[0]
Run DSSP (requires dssp executable)
运行DSSP(需要dssp可执行文件)
dssp = DSSP(model, "protein.pdb")
dssp = DSSP(model, "protein.pdb")
Extract secondary structure
提取二级结构
for key in dssp:
residue = dssp[key]
chain_id = key[0]
res_id = key[1][1]
ss = residue[2] # Secondary structure: H=helix, E=sheet, C=coil
acc = residue[3] # Accessible surface area
print(f"{chain_id} {res_id}: {ss} (ASA={acc:.1f})")
for key in dssp:
residue = dssp[key]
chain_id = key[0]
res_id = key[1][1]
ss = residue[2] # 二级结构: H=螺旋, E=折叠, C=无规卷曲
acc = residue[3] # 可及表面积
print(f"{chain_id} {res_id}: {ss} (ASA={acc:.1f})")
Extracting Sequence from PDB
从PDB提取序列
python
from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import PPBuilder
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
python
from Bio.PDB import PDBParser
from Bio.PDB.Polypeptide import PPBuilder
parser = PDBParser()
structure = parser.get_structure("protein", "protein.pdb")
ppb = PPBuilder()
for chain in structure[0]:
for pp in ppb.build_peptides(chain):
sequence = pp.get_sequence()
print(f"Chain {chain.id}: {sequence}")
ppb = PPBuilder()
for chain in structure[0]:
for pp in ppb.build_peptides(chain):
sequence = pp.get_sequence()
print(f"链 {chain.id}: {sequence}")
python
from Bio.PDB import PDBParser, PDBIO, Select
python
from Bio.PDB import PDBParser, PDBIO, Select
parser = PDBParser()
structure = parser.get_structure("protein", "input.pdb")
parser = PDBParser()
structure = parser.get_structure("protein", "input.pdb")
Select what to save
选择要保存的部分
class ChainSelect(Select):
def accept_chain(self, chain):
return chain.id == 'A' # Only save chain A
class ChainSelect(Select):
def accept_chain(self, chain):
return chain.id == 'A' # 仅保存链A
io = PDBIO()
io.set_structure(structure)
io.save("chain_A.pdb", ChainSelect())
io = PDBIO()
io.set_structure(structure)
io.save("chain_A.pdb", ChainSelect())
Phylogenetic Analysis
系统发育分析
Building Trees from Alignment
从比对结果构建树
python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
alignment = AlignIO.read("alignment.fasta", "fasta")
alignment = AlignIO.read("alignment.fasta", "fasta")
Calculate distance matrix
计算距离矩阵
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
Construct tree (Neighbor-Joining)
构建树(邻接法)
constructor = DistanceTreeConstructor(calculator, 'nj')
tree = constructor.build_tree(alignment)
constructor = DistanceTreeConstructor(calculator, 'nj')
tree = constructor.build_tree(alignment)
Phylo.write(tree, "tree.xml", "phyloxml")
print(tree)
Phylo.write(tree, "tree.xml", "phyloxml")
print(tree)
python
from Bio import Phylo
python
from Bio import Phylo
tree = Phylo.read("tree.xml", "phyloxml")
tree = Phylo.read("tree.xml", "phyloxml")
Draw tree (ASCII)
ASCII格式绘制树
Draw tree (graphical - requires matplotlib)
图形化绘制(需要matplotlib)
import matplotlib.pyplot as plt
Phylo.draw(tree)
plt.savefig("tree.png")
import matplotlib.pyplot as plt
Phylo.draw(tree)
plt.savefig("tree.png")
tree.ladderize() # Flip tree for better visualization
tree.ladderize() # 翻转树以获得更好的可视化效果
Phylo.draw(tree, do_show=True)
Phylo.draw(tree, do_show=True)
python
from Bio import Phylo
tree = Phylo.read("tree.xml", "phyloxml")
python
from Bio import Phylo
tree = Phylo.read("tree.xml", "phyloxml")
Get all terminals (leaves)
获取所有终端节点(叶子)
terminals = tree.get_terminals()
print(f"Number of leaves: {len(terminals)}")
terminals = tree.get_terminals()
print(f"叶子节点数量: {len(terminals)}")
Get all non-terminals (internal nodes)
获取所有非终端节点(内部节点)
nonterminals = tree.get_nonterminals()
print(f"Number of internal nodes: {len(nonterminals)}")
nonterminals = tree.get_nonterminals()
print(f"内部节点数量: {len(nonterminals)}")
Find common ancestor
寻找共同祖先
clade1 = tree.find_any(name="Species1")
clade2 = tree.find_any(name="Species2")
common = tree.common_ancestor(clade1, clade2)
clade1 = tree.find_any(name="Species1")
clade2 = tree.find_any(name="Species2")
common = tree.common_ancestor(clade1, clade2)
dist = tree.distance(clade1, clade2)
print(f"Distance between Species1 and Species2: {dist:.4f}")
dist = tree.distance(clade1, clade2)
print(f"Species1和Species2之间的距离: {dist:.4f}")
Get path between nodes
获取节点间路径
path = tree.get_path(clade1, clade2)
path = tree.get_path(clade1, clade2)
Creating Trees Programmatically
程序化创建树
python
from Bio.Phylo.BaseTree import Tree, Clade
python
from Bio.Phylo.BaseTree import Tree, Clade
Create tree structure
创建树结构
Create clades (nodes)
创建分支(节点)
clade_a = Clade(branch_length=0.5, name="A")
clade_b = Clade(branch_length=0.3, name="B")
clade_c = Clade(branch_length=0.4, name="C")
clade_a = Clade(branch_length=0.5, name="A")
clade_b = Clade(branch_length=0.3, name="B")
clade_c = Clade(branch_length=0.4, name="C")
Create internal node
创建内部节点
internal = Clade(branch_length=0.2)
internal.clades = [clade_a, clade_b]
internal = Clade(branch_length=0.2)
internal.clades = [clade_a, clade_b]
root = Clade()
root.clades = [internal, clade_c]
tree.root = root
root = Clade()
root.clades = [internal, clade_c]
tree.root = root
python
from Bio import Entrez
python
from Bio import Entrez
ALWAYS set your email
务必设置你的邮箱
Entrez.email = "your.email@example.com"
Entrez.email = "your.email@example.com"
Search nucleotide database
搜索核苷酸数据库
handle = Entrez.esearch(
db="nucleotide",
term="Homo sapiens[Organism] AND COX1",
retmax=10
)
record = Entrez.read(handle)
handle.close()
print(f"Found {record['Count']} records")
print(f"IDs: {record['IdList']}")
handle = Entrez.esearch(
db="nucleotide",
term="Homo sapiens[Organism] AND COX1",
retmax=10
)
record = Entrez.read(handle)
handle.close()
print(f"找到 {record['Count']} 条记录")
print(f"ID列表: {record['IdList']}")
python
from Bio import Entrez, SeqIO
Entrez.email = "your.email@example.com"
python
from Bio import Entrez, SeqIO
Entrez.email = "your.email@example.com"
handle = Entrez.efetch(
db="nucleotide",
id="NC_000001",
rettype="gb",
retmode="text"
)
record = SeqIO.read(handle, "genbank")
handle.close()
print(f"ID: {record.id}")
print(f"Description: {record.description}")
print(f"Length: {len(record.seq)}")
handle = Entrez.efetch(
db="nucleotide",
id="NC_000001",
rettype="gb",
retmode="text"
)
record = SeqIO.read(handle, "genbank")
handle.close()
print(f"ID: {record.id}")
print(f"描述: {record.description}")
print(f"长度: {len(record.seq)}")
python
from Bio import Entrez, SeqIO
Entrez.email = "your.email@example.com"
def download_sequences(id_list, output_file):
"""Download multiple sequences by ID."""
# Fetch records
handle = Entrez.efetch(
db="nucleotide",
id=id_list,
rettype="fasta",
retmode="text"
)
records = SeqIO.parse(handle, "fasta")
SeqIO.write(records, output_file, "fasta")
handle.close()
python
from Bio import Entrez, SeqIO
Entrez.email = "your.email@example.com"
def download_sequences(id_list, output_file):
"""通过ID批量下载序列。"""
# 获取记录
handle = Entrez.efetch(
db="nucleotide",
id=id_list,
rettype="fasta",
retmode="text"
)
records = SeqIO.parse(handle, "fasta")
SeqIO.write(records, output_file, "fasta")
handle.close()
ids = ["NM_001301717", "NM_001301718", "NM_001301719"]
download_sequences(ids, "downloaded.fasta")
ids = ["NM_001301717", "NM_001301718", "NM_001301719"]
download_sequences(ids, "downloaded.fasta")
Entrez with History
使用历史记录的Entrez
python
from Bio import Entrez
Entrez.email = "your.email@example.com"
python
from Bio import Entrez
Entrez.email = "your.email@example.com"
Search with history
带历史记录的搜索
search_handle = Entrez.esearch(
db="nucleotide",
term="human[organism] AND COX1",
usehistory="y",
retmax=1000
)
search_results = Entrez.read(search_handle)
search_handle.close()
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
count = int(search_results["Count"])
print(f"Found {count} records")
search_handle = Entrez.esearch(
db="nucleotide",
term="human[organism] AND COX1",
usehistory="y",
retmax=1000
)
search_results = Entrez.read(search_handle)
search_handle.close()
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
count = int(search_results["Count"])
print(f"找到 {count} 条记录")
batch_size = 100
for start in range(0, count, batch_size):
fetch_handle = Entrez.efetch(
db="nucleotide",
rettype="fasta",
retmode="text",
retstart=start,
retmax=batch_size,
webenv=webenv,
query_key=query_key
)
# Process batch
data = fetch_handle.read()
fetch_handle.close()
batch_size = 100
for start in range(0, count, batch_size):
fetch_handle = Entrez.efetch(
db="nucleotide",
rettype="fasta",
retmode="text",
retstart=start,
retmax=batch_size,
webenv=webenv,
query_key=query_key
)
# 处理批次
data = fetch_handle.read()
fetch_handle.close()
Complete Gene Analysis Pipeline
完整基因分析流程
python
from Bio import Entrez, SeqIO
from Bio.Seq import Seq
from Bio.SeqUtils import GC
Entrez.email = "your.email@example.com"
def analyze_gene(gene_id):
"""Complete analysis of a gene."""
# 1. Fetch sequence
handle = Entrez.efetch(db="nucleotide", id=gene_id, rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
# 2. Basic stats
length = len(record.seq)
gc_content = GC(record.seq)
# 3. Extract CDS
cds_list = []
for feature in record.features:
if feature.type == "CDS":
cds_seq = feature.extract(record.seq)
product = feature.qualifiers.get("product", ["Unknown"])[0]
cds_list.append((product, cds_seq))
# 4. Translate CDS
proteins = []
for product, cds in cds_list:
try:
protein = cds.translate(to_stop=True)
proteins.append((product, protein))
except:
proteins.append((product, None))
return {
'id': record.id,
'length': length,
'gc': gc_content,
'n_cds': len(cds_list),
'proteins': proteins
}
python
from Bio import Entrez, SeqIO
from Bio.Seq import Seq
from Bio.SeqUtils import GC
Entrez.email = "your.email@example.com"
def analyze_gene(gene_id):
"""完整的基因分析流程。"""
# 1. 获取序列
handle = Entrez.efetch(db="nucleotide", id=gene_id, rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
# 2. 基础统计
length = len(record.seq)
gc_content = GC(record.seq)
# 3. 提取CDS
cds_list = []
for feature in record.features:
if feature.type == "CDS":
cds_seq = feature.extract(record.seq)
product = feature.qualifiers.get("product", ["未知"])[0]
cds_list.append((product, cds_seq))
# 4. 翻译CDS
proteins = []
for product, cds in cds_list:
try:
protein = cds.translate(to_stop=True)
proteins.append((product, protein))
except:
proteins.append((product, None))
return {
'id': record.id,
'length': length,
'gc': gc_content,
'n_cds': len(cds_list),
'proteins': proteins
}
result = analyze_gene("NM_000518")
print(f"Gene: {result['id']}")
print(f"Length: {result['length']} bp")
print(f"GC: {result['gc']:.2f}%")
print(f"CDS: {result['n_cds']}")
result = analyze_gene("NM_000518")
print(f"基因: {result['id']}")
print(f"长度: {result['length']} bp")
print(f"GC含量: {result['gc']:.2f}%")
print(f"CDS数量: {result['n_cds']}")
RNA-Seq Read Processing
RNA-Seq Read处理
python
from Bio import SeqIO
import numpy as np
def process_rnaseq_reads(fastq_file, min_quality=30, min_length=50):
"""Process RNA-seq reads with quality filtering."""
stats = {
'total': 0,
'passed_quality': 0,
'passed_length': 0,
'final': 0
}
passed_reads = []
for record in SeqIO.parse(fastq_file, "fastq"):
stats['total'] += 1
# Quality filter
qualities = record.letter_annotations["phred_quality"]
if np.mean(qualities) < min_quality:
continue
stats['passed_quality'] += 1
# Length filter
if len(record.seq) < min_length:
continue
stats['passed_length'] += 1
# Trim adapters (simple example)
# In reality, use specialized tools like Cutadapt
passed_reads.append(record)
stats['final'] += 1
return passed_reads, stats
python
from Bio import SeqIO
import numpy as np
def process_rnaseq_reads(fastq_file, min_quality=30, min_length=50):
"""带质量过滤的RNA-seq Read处理。"""
stats = {
'total': 0,
'passed_quality': 0,
'passed_length': 0,
'final': 0
}
passed_reads = []
for record in SeqIO.parse(fastq_file, "fastq"):
stats['total'] += 1
# 质量过滤
qualities = record.letter_annotations["phred_quality"]
if np.mean(qualities) < min_quality:
continue
stats['passed_quality'] += 1
# 长度过滤
if len(record.seq) < min_length:
continue
stats['passed_length'] += 1
# 切除接头(简单示例)
# 实际中使用专用工具如Cutadapt
passed_reads.append(record)
stats['final'] += 1
return passed_reads, stats
reads, stats = process_rnaseq_reads("rnaseq.fastq")
print(f"Total: {stats['total']}")
print(f"Passed quality: {stats['passed_quality']}")
print(f"Passed length: {stats['passed_length']}")
print(f"Final: {stats['final']}")
reads, stats = process_rnaseq_reads("rnaseq.fastq")
print(f"总数: {stats['total']}")
print(f"通过质量过滤: {stats['passed_quality']}")
print(f"通过长度过滤: {stats['passed_length']}")
print(f"最终保留: {stats['final']}")
Restriction Enzyme Analysis
限制性酶切分析
python
from Bio import SeqIO
from Bio.Restriction import *
python
from Bio import SeqIO
from Bio.Restriction import *
record = SeqIO.read("plasmid.gb", "genbank")
seq = record.seq
record = SeqIO.read("plasmid.gb", "genbank")
seq = record.seq
Create restriction batch
创建限制性酶批处理
rb = RestrictionBatch([EcoRI, BamHI, PstI, HindIII])
rb = RestrictionBatch([EcoRI, BamHI, PstI, HindIII])
Find restriction sites
寻找酶切位点
analysis = rb.search(seq)
analysis = rb.search(seq)
for enzyme, sites in analysis.items():
if sites:
print(f"{enzyme}: {len(sites)} site(s) at {sites}")
else:
print(f"{enzyme}: No sites found")
for enzyme, sites in analysis.items():
if sites:
print(f"{enzyme}: {len(sites)}个位点,位于 {sites}")
else:
print(f"{enzyme}: 未找到位点")
Find enzymes that cut once (good for cloning)
寻找单酶切位点(适合克隆)
single_cutters = [enz for enz, sites in analysis.items() if len(sites) == 1]
print(f"\nSingle cutters: {[str(e) for e in single_cutters]}")
single_cutters = [enz for enz, sites in analysis.items() if len(sites) == 1]
print(f"\n单酶切酶: {[str(e) for e in single_cutters]}")
Codon Usage Analysis
密码子使用分析
python
from Bio import SeqIO
from collections import Counter
def analyze_codon_usage(genbank_file):
"""Analyze codon usage in CDS features."""
codon_counts = Counter()
total_codons = 0
for record in SeqIO.parse(genbank_file, "genbank"):
for feature in record.features:
if feature.type == "CDS":
# Extract CDS sequence
cds = feature.extract(record.seq)
# Count codons
for i in range(0, len(cds) - 2, 3):
codon = str(cds[i:i+3])
if len(codon) == 3:
codon_counts[codon] += 1
total_codons += 1
# Calculate frequencies
codon_freq = {codon: count/total_codons
for codon, count in codon_counts.items()}
return codon_counts, codon_freq
python
from Bio import SeqIO
from collections import Counter
def analyze_codon_usage(genbank_file):
"""分析CDS特征中的密码子使用情况。"""
codon_counts = Counter()
total_codons = 0
for record in SeqIO.parse(genbank_file, "genbank"):
for feature in record.features:
if feature.type == "CDS":
# 提取CDS序列
cds = feature.extract(record.seq)
# 统计密码子
for i in range(0, len(cds) - 2, 3):
codon = str(cds[i:i+3])
if len(codon) == 3:
codon_counts[codon] += 1
total_codons += 1
# 计算频率
codon_freq = {codon: count/total_codons
for codon, count in codon_counts.items()}
return codon_counts, codon_freq
counts, freqs = analyze_codon_usage("genome.gb")
counts, freqs = analyze_codon_usage("genome.gb")
Print most common codons
打印最常用的密码子
for codon, freq in sorted(freqs.items(), key=lambda x: x[1], reverse=True)[:10]:
print(f"{codon}: {freq:.4f}")
for codon, freq in sorted(freqs.items(), key=lambda x: x[1], reverse=True)[:10]:
print(f"{codon}: {freq:.4f}")
Performance Optimization
性能优化
Memory-Efficient File Processing
内存高效的文件处理
python
from Bio import SeqIO
def process_large_fasta(filename, process_func):
"""Process large FASTA without loading into memory."""
count = 0
for record in SeqIO.parse(filename, "fasta"):
process_func(record)
count += 1
# Progress report
if count % 10000 == 0:
print(f"Processed {count} sequences")
return count
python
from Bio import SeqIO
def process_large_fasta(filename, process_func):
"""不加载到内存的大FASTA文件处理。"""
count = 0
for record in SeqIO.parse(filename, "fasta"):
process_func(record)
count += 1
# 进度报告
if count % 10000 == 0:
print(f"已处理 {count} 条序列")
return count
def calculate_gc(record):
from Bio.SeqUtils import GC
gc = GC(record.seq)
if gc > 60:
print(f"{record.id}: High GC ({gc:.1f}%)")
def calculate_gc(record):
from Bio.SeqUtils import GC
gc = GC(record.seq)
if gc > 60:
print(f"{record.id}: 高GC含量 ({gc:.1f}%)")
n = process_large_fasta("large_genome.fasta", calculate_gc)
n = process_large_fasta("large_genome.fasta", calculate_gc)
python
from Bio import SeqIO
from multiprocessing import Pool
from functools import partial
def process_sequence(record, min_length=100):
"""Process single sequence."""
if len(record.seq) >= min_length:
from Bio.SeqUtils import GC
return (record.id, len(record.seq), GC(record.seq))
return None
def parallel_process_fasta(filename, n_processes=4):
"""Process FASTA in parallel."""
# Load sequences
records = list(SeqIO.parse(filename, "fasta"))
# Process in parallel
with Pool(n_processes) as pool:
results = pool.map(process_sequence, records)
# Filter None results
return [r for r in results if r is not None]
python
from Bio import SeqIO
from multiprocessing import Pool
from functools import partial
def process_sequence(record, min_length=100):
"""处理单条序列。"""
if len(record.seq) >= min_length:
from Bio.SeqUtils import GC
return (record.id, len(record.seq), GC(record.seq))
return None
def parallel_process_fasta(filename, n_processes=4):
"""并行处理FASTA文件。"""
# 加载序列
records = list(SeqIO.parse(filename, "fasta"))
# 并行处理
with Pool(n_processes) as pool:
results = pool.map(process_sequence, records)
# 过滤空结果
return [r for r in results if r is not None]
results = parallel_process_fasta("sequences.fasta", n_processes=4)
for seq_id, length, gc in results[:10]:
print(f"{seq_id}: {length} bp, GC={gc:.2f}%")
results = parallel_process_fasta("sequences.fasta", n_processes=4)
for seq_id, length, gc in results[:10]:
print(f"{seq_id}: {length} bp, GC含量={gc:.2f}%")
Index Files for Random Access
索引文件用于随机访问
python
from Bio import SeqIO
python
from Bio import SeqIO
Create index (fast random access)
创建索引(快速随机访问)
record_dict = SeqIO.index("large.fasta", "fasta")
record_dict = SeqIO.index("large.fasta", "fasta")
Access specific sequence instantly
立即访问特定序列
record = record_dict["seq_12345"]
print(record.seq)
record = record_dict["seq_12345"]
print(record.seq)
Iterate (still memory efficient)
迭代(仍保持内存高效)
for key in record_dict:
record = record_dict[key]
process(record)
for key in record_dict:
record = record_dict[key]
process(record)
SQLite-backed index (for very large files)
SQLite-backed索引(用于超大文件)
record_dict = SeqIO.index_db("large.idx", "huge.fasta", "fasta")
record_dict = SeqIO.index_db("large.idx", "huge.fasta", "fasta")
Common Pitfalls and Solutions
常见陷阱与解决方案
Translation Frame Errors
翻译读框错误
python
from Bio.Seq import Seq
dna = Seq("ATGGCCATTGTAATGGGCCGC")
python
from Bio.Seq import Seq
dna = Seq("ATGGCCATTGTAATGGGCCGC")
Problem: Wrong reading frame
问题:错误的阅读框
protein = dna.translate() # May have stop codons
protein = dna.translate() # 可能包含终止密码子
Solution: Check all frames
解决方案:检查所有读框
for i in range(3):
frame = dna[i:]
protein = frame.translate(to_stop=True, table=1)
if len(protein) > 10: # Reasonable length
print(f"Frame {i}: {protein}")
for i in range(3):
frame = dna[i:]
protein = frame.translate(to_stop=True, table=1)
if len(protein) > 10: # 合理长度
print(f"读框 {i}: {protein}")
File Format Mismatches
文件格式不匹配
python
from Bio import SeqIO
python
from Bio import SeqIO
Problem: Wrong format specified
问题:指定错误的格式
try:
records = SeqIO.parse("file.fasta", "genbank") # Wrong!
except:
print("Format mismatch")
try:
records = SeqIO.parse("file.fasta", "genbank") # 错误!
except:
print("格式不匹配")
Solution: Verify format
解决方案:自动识别格式
import os
def guess_format(filename):
ext = os.path.splitext(filename)[1].lower()
format_map = {
'.fasta': 'fasta',
'.fa': 'fasta',
'.fna': 'fasta',
'.fastq': 'fastq',
'.fq': 'fastq',
'.gb': 'genbank',
'.gbk': 'genbank'
}
return format_map.get(ext, 'fasta')
format = guess_format("file.fasta")
records = SeqIO.parse("file.fasta", format)
import os
def guess_format(filename):
ext = os.path.splitext(filename)[1].lower()
format_map = {
'.fasta': 'fasta',
'.fa': 'fasta',
'.fna': 'fasta',
'.fastq': 'fastq',
'.fq': 'fastq',
'.gb': 'genbank',
'.gbk': 'genbank'
}
return format_map.get(ext, 'fasta')
format = guess_format("file.fasta")
records = SeqIO.parse("file.fasta", format)
NCBI Rate Limiting
NCBI速率限制
python
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"
python
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"
Problem: Too many requests
问题:请求过于频繁
def bad_batch_download(id_list):
for gene_id in id_list:
handle = Entrez.efetch(db="nucleotide", id=gene_id)
# This will get you blocked!
def bad_batch_download(id_list):
for gene_id in id_list:
handle = Entrez.efetch(db="nucleotide", id=gene_id)
# 这会导致你被封禁!
Solution: Add delays
解决方案:添加延迟
def good_batch_download(id_list):
results = []
for gene_id in id_list:
handle = Entrez.efetch(db="nucleotide", id=gene_id,
rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
results.append(record)
time.sleep(0.34) # ~3 requests/sec
return results
def good_batch_download(id_list):
results = []
for gene_id in id_list:
handle = Entrez.efetch(db="nucleotide", id=gene_id,
rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
results.append(record)
time.sleep(0.34) # ~3次请求/秒
return results
Better: Use Entrez history for large batches
更好的方法:对大批次使用Entrez历史记录
Memory Issues with Large Alignments
大比对结果的内存问题
python
from Bio import AlignIO
python
from Bio import AlignIO
Problem: Loading huge alignment
问题:加载超大比对结果
alignment = AlignIO.read("huge_alignment.fasta", "fasta") # OOM!
alignment = AlignIO.read("huge_alignment.fasta", "fasta") # 内存溢出!
Solution: Process incrementally
解决方案:分块处理
def process_alignment_chunks(filename, chunk_size=1000):
records = SeqIO.parse(filename, "fasta")
chunk = []
for record in records:
chunk.append(record)
if len(chunk) >= chunk_size:
# Process chunk
process_chunk(chunk)
chunk = []
# Process remaining
if chunk:
process_chunk(chunk)
This comprehensive Biopython guide covers 50+ examples across all major bioinformatics workflows!
def process_alignment_chunks(filename, chunk_size=1000):
records = SeqIO.parse(filename, "fasta")
chunk = []
for record in records:
chunk.append(record)
if len(chunk) >= chunk_size:
# 处理块
process_chunk(chunk)
chunk = []
# 处理剩余部分
if chunk:
process_chunk(chunk)
这份全面的Biopython指南涵盖了50多个生物信息学核心工作流的示例!