scikit-bio
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesescikit-bio - Bioinformatics and Ecology
scikit-bio - 生物信息学与生态学
scikit-bio provides the data structures and statistical methods needed for rigorous biological analysis. It excels in calculating alpha/beta diversity, performing ordination (PCoA), and handling complex phylogenetic trees.
scikit-bio 为严谨的生物学分析提供所需的数据结构与统计方法,擅长计算α/β多样性、执行排序分析(PCoA)以及处理复杂的系统发育树。
When to Use
适用场景
- Analyzing microbiome data (taxonomic composition, community structure).
- Calculating ecological diversity metrics (Shannon, Simpson, UniFrac).
- Performing ordination for visualization (PCoA, DCA).
- High-level sequence manipulation (DNA, RNA, Protein with metadata).
- Reading and writing phylogenetic trees (Newick format).
- Pairwise and multiple sequence alignment analysis.
- Statistical testing of community differences (PERMANOVA, ANOSIM).
- 分析微生物组数据(分类组成、群落结构)
- 计算生态多样性指标(Shannon、Simpson、UniFrac)
- 执行排序分析以实现可视化(PCoA、DCA)
- 高级序列处理(带元数据的DNA、RNA、蛋白质序列)
- 读写系统发育树(Newick格式)
- 两两及多序列比对分析
- 群落差异的统计检验(PERMANOVA、ANOSIM)
Reference Documentation
参考文档
Official docs: http://scikit-bio.org/
Tutorials: http://scikit-bio.org/docs/latest/
Search patterns:, , ,
Tutorials: http://scikit-bio.org/docs/latest/
Search patterns:
skbio.sequenceskbio.stats.distanceskbio.diversityskbio.stats.ordination官方文档:http://scikit-bio.org/
教程:http://scikit-bio.org/docs/latest/
搜索模式:、、、
教程:http://scikit-bio.org/docs/latest/
搜索模式:
skbio.sequenceskbio.stats.distanceskbio.diversityskbio.stats.ordinationCore Principles
核心原则
Grammar of Biological Sequences
生物序列的语法规则
Instead of using raw strings, scikit-bio uses typed objects (DNA, RNA, Protein). These objects know their alphabet, can handle quality scores (Phred), and support biological operations (transcription, translation).
scikit-bio 不使用原始字符串,而是使用类型化对象(DNA、RNA、Protein)。这些对象知晓自身的字母表,可处理质量分数(Phred),并支持生物学操作(转录、翻译)。
Distance Matrices
距离矩阵
Microbiome research often boils down to comparing samples. The DistanceMatrix object is central, allowing for easy indexing, sub-setting, and statistical testing (e.g., PERMANOVA).
微生物组研究通常需要对比样本,DistanceMatrix 对象是核心组件,可实现便捷的索引、子集划分及统计检验(如PERMANOVA)。
Diversity Metrics
多样性指标
Provides a standardized implementation of hundreds of diversity metrics used in ecology, ensuring reproducibility across studies.
提供生态学中数百种多样性指标的标准化实现,确保研究结果的可重复性。
Quick Reference
快速参考
Installation
安装
bash
pip install scikit-biobash
pip install scikit-bioStandard Imports
标准导入
python
import skbio
import numpy as np
import pandas as pd
from skbio import DNA, RNA, Protein, Sequence
from skbio.stats.distance import DistanceMatrix
from skbio.diversity import alpha, beta
from skbio.stats.ordination import pcoapython
import skbio
import numpy as np
import pandas as pd
from skbio import DNA, RNA, Protein, Sequence
from skbio.stats.distance import DistanceMatrix
from skbio.diversity import alpha, beta
from skbio.stats.ordination import pcoaBasic Pattern - Sequence Manipulation
基础模式 - 序列处理
python
from skbio import DNApython
from skbio import DNA1. Create a sequence with metadata
1. 创建带元数据的序列
seq = DNA("ACC--GTT", metadata={'id': 'sample1', 'desc': 'gene A'})
seq = DNA("ACC--GTT", metadata={'id': 'sample1', 'desc': 'gene A'})
2. Biological operations
2. 生物学操作
rc = seq.reverse_complement()
degapped = seq.degap()
rc = seq.reverse_complement()
degapped = seq.degap()
3. Validation
3. 验证
print(f"Is valid? {seq.is_valid()}") # Checks against DNA alphabet
undefinedprint(f"Is valid? {seq.is_valid()}") # 检查是否符合DNA字母表
undefinedCritical Rules
重要规则
✅ DO
✅ 推荐做法
- Use Typed Sequences - Always use DNA, RNA, or Protein instead of generic Sequence to enable alphabet-specific methods.
- Set Metadata - Use the metadata and positional_metadata (for quality scores) attributes to keep data self-contained.
- Check Alphabet - Use when importing data from untrusted sources to catch non-IUPAC characters.
.is_valid() - Project Distance Matrices - Use to visualize high-dimensional community data.
pcoa() - Specify Reference for UniFrac - When calculating UniFrac diversity, ensure your phylogenetic tree contains all taxa present in your abundance table.
- Validate Trees - Use for tree manipulations as it provides robust traversal methods.
skbio.tree.TreeNode
- 使用类型化序列 - 始终使用DNA、RNA或Protein对象而非通用Sequence,以启用字母表专属方法
- 设置元数据 - 使用metadata和positional_metadata(存储质量分数)属性,确保数据自包含
- 检查字母表 - 从不可信来源导入数据时,使用捕获非IUPAC字符
.is_valid() - 投影距离矩阵 - 使用可视化高维群落数据
pcoa() - 为UniFrac指定参考 - 计算UniFrac多样性时,确保系统发育树包含丰度表中的所有分类单元
- 验证树结构 - 使用处理树结构,它提供了可靠的遍历方法
skbio.tree.TreeNode
❌ DON'T
❌ 不推荐做法
- Mix Sequence Types - Don't try to align DNA with RNA objects.
- Ignore Quality Scores - If you have FASTQ data, store Phred scores in positional_metadata.
- Manually Calculate Distance - Don't use raw NumPy for biological distances; use to access UniFrac or Bray-Curtis.
skbio.diversity.beta - Forget ID Matching - When performing ordination or PERMANOVA, ensure IDs in your distance matrix and metadata mapping match exactly.
- 混合序列类型 - 不要尝试将DNA与RNA对象进行比对
- 忽略质量分数 - 若拥有FASTQ数据,将Phred分数存储在positional_metadata中
- 手动计算距离 - 不要使用原始NumPy计算生物学距离;使用获取UniFrac或Bray-Curtis距离
skbio.diversity.beta - 忽略ID匹配 - 执行排序分析或PERMANOVA时,确保距离矩阵中的ID与元数据映射中的ID完全匹配
Anti-Patterns (NEVER)
反模式(绝对避免)
python
from skbio import DNApython
from skbio import DNA❌ BAD: Manual reverse complement with string logic
❌ 错误:使用字符串逻辑手动生成反向互补序列
rc_str = seq_str[::-1].replace('A', 't').replace('T', 'a')... # Fragile
rc_str = seq_str[::-1].replace('A', 't').replace('T', 'a')... # 不稳定
✅ GOOD: Built-in validated method
✅ 正确:使用内置的验证方法
rc_seq = DNA(seq_str).reverse_complement()
rc_seq = DNA(seq_str).reverse_complement()
❌ BAD: Using raw lists for community analysis
❌ 错误:使用原始列表进行群落分析
data = [[1, 0, 5], [2, 1, 0]]
data = [[1, 0, 5], [2, 1, 0]]
dist = manual_bray_curtis(data)
dist = manual_bray_curtis(data)
✅ GOOD: Using DistanceMatrix
✅ 正确:使用DistanceMatrix
from skbio.stats.distance import DistanceMatrix
dm = DistanceMatrix(matrix_data, ids=['S1', 'S2', 'S3'])
from skbio.stats.distance import DistanceMatrix
dm = DistanceMatrix(matrix_data, ids=['S1', 'S2', 'S3'])
❌ BAD: Stripping metadata to run scikit-learn
❌ 错误:剥离元数据以运行scikit-learn
vals = seq.values # Metadata lost!
vals = seq.values # 元数据丢失!
✅ GOOD: Process within skbio or use standard IO
✅ 正确:在skbio内处理或使用标准IO
seq.write('output.fasta')
undefinedseq.write('output.fasta')
undefinedSequence Analysis (skbio.sequence)
序列分析(skbio.sequence)
Advanced Sequence Operations
高级序列操作
python
from skbio import DNA, Proteinpython
from skbio import DNA, ProteinSequence with quality scores
带质量分数的序列
seq = DNA("ACGT", positional_metadata={'quality': [30, 35, 40, 20]})
seq = DNA("ACGT", positional_metadata={'quality': [30, 35, 40, 20]})
Slicing preserves metadata
切片保留元数据
sub = seq[1:3]
print(sub.positional_metadata) # {'quality': [35, 40]}
sub = seq[1:3]
print(sub.positional_metadata) # {'quality': [35, 40]}
K-mer frequencies
K-mer频率
freqs = seq.kmer_frequencies(k=2)
freqs = seq.kmer_frequencies(k=2)
Translation (DNA -> Protein)
翻译(DNA -> 蛋白质)
protein = DNA("ATGCGA").translate()
undefinedprotein = DNA("ATGCGA").translate()
undefinedDiversity Analysis (skbio.diversity)
多样性分析(skbio.diversity)
Alpha Diversity (Within a sample)
α多样性(样本内)
python
from skbio.diversity import alpha
counts = [10, 0, 5, 2, 20] # Species abundances
otus = ['OTU1', 'OTU2', 'OTU3', 'OTU4', 'OTU5']
shannon = alpha.shannon(counts)
simpson = alpha.simpson(counts)
observed_otus = alpha.observed_otus(counts)
print(f"Shannon Index: {shannon:.3f}")python
from skbio.diversity import alpha
counts = [10, 0, 5, 2, 20] # 物种丰度
otus = ['OTU1', 'OTU2', 'OTU3', 'OTU4', 'OTU5']
shannon = alpha.shannon(counts)
simpson = alpha.simpson(counts)
observed_otus = alpha.observed_otus(counts)
print(f"Shannon Index: {shannon:.3f}")Beta Diversity (Between samples)
β多样性(样本间)
python
from skbio.diversity import beta
import numpy as nppython
from skbio.diversity import beta
import numpy as npAbundance table (samples x taxa)
丰度表(样本×分类单元)
data = np.array([[10, 20, 0],
[5, 15, 2],
[0, 1, 30]])
ids = ['Sample1', 'Sample2', 'Sample3']
data = np.array([[10, 20, 0],
[5, 15, 2],
[0, 1, 30]])
ids = ['Sample1', 'Sample2', 'Sample3']
Bray-Curtis distance
Bray-Curtis距离
bc_dm = beta.pw_distances(data, ids=ids, metric='braycurtis')
bc_dm = beta.pw_distances(data, ids=ids, metric='braycurtis')
Weighted UniFrac (Requires a tree)
加权UniFrac(需要树结构)
unifrac_dm = beta.weighted_unifrac(data, otu_ids, tree)
unifrac_dm = beta.weighted_unifrac(data, otu_ids, tree)
undefinedundefinedPhylogenetics (skbio.tree)
系统发育学(skbio.tree)
Handling Trees
处理树结构
python
from skbio import TreeNode
from io import StringIOpython
from skbio import TreeNode
from io import StringIOLoad Newick tree
加载Newick格式的树
tree_str = "((A:0.1, B:0.2)C:0.3, D:0.4)E;"
tree = TreeNode.read(StringIO(tree_str))
tree_str = "((A:0.1, B:0.2)C:0.3, D:0.4)E;"
tree = TreeNode.read(StringIO(tree_str))
Traverse
遍历
for node in tree.tips():
print(f"Leaf: {node.name}, dist: {node.length}")
for node in tree.tips():
print(f"Leaf: {node.name}, dist: {node.length}")
Find Lowest Common Ancestor
寻找最近共同祖先
lca = tree.find_lca(['A', 'B'])
print(f"LCA of A and B is {lca.name}")
lca = tree.find_lca(['A', 'B'])
print(f"LCA of A and B is {lca.name}")
Rooting
设根
tree.root_at('D')
undefinedtree.root_at('D')
undefinedStatistics and Ordination
统计与排序分析
PCoA (Principal Coordinates Analysis)
PCoA(主坐标分析)
python
from skbio.stats.ordination import pcoa
from skbio.stats.distance import DistanceMatrixpython
from skbio.stats.ordination import pcoa
from skbio.stats.distance import DistanceMatrixCreate DM
创建距离矩阵
dm = DistanceMatrix([[0, 0.5, 0.8], [0.5, 0, 0.2], [0.8, 0.2, 0]],
ids=['A', 'B', 'C'])
dm = DistanceMatrix([[0, 0.5, 0.8], [0.5, 0, 0.2], [0.8, 0.2, 0]],
ids=['A', 'B', 'C'])
Perform PCoA
执行PCoA
results = pcoa(dm)
results = pcoa(dm)
View proportion of variance explained
查看解释的方差比例
print(results.proportion_explained)
print(results.proportion_explained)
Access coordinates for plotting
获取用于绘图的坐标
coords = results.samples
undefinedcoords = results.samples
undefinedCommunity Testing (PERMANOVA)
群落检验(PERMANOVA)
python
from skbio.stats.distance import permanovapython
from skbio.stats.distance import permanovametadata linking IDs to groups
关联ID与分组的元数据
metadata = pd.DataFrame({
'BodySite': ['Gut', 'Gut', 'Skin', 'Skin']},
index=['S1', 'S2', 'S3', 'S4'])
metadata = pd.DataFrame({
'BodySite': ['Gut', 'Gut', 'Skin', 'Skin']},
index=['S1', 'S2', 'S3', 'S4'])
Assuming dm is a DistanceMatrix of S1..S4
假设dm是包含S1..S4的DistanceMatrix
results = permanova(dm, metadata, column='BodySite', permutations=999)
results = permanova(dm, metadata, column='BodySite', permutations=999)
print(results['p-value'])
print(results['p-value'])
undefinedundefinedPractical Workflows
实用工作流
1. Microbiome Distance Visualization
1. 微生物组距离可视化
python
def plot_microbiome_pcoa(abundance_df, metadata_df, group_col):
"""Full workflow: Abundance -> Distance -> PCoA -> Table."""
from skbio.diversity import beta
from skbio.stats.ordination import pcoa
# 1. Calculate Bray-Curtis distance
dm = beta.pw_distances(abundance_df.values, ids=abundance_df.index, metric='braycurtis')
# 2. PCoA
pc = pcoa(dm)
# 3. Merge with metadata for plotting
plot_data = pc.samples[['PC1', 'PC2']].join(metadata_df)
return plot_data # Ready for Seaborn/Plotlypython
def plot_microbiome_pcoa(abundance_df, metadata_df, group_col):
"""完整工作流:丰度数据 -> 距离计算 -> PCoA -> 绘图数据。"""
from skbio.diversity import beta
from skbio.stats.ordination import pcoa
# 1. 计算Bray-Curtis距离
dm = beta.pw_distances(abundance_df.values, ids=abundance_df.index, metric='braycurtis')
# 2. 执行PCoA
pc = pcoa(dm)
# 3. 合并元数据用于绘图
plot_data = pc.samples[['PC1', 'PC2']].join(metadata_df)
return plot_data # 可直接用于Seaborn/Plotly绘图2. Sequence QC and Filtering
2. 序列质控与过滤
python
def filter_low_quality_sequences(sequences, min_avg_qual=30):
"""Filters DNA sequences based on Phred scores."""
valid_seqs = []
for s in sequences:
# Assuming quality is in positional_metadata
avg_q = np.mean(s.positional_metadata['quality'])
if avg_q >= min_avg_qual:
valid_seqs.append(s)
return valid_seqspython
def filter_low_quality_sequences(sequences, min_avg_qual=30):
"""基于Phred分数过滤DNA序列。"""
valid_seqs = []
for s in sequences:
# 假设质量分数存储在positional_metadata中
avg_q = np.mean(s.positional_metadata['quality'])
if avg_q >= min_avg_qual:
valid_seqs.append(s)
return valid_seqs3. Phylogenetic Distance Calculation
3. 系统发育距离计算
python
def get_tip_distances(tree):
"""Returns a distance matrix of all tips in a tree."""
dm = tree.tip_tip_distances()
return dmpython
def get_tip_distances(tree):
"""返回树中所有叶节点的距离矩阵。"""
dm = tree.tip_tip_distances()
return dmPerformance Optimization
性能优化
Vectorized Diversity
向量化多样性计算
When calculating diversity for many samples, pass the entire 2D array to instead of looping through pairs.
beta.pw_distances计算多个样本的多样性时,将整个二维数组传入,而非逐个遍历样本对。
beta.pw_distancesTree Traversal
树遍历
Use or instead of manual recursion for much better performance on large (10,000+ tip) trees.
tree.preorder()tree.postorder()对于大型树(10000+叶节点),使用或而非手动递归,性能会大幅提升。
tree.preorder()tree.postorder()Common Pitfalls and Solutions
常见陷阱与解决方案
The "Missing ID" in Distance Matrix
距离矩阵中的“缺失ID”问题
PERMANOVA and PCoA will fail if your metadata index and DistanceMatrix IDs don't match.
python
undefined若元数据索引与DistanceMatrix的ID不匹配,PERMANOVA和PCoA会执行失败。
python
undefined✅ Solution: Ensure alignment
✅ 解决方案:确保ID对齐
common_ids = set(dm.ids).intersection(metadata.index)
dm_sub = dm.filter(common_ids)
metadata_sub = metadata.loc[list(common_ids)]
undefinedcommon_ids = set(dm.ids).intersection(metadata.index)
dm_sub = dm.filter(common_ids)
metadata_sub = metadata.loc[list(common_ids)]
undefinedUniFrac Memory Usage
UniFrac内存占用问题
UniFrac calculation can be memory-intensive. For massive datasets, consider using the optimized versions or external tools like Unifrac-Binaries.
skbio.diversity.beta.unweighted_unifracUniFrac计算可能占用大量内存。对于超大规模数据集,可考虑使用的优化版本,或外部工具如Unifrac-Binaries。
skbio.diversity.beta.unweighted_unifracIUPAC Ambiguity
IUPAC歧义字符问题
Standard DNA objects don't allow arbitrary characters.
python
undefined标准DNA对象不允许任意字符。
python
undefined❌ Problem: DNA("ACGN") raises error if 'N' isn't handled
❌ 问题:DNA("ACGN")会报错,若未处理'N'
✅ Solution: scikit-bio DNA supports IUPAC (N, R, Y, etc.) by default.
✅ 解决方案:scikit-bio的DNA对象默认支持IUPAC字符(N、R、Y等)
But if you have non-standard characters:
但若包含非标准字符:
from skbio import Sequence
custom = Sequence("ACG-X") # Generic sequence permits anything
scikit-bio is the mathematical heart of modern microbiome and evolutionary research. By enforcing strict biological typing and providing validated ecological metrics, it ensures that your biological insights are grounded in statistical rigor.from skbio import Sequence
custom = Sequence("ACG-X") # 通用序列允许任意字符
scikit-bio是现代微生物组研究与进化研究的数学核心。通过严格的生物学类型约束及经过验证的生态指标,确保你的生物学见解具备统计严谨性。