scikit-bio

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

scikit-bio - Bioinformatics and Ecology

scikit-bio - 生物信息学与生态学

scikit-bio provides the data structures and statistical methods needed for rigorous biological analysis. It excels in calculating alpha/beta diversity, performing ordination (PCoA), and handling complex phylogenetic trees.
scikit-bio 为严谨的生物学分析提供所需的数据结构与统计方法,擅长计算α/β多样性、执行排序分析(PCoA)以及处理复杂的系统发育树。

When to Use

适用场景

  • Analyzing microbiome data (taxonomic composition, community structure).
  • Calculating ecological diversity metrics (Shannon, Simpson, UniFrac).
  • Performing ordination for visualization (PCoA, DCA).
  • High-level sequence manipulation (DNA, RNA, Protein with metadata).
  • Reading and writing phylogenetic trees (Newick format).
  • Pairwise and multiple sequence alignment analysis.
  • Statistical testing of community differences (PERMANOVA, ANOSIM).
  • 分析微生物组数据(分类组成、群落结构)
  • 计算生态多样性指标(Shannon、Simpson、UniFrac)
  • 执行排序分析以实现可视化(PCoA、DCA)
  • 高级序列处理(带元数据的DNA、RNA、蛋白质序列)
  • 读写系统发育树(Newick格式)
  • 两两及多序列比对分析
  • 群落差异的统计检验(PERMANOVA、ANOSIM)

Reference Documentation

参考文档

Official docs: http://scikit-bio.org/
Tutorials: http://scikit-bio.org/docs/latest/
Search patterns:
skbio.sequence
,
skbio.stats.distance
,
skbio.diversity
,
skbio.stats.ordination
官方文档http://scikit-bio.org/
教程http://scikit-bio.org/docs/latest/
搜索模式
skbio.sequence
skbio.stats.distance
skbio.diversity
skbio.stats.ordination

Core Principles

核心原则

Grammar of Biological Sequences

生物序列的语法规则

Instead of using raw strings, scikit-bio uses typed objects (DNA, RNA, Protein). These objects know their alphabet, can handle quality scores (Phred), and support biological operations (transcription, translation).
scikit-bio 不使用原始字符串,而是使用类型化对象(DNA、RNA、Protein)。这些对象知晓自身的字母表,可处理质量分数(Phred),并支持生物学操作(转录、翻译)。

Distance Matrices

距离矩阵

Microbiome research often boils down to comparing samples. The DistanceMatrix object is central, allowing for easy indexing, sub-setting, and statistical testing (e.g., PERMANOVA).
微生物组研究通常需要对比样本,DistanceMatrix 对象是核心组件,可实现便捷的索引、子集划分及统计检验(如PERMANOVA)。

Diversity Metrics

多样性指标

Provides a standardized implementation of hundreds of diversity metrics used in ecology, ensuring reproducibility across studies.
提供生态学中数百种多样性指标的标准化实现,确保研究结果的可重复性。

Quick Reference

快速参考

Installation

安装

bash
pip install scikit-bio
bash
pip install scikit-bio

Standard Imports

标准导入

python
import skbio
import numpy as np
import pandas as pd
from skbio import DNA, RNA, Protein, Sequence
from skbio.stats.distance import DistanceMatrix
from skbio.diversity import alpha, beta
from skbio.stats.ordination import pcoa
python
import skbio
import numpy as np
import pandas as pd
from skbio import DNA, RNA, Protein, Sequence
from skbio.stats.distance import DistanceMatrix
from skbio.diversity import alpha, beta
from skbio.stats.ordination import pcoa

Basic Pattern - Sequence Manipulation

基础模式 - 序列处理

python
from skbio import DNA
python
from skbio import DNA

1. Create a sequence with metadata

1. 创建带元数据的序列

seq = DNA("ACC--GTT", metadata={'id': 'sample1', 'desc': 'gene A'})
seq = DNA("ACC--GTT", metadata={'id': 'sample1', 'desc': 'gene A'})

2. Biological operations

2. 生物学操作

rc = seq.reverse_complement() degapped = seq.degap()
rc = seq.reverse_complement() degapped = seq.degap()

3. Validation

3. 验证

print(f"Is valid? {seq.is_valid()}") # Checks against DNA alphabet
undefined
print(f"Is valid? {seq.is_valid()}") # 检查是否符合DNA字母表
undefined

Critical Rules

重要规则

✅ DO

✅ 推荐做法

  • Use Typed Sequences - Always use DNA, RNA, or Protein instead of generic Sequence to enable alphabet-specific methods.
  • Set Metadata - Use the metadata and positional_metadata (for quality scores) attributes to keep data self-contained.
  • Check Alphabet - Use
    .is_valid()
    when importing data from untrusted sources to catch non-IUPAC characters.
  • Project Distance Matrices - Use
    pcoa()
    to visualize high-dimensional community data.
  • Specify Reference for UniFrac - When calculating UniFrac diversity, ensure your phylogenetic tree contains all taxa present in your abundance table.
  • Validate Trees - Use
    skbio.tree.TreeNode
    for tree manipulations as it provides robust traversal methods.
  • 使用类型化序列 - 始终使用DNA、RNA或Protein对象而非通用Sequence,以启用字母表专属方法
  • 设置元数据 - 使用metadata和positional_metadata(存储质量分数)属性,确保数据自包含
  • 检查字母表 - 从不可信来源导入数据时,使用
    .is_valid()
    捕获非IUPAC字符
  • 投影距离矩阵 - 使用
    pcoa()
    可视化高维群落数据
  • 为UniFrac指定参考 - 计算UniFrac多样性时,确保系统发育树包含丰度表中的所有分类单元
  • 验证树结构 - 使用
    skbio.tree.TreeNode
    处理树结构,它提供了可靠的遍历方法

❌ DON'T

❌ 不推荐做法

  • Mix Sequence Types - Don't try to align DNA with RNA objects.
  • Ignore Quality Scores - If you have FASTQ data, store Phred scores in positional_metadata.
  • Manually Calculate Distance - Don't use raw NumPy for biological distances; use
    skbio.diversity.beta
    to access UniFrac or Bray-Curtis.
  • Forget ID Matching - When performing ordination or PERMANOVA, ensure IDs in your distance matrix and metadata mapping match exactly.
  • 混合序列类型 - 不要尝试将DNA与RNA对象进行比对
  • 忽略质量分数 - 若拥有FASTQ数据,将Phred分数存储在positional_metadata中
  • 手动计算距离 - 不要使用原始NumPy计算生物学距离;使用
    skbio.diversity.beta
    获取UniFrac或Bray-Curtis距离
  • 忽略ID匹配 - 执行排序分析或PERMANOVA时,确保距离矩阵中的ID与元数据映射中的ID完全匹配

Anti-Patterns (NEVER)

反模式(绝对避免)

python
from skbio import DNA
python
from skbio import DNA

❌ BAD: Manual reverse complement with string logic

❌ 错误:使用字符串逻辑手动生成反向互补序列

rc_str = seq_str[::-1].replace('A', 't').replace('T', 'a')... # Fragile
rc_str = seq_str[::-1].replace('A', 't').replace('T', 'a')... # 不稳定

✅ GOOD: Built-in validated method

✅ 正确:使用内置的验证方法

rc_seq = DNA(seq_str).reverse_complement()
rc_seq = DNA(seq_str).reverse_complement()

❌ BAD: Using raw lists for community analysis

❌ 错误:使用原始列表进行群落分析

data = [[1, 0, 5], [2, 1, 0]]

data = [[1, 0, 5], [2, 1, 0]]

dist = manual_bray_curtis(data)

dist = manual_bray_curtis(data)

✅ GOOD: Using DistanceMatrix

✅ 正确:使用DistanceMatrix

from skbio.stats.distance import DistanceMatrix dm = DistanceMatrix(matrix_data, ids=['S1', 'S2', 'S3'])
from skbio.stats.distance import DistanceMatrix dm = DistanceMatrix(matrix_data, ids=['S1', 'S2', 'S3'])

❌ BAD: Stripping metadata to run scikit-learn

❌ 错误:剥离元数据以运行scikit-learn

vals = seq.values # Metadata lost!

vals = seq.values # 元数据丢失!

✅ GOOD: Process within skbio or use standard IO

✅ 正确:在skbio内处理或使用标准IO

seq.write('output.fasta')
undefined
seq.write('output.fasta')
undefined

Sequence Analysis (skbio.sequence)

序列分析(skbio.sequence)

Advanced Sequence Operations

高级序列操作

python
from skbio import DNA, Protein
python
from skbio import DNA, Protein

Sequence with quality scores

带质量分数的序列

seq = DNA("ACGT", positional_metadata={'quality': [30, 35, 40, 20]})
seq = DNA("ACGT", positional_metadata={'quality': [30, 35, 40, 20]})

Slicing preserves metadata

切片保留元数据

sub = seq[1:3] print(sub.positional_metadata) # {'quality': [35, 40]}
sub = seq[1:3] print(sub.positional_metadata) # {'quality': [35, 40]}

K-mer frequencies

K-mer频率

freqs = seq.kmer_frequencies(k=2)
freqs = seq.kmer_frequencies(k=2)

Translation (DNA -> Protein)

翻译(DNA -> 蛋白质)

protein = DNA("ATGCGA").translate()
undefined
protein = DNA("ATGCGA").translate()
undefined

Diversity Analysis (skbio.diversity)

多样性分析(skbio.diversity)

Alpha Diversity (Within a sample)

α多样性(样本内)

python
from skbio.diversity import alpha

counts = [10, 0, 5, 2, 20] # Species abundances
otus = ['OTU1', 'OTU2', 'OTU3', 'OTU4', 'OTU5']

shannon = alpha.shannon(counts)
simpson = alpha.simpson(counts)
observed_otus = alpha.observed_otus(counts)

print(f"Shannon Index: {shannon:.3f}")
python
from skbio.diversity import alpha

counts = [10, 0, 5, 2, 20] # 物种丰度
otus = ['OTU1', 'OTU2', 'OTU3', 'OTU4', 'OTU5']

shannon = alpha.shannon(counts)
simpson = alpha.simpson(counts)
observed_otus = alpha.observed_otus(counts)

print(f"Shannon Index: {shannon:.3f}")

Beta Diversity (Between samples)

β多样性(样本间)

python
from skbio.diversity import beta
import numpy as np
python
from skbio.diversity import beta
import numpy as np

Abundance table (samples x taxa)

丰度表(样本×分类单元)

data = np.array([[10, 20, 0], [5, 15, 2], [0, 1, 30]]) ids = ['Sample1', 'Sample2', 'Sample3']
data = np.array([[10, 20, 0], [5, 15, 2], [0, 1, 30]]) ids = ['Sample1', 'Sample2', 'Sample3']

Bray-Curtis distance

Bray-Curtis距离

bc_dm = beta.pw_distances(data, ids=ids, metric='braycurtis')
bc_dm = beta.pw_distances(data, ids=ids, metric='braycurtis')

Weighted UniFrac (Requires a tree)

加权UniFrac(需要树结构)

unifrac_dm = beta.weighted_unifrac(data, otu_ids, tree)

unifrac_dm = beta.weighted_unifrac(data, otu_ids, tree)

undefined
undefined

Phylogenetics (skbio.tree)

系统发育学(skbio.tree)

Handling Trees

处理树结构

python
from skbio import TreeNode
from io import StringIO
python
from skbio import TreeNode
from io import StringIO

Load Newick tree

加载Newick格式的树

tree_str = "((A:0.1, B:0.2)C:0.3, D:0.4)E;" tree = TreeNode.read(StringIO(tree_str))
tree_str = "((A:0.1, B:0.2)C:0.3, D:0.4)E;" tree = TreeNode.read(StringIO(tree_str))

Traverse

遍历

for node in tree.tips(): print(f"Leaf: {node.name}, dist: {node.length}")
for node in tree.tips(): print(f"Leaf: {node.name}, dist: {node.length}")

Find Lowest Common Ancestor

寻找最近共同祖先

lca = tree.find_lca(['A', 'B']) print(f"LCA of A and B is {lca.name}")
lca = tree.find_lca(['A', 'B']) print(f"LCA of A and B is {lca.name}")

Rooting

设根

tree.root_at('D')
undefined
tree.root_at('D')
undefined

Statistics and Ordination

统计与排序分析

PCoA (Principal Coordinates Analysis)

PCoA(主坐标分析)

python
from skbio.stats.ordination import pcoa
from skbio.stats.distance import DistanceMatrix
python
from skbio.stats.ordination import pcoa
from skbio.stats.distance import DistanceMatrix

Create DM

创建距离矩阵

dm = DistanceMatrix([[0, 0.5, 0.8], [0.5, 0, 0.2], [0.8, 0.2, 0]], ids=['A', 'B', 'C'])
dm = DistanceMatrix([[0, 0.5, 0.8], [0.5, 0, 0.2], [0.8, 0.2, 0]], ids=['A', 'B', 'C'])

Perform PCoA

执行PCoA

results = pcoa(dm)
results = pcoa(dm)

View proportion of variance explained

查看解释的方差比例

print(results.proportion_explained)
print(results.proportion_explained)

Access coordinates for plotting

获取用于绘图的坐标

coords = results.samples
undefined
coords = results.samples
undefined

Community Testing (PERMANOVA)

群落检验(PERMANOVA)

python
from skbio.stats.distance import permanova
python
from skbio.stats.distance import permanova

metadata linking IDs to groups

关联ID与分组的元数据

metadata = pd.DataFrame({ 'BodySite': ['Gut', 'Gut', 'Skin', 'Skin']}, index=['S1', 'S2', 'S3', 'S4'])
metadata = pd.DataFrame({ 'BodySite': ['Gut', 'Gut', 'Skin', 'Skin']}, index=['S1', 'S2', 'S3', 'S4'])

Assuming dm is a DistanceMatrix of S1..S4

假设dm是包含S1..S4的DistanceMatrix

results = permanova(dm, metadata, column='BodySite', permutations=999)

results = permanova(dm, metadata, column='BodySite', permutations=999)

print(results['p-value'])

print(results['p-value'])

undefined
undefined

Practical Workflows

实用工作流

1. Microbiome Distance Visualization

1. 微生物组距离可视化

python
def plot_microbiome_pcoa(abundance_df, metadata_df, group_col):
    """Full workflow: Abundance -> Distance -> PCoA -> Table."""
    from skbio.diversity import beta
    from skbio.stats.ordination import pcoa
    
    # 1. Calculate Bray-Curtis distance
    dm = beta.pw_distances(abundance_df.values, ids=abundance_df.index, metric='braycurtis')
    
    # 2. PCoA
    pc = pcoa(dm)
    
    # 3. Merge with metadata for plotting
    plot_data = pc.samples[['PC1', 'PC2']].join(metadata_df)
    
    return plot_data # Ready for Seaborn/Plotly
python
def plot_microbiome_pcoa(abundance_df, metadata_df, group_col):
    """完整工作流:丰度数据 -> 距离计算 -> PCoA -> 绘图数据。"""
    from skbio.diversity import beta
    from skbio.stats.ordination import pcoa
    
    # 1. 计算Bray-Curtis距离
    dm = beta.pw_distances(abundance_df.values, ids=abundance_df.index, metric='braycurtis')
    
    # 2. 执行PCoA
    pc = pcoa(dm)
    
    # 3. 合并元数据用于绘图
    plot_data = pc.samples[['PC1', 'PC2']].join(metadata_df)
    
    return plot_data # 可直接用于Seaborn/Plotly绘图

2. Sequence QC and Filtering

2. 序列质控与过滤

python
def filter_low_quality_sequences(sequences, min_avg_qual=30):
    """Filters DNA sequences based on Phred scores."""
    valid_seqs = []
    for s in sequences:
        # Assuming quality is in positional_metadata
        avg_q = np.mean(s.positional_metadata['quality'])
        if avg_q >= min_avg_qual:
            valid_seqs.append(s)
    return valid_seqs
python
def filter_low_quality_sequences(sequences, min_avg_qual=30):
    """基于Phred分数过滤DNA序列。"""
    valid_seqs = []
    for s in sequences:
        # 假设质量分数存储在positional_metadata中
        avg_q = np.mean(s.positional_metadata['quality'])
        if avg_q >= min_avg_qual:
            valid_seqs.append(s)
    return valid_seqs

3. Phylogenetic Distance Calculation

3. 系统发育距离计算

python
def get_tip_distances(tree):
    """Returns a distance matrix of all tips in a tree."""
    dm = tree.tip_tip_distances()
    return dm
python
def get_tip_distances(tree):
    """返回树中所有叶节点的距离矩阵。"""
    dm = tree.tip_tip_distances()
    return dm

Performance Optimization

性能优化

Vectorized Diversity

向量化多样性计算

When calculating diversity for many samples, pass the entire 2D array to
beta.pw_distances
instead of looping through pairs.
计算多个样本的多样性时,将整个二维数组传入
beta.pw_distances
,而非逐个遍历样本对。

Tree Traversal

树遍历

Use
tree.preorder()
or
tree.postorder()
instead of manual recursion for much better performance on large (10,000+ tip) trees.
对于大型树(10000+叶节点),使用
tree.preorder()
tree.postorder()
而非手动递归,性能会大幅提升。

Common Pitfalls and Solutions

常见陷阱与解决方案

The "Missing ID" in Distance Matrix

距离矩阵中的“缺失ID”问题

PERMANOVA and PCoA will fail if your metadata index and DistanceMatrix IDs don't match.
python
undefined
若元数据索引与DistanceMatrix的ID不匹配,PERMANOVA和PCoA会执行失败。
python
undefined

✅ Solution: Ensure alignment

✅ 解决方案:确保ID对齐

common_ids = set(dm.ids).intersection(metadata.index) dm_sub = dm.filter(common_ids) metadata_sub = metadata.loc[list(common_ids)]
undefined
common_ids = set(dm.ids).intersection(metadata.index) dm_sub = dm.filter(common_ids) metadata_sub = metadata.loc[list(common_ids)]
undefined

UniFrac Memory Usage

UniFrac内存占用问题

UniFrac calculation can be memory-intensive. For massive datasets, consider using the
skbio.diversity.beta.unweighted_unifrac
optimized versions or external tools like Unifrac-Binaries.
UniFrac计算可能占用大量内存。对于超大规模数据集,可考虑使用
skbio.diversity.beta.unweighted_unifrac
的优化版本,或外部工具如Unifrac-Binaries。

IUPAC Ambiguity

IUPAC歧义字符问题

Standard DNA objects don't allow arbitrary characters.
python
undefined
标准DNA对象不允许任意字符。
python
undefined

❌ Problem: DNA("ACGN") raises error if 'N' isn't handled

❌ 问题:DNA("ACGN")会报错,若未处理'N'

✅ Solution: scikit-bio DNA supports IUPAC (N, R, Y, etc.) by default.

✅ 解决方案:scikit-bio的DNA对象默认支持IUPAC字符(N、R、Y等)

But if you have non-standard characters:

但若包含非标准字符:

from skbio import Sequence custom = Sequence("ACG-X") # Generic sequence permits anything

scikit-bio is the mathematical heart of modern microbiome and evolutionary research. By enforcing strict biological typing and providing validated ecological metrics, it ensures that your biological insights are grounded in statistical rigor.
from skbio import Sequence custom = Sequence("ACG-X") # 通用序列允许任意字符

scikit-bio是现代微生物组研究与进化研究的数学核心。通过严格的生物学类型约束及经过验证的生态指标,确保你的生物学见解具备统计严谨性。