scikit-bio

A Python library for biological data analysis spanning sequence handling, phylogenetics, microbial ecology, and multivariate statistics.

一款涵盖序列处理、系统发育分析、微生物生态学及多元统计分析的Python生物数据分析库。

When to Apply

适用场景

Use this skill when users need to:

Task Category	Examples
Sequence work	DNA/RNA/protein manipulation, motif finding, translation
File handling	FASTA, FASTQ, GenBank, Newick, BIOM I/O
Alignments	Pairwise or multiple sequence alignment
Phylogenetics	Tree construction, manipulation, distance calculations
Diversity metrics	Alpha diversity (Shannon, Faith's PD), beta diversity (Bray-Curtis, UniFrac)
Ordination	PCoA, CCA, RDA for dimensionality reduction
Statistical tests	PERMANOVA, ANOSIM, Mantel tests
Microbiome analysis	Feature tables, rarefaction, community comparisons

当用户需要完成以下任务时，可使用本工具：

任务类别	示例
序列处理	DNA/RNA/蛋白质操作、基序查找、序列翻译
文件处理	FASTA、FASTQ、GenBank、Newick、BIOM格式读写
序列比对	两两序列比对或多序列比对
系统发育分析	进化树构建、操作、距离计算
多样性指标	α多样性（Shannon、Faith's PD）、β多样性（Bray-Curtis、UniFrac）
排序分析	用于降维的PCoA、CCA、RDA分析
统计检验	PERMANOVA、ANOSIM、Mantel检验
微生物组分析	特征表、稀疏化、群落比较

Installation

安装

bash

uv pip install scikit-bio

bash

uv pip install scikit-bio

Sequences

序列处理

Work with biological sequences through specialized

DNA

,

RNA

, and

Protein

classes.

python

import skbio

通过专用的

DNA

、

RNA

和

Protein

类处理生物序列。

python

import skbio

Load from file

从文件加载序列

seq = skbio.DNA.read('gene.fasta')

Common operations

常见操作

complement = seq.reverse_complement() messenger = seq.transcribe() peptide = messenger.translate()

Pattern search

模式搜索

hits = seq.find_with_regex('ATG[ACGT]{6}TAA')

Properties

序列属性

contains_ambiguous = seq.has_degenerates() clean_seq = seq.degap()


**Metadata types:**
- Sequence-level: ID, description, source organism
- Positional: Per-base quality scores (from FASTQ)
- Interval: Feature annotations, gene boundaries

contains_ambiguous = seq.has_degenerates() clean_seq = seq.degap()


**元数据类型：**
- 序列级：ID、描述、来源生物
- 位置级：单碱基质量分数（来自FASTQ）
- 区间级：特征注释、基因边界

Sequence Alignment

序列比对

Pairwise and multiple alignment using dynamic programming.

python

from skbio.alignment import local_pairwise_align_ssw, TabularMSA

使用动态规划实现两两和多序列比对。

python

from skbio.alignment import local_pairwise_align_ssw, TabularMSA

Local alignment (Smith-Waterman)

局部比对（Smith-Waterman算法）

result = local_pairwise_align_ssw(query_seq, target_seq)

Load existing alignment

加载已有的比对结果

alignment = TabularMSA.read('msa.fasta', constructor=skbio.DNA)

Derive consensus

生成共识序列

consensus_seq = alignment.consensus()


**Notes:**
- `local_pairwise_align_ssw` provides fast SSW-based local alignment
- `StripedSmithWaterman` handles protein sequences with substitution matrices
- Affine gap penalties suit biological sequences best

consensus_seq = alignment.consensus()


**注意事项：**
- `local_pairwise_align_ssw`提供基于SSW的快速局部比对
- `StripedSmithWaterman`支持带替换矩阵的蛋白质序列比对
- 仿射空位罚分最适合生物序列比对

Phylogenetic Trees

系统发育树

Construct and analyze evolutionary trees.

python

from skbio import TreeNode
from skbio.tree import nj, upgma

构建并分析进化树。

python

from skbio import TreeNode
from skbio.tree import nj, upgma

Build from distances

基于距离矩阵构建进化树

phylogeny = nj(distance_matrix)

Load existing tree

加载已有的进化树

phylogeny = TreeNode.read('species.nwk')

Extract subset

提取子集分支

clade = phylogeny.shear(['mouse', 'rat', 'human'])

Enumerate leaf nodes

枚举叶节点

leaves = list(phylogeny.tips())

Common ancestor

查找最近共同祖先

ancestor = phylogeny.lowest_common_ancestor(['mouse', 'rat'])

Branch length between taxa

计算类群间的分支长度

branch_dist = phylogeny.find('mouse').distance(phylogeny.find('rat'))

Pairwise distances for all tips

计算所有叶节点的两两距离

pairwise_dm = phylogeny.cophenetic_matrix()

Topology comparison

拓扑结构比较

rf_diff = phylogeny.robinson_foulds(other_tree)


**Tree construction methods:**

| Method | Use case |
|--------|----------|
| `nj()` | Standard neighbor-joining |
| `upgma()` | Assumes molecular clock |
| `bme()` | Scalable for large datasets |

rf_diff = phylogeny.robinson_foulds(other_tree)


**树构建方法：**

| 方法 | 适用场景 |
|--------|----------|
| `nj()` | 标准邻接法 |
| `upgma()` | 适用于分子钟假设的场景 |
| `bme()` | 适用于大型数据集的可扩展方法 |

Diversity Analysis

多样性分析

Calculate ecological diversity metrics.

计算生态学多样性指标。

Alpha Diversity (within-sample)

α多样性（样本内）

python

from skbio.diversity import alpha_diversity

python

from skbio.diversity import alpha_diversity

Sample abundance matrix

样本丰度矩阵

abundances = np.array([ [45, 12, 0, 8], [5, 0, 33, 17], [20, 20, 15, 10] ]) samples = ['gut_1', 'gut_2', 'gut_3']

Richness and evenness metrics

丰富度和均匀度指标

shannon_vals = alpha_diversity('shannon', abundances, ids=samples) simpson_vals = alpha_diversity('simpson', abundances, ids=samples)

Phylogenetic diversity (requires tree)

系统发育多样性（需要进化树）

faith_vals = alpha_diversity('faith_pd', abundances, ids=samples, tree=phylogeny, otu_ids=feature_names)

undefined

faith_vals = alpha_diversity('faith_pd', abundances, ids=samples, tree=phylogeny, otu_ids=feature_names)

undefined

Beta Diversity (between-sample)

β多样性（样本间）

python

from skbio.diversity import beta_diversity

python

from skbio.diversity import beta_diversity

Distance matrices

距离矩阵计算

bray_dm = beta_diversity('braycurtis', abundances, ids=samples) unifrac_dm = beta_diversity('weighted_unifrac', abundances, ids=samples, tree=phylogeny, otu_ids=feature_names)


**Key points:**
- Input must be integer counts, not proportions
- Phylogenetic metrics require a tree matching feature IDs
- `partial_beta_diversity()` computes specific sample pairs efficiently

bray_dm = beta_diversity('braycurtis', abundances, ids=samples) unifrac_dm = beta_diversity('weighted_unifrac', abundances, ids=samples, tree=phylogeny, otu_ids=feature_names)


**关键点：**
- 输入必须是整数计数，而非比例值
- 系统发育指标需要与特征ID匹配的进化树
- `partial_beta_diversity()`可高效计算特定样本对的距离

Ordination

排序分析

Project high-dimensional data to visualizable spaces.

python

from skbio.stats.ordination import pcoa, cca

将高维数据投影到可可视化的空间。

python

from skbio.stats.ordination import pcoa, cca

PCoA from distance matrix

基于距离矩阵的PCoA分析

coords = pcoa(bray_dm) axis1 = coords.samples['PC1'] axis2 = coords.samples['PC2'] variance_explained = coords.proportion_explained

CCA with environmental predictors

结合环境预测因子的CCA分析

constrained = cca(species_abundances, environmental_vars)


**Methods:**

| Function | Input | Purpose |
|----------|-------|---------|
| `pcoa()` | Distance matrix | Unconstrained ordination |
| `cca()` | Abundance + environment | Constrained ordination (unimodal) |
| `rda()` | Abundance + environment | Constrained ordination (linear) |

constrained = cca(species_abundances, environmental_vars)


**分析方法：**

| 函数 | 输入 | 用途 |
|----------|-------|---------|
| `pcoa()` | 距离矩阵 | 非约束排序 |
| `cca()` | 物种丰度+环境因子 | 约束排序（单峰模型） |
| `rda()` | 物种丰度+环境因子 | 约束排序（线性模型） |

Statistical Tests

统计检验

Hypothesis testing for ecological data.

python

from skbio.stats.distance import permanova, anosim, mantel

针对生态数据的假设检验。

python

from skbio.stats.distance import permanova, anosim, mantel

Group comparison

组间比较

treatment_groups = ['control', 'control', 'treated', 'treated'] perm_result = permanova(bray_dm, treatment_groups, permutations=999) print(f"F = {perm_result['test statistic']:.3f}, p = {perm_result['p-value']:.4f}")

Alternative group test

替代组间检验方法

anos_result = anosim(bray_dm, treatment_groups, permutations=999)

Matrix correlation

矩阵相关性分析

r, pval, n = mantel(genetic_dm, geographic_dm, method='spearman', permutations=999) print(f"r = {r:.3f}, p = {pval:.4f}")


**Test overview:**

| Test | Purpose | Key output |
|------|---------|------------|
| PERMANOVA | Group differences | F-statistic, p-value |
| ANOSIM | Group differences (alternative) | R-statistic, p-value |
| PERMDISP | Dispersion homogeneity | Tests PERMANOVA assumption |
| Mantel | Matrix correlation | Correlation coefficient, p-value |

r, pval, n = mantel(genetic_dm, geographic_dm, method='spearman', permutations=999) print(f"r = {r:.3f}, p = {pval:.4f}")


**检验概述：**

| 检验方法 | 用途 | 关键输出 |
|------|---------|------------|
| PERMANOVA | 组间差异分析 | F统计量、p值 |
| ANOSIM | 组间差异分析（替代方法） | R统计量、p值 |
| PERMDISP | 离散度齐性检验 | 验证PERMANOVA的假设条件 |
| Mantel | 矩阵相关性分析 | 相关系数、p值 |

File I/O

文件读写

Read and write 19+ biological formats.

python

import skbio

支持19种以上生物文件格式的读写。

python

import skbio

Automatic format detection

自动检测文件格式

tree = skbio.TreeNode.read('phylogeny.nwk')

Memory-efficient iteration

内存高效的迭代读取

for record in skbio.io.read('reads.fastq', format='fastq', constructor=skbio.DNA): if record.positional_metadata['quality'].mean() > 30: process(record)

Format conversion

格式转换

records = skbio.io.read('sequences.fastq', format='fastq', constructor=skbio.DNA) skbio.io.write(records, format='fasta', into='sequences.fasta')


**Supported formats:**

| Category | Formats |
|----------|---------|
| Sequences | FASTA, FASTQ, GenBank, EMBL, QSeq |
| Alignments | Clustal, PHYLIP, Stockholm |
| Trees | Newick |
| Tables | BIOM (HDF5/JSON) |
| Distances | Delimited matrices |

records = skbio.io.read('sequences.fastq', format='fastq', constructor=skbio.DNA) skbio.io.write(records, format='fasta', into='sequences.fasta')


**支持的格式：**

| 类别 | 格式 |
|----------|---------|
| 序列 | FASTA、FASTQ、GenBank、EMBL、QSeq |
| 比对结果 | Clustal、PHYLIP、Stockholm |
| 进化树 | Newick |
| 表格 | BIOM（HDF5/JSON） |
| 距离矩阵 | 分隔符分隔的矩阵 |

Distance Matrices

距离矩阵

Store and manipulate pairwise distances.

python

from skbio import DistanceMatrix
import numpy as np

存储和操作两两距离数据。

python

from skbio import DistanceMatrix
import numpy as np

Create from array

从数组创建距离矩阵

distances = np.array([ [0.0, 0.3, 0.7], [0.3, 0.0, 0.5], [0.7, 0.5, 0.0] ]) dm = DistanceMatrix(distances, ids=['sp_A', 'sp_B', 'sp_C'])

Access elements

访问元素

pair_dist = dm['sp_A', 'sp_B'] all_from_a = dm['sp_A']

Subset

子集筛选

subset_dm = dm.filter(['sp_A', 'sp_C'])

undefined

subset_dm = dm.filter(['sp_A', 'sp_C'])

undefined

Feature Tables (BIOM)

特征表（BIOM）

Handle OTU/ASV abundance tables.

python

from skbio import Table

处理OTU/ASV丰度表。

python

from skbio import Table

Load table

加载特征表

tbl = Table.read('features.biom')

Inspect structure

查看结构

sample_names = tbl.ids(axis='sample') feature_names = tbl.ids(axis='observation')

Filter by abundance

按丰度筛选样本

filtered = tbl.filter(lambda row, id_, md: row.sum() > 500, axis='sample')

Convert to pandas

转换为pandas DataFrame

df = tbl.to_dataframe()

undefined

df = tbl.to_dataframe()

undefined

Protein Embeddings

蛋白质嵌入

Bridge language model outputs with scikit-bio analysis.

python

from skbio.embedding import ProteinEmbedding

将语言模型输出与scikit-bio分析流程结合。

python

from skbio.embedding import ProteinEmbedding

Load embeddings (from ESM, ProtTrans, etc.)

加载嵌入向量（来自ESM、ProtTrans等）

emb = ProteinEmbedding(embedding_matrix, protein_ids)

Create distance matrix for downstream analysis

转换为距离矩阵用于下游分析

emb_dm = emb.to_distances(metric='cosine')

Ordination visualization

排序分析可视化

emb_pcoa = emb.to_ordination(metric='euclidean', method='pcoa')

undefined

emb_pcoa = emb.to_ordination(metric='euclidean', method='pcoa')

undefined

Typical Workflows

典型工作流

Microbiome diversity study:

Load BIOM table and phylogenetic tree
Calculate alpha diversity per sample
Compute beta diversity (UniFrac)
Ordinate with PCoA
Test group differences with PERMANOVA

Phylogenetic inference:

Read sequences from FASTA
Perform multiple alignment
Calculate pairwise distances
Construct tree with neighbor-joining
Analyze clade relationships

Sequence processing:

Read FASTQ with quality scores
Filter low-quality reads
Search for motifs
Translate to protein
Export as FASTA

微生物组多样性研究：

加载BIOM表和系统发育树
计算每个样本的α多样性
计算β多样性（UniFrac）
采用PCoA进行排序分析
使用PERMANOVA检验组间差异

系统发育推断：

从FASTA文件读取序列
执行多序列比对
计算两两距离
用邻接法构建进化树
分析分支关系

序列处理流程：

读取带质量分数的FASTQ文件
过滤低质量reads
搜索基序
翻译为蛋白质序列
导出为FASTA格式

Performance Tips

性能优化建议

Use generators for large sequence files
Prefer BIOM HDF5 over JSON for big tables
Apply
```
partial_beta_diversity()
```
when computing only specific pairs
Choose BME for very large phylogenies

对大型序列文件使用生成器
大型表格优先使用BIOM HDF5格式而非JSON
仅计算特定样本对时使用
```
partial_beta_diversity()
```
超大型系统发育树选择BME方法

Ecosystem Integration

生态系统集成

Library	Integration
pandas	DataFrames from distance matrices, diversity results
numpy	Array conversions throughout
matplotlib/seaborn	Plot ordination results, heatmaps
scikit-learn	Distance matrices as input
QIIME 2	Native BIOM, tree, distance matrix compatibility

库	集成方式
pandas	距离矩阵、多样性结果转换为DataFrame
numpy	全程支持数组转换
matplotlib/seaborn	排序结果可视化、热图绘制
scikit-learn	距离矩阵作为输入
QIIME 2	原生支持BIOM、进化树、距离矩阵格式

Reference Files

参考文件

File	Contents
references/api-reference.md	Complete method signatures, parameters, extended examples, and troubleshooting

文件	内容
references/api-reference.md	完整的方法签名、参数说明、扩展示例及故障排除指南

scikit-bio

Original

Translation

scikit-bio

scikit-bio

When to Apply

适用场景

Installation

安装

Sequences

序列处理

Load from file

从文件加载序列

Common operations

常见操作

Pattern search

模式搜索

Properties

序列属性

Sequence Alignment

序列比对

Local alignment (Smith-Waterman)

局部比对（Smith-Waterman算法）

Load existing alignment

加载已有的比对结果

Derive consensus

生成共识序列

Phylogenetic Trees

系统发育树

Build from distances

基于距离矩阵构建进化树

Load existing tree

加载已有的进化树

Extract subset

提取子集分支

Enumerate leaf nodes

枚举叶节点

Common ancestor

查找最近共同祖先

Branch length between taxa

计算类群间的分支长度

Pairwise distances for all tips

计算所有叶节点的两两距离

Topology comparison

拓扑结构比较

Diversity Analysis

多样性分析

Alpha Diversity (within-sample)

α多样性（样本内）

Sample abundance matrix

样本丰度矩阵

Richness and evenness metrics

丰富度和均匀度指标

Phylogenetic diversity (requires tree)

系统发育多样性（需要进化树）

Beta Diversity (between-sample)

β多样性（样本间）

Distance matrices

距离矩阵计算

Ordination

排序分析

PCoA from distance matrix

基于距离矩阵的PCoA分析

CCA with environmental predictors

结合环境预测因子的CCA分析

Statistical Tests

统计检验

Group comparison

组间比较

Alternative group test

替代组间检验方法

Matrix correlation

矩阵相关性分析

File I/O

文件读写

Automatic format detection

自动检测文件格式

Memory-efficient iteration

内存高效的迭代读取

Format conversion