scikit-bio

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

scikit-bio

scikit-bio

A Python library for biological data analysis spanning sequence handling, phylogenetics, microbial ecology, and multivariate statistics.
一款涵盖序列处理、系统发育分析、微生物生态学及多元统计分析的Python生物数据分析库。

When to Apply

适用场景

Use this skill when users need to:
Task CategoryExamples
Sequence workDNA/RNA/protein manipulation, motif finding, translation
File handlingFASTA, FASTQ, GenBank, Newick, BIOM I/O
AlignmentsPairwise or multiple sequence alignment
PhylogeneticsTree construction, manipulation, distance calculations
Diversity metricsAlpha diversity (Shannon, Faith's PD), beta diversity (Bray-Curtis, UniFrac)
OrdinationPCoA, CCA, RDA for dimensionality reduction
Statistical testsPERMANOVA, ANOSIM, Mantel tests
Microbiome analysisFeature tables, rarefaction, community comparisons
当用户需要完成以下任务时,可使用本工具:
任务类别示例
序列处理DNA/RNA/蛋白质操作、基序查找、序列翻译
文件处理FASTA、FASTQ、GenBank、Newick、BIOM格式读写
序列比对两两序列比对或多序列比对
系统发育分析进化树构建、操作、距离计算
多样性指标α多样性(Shannon、Faith's PD)、β多样性(Bray-Curtis、UniFrac)
排序分析用于降维的PCoA、CCA、RDA分析
统计检验PERMANOVA、ANOSIM、Mantel检验
微生物组分析特征表、稀疏化、群落比较

Installation

安装

bash
uv pip install scikit-bio
bash
uv pip install scikit-bio

Sequences

序列处理

Work with biological sequences through specialized
DNA
,
RNA
, and
Protein
classes.
python
import skbio
通过专用的
DNA
RNA
Protein
类处理生物序列。
python
import skbio

Load from file

从文件加载序列

seq = skbio.DNA.read('gene.fasta')
seq = skbio.DNA.read('gene.fasta')

Common operations

常见操作

complement = seq.reverse_complement() messenger = seq.transcribe() peptide = messenger.translate()
complement = seq.reverse_complement() messenger = seq.transcribe() peptide = messenger.translate()

Pattern search

模式搜索

hits = seq.find_with_regex('ATG[ACGT]{6}TAA')
hits = seq.find_with_regex('ATG[ACGT]{6}TAA')

Properties

序列属性

contains_ambiguous = seq.has_degenerates() clean_seq = seq.degap()

**Metadata types:**
- Sequence-level: ID, description, source organism
- Positional: Per-base quality scores (from FASTQ)
- Interval: Feature annotations, gene boundaries
contains_ambiguous = seq.has_degenerates() clean_seq = seq.degap()

**元数据类型:**
- 序列级:ID、描述、来源生物
- 位置级:单碱基质量分数(来自FASTQ)
- 区间级:特征注释、基因边界

Sequence Alignment

序列比对

Pairwise and multiple alignment using dynamic programming.
python
from skbio.alignment import local_pairwise_align_ssw, TabularMSA
使用动态规划实现两两和多序列比对。
python
from skbio.alignment import local_pairwise_align_ssw, TabularMSA

Local alignment (Smith-Waterman)

局部比对(Smith-Waterman算法)

result = local_pairwise_align_ssw(query_seq, target_seq)
result = local_pairwise_align_ssw(query_seq, target_seq)

Load existing alignment

加载已有的比对结果

alignment = TabularMSA.read('msa.fasta', constructor=skbio.DNA)
alignment = TabularMSA.read('msa.fasta', constructor=skbio.DNA)

Derive consensus

生成共识序列

consensus_seq = alignment.consensus()

**Notes:**
- `local_pairwise_align_ssw` provides fast SSW-based local alignment
- `StripedSmithWaterman` handles protein sequences with substitution matrices
- Affine gap penalties suit biological sequences best
consensus_seq = alignment.consensus()

**注意事项:**
- `local_pairwise_align_ssw`提供基于SSW的快速局部比对
- `StripedSmithWaterman`支持带替换矩阵的蛋白质序列比对
- 仿射空位罚分最适合生物序列比对

Phylogenetic Trees

系统发育树

Construct and analyze evolutionary trees.
python
from skbio import TreeNode
from skbio.tree import nj, upgma
构建并分析进化树。
python
from skbio import TreeNode
from skbio.tree import nj, upgma

Build from distances

基于距离矩阵构建进化树

phylogeny = nj(distance_matrix)
phylogeny = nj(distance_matrix)

Load existing tree

加载已有的进化树

phylogeny = TreeNode.read('species.nwk')
phylogeny = TreeNode.read('species.nwk')

Extract subset

提取子集分支

clade = phylogeny.shear(['mouse', 'rat', 'human'])
clade = phylogeny.shear(['mouse', 'rat', 'human'])

Enumerate leaf nodes

枚举叶节点

leaves = list(phylogeny.tips())
leaves = list(phylogeny.tips())

Common ancestor

查找最近共同祖先

ancestor = phylogeny.lowest_common_ancestor(['mouse', 'rat'])
ancestor = phylogeny.lowest_common_ancestor(['mouse', 'rat'])

Branch length between taxa

计算类群间的分支长度

branch_dist = phylogeny.find('mouse').distance(phylogeny.find('rat'))
branch_dist = phylogeny.find('mouse').distance(phylogeny.find('rat'))

Pairwise distances for all tips

计算所有叶节点的两两距离

pairwise_dm = phylogeny.cophenetic_matrix()
pairwise_dm = phylogeny.cophenetic_matrix()

Topology comparison

拓扑结构比较

rf_diff = phylogeny.robinson_foulds(other_tree)

**Tree construction methods:**

| Method | Use case |
|--------|----------|
| `nj()` | Standard neighbor-joining |
| `upgma()` | Assumes molecular clock |
| `bme()` | Scalable for large datasets |
rf_diff = phylogeny.robinson_foulds(other_tree)

**树构建方法:**

| 方法 | 适用场景 |
|--------|----------|
| `nj()` | 标准邻接法 |
| `upgma()` | 适用于分子钟假设的场景 |
| `bme()` | 适用于大型数据集的可扩展方法 |

Diversity Analysis

多样性分析

Calculate ecological diversity metrics.
计算生态学多样性指标。

Alpha Diversity (within-sample)

α多样性(样本内)

python
from skbio.diversity import alpha_diversity
python
from skbio.diversity import alpha_diversity

Sample abundance matrix

样本丰度矩阵

abundances = np.array([ [45, 12, 0, 8], [5, 0, 33, 17], [20, 20, 15, 10] ]) samples = ['gut_1', 'gut_2', 'gut_3']
abundances = np.array([ [45, 12, 0, 8], [5, 0, 33, 17], [20, 20, 15, 10] ]) samples = ['gut_1', 'gut_2', 'gut_3']

Richness and evenness metrics

丰富度和均匀度指标

shannon_vals = alpha_diversity('shannon', abundances, ids=samples) simpson_vals = alpha_diversity('simpson', abundances, ids=samples)
shannon_vals = alpha_diversity('shannon', abundances, ids=samples) simpson_vals = alpha_diversity('simpson', abundances, ids=samples)

Phylogenetic diversity (requires tree)

系统发育多样性(需要进化树)

faith_vals = alpha_diversity('faith_pd', abundances, ids=samples, tree=phylogeny, otu_ids=feature_names)
undefined
faith_vals = alpha_diversity('faith_pd', abundances, ids=samples, tree=phylogeny, otu_ids=feature_names)
undefined

Beta Diversity (between-sample)

β多样性(样本间)

python
from skbio.diversity import beta_diversity
python
from skbio.diversity import beta_diversity

Distance matrices

距离矩阵计算

bray_dm = beta_diversity('braycurtis', abundances, ids=samples) unifrac_dm = beta_diversity('weighted_unifrac', abundances, ids=samples, tree=phylogeny, otu_ids=feature_names)

**Key points:**
- Input must be integer counts, not proportions
- Phylogenetic metrics require a tree matching feature IDs
- `partial_beta_diversity()` computes specific sample pairs efficiently
bray_dm = beta_diversity('braycurtis', abundances, ids=samples) unifrac_dm = beta_diversity('weighted_unifrac', abundances, ids=samples, tree=phylogeny, otu_ids=feature_names)

**关键点:**
- 输入必须是整数计数,而非比例值
- 系统发育指标需要与特征ID匹配的进化树
- `partial_beta_diversity()`可高效计算特定样本对的距离

Ordination

排序分析

Project high-dimensional data to visualizable spaces.
python
from skbio.stats.ordination import pcoa, cca
将高维数据投影到可可视化的空间。
python
from skbio.stats.ordination import pcoa, cca

PCoA from distance matrix

基于距离矩阵的PCoA分析

coords = pcoa(bray_dm) axis1 = coords.samples['PC1'] axis2 = coords.samples['PC2'] variance_explained = coords.proportion_explained
coords = pcoa(bray_dm) axis1 = coords.samples['PC1'] axis2 = coords.samples['PC2'] variance_explained = coords.proportion_explained

CCA with environmental predictors

结合环境预测因子的CCA分析

constrained = cca(species_abundances, environmental_vars)

**Methods:**

| Function | Input | Purpose |
|----------|-------|---------|
| `pcoa()` | Distance matrix | Unconstrained ordination |
| `cca()` | Abundance + environment | Constrained ordination (unimodal) |
| `rda()` | Abundance + environment | Constrained ordination (linear) |
constrained = cca(species_abundances, environmental_vars)

**分析方法:**

| 函数 | 输入 | 用途 |
|----------|-------|---------|
| `pcoa()` | 距离矩阵 | 非约束排序 |
| `cca()` | 物种丰度+环境因子 | 约束排序(单峰模型) |
| `rda()` | 物种丰度+环境因子 | 约束排序(线性模型) |

Statistical Tests

统计检验

Hypothesis testing for ecological data.
python
from skbio.stats.distance import permanova, anosim, mantel
针对生态数据的假设检验。
python
from skbio.stats.distance import permanova, anosim, mantel

Group comparison

组间比较

treatment_groups = ['control', 'control', 'treated', 'treated'] perm_result = permanova(bray_dm, treatment_groups, permutations=999) print(f"F = {perm_result['test statistic']:.3f}, p = {perm_result['p-value']:.4f}")
treatment_groups = ['control', 'control', 'treated', 'treated'] perm_result = permanova(bray_dm, treatment_groups, permutations=999) print(f"F = {perm_result['test statistic']:.3f}, p = {perm_result['p-value']:.4f}")

Alternative group test

替代组间检验方法

anos_result = anosim(bray_dm, treatment_groups, permutations=999)
anos_result = anosim(bray_dm, treatment_groups, permutations=999)

Matrix correlation

矩阵相关性分析

r, pval, n = mantel(genetic_dm, geographic_dm, method='spearman', permutations=999) print(f"r = {r:.3f}, p = {pval:.4f}")

**Test overview:**

| Test | Purpose | Key output |
|------|---------|------------|
| PERMANOVA | Group differences | F-statistic, p-value |
| ANOSIM | Group differences (alternative) | R-statistic, p-value |
| PERMDISP | Dispersion homogeneity | Tests PERMANOVA assumption |
| Mantel | Matrix correlation | Correlation coefficient, p-value |
r, pval, n = mantel(genetic_dm, geographic_dm, method='spearman', permutations=999) print(f"r = {r:.3f}, p = {pval:.4f}")

**检验概述:**

| 检验方法 | 用途 | 关键输出 |
|------|---------|------------|
| PERMANOVA | 组间差异分析 | F统计量、p值 |
| ANOSIM | 组间差异分析(替代方法) | R统计量、p值 |
| PERMDISP | 离散度齐性检验 | 验证PERMANOVA的假设条件 |
| Mantel | 矩阵相关性分析 | 相关系数、p值 |

File I/O

文件读写

Read and write 19+ biological formats.
python
import skbio
支持19种以上生物文件格式的读写。
python
import skbio

Automatic format detection

自动检测文件格式

tree = skbio.TreeNode.read('phylogeny.nwk')
tree = skbio.TreeNode.read('phylogeny.nwk')

Memory-efficient iteration

内存高效的迭代读取

for record in skbio.io.read('reads.fastq', format='fastq', constructor=skbio.DNA): if record.positional_metadata['quality'].mean() > 30: process(record)
for record in skbio.io.read('reads.fastq', format='fastq', constructor=skbio.DNA): if record.positional_metadata['quality'].mean() > 30: process(record)

Format conversion

格式转换

records = skbio.io.read('sequences.fastq', format='fastq', constructor=skbio.DNA) skbio.io.write(records, format='fasta', into='sequences.fasta')

**Supported formats:**

| Category | Formats |
|----------|---------|
| Sequences | FASTA, FASTQ, GenBank, EMBL, QSeq |
| Alignments | Clustal, PHYLIP, Stockholm |
| Trees | Newick |
| Tables | BIOM (HDF5/JSON) |
| Distances | Delimited matrices |
records = skbio.io.read('sequences.fastq', format='fastq', constructor=skbio.DNA) skbio.io.write(records, format='fasta', into='sequences.fasta')

**支持的格式:**

| 类别 | 格式 |
|----------|---------|
| 序列 | FASTA、FASTQ、GenBank、EMBL、QSeq |
| 比对结果 | Clustal、PHYLIP、Stockholm |
| 进化树 | Newick |
| 表格 | BIOM(HDF5/JSON) |
| 距离矩阵 | 分隔符分隔的矩阵 |

Distance Matrices

距离矩阵

Store and manipulate pairwise distances.
python
from skbio import DistanceMatrix
import numpy as np
存储和操作两两距离数据。
python
from skbio import DistanceMatrix
import numpy as np

Create from array

从数组创建距离矩阵

distances = np.array([ [0.0, 0.3, 0.7], [0.3, 0.0, 0.5], [0.7, 0.5, 0.0] ]) dm = DistanceMatrix(distances, ids=['sp_A', 'sp_B', 'sp_C'])
distances = np.array([ [0.0, 0.3, 0.7], [0.3, 0.0, 0.5], [0.7, 0.5, 0.0] ]) dm = DistanceMatrix(distances, ids=['sp_A', 'sp_B', 'sp_C'])

Access elements

访问元素

pair_dist = dm['sp_A', 'sp_B'] all_from_a = dm['sp_A']
pair_dist = dm['sp_A', 'sp_B'] all_from_a = dm['sp_A']

Subset

子集筛选

subset_dm = dm.filter(['sp_A', 'sp_C'])
undefined
subset_dm = dm.filter(['sp_A', 'sp_C'])
undefined

Feature Tables (BIOM)

特征表(BIOM)

Handle OTU/ASV abundance tables.
python
from skbio import Table
处理OTU/ASV丰度表。
python
from skbio import Table

Load table

加载特征表

tbl = Table.read('features.biom')
tbl = Table.read('features.biom')

Inspect structure

查看结构

sample_names = tbl.ids(axis='sample') feature_names = tbl.ids(axis='observation')
sample_names = tbl.ids(axis='sample') feature_names = tbl.ids(axis='observation')

Filter by abundance

按丰度筛选样本

filtered = tbl.filter(lambda row, id_, md: row.sum() > 500, axis='sample')
filtered = tbl.filter(lambda row, id_, md: row.sum() > 500, axis='sample')

Convert to pandas

转换为pandas DataFrame

df = tbl.to_dataframe()
undefined
df = tbl.to_dataframe()
undefined

Protein Embeddings

蛋白质嵌入

Bridge language model outputs with scikit-bio analysis.
python
from skbio.embedding import ProteinEmbedding
将语言模型输出与scikit-bio分析流程结合。
python
from skbio.embedding import ProteinEmbedding

Load embeddings (from ESM, ProtTrans, etc.)

加载嵌入向量(来自ESM、ProtTrans等)

emb = ProteinEmbedding(embedding_matrix, protein_ids)
emb = ProteinEmbedding(embedding_matrix, protein_ids)

Create distance matrix for downstream analysis

转换为距离矩阵用于下游分析

emb_dm = emb.to_distances(metric='cosine')
emb_dm = emb.to_distances(metric='cosine')

Ordination visualization

排序分析可视化

emb_pcoa = emb.to_ordination(metric='euclidean', method='pcoa')
undefined
emb_pcoa = emb.to_ordination(metric='euclidean', method='pcoa')
undefined

Typical Workflows

典型工作流

Microbiome diversity study:
  1. Load BIOM table and phylogenetic tree
  2. Calculate alpha diversity per sample
  3. Compute beta diversity (UniFrac)
  4. Ordinate with PCoA
  5. Test group differences with PERMANOVA
Phylogenetic inference:
  1. Read sequences from FASTA
  2. Perform multiple alignment
  3. Calculate pairwise distances
  4. Construct tree with neighbor-joining
  5. Analyze clade relationships
Sequence processing:
  1. Read FASTQ with quality scores
  2. Filter low-quality reads
  3. Search for motifs
  4. Translate to protein
  5. Export as FASTA
微生物组多样性研究:
  1. 加载BIOM表和系统发育树
  2. 计算每个样本的α多样性
  3. 计算β多样性(UniFrac)
  4. 采用PCoA进行排序分析
  5. 使用PERMANOVA检验组间差异
系统发育推断:
  1. 从FASTA文件读取序列
  2. 执行多序列比对
  3. 计算两两距离
  4. 用邻接法构建进化树
  5. 分析分支关系
序列处理流程:
  1. 读取带质量分数的FASTQ文件
  2. 过滤低质量reads
  3. 搜索基序
  4. 翻译为蛋白质序列
  5. 导出为FASTA格式

Performance Tips

性能优化建议

  • Use generators for large sequence files
  • Prefer BIOM HDF5 over JSON for big tables
  • Apply
    partial_beta_diversity()
    when computing only specific pairs
  • Choose BME for very large phylogenies
  • 对大型序列文件使用生成器
  • 大型表格优先使用BIOM HDF5格式而非JSON
  • 仅计算特定样本对时使用
    partial_beta_diversity()
  • 超大型系统发育树选择BME方法

Ecosystem Integration

生态系统集成

LibraryIntegration
pandasDataFrames from distance matrices, diversity results
numpyArray conversions throughout
matplotlib/seabornPlot ordination results, heatmaps
scikit-learnDistance matrices as input
QIIME 2Native BIOM, tree, distance matrix compatibility
集成方式
pandas距离矩阵、多样性结果转换为DataFrame
numpy全程支持数组转换
matplotlib/seaborn排序结果可视化、热图绘制
scikit-learn距离矩阵作为输入
QIIME 2原生支持BIOM、进化树、距离矩阵格式

Reference Files

参考文件

FileContents
references/api-reference.mdComplete method signatures, parameters, extended examples, and troubleshooting
文件内容
references/api-reference.md完整的方法签名、参数说明、扩展示例及故障排除指南