scikit-bio
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesescikit-bio
scikit-bio
A Python library for biological data analysis spanning sequence handling, phylogenetics, microbial ecology, and multivariate statistics.
一款涵盖序列处理、系统发育分析、微生物生态学及多元统计分析的Python生物数据分析库。
When to Apply
适用场景
Use this skill when users need to:
| Task Category | Examples |
|---|---|
| Sequence work | DNA/RNA/protein manipulation, motif finding, translation |
| File handling | FASTA, FASTQ, GenBank, Newick, BIOM I/O |
| Alignments | Pairwise or multiple sequence alignment |
| Phylogenetics | Tree construction, manipulation, distance calculations |
| Diversity metrics | Alpha diversity (Shannon, Faith's PD), beta diversity (Bray-Curtis, UniFrac) |
| Ordination | PCoA, CCA, RDA for dimensionality reduction |
| Statistical tests | PERMANOVA, ANOSIM, Mantel tests |
| Microbiome analysis | Feature tables, rarefaction, community comparisons |
当用户需要完成以下任务时,可使用本工具:
| 任务类别 | 示例 |
|---|---|
| 序列处理 | DNA/RNA/蛋白质操作、基序查找、序列翻译 |
| 文件处理 | FASTA、FASTQ、GenBank、Newick、BIOM格式读写 |
| 序列比对 | 两两序列比对或多序列比对 |
| 系统发育分析 | 进化树构建、操作、距离计算 |
| 多样性指标 | α多样性(Shannon、Faith's PD)、β多样性(Bray-Curtis、UniFrac) |
| 排序分析 | 用于降维的PCoA、CCA、RDA分析 |
| 统计检验 | PERMANOVA、ANOSIM、Mantel检验 |
| 微生物组分析 | 特征表、稀疏化、群落比较 |
Installation
安装
bash
uv pip install scikit-biobash
uv pip install scikit-bioSequences
序列处理
Work with biological sequences through specialized , , and classes.
DNARNAProteinpython
import skbio通过专用的、和类处理生物序列。
DNARNAProteinpython
import skbioLoad from file
从文件加载序列
seq = skbio.DNA.read('gene.fasta')
seq = skbio.DNA.read('gene.fasta')
Common operations
常见操作
complement = seq.reverse_complement()
messenger = seq.transcribe()
peptide = messenger.translate()
complement = seq.reverse_complement()
messenger = seq.transcribe()
peptide = messenger.translate()
Pattern search
模式搜索
hits = seq.find_with_regex('ATG[ACGT]{6}TAA')
hits = seq.find_with_regex('ATG[ACGT]{6}TAA')
Properties
序列属性
contains_ambiguous = seq.has_degenerates()
clean_seq = seq.degap()
**Metadata types:**
- Sequence-level: ID, description, source organism
- Positional: Per-base quality scores (from FASTQ)
- Interval: Feature annotations, gene boundariescontains_ambiguous = seq.has_degenerates()
clean_seq = seq.degap()
**元数据类型:**
- 序列级:ID、描述、来源生物
- 位置级:单碱基质量分数(来自FASTQ)
- 区间级:特征注释、基因边界Sequence Alignment
序列比对
Pairwise and multiple alignment using dynamic programming.
python
from skbio.alignment import local_pairwise_align_ssw, TabularMSA使用动态规划实现两两和多序列比对。
python
from skbio.alignment import local_pairwise_align_ssw, TabularMSALocal alignment (Smith-Waterman)
局部比对(Smith-Waterman算法)
result = local_pairwise_align_ssw(query_seq, target_seq)
result = local_pairwise_align_ssw(query_seq, target_seq)
Load existing alignment
加载已有的比对结果
alignment = TabularMSA.read('msa.fasta', constructor=skbio.DNA)
alignment = TabularMSA.read('msa.fasta', constructor=skbio.DNA)
Derive consensus
生成共识序列
consensus_seq = alignment.consensus()
**Notes:**
- `local_pairwise_align_ssw` provides fast SSW-based local alignment
- `StripedSmithWaterman` handles protein sequences with substitution matrices
- Affine gap penalties suit biological sequences bestconsensus_seq = alignment.consensus()
**注意事项:**
- `local_pairwise_align_ssw`提供基于SSW的快速局部比对
- `StripedSmithWaterman`支持带替换矩阵的蛋白质序列比对
- 仿射空位罚分最适合生物序列比对Phylogenetic Trees
系统发育树
Construct and analyze evolutionary trees.
python
from skbio import TreeNode
from skbio.tree import nj, upgma构建并分析进化树。
python
from skbio import TreeNode
from skbio.tree import nj, upgmaBuild from distances
基于距离矩阵构建进化树
phylogeny = nj(distance_matrix)
phylogeny = nj(distance_matrix)
Load existing tree
加载已有的进化树
phylogeny = TreeNode.read('species.nwk')
phylogeny = TreeNode.read('species.nwk')
Extract subset
提取子集分支
clade = phylogeny.shear(['mouse', 'rat', 'human'])
clade = phylogeny.shear(['mouse', 'rat', 'human'])
Enumerate leaf nodes
枚举叶节点
leaves = list(phylogeny.tips())
leaves = list(phylogeny.tips())
Common ancestor
查找最近共同祖先
ancestor = phylogeny.lowest_common_ancestor(['mouse', 'rat'])
ancestor = phylogeny.lowest_common_ancestor(['mouse', 'rat'])
Branch length between taxa
计算类群间的分支长度
branch_dist = phylogeny.find('mouse').distance(phylogeny.find('rat'))
branch_dist = phylogeny.find('mouse').distance(phylogeny.find('rat'))
Pairwise distances for all tips
计算所有叶节点的两两距离
pairwise_dm = phylogeny.cophenetic_matrix()
pairwise_dm = phylogeny.cophenetic_matrix()
Topology comparison
拓扑结构比较
rf_diff = phylogeny.robinson_foulds(other_tree)
**Tree construction methods:**
| Method | Use case |
|--------|----------|
| `nj()` | Standard neighbor-joining |
| `upgma()` | Assumes molecular clock |
| `bme()` | Scalable for large datasets |rf_diff = phylogeny.robinson_foulds(other_tree)
**树构建方法:**
| 方法 | 适用场景 |
|--------|----------|
| `nj()` | 标准邻接法 |
| `upgma()` | 适用于分子钟假设的场景 |
| `bme()` | 适用于大型数据集的可扩展方法 |Diversity Analysis
多样性分析
Calculate ecological diversity metrics.
计算生态学多样性指标。
Alpha Diversity (within-sample)
α多样性(样本内)
python
from skbio.diversity import alpha_diversitypython
from skbio.diversity import alpha_diversitySample abundance matrix
样本丰度矩阵
abundances = np.array([
[45, 12, 0, 8],
[5, 0, 33, 17],
[20, 20, 15, 10]
])
samples = ['gut_1', 'gut_2', 'gut_3']
abundances = np.array([
[45, 12, 0, 8],
[5, 0, 33, 17],
[20, 20, 15, 10]
])
samples = ['gut_1', 'gut_2', 'gut_3']
Richness and evenness metrics
丰富度和均匀度指标
shannon_vals = alpha_diversity('shannon', abundances, ids=samples)
simpson_vals = alpha_diversity('simpson', abundances, ids=samples)
shannon_vals = alpha_diversity('shannon', abundances, ids=samples)
simpson_vals = alpha_diversity('simpson', abundances, ids=samples)
Phylogenetic diversity (requires tree)
系统发育多样性(需要进化树)
faith_vals = alpha_diversity('faith_pd', abundances, ids=samples,
tree=phylogeny, otu_ids=feature_names)
undefinedfaith_vals = alpha_diversity('faith_pd', abundances, ids=samples,
tree=phylogeny, otu_ids=feature_names)
undefinedBeta Diversity (between-sample)
β多样性(样本间)
python
from skbio.diversity import beta_diversitypython
from skbio.diversity import beta_diversityDistance matrices
距离矩阵计算
bray_dm = beta_diversity('braycurtis', abundances, ids=samples)
unifrac_dm = beta_diversity('weighted_unifrac', abundances, ids=samples,
tree=phylogeny, otu_ids=feature_names)
**Key points:**
- Input must be integer counts, not proportions
- Phylogenetic metrics require a tree matching feature IDs
- `partial_beta_diversity()` computes specific sample pairs efficientlybray_dm = beta_diversity('braycurtis', abundances, ids=samples)
unifrac_dm = beta_diversity('weighted_unifrac', abundances, ids=samples,
tree=phylogeny, otu_ids=feature_names)
**关键点:**
- 输入必须是整数计数,而非比例值
- 系统发育指标需要与特征ID匹配的进化树
- `partial_beta_diversity()`可高效计算特定样本对的距离Ordination
排序分析
Project high-dimensional data to visualizable spaces.
python
from skbio.stats.ordination import pcoa, cca将高维数据投影到可可视化的空间。
python
from skbio.stats.ordination import pcoa, ccaPCoA from distance matrix
基于距离矩阵的PCoA分析
coords = pcoa(bray_dm)
axis1 = coords.samples['PC1']
axis2 = coords.samples['PC2']
variance_explained = coords.proportion_explained
coords = pcoa(bray_dm)
axis1 = coords.samples['PC1']
axis2 = coords.samples['PC2']
variance_explained = coords.proportion_explained
CCA with environmental predictors
结合环境预测因子的CCA分析
constrained = cca(species_abundances, environmental_vars)
**Methods:**
| Function | Input | Purpose |
|----------|-------|---------|
| `pcoa()` | Distance matrix | Unconstrained ordination |
| `cca()` | Abundance + environment | Constrained ordination (unimodal) |
| `rda()` | Abundance + environment | Constrained ordination (linear) |constrained = cca(species_abundances, environmental_vars)
**分析方法:**
| 函数 | 输入 | 用途 |
|----------|-------|---------|
| `pcoa()` | 距离矩阵 | 非约束排序 |
| `cca()` | 物种丰度+环境因子 | 约束排序(单峰模型) |
| `rda()` | 物种丰度+环境因子 | 约束排序(线性模型) |Statistical Tests
统计检验
Hypothesis testing for ecological data.
python
from skbio.stats.distance import permanova, anosim, mantel针对生态数据的假设检验。
python
from skbio.stats.distance import permanova, anosim, mantelGroup comparison
组间比较
treatment_groups = ['control', 'control', 'treated', 'treated']
perm_result = permanova(bray_dm, treatment_groups, permutations=999)
print(f"F = {perm_result['test statistic']:.3f}, p = {perm_result['p-value']:.4f}")
treatment_groups = ['control', 'control', 'treated', 'treated']
perm_result = permanova(bray_dm, treatment_groups, permutations=999)
print(f"F = {perm_result['test statistic']:.3f}, p = {perm_result['p-value']:.4f}")
Alternative group test
替代组间检验方法
anos_result = anosim(bray_dm, treatment_groups, permutations=999)
anos_result = anosim(bray_dm, treatment_groups, permutations=999)
Matrix correlation
矩阵相关性分析
r, pval, n = mantel(genetic_dm, geographic_dm, method='spearman', permutations=999)
print(f"r = {r:.3f}, p = {pval:.4f}")
**Test overview:**
| Test | Purpose | Key output |
|------|---------|------------|
| PERMANOVA | Group differences | F-statistic, p-value |
| ANOSIM | Group differences (alternative) | R-statistic, p-value |
| PERMDISP | Dispersion homogeneity | Tests PERMANOVA assumption |
| Mantel | Matrix correlation | Correlation coefficient, p-value |r, pval, n = mantel(genetic_dm, geographic_dm, method='spearman', permutations=999)
print(f"r = {r:.3f}, p = {pval:.4f}")
**检验概述:**
| 检验方法 | 用途 | 关键输出 |
|------|---------|------------|
| PERMANOVA | 组间差异分析 | F统计量、p值 |
| ANOSIM | 组间差异分析(替代方法) | R统计量、p值 |
| PERMDISP | 离散度齐性检验 | 验证PERMANOVA的假设条件 |
| Mantel | 矩阵相关性分析 | 相关系数、p值 |File I/O
文件读写
Read and write 19+ biological formats.
python
import skbio支持19种以上生物文件格式的读写。
python
import skbioAutomatic format detection
自动检测文件格式
tree = skbio.TreeNode.read('phylogeny.nwk')
tree = skbio.TreeNode.read('phylogeny.nwk')
Memory-efficient iteration
内存高效的迭代读取
for record in skbio.io.read('reads.fastq', format='fastq', constructor=skbio.DNA):
if record.positional_metadata['quality'].mean() > 30:
process(record)
for record in skbio.io.read('reads.fastq', format='fastq', constructor=skbio.DNA):
if record.positional_metadata['quality'].mean() > 30:
process(record)
Format conversion
格式转换
records = skbio.io.read('sequences.fastq', format='fastq', constructor=skbio.DNA)
skbio.io.write(records, format='fasta', into='sequences.fasta')
**Supported formats:**
| Category | Formats |
|----------|---------|
| Sequences | FASTA, FASTQ, GenBank, EMBL, QSeq |
| Alignments | Clustal, PHYLIP, Stockholm |
| Trees | Newick |
| Tables | BIOM (HDF5/JSON) |
| Distances | Delimited matrices |records = skbio.io.read('sequences.fastq', format='fastq', constructor=skbio.DNA)
skbio.io.write(records, format='fasta', into='sequences.fasta')
**支持的格式:**
| 类别 | 格式 |
|----------|---------|
| 序列 | FASTA、FASTQ、GenBank、EMBL、QSeq |
| 比对结果 | Clustal、PHYLIP、Stockholm |
| 进化树 | Newick |
| 表格 | BIOM(HDF5/JSON) |
| 距离矩阵 | 分隔符分隔的矩阵 |Distance Matrices
距离矩阵
Store and manipulate pairwise distances.
python
from skbio import DistanceMatrix
import numpy as np存储和操作两两距离数据。
python
from skbio import DistanceMatrix
import numpy as npCreate from array
从数组创建距离矩阵
distances = np.array([
[0.0, 0.3, 0.7],
[0.3, 0.0, 0.5],
[0.7, 0.5, 0.0]
])
dm = DistanceMatrix(distances, ids=['sp_A', 'sp_B', 'sp_C'])
distances = np.array([
[0.0, 0.3, 0.7],
[0.3, 0.0, 0.5],
[0.7, 0.5, 0.0]
])
dm = DistanceMatrix(distances, ids=['sp_A', 'sp_B', 'sp_C'])
Access elements
访问元素
pair_dist = dm['sp_A', 'sp_B']
all_from_a = dm['sp_A']
pair_dist = dm['sp_A', 'sp_B']
all_from_a = dm['sp_A']
Subset
子集筛选
subset_dm = dm.filter(['sp_A', 'sp_C'])
undefinedsubset_dm = dm.filter(['sp_A', 'sp_C'])
undefinedFeature Tables (BIOM)
特征表(BIOM)
Handle OTU/ASV abundance tables.
python
from skbio import Table处理OTU/ASV丰度表。
python
from skbio import TableLoad table
加载特征表
tbl = Table.read('features.biom')
tbl = Table.read('features.biom')
Inspect structure
查看结构
sample_names = tbl.ids(axis='sample')
feature_names = tbl.ids(axis='observation')
sample_names = tbl.ids(axis='sample')
feature_names = tbl.ids(axis='observation')
Filter by abundance
按丰度筛选样本
filtered = tbl.filter(lambda row, id_, md: row.sum() > 500, axis='sample')
filtered = tbl.filter(lambda row, id_, md: row.sum() > 500, axis='sample')
Convert to pandas
转换为pandas DataFrame
df = tbl.to_dataframe()
undefineddf = tbl.to_dataframe()
undefinedProtein Embeddings
蛋白质嵌入
Bridge language model outputs with scikit-bio analysis.
python
from skbio.embedding import ProteinEmbedding将语言模型输出与scikit-bio分析流程结合。
python
from skbio.embedding import ProteinEmbeddingLoad embeddings (from ESM, ProtTrans, etc.)
加载嵌入向量(来自ESM、ProtTrans等)
emb = ProteinEmbedding(embedding_matrix, protein_ids)
emb = ProteinEmbedding(embedding_matrix, protein_ids)
Create distance matrix for downstream analysis
转换为距离矩阵用于下游分析
emb_dm = emb.to_distances(metric='cosine')
emb_dm = emb.to_distances(metric='cosine')
Ordination visualization
排序分析可视化
emb_pcoa = emb.to_ordination(metric='euclidean', method='pcoa')
undefinedemb_pcoa = emb.to_ordination(metric='euclidean', method='pcoa')
undefinedTypical Workflows
典型工作流
Microbiome diversity study:
- Load BIOM table and phylogenetic tree
- Calculate alpha diversity per sample
- Compute beta diversity (UniFrac)
- Ordinate with PCoA
- Test group differences with PERMANOVA
Phylogenetic inference:
- Read sequences from FASTA
- Perform multiple alignment
- Calculate pairwise distances
- Construct tree with neighbor-joining
- Analyze clade relationships
Sequence processing:
- Read FASTQ with quality scores
- Filter low-quality reads
- Search for motifs
- Translate to protein
- Export as FASTA
微生物组多样性研究:
- 加载BIOM表和系统发育树
- 计算每个样本的α多样性
- 计算β多样性(UniFrac)
- 采用PCoA进行排序分析
- 使用PERMANOVA检验组间差异
系统发育推断:
- 从FASTA文件读取序列
- 执行多序列比对
- 计算两两距离
- 用邻接法构建进化树
- 分析分支关系
序列处理流程:
- 读取带质量分数的FASTQ文件
- 过滤低质量reads
- 搜索基序
- 翻译为蛋白质序列
- 导出为FASTA格式
Performance Tips
性能优化建议
- Use generators for large sequence files
- Prefer BIOM HDF5 over JSON for big tables
- Apply when computing only specific pairs
partial_beta_diversity() - Choose BME for very large phylogenies
- 对大型序列文件使用生成器
- 大型表格优先使用BIOM HDF5格式而非JSON
- 仅计算特定样本对时使用
partial_beta_diversity() - 超大型系统发育树选择BME方法
Ecosystem Integration
生态系统集成
| Library | Integration |
|---|---|
| pandas | DataFrames from distance matrices, diversity results |
| numpy | Array conversions throughout |
| matplotlib/seaborn | Plot ordination results, heatmaps |
| scikit-learn | Distance matrices as input |
| QIIME 2 | Native BIOM, tree, distance matrix compatibility |
| 库 | 集成方式 |
|---|---|
| pandas | 距离矩阵、多样性结果转换为DataFrame |
| numpy | 全程支持数组转换 |
| matplotlib/seaborn | 排序结果可视化、热图绘制 |
| scikit-learn | 距离矩阵作为输入 |
| QIIME 2 | 原生支持BIOM、进化树、距离矩阵格式 |
Reference Files
参考文件
| File | Contents |
|---|---|
| references/api-reference.md | Complete method signatures, parameters, extended examples, and troubleshooting |
| 文件 | 内容 |
|---|---|
| references/api-reference.md | 完整的方法签名、参数说明、扩展示例及故障排除指南 |