single-cell-clustering-and-batch-correction-with-omicverse

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Single-cell clustering and batch correction with omicverse

基于omicverse的单细胞聚类与批次校正

Overview

概述

This skill distills the single-cell tutorials

t_cluster.ipynb

and

t_single_batch.ipynb

. Use it when a user wants to preprocess an

AnnData

object, explore clustering alternatives (Leiden, Louvain, scICE, GMM, topic/cNMF models), and evaluate or harmonise batches with omicverse utilities.

本技能提炼自单细胞教程

t_cluster.ipynb

和

t_single_batch.ipynb

。当用户需要预处理

AnnData

对象、探索多种聚类方案（Leiden、Louvain、scICE、GMM、主题/cNMF模型），或是使用omicverse工具评估或协调批次数据时，可使用本技能。

Instructions

操作步骤

Import libraries and set plotting defaults
- Load
```
omicverse as ov
```
  ,
```
scanpy as sc
```
  , and plotting helpers (
```
scvelo as scv
```
  when using dentate gyrus demo data).
- Apply
```
ov.plot_set()
```
  or
```
ov.utils.ov_plot_set()
```
  so figures adopt omicverse styling before embedding plots.
Load data and annotate batches
- For demo clustering, fetch
```
scv.datasets.dentategyrus()
```
  ; for integration, read provided
```
.h5ad
```
  files via
```
ov.read()
```
  and set
```
adata.obs['batch']
```
  identifiers for each cohort.
- Confirm inputs are sparse numeric matrices; convert with
```
adata.X = adata.X.astype(np.int64)
```
  when required for QC steps.
Run quality control
- Execute
```
ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, batch_key='batch')
```
  to drop low-quality cells and inspect summary statistics per batch.
- Save intermediate filtered objects (
```
adata.write_h5ad(...)
```
  ) so users can resume from clean checkpoints.
Preprocess and select features
- Call
```
ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=3000, batch_key=None)
```
  to normalise, log-transform, and flag highly variable genes; assign
```
adata.raw = adata
```
  and subset to
```
adata.var.highly_variable_features
```
  for downstream modelling.
- Scale expression (
```
ov.pp.scale(adata)
```
  ) and compute PCA scores with
```
ov.pp.pca(adata, layer='scaled', n_pcs=50)
```
  . Encourage reviewing variance explained via
```
ov.utils.plot_pca_variance_ratio(adata)
```
  .

Construct neighbourhood graph and baseline clustering

Build neighbour graph using

sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')

ov.pp.neighbors(...)

Generate Leiden or Louvain labels through

ov.utils.cluster(adata, method='leiden'|'louvain', resolution=1)

ov.single.leiden(adata, resolution=1.0)

, or

ov.pp.leiden(adata, resolution=1)

; remind users that resolution tunes granularity.

IMPORTANT - Dependency checks: Always verify prerequisites before clustering or plotting:

python

# Before clustering: check neighbors graph exists
if 'neighbors' not in adata.uns:
    if 'X_pca' in adata.obsm:
        ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
    else:
        raise ValueError("PCA must be computed before neighbors graph")

# Before plotting by cluster: check clustering was performed
if 'leiden' not in adata.obs:
    ov.single.leiden(adata, resolution=1.0)

Visualise embeddings with
```
ov.pl.embedding(adata, basis='X_umap', color=['clusters','leiden'], frameon='small', wspace=0.5)
```
and confirm cluster separation. Always check that columns in
```
color=
```
parameter exist in
```
adata.obs
```
before plotting.

Explore advanced clustering strategies

scICE consensus: instantiate

model = ov.utils.cluster(adata, method='scICE', use_rep='scaled|original|X_pca', resolution_range=(4,20), n_boot=50, n_steps=11)

and inspect stability via

model.plot_ic(figsize=(6,4))

before selecting

model.best_k

groups.

Gaussian mixtures: run

ov.utils.cluster(..., method='GMM', n_components=21, covariance_type='full', tol=1e-9, max_iter=1000)

for model-based assignments.

Topic modelling: fit

LDA_obj = ov.utils.LDA_topic(...)

, review

LDA_obj.plot_topic_contributions(6)

, derive cluster calls with

LDA_obj.predicted(k)

and optionally refine using

LDA_obj.get_results_rfc(...)

cNMF programs: initialise

cnmf_obj = ov.single.cNMF(... components=np.arange(5,11), n_iter=20, num_highvar_genes=2000, output_dir=...)

, factorise (

factorize

combine

), select K via

k_selection_plot

, and propagate usage scores back with

cnmf_obj.get_results(...)

and

cnmf_obj.get_results_rfc(...)

Evaluate clustering quality
- Compare predicted labels against known references with
```
adjusted_rand_score(adata.obs['clusters'], adata.obs['leiden'])
```
  and report metrics for each method (Leiden, Louvain, GMM, LDA variants, cNMF models) to justify chosen parameters.
Embed with multiple layouts
- Use
```
ov.utils.mde(...)
```
  to create MDE projections from different latent spaces (
```
adata.obsm["scaled|original|X_pca"]
```
  , harmonised embeddings, topic compositions) and plot via
```
ov.utils.embedding(..., color=['batch','cell_type'])
```
  or
```
ov.pl.embedding
```
  for consistent review of cluster/batch mixing.
Perform batch correction and integration
- Apply
```
ov.single.batch_correction(adata, batch_key='batch', methods='harmony'|'combat'|'scanorama'|'scVI'|'CellANOVA', n_pcs=50, ...)
```
  sequentially to generate harmonised embeddings stored in
```
adata.obsm
```
  (
```
X_harmony
```
  ,
```
X_combat
```
  ,
```
X_scanorama
```
  ,
```
X_scVI
```
  ,
```
X_cellanova
```
  ). For
```
scVI
```
  , mention latent size (
```
n_latent=30
```
  ) and
```
gene_likelihood="nb"
```
  ; for CellANOVA define control pools via
```
control_dict
```
  .
- After each correction, project to 2D with
```
ov.utils.mde
```
  and visualise
```
batch
```
  vs
```
cell_type
```
  to check mixing and conservation.

Benchmark integration performance

Persist final object (

adata.write_h5ad('neurips2021_batch_all.h5ad', compression='gzip')

) and reload when necessary.

Use

scib_metrics.benchmark.Benchmarker

with embeddings list (

["X_pca", "X_combat", "X_harmony", "X_cellanova", "X_scanorama", "X_mira_topic", "X_mira_feature", "X_scVI"]

) to compute batch-vs-biology trade-offs via

bm.benchmark()

and summarise with

bm.plot_results_table(min_max_scale=False)

General troubleshooting
- Ensure
```
adata.raw
```
  captures the unscaled log-normalised matrix before subsetting to HVGs.
- Confirm
```
use_rep='scaled|original|X_pca'
```
  strings exist in
```
adata.obsm
```
  prior to clustering; rerun preprocessing if missing.
- Monitor memory when running cNMF or scVI; adjust
```
n_iter
```
  ,
```
components
```
  , or latent dimensions for smaller datasets.
- Pipeline dependency errors: When you encounter errors like "Could not find 'leiden' in adata.obs", always check and add prerequisites:
  - Before leiden/louvain clustering → ensure
```
'neighbors' in adata.uns
```
  - Before plotting by clustering → ensure the cluster column exists in
```
adata.obs
```
  - Before UMAP/embedding → ensure PCA or another dimensionality reduction is complete
- Code generation pattern: When generating multi-step code, use defensive checks rather than assuming prior steps completed successfully. This prevents cascading failures when users run steps out of order or in separate sessions.

导入库并设置绘图默认参数
- 导入
```
omicverse as ov
```
  、
```
scanpy as sc
```
  以及绘图辅助工具（使用齿状回演示数据时需导入
```
scvelo as scv
```
  ）。
- 在嵌入绘图前，应用
```
ov.plot_set()
```
  或
```
ov.utils.ov_plot_set()
```
  ，使图表采用omicverse的样式。
加载数据并标注批次信息
- 若进行演示聚类，可获取
```
scv.datasets.dentategyrus()
```
  数据；若进行整合分析，通过
```
ov.read()
```
  读取提供的
```
.h5ad
```
  文件，并为每个队列设置
```
adata.obs['batch']
```
  标识符。
- 确认输入为稀疏数值矩阵；若QC步骤需要，使用
```
adata.X = adata.X.astype(np.int64)
```
  进行转换。
运行质量控制
- 执行
```
ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, batch_key='batch')
```
  以去除低质量细胞，并检查每个批次的汇总统计数据。
- 保存中间过滤后的对象（
```
adata.write_h5ad(...)
```
  ），方便用户从干净的检查点恢复操作。
预处理与特征选择
- 调用
```
ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=3000, batch_key=None)
```
  进行归一化、对数转换，并标记高可变基因；设置
```
adata.raw = adata
```
  ，并将数据集子集化为
```
adata.var.highly_variable_features
```
  ，用于后续建模。
- 对表达量进行标准化（
```
ov.pp.scale(adata)
```
  ），并通过
```
ov.pp.pca(adata, layer='scaled', n_pcs=50)
```
  计算PCA得分。建议通过
```
ov.utils.plot_pca_variance_ratio(adata)
```
  查看方差解释率。

构建邻域图与基线聚类

使用

sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')

或

ov.pp.neighbors(...)

构建邻域图。

通过

ov.utils.cluster(adata, method='leiden'|'louvain', resolution=1)

、

ov.single.leiden(adata, resolution=1.0)

或

ov.pp.leiden(adata, resolution=1)

生成Leiden或Louvain标签；提醒用户resolution参数可调整聚类粒度。

重要提示 - 依赖项检查：在进行聚类或绘图前，务必验证先决条件：

python

# 聚类前：检查邻域图是否存在
if 'neighbors' not in adata.uns:
    if 'X_pca' in adata.obsm:
        ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
    else:
        raise ValueError("PCA must be computed before neighbors graph")

# 按聚类绘图前：检查是否已执行聚类
if 'leiden' not in adata.obs:
    ov.single.leiden(adata, resolution=1.0)

使用
```
ov.pl.embedding(adata, basis='X_umap', color=['clusters','leiden'], frameon='small', wspace=0.5)
```
可视化嵌入结果，确认聚类分离情况。绘图前务必检查
```
color=
```
参数中的列是否存在于
```
adata.obs
```
中。

探索高级聚类策略

scICE共识聚类：实例化

model = ov.utils.cluster(adata, method='scICE', use_rep='scaled|original|X_pca', resolution_range=(4,20), n_boot=50, n_steps=11)

，通过

model.plot_ic(figsize=(6,4))

检查稳定性，然后选择

model.best_k

个聚类组。

高斯混合模型：运行

ov.utils.cluster(..., method='GMM', n_components=21, covariance_type='full', tol=1e-9, max_iter=1000)

进行基于模型的聚类分配。

主题建模：拟合

LDA_obj = ov.utils.LDA_topic(...)

，查看

LDA_obj.plot_topic_contributions(6)

，通过

LDA_obj.predicted(k)

生成聚类结果，可选择使用

LDA_obj.get_results_rfc(...)

进行优化。

cNMF程序：初始化

cnmf_obj = ov.single.cNMF(... components=np.arange(5,11), n_iter=20, num_highvar_genes=2000, output_dir=...)

，执行分解（

factorize

、

combine

），通过

k_selection_plot

选择K值，然后通过

cnmf_obj.get_results(...)

和

cnmf_obj.get_results_rfc(...)

将使用分数传回。

评估聚类质量
- 使用
```
adjusted_rand_score(adata.obs['clusters'], adata.obs['leiden'])
```
  比较预测标签与已知参考标签，并报告每种方法（Leiden、Louvain、GMM、LDA变体、cNMF模型）的指标，以证明所选参数的合理性。
使用多种布局生成嵌入
- 使用
```
ov.utils.mde(...)
```
  从不同潜在空间（
```
adata.obsm["scaled|original|X_pca"]
```
  、协调后的嵌入、主题组成）生成MDE投影，并通过
```
ov.utils.embedding(..., color=['batch','cell_type'])
```
  或
```
ov.pl.embedding
```
  进行绘图，持续检查聚类/批次混合情况。
执行批次校正与整合
- 依次应用
```
ov.single.batch_correction(adata, batch_key='batch', methods='harmony'|'combat'|'scanorama'|'scVI'|'CellANOVA', n_pcs=50, ...)
```
  生成协调后的嵌入，存储在
```
adata.obsm
```
  中（
```
X_harmony
```
  、
```
X_combat
```
  、
```
X_scanorama
```
  、
```
X_scVI
```
  、
```
X_cellanova
```
  ）。对于scVI，需提及潜在维度（
```
n_latent=30
```
  ）和
```
gene_likelihood="nb"
```
  ；对于CellANOVA，需通过
```
control_dict
```
  定义对照组。
- 每次校正后，使用
```
ov.utils.mde
```
  生成2D嵌入，并按
```
batch
```
  和
```
cell_type
```
  着色，检查混合情况和数据保留情况。

基准测试整合性能

保存最终对象（

adata.write_h5ad('neurips2021_batch_all.h5ad', compression='gzip')

），必要时重新加载。

使用

scib_metrics.benchmark.Benchmarker

，传入嵌入列表（

["X_pca", "X_combat", "X_harmony", "X_cellanova", "X_scanorama", "X_mira_topic", "X_mira_feature", "X_scVI"]

），通过

bm.benchmark()

计算批次与生物学特征的权衡指标，并通过

bm.plot_results_table(min_max_scale=False)

汇总结果。

常见问题排查
- 在子集化为HVGs前，确保
```
adata.raw
```
  存储了未标准化的对数归一化矩阵。
- 聚类前确认
```
use_rep='scaled|original|X_pca'
```
  对应的字符串存在于
```
adata.obsm
```
  中；若缺失，重新运行预处理步骤。
- 运行cNMF或scVI时监控内存；对于小型数据集，调整
```
n_iter
```
  、
```
components
```
  或潜在维度参数。
- 流水线依赖错误：当遇到类似“Could not find 'leiden' in adata.obs”的错误时，务必检查并添加先决条件：
  - 进行Leiden/Louvain聚类前 → 确保
```
'neighbors' in adata.uns
```
  - 按聚类结果绘图前 → 确保聚类列存在于
```
adata.obs
```
    中
  - 进行UMAP/嵌入前 → 确保已完成PCA或其他降维步骤
- 代码生成模式：生成多步骤代码时，使用防御性检查，而非假设前置步骤已成功完成。这可避免用户打乱步骤顺序或在不同会话中运行时出现连锁错误。

Examples

示例

"Normalise dentate gyrus cells, compare Leiden, scICE, and GMM clusters, and report ARI scores versus provided
```
clusters
```
."
"Batch-correct three NeurIPS datasets with Harmony and scVI, produce MDE embeddings coloured by
```
batch
```
and
```
cell_type
```
, and benchmark the embeddings."
"Fit topic and cNMF models on a preprocessed AnnData object, retrieve classifier-refined cluster calls, and visualise the resulting programs on UMAP."

"归一化齿状回细胞数据，比较Leiden、scICE和GMM聚类结果，并报告与提供的
```
clusters
```
标签的ARI得分。"
"使用Harmony和scVI对三个NeurIPS数据集进行批次校正，生成按
```
batch
```
和
```
cell_type
```
着色的MDE嵌入，并对嵌入结果进行基准测试。"
"在预处理后的AnnData对象上拟合主题模型和cNMF模型，获取经分类器优化的聚类结果，并在UMAP上可视化生成的程序。"

References

参考资料

Clustering walkthrough:
```
t_cluster.ipynb
```
Batch integration walkthrough:
```
t_single_batch.ipynb
```
Quick copy/paste commands:
```
reference.md
```

聚类操作指南：
```
t_cluster.ipynb
```
批次整合操作指南：
```
t_single_batch.ipynb
```
快速复制粘贴命令：
```
reference.md
```