single-cell-clustering-and-batch-correction-with-omicverse

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Single-cell clustering and batch correction with omicverse

基于omicverse的单细胞聚类与批次校正

Overview

概述

This skill distills the single-cell tutorials
t_cluster.ipynb
and
t_single_batch.ipynb
. Use it when a user wants to preprocess an
AnnData
object, explore clustering alternatives (Leiden, Louvain, scICE, GMM, topic/cNMF models), and evaluate or harmonise batches with omicverse utilities.
本技能提炼自单细胞教程
t_cluster.ipynb
t_single_batch.ipynb
。当用户需要预处理
AnnData
对象、探索多种聚类方案(Leiden、Louvain、scICE、GMM、主题/cNMF模型),或是使用omicverse工具评估或协调批次数据时,可使用本技能。

Instructions

操作步骤

  1. Import libraries and set plotting defaults
    • Load
      omicverse as ov
      ,
      scanpy as sc
      , and plotting helpers (
      scvelo as scv
      when using dentate gyrus demo data).
    • Apply
      ov.plot_set()
      or
      ov.utils.ov_plot_set()
      so figures adopt omicverse styling before embedding plots.
  2. Load data and annotate batches
    • For demo clustering, fetch
      scv.datasets.dentategyrus()
      ; for integration, read provided
      .h5ad
      files via
      ov.read()
      and set
      adata.obs['batch']
      identifiers for each cohort.
    • Confirm inputs are sparse numeric matrices; convert with
      adata.X = adata.X.astype(np.int64)
      when required for QC steps.
  3. Run quality control
    • Execute
      ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, batch_key='batch')
      to drop low-quality cells and inspect summary statistics per batch.
    • Save intermediate filtered objects (
      adata.write_h5ad(...)
      ) so users can resume from clean checkpoints.
  4. Preprocess and select features
    • Call
      ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=3000, batch_key=None)
      to normalise, log-transform, and flag highly variable genes; assign
      adata.raw = adata
      and subset to
      adata.var.highly_variable_features
      for downstream modelling.
    • Scale expression (
      ov.pp.scale(adata)
      ) and compute PCA scores with
      ov.pp.pca(adata, layer='scaled', n_pcs=50)
      . Encourage reviewing variance explained via
      ov.utils.plot_pca_variance_ratio(adata)
      .
  5. Construct neighbourhood graph and baseline clustering
    • Build neighbour graph using
      sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')
      or
      ov.pp.neighbors(...)
      .
    • Generate Leiden or Louvain labels through
      ov.utils.cluster(adata, method='leiden'|'louvain', resolution=1)
      ,
      ov.single.leiden(adata, resolution=1.0)
      , or
      ov.pp.leiden(adata, resolution=1)
      ; remind users that resolution tunes granularity.
    • IMPORTANT - Dependency checks: Always verify prerequisites before clustering or plotting:
      python
      # Before clustering: check neighbors graph exists
      if 'neighbors' not in adata.uns:
          if 'X_pca' in adata.obsm:
              ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
          else:
              raise ValueError("PCA must be computed before neighbors graph")
      
      # Before plotting by cluster: check clustering was performed
      if 'leiden' not in adata.obs:
          ov.single.leiden(adata, resolution=1.0)
    • Visualise embeddings with
      ov.pl.embedding(adata, basis='X_umap', color=['clusters','leiden'], frameon='small', wspace=0.5)
      and confirm cluster separation. Always check that columns in
      color=
      parameter exist in
      adata.obs
      before plotting.
  6. Explore advanced clustering strategies
    • scICE consensus: instantiate
      model = ov.utils.cluster(adata, method='scICE', use_rep='scaled|original|X_pca', resolution_range=(4,20), n_boot=50, n_steps=11)
      and inspect stability via
      model.plot_ic(figsize=(6,4))
      before selecting
      model.best_k
      groups.
    • Gaussian mixtures: run
      ov.utils.cluster(..., method='GMM', n_components=21, covariance_type='full', tol=1e-9, max_iter=1000)
      for model-based assignments.
    • Topic modelling: fit
      LDA_obj = ov.utils.LDA_topic(...)
      , review
      LDA_obj.plot_topic_contributions(6)
      , derive cluster calls with
      LDA_obj.predicted(k)
      and optionally refine using
      LDA_obj.get_results_rfc(...)
      .
    • cNMF programs: initialise
      cnmf_obj = ov.single.cNMF(... components=np.arange(5,11), n_iter=20, num_highvar_genes=2000, output_dir=...)
      , factorise (
      factorize
      ,
      combine
      ), select K via
      k_selection_plot
      , and propagate usage scores back with
      cnmf_obj.get_results(...)
      and
      cnmf_obj.get_results_rfc(...)
      .
  7. Evaluate clustering quality
    • Compare predicted labels against known references with
      adjusted_rand_score(adata.obs['clusters'], adata.obs['leiden'])
      and report metrics for each method (Leiden, Louvain, GMM, LDA variants, cNMF models) to justify chosen parameters.
  8. Embed with multiple layouts
    • Use
      ov.utils.mde(...)
      to create MDE projections from different latent spaces (
      adata.obsm["scaled|original|X_pca"]
      , harmonised embeddings, topic compositions) and plot via
      ov.utils.embedding(..., color=['batch','cell_type'])
      or
      ov.pl.embedding
      for consistent review of cluster/batch mixing.
  9. Perform batch correction and integration
    • Apply
      ov.single.batch_correction(adata, batch_key='batch', methods='harmony'|'combat'|'scanorama'|'scVI'|'CellANOVA', n_pcs=50, ...)
      sequentially to generate harmonised embeddings stored in
      adata.obsm
      (
      X_harmony
      ,
      X_combat
      ,
      X_scanorama
      ,
      X_scVI
      ,
      X_cellanova
      ). For
      scVI
      , mention latent size (
      n_latent=30
      ) and
      gene_likelihood="nb"
      ; for CellANOVA define control pools via
      control_dict
      .
    • After each correction, project to 2D with
      ov.utils.mde
      and visualise
      batch
      vs
      cell_type
      to check mixing and conservation.
  10. Benchmark integration performance
    • Persist final object (
      adata.write_h5ad('neurips2021_batch_all.h5ad', compression='gzip')
      ) and reload when necessary.
    • Use
      scib_metrics.benchmark.Benchmarker
      with embeddings list (
      ["X_pca", "X_combat", "X_harmony", "X_cellanova", "X_scanorama", "X_mira_topic", "X_mira_feature", "X_scVI"]
      ) to compute batch-vs-biology trade-offs via
      bm.benchmark()
      and summarise with
      bm.plot_results_table(min_max_scale=False)
      .
  11. General troubleshooting
    • Ensure
      adata.raw
      captures the unscaled log-normalised matrix before subsetting to HVGs.
    • Confirm
      use_rep='scaled|original|X_pca'
      strings exist in
      adata.obsm
      prior to clustering; rerun preprocessing if missing.
    • Monitor memory when running cNMF or scVI; adjust
      n_iter
      ,
      components
      , or latent dimensions for smaller datasets.
    • Pipeline dependency errors: When you encounter errors like "Could not find 'leiden' in adata.obs", always check and add prerequisites:
      • Before leiden/louvain clustering → ensure
        'neighbors' in adata.uns
      • Before plotting by clustering → ensure the cluster column exists in
        adata.obs
      • Before UMAP/embedding → ensure PCA or another dimensionality reduction is complete
    • Code generation pattern: When generating multi-step code, use defensive checks rather than assuming prior steps completed successfully. This prevents cascading failures when users run steps out of order or in separate sessions.
  1. 导入库并设置绘图默认参数
    • 导入
      omicverse as ov
      scanpy as sc
      以及绘图辅助工具(使用齿状回演示数据时需导入
      scvelo as scv
      )。
    • 在嵌入绘图前,应用
      ov.plot_set()
      ov.utils.ov_plot_set()
      ,使图表采用omicverse的样式。
  2. 加载数据并标注批次信息
    • 若进行演示聚类,可获取
      scv.datasets.dentategyrus()
      数据;若进行整合分析,通过
      ov.read()
      读取提供的
      .h5ad
      文件,并为每个队列设置
      adata.obs['batch']
      标识符。
    • 确认输入为稀疏数值矩阵;若QC步骤需要,使用
      adata.X = adata.X.astype(np.int64)
      进行转换。
  3. 运行质量控制
    • 执行
      ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, batch_key='batch')
      以去除低质量细胞,并检查每个批次的汇总统计数据。
    • 保存中间过滤后的对象(
      adata.write_h5ad(...)
      ),方便用户从干净的检查点恢复操作。
  4. 预处理与特征选择
    • 调用
      ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=3000, batch_key=None)
      进行归一化、对数转换,并标记高可变基因;设置
      adata.raw = adata
      ,并将数据集子集化为
      adata.var.highly_variable_features
      ,用于后续建模。
    • 对表达量进行标准化(
      ov.pp.scale(adata)
      ),并通过
      ov.pp.pca(adata, layer='scaled', n_pcs=50)
      计算PCA得分。建议通过
      ov.utils.plot_pca_variance_ratio(adata)
      查看方差解释率。
  5. 构建邻域图与基线聚类
    • 使用
      sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')
      ov.pp.neighbors(...)
      构建邻域图。
    • 通过
      ov.utils.cluster(adata, method='leiden'|'louvain', resolution=1)
      ov.single.leiden(adata, resolution=1.0)
      ov.pp.leiden(adata, resolution=1)
      生成Leiden或Louvain标签;提醒用户resolution参数可调整聚类粒度。
    • 重要提示 - 依赖项检查:在进行聚类或绘图前,务必验证先决条件:
      python
      # 聚类前:检查邻域图是否存在
      if 'neighbors' not in adata.uns:
          if 'X_pca' in adata.obsm:
              ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
          else:
              raise ValueError("PCA must be computed before neighbors graph")
      
      # 按聚类绘图前:检查是否已执行聚类
      if 'leiden' not in adata.obs:
          ov.single.leiden(adata, resolution=1.0)
    • 使用
      ov.pl.embedding(adata, basis='X_umap', color=['clusters','leiden'], frameon='small', wspace=0.5)
      可视化嵌入结果,确认聚类分离情况。绘图前务必检查
      color=
      参数中的列是否存在于
      adata.obs
      中。
  6. 探索高级聚类策略
    • scICE共识聚类:实例化
      model = ov.utils.cluster(adata, method='scICE', use_rep='scaled|original|X_pca', resolution_range=(4,20), n_boot=50, n_steps=11)
      ,通过
      model.plot_ic(figsize=(6,4))
      检查稳定性,然后选择
      model.best_k
      个聚类组。
    • 高斯混合模型:运行
      ov.utils.cluster(..., method='GMM', n_components=21, covariance_type='full', tol=1e-9, max_iter=1000)
      进行基于模型的聚类分配。
    • 主题建模:拟合
      LDA_obj = ov.utils.LDA_topic(...)
      ,查看
      LDA_obj.plot_topic_contributions(6)
      ,通过
      LDA_obj.predicted(k)
      生成聚类结果,可选择使用
      LDA_obj.get_results_rfc(...)
      进行优化。
    • cNMF程序:初始化
      cnmf_obj = ov.single.cNMF(... components=np.arange(5,11), n_iter=20, num_highvar_genes=2000, output_dir=...)
      ,执行分解(
      factorize
      combine
      ),通过
      k_selection_plot
      选择K值,然后通过
      cnmf_obj.get_results(...)
      cnmf_obj.get_results_rfc(...)
      将使用分数传回。
  7. 评估聚类质量
    • 使用
      adjusted_rand_score(adata.obs['clusters'], adata.obs['leiden'])
      比较预测标签与已知参考标签,并报告每种方法(Leiden、Louvain、GMM、LDA变体、cNMF模型)的指标,以证明所选参数的合理性。
  8. 使用多种布局生成嵌入
    • 使用
      ov.utils.mde(...)
      从不同潜在空间(
      adata.obsm["scaled|original|X_pca"]
      、协调后的嵌入、主题组成)生成MDE投影,并通过
      ov.utils.embedding(..., color=['batch','cell_type'])
      ov.pl.embedding
      进行绘图,持续检查聚类/批次混合情况。
  9. 执行批次校正与整合
    • 依次应用
      ov.single.batch_correction(adata, batch_key='batch', methods='harmony'|'combat'|'scanorama'|'scVI'|'CellANOVA', n_pcs=50, ...)
      生成协调后的嵌入,存储在
      adata.obsm
      中(
      X_harmony
      X_combat
      X_scanorama
      X_scVI
      X_cellanova
      )。对于scVI,需提及潜在维度(
      n_latent=30
      )和
      gene_likelihood="nb"
      ;对于CellANOVA,需通过
      control_dict
      定义对照组。
    • 每次校正后,使用
      ov.utils.mde
      生成2D嵌入,并按
      batch
      cell_type
      着色,检查混合情况和数据保留情况。
  10. 基准测试整合性能
    • 保存最终对象(
      adata.write_h5ad('neurips2021_batch_all.h5ad', compression='gzip')
      ),必要时重新加载。
    • 使用
      scib_metrics.benchmark.Benchmarker
      ,传入嵌入列表(
      ["X_pca", "X_combat", "X_harmony", "X_cellanova", "X_scanorama", "X_mira_topic", "X_mira_feature", "X_scVI"]
      ),通过
      bm.benchmark()
      计算批次与生物学特征的权衡指标,并通过
      bm.plot_results_table(min_max_scale=False)
      汇总结果。
  11. 常见问题排查
    • 在子集化为HVGs前,确保
      adata.raw
      存储了未标准化的对数归一化矩阵。
    • 聚类前确认
      use_rep='scaled|original|X_pca'
      对应的字符串存在于
      adata.obsm
      中;若缺失,重新运行预处理步骤。
    • 运行cNMF或scVI时监控内存;对于小型数据集,调整
      n_iter
      components
      或潜在维度参数。
    • 流水线依赖错误:当遇到类似“Could not find 'leiden' in adata.obs”的错误时,务必检查并添加先决条件:
      • 进行Leiden/Louvain聚类前 → 确保
        'neighbors' in adata.uns
      • 按聚类结果绘图前 → 确保聚类列存在于
        adata.obs
      • 进行UMAP/嵌入前 → 确保已完成PCA或其他降维步骤
    • 代码生成模式:生成多步骤代码时,使用防御性检查,而非假设前置步骤已成功完成。这可避免用户打乱步骤顺序或在不同会话中运行时出现连锁错误。

Examples

示例

  • "Normalise dentate gyrus cells, compare Leiden, scICE, and GMM clusters, and report ARI scores versus provided
    clusters
    ."
  • "Batch-correct three NeurIPS datasets with Harmony and scVI, produce MDE embeddings coloured by
    batch
    and
    cell_type
    , and benchmark the embeddings."
  • "Fit topic and cNMF models on a preprocessed AnnData object, retrieve classifier-refined cluster calls, and visualise the resulting programs on UMAP."
  • "归一化齿状回细胞数据,比较Leiden、scICE和GMM聚类结果,并报告与提供的
    clusters
    标签的ARI得分。"
  • "使用Harmony和scVI对三个NeurIPS数据集进行批次校正,生成按
    batch
    cell_type
    着色的MDE嵌入,并对嵌入结果进行基准测试。"
  • "在预处理后的AnnData对象上拟合主题模型和cNMF模型,获取经分类器优化的聚类结果,并在UMAP上可视化生成的程序。"

References

参考资料

  • Clustering walkthrough:
    t_cluster.ipynb
  • Batch integration walkthrough:
    t_single_batch.ipynb
  • Quick copy/paste commands:
    reference.md
  • 聚类操作指南:
    t_cluster.ipynb
  • 批次整合操作指南:
    t_single_batch.ipynb
  • 快速复制粘贴命令:
    reference.md