single-cell-clustering-and-batch-correction-with-omicverse
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSingle-cell clustering and batch correction with omicverse
基于omicverse的单细胞聚类与批次校正
Overview
概述
This skill distills the single-cell tutorials and . Use it when a user wants to preprocess an object, explore clustering alternatives (Leiden, Louvain, scICE, GMM, topic/cNMF models), and evaluate or harmonise batches with omicverse utilities.
t_cluster.ipynbt_single_batch.ipynbAnnData本技能提炼自单细胞教程和。当用户需要预处理对象、探索多种聚类方案(Leiden、Louvain、scICE、GMM、主题/cNMF模型),或是使用omicverse工具评估或协调批次数据时,可使用本技能。
t_cluster.ipynbt_single_batch.ipynbAnnDataInstructions
操作步骤
- Import libraries and set plotting defaults
- Load ,
omicverse as ov, and plotting helpers (scanpy as scwhen using dentate gyrus demo data).scvelo as scv - Apply or
ov.plot_set()so figures adopt omicverse styling before embedding plots.ov.utils.ov_plot_set()
- Load
- Load data and annotate batches
- For demo clustering, fetch ; for integration, read provided
scv.datasets.dentategyrus()files via.h5adand setov.read()identifiers for each cohort.adata.obs['batch'] - Confirm inputs are sparse numeric matrices; convert with when required for QC steps.
adata.X = adata.X.astype(np.int64)
- For demo clustering, fetch
- Run quality control
- Execute to drop low-quality cells and inspect summary statistics per batch.
ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, batch_key='batch') - Save intermediate filtered objects () so users can resume from clean checkpoints.
adata.write_h5ad(...)
- Execute
- Preprocess and select features
- Call to normalise, log-transform, and flag highly variable genes; assign
ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=3000, batch_key=None)and subset toadata.raw = adatafor downstream modelling.adata.var.highly_variable_features - Scale expression () and compute PCA scores with
ov.pp.scale(adata). Encourage reviewing variance explained viaov.pp.pca(adata, layer='scaled', n_pcs=50).ov.utils.plot_pca_variance_ratio(adata)
- Call
- Construct neighbourhood graph and baseline clustering
- Build neighbour graph using or
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca').ov.pp.neighbors(...) - Generate Leiden or Louvain labels through ,
ov.utils.cluster(adata, method='leiden'|'louvain', resolution=1), orov.single.leiden(adata, resolution=1.0); remind users that resolution tunes granularity.ov.pp.leiden(adata, resolution=1) - IMPORTANT - Dependency checks: Always verify prerequisites before clustering or plotting:
python
# Before clustering: check neighbors graph exists if 'neighbors' not in adata.uns: if 'X_pca' in adata.obsm: ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca') else: raise ValueError("PCA must be computed before neighbors graph") # Before plotting by cluster: check clustering was performed if 'leiden' not in adata.obs: ov.single.leiden(adata, resolution=1.0) - Visualise embeddings with and confirm cluster separation. Always check that columns in
ov.pl.embedding(adata, basis='X_umap', color=['clusters','leiden'], frameon='small', wspace=0.5)parameter exist incolor=before plotting.adata.obs
- Build neighbour graph using
- Explore advanced clustering strategies
- scICE consensus: instantiate and inspect stability via
model = ov.utils.cluster(adata, method='scICE', use_rep='scaled|original|X_pca', resolution_range=(4,20), n_boot=50, n_steps=11)before selectingmodel.plot_ic(figsize=(6,4))groups.model.best_k - Gaussian mixtures: run for model-based assignments.
ov.utils.cluster(..., method='GMM', n_components=21, covariance_type='full', tol=1e-9, max_iter=1000) - Topic modelling: fit , review
LDA_obj = ov.utils.LDA_topic(...), derive cluster calls withLDA_obj.plot_topic_contributions(6)and optionally refine usingLDA_obj.predicted(k).LDA_obj.get_results_rfc(...) - cNMF programs: initialise , factorise (
cnmf_obj = ov.single.cNMF(... components=np.arange(5,11), n_iter=20, num_highvar_genes=2000, output_dir=...),factorize), select K viacombine, and propagate usage scores back withk_selection_plotandcnmf_obj.get_results(...).cnmf_obj.get_results_rfc(...)
- scICE consensus: instantiate
- Evaluate clustering quality
- Compare predicted labels against known references with and report metrics for each method (Leiden, Louvain, GMM, LDA variants, cNMF models) to justify chosen parameters.
adjusted_rand_score(adata.obs['clusters'], adata.obs['leiden'])
- Compare predicted labels against known references with
- Embed with multiple layouts
- Use to create MDE projections from different latent spaces (
ov.utils.mde(...), harmonised embeddings, topic compositions) and plot viaadata.obsm["scaled|original|X_pca"]orov.utils.embedding(..., color=['batch','cell_type'])for consistent review of cluster/batch mixing.ov.pl.embedding
- Use
- Perform batch correction and integration
- Apply sequentially to generate harmonised embeddings stored in
ov.single.batch_correction(adata, batch_key='batch', methods='harmony'|'combat'|'scanorama'|'scVI'|'CellANOVA', n_pcs=50, ...)(adata.obsm,X_harmony,X_combat,X_scanorama,X_scVI). ForX_cellanova, mention latent size (scVI) andn_latent=30; for CellANOVA define control pools viagene_likelihood="nb".control_dict - After each correction, project to 2D with and visualise
ov.utils.mdevsbatchto check mixing and conservation.cell_type
- Apply
- Benchmark integration performance
- Persist final object () and reload when necessary.
adata.write_h5ad('neurips2021_batch_all.h5ad', compression='gzip') - Use with embeddings list (
scib_metrics.benchmark.Benchmarker) to compute batch-vs-biology trade-offs via["X_pca", "X_combat", "X_harmony", "X_cellanova", "X_scanorama", "X_mira_topic", "X_mira_feature", "X_scVI"]and summarise withbm.benchmark().bm.plot_results_table(min_max_scale=False)
- Persist final object (
- General troubleshooting
- Ensure captures the unscaled log-normalised matrix before subsetting to HVGs.
adata.raw - Confirm strings exist in
use_rep='scaled|original|X_pca'prior to clustering; rerun preprocessing if missing.adata.obsm - Monitor memory when running cNMF or scVI; adjust ,
n_iter, or latent dimensions for smaller datasets.components - Pipeline dependency errors: When you encounter errors like "Could not find 'leiden' in adata.obs", always check and add prerequisites:
- Before leiden/louvain clustering → ensure
'neighbors' in adata.uns - Before plotting by clustering → ensure the cluster column exists in
adata.obs - Before UMAP/embedding → ensure PCA or another dimensionality reduction is complete
- Before leiden/louvain clustering → ensure
- Code generation pattern: When generating multi-step code, use defensive checks rather than assuming prior steps completed successfully. This prevents cascading failures when users run steps out of order or in separate sessions.
- Ensure
- 导入库并设置绘图默认参数
- 导入、
omicverse as ov以及绘图辅助工具(使用齿状回演示数据时需导入scanpy as sc)。scvelo as scv - 在嵌入绘图前,应用或
ov.plot_set(),使图表采用omicverse的样式。ov.utils.ov_plot_set()
- 导入
- 加载数据并标注批次信息
- 若进行演示聚类,可获取数据;若进行整合分析,通过
scv.datasets.dentategyrus()读取提供的ov.read()文件,并为每个队列设置.h5ad标识符。adata.obs['batch'] - 确认输入为稀疏数值矩阵;若QC步骤需要,使用进行转换。
adata.X = adata.X.astype(np.int64)
- 若进行演示聚类,可获取
- 运行质量控制
- 执行以去除低质量细胞,并检查每个批次的汇总统计数据。
ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, batch_key='batch') - 保存中间过滤后的对象(),方便用户从干净的检查点恢复操作。
adata.write_h5ad(...)
- 执行
- 预处理与特征选择
- 调用进行归一化、对数转换,并标记高可变基因;设置
ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=3000, batch_key=None),并将数据集子集化为adata.raw = adata,用于后续建模。adata.var.highly_variable_features - 对表达量进行标准化(),并通过
ov.pp.scale(adata)计算PCA得分。建议通过ov.pp.pca(adata, layer='scaled', n_pcs=50)查看方差解释率。ov.utils.plot_pca_variance_ratio(adata)
- 调用
- 构建邻域图与基线聚类
- 使用或
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')构建邻域图。ov.pp.neighbors(...) - 通过、
ov.utils.cluster(adata, method='leiden'|'louvain', resolution=1)或ov.single.leiden(adata, resolution=1.0)生成Leiden或Louvain标签;提醒用户resolution参数可调整聚类粒度。ov.pp.leiden(adata, resolution=1) - 重要提示 - 依赖项检查:在进行聚类或绘图前,务必验证先决条件:
python
# 聚类前:检查邻域图是否存在 if 'neighbors' not in adata.uns: if 'X_pca' in adata.obsm: ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca') else: raise ValueError("PCA must be computed before neighbors graph") # 按聚类绘图前:检查是否已执行聚类 if 'leiden' not in adata.obs: ov.single.leiden(adata, resolution=1.0) - 使用可视化嵌入结果,确认聚类分离情况。绘图前务必检查
ov.pl.embedding(adata, basis='X_umap', color=['clusters','leiden'], frameon='small', wspace=0.5)参数中的列是否存在于color=中。adata.obs
- 使用
- 探索高级聚类策略
- scICE共识聚类:实例化,通过
model = ov.utils.cluster(adata, method='scICE', use_rep='scaled|original|X_pca', resolution_range=(4,20), n_boot=50, n_steps=11)检查稳定性,然后选择model.plot_ic(figsize=(6,4))个聚类组。model.best_k - 高斯混合模型:运行进行基于模型的聚类分配。
ov.utils.cluster(..., method='GMM', n_components=21, covariance_type='full', tol=1e-9, max_iter=1000) - 主题建模:拟合,查看
LDA_obj = ov.utils.LDA_topic(...),通过LDA_obj.plot_topic_contributions(6)生成聚类结果,可选择使用LDA_obj.predicted(k)进行优化。LDA_obj.get_results_rfc(...) - cNMF程序:初始化,执行分解(
cnmf_obj = ov.single.cNMF(... components=np.arange(5,11), n_iter=20, num_highvar_genes=2000, output_dir=...)、factorize),通过combine选择K值,然后通过k_selection_plot和cnmf_obj.get_results(...)将使用分数传回。cnmf_obj.get_results_rfc(...)
- scICE共识聚类:实例化
- 评估聚类质量
- 使用比较预测标签与已知参考标签,并报告每种方法(Leiden、Louvain、GMM、LDA变体、cNMF模型)的指标,以证明所选参数的合理性。
adjusted_rand_score(adata.obs['clusters'], adata.obs['leiden'])
- 使用
- 使用多种布局生成嵌入
- 使用从不同潜在空间(
ov.utils.mde(...)、协调后的嵌入、主题组成)生成MDE投影,并通过adata.obsm["scaled|original|X_pca"]或ov.utils.embedding(..., color=['batch','cell_type'])进行绘图,持续检查聚类/批次混合情况。ov.pl.embedding
- 使用
- 执行批次校正与整合
- 依次应用生成协调后的嵌入,存储在
ov.single.batch_correction(adata, batch_key='batch', methods='harmony'|'combat'|'scanorama'|'scVI'|'CellANOVA', n_pcs=50, ...)中(adata.obsm、X_harmony、X_combat、X_scanorama、X_scVI)。对于scVI,需提及潜在维度(X_cellanova)和n_latent=30;对于CellANOVA,需通过gene_likelihood="nb"定义对照组。control_dict - 每次校正后,使用生成2D嵌入,并按
ov.utils.mde和batch着色,检查混合情况和数据保留情况。cell_type
- 依次应用
- 基准测试整合性能
- 保存最终对象(),必要时重新加载。
adata.write_h5ad('neurips2021_batch_all.h5ad', compression='gzip') - 使用,传入嵌入列表(
scib_metrics.benchmark.Benchmarker),通过["X_pca", "X_combat", "X_harmony", "X_cellanova", "X_scanorama", "X_mira_topic", "X_mira_feature", "X_scVI"]计算批次与生物学特征的权衡指标,并通过bm.benchmark()汇总结果。bm.plot_results_table(min_max_scale=False)
- 保存最终对象(
- 常见问题排查
- 在子集化为HVGs前,确保存储了未标准化的对数归一化矩阵。
adata.raw - 聚类前确认对应的字符串存在于
use_rep='scaled|original|X_pca'中;若缺失,重新运行预处理步骤。adata.obsm - 运行cNMF或scVI时监控内存;对于小型数据集,调整、
n_iter或潜在维度参数。components - 流水线依赖错误:当遇到类似“Could not find 'leiden' in adata.obs”的错误时,务必检查并添加先决条件:
- 进行Leiden/Louvain聚类前 → 确保
'neighbors' in adata.uns - 按聚类结果绘图前 → 确保聚类列存在于中
adata.obs - 进行UMAP/嵌入前 → 确保已完成PCA或其他降维步骤
- 进行Leiden/Louvain聚类前 → 确保
- 代码生成模式:生成多步骤代码时,使用防御性检查,而非假设前置步骤已成功完成。这可避免用户打乱步骤顺序或在不同会话中运行时出现连锁错误。
- 在子集化为HVGs前,确保
Examples
示例
- "Normalise dentate gyrus cells, compare Leiden, scICE, and GMM clusters, and report ARI scores versus provided ."
clusters - "Batch-correct three NeurIPS datasets with Harmony and scVI, produce MDE embeddings coloured by and
batch, and benchmark the embeddings."cell_type - "Fit topic and cNMF models on a preprocessed AnnData object, retrieve classifier-refined cluster calls, and visualise the resulting programs on UMAP."
- "归一化齿状回细胞数据,比较Leiden、scICE和GMM聚类结果,并报告与提供的标签的ARI得分。"
clusters - "使用Harmony和scVI对三个NeurIPS数据集进行批次校正,生成按和
batch着色的MDE嵌入,并对嵌入结果进行基准测试。"cell_type - "在预处理后的AnnData对象上拟合主题模型和cNMF模型,获取经分类器优化的聚类结果,并在UMAP上可视化生成的程序。"
References
参考资料
- Clustering walkthrough:
t_cluster.ipynb - Batch integration walkthrough:
t_single_batch.ipynb - Quick copy/paste commands:
reference.md
- 聚类操作指南:
t_cluster.ipynb - 批次整合操作指南:
t_single_batch.ipynb - 快速复制粘贴命令:
reference.md