single-cell-preprocessing-with-omicverse

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Single-cell preprocessing with omicverse

使用omicverse进行单细胞数据预处理

Overview

概述

Follow this skill when a user needs to reproduce the preprocessing workflow from the omicverse notebooks
t_preprocess.ipynb
,
t_preprocess_cpu.ipynb
, and
t_preprocess_gpu.ipynb
. The tutorials operate on the 10x PBMC3k dataset and cover QC filtering, normalisation, highly variable gene (HVG) detection, dimensionality reduction, and downstream embeddings.
当用户需要复现omicverse教程笔记本
t_preprocess.ipynb
t_preprocess_cpu.ipynb
t_preprocess_gpu.ipynb
中的预处理流程时,可参考本指南。这些教程基于10x PBMC3k数据集,涵盖了QC过滤、归一化、高可变基因(HVG)检测、降维分析以及下游嵌入分析等内容。

Instructions

操作步骤

  1. Set up the environment
    • Import
      omicverse as ov
      and
      scanpy as sc
      , then call
      ov.plot_set(font_path='Arial')
      (or
      ov.ov_plot_set()
      in legacy notebooks) to standardise figure styling.
    • Encourage
      %load_ext autoreload
      and
      %autoreload 2
      when iterating inside notebooks so code edits propagate without restarting the kernel.
  2. Prepare input data
    • Download the PBMC3k filtered matrix from 10x Genomics (
      pbmc3k_filtered_gene_bc_matrices.tar.gz
      ) and extract it under
      data/filtered_gene_bc_matrices/hg19/
      .
    • Load the matrix via
      sc.read_10x_mtx(..., var_names='gene_symbols', cache=True)
      and keep a writable folder like
      write/
      for exports.
  3. Perform quality control (QC)
    • Run
      ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, doublets_method='scrublet')
      for the CPU/CPU–GPU pipelines; omit
      doublets_method
      on pure GPU where Scrublet is not yet supported.
    • Review the returned AnnData summary to confirm doublet rates and QC thresholds; advise adjusting cut-offs for different species or sequencing depths.
  4. Store raw counts before transformations
    • Call
      ov.utils.store_layers(adata, layers='counts')
      immediately after QC so the original counts remain accessible for later recovery and comparison.
  5. Normalise and select HVGs
    • Use
      ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000, target_sum=5e5)
      to apply shift-log normalisation followed by Pearson residual HVG detection (set
      target_sum=None
      on GPU, which keeps defaults).
    • For CPU–GPU mixed runs, demonstrate
      ov.pp.recover_counts(...)
      to invert normalisation and store reconstructed counts in
      adata.layers['recover_counts']
      .
  6. Manage
    .raw
    and layer recovery
    • Snapshot normalised data to
      .raw
      with
      adata.raw = adata
      (or
      adata.raw = adata.copy()
      ), and show
      ov.utils.retrieve_layers(adata_counts, layers='counts')
      to compare normalised vs. raw intensities.
  7. Scale, reduce, and embed
    • Scale features using
      ov.pp.scale(adata)
      (layers hold scaled matrices) followed by
      ov.pp.pca(adata, layer='scaled', n_pcs=50)
      .
    • Construct neighbourhood graphs with:
      • sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')
        for the baseline notebook.
      • ov.pp.neighbors(..., use_rep='scaled|original|X_pca')
        on CPU–GPU to leverage accelerated routines.
      • ov.pp.neighbors(..., method='cagra')
        on GPU to call RAPIDS graph primitives.
    • Generate embeddings via
      ov.utils.mde(...)
      ,
      ov.pp.umap(adata)
      ,
      ov.pp.mde(...)
      ,
      ov.pp.tsne(...)
      , or
      ov.pp.sude(...)
      depending on the notebook variant.
  8. Cluster and annotate
    • Run
      ov.pp.leiden(adata, resolution=1)
      or
      ov.single.leiden(adata, resolution=1.0)
      after neighbour graph construction; CPU–GPU pipelines also showcase
      ov.pp.score_genes_cell_cycle
      before clustering.
    • IMPORTANT - Defensive checks: When generating code that plots by clustering results (e.g.,
      color='leiden'
      ), always check if the clustering has been performed first:
      python
      # Check if leiden clustering exists, if not, run it
      if 'leiden' not in adata.obs:
          if 'neighbors' not in adata.uns:
              ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
          ov.single.leiden(adata, resolution=1.0)
    • Plot embeddings with
      ov.pl.embedding(...)
      or
      ov.utils.embedding(...)
      , colouring by
      leiden
      clusters and marker genes. Always verify that the column specified in
      color=
      parameter exists in
      adata.obs
      before plotting.
  9. Document outputs
    • Encourage saving intermediate AnnData objects (
      adata.write('write/pbmc3k_preprocessed.h5ad')
      ) and figure exports using Matplotlib’s
      plt.savefig(...)
      to preserve QC summaries and embeddings.
  10. Notebook-specific notes
    • Baseline (
      t_preprocess.ipynb
      )
      : Focuses on CPU execution with Scanpy neighbours; emphasise storing counts before and after
      retrieve_layers
      demonstrations.
    • CPU–GPU mixed (
      t_preprocess_cpu.ipynb
      )
      : Highlights Omicverse ≥1.7.0 mixed acceleration. Include timing magics (%%time) to showcase speedups and call out
      doublets_method='scrublet'
      support.
    • GPU (
      t_preprocess_gpu.ipynb
      )
      : Requires a CUDA-capable GPU, RAPIDS 24.04 stack, and
      rapids-singlecell
      . Mention the
      ov.pp.anndata_to_GPU
      /
      ov.pp.anndata_to_CPU
      transfers and
      method='cagra'
      neighbours. Note the current warning that pure-GPU pipelines depend on RAPIDS updates.
  11. Troubleshooting tips
    • If
      sc.read_10x_mtx
      fails, verify the extracted folder structure and ensure gene symbols are available via
      var_names='gene_symbols'
      .
    • Address GPU import errors by confirming the conda environment matches the RAPIDS version for the installed CUDA driver (
      nvidia-smi
      ).
    • For
      ov.pp.preprocess
      dimension mismatches, ensure QC filtered out empty barcodes so HVG selection does not encounter zero-variance features.
    • When embeddings lack expected fields (e.g.,
      scaled|original|X_pca
      missing), re-run
      ov.pp.scale
      and
      ov.pp.pca
      to rebuild the cached layers.
    • Pipeline dependency errors: When encountering errors like "Could not find 'leiden' in adata.obs or adata.var_names":
      • Always check if required preprocessing steps (neighbors, PCA) exist before dependent operations
      • Check if clustering results exist in
        adata.obs
        before trying to color plots by them
      • Use defensive checks in generated code to handle incomplete pipelines gracefully
    • Code generation best practice: Generate robust code with conditional checks for prerequisites rather than assuming perfect sequential execution. Users may run steps in separate sessions or skip intermediate steps.
  1. 环境搭建
    • 导入
      omicverse as ov
      scanpy as sc
      ,然后调用
      ov.plot_set(font_path='Arial')
      (旧版笔记本中使用
      ov.ov_plot_set()
      )来统一图表样式。
    • 建议在笔记本中使用
      %load_ext autoreload
      %autoreload 2
      ,这样修改代码后无需重启内核即可生效。
  2. 输入数据准备
    • 从10x Genomics下载PBMC3k过滤后的矩阵文件
      pbmc3k_filtered_gene_bc_matrices.tar.gz
      ,并解压到
      data/filtered_gene_bc_matrices/hg19/
      目录下。
    • 通过
      sc.read_10x_mtx(..., var_names='gene_symbols', cache=True)
      加载矩阵,并准备一个可写入的文件夹(如
      write/
      )用于导出结果。
  3. 质量控制(QC)
    • 在CPU或CPU-GPU混合流水线中,运行
      ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, doublets_method='scrublet')
      ;在纯GPU环境下,由于目前不支持Scrublet,需省略
      doublets_method
      参数。
    • 查看返回的AnnData摘要,确认双细胞率和QC阈值;建议根据不同物种或测序深度调整截断值。
  4. 在转换前存储原始计数
    • QC完成后立即调用
      ov.utils.store_layers(adata, layers='counts')
      ,这样原始计数可在后续恢复和对比时使用。
  5. 归一化与HVG筛选
    • 使用
      ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000, target_sum=5e5)
      执行移位对数归一化,然后进行Pearson残差HVG检测(在GPU环境下设置
      target_sum=None
      ,使用默认值)。
    • 在CPU-GPU混合运行时,演示使用
      ov.pp.recover_counts(...)
      反转归一化,并将重建的计数存储到
      adata.layers['recover_counts']
      中。
  6. 管理.raw属性与层恢复
    • 通过
      adata.raw = adata
      (或
      adata.raw = adata.copy()
      )将归一化数据快照保存到.raw属性中,并展示
      ov.utils.retrieve_layers(adata_counts, layers='counts')
      来对比归一化后与原始数据的强度。
  7. 数据缩放、降维与嵌入分析
    • 使用
      ov.pp.scale(adata)
      缩放特征(缩放后的矩阵存储在layers中),随后运行
      ov.pp.pca(adata, layer='scaled', n_pcs=50)
      进行PCA分析。
    • 构建邻接图:
      • 在基准笔记本中,使用
        sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')
      • 在CPU-GPU混合环境中,使用
        ov.pp.neighbors(..., use_rep='scaled|original|X_pca')
        以利用加速流程。
      • 在GPU环境中,使用
        ov.pp.neighbors(..., method='cagra')
        调用RAPIDS图原语。
    • 根据笔记本版本,通过
      ov.utils.mde(...)
      ov.pp.umap(adata)
      ov.pp.mde(...)
      ov.pp.tsne(...)
      ov.pp.sude(...)
      生成嵌入结果。
  8. 聚类与注释
    • 构建邻接图后,运行
      ov.pp.leiden(adata, resolution=1)
      ov.single.leiden(adata, resolution=1.0)
      ;在CPU-GPU流水线中,还需在聚类前展示
      ov.pp.score_genes_cell_cycle
      的使用。
    • 重要 - 防御性检查:当生成基于聚类结果绘图的代码时(如
      color='leiden'
      ),务必先检查是否已执行聚类:
      python
      # 检查是否存在leiden聚类结果,若没有则运行聚类
      if 'leiden' not in adata.obs:
          if 'neighbors' not in adata.uns:
              ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
          ov.single.leiden(adata, resolution=1.0)
    • 使用
      ov.pl.embedding(...)
      ov.utils.embedding(...)
      绘制嵌入图,按
      leiden
      聚类结果和标记基因着色。绘图前务必确认
      color=
      参数指定的列存在于
      adata.obs
      中。
  9. 文档输出
    • 建议保存中间AnnData对象(
      adata.write('write/pbmc3k_preprocessed.h5ad')
      ),并使用Matplotlib的
      plt.savefig(...)
      导出图表,以保存QC摘要和嵌入结果。
  10. 笔记本特定说明
    • 基准版(
      t_preprocess.ipynb
      :专注于CPU执行,使用Scanpy邻接图;重点强调在
      retrieve_layers
      演示前后存储计数。
    • CPU-GPU混合版(
      t_preprocess_cpu.ipynb
      :突出Omicverse ≥1.7.0的混合加速功能。使用计时魔法命令(%%time)展示速度提升,并说明支持
      doublets_method='scrublet'
    • GPU版(
      t_preprocess_gpu.ipynb
      :需要支持CUDA的GPU、RAPIDS 24.04栈和
      rapids-singlecell
      。提及
      ov.pp.anndata_to_GPU
      /
      ov.pp.anndata_to_CPU
      的数据传输,以及使用
      method='cagra'
      的邻接图。注意当前纯GPU流水线依赖RAPIDS更新的警告。
  11. 故障排除技巧
    • 如果
      sc.read_10x_mtx
      运行失败,检查解压后的文件夹结构,并确保通过
      var_names='gene_symbols'
      可获取基因符号。
    • 解决GPU导入错误时,确认conda环境与已安装CUDA驱动的RAPIDS版本匹配(使用
      nvidia-smi
      查看)。
    • ov.pp.preprocess
      出现维度不匹配,确保QC已过滤掉空条形码,这样HVG筛选不会遇到零方差特征。
    • 若嵌入结果缺少预期字段(如
      scaled|original|X_pca
      缺失),重新运行
      ov.pp.scale
      ov.pp.pca
      以重建缓存层。
    • 流水线依赖错误:当遇到类似“Could not find 'leiden' in adata.obs or adata.var_names”的错误时:
      • 在执行依赖操作前,务必检查所需的预处理步骤(邻接图、PCA)是否已完成
      • 在尝试按聚类结果着色绘图前,检查
        adata.obs
        中是否存在聚类结果
      • 在生成的代码中使用防御性检查,以优雅处理不完整的流水线
    • 代码生成最佳实践:生成带有前提条件检查的健壮代码,而非假设流程完美按顺序执行。用户可能会在不同会话中运行步骤,或跳过中间步骤。

Critical API Reference - Batch Column Handling

关键API参考 - 批次列处理

Batch Column Validation - REQUIRED Before Batch Operations

批次列验证 - 批次操作前必须执行

IMPORTANT: Always validate and prepare the batch column before any batch-aware operations (batch correction, integration, etc.). Missing or NaN values will cause errors.
CORRECT usage:
python
undefined
重要提示:在执行任何批次相关操作(批次校正、整合等)前,务必验证并准备批次列。缺失或NaN值会导致错误。
正确用法:
python
undefined

Step 1: Check if batch column exists, create default if not

步骤1:检查批次列是否存在,若不存在则创建默认列

if 'batch' not in adata.obs.columns: adata.obs['batch'] = 'batch_1' # Default single batch
if 'batch' not in adata.obs.columns: adata.obs['batch'] = 'batch_1' # 默认单批次

Step 2: Handle NaN/missing values - CRITICAL!

步骤2:处理NaN/缺失值 - 非常关键!

adata.obs['batch'] = adata.obs['batch'].fillna('unknown')
adata.obs['batch'] = adata.obs['batch'].fillna('unknown')

Step 3: Convert to categorical for efficient memory usage

步骤3:转换为分类类型以提高内存效率

adata.obs['batch'] = adata.obs['batch'].astype('category')
adata.obs['batch'] = adata.obs['batch'].astype('category')

Now safe to use in batch-aware operations

现在可安全用于批次相关操作

ov.pp.combat(adata, batch='batch') # or other batch correction methods

**WRONG - DO NOT USE:**
```python
ov.pp.combat(adata, batch='batch') # 或其他批次校正方法

**错误用法 - 请勿使用:**
```python

WRONG! Using batch column without validation can cause NaN errors

错误!未验证批次列就使用会导致NaN错误

ov.pp.combat(adata, batch='batch') # May fail if batch has NaN values!

ov.pp.combat(adata, batch='batch') # 若批次列存在NaN值可能失败!

WRONG! Assuming batch column exists

错误!假设批次列存在

adata.obs['batch'].unique() # KeyError if column doesn't exist!

adata.obs['batch'].unique() # 若列不存在会触发KeyError!

undefined
undefined

Common Batch-Related Pitfalls

常见批次相关陷阱

  1. NaN values in batch column: Always use
    fillna()
    before batch operations
  2. Missing batch column: Always check existence before use
  3. Non-categorical batch: Convert to category for memory efficiency
  4. Mixed data types: Ensure consistent string type before categorization
python
undefined
  1. 批次列中的NaN值:在批次操作前务必使用
    fillna()
    处理
  2. 缺失批次列:使用前务必检查是否存在
  3. 非分类类型的批次列:转换为分类类型以节省内存
  4. 混合数据类型:在分类转换前确保为一致的字符串类型
python
undefined

Complete defensive batch preparation pattern:

完整的防御性批次列准备模板:

def prepare_batch_column(adata, batch_key='batch', default_batch='batch_1'): """Prepare batch column for batch-aware operations.""" if batch_key not in adata.obs.columns: adata.obs[batch_key] = default_batch adata.obs[batch_key] = adata.obs[batch_key].fillna('unknown') adata.obs[batch_key] = adata.obs[batch_key].astype(str).astype('category') return adata
undefined
def prepare_batch_column(adata, batch_key='batch', default_batch='batch_1'): """为批次相关操作准备批次列。""" if batch_key not in adata.obs.columns: adata.obs[batch_key] = default_batch adata.obs[batch_key] = adata.obs[batch_key].fillna('unknown') adata.obs[batch_key] = adata.obs[batch_key].astype(str).astype('category') return adata
undefined

Highly Variable Genes (HVG) - Small Dataset Handling

高可变基因(HVG) - 小数据集处理

LOESS Failure with Small Batches

小批次下的LOESS失败问题

IMPORTANT: The
seurat_v3
HVG flavor uses LOESS regression which fails on small datasets or small per-batch subsets (<500 cells per batch). This manifests as:
ValueError: Extrapolation not allowed with blending
CORRECT - Use try/except fallback pattern:
python
undefined
重要提示
seurat_v3
HVG方法使用LOESS回归,在小数据集或小批次子集(每批次<500个细胞)中会失败。错误表现为:
ValueError: Extrapolation not allowed with blending
正确做法 - 使用try/except回退模板:
python
undefined

Robust HVG selection for any dataset size

适用于任何数据集大小的稳健HVG筛选

try: sc.pp.highly_variable_genes( adata, flavor='seurat_v3', n_top_genes=2000, batch_key='batch' # if batch correction is needed ) except ValueError as e: if 'Extrapolation' in str(e) or 'LOESS' in str(e): # Fallback to simpler method for small datasets sc.pp.highly_variable_genes( adata, flavor='seurat', # Works with any size n_top_genes=2000 ) else: raise

**Alternative - Use cell_ranger flavor for batch-aware HVG:**
```python
try: sc.pp.highly_variable_genes( adata, flavor='seurat_v3', n_top_genes=2000, batch_key='batch' # 若需要批次校正 ) except ValueError as e: if 'Extrapolation' in str(e) or 'LOESS' in str(e): # 小数据集回退到更简单的方法 sc.pp.highly_variable_genes( adata, flavor='seurat', # 适用于任何大小的数据集 n_top_genes=2000 ) else: raise

**替代方案 - 使用cell_ranger方法进行批次相关HVG筛选:**
```python

cell_ranger flavor is more robust for batched data

cell_ranger方法对批次数据更稳健

sc.pp.highly_variable_genes( adata, flavor='cell_ranger', # No LOESS, works with batches n_top_genes=2000, batch_key='batch' )
undefined
sc.pp.highly_variable_genes( adata, flavor='cell_ranger', # 无LOESS,适用于批次数据 n_top_genes=2000, batch_key='batch' )
undefined

Best Practices for Batch-Aware HVG

批次相关HVG筛选最佳实践

  1. Check batch sizes before HVG: Small batches (<500 cells) will cause LOESS to fail
  2. Prefer
    seurat
    or
    cell_ranger
    when batch sizes vary significantly
  3. Use
    seurat_v3
    only
    when all batches have >500 cells
  4. Always wrap in try/except when dataset size is unknown
python
undefined
  1. HVG筛选前检查批次大小:小批次(<500个细胞)会导致LOESS失败
  2. 当批次大小差异较大时,优先选择
    seurat
    cell_ranger
  3. 仅当所有批次细胞数>500时,才使用
    seurat_v3
  4. 当数据集大小未知时,务必使用try/except包裹
python
undefined

Safe batch-aware HVG pattern

安全的批次相关HVG筛选模板

def safe_highly_variable_genes(adata, batch_key='batch', n_top_genes=2000): """Select HVGs with automatic fallback for small batches.""" try: sc.pp.highly_variable_genes( adata, flavor='seurat_v3', n_top_genes=n_top_genes, batch_key=batch_key ) except ValueError: # Fallback for small batches sc.pp.highly_variable_genes( adata, flavor='seurat', n_top_genes=n_top_genes )
undefined
def safe_highly_variable_genes(adata, batch_key='batch', n_top_genes=2000): """自动回退的稳健HVG筛选,适用于小批次。""" try: sc.pp.highly_variable_genes( adata, flavor='seurat_v3', n_top_genes=n_top_genes, batch_key=batch_key ) except ValueError: # 小批次回退方案 sc.pp.highly_variable_genes( adata, flavor='seurat', n_top_genes=n_top_genes )
undefined

Examples

示例

  • "Download PBMC3k counts, run QC with Scrublet, normalise with
    shiftlog|pearson
    , and compute MDE + UMAP embeddings on CPU."
  • "Set up the mixed CPU–GPU workflow in a fresh conda env, recover raw counts after normalisation, and score cell cycle phases before Leiden clustering."
  • "Provision a RAPIDS environment, transfer AnnData to GPU, run
    method='cagra'
    neighbours, and return embeddings to CPU for plotting."
  • "下载PBMC3k计数数据,使用Scrublet进行QC,通过
    shiftlog|pearson
    归一化,并在CPU上计算MDE + UMAP嵌入结果。"
  • "在新的conda环境中搭建CPU-GPU混合流水线,归一化后恢复原始计数,并在Leiden聚类前对细胞周期阶段打分。"
  • "配置RAPIDS环境,将AnnData转移到GPU,运行
    method='cagra'
    邻接图,并将嵌入结果转回CPU用于绘图。"

References

参考资料

  • Detailed walkthrough notebooks:
    t_preprocess.ipynb
    ,
    t_preprocess_cpu.ipynb
    ,
    t_preprocess_gpu.ipynb
  • Quick copy/paste commands:
    reference.md
  • 详细分步教程笔记本:
    t_preprocess.ipynb
    t_preprocess_cpu.ipynb
    t_preprocess_gpu.ipynb
  • 快速复制粘贴命令:
    reference.md