single-cell-preprocessing-with-omicverse

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Single-cell preprocessing with omicverse

使用omicverse进行单细胞数据预处理

Overview

概述

Follow this skill when a user needs to reproduce the preprocessing workflow from the omicverse notebooks

t_preprocess.ipynb

t_preprocess_cpu.ipynb

, and

t_preprocess_gpu.ipynb

. The tutorials operate on the 10x PBMC3k dataset and cover QC filtering, normalisation, highly variable gene (HVG) detection, dimensionality reduction, and downstream embeddings.

当用户需要复现omicverse教程笔记本

t_preprocess.ipynb

、

t_preprocess_cpu.ipynb

和

t_preprocess_gpu.ipynb

中的预处理流程时，可参考本指南。这些教程基于10x PBMC3k数据集，涵盖了QC过滤、归一化、高可变基因（HVG）检测、降维分析以及下游嵌入分析等内容。

Instructions

操作步骤

Set up the environment
- Import
```
omicverse as ov
```
  and
```
scanpy as sc
```
  , then call
```
ov.plot_set(font_path='Arial')
```
  (or
```
ov.ov_plot_set()
```
  in legacy notebooks) to standardise figure styling.
- Encourage
```
%load_ext autoreload
```
  and
```
%autoreload 2
```
  when iterating inside notebooks so code edits propagate without restarting the kernel.
Prepare input data
- Download the PBMC3k filtered matrix from 10x Genomics (
```
pbmc3k_filtered_gene_bc_matrices.tar.gz
```
  ) and extract it under
```
data/filtered_gene_bc_matrices/hg19/
```
  .
- Load the matrix via
```
sc.read_10x_mtx(..., var_names='gene_symbols', cache=True)
```
  and keep a writable folder like
```
write/
```
  for exports.
Perform quality control (QC)
- Run
```
ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, doublets_method='scrublet')
```
  for the CPU/CPU–GPU pipelines; omit
```
doublets_method
```
  on pure GPU where Scrublet is not yet supported.
- Review the returned AnnData summary to confirm doublet rates and QC thresholds; advise adjusting cut-offs for different species or sequencing depths.
Store raw counts before transformations
- Call
```
ov.utils.store_layers(adata, layers='counts')
```
  immediately after QC so the original counts remain accessible for later recovery and comparison.
Normalise and select HVGs
- Use
```
ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000, target_sum=5e5)
```
  to apply shift-log normalisation followed by Pearson residual HVG detection (set
```
target_sum=None
```
  on GPU, which keeps defaults).
- For CPU–GPU mixed runs, demonstrate
```
ov.pp.recover_counts(...)
```
  to invert normalisation and store reconstructed counts in
```
adata.layers['recover_counts']
```
  .
Manage
.raw
and layer recovery
- Snapshot normalised data to
```
.raw
```
  with
```
adata.raw = adata
```
  (or
```
adata.raw = adata.copy()
```
  ), and show
```
ov.utils.retrieve_layers(adata_counts, layers='counts')
```
  to compare normalised vs. raw intensities.

Scale, reduce, and embed

Scale features using
```
ov.pp.scale(adata)
```
(layers hold scaled matrices) followed by
```
ov.pp.pca(adata, layer='scaled', n_pcs=50)
```
.

Construct neighbourhood graphs with:

sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')

for the baseline notebook.

ov.pp.neighbors(..., use_rep='scaled|original|X_pca')

on CPU–GPU to leverage accelerated routines.

```
ov.pp.neighbors(..., method='cagra')
```
on GPU to call RAPIDS graph primitives.

Generate embeddings via

ov.utils.mde(...)

ov.pp.umap(adata)

ov.pp.mde(...)

ov.pp.tsne(...)

, or

ov.pp.sude(...)

depending on the notebook variant.

Cluster and annotate

Run

ov.pp.leiden(adata, resolution=1)

ov.single.leiden(adata, resolution=1.0)

after neighbour graph construction; CPU–GPU pipelines also showcase

ov.pp.score_genes_cell_cycle

before clustering.

IMPORTANT - Defensive checks: When generating code that plots by clustering results (e.g.,

color='leiden'

), always check if the clustering has been performed first:

python

# Check if leiden clustering exists, if not, run it
if 'leiden' not in adata.obs:
    if 'neighbors' not in adata.uns:
        ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
    ov.single.leiden(adata, resolution=1.0)

Plot embeddings with
```
ov.pl.embedding(...)
```
or
```
ov.utils.embedding(...)
```
, colouring by
```
leiden
```
clusters and marker genes. Always verify that the column specified in
```
color=
```
parameter exists in
```
adata.obs
```
before plotting.

Document outputs
- Encourage saving intermediate AnnData objects (
```
adata.write('write/pbmc3k_preprocessed.h5ad')
```
  ) and figure exports using Matplotlib’s
```
plt.savefig(...)
```
  to preserve QC summaries and embeddings.
Notebook-specific notes
- Baseline (
  t_preprocess.ipynb
  ): Focuses on CPU execution with Scanpy neighbours; emphasise storing counts before and after
```
retrieve_layers
```
  demonstrations.
- CPU–GPU mixed (
  t_preprocess_cpu.ipynb
  ): Highlights Omicverse ≥1.7.0 mixed acceleration. Include timing magics (%%time) to showcase speedups and call out
```
doublets_method='scrublet'
```
  support.
- GPU (
  t_preprocess_gpu.ipynb
  ): Requires a CUDA-capable GPU, RAPIDS 24.04 stack, and
```
rapids-singlecell
```
  . Mention the
```
ov.pp.anndata_to_GPU
```
  /
```
ov.pp.anndata_to_CPU
```
  transfers and
```
method='cagra'
```
  neighbours. Note the current warning that pure-GPU pipelines depend on RAPIDS updates.
Troubleshooting tips
- If
```
sc.read_10x_mtx
```
  fails, verify the extracted folder structure and ensure gene symbols are available via
```
var_names='gene_symbols'
```
  .
- Address GPU import errors by confirming the conda environment matches the RAPIDS version for the installed CUDA driver (
```
nvidia-smi
```
  ).
- For
```
ov.pp.preprocess
```
  dimension mismatches, ensure QC filtered out empty barcodes so HVG selection does not encounter zero-variance features.
- When embeddings lack expected fields (e.g.,
```
scaled|original|X_pca
```
  missing), re-run
```
ov.pp.scale
```
  and
```
ov.pp.pca
```
  to rebuild the cached layers.
- Pipeline dependency errors: When encountering errors like "Could not find 'leiden' in adata.obs or adata.var_names":
  - Always check if required preprocessing steps (neighbors, PCA) exist before dependent operations
  - Check if clustering results exist in
```
adata.obs
```
    before trying to color plots by them
  - Use defensive checks in generated code to handle incomplete pipelines gracefully
- Code generation best practice: Generate robust code with conditional checks for prerequisites rather than assuming perfect sequential execution. Users may run steps in separate sessions or skip intermediate steps.

环境搭建
- 导入
```
omicverse as ov
```
  和
```
scanpy as sc
```
  ，然后调用
```
ov.plot_set(font_path='Arial')
```
  （旧版笔记本中使用
```
ov.ov_plot_set()
```
  ）来统一图表样式。
- 建议在笔记本中使用
```
%load_ext autoreload
```
  和
```
%autoreload 2
```
  ，这样修改代码后无需重启内核即可生效。
输入数据准备
- 从10x Genomics下载PBMC3k过滤后的矩阵文件
```
pbmc3k_filtered_gene_bc_matrices.tar.gz
```
  ，并解压到
```
data/filtered_gene_bc_matrices/hg19/
```
  目录下。
- 通过
```
sc.read_10x_mtx(..., var_names='gene_symbols', cache=True)
```
  加载矩阵，并准备一个可写入的文件夹（如
```
write/
```
  ）用于导出结果。
质量控制（QC）
- 在CPU或CPU-GPU混合流水线中，运行
```
ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, doublets_method='scrublet')
```
  ；在纯GPU环境下，由于目前不支持Scrublet，需省略
```
doublets_method
```
  参数。
- 查看返回的AnnData摘要，确认双细胞率和QC阈值；建议根据不同物种或测序深度调整截断值。
在转换前存储原始计数
- QC完成后立即调用
```
ov.utils.store_layers(adata, layers='counts')
```
  ，这样原始计数可在后续恢复和对比时使用。
归一化与HVG筛选
- 使用
```
ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000, target_sum=5e5)
```
  执行移位对数归一化，然后进行Pearson残差HVG检测（在GPU环境下设置
```
target_sum=None
```
  ，使用默认值）。
- 在CPU-GPU混合运行时，演示使用
```
ov.pp.recover_counts(...)
```
  反转归一化，并将重建的计数存储到
```
adata.layers['recover_counts']
```
  中。
管理.raw属性与层恢复
- 通过
```
adata.raw = adata
```
  （或
```
adata.raw = adata.copy()
```
  ）将归一化数据快照保存到.raw属性中，并展示
```
ov.utils.retrieve_layers(adata_counts, layers='counts')
```
  来对比归一化后与原始数据的强度。
数据缩放、降维与嵌入分析
- 使用
```
ov.pp.scale(adata)
```
  缩放特征（缩放后的矩阵存储在layers中），随后运行
```
ov.pp.pca(adata, layer='scaled', n_pcs=50)
```
  进行PCA分析。
- 构建邻接图：
  - 在基准笔记本中，使用
```
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')
```
    。
  - 在CPU-GPU混合环境中，使用
```
ov.pp.neighbors(..., use_rep='scaled|original|X_pca')
```
    以利用加速流程。
  - 在GPU环境中，使用
```
ov.pp.neighbors(..., method='cagra')
```
    调用RAPIDS图原语。
- 根据笔记本版本，通过
```
ov.utils.mde(...)
```
  、
```
ov.pp.umap(adata)
```
  、
```
ov.pp.mde(...)
```
  、
```
ov.pp.tsne(...)
```
  或
```
ov.pp.sude(...)
```
  生成嵌入结果。

聚类与注释

构建邻接图后，运行

ov.pp.leiden(adata, resolution=1)

或

ov.single.leiden(adata, resolution=1.0)

；在CPU-GPU流水线中，还需在聚类前展示

ov.pp.score_genes_cell_cycle

的使用。

重要 - 防御性检查：当生成基于聚类结果绘图的代码时（如

color='leiden'

），务必先检查是否已执行聚类：

python

# 检查是否存在leiden聚类结果，若没有则运行聚类
if 'leiden' not in adata.obs:
    if 'neighbors' not in adata.uns:
        ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
    ov.single.leiden(adata, resolution=1.0)

使用
```
ov.pl.embedding(...)
```
或
```
ov.utils.embedding(...)
```
绘制嵌入图，按
```
leiden
```
聚类结果和标记基因着色。绘图前务必确认
```
color=
```
参数指定的列存在于
```
adata.obs
```
中。

文档输出
- 建议保存中间AnnData对象（
```
adata.write('write/pbmc3k_preprocessed.h5ad')
```
  ），并使用Matplotlib的
```
plt.savefig(...)
```
  导出图表，以保存QC摘要和嵌入结果。
笔记本特定说明
- 基准版（
  t_preprocess.ipynb
  ）：专注于CPU执行，使用Scanpy邻接图；重点强调在
```
retrieve_layers
```
  演示前后存储计数。
- CPU-GPU混合版（
  t_preprocess_cpu.ipynb
  ）：突出Omicverse ≥1.7.0的混合加速功能。使用计时魔法命令（%%time）展示速度提升，并说明支持
```
doublets_method='scrublet'
```
  。
- GPU版（
  t_preprocess_gpu.ipynb
  ）：需要支持CUDA的GPU、RAPIDS 24.04栈和
```
rapids-singlecell
```
  。提及
```
ov.pp.anndata_to_GPU
```
  /
```
ov.pp.anndata_to_CPU
```
  的数据传输，以及使用
```
method='cagra'
```
  的邻接图。注意当前纯GPU流水线依赖RAPIDS更新的警告。
故障排除技巧
- 如果
```
sc.read_10x_mtx
```
  运行失败，检查解压后的文件夹结构，并确保通过
```
var_names='gene_symbols'
```
  可获取基因符号。
- 解决GPU导入错误时，确认conda环境与已安装CUDA驱动的RAPIDS版本匹配（使用
```
nvidia-smi
```
  查看）。
- 若
```
ov.pp.preprocess
```
  出现维度不匹配，确保QC已过滤掉空条形码，这样HVG筛选不会遇到零方差特征。
- 若嵌入结果缺少预期字段（如
```
scaled|original|X_pca
```
  缺失），重新运行
```
ov.pp.scale
```
  和
```
ov.pp.pca
```
  以重建缓存层。
- 流水线依赖错误：当遇到类似“Could not find 'leiden' in adata.obs or adata.var_names”的错误时：
  - 在执行依赖操作前，务必检查所需的预处理步骤（邻接图、PCA）是否已完成
  - 在尝试按聚类结果着色绘图前，检查
```
adata.obs
```
    中是否存在聚类结果
  - 在生成的代码中使用防御性检查，以优雅处理不完整的流水线
- 代码生成最佳实践：生成带有前提条件检查的健壮代码，而非假设流程完美按顺序执行。用户可能会在不同会话中运行步骤，或跳过中间步骤。

Critical API Reference - Batch Column Handling

关键API参考 - 批次列处理

Batch Column Validation - REQUIRED Before Batch Operations

批次列验证 - 批次操作前必须执行

IMPORTANT: Always validate and prepare the batch column before any batch-aware operations (batch correction, integration, etc.). Missing or NaN values will cause errors.

CORRECT usage:

python

undefined

重要提示：在执行任何批次相关操作（批次校正、整合等）前，务必验证并准备批次列。缺失或NaN值会导致错误。

正确用法：

python

undefined

Step 1: Check if batch column exists, create default if not

步骤1：检查批次列是否存在，若不存在则创建默认列

if 'batch' not in adata.obs.columns: adata.obs['batch'] = 'batch_1' # Default single batch

if 'batch' not in adata.obs.columns: adata.obs['batch'] = 'batch_1' # 默认单批次

Step 2: Handle NaN/missing values - CRITICAL!

步骤2：处理NaN/缺失值 - 非常关键！

adata.obs['batch'] = adata.obs['batch'].fillna('unknown')

Step 3: Convert to categorical for efficient memory usage

步骤3：转换为分类类型以提高内存效率

adata.obs['batch'] = adata.obs['batch'].astype('category')

Now safe to use in batch-aware operations

现在可安全用于批次相关操作

ov.pp.combat(adata, batch='batch') # or other batch correction methods


**WRONG - DO NOT USE:**
```python

ov.pp.combat(adata, batch='batch') # 或其他批次校正方法


**错误用法 - 请勿使用：**
```python

WRONG! Using batch column without validation can cause NaN errors

错误！未验证批次列就使用会导致NaN错误

ov.pp.combat(adata, batch='batch') # May fail if batch has NaN values!

ov.pp.combat(adata, batch='batch') # 若批次列存在NaN值可能失败！

WRONG! Assuming batch column exists

错误！假设批次列存在

adata.obs['batch'].unique() # KeyError if column doesn't exist!

adata.obs['batch'].unique() # 若列不存在会触发KeyError！

undefined

undefined

Common Batch-Related Pitfalls

常见批次相关陷阱

NaN values in batch column: Always use
```
fillna()
```
before batch operations
Missing batch column: Always check existence before use
Non-categorical batch: Convert to category for memory efficiency
Mixed data types: Ensure consistent string type before categorization

python

undefined

批次列中的NaN值：在批次操作前务必使用
```
fillna()
```
处理
缺失批次列：使用前务必检查是否存在
非分类类型的批次列：转换为分类类型以节省内存
混合数据类型：在分类转换前确保为一致的字符串类型

python

undefined

Complete defensive batch preparation pattern:

完整的防御性批次列准备模板：

def prepare_batch_column(adata, batch_key='batch', default_batch='batch_1'): """Prepare batch column for batch-aware operations.""" if batch_key not in adata.obs.columns: adata.obs[batch_key] = default_batch adata.obs[batch_key] = adata.obs[batch_key].fillna('unknown') adata.obs[batch_key] = adata.obs[batch_key].astype(str).astype('category') return adata

undefined

def prepare_batch_column(adata, batch_key='batch', default_batch='batch_1'): """为批次相关操作准备批次列。""" if batch_key not in adata.obs.columns: adata.obs[batch_key] = default_batch adata.obs[batch_key] = adata.obs[batch_key].fillna('unknown') adata.obs[batch_key] = adata.obs[batch_key].astype(str).astype('category') return adata

undefined

Highly Variable Genes (HVG) - Small Dataset Handling

高可变基因（HVG） - 小数据集处理

LOESS Failure with Small Batches

小批次下的LOESS失败问题

IMPORTANT: The

seurat_v3

HVG flavor uses LOESS regression which fails on small datasets or small per-batch subsets (<500 cells per batch). This manifests as:

ValueError: Extrapolation not allowed with blending

CORRECT - Use try/except fallback pattern:

python

undefined

重要提示：

seurat_v3

HVG方法使用LOESS回归，在小数据集或小批次子集（每批次<500个细胞）中会失败。错误表现为：

ValueError: Extrapolation not allowed with blending

正确做法 - 使用try/except回退模板：

python

undefined

Robust HVG selection for any dataset size

适用于任何数据集大小的稳健HVG筛选

try: sc.pp.highly_variable_genes( adata, flavor='seurat_v3', n_top_genes=2000, batch_key='batch' # if batch correction is needed ) except ValueError as e: if 'Extrapolation' in str(e) or 'LOESS' in str(e): # Fallback to simpler method for small datasets sc.pp.highly_variable_genes( adata, flavor='seurat', # Works with any size n_top_genes=2000 ) else: raise


**Alternative - Use cell_ranger flavor for batch-aware HVG:**
```python

try: sc.pp.highly_variable_genes( adata, flavor='seurat_v3', n_top_genes=2000, batch_key='batch' # 若需要批次校正 ) except ValueError as e: if 'Extrapolation' in str(e) or 'LOESS' in str(e): # 小数据集回退到更简单的方法 sc.pp.highly_variable_genes( adata, flavor='seurat', # 适用于任何大小的数据集 n_top_genes=2000 ) else: raise


**替代方案 - 使用cell_ranger方法进行批次相关HVG筛选：**
```python

cell_ranger flavor is more robust for batched data

cell_ranger方法对批次数据更稳健

sc.pp.highly_variable_genes( adata, flavor='cell_ranger', # No LOESS, works with batches n_top_genes=2000, batch_key='batch' )

undefined

sc.pp.highly_variable_genes( adata, flavor='cell_ranger', # 无LOESS，适用于批次数据 n_top_genes=2000, batch_key='batch' )

undefined

Best Practices for Batch-Aware HVG

批次相关HVG筛选最佳实践

Check batch sizes before HVG: Small batches (<500 cells) will cause LOESS to fail
Prefer
seurat
or
cell_ranger
when batch sizes vary significantly
Use
seurat_v3
only when all batches have >500 cells
Always wrap in try/except when dataset size is unknown

python

undefined

HVG筛选前检查批次大小：小批次（<500个细胞）会导致LOESS失败
当批次大小差异较大时，优先选择
seurat
或
cell_ranger
仅当所有批次细胞数>500时，才使用
seurat_v3
当数据集大小未知时，务必使用try/except包裹

python

undefined

Safe batch-aware HVG pattern

安全的批次相关HVG筛选模板

def safe_highly_variable_genes(adata, batch_key='batch', n_top_genes=2000): """Select HVGs with automatic fallback for small batches.""" try: sc.pp.highly_variable_genes( adata, flavor='seurat_v3', n_top_genes=n_top_genes, batch_key=batch_key ) except ValueError: # Fallback for small batches sc.pp.highly_variable_genes( adata, flavor='seurat', n_top_genes=n_top_genes )

undefined

def safe_highly_variable_genes(adata, batch_key='batch', n_top_genes=2000): """自动回退的稳健HVG筛选，适用于小批次。""" try: sc.pp.highly_variable_genes( adata, flavor='seurat_v3', n_top_genes=n_top_genes, batch_key=batch_key ) except ValueError: # 小批次回退方案 sc.pp.highly_variable_genes( adata, flavor='seurat', n_top_genes=n_top_genes )

undefined

Examples

示例

"Download PBMC3k counts, run QC with Scrublet, normalise with
```
shiftlog|pearson
```
, and compute MDE + UMAP embeddings on CPU."
"Set up the mixed CPU–GPU workflow in a fresh conda env, recover raw counts after normalisation, and score cell cycle phases before Leiden clustering."
"Provision a RAPIDS environment, transfer AnnData to GPU, run
```
method='cagra'
```
neighbours, and return embeddings to CPU for plotting."

"下载PBMC3k计数数据，使用Scrublet进行QC，通过
```
shiftlog|pearson
```
归一化，并在CPU上计算MDE + UMAP嵌入结果。"
"在新的conda环境中搭建CPU-GPU混合流水线，归一化后恢复原始计数，并在Leiden聚类前对细胞周期阶段打分。"
"配置RAPIDS环境，将AnnData转移到GPU，运行
```
method='cagra'
```
邻接图，并将嵌入结果转回CPU用于绘图。"

References

参考资料

Detailed walkthrough notebooks:

t_preprocess.ipynb

t_preprocess_cpu.ipynb

t_preprocess_gpu.ipynb

Quick copy/paste commands:
```
reference.md
```

详细分步教程笔记本：

t_preprocess.ipynb

、

t_preprocess_cpu.ipynb

、

t_preprocess_gpu.ipynb

快速复制粘贴命令：
```
reference.md
```