single-cell-preprocessing-with-omicverse
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSingle-cell preprocessing with omicverse
使用omicverse进行单细胞数据预处理
Overview
概述
Follow this skill when a user needs to reproduce the preprocessing workflow from the omicverse notebooks , , and . The tutorials operate on the 10x PBMC3k dataset and cover QC filtering, normalisation, highly variable gene (HVG) detection, dimensionality reduction, and downstream embeddings.
t_preprocess.ipynbt_preprocess_cpu.ipynbt_preprocess_gpu.ipynb当用户需要复现omicverse教程笔记本、和中的预处理流程时,可参考本指南。这些教程基于10x PBMC3k数据集,涵盖了QC过滤、归一化、高可变基因(HVG)检测、降维分析以及下游嵌入分析等内容。
t_preprocess.ipynbt_preprocess_cpu.ipynbt_preprocess_gpu.ipynbInstructions
操作步骤
- Set up the environment
- Import and
omicverse as ov, then callscanpy as sc(orov.plot_set(font_path='Arial')in legacy notebooks) to standardise figure styling.ov.ov_plot_set() - Encourage and
%load_ext autoreloadwhen iterating inside notebooks so code edits propagate without restarting the kernel.%autoreload 2
- Import
- Prepare input data
- Download the PBMC3k filtered matrix from 10x Genomics () and extract it under
pbmc3k_filtered_gene_bc_matrices.tar.gz.data/filtered_gene_bc_matrices/hg19/ - Load the matrix via and keep a writable folder like
sc.read_10x_mtx(..., var_names='gene_symbols', cache=True)for exports.write/
- Download the PBMC3k filtered matrix from 10x Genomics (
- Perform quality control (QC)
- Run for the CPU/CPU–GPU pipelines; omit
ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, doublets_method='scrublet')on pure GPU where Scrublet is not yet supported.doublets_method - Review the returned AnnData summary to confirm doublet rates and QC thresholds; advise adjusting cut-offs for different species or sequencing depths.
- Run
- Store raw counts before transformations
- Call immediately after QC so the original counts remain accessible for later recovery and comparison.
ov.utils.store_layers(adata, layers='counts')
- Call
- Normalise and select HVGs
- Use to apply shift-log normalisation followed by Pearson residual HVG detection (set
ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000, target_sum=5e5)on GPU, which keeps defaults).target_sum=None - For CPU–GPU mixed runs, demonstrate to invert normalisation and store reconstructed counts in
ov.pp.recover_counts(...).adata.layers['recover_counts']
- Use
- Manage and layer recovery
.raw- Snapshot normalised data to with
.raw(oradata.raw = adata), and showadata.raw = adata.copy()to compare normalised vs. raw intensities.ov.utils.retrieve_layers(adata_counts, layers='counts')
- Snapshot normalised data to
- Scale, reduce, and embed
- Scale features using (layers hold scaled matrices) followed by
ov.pp.scale(adata).ov.pp.pca(adata, layer='scaled', n_pcs=50) - Construct neighbourhood graphs with:
- for the baseline notebook.
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca') - on CPU–GPU to leverage accelerated routines.
ov.pp.neighbors(..., use_rep='scaled|original|X_pca') - on GPU to call RAPIDS graph primitives.
ov.pp.neighbors(..., method='cagra')
- Generate embeddings via ,
ov.utils.mde(...),ov.pp.umap(adata),ov.pp.mde(...), orov.pp.tsne(...)depending on the notebook variant.ov.pp.sude(...)
- Scale features using
- Cluster and annotate
- Run or
ov.pp.leiden(adata, resolution=1)after neighbour graph construction; CPU–GPU pipelines also showcaseov.single.leiden(adata, resolution=1.0)before clustering.ov.pp.score_genes_cell_cycle - IMPORTANT - Defensive checks: When generating code that plots by clustering results (e.g., ), always check if the clustering has been performed first:
color='leiden'python# Check if leiden clustering exists, if not, run it if 'leiden' not in adata.obs: if 'neighbors' not in adata.uns: ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca') ov.single.leiden(adata, resolution=1.0) - Plot embeddings with or
ov.pl.embedding(...), colouring byov.utils.embedding(...)clusters and marker genes. Always verify that the column specified inleidenparameter exists incolor=before plotting.adata.obs
- Run
- Document outputs
- Encourage saving intermediate AnnData objects () and figure exports using Matplotlib’s
adata.write('write/pbmc3k_preprocessed.h5ad')to preserve QC summaries and embeddings.plt.savefig(...)
- Encourage saving intermediate AnnData objects (
- Notebook-specific notes
- Baseline (): Focuses on CPU execution with Scanpy neighbours; emphasise storing counts before and after
t_preprocess.ipynbdemonstrations.retrieve_layers - CPU–GPU mixed (): Highlights Omicverse ≥1.7.0 mixed acceleration. Include timing magics (%%time) to showcase speedups and call out
t_preprocess_cpu.ipynbsupport.doublets_method='scrublet' - GPU (): Requires a CUDA-capable GPU, RAPIDS 24.04 stack, and
t_preprocess_gpu.ipynb. Mention therapids-singlecell/ov.pp.anndata_to_GPUtransfers andov.pp.anndata_to_CPUneighbours. Note the current warning that pure-GPU pipelines depend on RAPIDS updates.method='cagra'
- Baseline (
- Troubleshooting tips
- If fails, verify the extracted folder structure and ensure gene symbols are available via
sc.read_10x_mtx.var_names='gene_symbols' - Address GPU import errors by confirming the conda environment matches the RAPIDS version for the installed CUDA driver ().
nvidia-smi - For dimension mismatches, ensure QC filtered out empty barcodes so HVG selection does not encounter zero-variance features.
ov.pp.preprocess - When embeddings lack expected fields (e.g., missing), re-run
scaled|original|X_pcaandov.pp.scaleto rebuild the cached layers.ov.pp.pca - Pipeline dependency errors: When encountering errors like "Could not find 'leiden' in adata.obs or adata.var_names":
- Always check if required preprocessing steps (neighbors, PCA) exist before dependent operations
- Check if clustering results exist in before trying to color plots by them
adata.obs - Use defensive checks in generated code to handle incomplete pipelines gracefully
- Code generation best practice: Generate robust code with conditional checks for prerequisites rather than assuming perfect sequential execution. Users may run steps in separate sessions or skip intermediate steps.
- If
- 环境搭建
- 导入和
omicverse as ov,然后调用scanpy as sc(旧版笔记本中使用ov.plot_set(font_path='Arial'))来统一图表样式。ov.ov_plot_set() - 建议在笔记本中使用和
%load_ext autoreload,这样修改代码后无需重启内核即可生效。%autoreload 2
- 导入
- 输入数据准备
- 从10x Genomics下载PBMC3k过滤后的矩阵文件,并解压到
pbmc3k_filtered_gene_bc_matrices.tar.gz目录下。data/filtered_gene_bc_matrices/hg19/ - 通过加载矩阵,并准备一个可写入的文件夹(如
sc.read_10x_mtx(..., var_names='gene_symbols', cache=True))用于导出结果。write/
- 从10x Genomics下载PBMC3k过滤后的矩阵文件
- 质量控制(QC)
- 在CPU或CPU-GPU混合流水线中,运行;在纯GPU环境下,由于目前不支持Scrublet,需省略
ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, doublets_method='scrublet')参数。doublets_method - 查看返回的AnnData摘要,确认双细胞率和QC阈值;建议根据不同物种或测序深度调整截断值。
- 在CPU或CPU-GPU混合流水线中,运行
- 在转换前存储原始计数
- QC完成后立即调用,这样原始计数可在后续恢复和对比时使用。
ov.utils.store_layers(adata, layers='counts')
- QC完成后立即调用
- 归一化与HVG筛选
- 使用执行移位对数归一化,然后进行Pearson残差HVG检测(在GPU环境下设置
ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000, target_sum=5e5),使用默认值)。target_sum=None - 在CPU-GPU混合运行时,演示使用反转归一化,并将重建的计数存储到
ov.pp.recover_counts(...)中。adata.layers['recover_counts']
- 使用
- 管理.raw属性与层恢复
- 通过(或
adata.raw = adata)将归一化数据快照保存到.raw属性中,并展示adata.raw = adata.copy()来对比归一化后与原始数据的强度。ov.utils.retrieve_layers(adata_counts, layers='counts')
- 通过
- 数据缩放、降维与嵌入分析
- 使用缩放特征(缩放后的矩阵存储在layers中),随后运行
ov.pp.scale(adata)进行PCA分析。ov.pp.pca(adata, layer='scaled', n_pcs=50) - 构建邻接图:
- 在基准笔记本中,使用。
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca') - 在CPU-GPU混合环境中,使用以利用加速流程。
ov.pp.neighbors(..., use_rep='scaled|original|X_pca') - 在GPU环境中,使用调用RAPIDS图原语。
ov.pp.neighbors(..., method='cagra')
- 在基准笔记本中,使用
- 根据笔记本版本,通过、
ov.utils.mde(...)、ov.pp.umap(adata)、ov.pp.mde(...)或ov.pp.tsne(...)生成嵌入结果。ov.pp.sude(...)
- 使用
- 聚类与注释
- 构建邻接图后,运行或
ov.pp.leiden(adata, resolution=1);在CPU-GPU流水线中,还需在聚类前展示ov.single.leiden(adata, resolution=1.0)的使用。ov.pp.score_genes_cell_cycle - 重要 - 防御性检查:当生成基于聚类结果绘图的代码时(如),务必先检查是否已执行聚类:
color='leiden'python# 检查是否存在leiden聚类结果,若没有则运行聚类 if 'leiden' not in adata.obs: if 'neighbors' not in adata.uns: ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca') ov.single.leiden(adata, resolution=1.0) - 使用或
ov.pl.embedding(...)绘制嵌入图,按ov.utils.embedding(...)聚类结果和标记基因着色。绘图前务必确认leiden参数指定的列存在于color=中。adata.obs
- 构建邻接图后,运行
- 文档输出
- 建议保存中间AnnData对象(),并使用Matplotlib的
adata.write('write/pbmc3k_preprocessed.h5ad')导出图表,以保存QC摘要和嵌入结果。plt.savefig(...)
- 建议保存中间AnnData对象(
- 笔记本特定说明
- 基准版():专注于CPU执行,使用Scanpy邻接图;重点强调在
t_preprocess.ipynb演示前后存储计数。retrieve_layers - CPU-GPU混合版():突出Omicverse ≥1.7.0的混合加速功能。使用计时魔法命令(%%time)展示速度提升,并说明支持
t_preprocess_cpu.ipynb。doublets_method='scrublet' - GPU版():需要支持CUDA的GPU、RAPIDS 24.04栈和
t_preprocess_gpu.ipynb。提及rapids-singlecell/ov.pp.anndata_to_GPU的数据传输,以及使用ov.pp.anndata_to_CPU的邻接图。注意当前纯GPU流水线依赖RAPIDS更新的警告。method='cagra'
- 基准版(
- 故障排除技巧
- 如果运行失败,检查解压后的文件夹结构,并确保通过
sc.read_10x_mtx可获取基因符号。var_names='gene_symbols' - 解决GPU导入错误时,确认conda环境与已安装CUDA驱动的RAPIDS版本匹配(使用查看)。
nvidia-smi - 若出现维度不匹配,确保QC已过滤掉空条形码,这样HVG筛选不会遇到零方差特征。
ov.pp.preprocess - 若嵌入结果缺少预期字段(如缺失),重新运行
scaled|original|X_pca和ov.pp.scale以重建缓存层。ov.pp.pca - 流水线依赖错误:当遇到类似“Could not find 'leiden' in adata.obs or adata.var_names”的错误时:
- 在执行依赖操作前,务必检查所需的预处理步骤(邻接图、PCA)是否已完成
- 在尝试按聚类结果着色绘图前,检查中是否存在聚类结果
adata.obs - 在生成的代码中使用防御性检查,以优雅处理不完整的流水线
- 代码生成最佳实践:生成带有前提条件检查的健壮代码,而非假设流程完美按顺序执行。用户可能会在不同会话中运行步骤,或跳过中间步骤。
- 如果
Critical API Reference - Batch Column Handling
关键API参考 - 批次列处理
Batch Column Validation - REQUIRED Before Batch Operations
批次列验证 - 批次操作前必须执行
IMPORTANT: Always validate and prepare the batch column before any batch-aware operations (batch correction, integration, etc.). Missing or NaN values will cause errors.
CORRECT usage:
python
undefined重要提示:在执行任何批次相关操作(批次校正、整合等)前,务必验证并准备批次列。缺失或NaN值会导致错误。
正确用法:
python
undefinedStep 1: Check if batch column exists, create default if not
步骤1:检查批次列是否存在,若不存在则创建默认列
if 'batch' not in adata.obs.columns:
adata.obs['batch'] = 'batch_1' # Default single batch
if 'batch' not in adata.obs.columns:
adata.obs['batch'] = 'batch_1' # 默认单批次
Step 2: Handle NaN/missing values - CRITICAL!
步骤2:处理NaN/缺失值 - 非常关键!
adata.obs['batch'] = adata.obs['batch'].fillna('unknown')
adata.obs['batch'] = adata.obs['batch'].fillna('unknown')
Step 3: Convert to categorical for efficient memory usage
步骤3:转换为分类类型以提高内存效率
adata.obs['batch'] = adata.obs['batch'].astype('category')
adata.obs['batch'] = adata.obs['batch'].astype('category')
Now safe to use in batch-aware operations
现在可安全用于批次相关操作
ov.pp.combat(adata, batch='batch') # or other batch correction methods
**WRONG - DO NOT USE:**
```pythonov.pp.combat(adata, batch='batch') # 或其他批次校正方法
**错误用法 - 请勿使用:**
```pythonWRONG! Using batch column without validation can cause NaN errors
错误!未验证批次列就使用会导致NaN错误
ov.pp.combat(adata, batch='batch') # May fail if batch has NaN values!
ov.pp.combat(adata, batch='batch') # 若批次列存在NaN值可能失败!
WRONG! Assuming batch column exists
错误!假设批次列存在
adata.obs['batch'].unique() # KeyError if column doesn't exist!
adata.obs['batch'].unique() # 若列不存在会触发KeyError!
undefinedundefinedCommon Batch-Related Pitfalls
常见批次相关陷阱
- NaN values in batch column: Always use before batch operations
fillna() - Missing batch column: Always check existence before use
- Non-categorical batch: Convert to category for memory efficiency
- Mixed data types: Ensure consistent string type before categorization
python
undefined- 批次列中的NaN值:在批次操作前务必使用处理
fillna() - 缺失批次列:使用前务必检查是否存在
- 非分类类型的批次列:转换为分类类型以节省内存
- 混合数据类型:在分类转换前确保为一致的字符串类型
python
undefinedComplete defensive batch preparation pattern:
完整的防御性批次列准备模板:
def prepare_batch_column(adata, batch_key='batch', default_batch='batch_1'):
"""Prepare batch column for batch-aware operations."""
if batch_key not in adata.obs.columns:
adata.obs[batch_key] = default_batch
adata.obs[batch_key] = adata.obs[batch_key].fillna('unknown')
adata.obs[batch_key] = adata.obs[batch_key].astype(str).astype('category')
return adata
undefineddef prepare_batch_column(adata, batch_key='batch', default_batch='batch_1'):
"""为批次相关操作准备批次列。"""
if batch_key not in adata.obs.columns:
adata.obs[batch_key] = default_batch
adata.obs[batch_key] = adata.obs[batch_key].fillna('unknown')
adata.obs[batch_key] = adata.obs[batch_key].astype(str).astype('category')
return adata
undefinedHighly Variable Genes (HVG) - Small Dataset Handling
高可变基因(HVG) - 小数据集处理
LOESS Failure with Small Batches
小批次下的LOESS失败问题
IMPORTANT: The HVG flavor uses LOESS regression which fails on small datasets or small per-batch subsets (<500 cells per batch). This manifests as:
seurat_v3ValueError: Extrapolation not allowed with blendingCORRECT - Use try/except fallback pattern:
python
undefined重要提示: HVG方法使用LOESS回归,在小数据集或小批次子集(每批次<500个细胞)中会失败。错误表现为:
seurat_v3ValueError: Extrapolation not allowed with blending正确做法 - 使用try/except回退模板:
python
undefinedRobust HVG selection for any dataset size
适用于任何数据集大小的稳健HVG筛选
try:
sc.pp.highly_variable_genes(
adata,
flavor='seurat_v3',
n_top_genes=2000,
batch_key='batch' # if batch correction is needed
)
except ValueError as e:
if 'Extrapolation' in str(e) or 'LOESS' in str(e):
# Fallback to simpler method for small datasets
sc.pp.highly_variable_genes(
adata,
flavor='seurat', # Works with any size
n_top_genes=2000
)
else:
raise
**Alternative - Use cell_ranger flavor for batch-aware HVG:**
```pythontry:
sc.pp.highly_variable_genes(
adata,
flavor='seurat_v3',
n_top_genes=2000,
batch_key='batch' # 若需要批次校正
)
except ValueError as e:
if 'Extrapolation' in str(e) or 'LOESS' in str(e):
# 小数据集回退到更简单的方法
sc.pp.highly_variable_genes(
adata,
flavor='seurat', # 适用于任何大小的数据集
n_top_genes=2000
)
else:
raise
**替代方案 - 使用cell_ranger方法进行批次相关HVG筛选:**
```pythoncell_ranger flavor is more robust for batched data
cell_ranger方法对批次数据更稳健
sc.pp.highly_variable_genes(
adata,
flavor='cell_ranger', # No LOESS, works with batches
n_top_genes=2000,
batch_key='batch'
)
undefinedsc.pp.highly_variable_genes(
adata,
flavor='cell_ranger', # 无LOESS,适用于批次数据
n_top_genes=2000,
batch_key='batch'
)
undefinedBest Practices for Batch-Aware HVG
批次相关HVG筛选最佳实践
- Check batch sizes before HVG: Small batches (<500 cells) will cause LOESS to fail
- Prefer or
seuratwhen batch sizes vary significantlycell_ranger - Use only when all batches have >500 cells
seurat_v3 - Always wrap in try/except when dataset size is unknown
python
undefined- HVG筛选前检查批次大小:小批次(<500个细胞)会导致LOESS失败
- 当批次大小差异较大时,优先选择或
seuratcell_ranger - 仅当所有批次细胞数>500时,才使用
seurat_v3 - 当数据集大小未知时,务必使用try/except包裹
python
undefinedSafe batch-aware HVG pattern
安全的批次相关HVG筛选模板
def safe_highly_variable_genes(adata, batch_key='batch', n_top_genes=2000):
"""Select HVGs with automatic fallback for small batches."""
try:
sc.pp.highly_variable_genes(
adata, flavor='seurat_v3', n_top_genes=n_top_genes, batch_key=batch_key
)
except ValueError:
# Fallback for small batches
sc.pp.highly_variable_genes(
adata, flavor='seurat', n_top_genes=n_top_genes
)
undefineddef safe_highly_variable_genes(adata, batch_key='batch', n_top_genes=2000):
"""自动回退的稳健HVG筛选,适用于小批次。"""
try:
sc.pp.highly_variable_genes(
adata, flavor='seurat_v3', n_top_genes=n_top_genes, batch_key=batch_key
)
except ValueError:
# 小批次回退方案
sc.pp.highly_variable_genes(
adata, flavor='seurat', n_top_genes=n_top_genes
)
undefinedExamples
示例
- "Download PBMC3k counts, run QC with Scrublet, normalise with , and compute MDE + UMAP embeddings on CPU."
shiftlog|pearson - "Set up the mixed CPU–GPU workflow in a fresh conda env, recover raw counts after normalisation, and score cell cycle phases before Leiden clustering."
- "Provision a RAPIDS environment, transfer AnnData to GPU, run neighbours, and return embeddings to CPU for plotting."
method='cagra'
- "下载PBMC3k计数数据,使用Scrublet进行QC,通过归一化,并在CPU上计算MDE + UMAP嵌入结果。"
shiftlog|pearson - "在新的conda环境中搭建CPU-GPU混合流水线,归一化后恢复原始计数,并在Leiden聚类前对细胞周期阶段打分。"
- "配置RAPIDS环境,将AnnData转移到GPU,运行邻接图,并将嵌入结果转回CPU用于绘图。"
method='cagra'
References
参考资料
- Detailed walkthrough notebooks: ,
t_preprocess.ipynb,t_preprocess_cpu.ipynbt_preprocess_gpu.ipynb - Quick copy/paste commands:
reference.md
- 详细分步教程笔记本:、
t_preprocess.ipynb、t_preprocess_cpu.ipynbt_preprocess_gpu.ipynb - 快速复制粘贴命令:
reference.md