bulk-rna-seq-deconvolution-with-bulk2single

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Bulk RNA-seq deconvolution with Bulk2Single

基于Bulk2Single的批量RNA-seq反卷积分析

Overview

概述

Use this skill when a user wants to reconstruct single-cell profiles from bulk RNA-seq together with a matched reference scRNA-seq atlas. It follows

t_bulk2single.ipynb

, which demonstrates how to harmonise PDAC bulk replicates, train the beta-VAE generator, and benchmark the output cells against dentate gyrus scRNA-seq.

当用户想要从批量RNA-seq结合匹配的参考单细胞RNA-seq图谱重建单细胞图谱时，可以使用本技能。本技能遵循

t_bulk2single.ipynb

教程，该教程展示了如何协调PDAC批量重复样本、训练beta-VAE生成器，并将输出细胞与齿状回单细胞RNA-seq进行基准测试。

Instructions

操作步骤

Load libraries and data
- Import
```
omicverse as ov
```
  ,
```
scanpy as sc
```
  ,
```
scvelo as scv
```
  ,
```
anndata
```
  , and
```
matplotlib.pyplot as plt
```
  , then call
```
ov.plot_set()
```
  to match omicverse styling.
- Read the bulk counts table with
```
ov.read(...)
```
  /
```
ov.utils.read(...)
```
  and harmonise gene identifiers via
```
ov.bulk.Matrix_ID_mapping(<df>, 'genesets/pair_GRCm39.tsv')
```
  .
- Load the reference scRNA-seq AnnData (e.g.,
```
scv.datasets.dentategyrus()
```
  ) and confirm the cluster labels (stored in
```
adata.obs['clusters']
```
  ).

Initialise the Bulk2Single model

Instantiate

ov.bulk2single.Bulk2Single(bulk_data=bulk_df, single_data=adata, celltype_key='clusters', bulk_group=['dg_d_1', 'dg_d_2', 'dg_d_3'], top_marker_num=200, ratio_num=1, gpu=0)

Explain GPU selection (
```
gpu=-1
```
forces CPU) and how
```
bulk_group
```
names align with column IDs in the bulk matrix.

Estimate cell fractions
- Call
```
model.predicted_fraction()
```
  to run the integrated TAPE estimator, then plot stacked bar charts per sample to validate proportions.
- Encourage saving the fraction table for downstream reporting (
```
df.to_csv(...)
```
  ).
Preprocess for beta-VAE
- Execute
```
model.bulk_preprocess_lazy()
```
  ,
```
model.single_preprocess_lazy()
```
  , and
```
model.prepare_input()
```
  to produce matched feature spaces.
- Clarify that the lazy preprocessing expects raw counts; skip if the user has already log-normalised data and instead provide aligned matrices manually.

Train or load the beta-VAE

Train with

model.train(batch_size=512, learning_rate=1e-4, hidden_size=256, epoch_num=3500, vae_save_dir='...', vae_save_name='dg_vae', generate_save_dir='...', generate_save_name='dg')

Mention early stopping via
```
patience
```
and how to resume by reloading weights with
```
model.load('.../dg_vae.pth')
```
.
Use
```
model.plot_loss()
```
to monitor convergence.

Generate and filter synthetic cells
- Produce an AnnData using
```
model.generate()
```
  and reduce noise through
```
model.filtered(generate_adata, leiden_size=25)
```
  .
- Store the filtered AnnData (
```
.write_h5ad
```
  ) for reuse, noting it contains PCA embeddings in
```
obsm['X_pca']
```
  .

Benchmark against the reference atlas

Plot cell-type compositions with
```
ov.bulk2single.bulk2single_plot_cellprop(...)
```
for both generated and reference data.

Assess correlation using

ov.bulk2single.bulk2single_plot_correlation(single_data, generate_adata, celltype_key='clusters')

Embed with

generate_adata.obsm['X_mde'] = ov.utils.mde(generate_adata.obsm['X_pca'])

and visualise via

ov.utils.embedding(..., color=['clusters'], palette=ov.utils.pyomic_palette())

Troubleshooting tips
- If marker selection fails, increase
```
top_marker_num
```
  or provide a curated marker list.
- Alignment errors typically stem from mismatched
```
bulk_group
```
  names—double-check column IDs in the bulk matrix.
- Training on CPU can take several hours; advise switching
```
gpu
```
  to an available CUDA device for speed.

加载库与数据
- 导入
```
omicverse as ov
```
  、
```
scanpy as sc
```
  、
```
scvelo as scv
```
  、
```
anndata
```
  和
```
matplotlib.pyplot as plt
```
  ，然后调用
```
ov.plot_set()
```
  以匹配omicverse的样式。
- 使用
```
ov.read(...)
```
  /
```
ov.utils.read(...)
```
  读取批量计数表，并通过
```
ov.bulk.Matrix_ID_mapping(<df>, 'genesets/pair_GRCm39.tsv')
```
  协调基因标识符。
- 加载参考单细胞RNA-seq的AnnData（例如
```
scv.datasets.dentategyrus()
```
  ），并确认聚类标签（存储在
```
adata.obs['clusters']
```
  中）。

初始化Bulk2Single模型

实例化

ov.bulk2single.Bulk2Single(bulk_data=bulk_df, single_data=adata, celltype_key='clusters', bulk_group=['dg_d_1', 'dg_d_2', 'dg_d_3'], top_marker_num=200, ratio_num=1, gpu=0)

。

说明GPU选择（
```
gpu=-1
```
强制使用CPU），以及
```
bulk_group
```
名称如何与批量矩阵中的列ID对应。

估计细胞比例
- 调用
```
model.predicted_fraction()
```
  运行集成的TAPE估计器，然后绘制每个样本的堆叠条形图以验证比例。
- 建议保存比例表用于下游报告（
```
df.to_csv(...)
```
  ）。
beta-VAE预处理
- 执行
```
model.bulk_preprocess_lazy()
```
  、
```
model.single_preprocess_lazy()
```
  和
```
model.prepare_input()
```
  以生成匹配的特征空间。
- 说明惰性预处理需要原始计数数据；如果用户已完成对数归一化，则可跳过此步骤，手动提供对齐后的矩阵。

训练或加载beta-VAE

使用

model.train(batch_size=512, learning_rate=1e-4, hidden_size=256, epoch_num=3500, vae_save_dir='...', vae_save_name='dg_vae', generate_save_dir='...', generate_save_name='dg')

进行训练。

提及通过
```
patience
```
参数实现早停，以及如何通过重新加载权重（
```
model.load('.../dg_vae.pth')
```
）恢复训练。
使用
```
model.plot_loss()
```
监控收敛情况。

生成并过滤合成细胞
- 使用
```
model.generate()
```
  生成AnnData，并通过
```
model.filtered(generate_adata, leiden_size=25)
```
  降低噪声。
- 保存过滤后的AnnData（
```
.write_h5ad
```
  ）以便重复使用，注意其在
```
obsm['X_pca']
```
  中包含PCA嵌入。

与参考图谱进行基准测试

使用
```
ov.bulk2single.bulk2single_plot_cellprop(...)
```
绘制生成数据和参考数据的细胞类型组成。

使用

ov.bulk2single.bulk2single_plot_correlation(single_data, generate_adata, celltype_key='clusters')

评估相关性。

通过

generate_adata.obsm['X_mde'] = ov.utils.mde(generate_adata.obsm['X_pca'])

进行嵌入，然后通过

ov.utils.embedding(..., color=['clusters'], palette=ov.utils.pyomic_palette())

可视化。

故障排除提示
- 如果标记基因选择失败，增加
```
top_marker_num
```
  或提供 curated的标记基因列表。
- 对齐错误通常源于
```
bulk_group
```
  名称不匹配——请仔细检查批量矩阵中的列ID。
- 在CPU上训练可能需要数小时；建议将
```
gpu
```
  切换到可用的CUDA设备以提升速度。

Examples

示例

"Estimate cell fractions for PDAC bulk replicates and generate synthetic scRNA-seq using Bulk2Single."
"Load a pre-trained Bulk2Single model, regenerate cells, and compare cluster proportions to the dentate gyrus atlas."
"Plot correlation heatmaps between generated cells and reference clusters after filtering noisy synthetic cells."

"使用Bulk2Single估计PDAC批量重复样本的细胞比例并生成合成单细胞RNA-seq数据。"
"加载预训练的Bulk2Single模型，重新生成细胞，并将聚类比例与齿状回图谱进行比较。"
"过滤噪声合成细胞后，绘制生成细胞与参考聚类之间的相关性热图。"

References

参考文献

Tutorial notebook:
```
t_bulk2single.ipynb
```

Example data and weights:

omicverse_guide/docs/Tutorials-bulk2single/data/

Quick copy/paste commands:
```
reference.md
```

教程笔记本：
```
t_bulk2single.ipynb
```

示例数据与权重：

omicverse_guide/docs/Tutorials-bulk2single/data/

快速复制命令：
```
reference.md
```