bulk-rna-seq-deseq2-analysis-with-omicverse

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Bulk RNA-seq DESeq2 analysis with omicverse

基于 omicverse 的 Bulk RNA-seq DESeq2 分析

Overview

概述

Use this skill when a user wants to reproduce the DESeq2 workflow showcased in

t_deseq2.ipynb

. It covers loading raw featureCounts matrices, mapping Ensembl IDs to symbols, running PyDESeq2 via

ov.bulk.pyDEG

, and exploring downstream enrichment plots.

当用户希望复现

t_deseq2.ipynb

中展示的 DESeq2 工作流时，可使用此技能。内容涵盖加载原始 featureCounts 矩阵、将 Ensembl ID 映射为基因符号、通过

ov.bulk.pyDEG

运行 PyDESeq2，以及探索下游富集分析图。

Instructions

操作步骤

Import and format the expression matrix
- Call
```
import omicverse as ov
```
  and
```
ov.utils.ov_plot_set()
```
  to standardise visuals.
- Read tab-separated count data from featureCounts using
```
ov.utils.read(..., index_col=0, header=1)
```
  .
- Strip trailing
```
.bam
```
  from column names with
```
[c.split('/')[-1].replace('.bam', '') for c in data.columns]
```
  .

Map gene identifiers

Ensure the appropriate mapping pair exists by running
```
ov.utils.download_geneid_annotation_pair()
```
.

Replace

gene_id

with gene symbols using

ov.bulk.Matrix_ID_mapping(data, 'genesets/pair_<GENOME>.tsv')

Initialise the DEG object
- Create
```
dds = ov.bulk.pyDEG(data)
```
  from the mapped counts.
- Resolve duplicate gene names with
```
dds.drop_duplicates_index()
```
  and confirm success in logs.
Define contrasts and run DESeq2
- Collect sample labels into
```
treatment_groups
```
  and
```
control_groups
```
  lists that match column names exactly.
- Execute
```
dds.deg_analysis(treatment_groups, control_groups, method='DEseq2')
```
  to invoke PyDESeq2.
Filter and tune thresholds
- Inspect result shape (
```
dds.result.shape
```
  ) and optionally filter low-expression genes, e.g.
```
dds.result.loc[dds.result['log2(BaseMean)'] > 1]
```
  .
- Set thresholds via
```
dds.foldchange_set(fc_threshold=-1, pval_threshold=0.05, logp_max=6)
```
  to auto-pick fold-change cutoffs.

Visualise differential genes

Draw volcano plots with
```
dds.plot_volcano(...)
```
and summarise key genes.

Produce per-gene boxplots:

dds.plot_boxplot(genes=[...], treatment_groups=..., control_groups=..., figsize=(2, 3))

Run enrichment analyses (optional)

Download enrichment libraries using

ov.utils.download_pathway_database()

and load them through

ov.utils.geneset_prepare

Rank genes for GSEA with
```
rnk = dds.ranking2gsea()
```
.

Instantiate

gsea_obj = ov.bulk.pyGSEA(rnk, pathway_dict)

and call

gsea_obj.enrichment()

to compute terms.

Plot enrichment bubble charts via

gsea_obj.plot_enrichment(...)

and GSEA curves with

gsea_obj.plot_gsea(term_num=..., ...)

Troubleshooting
- If PyDESeq2 raises errors about size factors, remind users to provide raw counts (not log-transformed data).
- ```
gene_id
```
  mapping depends on species; direct them to download the correct genome pair when results look sparse.
- Large pathway libraries may require raising recursion limits or filtering to the top N terms before plotting.

导入并格式化表达矩阵
- 调用
```
import omicverse as ov
```
  和
```
ov.utils.ov_plot_set()
```
  来统一可视化样式。
- 使用
```
ov.utils.read(..., index_col=0, header=1)
```
  读取 featureCounts 生成的制表符分隔的计数数据。
- 通过
```
[c.split('/')[-1].replace('.bam', '') for c in data.columns]
```
  去除列名末尾的
```
.bam
```
  后缀。

基因标识符映射

运行

ov.utils.download_geneid_annotation_pair()

确保存在合适的映射对。

使用

ov.bulk.Matrix_ID_mapping(data, 'genesets/pair_<GENOME>.tsv')

将

gene_id

替换为基因符号。

初始化差异表达基因（DEG）对象
- 从映射后的计数矩阵创建
```
dds = ov.bulk.pyDEG(data)
```
  。
- 调用
```
dds.drop_duplicates_index()
```
  去除重复基因名，并通过日志确认操作成功。
定义对比组并运行 DESeq2
- 将样本标签收集到
```
treatment_groups
```
  和
```
control_groups
```
  列表中，确保与列名完全匹配。
- 执行
```
dds.deg_analysis(treatment_groups, control_groups, method='DEseq2')
```
  调用 PyDESeq2。
过滤并调整阈值
- 查看结果的形状（
```
dds.result.shape
```
  ），可选择性过滤低表达基因，例如
```
dds.result.loc[dds.result['log2(BaseMean)'] > 1]
```
  。
- 通过
```
dds.foldchange_set(fc_threshold=-1, pval_threshold=0.05, logp_max=6)
```
  设置阈值，自动选择倍数变化的截断值。

差异基因可视化

使用
```
dds.plot_volcano(...)
```
绘制火山图并总结关键基因。

生成单基因箱线图：

dds.plot_boxplot(genes=[...], treatment_groups=..., control_groups=..., figsize=(2, 3))

。

运行富集分析（可选）
- 使用
```
ov.utils.download_pathway_database()
```
  下载富集分析数据库，并通过
```
ov.utils.geneset_prepare
```
  加载。
- 通过
```
rnk = dds.ranking2gsea()
```
  生成用于 GSEA 的基因排名列表。
- 实例化
```
gsea_obj = ov.bulk.pyGSEA(rnk, pathway_dict)
```
  并调用
```
gsea_obj.enrichment()
```
  计算富集条目。
- 通过
```
gsea_obj.plot_enrichment(...)
```
  绘制富集气泡图，使用
```
gsea_obj.plot_gsea(term_num=..., ...)
```
  绘制 GSEA 曲线。
故障排除
- 如果 PyDESeq2 因大小因子问题报错，提醒用户提供原始计数数据（而非对数转换后的数据）。
- ```
gene_id
```
  映射依赖于物种；当结果稀疏时，指导用户下载对应基因组的映射对。
- 大型通路数据库可能需要提高递归限制，或在绘图前过滤出前 N 个条目。

Examples

示例

"Run PyDESeq2 on treated vs control replicates and highlight the top enriched WikiPathways terms."
"Filter DEGs to genes with log2(BaseMean) > 1, auto-select fold-change cutoffs, and create volcano and boxplots."
"Generate the ranked gene list for GSEA and plot the enrichment curve for the top pathway."

"在处理组与对照组重复样本上运行 PyDESeq2，并高亮显示富集程度最高的 WikiPathways 条目。"
"将差异表达基因过滤为 log2(BaseMean) > 1 的基因，自动选择倍数变化截断值，并生成火山图和箱线图。"
"生成用于 GSEA 的基因排名列表，并绘制排名第一的通路的富集曲线。"

References

参考资料

Tutorial notebook:
```
t_deseq2.ipynb
```
Sample featureCounts matrix:
```
sample/counts.txt
```
Quick copy/paste commands:
```
reference.md
```

教程笔记本：
```
t_deseq2.ipynb
```
示例 featureCounts 矩阵：
```
sample/counts.txt
```
快速复制命令：
```
reference.md
```