bulk-rna-seq-deseq2-analysis-with-omicverse
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBulk RNA-seq DESeq2 analysis with omicverse
基于 omicverse 的 Bulk RNA-seq DESeq2 分析
Overview
概述
Use this skill when a user wants to reproduce the DESeq2 workflow showcased in . It covers loading raw featureCounts matrices, mapping Ensembl IDs to symbols, running PyDESeq2 via , and exploring downstream enrichment plots.
t_deseq2.ipynbov.bulk.pyDEG当用户希望复现 中展示的 DESeq2 工作流时,可使用此技能。内容涵盖加载原始 featureCounts 矩阵、将 Ensembl ID 映射为基因符号、通过 运行 PyDESeq2,以及探索下游富集分析图。
t_deseq2.ipynbov.bulk.pyDEGInstructions
操作步骤
- Import and format the expression matrix
- Call and
import omicverse as ovto standardise visuals.ov.utils.ov_plot_set() - Read tab-separated count data from featureCounts using .
ov.utils.read(..., index_col=0, header=1) - Strip trailing from column names with
.bam.[c.split('/')[-1].replace('.bam', '') for c in data.columns]
- Call
- Map gene identifiers
- Ensure the appropriate mapping pair exists by running .
ov.utils.download_geneid_annotation_pair() - Replace with gene symbols using
gene_id.ov.bulk.Matrix_ID_mapping(data, 'genesets/pair_<GENOME>.tsv')
- Ensure the appropriate mapping pair exists by running
- Initialise the DEG object
- Create from the mapped counts.
dds = ov.bulk.pyDEG(data) - Resolve duplicate gene names with and confirm success in logs.
dds.drop_duplicates_index()
- Create
- Define contrasts and run DESeq2
- Collect sample labels into and
treatment_groupslists that match column names exactly.control_groups - Execute to invoke PyDESeq2.
dds.deg_analysis(treatment_groups, control_groups, method='DEseq2')
- Collect sample labels into
- Filter and tune thresholds
- Inspect result shape () and optionally filter low-expression genes, e.g.
dds.result.shape.dds.result.loc[dds.result['log2(BaseMean)'] > 1] - Set thresholds via to auto-pick fold-change cutoffs.
dds.foldchange_set(fc_threshold=-1, pval_threshold=0.05, logp_max=6)
- Inspect result shape (
- Visualise differential genes
- Draw volcano plots with and summarise key genes.
dds.plot_volcano(...) - Produce per-gene boxplots: .
dds.plot_boxplot(genes=[...], treatment_groups=..., control_groups=..., figsize=(2, 3))
- Draw volcano plots with
- Run enrichment analyses (optional)
- Download enrichment libraries using and load them through
ov.utils.download_pathway_database().ov.utils.geneset_prepare - Rank genes for GSEA with .
rnk = dds.ranking2gsea() - Instantiate and call
gsea_obj = ov.bulk.pyGSEA(rnk, pathway_dict)to compute terms.gsea_obj.enrichment() - Plot enrichment bubble charts via and GSEA curves with
gsea_obj.plot_enrichment(...).gsea_obj.plot_gsea(term_num=..., ...)
- Download enrichment libraries using
- Troubleshooting
- If PyDESeq2 raises errors about size factors, remind users to provide raw counts (not log-transformed data).
- mapping depends on species; direct them to download the correct genome pair when results look sparse.
gene_id - Large pathway libraries may require raising recursion limits or filtering to the top N terms before plotting.
- 导入并格式化表达矩阵
- 调用 和
import omicverse as ov来统一可视化样式。ov.utils.ov_plot_set() - 使用 读取 featureCounts 生成的制表符分隔的计数数据。
ov.utils.read(..., index_col=0, header=1) - 通过 去除列名末尾的
[c.split('/')[-1].replace('.bam', '') for c in data.columns]后缀。.bam
- 调用
- 基因标识符映射
- 运行 确保存在合适的映射对。
ov.utils.download_geneid_annotation_pair() - 使用 将
ov.bulk.Matrix_ID_mapping(data, 'genesets/pair_<GENOME>.tsv')替换为基因符号。gene_id
- 运行
- 初始化差异表达基因(DEG)对象
- 从映射后的计数矩阵创建 。
dds = ov.bulk.pyDEG(data) - 调用 去除重复基因名,并通过日志确认操作成功。
dds.drop_duplicates_index()
- 从映射后的计数矩阵创建
- 定义对比组并运行 DESeq2
- 将样本标签收集到 和
treatment_groups列表中,确保与列名完全匹配。control_groups - 执行 调用 PyDESeq2。
dds.deg_analysis(treatment_groups, control_groups, method='DEseq2')
- 将样本标签收集到
- 过滤并调整阈值
- 查看结果的形状(),可选择性过滤低表达基因,例如
dds.result.shape。dds.result.loc[dds.result['log2(BaseMean)'] > 1] - 通过 设置阈值,自动选择倍数变化的截断值。
dds.foldchange_set(fc_threshold=-1, pval_threshold=0.05, logp_max=6)
- 查看结果的形状(
- 差异基因可视化
- 使用 绘制火山图并总结关键基因。
dds.plot_volcano(...) - 生成单基因箱线图:。
dds.plot_boxplot(genes=[...], treatment_groups=..., control_groups=..., figsize=(2, 3))
- 使用
- 运行富集分析(可选)
- 使用 下载富集分析数据库,并通过
ov.utils.download_pathway_database()加载。ov.utils.geneset_prepare - 通过 生成用于 GSEA 的基因排名列表。
rnk = dds.ranking2gsea() - 实例化 并调用
gsea_obj = ov.bulk.pyGSEA(rnk, pathway_dict)计算富集条目。gsea_obj.enrichment() - 通过 绘制富集气泡图,使用
gsea_obj.plot_enrichment(...)绘制 GSEA 曲线。gsea_obj.plot_gsea(term_num=..., ...)
- 使用
- 故障排除
- 如果 PyDESeq2 因大小因子问题报错,提醒用户提供原始计数数据(而非对数转换后的数据)。
- 映射依赖于物种;当结果稀疏时,指导用户下载对应基因组的映射对。
gene_id - 大型通路数据库可能需要提高递归限制,或在绘图前过滤出前 N 个条目。
Examples
示例
- "Run PyDESeq2 on treated vs control replicates and highlight the top enriched WikiPathways terms."
- "Filter DEGs to genes with log2(BaseMean) > 1, auto-select fold-change cutoffs, and create volcano and boxplots."
- "Generate the ranked gene list for GSEA and plot the enrichment curve for the top pathway."
- "在处理组与对照组重复样本上运行 PyDESeq2,并高亮显示富集程度最高的 WikiPathways 条目。"
- "将差异表达基因过滤为 log2(BaseMean) > 1 的基因,自动选择倍数变化截断值,并生成火山图和箱线图。"
- "生成用于 GSEA 的基因排名列表,并绘制排名第一的通路的富集曲线。"
References
参考资料
- Tutorial notebook:
t_deseq2.ipynb - Sample featureCounts matrix:
sample/counts.txt - Quick copy/paste commands:
reference.md
- 教程笔记本:
t_deseq2.ipynb - 示例 featureCounts 矩阵:
sample/counts.txt - 快速复制命令:
reference.md