tooluniverse-rnaseq-deseq2

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RNA-seq Differential Expression Analysis (DESeq2)

RNA-seq差异表达分析(DESeq2)

Comprehensive differential expression analysis of RNA-seq count data using PyDESeq2, with integrated enrichment analysis (gseapy) and gene annotation via ToolUniverse.
BixBench Coverage: Validated on 53 BixBench questions across 15 computational biology projects covering RNA-seq, miRNA-seq, and differential expression analysis tasks.

使用PyDESeq2对RNA-seq计数数据进行全面的差异表达分析,并集成了富集分析(gseapy)和通过ToolUniverse进行的基因注释功能。
BixBench覆盖范围:已在15个计算生物学项目的53个BixBench问题上验证,涵盖RNA-seq、miRNA-seq和差异表达分析任务。

Core Principles

核心原则

  1. Data-first approach - Load and validate count data and metadata BEFORE any analysis
  2. Statistical rigor - Always use proper normalization, dispersion estimation, and multiple testing correction
  3. Flexible design - Support single-factor, multi-factor, and interaction designs
  4. Threshold awareness - Apply user-specified thresholds exactly (padj, log2FC, baseMean)
  5. Reproducible - Set random seeds, document all parameters, output complete results
  6. Question-driven - Parse what the user is actually asking and extract the specific answer
  7. Enrichment integration - Chain DESeq2 results into pathway/GO enrichment when requested
  8. English-first queries - Use English gene/pathway names in all tool calls

  1. 数据优先方法 - 在任何分析之前加载并验证计数数据和元数据
  2. 统计严谨性 - 始终使用正确的标准化、离散度估计和多重检验校正
  3. 灵活设计 - 支持单因素、多因素和交互作用设计
  4. 阈值感知 - 严格应用用户指定的阈值(padj、log2FC、baseMean)
  5. 可复现性 - 设置随机种子、记录所有参数、输出完整结果
  6. 问题驱动 - 解析用户实际需求并提取特定答案
  7. 富集分析集成 - 按需将DESeq2结果关联到通路/GO富集分析
  8. 英文优先查询 - 在所有工具调用中使用英文基因/通路名称

When to Use This Skill

适用场景

Apply when users:
  • Have RNA-seq count matrices and want differential expression analysis
  • Ask about DESeq2, DEGs, differential expression, padj, log2FC
  • Need dispersion estimates or diagnostics
  • Want enrichment analysis (GO, KEGG, Reactome) on DEGs
  • Ask about specific gene expression changes between conditions
  • Need to compare multiple strains/conditions/treatments
  • Ask about batch effect correction in RNA-seq
  • Questions mention "count data", "count matrix", "RNA-seq", "transcriptomics"

当用户有以下需求时适用:
  • 拥有RNA-seq计数矩阵并需要进行差异表达分析
  • 询问关于DESeq2、DEGs、差异表达、padj、log2FC的相关问题
  • 需要离散度估计或诊断分析
  • 希望对DEGs进行富集分析(GO、KEGG、Reactome)
  • 询问不同条件间特定基因的表达变化
  • 需要比较多个菌株/条件/处理组
  • 询问RNA-seq中的批次效应校正方法
  • 问题中提及“计数数据”、“计数矩阵”、“RNA-seq”、“转录组学”

Required Packages

所需包

python
undefined
python
undefined

Core (MUST be installed)

核心包(必须安装)

import pandas as pd import numpy as np from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats
import pandas as pd import numpy as np from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats

Enrichment (optional, for GO/KEGG/Reactome)

富集分析包(可选,用于GO/KEGG/Reactome分析)

import gseapy as gp
import gseapy as gp

ToolUniverse (optional, for gene annotation)

ToolUniverse(可选,用于基因注释)

from tooluniverse import ToolUniverse

**Installation**:
```bash
pip install pydeseq2 gseapy pandas numpy scipy anndata

from tooluniverse import ToolUniverse

**安装命令**:
```bash
pip install pydeseq2 gseapy pandas numpy scipy anndata

Analysis Workflow

分析流程

Step 1: Question Parsing

步骤1:问题解析

CRITICAL FIRST STEP: Before writing ANY code, parse the question to identify:
  • Data files: Look for
    *counts*.csv
    ,
    *metadata*.csv
    ,
    *.h5ad
  • Thresholds: Extract padj (default 0.05), log2FC (default 0), baseMean (default 0)
  • Design: Identify factors mentioned ("strain", "condition", "batch")
  • Contrast: Determine comparison ("A vs B", "mutant vs wildtype")
  • Direction: Check if "upregulated", "downregulated", or both
  • Enrichment: Look for "GO", "KEGG", "Reactome", "pathway"
  • Specific genes: Check if asking about individual genes
See references/question_parsing.md for detailed parsing patterns.
关键第一步:在编写任何代码之前,先解析问题以明确:
  • 数据文件:查找
    *counts*.csv
    *metadata*.csv
    *.h5ad
    这类文件
  • 阈值:提取padj(默认0.05)、log2FC(默认0)、baseMean(默认0)
  • 实验设计:识别提及的因素(如“菌株”、“条件”、“批次”)
  • 对比组:确定比较对象(如“A vs B”、“突变体 vs 野生型”)
  • 方向:检查是否要求“上调”、“下调”或两者都要
  • 富集分析:查找是否提及“GO”、“KEGG”、“Reactome”、“通路”
  • 特定基因:检查是否询问单个基因的相关信息
详细解析规则请参考references/question_parsing.md

Step 1.5: Design Formula Decision Tree ⚠️ CRITICAL

步骤1.5:设计公式决策树 ⚠️ 关键

ALWAYS inspect metadata for ALL variables, not just what the question mentions!
Many experiments have hidden batch effects (media conditions, sequencing batches, time points) that MUST be included as covariates. Failing to account for these reduces statistical power and can lead to incorrect results.
Decision process:
  1. List ALL metadata columns (don't skip any!)
  2. Categorize each column:
    • Biological interest: The factor you're testing (strain, treatment, genotype, condition)
    • Batch/Block: Systematic covariates (media, batch, sequencing_run, time, plate)
    • Irrelevant: Sample IDs, notes, file names
  3. Design formula:
    • Single factor:
      ~condition
      (only if no batch variables exist)
    • With covariates:
      ~batch1 + batch2 + condition
      (covariates first!)
    • Interaction:
      ~batch + factor1 + factor2 + factor1:factor2
Example (critical real-world case):
Metadata columns: [Strain, Media, Replicate]
Question asks: "strain effects"

❌ WRONG: design="~Strain" (ignores Media!)
✅ CORRECT: design="~Media + Strain" (accounts for media variation)

Why: Even though question doesn't mention media, it systematically
affects expression. DESeq2 must remove media effects to properly test strain.
Rule of thumb: If a column has 2+ levels and represents an experimental condition (not sample ID), include it in design.
务必检查元数据中的所有变量,而不仅仅是问题提及的内容!
许多实验存在隐藏的批次效应(如培养基条件、测序批次、时间点),这些必须作为协变量纳入分析。忽略这些因素会降低统计效力,甚至导致错误结果。
决策流程:
  1. 列出所有元数据列(不要遗漏任何列!)
  2. 对每列进行分类:
    • 生物学关注因素:你要检验的因素(菌株、处理、基因型、条件)
    • 批次/区组:系统性协变量(培养基、批次、测序_run、时间、培养板)
    • 无关项:样本ID、备注、文件名
  3. 设计公式:
    • 单因素:
      ~condition
      (仅当不存在批次变量时使用)
    • 含协变量:
      ~batch1 + batch2 + condition
      (协变量在前!)
    • 交互作用:
      ~batch + factor1 + factor2 + factor1:factor2
示例(关键实际案例):
元数据列: [Strain, Media, Replicate]
用户问题: "菌株效应"

❌ 错误: design="~Strain"(忽略了Media!)
✅ 正确: design="~Media + Strain"(考虑了培养基变异)

原因: 尽管问题未提及培养基,但它会系统性影响基因表达。DESeq2必须先消除培养基效应,才能正确检验菌株的影响。
经验法则:如果某列有2个及以上水平,且代表实验条件(而非样本ID),则应将其纳入设计公式。

Step 2: Data Loading & Validation

步骤2:数据加载与验证

Load count matrix and metadata, then validate alignment:
python
import pandas as pd
from scripts.load_count_matrix import load_count_matrix, validate_inputs
加载计数矩阵和元数据,然后验证对齐情况:
python
import pandas as pd
from scripts.load_count_matrix import load_count_matrix, validate_inputs

Load data

加载数据

counts = load_count_matrix("counts.csv") metadata = pd.read_csv("metadata.csv", index_col=0)
counts = load_count_matrix("counts.csv") metadata = pd.read_csv("metadata.csv", index_col=0)

Validate and align

验证并对齐数据

counts, metadata, issues = validate_inputs(counts, metadata)

**Key considerations**:
- Ensure samples as rows, genes as columns (PyDESeq2 requirement)
- Verify integer counts (round if needed)
- Align sample names between counts and metadata
- Remove zero-count genes

See [references/data_loading.md](references/data_loading.md) for detailed data handling.
counts, metadata, issues = validate_inputs(counts, metadata)

**关键注意事项**:
- 确保计数矩阵方向正确(样本为行,基因为列,符合PyDESeq2要求)
- 验证计数为整数(必要时取整)
- 确保计数矩阵与元数据的样本名称对齐
- 移除零计数基因

详细数据处理方法请参考[references/data_loading.md](references/data_loading.md)。

Step 2.5: Inspect Metadata Structure ⚠️ REQUIRED

步骤2.5:检查元数据结构 ⚠️ 必须执行

Before choosing design formula, ALWAYS inspect metadata to identify all experimental factors:
python
undefined
在选择设计公式之前,务必检查元数据以识别所有实验因素:
python
undefined

Print metadata structure

打印元数据结构

print("Metadata columns and levels:") for col in metadata.columns: unique_vals = metadata[col].unique() print(f" {col}: {len(unique_vals)} levels → {list(unique_vals)[:5]}")
print("元数据列及水平:") for col in metadata.columns: unique_vals = metadata[col].unique() print(f" {col}: {len(unique_vals)}个水平 → {list(unique_vals)[:5]}")

Example output:

示例输出:

Strain: 4 levels → ['1', '97', '98', '99']

Strain: 4个水平 → ['1', '97', '98', '99']

Media: 3 levels → ['MMGluFeMinus', 'MMGluFePlus', 'Succinate']

Media: 3个水平 → ['MMGluFeMinus', 'MMGluFePlus', 'Succinate']

Replicate: 3 levels → ['A', 'B', 'C']

Replicate: 3个水平 → ['A', 'B', 'C']

Decision:

决策:

- Strain: Biological factor (testing)

- Strain: 生物学关注因素(待检验)

- Media: Batch/covariate (MUST include!)

- Media: 批次/协变量(必须纳入!)

- Replicate: Biological replicate (don't include as factor)

- Replicate: 生物学重复(无需作为因素纳入)

Design: ~Media + Strain

设计公式: ~Media + Strain


This step prevents missing hidden batch effects that could invalidate your analysis.

此步骤可避免遗漏可能导致分析结果无效的隐藏批次效应。

Step 3: Run PyDESeq2

步骤3:运行PyDESeq2

Execute differential expression analysis:
python
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
执行差异表达分析:
python
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

Setup design (set reference level first)

设置设计(先设置参考水平)

metadata['condition'] = pd.Categorical( metadata['condition'], categories=['control', 'treatment'] # First = reference )
metadata['condition'] = pd.Categorical( metadata['condition'], categories=['control', 'treatment'] # 第一个为参考组 )

Run DESeq2

运行DESeq2

dds = DeseqDataSet( counts=counts, metadata=metadata, design="~condition", quiet=True ) dds.deseq2()
dds = DeseqDataSet( counts=counts, metadata=metadata, design="~condition", quiet=True ) dds.deseq2()

Extract results

提取结果

stat_res = DeseqStats(dds, contrast=['condition', 'treatment', 'control'], quiet=True) stat_res.run_wald_test() stat_res.summary()
stat_res = DeseqStats(dds, contrast=['condition', 'treatment', 'control'], quiet=True) stat_res.run_wald_test() stat_res.summary()

Apply LFC shrinkage (if needed)

应用LFC收缩(如有需要)

stat_res.lfc_shrink(coeff='condition[T.treatment]')
results = stat_res.results_df
undefined
stat_res.lfc_shrink(coeff='condition[T.treatment]')
results = stat_res.results_df
undefined

Multi-Factor Design (Common Real-World Case)

多因素设计(常见实际场景)

When metadata has multiple experimental variables (e.g., strain + media conditions), ALWAYS include covariates:
python
undefined
当元数据包含多个实验变量(如菌株 + 培养基条件)时,务必纳入协变量:
python
undefined

Inspect metadata (from Step 2.5) showed:

步骤2.5的元数据检查显示:

- Strain: 4 levels (factor of interest)

- Strain: 4个水平(关注因素)

- Media: 3 levels (batch effect - MUST include!)

- Media: 3个水平(批次效应 - 必须纳入!)

Set reference levels for BOTH factors

为两个因素设置参考水平

metadata['media'] = pd.Categorical( metadata['media'], categories=['MMGluFeMinus', 'MMGluFePlus', 'Succinate'] # First = reference ) metadata['strain'] = pd.Categorical( metadata['strain'], categories=['1', '97', '98', '99'] # First = reference (JBX1) )
metadata['media'] = pd.Categorical( metadata['media'], categories=['MMGluFeMinus', 'MMGluFePlus', 'Succinate'] # 第一个为参考组 ) metadata['strain'] = pd.Categorical( metadata['strain'], categories=['1', '97', '98', '99'] # 第一个为参考组(JBX1) )

Include covariate in design formula

在设计公式中纳入协变量

dds = DeseqDataSet( counts=counts, metadata=metadata, design="~media + strain", # Covariate first, then factor! quiet=True ) dds.deseq2()
dds = DeseqDataSet( counts=counts, metadata=metadata, design="~media + strain", # 协变量在前,关注因素在后! quiet=True ) dds.deseq2()

Extract strain effect (controlling for media)

提取菌株效应(控制培养基影响)

stat_res = DeseqStats(dds, contrast=['strain', '98', '1'], quiet=True) stat_res.run_wald_test() stat_res.summary() results = stat_res.results_df

**Why this matters**: DESeq2 will model media effects separately, removing media-driven variance before testing strain differences. This increases statistical power and prevents false positives/negatives.

**When to use what**:
- **Python (PyDESeq2)**: Use for ALL DESeq2 analysis (normalization, testing, filtering)
- **ToolUniverse**: Use ONLY for gene annotation (ID conversion, pathway context)
- **gseapy**: Use for enrichment analysis (GO/KEGG/Reactome)

See [references/pydeseq2_workflow.md](references/pydeseq2_workflow.md) for complete PyDESeq2 patterns including batch effects, interaction terms, and complex designs.
stat_res = DeseqStats(dds, contrast=['strain', '98', '1'], quiet=True) stat_res.run_wald_test() stat_res.summary() results = stat_res.results_df

**重要性**:DESeq2会单独建模培养基效应,在检验菌株差异之前消除培养基驱动的变异。这能提高统计效力,避免假阳性/假阴性结果。

**工具选择规则**:
- **Python (PyDESeq2)**:用于所有DESeq2分析(标准化、检验、过滤)
- **ToolUniverse**:仅用于基因注释(ID转换、通路上下文)
- **gseapy**:用于富集分析(GO/KEGG/Reactome)

完整PyDESeq2流程(包括批次效应、交互项、复杂设计)请参考[references/pydeseq2_workflow.md](references/pydeseq2_workflow.md)。

Step 4: Filter Results

步骤4:结果过滤

Apply thresholds from the question:
python
undefined
应用问题中指定的阈值:
python
undefined

Filter DEGs

筛选差异表达基因

sig_genes = results[ (results['padj'] < 0.05) & (results['log2FoldChange'].abs() > 0.5) & (results['baseMean'] > 10) ]
sig_genes = results[ (results['padj'] < 0.05) & (results['log2FoldChange'].abs() > 0.5) & (results['baseMean'] > 10) ]

Direction-specific filtering

按方向筛选

up_genes = sig_genes[sig_genes['log2FoldChange'] > 0] down_genes = sig_genes[sig_genes['log2FoldChange'] < 0]

**Common filtering patterns**:
- Basic DEG count: `len(sig_genes)`
- Specific gene value: `results.loc['GENE_NAME', 'log2FoldChange']`
- Set operations: Use Python sets for "unique", "shared", "overlap"

See [references/result_filtering.md](references/result_filtering.md) for advanced filtering.
up_genes = sig_genes[sig_genes['log2FoldChange'] > 0] down_genes = sig_genes[sig_genes['log2FoldChange'] < 0]

**常见过滤场景**:
- 基础差异表达基因计数: `len(sig_genes)`
- 特定基因的数值: `results.loc['GENE_NAME', 'log2FoldChange']`
- 集合操作: 使用Python集合进行“唯一”、“共有”、“重叠”分析

高级过滤方法请参考[references/result_filtering.md](references/result_filtering.md)。

Step 5: Dispersion Analysis (if asked)

步骤5:离散度分析(如有需求)

Access dispersion estimates:
python
undefined
获取离散度估计值:
python
undefined

Get dispersion data

获取离散度数据

disp_data = dds.var # Dispersions stored here
disp_data = dds.var # 离散度存储于此

Common question: "genes below threshold prior to fitting"

常见问题: "拟合前离散度低于阈值的基因数量"

genewise = dds.var['genewise_dispersions'] count_below = (genewise < 1e-5).sum()

**Dispersion column mapping**:
- "prior to fitting" → `genewise_dispersions`
- "fitted dispersions" → `fitted_dispersions`
- "after shrinkage" / "MAP" → `MAP_dispersions`
- "final" → `dispersions`

See [references/dispersion_analysis.md](references/dispersion_analysis.md) for diagnostics.
genewise = dds.var['genewise_dispersions'] count_below = (genewise < 1e-5).sum()

**离散度列映射**:
- "拟合前" → `genewise_dispersions`
- "拟合后离散度" → `fitted_dispersions`
- "收缩后" / "MAP" → `MAP_dispersions`
- "最终离散度" → `dispersions`

离散度诊断分析请参考[references/dispersion_analysis.md](references/dispersion_analysis.md)。

Step 6: Enrichment Analysis (optional)

步骤6:富集分析(可选)

Run pathway enrichment on DEGs:
python
import gseapy as gp
对差异表达基因进行通路富集分析:
python
import gseapy as gp

Prepare gene list

准备基因列表

gene_list = sig_genes.index.tolist()
gene_list = sig_genes.index.tolist()

Run enrichment

运行富集分析

enr = gp.enrich( gene_list=gene_list, gene_sets='GO_Biological_Process_2023', # or KEGG, Reactome background=None, # or specify background outdir=None, cutoff=0.05, no_plot=True, verbose=False )
enr = gp.enrich( gene_list=gene_list, gene_sets='GO_Biological_Process_2023', # 或KEGG、Reactome background=None, # 或指定背景基因集 outdir=None, cutoff=0.05, no_plot=True, verbose=False )

Extract results

提取结果

top_pathways = enr.results.head(10)

**Library selection**:
- Human GO: `GO_Biological_Process_2023`
- Mouse KEGG: `KEGG_2019_Mouse`
- Human KEGG: `KEGG_2021_Human`
- Reactome: `Reactome_2022`

See [references/enrichment_analysis.md](references/enrichment_analysis.md) for complete enrichment workflows.
top_pathways = enr.results.head(10)

**数据库选择**:
- 人类GO: `GO_Biological_Process_2023`
- 小鼠KEGG: `KEGG_2019_Mouse`
- 人类KEGG: `KEGG_2021_Human`
- Reactome: `Reactome_2022`

完整富集分析流程请参考[references/enrichment_analysis.md](references/enrichment_analysis.md)。

Step 7: Gene Annotation with ToolUniverse (optional)

步骤7:使用ToolUniverse进行基因注释(可选)

Use ToolUniverse ONLY for gene annotation, not analysis:
python
from tooluniverse import ToolUniverse

tu = ToolUniverse()
tu.load_tools()
仅使用ToolUniverse进行基因注释,不用于分析:
python
from tooluniverse import ToolUniverse

tu = ToolUniverse()
tu.load_tools()

Gene ID conversion

基因ID转换

result = tu.tools.MyGene_query_genes(query="TP53")
result = tu.tools.MyGene_query_genes(query="TP53")

Gene details

基因详细信息查询

result = tu.tools.ensembl_lookup_gene( gene_id="ENSG00000141510", species="homo_sapiens" )

**Do NOT use ToolUniverse for**:
- Differential expression (use PyDESeq2)
- Statistical testing (use scipy.stats)
- Enrichment analysis (use gseapy)

---
result = tu.tools.ensembl_lookup_gene( gene_id="ENSG00000141510", species="homo_sapiens" )

**请勿使用ToolUniverse进行以下操作**:
- 差异表达分析(请使用PyDESeq2)
- 统计检验(请使用scipy.stats)
- 富集分析(请使用gseapy)

---

Output Formatting

输出格式

Match the question's requested format:
python
undefined
匹配问题要求的格式:
python
undefined

Numeric precision

数值精度

round(value, 2) # "2 decimal points" f"{value:.2E}" # "scientific notation"
round(value, 2) # "保留2位小数" f"{value:.2E}" # "科学计数法"

Percentages

百分比

f"{value * 100:.1f}%" # "as percentage"
f"{value * 100:.1f}%" # "百分比格式"

Counts (no decimals)

计数(无小数)

int(len(sig_genes)) # "how many genes"

See [references/output_formatting.md](references/output_formatting.md) for all format patterns.

---
int(len(sig_genes)) # "基因数量"

所有格式规则请参考[references/output_formatting.md](references/output_formatting.md)。

---

Common BixBench Patterns

常见BixBench模式

Pattern 1: Basic DEG Count

模式1:基础差异表达基因计数

Question: "How many genes show significant DE (padj < 0.05, |log2FC| > 0.5)?"
python
degs = results[(results['padj'] < 0.05) & (results['log2FoldChange'].abs() > 0.5)]
answer = len(degs)
问题: "有多少基因显示显著差异表达(padj < 0.05,|log2FC| > 0.5)?"
python
degs = results[(results['padj'] < 0.05) & (results['log2FoldChange'].abs() > 0.5)]
answer = len(degs)

Pattern 2: Specific Gene Value

模式2:特定基因数值查询

Question: "What is the log2FC of gene X?"
python
answer = round(results.loc['GENE_X', 'log2FoldChange'], 2)
问题: "基因X的log2FC是多少?"
python
answer = round(results.loc['GENE_X', 'log2FoldChange'], 2)

Pattern 3: Direction-Specific

模式3:按方向筛选

Question: "How many genes are upregulated?"
python
up_degs = results[(results['padj'] < 0.05) & (results['log2FoldChange'] > 0)]
answer = len(up_degs)
问题: "有多少基因上调?"
python
up_degs = results[(results['padj'] < 0.05) & (results['log2FoldChange'] > 0)]
answer = len(up_degs)

Pattern 4: Set Operations

模式4:集合操作

Question: "How many genes are uniquely DE in condition A?"
python
degs_A = set(results_A[results_A['padj'] < 0.05].index)
degs_B = set(results_B[results_B['padj'] < 0.05].index)
unique_A = degs_A - degs_B
answer = len(unique_A)
问题: "条件A中独有的差异表达基因有多少?"
python
degs_A = set(results_A[results_A['padj'] < 0.05].index)
degs_B = set(results_B[results_B['padj'] < 0.05].index)
unique_A = degs_A - degs_B
answer = len(unique_A)

Pattern 5: Dispersion Count

模式5:离散度计数

Question: "How many genes have dispersion below 1e-5 prior to fitting?"
python
genewise = dds.var['genewise_dispersions']
answer = (genewise < 1e-5).sum()
See references/bixbench_examples.md for all 10 patterns with examples.

问题: "拟合前离散度低于1e-5的基因有多少?"
python
genewise = dds.var['genewise_dispersions']
answer = (genewise < 1e-5).sum()
所有10种模式及示例请参考references/bixbench_examples.md

Error Handling

错误处理

ErrorSolution
"No matching samples"Check if counts need transposing; strip whitespace
"Dispersion trend did not converge"Use
fit_type='mean'
"Contrast not found"Check
metadata['factor'].unique()
for exact names
"Non-integer counts"Round to integers OR use t-test for normalized data
"NaN in padj"Independent filtering removed genes; exclude from counts
See references/troubleshooting.md for complete debugging guide.

错误类型解决方案
"无匹配样本"检查计数矩阵是否需要转置;移除样本名称中的空格
"离散度趋势未收敛"使用
fit_type='mean'
参数
"未找到对比组"检查
metadata['factor'].unique()
中的名称是否完全匹配
"非整数计数"取整为整数,或对标准化数据使用t检验
"padj中存在NaN"独立过滤已移除部分基因;将这些基因从计数矩阵中排除
完整调试指南请参考references/troubleshooting.md

Validation Checklist

验证清单

Every analysis MUST include:
Data Loading:
  • Count matrix oriented correctly (samples as rows, genes as columns)
  • Metadata aligned with counts
  • Integer counts validated
DESeq2 Analysis:
  • Design formula matches question
  • Reference level set correctly (first in Categorical)
  • Correct contrast extracted
Results:
  • Thresholds match question exactly
  • Direction filter applied if specified
  • Answer formatted correctly (decimal places, notation)
Quality Checks:
  • DEG count is reasonable
  • P-values between 0 and 1
  • Log2FC values are finite

每次分析必须包含以下检查:
数据加载:
  • 计数矩阵方向正确(样本为行,基因为列)
  • 元数据与计数矩阵样本对齐
  • 已验证计数为整数
DESeq2分析:
  • 设计公式与问题需求匹配
  • 参考水平设置正确(Categorical中的第一个值)
  • 提取了正确的对比组
结果:
  • 阈值与问题要求完全一致
  • 按需应用了方向筛选
  • 答案格式正确(小数位数、计数法)
质量检查:
  • 差异表达基因数量合理
  • P值在0到1之间
  • Log2FC值为有限数值

Known Limitations

已知限制

PyDESeq2 vs R DESeq2 Differences

PyDESeq2与R DESeq2的差异

This skill uses PyDESeq2 (Python implementation) for differential expression analysis. While PyDESeq2 faithfully implements the DESeq2 algorithm, numerical differences exist between Python and R implementations:
Dispersion Estimation:
  • Dispersion estimates may differ from R DESeq2, especially for very low dispersion genes (< 1e-05)
  • This is due to different numerical optimization methods between Python and R statistical libraries
  • Results are still statistically valid and biologically meaningful
  • If you need exact R DESeq2 dispersions for benchmark reproducibility, consider using R DESeq2 directly via rpy2
For Most Analyses: PyDESeq2 provides accurate, publication-quality results. The differences are in numerical precision, not statistical validity.
本技能使用PyDESeq2(Python实现)进行差异表达分析。尽管PyDESeq2忠实实现了DESeq2算法,但Python与R版本之间存在数值差异:
离散度估计:
  • 离散度估计值可能与R DESeq2不同,尤其是离散度极低的基因(< 1e-05)
  • 这是由于Python和R统计库使用的数值优化方法不同
  • 结果仍然具有统计有效性和生物学意义
  • 如果需要与R DESeq2完全一致的离散度估计(用于基准复现),可考虑通过rpy2直接使用R DESeq2
对于大多数分析:PyDESeq2可提供准确的、可用于发表的结果。差异仅存在于数值精度层面,不影响统计有效性。

GO/KEGG Enrichment (gseapy vs clusterProfiler)

GO/KEGG富集分析(gseapy vs clusterProfiler)

Enrichment analysis uses Python gseapy by default. Results may differ from R clusterProfiler due to:
  • Different enrichment algorithms and statistical tests
  • Different GO/KEGG database versions
  • Different term simplification/redundancy removal methods
For R clusterProfiler Compatibility: If you need results matching R clusterProfiler exactly (e.g., for benchmark reproducibility):
python
undefined
默认使用Python gseapy进行富集分析。结果可能与R clusterProfiler不同,原因如下:
  • 使用的富集算法和统计检验不同
  • GO/KEGG数据库版本不同
  • 术语简化/去冗余方法不同
与R clusterProfiler兼容: 如果需要与R clusterProfiler完全一致的结果(如用于基准复现):
python
undefined

Install rpy2 and R packages first:

先安装rpy2和R包:

pip install rpy2

pip install rpy2

In R: install.packages(c("clusterProfiler", "org.Hs.eg.db", "enrichplot"))

在R中: install.packages(c("clusterProfiler", "org.Hs.eg.db", "enrichplot"))

from rpy2.robjects.packages import importr from rpy2.robjects import pandas2ri, vectors pandas2ri.activate()
from rpy2.robjects.packages import importr from rpy2.robjects import pandas2ri, vectors pandas2ri.activate()

Load R packages

加载R包

clusterProfiler = importr('clusterProfiler') orgdb = importr('org.Hs.eg.db')
clusterProfiler = importr('clusterProfiler') orgdb = importr('org.Hs.eg.db')

Convert Python gene list to R vector

将Python基因列表转换为R向量

gene_list = sig_genes.index.tolist() r_genes = vectors.StrVector(gene_list)
gene_list = sig_genes.index.tolist() r_genes = vectors.StrVector(gene_list)

Run enrichGO (R's clusterProfiler)

运行enrichGO(R的clusterProfiler)

enrich_result = clusterProfiler.enrichGO( gene=r_genes, OrgDb=orgdb.org_Hs_eg_db, ont='BP', # Biological Process pAdjustMethod='BH', pvalueCutoff=0.05, qvalueCutoff=0.2 )
enrich_result = clusterProfiler.enrichGO( gene=r_genes, OrgDb=orgdb.org_Hs_eg_db, ont='BP', # 生物过程 pAdjustMethod='BH', pvalueCutoff=0.05, qvalueCutoff=0.2 )

Apply simplify to remove redundant terms

应用简化以移除冗余术语

simplified = clusterProfiler.simplify(enrich_result, cutoff=0.7, by='p.adjust')
simplified = clusterProfiler.simplify(enrich_result, cutoff=0.7, by='p.adjust')

Convert R result back to pandas

将R结果转换回pandas DataFrame

results_df = pandas2ri.rpy2py(simplified)

**Default behavior**: Uses gseapy (no R dependencies required). This is sufficient for most analyses and provides valid enrichment results.

**When to use R clusterProfiler**:
- When exact reproducibility with R-based benchmarks is required
- When collaborating with R users who need identical results
- When specific clusterProfiler features (GSEA, compareCluster) are needed

---
results_df = pandas2ri.rpy2py(simplified)

**默认行为**:使用gseapy(无需依赖R)。这足以满足大多数分析需求,并能提供有效的富集结果。

**何时使用R clusterProfiler**:
- 需要与基于R的基准结果完全复现
- 与R用户协作,需要完全一致的结果
- 需要使用clusterProfiler的特定功能(如GSEA、compareCluster)

---

References

参考文档

Detailed documentation for advanced topics:
  • question_parsing.md - Extract parameters from questions
  • data_loading.md - Data loading and validation patterns
  • pydeseq2_workflow.md - Complete PyDESeq2 code examples
  • result_filtering.md - Advanced filtering and extraction
  • dispersion_analysis.md - Dispersion diagnostics
  • enrichment_analysis.md - GO/KEGG/Reactome workflows
  • output_formatting.md - Format answers correctly
  • bixbench_examples.md - All 10 question patterns
  • troubleshooting.md - Common issues and debugging
高级主题的详细文档:
  • question_parsing.md - 从问题中提取参数
  • data_loading.md - 数据加载与验证模式
  • pydeseq2_workflow.md - 完整PyDESeq2代码示例
  • result_filtering.md - 高级过滤与提取方法
  • dispersion_analysis.md - 离散度诊断分析
  • enrichment_analysis.md - GO/KEGG/Reactome分析流程
  • output_formatting.md - 结果格式规范
  • bixbench_examples.md - 所有10种问题模式
  • troubleshooting.md - 常见问题与调试指南

Utility Scripts

实用脚本

Helper scripts for common tasks:
  • format_deseq2_output.py - Output formatters
  • load_count_matrix.py - Data loading utilities
常见任务的辅助脚本:
  • format_deseq2_output.py - 结果格式转换工具
  • load_count_matrix.py - 数据加载工具