tooluniverse-rnaseq-deseq2
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRNA-seq Differential Expression Analysis (DESeq2)
RNA-seq差异表达分析(DESeq2)
Comprehensive differential expression analysis of RNA-seq count data using PyDESeq2, with integrated enrichment analysis (gseapy) and gene annotation via ToolUniverse.
BixBench Coverage: Validated on 53 BixBench questions across 15 computational biology projects covering RNA-seq, miRNA-seq, and differential expression analysis tasks.
使用PyDESeq2对RNA-seq计数数据进行全面的差异表达分析,并集成了富集分析(gseapy)和通过ToolUniverse进行的基因注释功能。
BixBench覆盖范围:已在15个计算生物学项目的53个BixBench问题上验证,涵盖RNA-seq、miRNA-seq和差异表达分析任务。
Core Principles
核心原则
- Data-first approach - Load and validate count data and metadata BEFORE any analysis
- Statistical rigor - Always use proper normalization, dispersion estimation, and multiple testing correction
- Flexible design - Support single-factor, multi-factor, and interaction designs
- Threshold awareness - Apply user-specified thresholds exactly (padj, log2FC, baseMean)
- Reproducible - Set random seeds, document all parameters, output complete results
- Question-driven - Parse what the user is actually asking and extract the specific answer
- Enrichment integration - Chain DESeq2 results into pathway/GO enrichment when requested
- English-first queries - Use English gene/pathway names in all tool calls
- 数据优先方法 - 在任何分析之前加载并验证计数数据和元数据
- 统计严谨性 - 始终使用正确的标准化、离散度估计和多重检验校正
- 灵活设计 - 支持单因素、多因素和交互作用设计
- 阈值感知 - 严格应用用户指定的阈值(padj、log2FC、baseMean)
- 可复现性 - 设置随机种子、记录所有参数、输出完整结果
- 问题驱动 - 解析用户实际需求并提取特定答案
- 富集分析集成 - 按需将DESeq2结果关联到通路/GO富集分析
- 英文优先查询 - 在所有工具调用中使用英文基因/通路名称
When to Use This Skill
适用场景
Apply when users:
- Have RNA-seq count matrices and want differential expression analysis
- Ask about DESeq2, DEGs, differential expression, padj, log2FC
- Need dispersion estimates or diagnostics
- Want enrichment analysis (GO, KEGG, Reactome) on DEGs
- Ask about specific gene expression changes between conditions
- Need to compare multiple strains/conditions/treatments
- Ask about batch effect correction in RNA-seq
- Questions mention "count data", "count matrix", "RNA-seq", "transcriptomics"
当用户有以下需求时适用:
- 拥有RNA-seq计数矩阵并需要进行差异表达分析
- 询问关于DESeq2、DEGs、差异表达、padj、log2FC的相关问题
- 需要离散度估计或诊断分析
- 希望对DEGs进行富集分析(GO、KEGG、Reactome)
- 询问不同条件间特定基因的表达变化
- 需要比较多个菌株/条件/处理组
- 询问RNA-seq中的批次效应校正方法
- 问题中提及“计数数据”、“计数矩阵”、“RNA-seq”、“转录组学”
Required Packages
所需包
python
undefinedpython
undefinedCore (MUST be installed)
核心包(必须安装)
import pandas as pd
import numpy as np
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
import pandas as pd
import numpy as np
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
Enrichment (optional, for GO/KEGG/Reactome)
富集分析包(可选,用于GO/KEGG/Reactome分析)
import gseapy as gp
import gseapy as gp
ToolUniverse (optional, for gene annotation)
ToolUniverse(可选,用于基因注释)
from tooluniverse import ToolUniverse
**Installation**:
```bash
pip install pydeseq2 gseapy pandas numpy scipy anndatafrom tooluniverse import ToolUniverse
**安装命令**:
```bash
pip install pydeseq2 gseapy pandas numpy scipy anndataAnalysis Workflow
分析流程
Step 1: Question Parsing
步骤1:问题解析
CRITICAL FIRST STEP: Before writing ANY code, parse the question to identify:
- Data files: Look for ,
*counts*.csv,*metadata*.csv*.h5ad - Thresholds: Extract padj (default 0.05), log2FC (default 0), baseMean (default 0)
- Design: Identify factors mentioned ("strain", "condition", "batch")
- Contrast: Determine comparison ("A vs B", "mutant vs wildtype")
- Direction: Check if "upregulated", "downregulated", or both
- Enrichment: Look for "GO", "KEGG", "Reactome", "pathway"
- Specific genes: Check if asking about individual genes
See references/question_parsing.md for detailed parsing patterns.
关键第一步:在编写任何代码之前,先解析问题以明确:
- 数据文件:查找、
*counts*.csv、*metadata*.csv这类文件*.h5ad - 阈值:提取padj(默认0.05)、log2FC(默认0)、baseMean(默认0)
- 实验设计:识别提及的因素(如“菌株”、“条件”、“批次”)
- 对比组:确定比较对象(如“A vs B”、“突变体 vs 野生型”)
- 方向:检查是否要求“上调”、“下调”或两者都要
- 富集分析:查找是否提及“GO”、“KEGG”、“Reactome”、“通路”
- 特定基因:检查是否询问单个基因的相关信息
详细解析规则请参考references/question_parsing.md。
Step 1.5: Design Formula Decision Tree ⚠️ CRITICAL
步骤1.5:设计公式决策树 ⚠️ 关键
ALWAYS inspect metadata for ALL variables, not just what the question mentions!
Many experiments have hidden batch effects (media conditions, sequencing batches, time points) that MUST be included as covariates. Failing to account for these reduces statistical power and can lead to incorrect results.
Decision process:
- List ALL metadata columns (don't skip any!)
- Categorize each column:
- Biological interest: The factor you're testing (strain, treatment, genotype, condition)
- Batch/Block: Systematic covariates (media, batch, sequencing_run, time, plate)
- Irrelevant: Sample IDs, notes, file names
- Design formula:
- Single factor: (only if no batch variables exist)
~condition - With covariates: (covariates first!)
~batch1 + batch2 + condition - Interaction:
~batch + factor1 + factor2 + factor1:factor2
- Single factor:
Example (critical real-world case):
Metadata columns: [Strain, Media, Replicate]
Question asks: "strain effects"
❌ WRONG: design="~Strain" (ignores Media!)
✅ CORRECT: design="~Media + Strain" (accounts for media variation)
Why: Even though question doesn't mention media, it systematically
affects expression. DESeq2 must remove media effects to properly test strain.Rule of thumb: If a column has 2+ levels and represents an experimental condition (not sample ID), include it in design.
务必检查元数据中的所有变量,而不仅仅是问题提及的内容!
许多实验存在隐藏的批次效应(如培养基条件、测序批次、时间点),这些必须作为协变量纳入分析。忽略这些因素会降低统计效力,甚至导致错误结果。
决策流程:
- 列出所有元数据列(不要遗漏任何列!)
- 对每列进行分类:
- 生物学关注因素:你要检验的因素(菌株、处理、基因型、条件)
- 批次/区组:系统性协变量(培养基、批次、测序_run、时间、培养板)
- 无关项:样本ID、备注、文件名
- 设计公式:
- 单因素:(仅当不存在批次变量时使用)
~condition - 含协变量:(协变量在前!)
~batch1 + batch2 + condition - 交互作用:
~batch + factor1 + factor2 + factor1:factor2
- 单因素:
示例(关键实际案例):
元数据列: [Strain, Media, Replicate]
用户问题: "菌株效应"
❌ 错误: design="~Strain"(忽略了Media!)
✅ 正确: design="~Media + Strain"(考虑了培养基变异)
原因: 尽管问题未提及培养基,但它会系统性影响基因表达。DESeq2必须先消除培养基效应,才能正确检验菌株的影响。经验法则:如果某列有2个及以上水平,且代表实验条件(而非样本ID),则应将其纳入设计公式。
Step 2: Data Loading & Validation
步骤2:数据加载与验证
Load count matrix and metadata, then validate alignment:
python
import pandas as pd
from scripts.load_count_matrix import load_count_matrix, validate_inputs加载计数矩阵和元数据,然后验证对齐情况:
python
import pandas as pd
from scripts.load_count_matrix import load_count_matrix, validate_inputsLoad data
加载数据
counts = load_count_matrix("counts.csv")
metadata = pd.read_csv("metadata.csv", index_col=0)
counts = load_count_matrix("counts.csv")
metadata = pd.read_csv("metadata.csv", index_col=0)
Validate and align
验证并对齐数据
counts, metadata, issues = validate_inputs(counts, metadata)
**Key considerations**:
- Ensure samples as rows, genes as columns (PyDESeq2 requirement)
- Verify integer counts (round if needed)
- Align sample names between counts and metadata
- Remove zero-count genes
See [references/data_loading.md](references/data_loading.md) for detailed data handling.counts, metadata, issues = validate_inputs(counts, metadata)
**关键注意事项**:
- 确保计数矩阵方向正确(样本为行,基因为列,符合PyDESeq2要求)
- 验证计数为整数(必要时取整)
- 确保计数矩阵与元数据的样本名称对齐
- 移除零计数基因
详细数据处理方法请参考[references/data_loading.md](references/data_loading.md)。Step 2.5: Inspect Metadata Structure ⚠️ REQUIRED
步骤2.5:检查元数据结构 ⚠️ 必须执行
Before choosing design formula, ALWAYS inspect metadata to identify all experimental factors:
python
undefined在选择设计公式之前,务必检查元数据以识别所有实验因素:
python
undefinedPrint metadata structure
打印元数据结构
print("Metadata columns and levels:")
for col in metadata.columns:
unique_vals = metadata[col].unique()
print(f" {col}: {len(unique_vals)} levels → {list(unique_vals)[:5]}")
print("元数据列及水平:")
for col in metadata.columns:
unique_vals = metadata[col].unique()
print(f" {col}: {len(unique_vals)}个水平 → {list(unique_vals)[:5]}")
Example output:
示例输出:
Strain: 4 levels → ['1', '97', '98', '99']
Strain: 4个水平 → ['1', '97', '98', '99']
Media: 3 levels → ['MMGluFeMinus', 'MMGluFePlus', 'Succinate']
Media: 3个水平 → ['MMGluFeMinus', 'MMGluFePlus', 'Succinate']
Replicate: 3 levels → ['A', 'B', 'C']
Replicate: 3个水平 → ['A', 'B', 'C']
Decision:
决策:
- Strain: Biological factor (testing)
- Strain: 生物学关注因素(待检验)
- Media: Batch/covariate (MUST include!)
- Media: 批次/协变量(必须纳入!)
- Replicate: Biological replicate (don't include as factor)
- Replicate: 生物学重复(无需作为因素纳入)
Design: ~Media + Strain
设计公式: ~Media + Strain
This step prevents missing hidden batch effects that could invalidate your analysis.
此步骤可避免遗漏可能导致分析结果无效的隐藏批次效应。Step 3: Run PyDESeq2
步骤3:运行PyDESeq2
Execute differential expression analysis:
python
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats执行差异表达分析:
python
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStatsSetup design (set reference level first)
设置设计(先设置参考水平)
metadata['condition'] = pd.Categorical(
metadata['condition'],
categories=['control', 'treatment'] # First = reference
)
metadata['condition'] = pd.Categorical(
metadata['condition'],
categories=['control', 'treatment'] # 第一个为参考组
)
Run DESeq2
运行DESeq2
dds = DeseqDataSet(
counts=counts,
metadata=metadata,
design="~condition",
quiet=True
)
dds.deseq2()
dds = DeseqDataSet(
counts=counts,
metadata=metadata,
design="~condition",
quiet=True
)
dds.deseq2()
Extract results
提取结果
stat_res = DeseqStats(dds, contrast=['condition', 'treatment', 'control'], quiet=True)
stat_res.run_wald_test()
stat_res.summary()
stat_res = DeseqStats(dds, contrast=['condition', 'treatment', 'control'], quiet=True)
stat_res.run_wald_test()
stat_res.summary()
Apply LFC shrinkage (if needed)
应用LFC收缩(如有需要)
stat_res.lfc_shrink(coeff='condition[T.treatment]')
results = stat_res.results_df
undefinedstat_res.lfc_shrink(coeff='condition[T.treatment]')
results = stat_res.results_df
undefinedMulti-Factor Design (Common Real-World Case)
多因素设计(常见实际场景)
When metadata has multiple experimental variables (e.g., strain + media conditions), ALWAYS include covariates:
python
undefined当元数据包含多个实验变量(如菌株 + 培养基条件)时,务必纳入协变量:
python
undefinedInspect metadata (from Step 2.5) showed:
步骤2.5的元数据检查显示:
- Strain: 4 levels (factor of interest)
- Strain: 4个水平(关注因素)
- Media: 3 levels (batch effect - MUST include!)
- Media: 3个水平(批次效应 - 必须纳入!)
Set reference levels for BOTH factors
为两个因素设置参考水平
metadata['media'] = pd.Categorical(
metadata['media'],
categories=['MMGluFeMinus', 'MMGluFePlus', 'Succinate'] # First = reference
)
metadata['strain'] = pd.Categorical(
metadata['strain'],
categories=['1', '97', '98', '99'] # First = reference (JBX1)
)
metadata['media'] = pd.Categorical(
metadata['media'],
categories=['MMGluFeMinus', 'MMGluFePlus', 'Succinate'] # 第一个为参考组
)
metadata['strain'] = pd.Categorical(
metadata['strain'],
categories=['1', '97', '98', '99'] # 第一个为参考组(JBX1)
)
Include covariate in design formula
在设计公式中纳入协变量
dds = DeseqDataSet(
counts=counts,
metadata=metadata,
design="~media + strain", # Covariate first, then factor!
quiet=True
)
dds.deseq2()
dds = DeseqDataSet(
counts=counts,
metadata=metadata,
design="~media + strain", # 协变量在前,关注因素在后!
quiet=True
)
dds.deseq2()
Extract strain effect (controlling for media)
提取菌株效应(控制培养基影响)
stat_res = DeseqStats(dds, contrast=['strain', '98', '1'], quiet=True)
stat_res.run_wald_test()
stat_res.summary()
results = stat_res.results_df
**Why this matters**: DESeq2 will model media effects separately, removing media-driven variance before testing strain differences. This increases statistical power and prevents false positives/negatives.
**When to use what**:
- **Python (PyDESeq2)**: Use for ALL DESeq2 analysis (normalization, testing, filtering)
- **ToolUniverse**: Use ONLY for gene annotation (ID conversion, pathway context)
- **gseapy**: Use for enrichment analysis (GO/KEGG/Reactome)
See [references/pydeseq2_workflow.md](references/pydeseq2_workflow.md) for complete PyDESeq2 patterns including batch effects, interaction terms, and complex designs.stat_res = DeseqStats(dds, contrast=['strain', '98', '1'], quiet=True)
stat_res.run_wald_test()
stat_res.summary()
results = stat_res.results_df
**重要性**:DESeq2会单独建模培养基效应,在检验菌株差异之前消除培养基驱动的变异。这能提高统计效力,避免假阳性/假阴性结果。
**工具选择规则**:
- **Python (PyDESeq2)**:用于所有DESeq2分析(标准化、检验、过滤)
- **ToolUniverse**:仅用于基因注释(ID转换、通路上下文)
- **gseapy**:用于富集分析(GO/KEGG/Reactome)
完整PyDESeq2流程(包括批次效应、交互项、复杂设计)请参考[references/pydeseq2_workflow.md](references/pydeseq2_workflow.md)。Step 4: Filter Results
步骤4:结果过滤
Apply thresholds from the question:
python
undefined应用问题中指定的阈值:
python
undefinedFilter DEGs
筛选差异表达基因
sig_genes = results[
(results['padj'] < 0.05) &
(results['log2FoldChange'].abs() > 0.5) &
(results['baseMean'] > 10)
]
sig_genes = results[
(results['padj'] < 0.05) &
(results['log2FoldChange'].abs() > 0.5) &
(results['baseMean'] > 10)
]
Direction-specific filtering
按方向筛选
up_genes = sig_genes[sig_genes['log2FoldChange'] > 0]
down_genes = sig_genes[sig_genes['log2FoldChange'] < 0]
**Common filtering patterns**:
- Basic DEG count: `len(sig_genes)`
- Specific gene value: `results.loc['GENE_NAME', 'log2FoldChange']`
- Set operations: Use Python sets for "unique", "shared", "overlap"
See [references/result_filtering.md](references/result_filtering.md) for advanced filtering.up_genes = sig_genes[sig_genes['log2FoldChange'] > 0]
down_genes = sig_genes[sig_genes['log2FoldChange'] < 0]
**常见过滤场景**:
- 基础差异表达基因计数: `len(sig_genes)`
- 特定基因的数值: `results.loc['GENE_NAME', 'log2FoldChange']`
- 集合操作: 使用Python集合进行“唯一”、“共有”、“重叠”分析
高级过滤方法请参考[references/result_filtering.md](references/result_filtering.md)。Step 5: Dispersion Analysis (if asked)
步骤5:离散度分析(如有需求)
Access dispersion estimates:
python
undefined获取离散度估计值:
python
undefinedGet dispersion data
获取离散度数据
disp_data = dds.var # Dispersions stored here
disp_data = dds.var # 离散度存储于此
Common question: "genes below threshold prior to fitting"
常见问题: "拟合前离散度低于阈值的基因数量"
genewise = dds.var['genewise_dispersions']
count_below = (genewise < 1e-5).sum()
**Dispersion column mapping**:
- "prior to fitting" → `genewise_dispersions`
- "fitted dispersions" → `fitted_dispersions`
- "after shrinkage" / "MAP" → `MAP_dispersions`
- "final" → `dispersions`
See [references/dispersion_analysis.md](references/dispersion_analysis.md) for diagnostics.genewise = dds.var['genewise_dispersions']
count_below = (genewise < 1e-5).sum()
**离散度列映射**:
- "拟合前" → `genewise_dispersions`
- "拟合后离散度" → `fitted_dispersions`
- "收缩后" / "MAP" → `MAP_dispersions`
- "最终离散度" → `dispersions`
离散度诊断分析请参考[references/dispersion_analysis.md](references/dispersion_analysis.md)。Step 6: Enrichment Analysis (optional)
步骤6:富集分析(可选)
Run pathway enrichment on DEGs:
python
import gseapy as gp对差异表达基因进行通路富集分析:
python
import gseapy as gpPrepare gene list
准备基因列表
gene_list = sig_genes.index.tolist()
gene_list = sig_genes.index.tolist()
Run enrichment
运行富集分析
enr = gp.enrich(
gene_list=gene_list,
gene_sets='GO_Biological_Process_2023', # or KEGG, Reactome
background=None, # or specify background
outdir=None,
cutoff=0.05,
no_plot=True,
verbose=False
)
enr = gp.enrich(
gene_list=gene_list,
gene_sets='GO_Biological_Process_2023', # 或KEGG、Reactome
background=None, # 或指定背景基因集
outdir=None,
cutoff=0.05,
no_plot=True,
verbose=False
)
Extract results
提取结果
top_pathways = enr.results.head(10)
**Library selection**:
- Human GO: `GO_Biological_Process_2023`
- Mouse KEGG: `KEGG_2019_Mouse`
- Human KEGG: `KEGG_2021_Human`
- Reactome: `Reactome_2022`
See [references/enrichment_analysis.md](references/enrichment_analysis.md) for complete enrichment workflows.top_pathways = enr.results.head(10)
**数据库选择**:
- 人类GO: `GO_Biological_Process_2023`
- 小鼠KEGG: `KEGG_2019_Mouse`
- 人类KEGG: `KEGG_2021_Human`
- Reactome: `Reactome_2022`
完整富集分析流程请参考[references/enrichment_analysis.md](references/enrichment_analysis.md)。Step 7: Gene Annotation with ToolUniverse (optional)
步骤7:使用ToolUniverse进行基因注释(可选)
Use ToolUniverse ONLY for gene annotation, not analysis:
python
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()仅使用ToolUniverse进行基因注释,不用于分析:
python
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()Gene ID conversion
基因ID转换
result = tu.tools.MyGene_query_genes(query="TP53")
result = tu.tools.MyGene_query_genes(query="TP53")
Gene details
基因详细信息查询
result = tu.tools.ensembl_lookup_gene(
gene_id="ENSG00000141510",
species="homo_sapiens"
)
**Do NOT use ToolUniverse for**:
- Differential expression (use PyDESeq2)
- Statistical testing (use scipy.stats)
- Enrichment analysis (use gseapy)
---result = tu.tools.ensembl_lookup_gene(
gene_id="ENSG00000141510",
species="homo_sapiens"
)
**请勿使用ToolUniverse进行以下操作**:
- 差异表达分析(请使用PyDESeq2)
- 统计检验(请使用scipy.stats)
- 富集分析(请使用gseapy)
---Output Formatting
输出格式
Match the question's requested format:
python
undefined匹配问题要求的格式:
python
undefinedNumeric precision
数值精度
round(value, 2) # "2 decimal points"
f"{value:.2E}" # "scientific notation"
round(value, 2) # "保留2位小数"
f"{value:.2E}" # "科学计数法"
Percentages
百分比
f"{value * 100:.1f}%" # "as percentage"
f"{value * 100:.1f}%" # "百分比格式"
Counts (no decimals)
计数(无小数)
int(len(sig_genes)) # "how many genes"
See [references/output_formatting.md](references/output_formatting.md) for all format patterns.
---int(len(sig_genes)) # "基因数量"
所有格式规则请参考[references/output_formatting.md](references/output_formatting.md)。
---Common BixBench Patterns
常见BixBench模式
Pattern 1: Basic DEG Count
模式1:基础差异表达基因计数
Question: "How many genes show significant DE (padj < 0.05, |log2FC| > 0.5)?"
python
degs = results[(results['padj'] < 0.05) & (results['log2FoldChange'].abs() > 0.5)]
answer = len(degs)问题: "有多少基因显示显著差异表达(padj < 0.05,|log2FC| > 0.5)?"
python
degs = results[(results['padj'] < 0.05) & (results['log2FoldChange'].abs() > 0.5)]
answer = len(degs)Pattern 2: Specific Gene Value
模式2:特定基因数值查询
Question: "What is the log2FC of gene X?"
python
answer = round(results.loc['GENE_X', 'log2FoldChange'], 2)问题: "基因X的log2FC是多少?"
python
answer = round(results.loc['GENE_X', 'log2FoldChange'], 2)Pattern 3: Direction-Specific
模式3:按方向筛选
Question: "How many genes are upregulated?"
python
up_degs = results[(results['padj'] < 0.05) & (results['log2FoldChange'] > 0)]
answer = len(up_degs)问题: "有多少基因上调?"
python
up_degs = results[(results['padj'] < 0.05) & (results['log2FoldChange'] > 0)]
answer = len(up_degs)Pattern 4: Set Operations
模式4:集合操作
Question: "How many genes are uniquely DE in condition A?"
python
degs_A = set(results_A[results_A['padj'] < 0.05].index)
degs_B = set(results_B[results_B['padj'] < 0.05].index)
unique_A = degs_A - degs_B
answer = len(unique_A)问题: "条件A中独有的差异表达基因有多少?"
python
degs_A = set(results_A[results_A['padj'] < 0.05].index)
degs_B = set(results_B[results_B['padj'] < 0.05].index)
unique_A = degs_A - degs_B
answer = len(unique_A)Pattern 5: Dispersion Count
模式5:离散度计数
Question: "How many genes have dispersion below 1e-5 prior to fitting?"
python
genewise = dds.var['genewise_dispersions']
answer = (genewise < 1e-5).sum()See references/bixbench_examples.md for all 10 patterns with examples.
问题: "拟合前离散度低于1e-5的基因有多少?"
python
genewise = dds.var['genewise_dispersions']
answer = (genewise < 1e-5).sum()所有10种模式及示例请参考references/bixbench_examples.md。
Error Handling
错误处理
| Error | Solution |
|---|---|
| "No matching samples" | Check if counts need transposing; strip whitespace |
| "Dispersion trend did not converge" | Use |
| "Contrast not found" | Check |
| "Non-integer counts" | Round to integers OR use t-test for normalized data |
| "NaN in padj" | Independent filtering removed genes; exclude from counts |
See references/troubleshooting.md for complete debugging guide.
| 错误类型 | 解决方案 |
|---|---|
| "无匹配样本" | 检查计数矩阵是否需要转置;移除样本名称中的空格 |
| "离散度趋势未收敛" | 使用 |
| "未找到对比组" | 检查 |
| "非整数计数" | 取整为整数,或对标准化数据使用t检验 |
| "padj中存在NaN" | 独立过滤已移除部分基因;将这些基因从计数矩阵中排除 |
完整调试指南请参考references/troubleshooting.md。
Validation Checklist
验证清单
Every analysis MUST include:
Data Loading:
- Count matrix oriented correctly (samples as rows, genes as columns)
- Metadata aligned with counts
- Integer counts validated
DESeq2 Analysis:
- Design formula matches question
- Reference level set correctly (first in Categorical)
- Correct contrast extracted
Results:
- Thresholds match question exactly
- Direction filter applied if specified
- Answer formatted correctly (decimal places, notation)
Quality Checks:
- DEG count is reasonable
- P-values between 0 and 1
- Log2FC values are finite
每次分析必须包含以下检查:
数据加载:
- 计数矩阵方向正确(样本为行,基因为列)
- 元数据与计数矩阵样本对齐
- 已验证计数为整数
DESeq2分析:
- 设计公式与问题需求匹配
- 参考水平设置正确(Categorical中的第一个值)
- 提取了正确的对比组
结果:
- 阈值与问题要求完全一致
- 按需应用了方向筛选
- 答案格式正确(小数位数、计数法)
质量检查:
- 差异表达基因数量合理
- P值在0到1之间
- Log2FC值为有限数值
Known Limitations
已知限制
PyDESeq2 vs R DESeq2 Differences
PyDESeq2与R DESeq2的差异
This skill uses PyDESeq2 (Python implementation) for differential expression analysis. While PyDESeq2 faithfully implements the DESeq2 algorithm, numerical differences exist between Python and R implementations:
Dispersion Estimation:
- Dispersion estimates may differ from R DESeq2, especially for very low dispersion genes (< 1e-05)
- This is due to different numerical optimization methods between Python and R statistical libraries
- Results are still statistically valid and biologically meaningful
- If you need exact R DESeq2 dispersions for benchmark reproducibility, consider using R DESeq2 directly via rpy2
For Most Analyses: PyDESeq2 provides accurate, publication-quality results. The differences are in numerical precision, not statistical validity.
本技能使用PyDESeq2(Python实现)进行差异表达分析。尽管PyDESeq2忠实实现了DESeq2算法,但Python与R版本之间存在数值差异:
离散度估计:
- 离散度估计值可能与R DESeq2不同,尤其是离散度极低的基因(< 1e-05)
- 这是由于Python和R统计库使用的数值优化方法不同
- 结果仍然具有统计有效性和生物学意义
- 如果需要与R DESeq2完全一致的离散度估计(用于基准复现),可考虑通过rpy2直接使用R DESeq2
对于大多数分析:PyDESeq2可提供准确的、可用于发表的结果。差异仅存在于数值精度层面,不影响统计有效性。
GO/KEGG Enrichment (gseapy vs clusterProfiler)
GO/KEGG富集分析(gseapy vs clusterProfiler)
Enrichment analysis uses Python gseapy by default. Results may differ from R clusterProfiler due to:
- Different enrichment algorithms and statistical tests
- Different GO/KEGG database versions
- Different term simplification/redundancy removal methods
For R clusterProfiler Compatibility:
If you need results matching R clusterProfiler exactly (e.g., for benchmark reproducibility):
python
undefined默认使用Python gseapy进行富集分析。结果可能与R clusterProfiler不同,原因如下:
- 使用的富集算法和统计检验不同
- GO/KEGG数据库版本不同
- 术语简化/去冗余方法不同
与R clusterProfiler兼容:
如果需要与R clusterProfiler完全一致的结果(如用于基准复现):
python
undefinedInstall rpy2 and R packages first:
先安装rpy2和R包:
pip install rpy2
pip install rpy2
In R: install.packages(c("clusterProfiler", "org.Hs.eg.db", "enrichplot"))
在R中: install.packages(c("clusterProfiler", "org.Hs.eg.db", "enrichplot"))
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri, vectors
pandas2ri.activate()
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri, vectors
pandas2ri.activate()
Load R packages
加载R包
clusterProfiler = importr('clusterProfiler')
orgdb = importr('org.Hs.eg.db')
clusterProfiler = importr('clusterProfiler')
orgdb = importr('org.Hs.eg.db')
Convert Python gene list to R vector
将Python基因列表转换为R向量
gene_list = sig_genes.index.tolist()
r_genes = vectors.StrVector(gene_list)
gene_list = sig_genes.index.tolist()
r_genes = vectors.StrVector(gene_list)
Run enrichGO (R's clusterProfiler)
运行enrichGO(R的clusterProfiler)
enrich_result = clusterProfiler.enrichGO(
gene=r_genes,
OrgDb=orgdb.org_Hs_eg_db,
ont='BP', # Biological Process
pAdjustMethod='BH',
pvalueCutoff=0.05,
qvalueCutoff=0.2
)
enrich_result = clusterProfiler.enrichGO(
gene=r_genes,
OrgDb=orgdb.org_Hs_eg_db,
ont='BP', # 生物过程
pAdjustMethod='BH',
pvalueCutoff=0.05,
qvalueCutoff=0.2
)
Apply simplify to remove redundant terms
应用简化以移除冗余术语
simplified = clusterProfiler.simplify(enrich_result, cutoff=0.7, by='p.adjust')
simplified = clusterProfiler.simplify(enrich_result, cutoff=0.7, by='p.adjust')
Convert R result back to pandas
将R结果转换回pandas DataFrame
results_df = pandas2ri.rpy2py(simplified)
**Default behavior**: Uses gseapy (no R dependencies required). This is sufficient for most analyses and provides valid enrichment results.
**When to use R clusterProfiler**:
- When exact reproducibility with R-based benchmarks is required
- When collaborating with R users who need identical results
- When specific clusterProfiler features (GSEA, compareCluster) are needed
---results_df = pandas2ri.rpy2py(simplified)
**默认行为**:使用gseapy(无需依赖R)。这足以满足大多数分析需求,并能提供有效的富集结果。
**何时使用R clusterProfiler**:
- 需要与基于R的基准结果完全复现
- 与R用户协作,需要完全一致的结果
- 需要使用clusterProfiler的特定功能(如GSEA、compareCluster)
---References
参考文档
Detailed documentation for advanced topics:
- question_parsing.md - Extract parameters from questions
- data_loading.md - Data loading and validation patterns
- pydeseq2_workflow.md - Complete PyDESeq2 code examples
- result_filtering.md - Advanced filtering and extraction
- dispersion_analysis.md - Dispersion diagnostics
- enrichment_analysis.md - GO/KEGG/Reactome workflows
- output_formatting.md - Format answers correctly
- bixbench_examples.md - All 10 question patterns
- troubleshooting.md - Common issues and debugging
高级主题的详细文档:
- question_parsing.md - 从问题中提取参数
- data_loading.md - 数据加载与验证模式
- pydeseq2_workflow.md - 完整PyDESeq2代码示例
- result_filtering.md - 高级过滤与提取方法
- dispersion_analysis.md - 离散度诊断分析
- enrichment_analysis.md - GO/KEGG/Reactome分析流程
- output_formatting.md - 结果格式规范
- bixbench_examples.md - 所有10种问题模式
- troubleshooting.md - 常见问题与调试指南
Utility Scripts
实用脚本
Helper scripts for common tasks:
- format_deseq2_output.py - Output formatters
- load_count_matrix.py - Data loading utilities
常见任务的辅助脚本:
- format_deseq2_output.py - 结果格式转换工具
- load_count_matrix.py - 数据加载工具