pydeseq2

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PyDESeq2

PyDESeq2

Overview

概述

PyDESeq2 is a Python implementation of DESeq2 for differential expression analysis with bulk RNA-seq data. Design and execute complete workflows from data loading through result interpretation, including single-factor and multi-factor designs, Wald tests with multiple testing correction, optional apeGLM shrinkage, and integration with pandas and AnnData.
PyDESeq2是DESeq2的Python实现版本,用于基于bulk RNA-seq数据的差异表达分析。可完成从数据加载到结果解读的完整工作流,包括单因素和多因素设计、带多重检验校正的Wald检验、可选的apeGLM收缩,以及与pandas和AnnData的集成。

When to Use This Skill

何时使用该工具

This skill should be used when:
  • Analyzing bulk RNA-seq count data for differential expression
  • Comparing gene expression between experimental conditions (e.g., treated vs control)
  • Performing multi-factor designs accounting for batch effects or covariates
  • Converting R-based DESeq2 workflows to Python
  • Integrating differential expression analysis into Python-based pipelines
  • Users mention "DESeq2", "differential expression", "RNA-seq analysis", or "PyDESeq2"
当你遇到以下场景时,可使用本工具:
  • 分析bulk RNA-seq计数数据以进行差异表达分析
  • 比较不同实验条件下的基因表达(如处理组 vs 对照组)
  • 构建多因素设计以控制批次效应或协变量
  • 将基于R的DESeq2工作流转换为Python版本
  • 将差异表达分析集成到Python流水线中
  • 用户提及"DESeq2"、"差异表达"、"RNA-seq分析"或"PyDESeq2"时

Quick Start Workflow

快速开始工作流

For users who want to perform a standard differential expression analysis:
python
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
对于想要执行标准差异表达分析的用户:
python
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

1. Load data

1. Load data

counts_df = pd.read_csv("counts.csv", index_col=0).T # Transpose to samples × genes metadata = pd.read_csv("metadata.csv", index_col=0)
counts_df = pd.read_csv("counts.csv", index_col=0).T # Transpose to samples × genes metadata = pd.read_csv("metadata.csv", index_col=0)

2. Filter low-count genes

2. Filter low-count genes

genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10] counts_df = counts_df[genes_to_keep]
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10] counts_df = counts_df[genes_to_keep]

3. Initialize and fit DESeq2

3. Initialize and fit DESeq2

dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~condition", refit_cooks=True ) dds.deseq2()
dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~condition", refit_cooks=True ) dds.deseq2()

4. Perform statistical testing

4. Perform statistical testing

ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary()

5. Access results

5. Access results

results = ds.results_df significant = results[results.padj < 0.05] print(f"Found {len(significant)} significant genes")
undefined
results = ds.results_df significant = results[results.padj < 0.05] print(f"Found {len(significant)} significant genes")
undefined

Core Workflow Steps

核心工作流步骤

Step 1: Data Preparation

步骤1:数据准备

Input requirements:
  • Count matrix: Samples × genes DataFrame with non-negative integer read counts
  • Metadata: Samples × variables DataFrame with experimental factors
Common data loading patterns:
python
undefined
输入要求:
  • 计数矩阵: 样本×基因的DataFrame,包含非负整数读数
  • 元数据: 样本×变量的DataFrame,包含实验因素
常见数据加载方式:
python
undefined

From CSV (typical format: genes × samples, needs transpose)

From CSV (typical format: genes × samples, needs transpose)

counts_df = pd.read_csv("counts.csv", index_col=0).T metadata = pd.read_csv("metadata.csv", index_col=0)
counts_df = pd.read_csv("counts.csv", index_col=0).T metadata = pd.read_csv("metadata.csv", index_col=0)

From TSV

From TSV

counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T
counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T

From AnnData

From AnnData

import anndata as ad adata = ad.read_h5ad("data.h5ad") counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names) metadata = adata.obs

**Data filtering:**

```python
import anndata as ad adata = ad.read_h5ad("data.h5ad") counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names) metadata = adata.obs

**数据过滤:**

```python

Remove low-count genes

Remove low-count genes

genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10] counts_df = counts_df[genes_to_keep]
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10] counts_df = counts_df[genes_to_keep]

Remove samples with missing metadata

Remove samples with missing metadata

samples_to_keep = ~metadata.condition.isna() counts_df = counts_df.loc[samples_to_keep] metadata = metadata.loc[samples_to_keep]
undefined
samples_to_keep = ~metadata.condition.isna() counts_df = counts_df.loc[samples_to_keep] metadata = metadata.loc[samples_to_keep]
undefined

Step 2: Design Specification

步骤2:设计矩阵指定

The design formula specifies how gene expression is modeled.
Single-factor designs:
python
design = "~condition"  # Simple two-group comparison
Multi-factor designs:
python
design = "~batch + condition"  # Control for batch effects
design = "~age + condition"     # Include continuous covariate
design = "~group + condition + group:condition"  # Interaction effects
Design formula guidelines:
  • Use Wilkinson formula notation (R-style)
  • Put adjustment variables (e.g., batch) before the main variable of interest
  • Ensure variables exist as columns in the metadata DataFrame
  • Use appropriate data types (categorical for discrete variables)
设计公式用于定义基因表达的建模方式。
单因素设计:
python
design = "~condition"  # Simple two-group comparison
多因素设计:
python
design = "~batch + condition"  # Control for batch effects
design = "~age + condition"     # Include continuous covariate
design = "~group + condition + group:condition"  # Interaction effects
设计公式指南:
  • 使用Wilkinson公式表示法(R语言风格)
  • 将校正变量(如批次)放在研究变量之前
  • 确保变量在元数据DataFrame中存在对应的列
  • 为离散变量使用合适的数据类型(分类类型)

Step 3: DESeq2 Fitting

步骤3:DESeq2模型拟合

Initialize the DeseqDataSet and run the complete pipeline:
python
from pydeseq2.dds import DeseqDataSet

dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",
    refit_cooks=True,  # Refit after removing outliers
    n_cpus=1           # Parallel processing (adjust as needed)
)
初始化DeseqDataSet并运行完整流程:
python
from pydeseq2.dds import DeseqDataSet

dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",
    refit_cooks=True,  # Refit after removing outliers
    n_cpus=1           # Parallel processing (adjust as needed)
)

Run the complete DESeq2 pipeline

Run the complete DESeq2 pipeline

dds.deseq2()

**What `deseq2()` does:**
1. Computes size factors (normalization)
2. Fits genewise dispersions
3. Fits dispersion trend curve
4. Computes dispersion priors
5. Fits MAP dispersions (shrinkage)
6. Fits log fold changes
7. Calculates Cook's distances (outlier detection)
8. Refits if outliers detected (optional)
dds.deseq2()

**`deseq2()`函数执行的操作:**
1. 计算大小因子(归一化)
2. 拟合基因水平离散度
3. 拟合离散度趋势曲线
4. 计算离散度先验值
5. 拟合MAP离散度(收缩)
6. 拟合对数倍数变化
7. 计算Cook距离(异常值检测)
8. 若检测到异常值则重新拟合(可选)

Step 4: Statistical Testing

步骤4:统计检验

Perform Wald tests to identify differentially expressed genes:
python
from pydeseq2.ds import DeseqStats

ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"],  # Test treated vs control
    alpha=0.05,                # Significance threshold
    cooks_filter=True,         # Filter outliers
    independent_filter=True    # Filter low-power tests
)

ds.summary()
Contrast specification:
  • Format:
    [variable, test_level, reference_level]
  • Example:
    ["condition", "treated", "control"]
    tests treated vs control
  • If
    None
    , uses the last coefficient in the design
Result DataFrame columns:
  • baseMean
    : Mean normalized count across samples
  • log2FoldChange
    : Log2 fold change between conditions
  • lfcSE
    : Standard error of LFC
  • stat
    : Wald test statistic
  • pvalue
    : Raw p-value
  • padj
    : Adjusted p-value (FDR-corrected via Benjamini-Hochberg)
执行Wald检验以识别差异表达基因:
python
from pydeseq2.ds import DeseqStats

ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"],  # Test treated vs control
    alpha=0.05,                # Significance threshold
    cooks_filter=True,         # Filter outliers
    independent_filter=True    # Filter low-power tests
)

ds.summary()
对比参数指定:
  • 格式:
    [变量名, 测试组水平, 对照组水平]
  • 示例:
    ["condition", "treated", "control"]
    用于比较处理组与对照组
  • 若设为
    None
    ,则使用设计矩阵中的最后一个系数
结果DataFrame列说明:
  • baseMean
    :所有样本中的平均归一化计数
  • log2FoldChange
    :两组间的对数2倍数变化
  • lfcSE
    :对数倍数变化的标准误
  • stat
    :Wald检验统计量
  • pvalue
    :原始p值
  • padj
    :校正后p值(通过Benjamini-Hochberg法进行FDR校正)

Step 5: Optional LFC Shrinkage

步骤5:可选的对数倍数变化收缩

Apply shrinkage to reduce noise in fold change estimates:
python
ds.lfc_shrink()  # Applies apeGLM shrinkage
When to use LFC shrinkage:
  • For visualization (volcano plots, heatmaps)
  • For ranking genes by effect size
  • When prioritizing genes for follow-up experiments
Important: Shrinkage affects only the log2FoldChange values, not the statistical test results (p-values remain unchanged). Use shrunk values for visualization but report unshrunken p-values for significance.
应用收缩方法以降低倍数变化估计中的噪声:
python
ds.lfc_shrink()  # Applies apeGLM shrinkage
何时使用对数倍数变化收缩:
  • 用于可视化(火山图、热图)
  • 按效应大小对基因进行排序
  • 优先选择后续实验研究的基因时
重要提示: 收缩仅影响log2FoldChange值,不改变统计检验结果(p值保持不变)。可视化时使用收缩后的值,但报告显著性时使用未收缩的p值。

Step 6: Result Export

步骤6:结果导出

Save results and intermediate objects:
python
import pickle
保存结果和中间对象:
python
import pickle

Export results as CSV

Export results as CSV

ds.results_df.to_csv("deseq2_results.csv")
ds.results_df.to_csv("deseq2_results.csv")

Save significant genes only

Save significant genes only

significant = ds.results_df[ds.results_df.padj < 0.05] significant.to_csv("significant_genes.csv")
significant = ds.results_df[ds.results_df.padj < 0.05] significant.to_csv("significant_genes.csv")

Save DeseqDataSet for later use

Save DeseqDataSet for later use

with open("dds_result.pkl", "wb") as f: pickle.dump(dds.to_picklable_anndata(), f)
undefined
with open("dds_result.pkl", "wb") as f: pickle.dump(dds.to_picklable_anndata(), f)
undefined

Common Analysis Patterns

常见分析模式

Two-Group Comparison

两组比较

Standard case-control comparison:
python
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()

ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()

results = ds.results_df
significant = results[results.padj < 0.05]
标准病例-对照比较:
python
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()

ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()

results = ds.results_df
significant = results[results.padj < 0.05]

Multiple Comparisons

多组比较

Testing multiple treatment groups against control:
python
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()

treatments = ["treatment_A", "treatment_B", "treatment_C"]
all_results = {}

for treatment in treatments:
    ds = DeseqStats(dds, contrast=["condition", treatment, "control"])
    ds.summary()
    all_results[treatment] = ds.results_df

    sig_count = len(ds.results_df[ds.results_df.padj < 0.05])
    print(f"{treatment}: {sig_count} significant genes")
比较多个处理组与对照组:
python
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()

treatments = ["treatment_A", "treatment_B", "treatment_C"]
all_results = {}

for treatment in treatments:
    ds = DeseqStats(dds, contrast=["condition", treatment, "control"])
    ds.summary()
    all_results[treatment] = ds.results_df

    sig_count = len(ds.results_df[ds.results_df.padj < 0.05])
    print(f"{treatment}: {sig_count} significant genes")

Accounting for Batch Effects

批次效应校正

Control for technical variation:
python
undefined
控制技术变异:
python
undefined

Include batch in design

Include batch in design

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition") dds.deseq2()
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition") dds.deseq2()

Test condition while controlling for batch

Test condition while controlling for batch

ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary()
undefined
ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary()
undefined

Continuous Covariates

连续协变量

Include continuous variables like age or dosage:
python
undefined
纳入年龄或剂量等连续变量:
python
undefined

Ensure continuous variable is numeric

Ensure continuous variable is numeric

metadata["age"] = pd.to_numeric(metadata["age"])
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition") dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary()
undefined
metadata["age"] = pd.to_numeric(metadata["age"])
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition") dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary()
undefined

Using the Analysis Script

使用分析脚本

This skill includes a complete command-line script for standard analyses:
bash
undefined
本工具包含一个完整的命令行脚本,用于标准分析:
bash
undefined

Basic usage

Basic usage

python scripts/run_deseq2_analysis.py
--counts counts.csv
--metadata metadata.csv
--design "~condition"
--contrast condition treated control
--output results/
python scripts/run_deseq2_analysis.py
--counts counts.csv
--metadata metadata.csv
--design "~condition"
--contrast condition treated control
--output results/

With additional options

With additional options

python scripts/run_deseq2_analysis.py
--counts counts.csv
--metadata metadata.csv
--design "~batch + condition"
--contrast condition treated control
--output results/
--min-counts 10
--alpha 0.05
--n-cpus 4
--plots

**Script features:**
- Automatic data loading and validation
- Gene and sample filtering
- Complete DESeq2 pipeline execution
- Statistical testing with customizable parameters
- Result export (CSV, pickle)
- Optional visualization (volcano and MA plots)

Refer users to `scripts/run_deseq2_analysis.py` when they need a standalone analysis tool or want to batch process multiple datasets.
python scripts/run_deseq2_analysis.py
--counts counts.csv
--metadata metadata.csv
--design "~batch + condition"
--contrast condition treated control
--output results/
--min-counts 10
--alpha 0.05
--n-cpus 4
--plots

**脚本功能:**
- 自动数据加载与验证
- 基因与样本过滤
- 完整DESeq2流程执行
- 可自定义参数的统计检验
- 结果导出(CSV、pickle格式)
- 可选可视化(火山图和MA图)

当用户需要独立分析工具或批量处理多个数据集时,可引导其使用`scripts/run_deseq2_analysis.py`脚本。

Result Interpretation

结果解读

Identifying Significant Genes

识别显著差异基因

python
undefined
python
undefined

Filter by adjusted p-value

Filter by adjusted p-value

significant = ds.results_df[ds.results_df.padj < 0.05]
significant = ds.results_df[ds.results_df.padj < 0.05]

Filter by both significance and effect size

Filter by both significance and effect size

sig_and_large = ds.results_df[ (ds.results_df.padj < 0.05) & (abs(ds.results_df.log2FoldChange) > 1) ]
sig_and_large = ds.results_df[ (ds.results_df.padj < 0.05) & (abs(ds.results_df.log2FoldChange) > 1) ]

Separate up- and down-regulated

Separate up- and down-regulated

upregulated = significant[significant.log2FoldChange > 0] downregulated = significant[significant.log2FoldChange < 0]
print(f"Upregulated: {len(upregulated)}") print(f"Downregulated: {len(downregulated)}")
undefined
upregulated = significant[significant.log2FoldChange > 0] downregulated = significant[significant.log2FoldChange < 0]
print(f"Upregulated: {len(upregulated)}") print(f"Downregulated: {len(downregulated)}")
undefined

Ranking and Sorting

基因排序

python
undefined
python
undefined

Sort by adjusted p-value

Sort by adjusted p-value

top_by_padj = ds.results_df.sort_values("padj").head(20)
top_by_padj = ds.results_df.sort_values("padj").head(20)

Sort by absolute fold change (use shrunk values)

Sort by absolute fold change (use shrunk values)

ds.lfc_shrink() ds.results_df["abs_lfc"] = abs(ds.results_df.log2FoldChange) top_by_lfc = ds.results_df.sort_values("abs_lfc", ascending=False).head(20)
ds.lfc_shrink() ds.results_df["abs_lfc"] = abs(ds.results_df.log2FoldChange) top_by_lfc = ds.results_df.sort_values("abs_lfc", ascending=False).head(20)

Sort by a combined metric

Sort by a combined metric

ds.results_df["score"] = -np.log10(ds.results_df.padj) * abs(ds.results_df.log2FoldChange) top_combined = ds.results_df.sort_values("score", ascending=False).head(20)
undefined
ds.results_df["score"] = -np.log10(ds.results_df.padj) * abs(ds.results_df.log2FoldChange) top_combined = ds.results_df.sort_values("score", ascending=False).head(20)
undefined

Quality Metrics

质量指标检查

python
undefined
python
undefined

Check normalization (size factors should be close to 1)

Check normalization (size factors should be close to 1)

print("Size factors:", dds.obsm["size_factors"])
print("Size factors:", dds.obsm["size_factors"])

Examine dispersion estimates

Examine dispersion estimates

import matplotlib.pyplot as plt plt.hist(dds.varm["dispersions"], bins=50) plt.xlabel("Dispersion") plt.ylabel("Frequency") plt.title("Dispersion Distribution") plt.show()
import matplotlib.pyplot as plt plt.hist(dds.varm["dispersions"], bins=50) plt.xlabel("Dispersion") plt.ylabel("Frequency") plt.title("Dispersion Distribution") plt.show()

Check p-value distribution (should be mostly flat with peak near 0)

Check p-value distribution (should be mostly flat with peak near 0)

plt.hist(ds.results_df.pvalue.dropna(), bins=50) plt.xlabel("P-value") plt.ylabel("Frequency") plt.title("P-value Distribution") plt.show()
undefined
plt.hist(ds.results_df.pvalue.dropna(), bins=50) plt.xlabel("P-value") plt.ylabel("Frequency") plt.title("P-value Distribution") plt.show()
undefined

Visualization Guidelines

可视化指南

Volcano Plot

火山图

Visualize significance vs effect size:
python
import matplotlib.pyplot as plt
import numpy as np

results = ds.results_df.copy()
results["-log10(padj)"] = -np.log10(results.padj)

plt.figure(figsize=(10, 6))
significant = results.padj < 0.05

plt.scatter(
    results.loc[~significant, "log2FoldChange"],
    results.loc[~significant, "-log10(padj)"],
    alpha=0.3, s=10, c='gray', label='Not significant'
)
plt.scatter(
    results.loc[significant, "log2FoldChange"],
    results.loc[significant, "-log10(padj)"],
    alpha=0.6, s=10, c='red', label='padj < 0.05'
)

plt.axhline(-np.log10(0.05), color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log2 Fold Change")
plt.ylabel("-Log10(Adjusted P-value)")
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)
可视化显著性与效应大小的关系:
python
import matplotlib.pyplot as plt
import numpy as np

results = ds.results_df.copy()
results["-log10(padj)"] = -np.log10(results.padj)

plt.figure(figsize=(10, 6))
significant = results.padj < 0.05

plt.scatter(
    results.loc[~significant, "log2FoldChange"],
    results.loc[~significant, "-log10(padj)"],
    alpha=0.3, s=10, c='gray', label='Not significant'
)
plt.scatter(
    results.loc[significant, "log2FoldChange"],
    results.loc[significant, "-log10(padj)"],
    alpha=0.6, s=10, c='red', label='padj < 0.05'
)

plt.axhline(-np.log10(0.05), color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log2 Fold Change")
plt.ylabel("-Log10(Adjusted P-value)")
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)

MA Plot

MA图

Show fold change vs mean expression:
python
plt.figure(figsize=(10, 6))

plt.scatter(
    np.log10(results.loc[~significant, "baseMean"] + 1),
    results.loc[~significant, "log2FoldChange"],
    alpha=0.3, s=10, c='gray'
)
plt.scatter(
    np.log10(results.loc[significant, "baseMean"] + 1),
    results.loc[significant, "log2FoldChange"],
    alpha=0.6, s=10, c='red'
)

plt.axhline(0, color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log10(Base Mean + 1)")
plt.ylabel("Log2 Fold Change")
plt.title("MA Plot")
plt.savefig("ma_plot.png", dpi=300)
展示倍数变化与平均表达量的关系:
python
plt.figure(figsize=(10, 6))

plt.scatter(
    np.log10(results.loc[~significant, "baseMean"] + 1),
    results.loc[~significant, "log2FoldChange"],
    alpha=0.3, s=10, c='gray'
)
plt.scatter(
    np.log10(results.loc[significant, "baseMean"] + 1),
    results.loc[significant, "log2FoldChange"],
    alpha=0.6, s=10, c='red'
)

plt.axhline(0, color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log10(Base Mean + 1)")
plt.ylabel("Log2 Fold Change")
plt.title("MA Plot")
plt.savefig("ma_plot.png", dpi=300)

Troubleshooting Common Issues

常见问题排查

Data Format Problems

数据格式问题

Issue: "Index mismatch between counts and metadata"
Solution: Ensure sample names match exactly
python
print("Counts samples:", counts_df.index.tolist())
print("Metadata samples:", metadata.index.tolist())
问题: "计数矩阵与元数据的索引不匹配"
解决方案: 确保样本名称完全一致
python
print("Counts samples:", counts_df.index.tolist())
print("Metadata samples:", metadata.index.tolist())

Take intersection if needed

Take intersection if needed

common = counts_df.index.intersection(metadata.index) counts_df = counts_df.loc[common] metadata = metadata.loc[common]

**Issue:** "All genes have zero counts"

**Solution:** Check if data needs transposition
```python
print(f"Counts shape: {counts_df.shape}")
common = counts_df.index.intersection(metadata.index) counts_df = counts_df.loc[common] metadata = metadata.loc[common]

**问题:** "所有基因的计数均为零"

**解决方案:** 检查数据是否需要转置
```python
print(f"Counts shape: {counts_df.shape}")

If genes > samples, transpose is needed

If genes > samples, transpose is needed

if counts_df.shape[1] < counts_df.shape[0]: counts_df = counts_df.T
undefined
if counts_df.shape[1] < counts_df.shape[0]: counts_df = counts_df.T
undefined

Design Matrix Issues

设计矩阵问题

Issue: "Design matrix is not full rank"
Cause: Confounded variables (e.g., all treated samples in one batch)
Solution: Remove confounded variable or add interaction term
python
undefined
问题: "设计矩阵不满秩"
原因: 变量存在共线性(如所有处理组样本都属于同一批次)
解决方案: 移除共线变量或添加交互项
python
undefined

Check confounding

Check confounding

print(pd.crosstab(metadata.condition, metadata.batch))
print(pd.crosstab(metadata.condition, metadata.batch))

Either simplify design or add interaction

Either simplify design or add interaction

design = "~condition" # Remove batch
design = "~condition" # Remove batch

OR

OR

design = "~condition + batch + condition:batch" # Model interaction
undefined
design = "~condition + batch + condition:batch" # Model interaction
undefined

No Significant Genes

无显著差异基因

Diagnostics:
python
undefined
诊断步骤:
python
undefined

Check dispersion distribution

Check dispersion distribution

plt.hist(dds.varm["dispersions"], bins=50) plt.show()
plt.hist(dds.varm["dispersions"], bins=50) plt.show()

Check size factors

Check size factors

print(dds.obsm["size_factors"])
print(dds.obsm["size_factors"])

Look at top genes by raw p-value

Look at top genes by raw p-value

print(ds.results_df.nsmallest(20, "pvalue"))

**Possible causes:**
- Small effect sizes
- High biological variability
- Insufficient sample size
- Technical issues (batch effects, outliers)
print(ds.results_df.nsmallest(20, "pvalue"))

**可能原因:**
- 效应量过小
- 生物变异性高
- 样本量不足
- 技术问题(批次效应、异常值)

Reference Documentation

参考文档

For comprehensive details beyond this workflow-oriented guide:
  • API Reference (
    references/api_reference.md
    ): Complete documentation of PyDESeq2 classes, methods, and data structures. Use when needing detailed parameter information or understanding object attributes.
  • Workflow Guide (
    references/workflow_guide.md
    ): In-depth guide covering complete analysis workflows, data loading patterns, multi-factor designs, troubleshooting, and best practices. Use when handling complex experimental designs or encountering issues.
Load these references into context when users need:
  • Detailed API documentation:
    Read references/api_reference.md
  • Comprehensive workflow examples:
    Read references/workflow_guide.md
  • Troubleshooting guidance:
    Read references/workflow_guide.md
    (see Troubleshooting section)
如需本工作流指南之外的详细信息,请参考:
  • API参考 (
    references/api_reference.md
    ):PyDESeq2类、方法和数据结构的完整文档。当需要详细参数信息或了解对象属性时使用。
  • 工作流指南 (
    references/workflow_guide.md
    ):涵盖完整分析工作流、数据加载模式、多因素设计、问题排查和最佳实践的深度指南。处理复杂实验设计或遇到问题时使用。
当用户需要以下内容时,可引导其查阅对应文档:
  • 详细API文档:
    Read references/api_reference.md
  • 全面工作流示例:
    Read references/workflow_guide.md
  • 问题排查指导:
    Read references/workflow_guide.md
    (查看问题排查章节)

Key Reminders

关键提示

  1. Data orientation matters: Count matrices typically load as genes × samples but need to be samples × genes. Always transpose with
    .T
    if needed.
  2. Sample filtering: Remove samples with missing metadata before analysis to avoid errors.
  3. Gene filtering: Filter low-count genes (e.g., < 10 total reads) to improve power and reduce computational time.
  4. Design formula order: Put adjustment variables before the variable of interest (e.g.,
    "~batch + condition"
    not
    "~condition + batch"
    ).
  5. LFC shrinkage timing: Apply shrinkage after statistical testing and only for visualization/ranking purposes. P-values remain based on unshrunken estimates.
  6. Result interpretation: Use
    padj < 0.05
    for significance, not raw p-values. The Benjamini-Hochberg procedure controls false discovery rate.
  7. Contrast specification: The format is
    [variable, test_level, reference_level]
    where test_level is compared against reference_level.
  8. Save intermediate objects: Use pickle to save DeseqDataSet objects for later use or additional analyses without re-running the expensive fitting step.
  1. 数据方向很重要: 计数矩阵通常以基因×样本的格式加载,但分析需要样本×基因的格式。必要时务必使用
    .T
    进行转置。
  2. 样本过滤: 分析前移除元数据缺失的样本,以避免错误。
  3. 基因过滤: 过滤低计数基因(如总读数<10)以提升分析效能并减少计算时间。
  4. 设计公式顺序: 将校正变量放在研究变量之前(如使用
    "~batch + condition"
    而非
    "~condition + batch"
    )。
  5. 对数倍数变化收缩时机: 在统计检验后应用收缩,且仅用于可视化/排序。p值基于未收缩的估计值。
  6. 结果解读: 使用
    padj < 0.05
    判断显著性,而非原始p值。Benjamini-Hochberg方法用于控制假发现率。
  7. 对比参数指定: 格式为
    [变量名, 测试组水平, 对照组水平]
    ,其中测试组水平与对照组水平进行比较。
  8. 保存中间对象: 使用pickle保存DeseqDataSet对象,以便后续使用或进行额外分析,无需重新运行耗时的拟合步骤。

Installation and Requirements

安装与环境要求

bash
uv pip install pydeseq2
System requirements:
  • Python 3.10-3.11
  • pandas 1.4.3+
  • numpy 1.23.0+
  • scipy 1.11.0+
  • scikit-learn 1.1.1+
  • anndata 0.8.0+
Optional for visualization:
  • matplotlib
  • seaborn
bash
uv pip install pydeseq2
系统要求:
  • Python 3.10-3.11
  • pandas 1.4.3+
  • numpy 1.23.0+
  • scipy 1.11.0+
  • scikit-learn 1.1.1+
  • anndata 0.8.0+
可视化可选依赖:
  • matplotlib
  • seaborn

Additional Resources

额外资源