pydeseq2
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePyDESeq2
PyDESeq2
Overview
概述
PyDESeq2 is a Python implementation of DESeq2 for differential expression analysis with bulk RNA-seq data. Design and execute complete workflows from data loading through result interpretation, including single-factor and multi-factor designs, Wald tests with multiple testing correction, optional apeGLM shrinkage, and integration with pandas and AnnData.
PyDESeq2是DESeq2的Python实现版本,用于基于bulk RNA-seq数据的差异表达分析。可完成从数据加载到结果解读的完整工作流,包括单因素和多因素设计、带多重检验校正的Wald检验、可选的apeGLM收缩,以及与pandas和AnnData的集成。
When to Use This Skill
何时使用该工具
This skill should be used when:
- Analyzing bulk RNA-seq count data for differential expression
- Comparing gene expression between experimental conditions (e.g., treated vs control)
- Performing multi-factor designs accounting for batch effects or covariates
- Converting R-based DESeq2 workflows to Python
- Integrating differential expression analysis into Python-based pipelines
- Users mention "DESeq2", "differential expression", "RNA-seq analysis", or "PyDESeq2"
当你遇到以下场景时,可使用本工具:
- 分析bulk RNA-seq计数数据以进行差异表达分析
- 比较不同实验条件下的基因表达(如处理组 vs 对照组)
- 构建多因素设计以控制批次效应或协变量
- 将基于R的DESeq2工作流转换为Python版本
- 将差异表达分析集成到Python流水线中
- 用户提及"DESeq2"、"差异表达"、"RNA-seq分析"或"PyDESeq2"时
Quick Start Workflow
快速开始工作流
For users who want to perform a standard differential expression analysis:
python
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats对于想要执行标准差异表达分析的用户:
python
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats1. Load data
1. Load data
counts_df = pd.read_csv("counts.csv", index_col=0).T # Transpose to samples × genes
metadata = pd.read_csv("metadata.csv", index_col=0)
counts_df = pd.read_csv("counts.csv", index_col=0).T # Transpose to samples × genes
metadata = pd.read_csv("metadata.csv", index_col=0)
2. Filter low-count genes
2. Filter low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
3. Initialize and fit DESeq2
3. Initialize and fit DESeq2
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True
)
dds.deseq2()
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True
)
dds.deseq2()
4. Perform statistical testing
4. Perform statistical testing
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
5. Access results
5. Access results
results = ds.results_df
significant = results[results.padj < 0.05]
print(f"Found {len(significant)} significant genes")
undefinedresults = ds.results_df
significant = results[results.padj < 0.05]
print(f"Found {len(significant)} significant genes")
undefinedCore Workflow Steps
核心工作流步骤
Step 1: Data Preparation
步骤1:数据准备
Input requirements:
- Count matrix: Samples × genes DataFrame with non-negative integer read counts
- Metadata: Samples × variables DataFrame with experimental factors
Common data loading patterns:
python
undefined输入要求:
- 计数矩阵: 样本×基因的DataFrame,包含非负整数读数
- 元数据: 样本×变量的DataFrame,包含实验因素
常见数据加载方式:
python
undefinedFrom CSV (typical format: genes × samples, needs transpose)
From CSV (typical format: genes × samples, needs transpose)
counts_df = pd.read_csv("counts.csv", index_col=0).T
metadata = pd.read_csv("metadata.csv", index_col=0)
counts_df = pd.read_csv("counts.csv", index_col=0).T
metadata = pd.read_csv("metadata.csv", index_col=0)
From TSV
From TSV
counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T
counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T
From AnnData
From AnnData
import anndata as ad
adata = ad.read_h5ad("data.h5ad")
counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
metadata = adata.obs
**Data filtering:**
```pythonimport anndata as ad
adata = ad.read_h5ad("data.h5ad")
counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
metadata = adata.obs
**数据过滤:**
```pythonRemove low-count genes
Remove low-count genes
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
Remove samples with missing metadata
Remove samples with missing metadata
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]
undefinedsamples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]
undefinedStep 2: Design Specification
步骤2:设计矩阵指定
The design formula specifies how gene expression is modeled.
Single-factor designs:
python
design = "~condition" # Simple two-group comparisonMulti-factor designs:
python
design = "~batch + condition" # Control for batch effects
design = "~age + condition" # Include continuous covariate
design = "~group + condition + group:condition" # Interaction effectsDesign formula guidelines:
- Use Wilkinson formula notation (R-style)
- Put adjustment variables (e.g., batch) before the main variable of interest
- Ensure variables exist as columns in the metadata DataFrame
- Use appropriate data types (categorical for discrete variables)
设计公式用于定义基因表达的建模方式。
单因素设计:
python
design = "~condition" # Simple two-group comparison多因素设计:
python
design = "~batch + condition" # Control for batch effects
design = "~age + condition" # Include continuous covariate
design = "~group + condition + group:condition" # Interaction effects设计公式指南:
- 使用Wilkinson公式表示法(R语言风格)
- 将校正变量(如批次)放在研究变量之前
- 确保变量在元数据DataFrame中存在对应的列
- 为离散变量使用合适的数据类型(分类类型)
Step 3: DESeq2 Fitting
步骤3:DESeq2模型拟合
Initialize the DeseqDataSet and run the complete pipeline:
python
from pydeseq2.dds import DeseqDataSet
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True, # Refit after removing outliers
n_cpus=1 # Parallel processing (adjust as needed)
)初始化DeseqDataSet并运行完整流程:
python
from pydeseq2.dds import DeseqDataSet
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True, # Refit after removing outliers
n_cpus=1 # Parallel processing (adjust as needed)
)Run the complete DESeq2 pipeline
Run the complete DESeq2 pipeline
dds.deseq2()
**What `deseq2()` does:**
1. Computes size factors (normalization)
2. Fits genewise dispersions
3. Fits dispersion trend curve
4. Computes dispersion priors
5. Fits MAP dispersions (shrinkage)
6. Fits log fold changes
7. Calculates Cook's distances (outlier detection)
8. Refits if outliers detected (optional)dds.deseq2()
**`deseq2()`函数执行的操作:**
1. 计算大小因子(归一化)
2. 拟合基因水平离散度
3. 拟合离散度趋势曲线
4. 计算离散度先验值
5. 拟合MAP离散度(收缩)
6. 拟合对数倍数变化
7. 计算Cook距离(异常值检测)
8. 若检测到异常值则重新拟合(可选)Step 4: Statistical Testing
步骤4:统计检验
Perform Wald tests to identify differentially expressed genes:
python
from pydeseq2.ds import DeseqStats
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"], # Test treated vs control
alpha=0.05, # Significance threshold
cooks_filter=True, # Filter outliers
independent_filter=True # Filter low-power tests
)
ds.summary()Contrast specification:
- Format:
[variable, test_level, reference_level] - Example: tests treated vs control
["condition", "treated", "control"] - If , uses the last coefficient in the design
None
Result DataFrame columns:
- : Mean normalized count across samples
baseMean - : Log2 fold change between conditions
log2FoldChange - : Standard error of LFC
lfcSE - : Wald test statistic
stat - : Raw p-value
pvalue - : Adjusted p-value (FDR-corrected via Benjamini-Hochberg)
padj
执行Wald检验以识别差异表达基因:
python
from pydeseq2.ds import DeseqStats
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"], # Test treated vs control
alpha=0.05, # Significance threshold
cooks_filter=True, # Filter outliers
independent_filter=True # Filter low-power tests
)
ds.summary()对比参数指定:
- 格式:
[变量名, 测试组水平, 对照组水平] - 示例:用于比较处理组与对照组
["condition", "treated", "control"] - 若设为,则使用设计矩阵中的最后一个系数
None
结果DataFrame列说明:
- :所有样本中的平均归一化计数
baseMean - :两组间的对数2倍数变化
log2FoldChange - :对数倍数变化的标准误
lfcSE - :Wald检验统计量
stat - :原始p值
pvalue - :校正后p值(通过Benjamini-Hochberg法进行FDR校正)
padj
Step 5: Optional LFC Shrinkage
步骤5:可选的对数倍数变化收缩
Apply shrinkage to reduce noise in fold change estimates:
python
ds.lfc_shrink() # Applies apeGLM shrinkageWhen to use LFC shrinkage:
- For visualization (volcano plots, heatmaps)
- For ranking genes by effect size
- When prioritizing genes for follow-up experiments
Important: Shrinkage affects only the log2FoldChange values, not the statistical test results (p-values remain unchanged). Use shrunk values for visualization but report unshrunken p-values for significance.
应用收缩方法以降低倍数变化估计中的噪声:
python
ds.lfc_shrink() # Applies apeGLM shrinkage何时使用对数倍数变化收缩:
- 用于可视化(火山图、热图)
- 按效应大小对基因进行排序
- 优先选择后续实验研究的基因时
重要提示: 收缩仅影响log2FoldChange值,不改变统计检验结果(p值保持不变)。可视化时使用收缩后的值,但报告显著性时使用未收缩的p值。
Step 6: Result Export
步骤6:结果导出
Save results and intermediate objects:
python
import pickle保存结果和中间对象:
python
import pickleExport results as CSV
Export results as CSV
ds.results_df.to_csv("deseq2_results.csv")
ds.results_df.to_csv("deseq2_results.csv")
Save significant genes only
Save significant genes only
significant = ds.results_df[ds.results_df.padj < 0.05]
significant.to_csv("significant_genes.csv")
significant = ds.results_df[ds.results_df.padj < 0.05]
significant.to_csv("significant_genes.csv")
Save DeseqDataSet for later use
Save DeseqDataSet for later use
with open("dds_result.pkl", "wb") as f:
pickle.dump(dds.to_picklable_anndata(), f)
undefinedwith open("dds_result.pkl", "wb") as f:
pickle.dump(dds.to_picklable_anndata(), f)
undefinedCommon Analysis Patterns
常见分析模式
Two-Group Comparison
两组比较
Standard case-control comparison:
python
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
results = ds.results_df
significant = results[results.padj < 0.05]标准病例-对照比较:
python
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
results = ds.results_df
significant = results[results.padj < 0.05]Multiple Comparisons
多组比较
Testing multiple treatment groups against control:
python
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()
treatments = ["treatment_A", "treatment_B", "treatment_C"]
all_results = {}
for treatment in treatments:
ds = DeseqStats(dds, contrast=["condition", treatment, "control"])
ds.summary()
all_results[treatment] = ds.results_df
sig_count = len(ds.results_df[ds.results_df.padj < 0.05])
print(f"{treatment}: {sig_count} significant genes")比较多个处理组与对照组:
python
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()
treatments = ["treatment_A", "treatment_B", "treatment_C"]
all_results = {}
for treatment in treatments:
ds = DeseqStats(dds, contrast=["condition", treatment, "control"])
ds.summary()
all_results[treatment] = ds.results_df
sig_count = len(ds.results_df[ds.results_df.padj < 0.05])
print(f"{treatment}: {sig_count} significant genes")Accounting for Batch Effects
批次效应校正
Control for technical variation:
python
undefined控制技术变异:
python
undefinedInclude batch in design
Include batch in design
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition")
dds.deseq2()
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition")
dds.deseq2()
Test condition while controlling for batch
Test condition while controlling for batch
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
undefinedds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
undefinedContinuous Covariates
连续协变量
Include continuous variables like age or dosage:
python
undefined纳入年龄或剂量等连续变量:
python
undefinedEnsure continuous variable is numeric
Ensure continuous variable is numeric
metadata["age"] = pd.to_numeric(metadata["age"])
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition")
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
undefinedmetadata["age"] = pd.to_numeric(metadata["age"])
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition")
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
undefinedUsing the Analysis Script
使用分析脚本
This skill includes a complete command-line script for standard analyses:
bash
undefined本工具包含一个完整的命令行脚本,用于标准分析:
bash
undefinedBasic usage
Basic usage
python scripts/run_deseq2_analysis.py
--counts counts.csv
--metadata metadata.csv
--design "~condition"
--contrast condition treated control
--output results/
--counts counts.csv
--metadata metadata.csv
--design "~condition"
--contrast condition treated control
--output results/
python scripts/run_deseq2_analysis.py
--counts counts.csv
--metadata metadata.csv
--design "~condition"
--contrast condition treated control
--output results/
--counts counts.csv
--metadata metadata.csv
--design "~condition"
--contrast condition treated control
--output results/
With additional options
With additional options
python scripts/run_deseq2_analysis.py
--counts counts.csv
--metadata metadata.csv
--design "~batch + condition"
--contrast condition treated control
--output results/
--min-counts 10
--alpha 0.05
--n-cpus 4
--plots
--counts counts.csv
--metadata metadata.csv
--design "~batch + condition"
--contrast condition treated control
--output results/
--min-counts 10
--alpha 0.05
--n-cpus 4
--plots
**Script features:**
- Automatic data loading and validation
- Gene and sample filtering
- Complete DESeq2 pipeline execution
- Statistical testing with customizable parameters
- Result export (CSV, pickle)
- Optional visualization (volcano and MA plots)
Refer users to `scripts/run_deseq2_analysis.py` when they need a standalone analysis tool or want to batch process multiple datasets.python scripts/run_deseq2_analysis.py
--counts counts.csv
--metadata metadata.csv
--design "~batch + condition"
--contrast condition treated control
--output results/
--min-counts 10
--alpha 0.05
--n-cpus 4
--plots
--counts counts.csv
--metadata metadata.csv
--design "~batch + condition"
--contrast condition treated control
--output results/
--min-counts 10
--alpha 0.05
--n-cpus 4
--plots
**脚本功能:**
- 自动数据加载与验证
- 基因与样本过滤
- 完整DESeq2流程执行
- 可自定义参数的统计检验
- 结果导出(CSV、pickle格式)
- 可选可视化(火山图和MA图)
当用户需要独立分析工具或批量处理多个数据集时,可引导其使用`scripts/run_deseq2_analysis.py`脚本。Result Interpretation
结果解读
Identifying Significant Genes
识别显著差异基因
python
undefinedpython
undefinedFilter by adjusted p-value
Filter by adjusted p-value
significant = ds.results_df[ds.results_df.padj < 0.05]
significant = ds.results_df[ds.results_df.padj < 0.05]
Filter by both significance and effect size
Filter by both significance and effect size
sig_and_large = ds.results_df[
(ds.results_df.padj < 0.05) &
(abs(ds.results_df.log2FoldChange) > 1)
]
sig_and_large = ds.results_df[
(ds.results_df.padj < 0.05) &
(abs(ds.results_df.log2FoldChange) > 1)
]
Separate up- and down-regulated
Separate up- and down-regulated
upregulated = significant[significant.log2FoldChange > 0]
downregulated = significant[significant.log2FoldChange < 0]
print(f"Upregulated: {len(upregulated)}")
print(f"Downregulated: {len(downregulated)}")
undefinedupregulated = significant[significant.log2FoldChange > 0]
downregulated = significant[significant.log2FoldChange < 0]
print(f"Upregulated: {len(upregulated)}")
print(f"Downregulated: {len(downregulated)}")
undefinedRanking and Sorting
基因排序
python
undefinedpython
undefinedSort by adjusted p-value
Sort by adjusted p-value
top_by_padj = ds.results_df.sort_values("padj").head(20)
top_by_padj = ds.results_df.sort_values("padj").head(20)
Sort by absolute fold change (use shrunk values)
Sort by absolute fold change (use shrunk values)
ds.lfc_shrink()
ds.results_df["abs_lfc"] = abs(ds.results_df.log2FoldChange)
top_by_lfc = ds.results_df.sort_values("abs_lfc", ascending=False).head(20)
ds.lfc_shrink()
ds.results_df["abs_lfc"] = abs(ds.results_df.log2FoldChange)
top_by_lfc = ds.results_df.sort_values("abs_lfc", ascending=False).head(20)
Sort by a combined metric
Sort by a combined metric
ds.results_df["score"] = -np.log10(ds.results_df.padj) * abs(ds.results_df.log2FoldChange)
top_combined = ds.results_df.sort_values("score", ascending=False).head(20)
undefinedds.results_df["score"] = -np.log10(ds.results_df.padj) * abs(ds.results_df.log2FoldChange)
top_combined = ds.results_df.sort_values("score", ascending=False).head(20)
undefinedQuality Metrics
质量指标检查
python
undefinedpython
undefinedCheck normalization (size factors should be close to 1)
Check normalization (size factors should be close to 1)
print("Size factors:", dds.obsm["size_factors"])
print("Size factors:", dds.obsm["size_factors"])
Examine dispersion estimates
Examine dispersion estimates
import matplotlib.pyplot as plt
plt.hist(dds.varm["dispersions"], bins=50)
plt.xlabel("Dispersion")
plt.ylabel("Frequency")
plt.title("Dispersion Distribution")
plt.show()
import matplotlib.pyplot as plt
plt.hist(dds.varm["dispersions"], bins=50)
plt.xlabel("Dispersion")
plt.ylabel("Frequency")
plt.title("Dispersion Distribution")
plt.show()
Check p-value distribution (should be mostly flat with peak near 0)
Check p-value distribution (should be mostly flat with peak near 0)
plt.hist(ds.results_df.pvalue.dropna(), bins=50)
plt.xlabel("P-value")
plt.ylabel("Frequency")
plt.title("P-value Distribution")
plt.show()
undefinedplt.hist(ds.results_df.pvalue.dropna(), bins=50)
plt.xlabel("P-value")
plt.ylabel("Frequency")
plt.title("P-value Distribution")
plt.show()
undefinedVisualization Guidelines
可视化指南
Volcano Plot
火山图
Visualize significance vs effect size:
python
import matplotlib.pyplot as plt
import numpy as np
results = ds.results_df.copy()
results["-log10(padj)"] = -np.log10(results.padj)
plt.figure(figsize=(10, 6))
significant = results.padj < 0.05
plt.scatter(
results.loc[~significant, "log2FoldChange"],
results.loc[~significant, "-log10(padj)"],
alpha=0.3, s=10, c='gray', label='Not significant'
)
plt.scatter(
results.loc[significant, "log2FoldChange"],
results.loc[significant, "-log10(padj)"],
alpha=0.6, s=10, c='red', label='padj < 0.05'
)
plt.axhline(-np.log10(0.05), color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log2 Fold Change")
plt.ylabel("-Log10(Adjusted P-value)")
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)可视化显著性与效应大小的关系:
python
import matplotlib.pyplot as plt
import numpy as np
results = ds.results_df.copy()
results["-log10(padj)"] = -np.log10(results.padj)
plt.figure(figsize=(10, 6))
significant = results.padj < 0.05
plt.scatter(
results.loc[~significant, "log2FoldChange"],
results.loc[~significant, "-log10(padj)"],
alpha=0.3, s=10, c='gray', label='Not significant'
)
plt.scatter(
results.loc[significant, "log2FoldChange"],
results.loc[significant, "-log10(padj)"],
alpha=0.6, s=10, c='red', label='padj < 0.05'
)
plt.axhline(-np.log10(0.05), color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log2 Fold Change")
plt.ylabel("-Log10(Adjusted P-value)")
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)MA Plot
MA图
Show fold change vs mean expression:
python
plt.figure(figsize=(10, 6))
plt.scatter(
np.log10(results.loc[~significant, "baseMean"] + 1),
results.loc[~significant, "log2FoldChange"],
alpha=0.3, s=10, c='gray'
)
plt.scatter(
np.log10(results.loc[significant, "baseMean"] + 1),
results.loc[significant, "log2FoldChange"],
alpha=0.6, s=10, c='red'
)
plt.axhline(0, color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log10(Base Mean + 1)")
plt.ylabel("Log2 Fold Change")
plt.title("MA Plot")
plt.savefig("ma_plot.png", dpi=300)展示倍数变化与平均表达量的关系:
python
plt.figure(figsize=(10, 6))
plt.scatter(
np.log10(results.loc[~significant, "baseMean"] + 1),
results.loc[~significant, "log2FoldChange"],
alpha=0.3, s=10, c='gray'
)
plt.scatter(
np.log10(results.loc[significant, "baseMean"] + 1),
results.loc[significant, "log2FoldChange"],
alpha=0.6, s=10, c='red'
)
plt.axhline(0, color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log10(Base Mean + 1)")
plt.ylabel("Log2 Fold Change")
plt.title("MA Plot")
plt.savefig("ma_plot.png", dpi=300)Troubleshooting Common Issues
常见问题排查
Data Format Problems
数据格式问题
Issue: "Index mismatch between counts and metadata"
Solution: Ensure sample names match exactly
python
print("Counts samples:", counts_df.index.tolist())
print("Metadata samples:", metadata.index.tolist())问题: "计数矩阵与元数据的索引不匹配"
解决方案: 确保样本名称完全一致
python
print("Counts samples:", counts_df.index.tolist())
print("Metadata samples:", metadata.index.tolist())Take intersection if needed
Take intersection if needed
common = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common]
metadata = metadata.loc[common]
**Issue:** "All genes have zero counts"
**Solution:** Check if data needs transposition
```python
print(f"Counts shape: {counts_df.shape}")common = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common]
metadata = metadata.loc[common]
**问题:** "所有基因的计数均为零"
**解决方案:** 检查数据是否需要转置
```python
print(f"Counts shape: {counts_df.shape}")If genes > samples, transpose is needed
If genes > samples, transpose is needed
if counts_df.shape[1] < counts_df.shape[0]:
counts_df = counts_df.T
undefinedif counts_df.shape[1] < counts_df.shape[0]:
counts_df = counts_df.T
undefinedDesign Matrix Issues
设计矩阵问题
Issue: "Design matrix is not full rank"
Cause: Confounded variables (e.g., all treated samples in one batch)
Solution: Remove confounded variable or add interaction term
python
undefined问题: "设计矩阵不满秩"
原因: 变量存在共线性(如所有处理组样本都属于同一批次)
解决方案: 移除共线变量或添加交互项
python
undefinedCheck confounding
Check confounding
print(pd.crosstab(metadata.condition, metadata.batch))
print(pd.crosstab(metadata.condition, metadata.batch))
Either simplify design or add interaction
Either simplify design or add interaction
design = "~condition" # Remove batch
design = "~condition" # Remove batch
OR
OR
design = "~condition + batch + condition:batch" # Model interaction
undefineddesign = "~condition + batch + condition:batch" # Model interaction
undefinedNo Significant Genes
无显著差异基因
Diagnostics:
python
undefined诊断步骤:
python
undefinedCheck dispersion distribution
Check dispersion distribution
plt.hist(dds.varm["dispersions"], bins=50)
plt.show()
plt.hist(dds.varm["dispersions"], bins=50)
plt.show()
Check size factors
Check size factors
print(dds.obsm["size_factors"])
print(dds.obsm["size_factors"])
Look at top genes by raw p-value
Look at top genes by raw p-value
print(ds.results_df.nsmallest(20, "pvalue"))
**Possible causes:**
- Small effect sizes
- High biological variability
- Insufficient sample size
- Technical issues (batch effects, outliers)print(ds.results_df.nsmallest(20, "pvalue"))
**可能原因:**
- 效应量过小
- 生物变异性高
- 样本量不足
- 技术问题(批次效应、异常值)Reference Documentation
参考文档
For comprehensive details beyond this workflow-oriented guide:
-
API Reference (): Complete documentation of PyDESeq2 classes, methods, and data structures. Use when needing detailed parameter information or understanding object attributes.
references/api_reference.md -
Workflow Guide (): In-depth guide covering complete analysis workflows, data loading patterns, multi-factor designs, troubleshooting, and best practices. Use when handling complex experimental designs or encountering issues.
references/workflow_guide.md
Load these references into context when users need:
- Detailed API documentation:
Read references/api_reference.md - Comprehensive workflow examples:
Read references/workflow_guide.md - Troubleshooting guidance: (see Troubleshooting section)
Read references/workflow_guide.md
如需本工作流指南之外的详细信息,请参考:
-
API参考 ():PyDESeq2类、方法和数据结构的完整文档。当需要详细参数信息或了解对象属性时使用。
references/api_reference.md -
工作流指南 ():涵盖完整分析工作流、数据加载模式、多因素设计、问题排查和最佳实践的深度指南。处理复杂实验设计或遇到问题时使用。
references/workflow_guide.md
当用户需要以下内容时,可引导其查阅对应文档:
- 详细API文档:
Read references/api_reference.md - 全面工作流示例:
Read references/workflow_guide.md - 问题排查指导:(查看问题排查章节)
Read references/workflow_guide.md
Key Reminders
关键提示
-
Data orientation matters: Count matrices typically load as genes × samples but need to be samples × genes. Always transpose withif needed.
.T -
Sample filtering: Remove samples with missing metadata before analysis to avoid errors.
-
Gene filtering: Filter low-count genes (e.g., < 10 total reads) to improve power and reduce computational time.
-
Design formula order: Put adjustment variables before the variable of interest (e.g.,not
"~batch + condition")."~condition + batch" -
LFC shrinkage timing: Apply shrinkage after statistical testing and only for visualization/ranking purposes. P-values remain based on unshrunken estimates.
-
Result interpretation: Usefor significance, not raw p-values. The Benjamini-Hochberg procedure controls false discovery rate.
padj < 0.05 -
Contrast specification: The format iswhere test_level is compared against reference_level.
[variable, test_level, reference_level] -
Save intermediate objects: Use pickle to save DeseqDataSet objects for later use or additional analyses without re-running the expensive fitting step.
-
数据方向很重要: 计数矩阵通常以基因×样本的格式加载,但分析需要样本×基因的格式。必要时务必使用进行转置。
.T -
样本过滤: 分析前移除元数据缺失的样本,以避免错误。
-
基因过滤: 过滤低计数基因(如总读数<10)以提升分析效能并减少计算时间。
-
设计公式顺序: 将校正变量放在研究变量之前(如使用而非
"~batch + condition")。"~condition + batch" -
对数倍数变化收缩时机: 在统计检验后应用收缩,且仅用于可视化/排序。p值基于未收缩的估计值。
-
结果解读: 使用判断显著性,而非原始p值。Benjamini-Hochberg方法用于控制假发现率。
padj < 0.05 -
对比参数指定: 格式为,其中测试组水平与对照组水平进行比较。
[变量名, 测试组水平, 对照组水平] -
保存中间对象: 使用pickle保存DeseqDataSet对象,以便后续使用或进行额外分析,无需重新运行耗时的拟合步骤。
Installation and Requirements
安装与环境要求
bash
uv pip install pydeseq2System requirements:
- Python 3.10-3.11
- pandas 1.4.3+
- numpy 1.23.0+
- scipy 1.11.0+
- scikit-learn 1.1.1+
- anndata 0.8.0+
Optional for visualization:
- matplotlib
- seaborn
bash
uv pip install pydeseq2系统要求:
- Python 3.10-3.11
- pandas 1.4.3+
- numpy 1.23.0+
- scipy 1.11.0+
- scikit-learn 1.1.1+
- anndata 0.8.0+
可视化可选依赖:
- matplotlib
- seaborn
Additional Resources
额外资源
- Official Documentation: https://pydeseq2.readthedocs.io
- GitHub Repository: https://github.com/owkin/PyDESeq2
- Publication: Muzellec et al. (2023) Bioinformatics, DOI: 10.1093/bioinformatics/btad547
- Original DESeq2 (R): Love et al. (2014) Genome Biology, DOI: 10.1186/s13059-014-0550-8
- 官方文档: https://pydeseq2.readthedocs.io
- GitHub仓库: https://github.com/owkin/PyDESeq2
- 相关文献: Muzellec et al. (2023) Bioinformatics, DOI: 10.1093/bioinformatics/btad547
- 原始R版DESeq2: Love et al. (2014) Genome Biology, DOI: 10.1186/s13059-014-0550-8