tooluniverse-single-cell

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Single-Cell Genomics and Expression Matrix Analysis

单细胞基因组学与表达矩阵分析

Comprehensive single-cell RNA-seq analysis and expression matrix processing using scanpy, anndata, scipy, and ToolUniverse.

基于scanpy、anndata、scipy和ToolUniverse的全面单细胞RNA-seq分析与表达矩阵处理方案。

When to Use This Skill

何时使用该Skill

Apply when users:
  • Have scRNA-seq data (h5ad, 10X, CSV count matrices) and want analysis
  • Ask about cell type identification, clustering, or annotation
  • Need differential expression analysis by cell type or condition
  • Want gene-expression correlation analysis (e.g., gene length vs expression by cell type)
  • Ask about PCA, UMAP, t-SNE for expression data
  • Need Leiden/Louvain clustering on expression matrices
  • Want statistical comparisons between cell types (t-test, ANOVA, fold change)
  • Ask about marker genes, batch correction, trajectory, or cell-cell communication
BixBench Coverage: 18+ questions across 5 projects (bix-22, bix-27, bix-31, bix-33, bix-36)
NOT for (use other skills instead):
  • Bulk RNA-seq DESeq2 only ->
    tooluniverse-rnaseq-deseq2
  • Gene enrichment only ->
    tooluniverse-gene-enrichment
  • VCF/variant analysis ->
    tooluniverse-variant-analysis

适用于以下用户场景:
  • 拥有scRNA-seq数据(h5ad、10X、CSV计数矩阵)并需要进行分析
  • 询问细胞类型识别、聚类或注释相关问题
  • 需要按细胞类型或条件进行差异表达分析
  • 想要进行基因表达相关性分析(如按细胞类型分析基因长度与表达量的关系)
  • 询问表达数据的PCA、UMAP、t-SNE分析方法
  • 需要对表达矩阵进行Leiden/Louvain聚类
  • 想要进行细胞类型间的统计比较(t检验、方差分析、倍数变化)
  • 询问标记基因、批次校正、轨迹分析或细胞间通讯相关问题
BixBench覆盖范围:5个项目(bix-22、bix-27、bix-31、bix-33、bix-36)中的18+个问题
不适用场景(请使用其他Skill):
  • 仅需批量RNA-seq的DESeq2分析 ->
    tooluniverse-rnaseq-deseq2
  • 仅需基因富集分析 ->
    tooluniverse-gene-enrichment
  • VCF/变异分析 ->
    tooluniverse-variant-analysis

Core Principles

核心原则

  1. Data-first - Load, inspect, validate before analysis
  2. AnnData-centric - All data flows through anndata objects
  3. Cell type awareness - Per-cell-type subsetting when needed
  4. Statistical rigor - Normalization, multiple testing correction, effect sizes
  5. Question-driven - Parse what the user is actually asking

  1. 数据优先 - 分析前先加载、检查、验证数据
  2. 以AnnData为中心 - 所有数据通过anndata对象流转
  3. 细胞类型感知 - 必要时按细胞类型进行子集划分
  4. 统计严谨性 - 归一化、多重检验校正、效应量计算
  5. 以问题为导向 - 解析用户的实际需求

Required Packages

所需依赖包

python
import scanpy as sc, anndata as ad, pandas as pd, numpy as np
from scipy import stats
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.decomposition import PCA
from statsmodels.stats.multitest import multipletests
import gseapy as gp  # enrichment
import harmonypy     # batch correction (optional)
Install:
pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodels

python
import scanpy as sc, anndata as ad, pandas as pd, numpy as np
from scipy import stats
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.decomposition import PCA
from statsmodels.stats.multitest import multipletests
import gseapy as gp  # enrichment
import harmonypy     # batch correction (optional)
安装命令:
pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodels

Workflow Decision Tree

工作流决策树

START: User question about scRNA-seq data
|
+-- FULL PIPELINE (raw counts -> annotated clusters)
|   Workflow: QC -> Normalize -> HVG -> PCA -> Cluster -> Annotate -> DE
|   See: references/scanpy_workflow.md
|
+-- DIFFERENTIAL EXPRESSION (per-cell-type comparison)
|   Most common BixBench pattern (bix-33)
|   See: analysis_patterns.md "Pattern 1"
|
+-- CORRELATION ANALYSIS (gene property vs expression)
|   Pattern: Gene length vs expression (bix-22)
|   See: analysis_patterns.md "Pattern 2"
|
+-- CLUSTERING & PCA (expression matrix analysis)
|   See: references/clustering_guide.md
|
+-- CELL COMMUNICATION (ligand-receptor interactions)
|   See: references/cell_communication.md
|
+-- TRAJECTORY ANALYSIS (pseudotime)
    See: references/trajectory_analysis.md
Data format handling:
  • h5ad ->
    sc.read_h5ad()
  • 10X ->
    sc.read_10x_mtx()
    or
    sc.read_10x_h5()
  • CSV/TSV ->
    pd.read_csv()
    -> Convert to AnnData (check orientation!)

开始:用户提出scRNA-seq数据相关问题
|
+-- 完整流程(原始计数 -> 注释聚类)
|   流程:QC -> 归一化 -> HVG -> PCA -> 聚类 -> 注释 -> 差异表达
|   参考:references/scanpy_workflow.md
|
+-- 差异表达分析(按细胞类型比较)
|   BixBench最常见的模式(bix-33)
|   参考:analysis_patterns.md "模式1"
|
+-- 相关性分析(基因属性与表达量)
|   模式:基因长度与表达量的关系(bix-22)
|   参考:analysis_patterns.md "模式2"
|
+-- 聚类与PCA分析(表达矩阵分析)
|   参考:references/clustering_guide.md
|
+-- 细胞间通讯分析
|   参考:references/cell_communication.md
|
+-- 轨迹分析(拟时间分析)
    参考:references/trajectory_analysis.md
数据格式处理:
  • h5ad ->
    sc.read_h5ad()
  • 10X ->
    sc.read_10x_mtx()
    sc.read_10x_h5()
  • CSV/TSV ->
    pd.read_csv()
    -> 转换为AnnData(注意矩阵方向!)

Data Loading

数据加载

AnnData expects: cells/samples as rows (obs), genes as columns (var)
python
adata = sc.read_h5ad("data.h5ad")  # h5ad already oriented
AnnData要求:细胞/样本为行(obs),基因为列(var)
python
adata = sc.read_h5ad("data.h5ad")  # h5ad格式已符合方向要求

CSV/TSV: check orientation

CSV/TSV:检查矩阵方向

df = pd.read_csv("counts.csv", index_col=0) if df.shape[0] > df.shape[1] * 5: # genes > samples by 5x => transpose df = df.T adata = ad.AnnData(df)
df = pd.read_csv("counts.csv", index_col=0) if df.shape[0] > df.shape[1] * 5: # 基因数量是样本数量的5倍以上 => 转置矩阵 df = df.T adata = ad.AnnData(df)

Load metadata

加载元数据

meta = pd.read_csv("metadata.csv", index_col=0) common = adata.obs_names.intersection(meta.index) adata = adata[common].copy() for col in meta.columns: adata.obs[col] = meta.loc[common, col]

---
meta = pd.read_csv("metadata.csv", index_col=0) common = adata.obs_names.intersection(meta.index) adata = adata[common].copy() for col in meta.columns: adata.obs[col] = meta.loc[common, col]

---

Quality Control

质量控制

python
adata.var['mt'] = adata.var_names.str.startswith(('MT-', 'mt-'))
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
sc.pp.filter_cells(adata, min_genes=200)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_genes(adata, min_cells=3)
See: references/scanpy_workflow.md for details

python
adata.var['mt'] = adata.var_names.str.startswith(('MT-', 'mt-'))
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
sc.pp.filter_cells(adata, min_genes=200)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_genes(adata, min_cells=3)
详情参考:references/scanpy_workflow.md

Complete Pipeline (Quick Reference)

完整流程(快速参考)

python
import scanpy as sc

adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")
python
import scanpy as sc

adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")

QC

QC

adata.var['mt'] = adata.var_names.str.startswith('MT-') sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True) adata = adata[adata.obs['pct_counts_mt'] < 20].copy() sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3)
adata.var['mt'] = adata.var_names.str.startswith('MT-') sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True) adata = adata[adata.obs['pct_counts_mt'] < 20].copy() sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3)

Normalize + HVG + PCA

归一化 + HVG + PCA

sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) adata.raw = adata.copy() sc.pp.highly_variable_genes(adata, n_top_genes=2000) sc.tl.pca(adata, n_comps=50)
sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) adata.raw = adata.copy() sc.pp.highly_variable_genes(adata, n_top_genes=2000) sc.tl.pca(adata, n_comps=50)

Cluster + UMAP

聚类 + UMAP

sc.pp.neighbors(adata, n_pcs=30) sc.tl.leiden(adata, resolution=0.5) sc.tl.umap(adata)
sc.pp.neighbors(adata, n_pcs=30) sc.tl.leiden(adata, resolution=0.5) sc.tl.umap(adata)

Find markers + Annotate + Per-cell-type DE

寻找标记基因 + 注释 + 按细胞类型进行差异表达分析

sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')

---
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')

---

Differential Expression Decision Tree

差异表达分析决策树

Single-Cell DE (many cells per condition):
  Use: sc.tl.rank_genes_groups(), methods: wilcoxon, t-test, logreg
  Best for: Per-cell-type DE, marker gene finding

Pseudo-Bulk DE (aggregate counts by sample):
  Use: DESeq2 via PyDESeq2
  Best for: Sample-level comparisons with replicates

Statistical Tests Only:
  Use: scipy.stats (ttest_ind, f_oneway, pearsonr)
  Best for: Correlation, ANOVA, t-tests on summaries

单细胞差异表达分析(每个条件下细胞数量较多):
  工具:sc.tl.rank_genes_groups(),方法:wilcoxon、t-test、logreg
  最佳适用场景:按细胞类型的差异表达分析、标记基因寻找

伪批量差异表达分析(按样本聚合计数):
  工具:通过PyDESeq2调用DESeq2
  最佳适用场景:有重复样本的样本水平比较

仅统计检验:
  工具:scipy.stats(ttest_ind、f_oneway、pearsonr)
  最佳适用场景:相关性分析、方差分析、基于汇总数据的t检验

Statistical Tests (Quick Reference)

统计检验(快速参考)

python
from scipy import stats
from statsmodels.stats.multitest import multipletests
python
from scipy import stats
from statsmodels.stats.multitest import multipletests

Pearson/Spearman correlation

Pearson/Spearman相关性分析

r, p = stats.pearsonr(gene_lengths, mean_expression)
r, p = stats.pearsonr(gene_lengths, mean_expression)

Welch's t-test

Welch's t检验

t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)
t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)

ANOVA

方差分析

f_stat, p_val = stats.f_oneway(group1, group2, group3)
f_stat, p_val = stats.f_oneway(group1, group2, group3)

Multiple testing correction (BH)

多重检验校正(BH法)

reject, pvals_adj, _, _ = multipletests(pvals, method='fdr_bh')

---
reject, pvals_adj, _, _ = multipletests(pvals, method='fdr_bh')

---

Batch Correction (Harmony)

批次校正(Harmony)

python
import harmonypy
sc.tl.pca(adata, n_comps=50)
ho = harmonypy.run_harmony(adata.obsm['X_pca'][:, :30], adata.obs, 'batch', random_state=0)
adata.obsm['X_pca_harmony'] = ho.Z_corr.T
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)

python
import harmonypy
sc.tl.pca(adata, n_comps=50)
ho = harmonypy.run_harmony(adata.obsm['X_pca'][:, :30], adata.obs, 'batch', random_state=0)
adata.obsm['X_pca_harmony'] = ho.Z_corr.T
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)

ToolUniverse Integration

ToolUniverse集成

Gene Annotation

基因注释

  • HPA_search_genes_by_query: Cell-type marker gene search
  • MyGene_query_genes / MyGene_batch_query: Gene ID conversion
  • ensembl_lookup_gene: Ensembl gene details
  • UniProt_get_function_by_accession: Protein function
  • HPA_search_genes_by_query:细胞类型标记基因搜索
  • MyGene_query_genes / MyGene_batch_query:基因ID转换
  • ensembl_lookup_gene:Ensembl基因详情查询
  • UniProt_get_function_by_accession:蛋白质功能查询

Cell-Cell Communication

细胞间通讯

  • OmniPath_get_ligand_receptor_interactions: L-R pairs (CellPhoneDB, CellChatDB)
  • OmniPath_get_signaling_interactions: Downstream signaling
  • OmniPath_get_complexes: Multi-subunit receptors
  • OmniPath_get_ligand_receptor_interactions:配体-受体对(CellPhoneDB、CellChatDB)
  • OmniPath_get_signaling_interactions:下游信号通路
  • OmniPath_get_complexes:多亚基受体复合物

Enrichment (Post-DE)

富集分析(差异表达后)

  • PANTHER_enrichment: GO enrichment (BP, MF, CC)
  • STRING_functional_enrichment: Network-based enrichment
  • ReactomeAnalysis_pathway_enrichment: Reactome pathways

  • PANTHER_enrichment:GO富集分析(生物过程、分子功能、细胞组分)
  • STRING_functional_enrichment:基于网络的富集分析
  • ReactomeAnalysis_pathway_enrichment:Reactome通路分析

Scanpy vs Seurat Equivalents

Scanpy与Seurat功能对应表

OperationSeurat (R)Scanpy (Python)
Load data
Read10X()
sc.read_10x_mtx()
Normalize
NormalizeData()
sc.pp.normalize_total() + sc.pp.log1p()
Find HVGs
FindVariableFeatures()
sc.pp.highly_variable_genes()
PCA
RunPCA()
sc.tl.pca()
Cluster
FindClusters()
sc.tl.leiden()
UMAP
RunUMAP()
sc.tl.umap()
Find markers
FindMarkers()
sc.tl.rank_genes_groups()
Batch correction
RunHarmony()
harmonypy.run_harmony()

操作Seurat (R)Scanpy (Python)
加载数据
Read10X()
sc.read_10x_mtx()
归一化
NormalizeData()
sc.pp.normalize_total() + sc.pp.log1p()
寻找高变异基因
FindVariableFeatures()
sc.pp.highly_variable_genes()
PCA分析
RunPCA()
sc.tl.pca()
聚类
FindClusters()
sc.tl.leiden()
UMAP分析
RunUMAP()
sc.tl.umap()
寻找标记基因
FindMarkers()
sc.tl.rank_genes_groups()
批次校正
RunHarmony()
harmonypy.run_harmony()

Troubleshooting

故障排除

IssueSolution
ModuleNotFoundError: leidenalg
pip install leidenalg
Sparse matrix errors
.toarray()
:
X = adata.X.toarray() if issparse(adata.X) else adata.X
Wrong matrix orientationMore genes than samples? Transpose
NaN in correlationFilter:
valid = ~np.isnan(x) & ~np.isnan(y)
Too few cells for DENeed >= 3 cells per condition per cell type
Memory errorUse
sc.pp.highly_variable_genes()
to reduce features

问题解决方案
ModuleNotFoundError: leidenalg
pip install leidenalg
稀疏矩阵错误转换为数组:
X = adata.X.toarray() if issparse(adata.X) else adata.X
矩阵方向错误基因数量远多于样本数量?转置矩阵
相关性分析中出现NaN过滤无效值:
valid = ~np.isnan(x) & ~np.isnan(y)
差异表达分析的细胞数量不足每个细胞类型的每个条件下至少需要3个细胞
内存错误使用
sc.pp.highly_variable_genes()
减少特征数量

Reference Documentation

参考文档

Detailed Analysis Patterns: analysis_patterns.md (per-cell-type DE, correlation, PCA, ANOVA, cell communication)
Core Workflows:
  • references/scanpy_workflow.md - Complete scanpy pipeline
  • references/seurat_workflow.md - Seurat to Scanpy translation
  • references/clustering_guide.md - Clustering methods
  • references/marker_identification.md - Marker genes, annotation
  • references/trajectory_analysis.md - Pseudotime
  • references/cell_communication.md - OmniPath/CellPhoneDB workflow
  • references/troubleshooting.md - Detailed error solutions
详细分析模式:analysis_patterns.md(按细胞类型的差异表达分析、相关性分析、PCA、方差分析、细胞通讯分析)
核心工作流:
  • references/scanpy_workflow.md - 完整Scanpy工作流
  • references/seurat_workflow.md - Seurat到Scanpy的方法转换
  • references/clustering_guide.md - 聚类方法指南
  • references/marker_identification.md - 标记基因识别与注释
  • references/trajectory_analysis.md - 拟时间轨迹分析
  • references/cell_communication.md - OmniPath/CellPhoneDB工作流
  • references/troubleshooting.md - 详细错误解决方案