tooluniverse-single-cell
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSingle-Cell Genomics and Expression Matrix Analysis
单细胞基因组学与表达矩阵分析
Comprehensive single-cell RNA-seq analysis and expression matrix processing using scanpy, anndata, scipy, and ToolUniverse.
基于scanpy、anndata、scipy和ToolUniverse的全面单细胞RNA-seq分析与表达矩阵处理方案。
When to Use This Skill
何时使用该Skill
Apply when users:
- Have scRNA-seq data (h5ad, 10X, CSV count matrices) and want analysis
- Ask about cell type identification, clustering, or annotation
- Need differential expression analysis by cell type or condition
- Want gene-expression correlation analysis (e.g., gene length vs expression by cell type)
- Ask about PCA, UMAP, t-SNE for expression data
- Need Leiden/Louvain clustering on expression matrices
- Want statistical comparisons between cell types (t-test, ANOVA, fold change)
- Ask about marker genes, batch correction, trajectory, or cell-cell communication
BixBench Coverage: 18+ questions across 5 projects (bix-22, bix-27, bix-31, bix-33, bix-36)
NOT for (use other skills instead):
- Bulk RNA-seq DESeq2 only ->
tooluniverse-rnaseq-deseq2 - Gene enrichment only ->
tooluniverse-gene-enrichment - VCF/variant analysis ->
tooluniverse-variant-analysis
适用于以下用户场景:
- 拥有scRNA-seq数据(h5ad、10X、CSV计数矩阵)并需要进行分析
- 询问细胞类型识别、聚类或注释相关问题
- 需要按细胞类型或条件进行差异表达分析
- 想要进行基因表达相关性分析(如按细胞类型分析基因长度与表达量的关系)
- 询问表达数据的PCA、UMAP、t-SNE分析方法
- 需要对表达矩阵进行Leiden/Louvain聚类
- 想要进行细胞类型间的统计比较(t检验、方差分析、倍数变化)
- 询问标记基因、批次校正、轨迹分析或细胞间通讯相关问题
BixBench覆盖范围:5个项目(bix-22、bix-27、bix-31、bix-33、bix-36)中的18+个问题
不适用场景(请使用其他Skill):
- 仅需批量RNA-seq的DESeq2分析 ->
tooluniverse-rnaseq-deseq2 - 仅需基因富集分析 ->
tooluniverse-gene-enrichment - VCF/变异分析 ->
tooluniverse-variant-analysis
Core Principles
核心原则
- Data-first - Load, inspect, validate before analysis
- AnnData-centric - All data flows through anndata objects
- Cell type awareness - Per-cell-type subsetting when needed
- Statistical rigor - Normalization, multiple testing correction, effect sizes
- Question-driven - Parse what the user is actually asking
- 数据优先 - 分析前先加载、检查、验证数据
- 以AnnData为中心 - 所有数据通过anndata对象流转
- 细胞类型感知 - 必要时按细胞类型进行子集划分
- 统计严谨性 - 归一化、多重检验校正、效应量计算
- 以问题为导向 - 解析用户的实际需求
Required Packages
所需依赖包
python
import scanpy as sc, anndata as ad, pandas as pd, numpy as np
from scipy import stats
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.decomposition import PCA
from statsmodels.stats.multitest import multipletests
import gseapy as gp # enrichment
import harmonypy # batch correction (optional)Install:
pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodelspython
import scanpy as sc, anndata as ad, pandas as pd, numpy as np
from scipy import stats
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.decomposition import PCA
from statsmodels.stats.multitest import multipletests
import gseapy as gp # enrichment
import harmonypy # batch correction (optional)安装命令:
pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodelsWorkflow Decision Tree
工作流决策树
START: User question about scRNA-seq data
|
+-- FULL PIPELINE (raw counts -> annotated clusters)
| Workflow: QC -> Normalize -> HVG -> PCA -> Cluster -> Annotate -> DE
| See: references/scanpy_workflow.md
|
+-- DIFFERENTIAL EXPRESSION (per-cell-type comparison)
| Most common BixBench pattern (bix-33)
| See: analysis_patterns.md "Pattern 1"
|
+-- CORRELATION ANALYSIS (gene property vs expression)
| Pattern: Gene length vs expression (bix-22)
| See: analysis_patterns.md "Pattern 2"
|
+-- CLUSTERING & PCA (expression matrix analysis)
| See: references/clustering_guide.md
|
+-- CELL COMMUNICATION (ligand-receptor interactions)
| See: references/cell_communication.md
|
+-- TRAJECTORY ANALYSIS (pseudotime)
See: references/trajectory_analysis.mdData format handling:
- h5ad ->
sc.read_h5ad() - 10X -> or
sc.read_10x_mtx()sc.read_10x_h5() - CSV/TSV -> -> Convert to AnnData (check orientation!)
pd.read_csv()
开始:用户提出scRNA-seq数据相关问题
|
+-- 完整流程(原始计数 -> 注释聚类)
| 流程:QC -> 归一化 -> HVG -> PCA -> 聚类 -> 注释 -> 差异表达
| 参考:references/scanpy_workflow.md
|
+-- 差异表达分析(按细胞类型比较)
| BixBench最常见的模式(bix-33)
| 参考:analysis_patterns.md "模式1"
|
+-- 相关性分析(基因属性与表达量)
| 模式:基因长度与表达量的关系(bix-22)
| 参考:analysis_patterns.md "模式2"
|
+-- 聚类与PCA分析(表达矩阵分析)
| 参考:references/clustering_guide.md
|
+-- 细胞间通讯分析
| 参考:references/cell_communication.md
|
+-- 轨迹分析(拟时间分析)
参考:references/trajectory_analysis.md数据格式处理:
- h5ad ->
sc.read_h5ad() - 10X -> 或
sc.read_10x_mtx()sc.read_10x_h5() - CSV/TSV -> -> 转换为AnnData(注意矩阵方向!)
pd.read_csv()
Data Loading
数据加载
AnnData expects: cells/samples as rows (obs), genes as columns (var)
python
adata = sc.read_h5ad("data.h5ad") # h5ad already orientedAnnData要求:细胞/样本为行(obs),基因为列(var)
python
adata = sc.read_h5ad("data.h5ad") # h5ad格式已符合方向要求CSV/TSV: check orientation
CSV/TSV:检查矩阵方向
df = pd.read_csv("counts.csv", index_col=0)
if df.shape[0] > df.shape[1] * 5: # genes > samples by 5x => transpose
df = df.T
adata = ad.AnnData(df)
df = pd.read_csv("counts.csv", index_col=0)
if df.shape[0] > df.shape[1] * 5: # 基因数量是样本数量的5倍以上 => 转置矩阵
df = df.T
adata = ad.AnnData(df)
Load metadata
加载元数据
meta = pd.read_csv("metadata.csv", index_col=0)
common = adata.obs_names.intersection(meta.index)
adata = adata[common].copy()
for col in meta.columns:
adata.obs[col] = meta.loc[common, col]
---meta = pd.read_csv("metadata.csv", index_col=0)
common = adata.obs_names.intersection(meta.index)
adata = adata[common].copy()
for col in meta.columns:
adata.obs[col] = meta.loc[common, col]
---Quality Control
质量控制
python
adata.var['mt'] = adata.var_names.str.startswith(('MT-', 'mt-'))
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
sc.pp.filter_cells(adata, min_genes=200)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_genes(adata, min_cells=3)See: references/scanpy_workflow.md for details
python
adata.var['mt'] = adata.var_names.str.startswith(('MT-', 'mt-'))
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
sc.pp.filter_cells(adata, min_genes=200)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_genes(adata, min_cells=3)详情参考:references/scanpy_workflow.md
Complete Pipeline (Quick Reference)
完整流程(快速参考)
python
import scanpy as sc
adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")python
import scanpy as sc
adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")QC
QC
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
Normalize + HVG + PCA
归一化 + HVG + PCA
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy()
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.tl.pca(adata, n_comps=50)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy()
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.tl.pca(adata, n_comps=50)
Cluster + UMAP
聚类 + UMAP
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)
Find markers + Annotate + Per-cell-type DE
寻找标记基因 + 注释 + 按细胞类型进行差异表达分析
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')
---sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')
---Differential Expression Decision Tree
差异表达分析决策树
Single-Cell DE (many cells per condition):
Use: sc.tl.rank_genes_groups(), methods: wilcoxon, t-test, logreg
Best for: Per-cell-type DE, marker gene finding
Pseudo-Bulk DE (aggregate counts by sample):
Use: DESeq2 via PyDESeq2
Best for: Sample-level comparisons with replicates
Statistical Tests Only:
Use: scipy.stats (ttest_ind, f_oneway, pearsonr)
Best for: Correlation, ANOVA, t-tests on summaries单细胞差异表达分析(每个条件下细胞数量较多):
工具:sc.tl.rank_genes_groups(),方法:wilcoxon、t-test、logreg
最佳适用场景:按细胞类型的差异表达分析、标记基因寻找
伪批量差异表达分析(按样本聚合计数):
工具:通过PyDESeq2调用DESeq2
最佳适用场景:有重复样本的样本水平比较
仅统计检验:
工具:scipy.stats(ttest_ind、f_oneway、pearsonr)
最佳适用场景:相关性分析、方差分析、基于汇总数据的t检验Statistical Tests (Quick Reference)
统计检验(快速参考)
python
from scipy import stats
from statsmodels.stats.multitest import multipletestspython
from scipy import stats
from statsmodels.stats.multitest import multipletestsPearson/Spearman correlation
Pearson/Spearman相关性分析
r, p = stats.pearsonr(gene_lengths, mean_expression)
r, p = stats.pearsonr(gene_lengths, mean_expression)
Welch's t-test
Welch's t检验
t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)
t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)
ANOVA
方差分析
f_stat, p_val = stats.f_oneway(group1, group2, group3)
f_stat, p_val = stats.f_oneway(group1, group2, group3)
Multiple testing correction (BH)
多重检验校正(BH法)
reject, pvals_adj, _, _ = multipletests(pvals, method='fdr_bh')
---reject, pvals_adj, _, _ = multipletests(pvals, method='fdr_bh')
---Batch Correction (Harmony)
批次校正(Harmony)
python
import harmonypy
sc.tl.pca(adata, n_comps=50)
ho = harmonypy.run_harmony(adata.obsm['X_pca'][:, :30], adata.obs, 'batch', random_state=0)
adata.obsm['X_pca_harmony'] = ho.Z_corr.T
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)python
import harmonypy
sc.tl.pca(adata, n_comps=50)
ho = harmonypy.run_harmony(adata.obsm['X_pca'][:, :30], adata.obs, 'batch', random_state=0)
adata.obsm['X_pca_harmony'] = ho.Z_corr.T
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)ToolUniverse Integration
ToolUniverse集成
Gene Annotation
基因注释
- HPA_search_genes_by_query: Cell-type marker gene search
- MyGene_query_genes / MyGene_batch_query: Gene ID conversion
- ensembl_lookup_gene: Ensembl gene details
- UniProt_get_function_by_accession: Protein function
- HPA_search_genes_by_query:细胞类型标记基因搜索
- MyGene_query_genes / MyGene_batch_query:基因ID转换
- ensembl_lookup_gene:Ensembl基因详情查询
- UniProt_get_function_by_accession:蛋白质功能查询
Cell-Cell Communication
细胞间通讯
- OmniPath_get_ligand_receptor_interactions: L-R pairs (CellPhoneDB, CellChatDB)
- OmniPath_get_signaling_interactions: Downstream signaling
- OmniPath_get_complexes: Multi-subunit receptors
- OmniPath_get_ligand_receptor_interactions:配体-受体对(CellPhoneDB、CellChatDB)
- OmniPath_get_signaling_interactions:下游信号通路
- OmniPath_get_complexes:多亚基受体复合物
Enrichment (Post-DE)
富集分析(差异表达后)
- PANTHER_enrichment: GO enrichment (BP, MF, CC)
- STRING_functional_enrichment: Network-based enrichment
- ReactomeAnalysis_pathway_enrichment: Reactome pathways
- PANTHER_enrichment:GO富集分析(生物过程、分子功能、细胞组分)
- STRING_functional_enrichment:基于网络的富集分析
- ReactomeAnalysis_pathway_enrichment:Reactome通路分析
Scanpy vs Seurat Equivalents
Scanpy与Seurat功能对应表
| Operation | Seurat (R) | Scanpy (Python) |
|---|---|---|
| Load data | | |
| Normalize | | |
| Find HVGs | | |
| PCA | | |
| Cluster | | |
| UMAP | | |
| Find markers | | |
| Batch correction | | |
| 操作 | Seurat (R) | Scanpy (Python) |
|---|---|---|
| 加载数据 | | |
| 归一化 | | |
| 寻找高变异基因 | | |
| PCA分析 | | |
| 聚类 | | |
| UMAP分析 | | |
| 寻找标记基因 | | |
| 批次校正 | | |
Troubleshooting
故障排除
| Issue | Solution |
|---|---|
| |
| Sparse matrix errors | |
| Wrong matrix orientation | More genes than samples? Transpose |
| NaN in correlation | Filter: |
| Too few cells for DE | Need >= 3 cells per condition per cell type |
| Memory error | Use |
| 问题 | 解决方案 |
|---|---|
| |
| 稀疏矩阵错误 | 转换为数组: |
| 矩阵方向错误 | 基因数量远多于样本数量?转置矩阵 |
| 相关性分析中出现NaN | 过滤无效值: |
| 差异表达分析的细胞数量不足 | 每个细胞类型的每个条件下至少需要3个细胞 |
| 内存错误 | 使用 |
Reference Documentation
参考文档
Detailed Analysis Patterns: analysis_patterns.md (per-cell-type DE, correlation, PCA, ANOVA, cell communication)
Core Workflows:
- references/scanpy_workflow.md - Complete scanpy pipeline
- references/seurat_workflow.md - Seurat to Scanpy translation
- references/clustering_guide.md - Clustering methods
- references/marker_identification.md - Marker genes, annotation
- references/trajectory_analysis.md - Pseudotime
- references/cell_communication.md - OmniPath/CellPhoneDB workflow
- references/troubleshooting.md - Detailed error solutions
详细分析模式:analysis_patterns.md(按细胞类型的差异表达分析、相关性分析、PCA、方差分析、细胞通讯分析)
核心工作流:
- references/scanpy_workflow.md - 完整Scanpy工作流
- references/seurat_workflow.md - Seurat到Scanpy的方法转换
- references/clustering_guide.md - 聚类方法指南
- references/marker_identification.md - 标记基因识别与注释
- references/trajectory_analysis.md - 拟时间轨迹分析
- references/cell_communication.md - OmniPath/CellPhoneDB工作流
- references/troubleshooting.md - 详细错误解决方案