scanpy
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseScanpy: Single-Cell Analysis
Scanpy:单细胞分析
Overview
概述
Scanpy is a scalable Python toolkit for analyzing single-cell RNA-seq data, built on AnnData. Apply this skill for complete single-cell workflows including quality control, normalization, dimensionality reduction, clustering, marker gene identification, visualization, and trajectory analysis.
Scanpy是基于AnnData构建的可扩展Python工具包,用于分析单细胞RNA-seq数据。使用该技能可完成完整的单细胞分析工作流,包括质量控制、归一化、降维、聚类、标记基因识别、可视化和轨迹分析。
When to Use This Skill
适用场景
This skill should be used when:
- Analyzing single-cell RNA-seq data (.h5ad, 10X, CSV formats)
- Performing quality control on scRNA-seq datasets
- Creating UMAP, t-SNE, or PCA visualizations
- Identifying cell clusters and finding marker genes
- Annotating cell types based on gene expression
- Conducting trajectory inference or pseudotime analysis
- Generating publication-quality single-cell plots
当你需要进行以下操作时,可使用该技能:
- 分析单细胞RNA-seq数据(.h5ad、10X、CSV格式)
- 对单细胞RNA-seq数据集执行质量控制
- 生成UMAP、t-SNE或PCA可视化图
- 识别细胞簇并寻找标记基因
- 根据基因表达注释细胞类型
- 进行轨迹推断或拟时间分析
- 生成可用于发表的单细胞分析图
Quick Start
快速开始
Basic Import and Setup
基础导入与设置
python
import scanpy as sc
import pandas as pd
import numpy as nppython
import scanpy as sc
import pandas as pd
import numpy as npConfigure settings
配置设置
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white')
sc.settings.figdir = './figures/'
undefinedsc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white')
sc.settings.figdir = './figures/'
undefinedLoading Data
数据加载
python
undefinedpython
undefinedFrom 10X Genomics
从10X Genomics加载数据
adata = sc.read_10x_mtx('path/to/data/')
adata = sc.read_10x_h5('path/to/data.h5')
adata = sc.read_10x_mtx('path/to/data/')
adata = sc.read_10x_h5('path/to/data.h5')
From h5ad (AnnData format)
从h5ad(AnnData格式)加载
adata = sc.read_h5ad('path/to/data.h5ad')
adata = sc.read_h5ad('path/to/data.h5ad')
From CSV
从CSV加载
adata = sc.read_csv('path/to/data.csv')
undefinedadata = sc.read_csv('path/to/data.csv')
undefinedUnderstanding AnnData Structure
了解AnnData结构
The AnnData object is the core data structure in scanpy:
python
adata.X # Expression matrix (cells × genes)
adata.obs # Cell metadata (DataFrame)
adata.var # Gene metadata (DataFrame)
adata.uns # Unstructured annotations (dict)
adata.obsm # Multi-dimensional cell data (PCA, UMAP)
adata.raw # Raw data backupAnnData对象是Scanpy中的核心数据结构:
python
adata.X # 表达矩阵(细胞 × 基因)
adata.obs # 细胞元数据(DataFrame)
adata.var # 基因元数据(DataFrame)
adata.uns # 非结构化注释(字典)
adata.obsm # 多维细胞数据(PCA、UMAP结果)
adata.raw # 原始数据备份Access cell and gene names
访问细胞和基因名称
adata.obs_names # Cell barcodes
adata.var_names # Gene names
undefinedadata.obs_names # 细胞条形码
adata.var_names # 基因名称
undefinedStandard Analysis Workflow
标准分析工作流
1. Quality Control
1. 质量控制
Identify and filter low-quality cells and genes:
python
undefined识别并过滤低质量细胞和基因:
python
undefinedIdentify mitochondrial genes
识别线粒体基因
adata.var['mt'] = adata.var_names.str.startswith('MT-')
adata.var['mt'] = adata.var_names.str.startswith('MT-')
Calculate QC metrics
计算QC指标
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
Visualize QC metrics
可视化QC指标
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
jitter=0.4, multi_panel=True)
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
jitter=0.4, multi_panel=True)
Filter cells and genes
过滤细胞和基因
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.pct_counts_mt < 5, :] # Remove high MT% cells
**Use the QC script for automated analysis:**
```bash
python scripts/qc_analysis.py input_file.h5ad --output filtered.h5adsc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.pct_counts_mt < 5, :] # 移除高线粒体占比细胞
**使用QC脚本进行自动化分析:**
```bash
python scripts/qc_analysis.py input_file.h5ad --output filtered.h5ad2. Normalization and Preprocessing
2. 归一化与预处理
python
undefinedpython
undefinedNormalize to 10,000 counts per cell
归一化至每个细胞10,000计数
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.normalize_total(adata, target_sum=1e4)
Log-transform
对数转换
sc.pp.log1p(adata)
sc.pp.log1p(adata)
Save raw counts for later
保存原始计数供后续使用
adata.raw = adata
adata.raw = adata
Identify highly variable genes
识别高可变基因
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pl.highly_variable_genes(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pl.highly_variable_genes(adata)
Subset to highly variable genes
筛选高可变基因子集
adata = adata[:, adata.var.highly_variable]
adata = adata[:, adata.var.highly_variable]
Regress out unwanted variation
回归去除 unwanted variation
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
Scale data
数据缩放
sc.pp.scale(adata, max_value=10)
undefinedsc.pp.scale(adata, max_value=10)
undefined3. Dimensionality Reduction
3. 降维
python
undefinedpython
undefinedPCA
PCA分析
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True) # Check elbow plot
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True) # 查看肘部图
Compute neighborhood graph
计算邻域图
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
UMAP for visualization
UMAP可视化
sc.tl.umap(adata)
sc.pl.umap(adata, color='leiden')
sc.tl.umap(adata)
sc.pl.umap(adata, color='leiden')
Alternative: t-SNE
替代方法:t-SNE
sc.tl.tsne(adata)
undefinedsc.tl.tsne(adata)
undefined4. Clustering
4. 聚类
python
undefinedpython
undefinedLeiden clustering (recommended)
Leiden聚类(推荐)
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color='leiden', legend_loc='on data')
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color='leiden', legend_loc='on data')
Try multiple resolutions to find optimal granularity
尝试多种分辨率以找到最优粒度
for res in [0.3, 0.5, 0.8, 1.0]:
sc.tl.leiden(adata, resolution=res, key_added=f'leiden_{res}')
undefinedfor res in [0.3, 0.5, 0.8, 1.0]:
sc.tl.leiden(adata, resolution=res, key_added=f'leiden_{res}')
undefined5. Marker Gene Identification
5. 标记基因识别
python
undefinedpython
undefinedFind marker genes for each cluster
寻找每个细胞簇的标记基因
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
Visualize results
可视化结果
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
sc.pl.rank_genes_groups_heatmap(adata, n_genes=10)
sc.pl.rank_genes_groups_dotplot(adata, n_genes=5)
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
sc.pl.rank_genes_groups_heatmap(adata, n_genes=10)
sc.pl.rank_genes_groups_dotplot(adata, n_genes=5)
Get results as DataFrame
将结果转换为DataFrame
markers = sc.get.rank_genes_groups_df(adata, group='0')
undefinedmarkers = sc.get.rank_genes_groups_df(adata, group='0')
undefined6. Cell Type Annotation
6. 细胞类型注释
python
undefinedpython
undefinedDefine marker genes for known cell types
定义已知细胞类型的标记基因
marker_genes = ['CD3D', 'CD14', 'MS4A1', 'NKG7', 'FCGR3A']
marker_genes = ['CD3D', 'CD14', 'MS4A1', 'NKG7', 'FCGR3A']
Visualize markers
可视化标记基因
sc.pl.umap(adata, color=marker_genes, use_raw=True)
sc.pl.dotplot(adata, var_names=marker_genes, groupby='leiden')
sc.pl.umap(adata, color=marker_genes, use_raw=True)
sc.pl.dotplot(adata, var_names=marker_genes, groupby='leiden')
Manual annotation
手动注释
cluster_to_celltype = {
'0': 'CD4 T cells',
'1': 'CD14+ Monocytes',
'2': 'B cells',
'3': 'CD8 T cells',
}
adata.obs['cell_type'] = adata.obs['leiden'].map(cluster_to_celltype)
cluster_to_celltype = {
'0': 'CD4 T细胞',
'1': 'CD14+单核细胞',
'2': 'B细胞',
'3': 'CD8 T细胞',
}
adata.obs['cell_type'] = adata.obs['leiden'].map(cluster_to_celltype)
Visualize annotated types
可视化注释后的细胞类型
sc.pl.umap(adata, color='cell_type', legend_loc='on data')
undefinedsc.pl.umap(adata, color='cell_type', legend_loc='on data')
undefined7. Save Results
7. 保存结果
python
undefinedpython
undefinedSave processed data
保存处理后的数据
adata.write('results/processed_data.h5ad')
adata.write('results/processed_data.h5ad')
Export metadata
导出元数据
adata.obs.to_csv('results/cell_metadata.csv')
adata.var.to_csv('results/gene_metadata.csv')
undefinedadata.obs.to_csv('results/cell_metadata.csv')
adata.var.to_csv('results/gene_metadata.csv')
undefinedCommon Tasks
常见任务
Creating Publication-Quality Plots
生成可用于发表的图表
python
undefinedpython
undefinedSet high-quality defaults
设置高质量默认参数
sc.settings.set_figure_params(dpi=300, frameon=False, figsize=(5, 5))
sc.settings.file_format_figs = 'pdf'
sc.settings.set_figure_params(dpi=300, frameon=False, figsize=(5, 5))
sc.settings.file_format_figs = 'pdf'
UMAP with custom styling
自定义样式的UMAP图
sc.pl.umap(adata, color='cell_type',
palette='Set2',
legend_loc='on data',
legend_fontsize=12,
legend_fontoutline=2,
frameon=False,
save='_publication.pdf')
sc.pl.umap(adata, color='cell_type',
palette='Set2',
legend_loc='on data',
legend_fontsize=12,
legend_fontoutline=2,
frameon=False,
save='_publication.pdf')
Heatmap of marker genes
标记基因热图
sc.pl.heatmap(adata, var_names=genes, groupby='cell_type',
swap_axes=True, show_gene_labels=True,
save='_markers.pdf')
sc.pl.heatmap(adata, var_names=genes, groupby='cell_type',
swap_axes=True, show_gene_labels=True,
save='_markers.pdf')
Dot plot
点图
sc.pl.dotplot(adata, var_names=genes, groupby='cell_type',
save='_dotplot.pdf')
Refer to `references/plotting_guide.md` for comprehensive visualization examples.sc.pl.dotplot(adata, var_names=genes, groupby='cell_type',
save='_dotplot.pdf')
参考`references/plotting_guide.md`获取全面的可视化示例。Trajectory Inference
轨迹推断
python
undefinedpython
undefinedPAGA (Partition-based graph abstraction)
PAGA(基于分区的图抽象)
sc.tl.paga(adata, groups='leiden')
sc.pl.paga(adata, color='leiden')
sc.tl.paga(adata, groups='leiden')
sc.pl.paga(adata, color='leiden')
Diffusion pseudotime
扩散拟时间
adata.uns['iroot'] = np.flatnonzero(adata.obs['leiden'] == '0')[0]
sc.tl.dpt(adata)
sc.pl.umap(adata, color='dpt_pseudotime')
undefinedadata.uns['iroot'] = np.flatnonzero(adata.obs['leiden'] == '0')[0]
sc.tl.dpt(adata)
sc.pl.umap(adata, color='dpt_pseudotime')
undefinedDifferential Expression Between Conditions
不同条件间的差异表达分析
python
undefinedpython
undefinedCompare treated vs control within cell types
比较细胞类型内处理组与对照组的差异
adata_subset = adata[adata.obs['cell_type'] == 'T cells']
sc.tl.rank_genes_groups(adata_subset, groupby='condition',
groups=['treated'], reference='control')
sc.pl.rank_genes_groups(adata_subset, groups=['treated'])
undefinedadata_subset = adata[adata.obs['cell_type'] == 'T cells']
sc.tl.rank_genes_groups(adata_subset, groupby='condition',
groups=['treated'], reference='control')
sc.pl.rank_genes_groups(adata_subset, groups=['treated'])
undefinedGene Set Scoring
基因集评分
python
undefinedpython
undefinedScore cells for gene set expression
计算细胞的基因集表达评分
gene_set = ['CD3D', 'CD3E', 'CD3G']
sc.tl.score_genes(adata, gene_set, score_name='T_cell_score')
sc.pl.umap(adata, color='T_cell_score')
undefinedgene_set = ['CD3D', 'CD3E', 'CD3G']
sc.tl.score_genes(adata, gene_set, score_name='T_cell_score')
sc.pl.umap(adata, color='T_cell_score')
undefinedBatch Correction
批次校正
python
undefinedpython
undefinedComBat batch correction
ComBat批次校正
sc.pp.combat(adata, key='batch')
sc.pp.combat(adata, key='batch')
Alternative: use Harmony or scVI (separate packages)
替代方法:使用Harmony或scVI(需单独安装包)
undefinedundefinedKey Parameters to Adjust
关键可调参数
Quality Control
质量控制
- : Minimum genes per cell (typically 200-500)
min_genes - : Minimum cells per gene (typically 3-10)
min_cells - : Mitochondrial threshold (typically 5-20%)
pct_counts_mt
- : 每个细胞的最小基因数(通常200-500)
min_genes - : 每个基因的最小细胞数(通常3-10)
min_cells - : 线粒体基因占比阈值(通常5-20%)
pct_counts_mt
Normalization
归一化
- : Target counts per cell (default 1e4)
target_sum
- : 每个细胞的目标计数(默认1e4)
target_sum
Feature Selection
特征选择
- : Number of HVGs (typically 2000-3000)
n_top_genes - ,
min_mean,max_mean: HVG selection parametersmin_disp
- : 高可变基因(HVG)数量(通常2000-3000)
n_top_genes - ,
min_mean,max_mean: HVG选择参数min_disp
Dimensionality Reduction
降维
- : Number of principal components (check variance ratio plot)
n_pcs - : Number of neighbors (typically 10-30)
n_neighbors
- : 主成分数量(查看方差比图确定)
n_pcs - : 邻域数量(通常10-30)
n_neighbors
Clustering
聚类
- : Clustering granularity (0.4-1.2, higher = more clusters)
resolution
- : 聚类粒度(0.4-1.2,值越高聚类越多)
resolution
Common Pitfalls and Best Practices
常见陷阱与最佳实践
- Always save raw counts: before filtering genes
adata.raw = adata - Check QC plots carefully: Adjust thresholds based on dataset quality
- Use Leiden over Louvain: More efficient and better results
- Try multiple clustering resolutions: Find optimal granularity
- Validate cell type annotations: Use multiple marker genes
- Use for gene expression plots: Shows original counts
use_raw=True - Check PCA variance ratio: Determine optimal number of PCs
- Save intermediate results: Long workflows can fail partway through
- 始终保存原始计数:过滤基因前执行
adata.raw = adata - 仔细检查QC图:根据数据集质量调整阈值
- 优先使用Leiden而非Louvain:效率更高且结果更优
- 尝试多种聚类分辨率:找到最优聚类粒度
- 验证细胞类型注释:使用多个标记基因
- 基因表达图使用:展示原始计数
use_raw=True - 检查PCA方差比:确定最优主成分数量
- 保存中间结果:长工作流可能中途失败
Bundled Resources
附带资源
scripts/qc_analysis.py
scripts/qc_analysis.py
Automated quality control script that calculates metrics, generates plots, and filters data:
bash
python scripts/qc_analysis.py input.h5ad --output filtered.h5ad \
--mt-threshold 5 --min-genes 200 --min-cells 3自动化质量控制脚本,可计算指标、生成图表并过滤数据:
bash
python scripts/qc_analysis.py input.h5ad --output filtered.h5ad \
--mt-threshold 5 --min-genes 200 --min-cells 3references/standard_workflow.md
references/standard_workflow.md
Complete step-by-step workflow with detailed explanations and code examples for:
- Data loading and setup
- Quality control with visualization
- Normalization and scaling
- Feature selection
- Dimensionality reduction (PCA, UMAP, t-SNE)
- Clustering (Leiden, Louvain)
- Marker gene identification
- Cell type annotation
- Trajectory inference
- Differential expression
Read this reference when performing a complete analysis from scratch.
完整的分步工作流指南,包含详细说明和代码示例,覆盖:
- 数据加载与设置
- 带可视化的质量控制
- 归一化与缩放
- 特征选择
- 降维(PCA、UMAP、t-SNE)
- 聚类(Leiden、Louvain)
- 标记基因识别
- 细胞类型注释
- 轨迹推断
- 差异表达分析
从头开始执行完整分析时可参考该文档。
references/api_reference.md
references/api_reference.md
Quick reference guide for scanpy functions organized by module:
- Reading/writing data (,
sc.read_*)adata.write_* - Preprocessing ()
sc.pp.* - Tools ()
sc.tl.* - Plotting ()
sc.pl.* - AnnData structure and manipulation
- Settings and utilities
Use this for quick lookup of function signatures and common parameters.
Scanpy函数快速参考指南,按模块分类:
- 数据读写(,
sc.read_*)adata.write_* - 预处理()
sc.pp.* - 工具()
sc.tl.* - 绘图()
sc.pl.* - AnnData结构与操作
- 设置与实用工具
用于快速查询函数签名和常见参数。
references/plotting_guide.md
references/plotting_guide.md
Comprehensive visualization guide including:
- Quality control plots
- Dimensionality reduction visualizations
- Clustering visualizations
- Marker gene plots (heatmaps, dot plots, violin plots)
- Trajectory and pseudotime plots
- Publication-quality customization
- Multi-panel figures
- Color palettes and styling
Consult this when creating publication-ready figures.
全面的可视化指南,包含:
- 质量控制图
- 降维可视化
- 聚类可视化
- 标记基因图(热图、点图、小提琴图)
- 轨迹与拟时间图
- 可发表级别的自定义设置
- 多面板图
- 调色板与样式
创建可发表级图表时可参考该指南。
assets/analysis_template.py
assets/analysis_template.py
Complete analysis template providing a full workflow from data loading through cell type annotation. Copy and customize this template for new analyses:
bash
cp assets/analysis_template.py my_analysis.py完整的分析模板,提供从数据加载到细胞类型注释的全工作流。复制并自定义该模板以用于新分析:
bash
cp assets/analysis_template.py my_analysis.pyEdit parameters and run
编辑参数并运行
python my_analysis.py
The template includes all standard steps with configurable parameters and helpful comments.python my_analysis.py
该模板包含所有标准步骤,带有可配置参数和实用注释。Additional Resources
额外资源
- Official scanpy documentation: https://scanpy.readthedocs.io/
- Scanpy tutorials: https://scanpy-tutorials.readthedocs.io/
- scverse ecosystem: https://scverse.org/ (related tools: squidpy, scvi-tools, cellrank)
- Best practices: Luecken & Theis (2019) "Current best practices in single-cell RNA-seq"
- Scanpy官方文档:https://scanpy.readthedocs.io/
- Scanpy教程:https://scanpy-tutorials.readthedocs.io/
- scverse生态系统:https://scverse.org/(相关工具:squidpy、scvi-tools、cellrank)
- 最佳实践:Luecken & Theis (2019) 《Current best practices in single-cell RNA-seq》
Tips for Effective Analysis
高效分析技巧
- Start with the template: Use as a starting point
assets/analysis_template.py - Run QC script first: Use for initial filtering
scripts/qc_analysis.py - Consult references as needed: Load workflow and API references into context
- Iterate on clustering: Try multiple resolutions and visualization methods
- Validate biologically: Check marker genes match expected cell types
- Document parameters: Record QC thresholds and analysis settings
- Save checkpoints: Write intermediate results at key steps
- 从模板开始:使用作为起点
assets/analysis_template.py - 先运行QC脚本:使用进行初始过滤
scripts/qc_analysis.py - 按需参考文档:加载工作流和API文档到上下文
- 迭代聚类分辨率:尝试多种分辨率找到最优粒度
- 生物学验证:确保标记基因与预期细胞类型匹配
- 记录参数:保存QC阈值和分析设置
- 保存检查点:在关键步骤保存中间结果