single-cell-rna-qc

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Single-Cell RNA-seq Quality Control

单细胞RNA-seq数据质量控制

Automated QC workflow for single-cell RNA-seq data following scverse best practices.
遵循scverse最佳实践的单细胞RNA-seq数据自动化QC工作流。

When to Use This Skill

适用场景

Use when users:
  • Request quality control or QC on single-cell RNA-seq data
  • Want to filter low-quality cells or assess data quality
  • Need QC visualizations or metrics
  • Ask to follow scverse/scanpy best practices
  • Request MAD-based filtering or outlier detection
Supported input formats:
  • .h5ad
    files (AnnData format from scanpy/Python workflows)
  • .h5
    files (10X Genomics Cell Ranger output)
Default recommendation: Use Approach 1 (complete pipeline) unless the user has specific custom requirements or explicitly requests non-standard filtering logic.
当用户有以下需求时使用:
  • 对单细胞RNA-seq数据进行质量控制(QC)
  • 过滤低质量细胞或评估数据质量
  • 需要QC可视化图表或指标
  • 要求遵循scverse/scanpy最佳实践
  • 需要基于MAD的过滤或异常值检测
支持的输入格式:
  • .h5ad
    文件(来自scanpy/Python工作流的AnnData格式)
  • .h5
    文件(10X Genomics Cell Ranger输出结果)
默认推荐:除非用户有特定自定义需求或明确要求非标准过滤逻辑,否则使用方法1(完整流程)。

Approach 1: Complete QC Pipeline (Recommended for Standard Workflows)

方法1:完整QC流程(标准工作流推荐)

For standard QC following scverse best practices, use the convenience script
scripts/qc_analysis.py
:
bash
python3 scripts/qc_analysis.py input.h5ad
对于遵循scverse最佳实践的标准QC,使用便捷脚本
scripts/qc_analysis.py
bash
python3 scripts/qc_analysis.py input.h5ad

or for 10X Genomics .h5 files:

针对10X Genomics .h5文件:

python3 scripts/qc_analysis.py raw_feature_bc_matrix.h5

The script automatically detects the file format and loads it appropriately.

**When to use this approach:**
- Standard QC workflow with adjustable thresholds (all cells filtered the same way)
- Batch processing multiple datasets
- Quick exploratory analysis
- User wants the "just works" solution

**Requirements:** anndata, scanpy, scipy, matplotlib, seaborn, numpy

**Parameters:**

Customize filtering thresholds and gene patterns using command-line parameters:
- `--output-dir` - Output directory
- `--mad-counts`, `--mad-genes`, `--mad-mt` - MAD thresholds for counts/genes/MT%
- `--mt-threshold` - Hard mitochondrial % cutoff
- `--min-cells` - Gene filtering threshold
- `--mt-pattern`, `--ribo-pattern`, `--hb-pattern` - Gene name patterns for different species

Use `--help` to see current default values.

**Outputs:**

All files are saved to `<input_basename>_qc_results/` directory by default (or to the directory specified by `--output-dir`):
- `qc_metrics_before_filtering.png` - Pre-filtering visualizations
- `qc_filtering_thresholds.png` - MAD-based threshold overlays
- `qc_metrics_after_filtering.png` - Post-filtering quality metrics
- `<input_basename>_filtered.h5ad` - Clean, filtered dataset ready for downstream analysis
- `<input_basename>_with_qc.h5ad` - Original data with QC annotations preserved

If copying outputs for user access, copy individual files (not the entire directory) so users can preview them directly.
python3 scripts/qc_analysis.py raw_feature_bc_matrix.h5

脚本会自动检测文件格式并进行相应加载。

**适用场景:**
- 具有可调阈值的标准QC工作流(所有细胞采用相同过滤方式)
- 批量处理多个数据集
- 快速探索性分析
- 用户需要“开箱即用”的解决方案

**依赖库**:anndata, scanpy, scipy, matplotlib, seaborn, numpy

**参数说明:**

可通过命令行参数自定义过滤阈值和基因模式:
- `--output-dir` - 输出目录
- `--mad-counts`, `--mad-genes`, `--mad-mt` - 用于counts/基因/线粒体占比的MAD阈值
- `--mt-threshold` - 线粒体占比的硬性截断值
- `--min-cells` - 基因过滤阈值
- `--mt-pattern`, `--ribo-pattern`, `--hb-pattern` - 不同物种的基因名称模式

使用`--help`查看当前默认参数值。

**输出内容:**

所有文件默认保存至`<input_basename>_qc_results/`目录(或`--output-dir`指定的目录):
- `qc_metrics_before_filtering.png` - 过滤前可视化图表
- `qc_filtering_thresholds.png` - 基于MAD的阈值叠加图
- `qc_metrics_after_filtering.png` - 过滤后质量指标图
- `<input_basename>_filtered.h5ad` - 经过清理、过滤后的数据集,可用于下游分析
- `<input_basename>_with_qc.h5ad` - 保留QC注释信息的原始数据

若要将输出内容提供给用户访问,请复制单个文件(而非整个目录),以便用户直接预览。

Workflow Steps

工作流步骤

The script performs the following steps:
  1. Calculate QC metrics - Count depth, gene detection, mitochondrial/ribosomal/hemoglobin content
  2. Apply MAD-based filtering - Permissive outlier detection using MAD thresholds for counts/genes/MT%
  3. Filter genes - Remove genes detected in few cells
  4. Generate visualizations - Comprehensive before/after plots with threshold overlays
脚本执行以下步骤:
  1. 计算QC指标 - 计数深度、基因检测数、线粒体/核糖体/血红蛋白含量
  2. 应用MAD过滤 - 使用MAD阈值进行宽松的异常值检测,针对counts/基因/线粒体占比
  3. 过滤基因 - 移除仅在少数细胞中被检测到的基因
  4. 生成可视化图表 - 生成包含阈值叠加的过滤前后全面对比图

Approach 2: Modular Building Blocks (For Custom Workflows)

方法2:模块化构建块(自定义工作流)

For custom analysis workflows or non-standard requirements, use the modular utility functions from
scripts/qc_core.py
and
scripts/qc_plotting.py
:
python
undefined
针对自定义分析工作流或非标准需求,可使用
scripts/qc_core.py
scripts/qc_plotting.py
中的模块化工具函数:
python
undefined

Run from scripts/ directory, or add scripts/ to sys.path if needed

从scripts/目录运行,或需将scripts/添加至sys.path

import anndata as ad from qc_core import calculate_qc_metrics, detect_outliers_mad, filter_cells from qc_plotting import plot_qc_distributions # Only if visualization needed
adata = ad.read_h5ad('input.h5ad') calculate_qc_metrics(adata, inplace=True)
import anndata as ad from qc_core import calculate_qc_metrics, detect_outliers_mad, filter_cells from qc_plotting import plot_qc_distributions # 仅当需要可视化时导入
adata = ad.read_h5ad('input.h5ad') calculate_qc_metrics(adata, inplace=True)

... custom analysis logic here

... 此处添加自定义分析逻辑


**When to use this approach:**
- Different workflow needed (skip steps, change order, apply different thresholds to subsets)
- Conditional logic (e.g., filter neurons differently than other cells)
- Partial execution (only metrics/visualization, no filtering)
- Integration with other analysis steps in a larger pipeline
- Custom filtering criteria beyond what command-line params support

**Available utility functions:**

From `qc_core.py` (core QC operations):
- `calculate_qc_metrics(adata, mt_pattern, ribo_pattern, hb_pattern, inplace=True)` - Calculate QC metrics and annotate adata
- `detect_outliers_mad(adata, metric, n_mads, verbose=True)` - MAD-based outlier detection, returns boolean mask
- `apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True)` - Apply hard cutoffs, returns boolean mask
- `filter_cells(adata, mask, inplace=False)` - Apply boolean mask to filter cells
- `filter_genes(adata, min_cells=20, min_counts=None, inplace=True)` - Filter genes by detection
- `print_qc_summary(adata, label='')` - Print summary statistics

From `qc_plotting.py` (visualization):
- `plot_qc_distributions(adata, output_path, title)` - Generate comprehensive QC plots
- `plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path)` - Visualize filtering thresholds
- `plot_qc_after_filtering(adata, output_path)` - Generate post-filtering plots

**Example custom workflows:**

**Example 1: Only calculate metrics and visualize, don't filter yet**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
plot_qc_distributions(adata, 'qc_before.png', title='Initial QC')
print_qc_summary(adata, label='Before filtering')
Example 2: Apply only MT% filtering, keep other metrics permissive
python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

**适用场景:**
- 需要不同的工作流(跳过步骤、调整顺序、对子集应用不同阈值)
- 条件逻辑(例如,对神经元和其他细胞采用不同过滤方式)
- 部分执行(仅计算指标/可视化,不进行过滤)
- 与更大流程中的其他分析步骤集成
- 需要命令行参数不支持的自定义过滤条件

**可用工具函数:**

来自`qc_core.py`(核心QC操作):
- `calculate_qc_metrics(adata, mt_pattern, ribo_pattern, hb_pattern, inplace=True)` - 计算QC指标并标注到adata中
- `detect_outliers_mad(adata, metric, n_mads, verbose=True)` - 基于MAD的异常值检测,返回布尔掩码
- `apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True)` - 应用硬性截断值,返回布尔掩码
- `filter_cells(adata, mask, inplace=False)` - 应用布尔掩码过滤细胞
- `filter_genes(adata, min_cells=20, min_counts=None, inplace=True)` - 基于检测情况过滤基因
- `print_qc_summary(adata, label='')` - 打印汇总统计信息

来自`qc_plotting.py`(可视化):
- `plot_qc_distributions(adata, output_path, title)` - 生成全面的QC图表
- `plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path)` - 可视化过滤阈值
- `plot_qc_after_filtering(adata, output_path)` - 生成过滤后图表

**自定义工作流示例:**

**示例1:仅计算指标并可视化,暂不过滤**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
plot_qc_distributions(adata, 'qc_before.png', title='初始QC结果')
print_qc_summary(adata, label='过滤前')
示例2:仅应用线粒体占比过滤,其他指标保持宽松
python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

Only filter high MT% cells

仅过滤线粒体占比高的细胞

high_mt = apply_hard_threshold(adata, 'pct_counts_mt', 10, operator='>') adata_filtered = filter_cells(adata, ~high_mt) adata_filtered.write('filtered.h5ad')

**Example 3: Different thresholds for different subsets**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
high_mt = apply_hard_threshold(adata, 'pct_counts_mt', 10, operator='>') adata_filtered = filter_cells(adata, ~high_mt) adata_filtered.write('filtered.h5ad')

**示例3:对不同子集应用不同阈值**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

Apply type-specific QC (assumes cell_type metadata exists)

应用细胞类型特异性QC(假设存在cell_type元数据)

neurons = adata.obs['cell_type'] == 'neuron' other_cells = ~neurons
neurons = adata.obs['cell_type'] == 'neuron' other_cells = ~neurons

Neurons tolerate higher MT%, other cells use stricter threshold

神经元可容忍更高的线粒体占比,其他细胞采用更严格的阈值

neuron_qc = apply_hard_threshold(adata[neurons], 'pct_counts_mt', 15, operator='>') other_qc = apply_hard_threshold(adata[other_cells], 'pct_counts_mt', 8, operator='>')
undefined
neuron_qc = apply_hard_threshold(adata[neurons], 'pct_counts_mt', 15, operator='>') other_qc = apply_hard_threshold(adata[other_cells], 'pct_counts_mt', 8, operator='>')
undefined

Best Practices

最佳实践

  1. Be permissive with filtering - Default thresholds intentionally retain most cells to avoid losing rare populations
  2. Inspect visualizations - Always review before/after plots to ensure filtering makes biological sense
  3. Consider dataset-specific factors - Some tissues naturally have higher mitochondrial content (e.g., neurons, cardiomyocytes)
  4. Check gene annotations - Mitochondrial gene prefixes vary by species (mt- for mouse, MT- for human)
  5. Iterate if needed - QC parameters may need adjustment based on the specific experiment or tissue type
  1. 过滤保持宽松 - 默认阈值有意保留大多数细胞,避免丢失稀有细胞群
  2. 检查可视化结果 - 务必查看过滤前后的图表,确保过滤逻辑符合生物学合理性
  3. 考虑数据集特异性因素 - 某些组织天然具有更高的线粒体含量(例如神经元、心肌细胞)
  4. 检查基因注释 - 线粒体基因前缀因物种而异(小鼠为mt-,人类为MT-)
  5. 必要时迭代调整 - QC参数可能需要根据具体实验或组织类型进行调整

Reference Materials

参考资料

For detailed QC methodology, parameter rationale, and troubleshooting guidance, see
references/scverse_qc_guidelines.md
. This reference provides:
  • Detailed explanations of each QC metric and why it matters
  • Rationale for MAD-based thresholds and why they're better than fixed cutoffs
  • Guidelines for interpreting QC visualizations (histograms, violin plots, scatter plots)
  • Species-specific considerations for gene annotations
  • When and how to adjust filtering parameters
  • Advanced QC considerations (ambient RNA correction, doublet detection)
Load this reference when users need deeper understanding of the methodology or when troubleshooting QC issues.
如需了解详细的QC方法、参数依据和故障排除指南,请查看
references/scverse_qc_guidelines.md
。该参考资料包含:
  • 每个QC指标的详细解释及其重要性
  • 基于MAD阈值的依据,以及为何优于固定截断值
  • QC可视化图表(直方图、小提琴图、散点图)的解读指南
  • 基因注释的物种特异性注意事项
  • 何时及如何调整过滤参数
  • 高级QC注意事项(环境RNA校正、双细胞检测)
当用户需要深入了解方法细节或排查QC问题时,可加载此参考资料。

Next Steps After QC

QC后的后续步骤

Typical downstream analysis steps:
  • Ambient RNA correction (SoupX, CellBender)
  • Doublet detection (scDblFinder)
  • Normalization (log-normalize, scran)
  • Feature selection and dimensionality reduction
  • Clustering and cell type annotation
典型的下游分析步骤:
  • 环境RNA校正(SoupX, CellBender)
  • 双细胞检测(scDblFinder)
  • 归一化(对数归一化、scran)
  • 特征选择和降维
  • 聚类和细胞类型注释