single-cell-rna-qc

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Single-Cell RNA-seq Quality Control

单细胞RNA-seq数据质量控制

Automated QC workflow for single-cell RNA-seq data following scverse best practices.

遵循scverse最佳实践的单细胞RNA-seq数据自动化QC工作流。

When to Use This Skill

适用场景

Use when users:

Request quality control or QC on single-cell RNA-seq data
Want to filter low-quality cells or assess data quality
Need QC visualizations or metrics
Ask to follow scverse/scanpy best practices
Request MAD-based filtering or outlier detection

Supported input formats:

```
.h5ad
```
files (AnnData format from scanpy/Python workflows)
```
.h5
```
files (10X Genomics Cell Ranger output)

Default recommendation: Use Approach 1 (complete pipeline) unless the user has specific custom requirements or explicitly requests non-standard filtering logic.

当用户有以下需求时使用：

对单细胞RNA-seq数据进行质量控制（QC）
过滤低质量细胞或评估数据质量
需要QC可视化图表或指标
要求遵循scverse/scanpy最佳实践
需要基于MAD的过滤或异常值检测

支持的输入格式：

```
.h5ad
```
文件（来自scanpy/Python工作流的AnnData格式）
```
.h5
```
文件（10X Genomics Cell Ranger输出结果）

默认推荐：除非用户有特定自定义需求或明确要求非标准过滤逻辑，否则使用方法1（完整流程）。

Approach 1: Complete QC Pipeline (Recommended for Standard Workflows)

方法1：完整QC流程（标准工作流推荐）

For standard QC following scverse best practices, use the convenience script

scripts/qc_analysis.py

bash

python3 scripts/qc_analysis.py input.h5ad

对于遵循scverse最佳实践的标准QC，使用便捷脚本

scripts/qc_analysis.py

：

bash

python3 scripts/qc_analysis.py input.h5ad

or for 10X Genomics .h5 files:

针对10X Genomics .h5文件：

python3 scripts/qc_analysis.py raw_feature_bc_matrix.h5


The script automatically detects the file format and loads it appropriately.

**When to use this approach:**
- Standard QC workflow with adjustable thresholds (all cells filtered the same way)
- Batch processing multiple datasets
- Quick exploratory analysis
- User wants the "just works" solution

**Requirements:** anndata, scanpy, scipy, matplotlib, seaborn, numpy

**Parameters:**

Customize filtering thresholds and gene patterns using command-line parameters:
- `--output-dir` - Output directory
- `--mad-counts`, `--mad-genes`, `--mad-mt` - MAD thresholds for counts/genes/MT%
- `--mt-threshold` - Hard mitochondrial % cutoff
- `--min-cells` - Gene filtering threshold
- `--mt-pattern`, `--ribo-pattern`, `--hb-pattern` - Gene name patterns for different species

Use `--help` to see current default values.

**Outputs:**

All files are saved to `<input_basename>_qc_results/` directory by default (or to the directory specified by `--output-dir`):
- `qc_metrics_before_filtering.png` - Pre-filtering visualizations
- `qc_filtering_thresholds.png` - MAD-based threshold overlays
- `qc_metrics_after_filtering.png` - Post-filtering quality metrics
- `<input_basename>_filtered.h5ad` - Clean, filtered dataset ready for downstream analysis
- `<input_basename>_with_qc.h5ad` - Original data with QC annotations preserved

If copying outputs for user access, copy individual files (not the entire directory) so users can preview them directly.

python3 scripts/qc_analysis.py raw_feature_bc_matrix.h5


脚本会自动检测文件格式并进行相应加载。

**适用场景：**
- 具有可调阈值的标准QC工作流（所有细胞采用相同过滤方式）
- 批量处理多个数据集
- 快速探索性分析
- 用户需要“开箱即用”的解决方案

**依赖库**：anndata, scanpy, scipy, matplotlib, seaborn, numpy

**参数说明：**

可通过命令行参数自定义过滤阈值和基因模式：
- `--output-dir` - 输出目录
- `--mad-counts`, `--mad-genes`, `--mad-mt` - 用于counts/基因/线粒体占比的MAD阈值
- `--mt-threshold` - 线粒体占比的硬性截断值
- `--min-cells` - 基因过滤阈值
- `--mt-pattern`, `--ribo-pattern`, `--hb-pattern` - 不同物种的基因名称模式

使用`--help`查看当前默认参数值。

**输出内容：**

所有文件默认保存至`<input_basename>_qc_results/`目录（或`--output-dir`指定的目录）：
- `qc_metrics_before_filtering.png` - 过滤前可视化图表
- `qc_filtering_thresholds.png` - 基于MAD的阈值叠加图
- `qc_metrics_after_filtering.png` - 过滤后质量指标图
- `<input_basename>_filtered.h5ad` - 经过清理、过滤后的数据集，可用于下游分析
- `<input_basename>_with_qc.h5ad` - 保留QC注释信息的原始数据

若要将输出内容提供给用户访问，请复制单个文件（而非整个目录），以便用户直接预览。

Workflow Steps

工作流步骤

The script performs the following steps:

Calculate QC metrics - Count depth, gene detection, mitochondrial/ribosomal/hemoglobin content
Apply MAD-based filtering - Permissive outlier detection using MAD thresholds for counts/genes/MT%
Filter genes - Remove genes detected in few cells
Generate visualizations - Comprehensive before/after plots with threshold overlays

脚本执行以下步骤：

计算QC指标 - 计数深度、基因检测数、线粒体/核糖体/血红蛋白含量
应用MAD过滤 - 使用MAD阈值进行宽松的异常值检测，针对counts/基因/线粒体占比
过滤基因 - 移除仅在少数细胞中被检测到的基因
生成可视化图表 - 生成包含阈值叠加的过滤前后全面对比图

Approach 2: Modular Building Blocks (For Custom Workflows)

方法2：模块化构建块（自定义工作流）

For custom analysis workflows or non-standard requirements, use the modular utility functions from

scripts/qc_core.py

and

scripts/qc_plotting.py

python

undefined

针对自定义分析工作流或非标准需求，可使用

scripts/qc_core.py

和

scripts/qc_plotting.py

中的模块化工具函数：

python

undefined

Run from scripts/ directory, or add scripts/ to sys.path if needed

从scripts/目录运行，或需将scripts/添加至sys.path

import anndata as ad from qc_core import calculate_qc_metrics, detect_outliers_mad, filter_cells from qc_plotting import plot_qc_distributions # Only if visualization needed

adata = ad.read_h5ad('input.h5ad') calculate_qc_metrics(adata, inplace=True)

import anndata as ad from qc_core import calculate_qc_metrics, detect_outliers_mad, filter_cells from qc_plotting import plot_qc_distributions # 仅当需要可视化时导入

adata = ad.read_h5ad('input.h5ad') calculate_qc_metrics(adata, inplace=True)

... custom analysis logic here

... 此处添加自定义分析逻辑


**When to use this approach:**
- Different workflow needed (skip steps, change order, apply different thresholds to subsets)
- Conditional logic (e.g., filter neurons differently than other cells)
- Partial execution (only metrics/visualization, no filtering)
- Integration with other analysis steps in a larger pipeline
- Custom filtering criteria beyond what command-line params support

**Available utility functions:**

From `qc_core.py` (core QC operations):
- `calculate_qc_metrics(adata, mt_pattern, ribo_pattern, hb_pattern, inplace=True)` - Calculate QC metrics and annotate adata
- `detect_outliers_mad(adata, metric, n_mads, verbose=True)` - MAD-based outlier detection, returns boolean mask
- `apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True)` - Apply hard cutoffs, returns boolean mask
- `filter_cells(adata, mask, inplace=False)` - Apply boolean mask to filter cells
- `filter_genes(adata, min_cells=20, min_counts=None, inplace=True)` - Filter genes by detection
- `print_qc_summary(adata, label='')` - Print summary statistics

From `qc_plotting.py` (visualization):
- `plot_qc_distributions(adata, output_path, title)` - Generate comprehensive QC plots
- `plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path)` - Visualize filtering thresholds
- `plot_qc_after_filtering(adata, output_path)` - Generate post-filtering plots

**Example custom workflows:**

**Example 1: Only calculate metrics and visualize, don't filter yet**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
plot_qc_distributions(adata, 'qc_before.png', title='Initial QC')
print_qc_summary(adata, label='Before filtering')

Example 2: Apply only MT% filtering, keep other metrics permissive

python

adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)


**适用场景：**
- 需要不同的工作流（跳过步骤、调整顺序、对子集应用不同阈值）
- 条件逻辑（例如，对神经元和其他细胞采用不同过滤方式）
- 部分执行（仅计算指标/可视化，不进行过滤）
- 与更大流程中的其他分析步骤集成
- 需要命令行参数不支持的自定义过滤条件

**可用工具函数：**

来自`qc_core.py`（核心QC操作）：
- `calculate_qc_metrics(adata, mt_pattern, ribo_pattern, hb_pattern, inplace=True)` - 计算QC指标并标注到adata中
- `detect_outliers_mad(adata, metric, n_mads, verbose=True)` - 基于MAD的异常值检测，返回布尔掩码
- `apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True)` - 应用硬性截断值，返回布尔掩码
- `filter_cells(adata, mask, inplace=False)` - 应用布尔掩码过滤细胞
- `filter_genes(adata, min_cells=20, min_counts=None, inplace=True)` - 基于检测情况过滤基因
- `print_qc_summary(adata, label='')` - 打印汇总统计信息

来自`qc_plotting.py`（可视化）：
- `plot_qc_distributions(adata, output_path, title)` - 生成全面的QC图表
- `plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path)` - 可视化过滤阈值
- `plot_qc_after_filtering(adata, output_path)` - 生成过滤后图表

**自定义工作流示例：**

**示例1：仅计算指标并可视化，暂不过滤**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
plot_qc_distributions(adata, 'qc_before.png', title='初始QC结果')
print_qc_summary(adata, label='过滤前')

示例2：仅应用线粒体占比过滤，其他指标保持宽松

python

adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

Only filter high MT% cells

仅过滤线粒体占比高的细胞

high_mt = apply_hard_threshold(adata, 'pct_counts_mt', 10, operator='>') adata_filtered = filter_cells(adata, ~high_mt) adata_filtered.write('filtered.h5ad')


**Example 3: Different thresholds for different subsets**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

high_mt = apply_hard_threshold(adata, 'pct_counts_mt', 10, operator='>') adata_filtered = filter_cells(adata, ~high_mt) adata_filtered.write('filtered.h5ad')


**示例3：对不同子集应用不同阈值**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

Apply type-specific QC (assumes cell_type metadata exists)

应用细胞类型特异性QC（假设存在cell_type元数据）

neurons = adata.obs['cell_type'] == 'neuron' other_cells = ~neurons

Neurons tolerate higher MT%, other cells use stricter threshold

神经元可容忍更高的线粒体占比，其他细胞采用更严格的阈值

neuron_qc = apply_hard_threshold(adata[neurons], 'pct_counts_mt', 15, operator='>') other_qc = apply_hard_threshold(adata[other_cells], 'pct_counts_mt', 8, operator='>')

undefined

neuron_qc = apply_hard_threshold(adata[neurons], 'pct_counts_mt', 15, operator='>') other_qc = apply_hard_threshold(adata[other_cells], 'pct_counts_mt', 8, operator='>')

undefined

Best Practices

最佳实践

Be permissive with filtering - Default thresholds intentionally retain most cells to avoid losing rare populations
Inspect visualizations - Always review before/after plots to ensure filtering makes biological sense
Consider dataset-specific factors - Some tissues naturally have higher mitochondrial content (e.g., neurons, cardiomyocytes)
Check gene annotations - Mitochondrial gene prefixes vary by species (mt- for mouse, MT- for human)
Iterate if needed - QC parameters may need adjustment based on the specific experiment or tissue type

过滤保持宽松 - 默认阈值有意保留大多数细胞，避免丢失稀有细胞群
检查可视化结果 - 务必查看过滤前后的图表，确保过滤逻辑符合生物学合理性
考虑数据集特异性因素 - 某些组织天然具有更高的线粒体含量（例如神经元、心肌细胞）
检查基因注释 - 线粒体基因前缀因物种而异（小鼠为mt-，人类为MT-）
必要时迭代调整 - QC参数可能需要根据具体实验或组织类型进行调整

Reference Materials

参考资料

For detailed QC methodology, parameter rationale, and troubleshooting guidance, see

references/scverse_qc_guidelines.md

. This reference provides:

Detailed explanations of each QC metric and why it matters
Rationale for MAD-based thresholds and why they're better than fixed cutoffs
Guidelines for interpreting QC visualizations (histograms, violin plots, scatter plots)
Species-specific considerations for gene annotations
When and how to adjust filtering parameters
Advanced QC considerations (ambient RNA correction, doublet detection)

Load this reference when users need deeper understanding of the methodology or when troubleshooting QC issues.

如需了解详细的QC方法、参数依据和故障排除指南，请查看

references/scverse_qc_guidelines.md

。该参考资料包含：

每个QC指标的详细解释及其重要性
基于MAD阈值的依据，以及为何优于固定截断值
QC可视化图表（直方图、小提琴图、散点图）的解读指南
基因注释的物种特异性注意事项
何时及如何调整过滤参数
高级QC注意事项（环境RNA校正、双细胞检测）

当用户需要深入了解方法细节或排查QC问题时，可加载此参考资料。

Next Steps After QC

QC后的后续步骤

Typical downstream analysis steps:

Ambient RNA correction (SoupX, CellBender)
Doublet detection (scDblFinder)
Normalization (log-normalize, scran)
Feature selection and dimensionality reduction
Clustering and cell type annotation

典型的下游分析步骤：

环境RNA校正（SoupX, CellBender）
双细胞检测（scDblFinder）
归一化（对数归一化、scran）
特征选择和降维
聚类和细胞类型注释