single-cell-rna-qc
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSingle-Cell RNA-seq Quality Control
单细胞RNA-seq数据质量控制
Automated QC workflow for single-cell RNA-seq data following scverse best practices.
遵循scverse最佳实践的单细胞RNA-seq数据自动化QC工作流。
When to Use This Skill
适用场景
Use when users:
- Request quality control or QC on single-cell RNA-seq data
- Want to filter low-quality cells or assess data quality
- Need QC visualizations or metrics
- Ask to follow scverse/scanpy best practices
- Request MAD-based filtering or outlier detection
Supported input formats:
- files (AnnData format from scanpy/Python workflows)
.h5ad - files (10X Genomics Cell Ranger output)
.h5
Default recommendation: Use Approach 1 (complete pipeline) unless the user has specific custom requirements or explicitly requests non-standard filtering logic.
当用户有以下需求时使用:
- 对单细胞RNA-seq数据进行质量控制(QC)
- 过滤低质量细胞或评估数据质量
- 需要QC可视化图表或指标
- 要求遵循scverse/scanpy最佳实践
- 需要基于MAD的过滤或异常值检测
支持的输入格式:
- 文件(来自scanpy/Python工作流的AnnData格式)
.h5ad - 文件(10X Genomics Cell Ranger输出结果)
.h5
默认推荐:除非用户有特定自定义需求或明确要求非标准过滤逻辑,否则使用方法1(完整流程)。
Approach 1: Complete QC Pipeline (Recommended for Standard Workflows)
方法1:完整QC流程(标准工作流推荐)
For standard QC following scverse best practices, use the convenience script :
scripts/qc_analysis.pybash
python3 scripts/qc_analysis.py input.h5ad对于遵循scverse最佳实践的标准QC,使用便捷脚本:
scripts/qc_analysis.pybash
python3 scripts/qc_analysis.py input.h5ador for 10X Genomics .h5 files:
针对10X Genomics .h5文件:
python3 scripts/qc_analysis.py raw_feature_bc_matrix.h5
The script automatically detects the file format and loads it appropriately.
**When to use this approach:**
- Standard QC workflow with adjustable thresholds (all cells filtered the same way)
- Batch processing multiple datasets
- Quick exploratory analysis
- User wants the "just works" solution
**Requirements:** anndata, scanpy, scipy, matplotlib, seaborn, numpy
**Parameters:**
Customize filtering thresholds and gene patterns using command-line parameters:
- `--output-dir` - Output directory
- `--mad-counts`, `--mad-genes`, `--mad-mt` - MAD thresholds for counts/genes/MT%
- `--mt-threshold` - Hard mitochondrial % cutoff
- `--min-cells` - Gene filtering threshold
- `--mt-pattern`, `--ribo-pattern`, `--hb-pattern` - Gene name patterns for different species
Use `--help` to see current default values.
**Outputs:**
All files are saved to `<input_basename>_qc_results/` directory by default (or to the directory specified by `--output-dir`):
- `qc_metrics_before_filtering.png` - Pre-filtering visualizations
- `qc_filtering_thresholds.png` - MAD-based threshold overlays
- `qc_metrics_after_filtering.png` - Post-filtering quality metrics
- `<input_basename>_filtered.h5ad` - Clean, filtered dataset ready for downstream analysis
- `<input_basename>_with_qc.h5ad` - Original data with QC annotations preserved
If copying outputs for user access, copy individual files (not the entire directory) so users can preview them directly.python3 scripts/qc_analysis.py raw_feature_bc_matrix.h5
脚本会自动检测文件格式并进行相应加载。
**适用场景:**
- 具有可调阈值的标准QC工作流(所有细胞采用相同过滤方式)
- 批量处理多个数据集
- 快速探索性分析
- 用户需要“开箱即用”的解决方案
**依赖库**:anndata, scanpy, scipy, matplotlib, seaborn, numpy
**参数说明:**
可通过命令行参数自定义过滤阈值和基因模式:
- `--output-dir` - 输出目录
- `--mad-counts`, `--mad-genes`, `--mad-mt` - 用于counts/基因/线粒体占比的MAD阈值
- `--mt-threshold` - 线粒体占比的硬性截断值
- `--min-cells` - 基因过滤阈值
- `--mt-pattern`, `--ribo-pattern`, `--hb-pattern` - 不同物种的基因名称模式
使用`--help`查看当前默认参数值。
**输出内容:**
所有文件默认保存至`<input_basename>_qc_results/`目录(或`--output-dir`指定的目录):
- `qc_metrics_before_filtering.png` - 过滤前可视化图表
- `qc_filtering_thresholds.png` - 基于MAD的阈值叠加图
- `qc_metrics_after_filtering.png` - 过滤后质量指标图
- `<input_basename>_filtered.h5ad` - 经过清理、过滤后的数据集,可用于下游分析
- `<input_basename>_with_qc.h5ad` - 保留QC注释信息的原始数据
若要将输出内容提供给用户访问,请复制单个文件(而非整个目录),以便用户直接预览。Workflow Steps
工作流步骤
The script performs the following steps:
- Calculate QC metrics - Count depth, gene detection, mitochondrial/ribosomal/hemoglobin content
- Apply MAD-based filtering - Permissive outlier detection using MAD thresholds for counts/genes/MT%
- Filter genes - Remove genes detected in few cells
- Generate visualizations - Comprehensive before/after plots with threshold overlays
脚本执行以下步骤:
- 计算QC指标 - 计数深度、基因检测数、线粒体/核糖体/血红蛋白含量
- 应用MAD过滤 - 使用MAD阈值进行宽松的异常值检测,针对counts/基因/线粒体占比
- 过滤基因 - 移除仅在少数细胞中被检测到的基因
- 生成可视化图表 - 生成包含阈值叠加的过滤前后全面对比图
Approach 2: Modular Building Blocks (For Custom Workflows)
方法2:模块化构建块(自定义工作流)
For custom analysis workflows or non-standard requirements, use the modular utility functions from and :
scripts/qc_core.pyscripts/qc_plotting.pypython
undefined针对自定义分析工作流或非标准需求,可使用和中的模块化工具函数:
scripts/qc_core.pyscripts/qc_plotting.pypython
undefinedRun from scripts/ directory, or add scripts/ to sys.path if needed
从scripts/目录运行,或需将scripts/添加至sys.path
import anndata as ad
from qc_core import calculate_qc_metrics, detect_outliers_mad, filter_cells
from qc_plotting import plot_qc_distributions # Only if visualization needed
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
import anndata as ad
from qc_core import calculate_qc_metrics, detect_outliers_mad, filter_cells
from qc_plotting import plot_qc_distributions # 仅当需要可视化时导入
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
... custom analysis logic here
... 此处添加自定义分析逻辑
**When to use this approach:**
- Different workflow needed (skip steps, change order, apply different thresholds to subsets)
- Conditional logic (e.g., filter neurons differently than other cells)
- Partial execution (only metrics/visualization, no filtering)
- Integration with other analysis steps in a larger pipeline
- Custom filtering criteria beyond what command-line params support
**Available utility functions:**
From `qc_core.py` (core QC operations):
- `calculate_qc_metrics(adata, mt_pattern, ribo_pattern, hb_pattern, inplace=True)` - Calculate QC metrics and annotate adata
- `detect_outliers_mad(adata, metric, n_mads, verbose=True)` - MAD-based outlier detection, returns boolean mask
- `apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True)` - Apply hard cutoffs, returns boolean mask
- `filter_cells(adata, mask, inplace=False)` - Apply boolean mask to filter cells
- `filter_genes(adata, min_cells=20, min_counts=None, inplace=True)` - Filter genes by detection
- `print_qc_summary(adata, label='')` - Print summary statistics
From `qc_plotting.py` (visualization):
- `plot_qc_distributions(adata, output_path, title)` - Generate comprehensive QC plots
- `plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path)` - Visualize filtering thresholds
- `plot_qc_after_filtering(adata, output_path)` - Generate post-filtering plots
**Example custom workflows:**
**Example 1: Only calculate metrics and visualize, don't filter yet**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
plot_qc_distributions(adata, 'qc_before.png', title='Initial QC')
print_qc_summary(adata, label='Before filtering')Example 2: Apply only MT% filtering, keep other metrics permissive
python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
**适用场景:**
- 需要不同的工作流(跳过步骤、调整顺序、对子集应用不同阈值)
- 条件逻辑(例如,对神经元和其他细胞采用不同过滤方式)
- 部分执行(仅计算指标/可视化,不进行过滤)
- 与更大流程中的其他分析步骤集成
- 需要命令行参数不支持的自定义过滤条件
**可用工具函数:**
来自`qc_core.py`(核心QC操作):
- `calculate_qc_metrics(adata, mt_pattern, ribo_pattern, hb_pattern, inplace=True)` - 计算QC指标并标注到adata中
- `detect_outliers_mad(adata, metric, n_mads, verbose=True)` - 基于MAD的异常值检测,返回布尔掩码
- `apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True)` - 应用硬性截断值,返回布尔掩码
- `filter_cells(adata, mask, inplace=False)` - 应用布尔掩码过滤细胞
- `filter_genes(adata, min_cells=20, min_counts=None, inplace=True)` - 基于检测情况过滤基因
- `print_qc_summary(adata, label='')` - 打印汇总统计信息
来自`qc_plotting.py`(可视化):
- `plot_qc_distributions(adata, output_path, title)` - 生成全面的QC图表
- `plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path)` - 可视化过滤阈值
- `plot_qc_after_filtering(adata, output_path)` - 生成过滤后图表
**自定义工作流示例:**
**示例1:仅计算指标并可视化,暂不过滤**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
plot_qc_distributions(adata, 'qc_before.png', title='初始QC结果')
print_qc_summary(adata, label='过滤前')示例2:仅应用线粒体占比过滤,其他指标保持宽松
python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)Only filter high MT% cells
仅过滤线粒体占比高的细胞
high_mt = apply_hard_threshold(adata, 'pct_counts_mt', 10, operator='>')
adata_filtered = filter_cells(adata, ~high_mt)
adata_filtered.write('filtered.h5ad')
**Example 3: Different thresholds for different subsets**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)high_mt = apply_hard_threshold(adata, 'pct_counts_mt', 10, operator='>')
adata_filtered = filter_cells(adata, ~high_mt)
adata_filtered.write('filtered.h5ad')
**示例3:对不同子集应用不同阈值**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)Apply type-specific QC (assumes cell_type metadata exists)
应用细胞类型特异性QC(假设存在cell_type元数据)
neurons = adata.obs['cell_type'] == 'neuron'
other_cells = ~neurons
neurons = adata.obs['cell_type'] == 'neuron'
other_cells = ~neurons
Neurons tolerate higher MT%, other cells use stricter threshold
神经元可容忍更高的线粒体占比,其他细胞采用更严格的阈值
neuron_qc = apply_hard_threshold(adata[neurons], 'pct_counts_mt', 15, operator='>')
other_qc = apply_hard_threshold(adata[other_cells], 'pct_counts_mt', 8, operator='>')
undefinedneuron_qc = apply_hard_threshold(adata[neurons], 'pct_counts_mt', 15, operator='>')
other_qc = apply_hard_threshold(adata[other_cells], 'pct_counts_mt', 8, operator='>')
undefinedBest Practices
最佳实践
- Be permissive with filtering - Default thresholds intentionally retain most cells to avoid losing rare populations
- Inspect visualizations - Always review before/after plots to ensure filtering makes biological sense
- Consider dataset-specific factors - Some tissues naturally have higher mitochondrial content (e.g., neurons, cardiomyocytes)
- Check gene annotations - Mitochondrial gene prefixes vary by species (mt- for mouse, MT- for human)
- Iterate if needed - QC parameters may need adjustment based on the specific experiment or tissue type
- 过滤保持宽松 - 默认阈值有意保留大多数细胞,避免丢失稀有细胞群
- 检查可视化结果 - 务必查看过滤前后的图表,确保过滤逻辑符合生物学合理性
- 考虑数据集特异性因素 - 某些组织天然具有更高的线粒体含量(例如神经元、心肌细胞)
- 检查基因注释 - 线粒体基因前缀因物种而异(小鼠为mt-,人类为MT-)
- 必要时迭代调整 - QC参数可能需要根据具体实验或组织类型进行调整
Reference Materials
参考资料
For detailed QC methodology, parameter rationale, and troubleshooting guidance, see . This reference provides:
references/scverse_qc_guidelines.md- Detailed explanations of each QC metric and why it matters
- Rationale for MAD-based thresholds and why they're better than fixed cutoffs
- Guidelines for interpreting QC visualizations (histograms, violin plots, scatter plots)
- Species-specific considerations for gene annotations
- When and how to adjust filtering parameters
- Advanced QC considerations (ambient RNA correction, doublet detection)
Load this reference when users need deeper understanding of the methodology or when troubleshooting QC issues.
如需了解详细的QC方法、参数依据和故障排除指南,请查看。该参考资料包含:
references/scverse_qc_guidelines.md- 每个QC指标的详细解释及其重要性
- 基于MAD阈值的依据,以及为何优于固定截断值
- QC可视化图表(直方图、小提琴图、散点图)的解读指南
- 基因注释的物种特异性注意事项
- 何时及如何调整过滤参数
- 高级QC注意事项(环境RNA校正、双细胞检测)
当用户需要深入了解方法细节或排查QC问题时,可加载此参考资料。
Next Steps After QC
QC后的后续步骤
Typical downstream analysis steps:
- Ambient RNA correction (SoupX, CellBender)
- Doublet detection (scDblFinder)
- Normalization (log-normalize, scran)
- Feature selection and dimensionality reduction
- Clustering and cell type annotation
典型的下游分析步骤:
- 环境RNA校正(SoupX, CellBender)
- 双细胞检测(scDblFinder)
- 归一化(对数归一化、scran)
- 特征选择和降维
- 聚类和细胞类型注释