single-cell-annotation-skills-with-omicverse
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSingle-cell annotation skills with omicverse
基于omicverse的单细胞注释技能
Overview
概述
Use this skill to reproduce and adapt the single-cell annotation playbook captured in omicverse tutorials: SCSA , MetaTiME , CellVote & , CellMatch , GPTAnno , and label transfer . Each section below highlights required inputs, training/inference steps, and how to read the outputs.
t_cellanno.ipynbt_metatime.ipynbt_cellvote.mdt_cellvote_pbmc3k.ipynbt_cellmatch.ipynbt_gptanno.ipynbt_anno_trans.ipynb使用本技能可复现并调整omicverse教程中记录的单细胞注释流程:SCSA(对应)、MetaTiME(对应)、CellVote(对应和)、CellMatch(对应)、GPTAnno(对应)以及标签迁移(对应)。以下每个部分都会重点说明所需输入、训练/推理步骤以及如何解读输出结果。
t_cellanno.ipynbt_metatime.ipynbt_cellvote.mdt_cellvote_pbmc3k.ipynbt_cellmatch.ipynbt_gptanno.ipynbt_anno_trans.ipynbInstructions
操作步骤
-
SCSA automated cluster annotation
- Data requirements: PBMC3k raw counts from 10x Genomics () or the processed
pbmc3k_filtered_gene_bc_matrices.tar.gz. Download instructions are embedded in the notebook; unpack tosample/rna.h5ad. Ensure an SCSA SQLite database is available (e.g.data/filtered_gene_bc_matrices/hg19/from the Figshare/Drive links listed in the tutorial) and pointpySCSA_2024_v1_plus.dbto its location.model_path - Preprocessing & model fit: Load with , run QC (
sc.read_10x_mtx), normalization and HVG selection (ov.pp.qc), scaling (ov.pp.preprocess), PCA (ov.pp.scale), neighbors, Leiden clustering, and compute rank markers (ov.pp.pca). Instantiatesc.tl.rank_genes_groupschoosingscsa = ov.single.pySCSA(...)ortarget='cellmarker', tissue scope, and thresholds ('panglaodb',foldchange).pvalue - Inference & interpretation: Call or
scsa.cell_anno(clustertype='leiden', result_key='scsa_celltype_cellmarker')to append predictions toscsa.cell_auto_anno. Compare to manual marker-based labels viaadata.obsorov.utils.embedding, inspect marker dictionaries (sc.pl.dotplot), and query supported tissues withov.single.get_celltype_marker. Use the ROI/ROE helpers (scsa.get_model_tissue(),ov.utils.roe) to validate abundance trends.ov.utils.plot_cellproportion
- Data requirements: PBMC3k raw counts from 10x Genomics (
-
MetaTiME tumour microenvironment states
- Data requirements: Batched TME AnnData with an scVI latent embedding. The tutorial uses from Figshare (
TiME_adata_scvi.h5ad). If starting from counts, run scVI (https://figshare.com/ndownloader/files/41440050) first to populatescvi.model.SCVI.adata.obsm['X_scVI'] - Preprocessing & model fit: Optionally subset to non-malignant cells via . Rebuild neighbors on the latent representation (
adata.obs['isTME']) and embed with pymde (sc.pp.neighbors(adata, use_rep="X_scVI")). Initialiseadata.obsm['X_mde'] = ov.utils.mde(...)and, if finer granularity is desired, over-cluster withTiME_object = ov.single.MetaTiME(adata, mode='table').TiME_object.overcluster(resolution=8, clustercol='overcluster') - Inference & interpretation: Run to assign minor states and
TiME_object.predictTiME(save_obs_name='MetaTiME'). Visualise usingMajor_MetaTiMEorTiME_object.plot. Interpret the outputs by comparing cluster-level distributions and confirming that MetaTiME and Major_MetaTiME columns align with expected niches.sc.pl.embedding
- Data requirements: Batched TME AnnData with an scVI latent embedding. The tutorial uses
-
CellVote consensus labelling
- Data requirements: A clustered AnnData (e.g. PBMC3k stored as env var or
CELLVOTE_PBMC3K) plus at least two precomputed annotation columns (simulated in the tutorial asdata/pbmc3k.h5ad,scsa_annotation,gpt_celltype). Prepare per-cluster marker genes viagbi_celltype.sc.tl.rank_genes_groups - Preprocessing & model fit: After standard preprocessing (normalize, log1p, HVGs, PCA, neighbors, Leiden) build a marker dictionary or via
marker_dict = top_markers_from_rgg(adata, 'leiden', topn=10). Instantiateov.single.get_celltype_marker.cv = ov.single.CellVote(adata) - Inference & interpretation: Call . Offline examples monkey-patch arbitration to avoid API calls; online voting requires valid credentials. Final consensus labels live in
cv.vote(clusters_key='leiden', cluster_markers=marker_dict, celltype_keys=[...], species='human', organization='PBMC', provider='openai', model='gpt-4o-mini'). Compare each cluster’s majority vote with the input sources (adata.obs['CellVote_celltype']) to justify decisions.adata.obs[['leiden', 'scsa_annotation', ...]]
- Data requirements: A clustered AnnData (e.g. PBMC3k stored as
-
CellMatch ontology mapping
- Data requirements: Annotated AnnData such as with
pertpy.dt.haber_2017_regions(). Download Cell Ontology JSON (adata.obs['cell_label']) viacl.jsonor manual links, and optionally Cell Taxonomy resources (ov.single.download_cl(...)). Ensure access to a SentenceTransformer model (Cell_Taxonomy_resource.txt,sentence-transformers/all-MiniLM-L6-v2, etc.), downloading toBAAI/bge-base-en-v1.5if offline.local_model_dir - Preprocessing & model fit: Create the mapper with . Run
ov.single.CellOntologyMapper(cl_obo_file='new_ontology/cl.json', model_name='sentence-transformers/all-MiniLM-L6-v2', local_model_dir='./my_models')to assign ontology-derived labels/IDs, optionally enabling taxonomy matching (mapper.map_adata(...)after callinguse_taxonomy=True).load_cell_taxonomy_resource - Inference & interpretation: Explore mapping summaries () and inspect embeddings coloured by
mapper.print_mapping_summary_taxonomy,cell_ontology, orcell_ontology_cl_id. Use helper queries such asenhanced_cell_ontology,mapper.find_similar_cells('T helper cell'), and category browsing to validate ontology coverage.mapper.get_cell_info(...)
- Data requirements: Annotated AnnData such as
-
GPTAnno LLM-powered annotation
- Data requirements: The same PBMC3k dataset (raw matrix or ) and cluster assignments. Access to an LLM endpoint—configure
.h5adfor OpenAI-compatible providers (AGI_API_KEY,provider='openai','qwen', etc.), or supply a local model path for'kimi'.ov.single.gptcelltype_local - Preprocessing & model fit: Follow the QC, normalization, HVG, scaling, PCA, neighbor, Leiden, and marker discovery steps described above (reusing outputs from the SCSA workflow). Build the marker dictionary automatically with .
ov.single.get_celltype_marker(adata, clustertype='leiden', rank=True, key='rank_genes_groups', foldchange=2, topgenenumber=5) - Inference & interpretation: Invoke specifying tissue/species context and desired provider/model. Post-process responses to keep clean labels (
ov.single.gptcelltype(...)) and write them toresult[key].split(': ')[-1].... Compare embeddings (adata.obs['gpt_celltype']) to verify cluster identities. If operating offline, callov.pl.embedding(..., color=['leiden','gpt_celltype'])with a downloaded instruction-tuned checkpoint.ov.single.gptcelltype_local
- Data requirements: The same PBMC3k dataset (raw matrix or
-
Weighted KNN annotation transfer
- Data requirements: Cross-modal GLUE outputs with aligned embeddings, e.g. (annotated RNA) and
data/analysis_lymph/rna-emb.h5ad(query ATAC) where both containdata/analysis_lymph/atac-emb.h5ad.obsm['X_glue'] - Preprocessing & model fit: Load both modalities, optionally concatenate for QC plots, and compute a shared low-dimensional embedding with . Train a neighbour model using
ov.utils.mde.ov.utils.weighted_knn_trainer(train_adata=rna, train_adata_emb='X_glue', n_neighbors=15) - Inference & interpretation: Transfer labels via . Store predictions in
labels, uncert = ov.utils.weighted_knn_transfer(query_adata=atac, query_adata_emb='X_glue', label_keys='major_celltype', knn_model=knn_transformer, ref_adata_obs=rna.obs)and uncertainties inatac.obs['transf_celltype']; copy toatac.obs['transf_celltype_unc']if you want consistent naming. Visualise (major_celltype) and inspect uncertainty to flag ambiguous cells.ov.utils.embedding
- Data requirements: Cross-modal GLUE outputs with aligned embeddings, e.g.
-
SCSA自动化聚类注释
- 数据要求:来自10x Genomics的PBMC3k原始计数数据()或已处理的
pbmc3k_filtered_gene_bc_matrices.tar.gz文件。下载说明已嵌入notebook中;解压至sample/rna.h5ad路径下。确保SCSA SQLite数据库可用(例如教程中列出的Figshare/网盘链接中的data/filtered_gene_bc_matrices/hg19/),并将pySCSA_2024_v1_plus.db指向该数据库的位置。model_path - 预处理与模型拟合:使用加载数据,运行QC(
sc.read_10x_mtx)、归一化和高变基因(HVG)筛选(ov.pp.qc)、标准化(ov.pp.preprocess)、PCA(ov.pp.scale)、构建邻居图、Leiden聚类,然后计算差异标记基因(ov.pp.pca)。实例化sc.tl.rank_genes_groups,选择scsa = ov.single.pySCSA(...)或target='cellmarker'、组织范围以及阈值('panglaodb'、foldchange)。pvalue - 推理与解读:调用或
scsa.cell_anno(clustertype='leiden', result_key='scsa_celltype_cellmarker')将预测结果添加到scsa.cell_auto_anno中。通过adata.obs或ov.utils.embedding与基于手动标记基因的标签进行比较,查看标记基因字典(sc.pl.dotplot),并使用ov.single.get_celltype_marker查询支持的组织类型。使用ROI/ROE辅助工具(scsa.get_model_tissue()、ov.utils.roe)验证细胞丰度趋势。ov.utils.plot_cellproportion
- 数据要求:来自10x Genomics的PBMC3k原始计数数据(
-
MetaTiME肿瘤微环境状态分析
- 数据要求:带有scVI潜在嵌入的批量肿瘤微环境(TME)AnnData数据。教程使用Figshare()上的
https://figshare.com/ndownloader/files/41440050文件。如果从计数数据开始,需先运行scVI(TiME_adata_scvi.h5ad)以生成scvi.model.SCVI。adata.obsm['X_scVI'] - 预处理与模型拟合:可选择通过筛选出非恶性细胞。基于潜在嵌入重建邻居图(
adata.obs['isTME']),并使用pymde进行嵌入(sc.pp.neighbors(adata, use_rep="X_scVI"))。初始化adata.obsm['X_mde'] = ov.utils.mde(...),如果需要更精细的粒度,可使用TiME_object = ov.single.MetaTiME(adata, mode='table')进行过度聚类。TiME_object.overcluster(resolution=8, clustercol='overcluster') - 推理与解读:运行以分配次要状态和
TiME_object.predictTiME(save_obs_name='MetaTiME')。使用Major_MetaTiME或TiME_object.plot进行可视化。通过比较聚类水平的分布情况,并确认MetaTiME和Major_MetaTiME列与预期的生态位一致来解读输出结果。sc.pl.embedding
- 数据要求:带有scVI潜在嵌入的批量肿瘤微环境(TME)AnnData数据。教程使用Figshare(
-
CellVote共识标注
- 数据要求:已完成聚类的AnnData数据(例如存储为环境变量或
CELLVOTE_PBMC3K的PBMC3k数据),以及至少两个预计算的注释列(教程中模拟为data/pbmc3k.h5ad、scsa_annotation、gpt_celltype)。通过gbi_celltype准备每个聚类的标记基因。sc.tl.rank_genes_groups - 预处理与模型拟合:完成标准预处理(归一化、log1p转换、HVG筛选、PCA、构建邻居图、Leiden聚类)后,构建标记基因字典或通过
marker_dict = top_markers_from_rgg(adata, 'leiden', topn=10)生成。实例化ov.single.get_celltype_marker。cv = ov.single.CellVote(adata) - 推理与解读:调用。离线示例通过修补仲裁逻辑避免调用API;在线投票需要有效的凭证。最终的共识标签存储在
cv.vote(clusters_key='leiden', cluster_markers=marker_dict, celltype_keys=[...], species='human', organization='PBMC', provider='openai', model='gpt-4o-mini')中。比较每个聚类的多数投票结果与输入来源(adata.obs['CellVote_celltype'])以验证决策合理性。adata.obs[['leiden', 'scsa_annotation', ...]]
- 数据要求:已完成聚类的AnnData数据(例如存储为环境变量
-
CellMatch本体映射
- 数据要求:已注释的AnnData数据,例如,其中包含
pertpy.dt.haber_2017_regions()。通过adata.obs['cell_label']或手动链接下载细胞本体JSON文件(ov.single.download_cl(...)),也可选择下载细胞分类资源(cl.json)。确保可访问SentenceTransformer模型(Cell_Taxonomy_resource.txt、sentence-transformers/all-MiniLM-L6-v2等),如果离线使用,需下载到BAAI/bge-base-en-v1.5目录。local_model_dir - 预处理与模型拟合:使用创建映射器。运行
ov.single.CellOntologyMapper(cl_obo_file='new_ontology/cl.json', model_name='sentence-transformers/all-MiniLM-L6-v2', local_model_dir='./my_models')分配基于本体的标签/ID,若需要可启用分类匹配(调用mapper.map_adata(...)后设置load_cell_taxonomy_resource)。use_taxonomy=True - 推理与解读:查看映射摘要(),并按
mapper.print_mapping_summary_taxonomy、cell_ontology或cell_ontology_cl_id对嵌入结果进行着色可视化。使用辅助查询工具,如enhanced_cell_ontology、mapper.find_similar_cells('T helper cell')以及类别浏览来验证本体覆盖范围。mapper.get_cell_info(...)
- 数据要求:已注释的AnnData数据,例如
-
GPTAnno大语言模型驱动的注释
- 数据要求:相同的PBMC3k数据集(原始矩阵或文件)以及聚类结果。可访问大语言模型(LLM)端点——为OpenAI兼容提供商配置
.h5ad(AGI_API_KEY、provider='openai'、'qwen'等),或为'kimi'提供本地模型路径。ov.single.gptcelltype_local - 预处理与模型拟合:遵循上述QC、归一化、HVG筛选、标准化、PCA、构建邻居图、Leiden聚类以及标记基因发现步骤(可复用SCSA工作流的输出结果)。通过自动构建标记基因字典。
ov.single.get_celltype_marker(adata, clustertype='leiden', rank=True, key='rank_genes_groups', foldchange=2, topgenenumber=5) - 推理与解读:调用并指定组织/物种上下文以及所需的提供商/模型。对响应结果进行后处理以保留清晰的标签(
ov.single.gptcelltype(...)),并将其写入result[key].split(': ')[-1]...。通过可视化嵌入结果(adata.obs['gpt_celltype'])验证聚类身份。如果离线操作,可调用ov.pl.embedding(..., color=['leiden','gpt_celltype'])并使用已下载的指令微调模型 checkpoint。ov.single.gptcelltype_local
- 数据要求:相同的PBMC3k数据集(原始矩阵或
-
加权KNN注释迁移
- 数据要求:经过GLUE整合的跨组学输出数据,带有对齐的嵌入,例如(已注释的RNA数据)和
data/analysis_lymph/rna-emb.h5ad(待查询的ATAC数据),两者均包含data/analysis_lymph/atac-emb.h5ad。obsm['X_glue'] - 预处理与模型拟合:加载两种组学数据,可选择合并以生成QC图,并使用计算共享的低维嵌入。使用
ov.utils.mde训练邻居模型。ov.utils.weighted_knn_trainer(train_adata=rna, train_adata_emb='X_glue', n_neighbors=15) - 推理与解读:通过迁移标签。将预测结果存储在
labels, uncert = ov.utils.weighted_knn_transfer(query_adata=atac, query_adata_emb='X_glue', label_keys='major_celltype', knn_model=knn_transformer, ref_adata_obs=rna.obs)中,不确定性存储在atac.obs['transf_celltype']中;如果需要统一命名,可将其复制到atac.obs['transf_celltype_unc']列。通过可视化(major_celltype)并查看不确定性来标记模糊细胞。ov.utils.embedding
- 数据要求:经过GLUE整合的跨组学输出数据,带有对齐的嵌入,例如
Critical API Reference - EXACT Function Signatures
关键API参考 - 精确函数签名
pySCSA - IMPORTANT: Parameter is clustertype
, NOT cluster
clustertypeclusterpySCSA - 重要提示:参数为clustertype
,而非cluster
clustertypeclusterCORRECT usage:
python
undefined正确用法:
python
undefinedStep 1: Initialize pySCSA
Step 1: Initialize pySCSA
scsa = ov.single.pySCSA(
adata,
foldchange=1.5,
pvalue=0.01,
species='Human',
tissue='All',
target='cellmarker' # or 'panglaodb'
)
scsa = ov.single.pySCSA(
adata,
foldchange=1.5,
pvalue=0.01,
species='Human',
tissue='All',
target='cellmarker' # or 'panglaodb'
)
Step 2: Run annotation - NOTE: use clustertype='leiden', NOT cluster='leiden'!
Step 2: Run annotation - NOTE: use clustertype='leiden', NOT cluster='leiden'!
anno_result = scsa.cell_anno(clustertype='leiden', cluster='all')
anno_result = scsa.cell_anno(clustertype='leiden', cluster='all')
Step 3: Add cell type labels to adata.obs
Step 3: Add cell type labels to adata.obs
scsa.cell_auto_anno(adata, clustertype='leiden', key='scsa_celltype')
scsa.cell_auto_anno(adata, clustertype='leiden', key='scsa_celltype')
Results are stored in adata.obs['scsa_celltype']
Results are stored in adata.obs['scsa_celltype']
**WRONG - DO NOT USE:**
```python
**错误用法 - 请勿使用:**
```pythonWRONG! 'cluster' is NOT a valid parameter for cell_auto_anno!
WRONG! 'cluster' is NOT a valid parameter for cell_auto_anno!
scsa.cell_auto_anno(adata, cluster='leiden') # ERROR!
scsa.cell_auto_anno(adata, cluster='leiden') # ERROR!
undefinedundefinedCOSG Marker Genes - Results stored in adata.uns, NOT adata.obs
COSG标记基因 - 结果存储在adata.uns中,而非adata.obs
CORRECT usage:
python
undefined正确用法:
python
undefinedStep 1: Run COSG marker gene identification
Step 1: Run COSG marker gene identification
ov.single.cosg(adata, groupby='leiden', n_genes_user=50)
ov.single.cosg(adata, groupby='leiden', n_genes_user=50)
Step 2: Access results from adata.uns (NOT adata.obs!)
Step 2: Access results from adata.uns (NOT adata.obs!)
marker_names = adata.uns['rank_genes_groups']['names'] # DataFrame with cluster columns
marker_scores = adata.uns['rank_genes_groups']['scores']
marker_names = adata.uns['rank_genes_groups']['names'] # DataFrame with cluster columns
marker_scores = adata.uns['rank_genes_groups']['scores']
Step 3: Get top markers for specific cluster
Step 3: Get top markers for specific cluster
cluster_0_markers = adata.uns['rank_genes_groups']['names']['0'][:10].tolist()
cluster_0_markers = adata.uns['rank_genes_groups']['names']['0'][:10].tolist()
Step 4: To create celltype column, manually map clusters to cell types
Step 4: To create celltype column, manually map clusters to cell types
cluster_to_celltype = {
'0': 'T cells',
'1': 'B cells',
'2': 'Monocytes',
}
adata.obs['cosg_celltype'] = adata.obs['leiden'].map(cluster_to_celltype)
**WRONG - DO NOT USE:**
```pythoncluster_to_celltype = {
'0': 'T cells',
'1': 'B cells',
'2': 'Monocytes',
}
adata.obs['cosg_celltype'] = adata.obs['leiden'].map(cluster_to_celltype)
**错误用法 - 请勿使用:**
```pythonWRONG! COSG does NOT create adata.obs columns directly!
WRONG! COSG does NOT create adata.obs columns directly!
adata.obs['cosg_celltype'] # This key does NOT exist after running COSG!
adata.obs['cosg_celltype'] # This key does NOT exist after running COSG!
adata.uns['cosg_celltype'] # This key also does NOT exist!
adata.uns['cosg_celltype'] # This key also does NOT exist!
undefinedundefinedCommon Pitfalls to Avoid
需避免的常见陷阱
-
pySCSA parameter confusion:
- = which obs column contains cluster labels (e.g., 'leiden')
clustertype - = which specific clusters to annotate ('all' or specific cluster IDs)
cluster - These are DIFFERENT parameters!
-
COSG result access:
- COSG is a marker gene finder, NOT a cell type annotator
- Results are per-cluster gene rankings stored in
adata.uns['rank_genes_groups'] - To assign cell types, you must manually map clusters to cell types based on markers
-
Result storage patterns in OmicVerse:
- Cell type annotations →
adata.obs['<key>'] - Marker gene results → (includes 'names', 'scores', 'logfoldchanges')
adata.uns['<key>'] - Differential expression →
adata.uns['rank_genes_groups']
- Cell type annotations →
-
pySCSA参数混淆:
- = 存储聚类标签的obs列名称(例如'leiden')
clustertype - = 要注释的特定聚类('all'或特定聚类ID)
cluster - 这是两个不同的参数!
-
COSG结果访问:
- COSG是标记基因查找工具,而非细胞类型注释工具
- 结果为每个聚类的基因排名,存储在中
adata.uns['rank_genes_groups'] - 要分配细胞类型,必须基于标记基因手动将聚类映射到细胞类型
-
OmicVerse中的结果存储模式:
- 细胞类型注释 →
adata.obs['<key>'] - 标记基因结果 → (包含'names'、'scores'、'logfoldchanges')
adata.uns['<key>'] - 差异表达分析 →
adata.uns['rank_genes_groups']
- 细胞类型注释 →
Examples
示例
- "Run SCSA with both CellMarker and PanglaoDB references on PBMC3k, then benchmark against manual marker assignments before feeding the results into CellVote."
- "Annotate tumour microenvironment states in the MetaTiME Figshare dataset, highlight Major_MetaTiME classes, and export the label distribution per patient."
- "Download Cell Ontology resources, map clusters to ontology terms, and enrich ambiguous clusters using Cell Taxonomy hints."
haber_2017_regions - "Propagate RNA-derived labels onto GLUE-integrated ATAC cells and report clusters with high transfer uncertainty."
major_celltype
- "在PBMC3k数据集上使用CellMarker和PanglaoDB参考数据库运行SCSA,然后与手动标记基因分配的结果进行基准测试,再将结果输入到CellVote中。"
- "注释MetaTiME Figshare数据集中的肿瘤微环境状态,突出显示Major_MetaTiME类别,并导出每个患者的标签分布情况。"
- "下载细胞本体资源,将聚类映射到本体术语,并使用细胞分类提示信息丰富模糊聚类的注释。"
haber_2017_regions - "将RNA数据中的标签迁移到经过GLUE整合的ATAC细胞上,并报告迁移不确定性较高的聚类。"
major_celltype
References
参考文献
- Tutorials and notebooks: ,
t_cellanno.ipynb,t_metatime.ipynb,t_cellvote.md,t_cellvote_pbmc3k.ipynb,t_cellmatch.ipynb,t_gptanno.ipynb.t_anno_trans.ipynb - Sample data & assets: PBMC3k matrix from 10x Genomics, MetaTiME (Figshare), SCSA database downloads, GLUE embeddings under
TiME_adata_scvi.h5ad, Cell Ontologydata/analysis_lymph/, and Cell Taxonomy resource.cl.json - Quick copy commands: .
reference.md
- 教程与notebook:、
t_cellanno.ipynb、t_metatime.ipynb、t_cellvote.md、t_cellvote_pbmc3k.ipynb、t_cellmatch.ipynb、t_gptanno.ipynb。t_anno_trans.ipynb - 样本数据与资源:来自10x Genomics的PBMC3k矩阵、Figshare上的MetaTiME 文件、SCSA数据库下载链接、
TiME_adata_scvi.h5ad下的GLUE嵌入数据、细胞本体data/analysis_lymph/文件以及细胞分类资源。cl.json - 快速复制命令:。
reference.md