immunomind
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAIRR data analysis with immunomind ecosystem
基于immunomind生态系统的AIRR数据分析
Overview
概述
This skill provides a practical workflow for Adaptive Immune Receptor Repertoire (AIRR / VDJ-seq) data analysis with immunarch (analysis + visualization) and immundata (data ingestion + transformation + schema handling).
Use it to:
- Load AIRR data from one file, many files, glob patterns, or metadata tables.
- Define receptor semantics (chain-agnostic, single-chain, paired-chain).
- Build reproducible immutable pipelines.
- Compute repertoire statistics, clonality, diversity, and public overlap.
- Move annotations between ImmunData and single-cell objects (e.g., Seurat/AnnData).
本技能提供了一套实用的工作流,用于结合immunarch(分析与可视化)和immundata(数据导入、转换与schema处理)进行适应性免疫受体库(AIRR / VDJ-seq)数据分析。
可用于:
- 从单个文件、多个文件、通配符模式或元数据表中加载AIRR数据。
- 定义受体语义(不区分链、单链、配对链)。
- 构建可复现的不可变工作流。
- 计算免疫库统计数据、克隆性、多样性及公共重叠度。
- 在ImmunData与单细胞对象(如Seurat/AnnData)之间迁移注释信息。
When to Use This Skill
适用场景
Use this skill when the user asks to:
- Analyze bulk or single-cell AIRR/TCR/BCR data.
- Compare repertoires across sample groups (tissue, therapy, cluster, donor, timepoint).
- Compute clonality/diversity/publicity metrics.
- Define or change receptor schema (e.g., , TRA-only, TRA+TRB).
cdr3_aa + v_call - Filter receptors by patterns or sequence distance.
- Add/propagate labels between repertoire data and scRNA metadata.
- Convert old immunarch objects to the newer ImmunData pipeline.
当用户提出以下需求时,可使用本技能:
- 分析批量或单细胞AIRR/TCR/BCR数据。
- 比较不同样本组(组织、治疗方案、细胞簇、供体、时间点)的免疫库。
- 计算克隆性/多样性/公共序列重叠指标。
- 定义或修改受体schema(如、仅TRA链、TRA+TRB链)。
cdr3_aa + v_call - 根据序列模式或序列距离筛选受体。
- 在免疫库数据与scRNA元数据之间添加/传递标签。
- 将旧版immunarch对象转换为新版ImmunData工作流。
Quick Start
快速开始
Basic import and first look
基础导入与初步查看
r
library(immunarch)r
library(immunarch)Demo data + basic grouping
演示数据 + 基础分组
idata <- get_test_idata() |> agg_repertoires("Therapy")
idata
idata <- get_test_idata() |> agg_repertoires("Therapy")
idata
Core analyses
核心分析
airr_stats_genes(idata, gene_col = "v_call") |> vis()
airr_public_jaccard(idata) |> vis()
airr_clonality_prop(idata)
airr_diversity_pielou(idata) |> vis()
undefinedairr_stats_genes(idata, gene_col = "v_call") |> vis()
airr_public_jaccard(idata) |> vis()
airr_clonality_prop(idata)
airr_diversity_pielou(idata) |> vis()
undefinedOptional: add clonality labels to Seurat metadata
可选操作:向Seurat元数据添加克隆性标签
r
idata <- annotate_clonality_prop(idata)
sdata <- annotate_seurat(idata, sdata, cols = "clonal_prop_bin")
Seurat::DimPlot(sdata, reduction = "umap", group.by = "clonal_prop_bin", shuffle = TRUE)r
idata <- annotate_clonality_prop(idata)
sdata <- annotate_seurat(idata, sdata, cols = "clonal_prop_bin")
Seurat::DimPlot(sdata, reduction = "umap", group.by = "clonal_prop_bin", shuffle = TRUE)Ingest AIRR files via immundata directly
直接通过immundata导入AIRR文件
r
library(immundata)
md_path <- system.file("extdata/tsv", "metadata.tsv", package = "immundata")
files <- c(
system.file("extdata/tsv", "sample_0_1k.tsv", package = "immundata"),
system.file("extdata/tsv", "sample_1k_2k.tsv", package = "immundata")
)
md <- read_metadata(md_path)
idata <- read_repertoires(
path = files,
schema = c("cdr3_aa", "v_call"),
metadata = md
)r
library(immundata)
md_path <- system.file("extdata/tsv", "metadata.tsv", package = "immundata")
files <- c(
system.file("extdata/tsv", "sample_0_1k.tsv", package = "immundata"),
system.file("extdata/tsv", "sample_1k_2k.tsv", package = "immundata")
)
md <- read_metadata(md_path)
idata <- read_repertoires(
path = files,
schema = c("cdr3_aa", "v_call"),
metadata = md
)Typical User Intake (what to ask/assume)
典型用户需求确认(需询问/默认假设)
Before coding, identify:
-
Modality: bulk vs single-cell AIRR.
-
Input format: TSV/CSV/Parquet, gzipped or not, one file vs many.
-
Schema intent:
- chain-agnostic,
- single-chain (e.g., TRA only),
- paired-chain (e.g., TRA+TRB, IGH + IGK|IGL).
-
Grouping variables: repertoire schema (sample/cluster/tissue/therapy).
-
Target analyses: stats, gene usage, clonality, diversity, overlap, annotation transfer.
-
Scale/performance: whether snapshots/materialization strategy is needed.
If unknown, default to conservative, reproducible choices and print intermediate summaries.
编写代码前,需明确:
-
数据类型:批量或单细胞AIRR数据。
-
输入格式:TSV/CSV/Parquet、是否压缩、单个或多个文件。
-
schema需求:
- 不区分链,
- 单链(如仅TRA链),
- 配对链(如TRA+TRB、IGH + IGK|IGL)。
-
分组变量:免疫库schema(样本/细胞簇/组织/治疗方案)。
-
目标分析:统计数据、基因使用情况、克隆性、多样性、重叠度、注释信息迁移。
-
数据规模/性能:是否需要快照/实体化策略。
若信息未知,默认采用保守、可复现的选择,并打印中间汇总结果。
Standard Analysis Workflow
标准分析工作流
1) Ingest data and metadata
1) 导入数据与元数据
Use + .
read_metadata()read_repertoires()r
library(immundata)
inp_files <- "path/to/airr/*.tsv.gz"
md_table <- read_metadata("path/to/metadata.tsv")
idata <- read_repertoires(
path = inp_files,
schema = c("cdr3_aa", "v_call"),
metadata = md_table,
repertoire_schema = "Sample"
)使用 + 。
read_metadata()read_repertoires()r
library(immundata)
inp_files <- "path/to/airr/*.tsv.gz"
md_table <- read_metadata("path/to/metadata.tsv")
idata <- read_repertoires(
path = inp_files,
schema = c("cdr3_aa", "v_call"),
metadata = md_table,
repertoire_schema = "Sample"
)2) Choose receptor schema explicitly
2) 明确选择受体schema
Chain-agnostic (bulk / pre-filtered)
不区分链(批量/预过滤数据)
r
idata <- read_repertoires(
path = inp_files,
schema = c("cdr3_aa", "v_call")
)r
idata <- read_repertoires(
path = inp_files,
schema = c("cdr3_aa", "v_call")
)Single-chain (e.g., TRA only)
单链(如仅TRA链)
r
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = "TRA"
)
idata <- read_repertoires(
path = "path/to/single_cell.csv.gz",
schema = schema,
barcode_col = "barcode",
locus_col = "locus"
)r
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = "TRA"
)
idata <- read_repertoires(
path = "path/to/single_cell.csv.gz",
schema = schema,
barcode_col = "barcode",
locus_col = "locus"
)Paired-chain (e.g., TRA + TRB)
配对链(如TRA + TRB)
r
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = c("TRA", "TRB")
)
idata <- read_repertoires(
path = "path/to/single_cell.csv.gz",
schema = schema,
barcode_col = "barcode",
locus_col = "locus",
umi_col = "umis"
)r
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = c("TRA", "TRB")
)
idata <- read_repertoires(
path = "path/to/single_cell.csv.gz",
schema = schema,
barcode_col = "barcode",
locus_col = "locus",
umi_col = "umis"
)Paired with alternative light chain (e.g., IGH + IGK|IGL)
配对替代轻链(如IGH + IGK|IGL)
r
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = c("IGH", "IGK|IGL")
)r
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = c("IGH", "IGK|IGL")
)3) Transform immutably (filter / annotate / mutate)
3) 不可变转换(过滤/注释/突变)
r
undefinedr
undefinedFiltering
过滤
idata_f <- idata |>
filter(v_call == "TRBV2") |>
filter(imd_proportion >= 0.0002)
idata_f <- idata |>
filter(v_call == "TRBV2") |>
filter(imd_proportion >= 0.0002)
Sequence-distance filter
序列距离过滤
idata_seq <- idata |>
filter(seq_options = make_seq_options(
patterns = "CASSELAGYRGEQYF",
query_col = "cdr3",
method = "lev",
max_dist = 3
))
idata_seq <- idata |>
filter(seq_options = make_seq_options(
patterns = "CASSELAGYRGEQYF",
query_col = "cdr3",
method = "lev",
max_dist = 3
))
Annotation join
注释关联
idata_ann <- annotate(
idata = idata,
annotations = cells[c("barcode", "ident")],
by = c(imd_barcode = "barcode"),
keep_repertoires = FALSE
)
idata_ann <- annotate(
idata = idata,
annotations = cells[c("barcode", "ident")],
by = c(imd_barcode = "barcode"),
keep_repertoires = FALSE
)
Mutations
新增字段
idata_mut <- idata |>
mutate(big_chains = umis >= 10) |>
mutate(dist_to_pattern = dd$levenshtein(cdr3, "CASSSVSGNSPLHF"))
undefinedidata_mut <- idata |>
mutate(big_chains = umis >= 10) |>
mutate(dist_to_pattern = dd$levenshtein(cdr3, "CASSSVSGNSPLHF"))
undefined4) Aggregate repertoires for reporting strata
4) 按报告分层聚合免疫库
r
idata_grp <- idata |>
agg_repertoires("Tissue")r
idata_grp <- idata |>
agg_repertoires("Tissue")5) Compute immune repertoire statistics
5) 计算免疫库统计数据
r
undefinedr
undefinedChain and gene statistics
链与基因统计
chains <- airr_stats_chains(idata_grp)
genes <- airr_stats_genes(idata_grp, gene_col = "v_call", level = "receptor")
chains |> vis()
genes |> vis()
undefinedchains <- airr_stats_chains(idata_grp)
genes <- airr_stats_genes(idata_grp, gene_col = "v_call", level = "receptor")
chains |> vis()
genes |> vis()
undefined6) Clonality analysis
6) 克隆性分析
r
cl_line <- airr_clonality_line(idata_grp)
cl_prop <- airr_clonality_prop(idata_grp)
cl_rank <- airr_clonality_rank(idata_grp, bins = c(10, 100, 1000))
cl_prop |> vis()
cl_rank |> vis()r
cl_line <- airr_clonality_line(idata_grp)
cl_prop <- airr_clonality_prop(idata_grp)
cl_rank <- airr_clonality_rank(idata_grp, bins = c(10, 100, 1000))
cl_prop |> vis()
cl_rank |> vis()7) Diversity analysis
7) 多样性分析
r
d50 <- airr_diversity_dxx(idata_grp, perc = 50)
chao1 <- airr_diversity_chao1(idata_grp)
shannon <- airr_diversity_shannon(idata_grp)
pielou <- airr_diversity_pielou(idata_grp)
hill1 <- airr_diversity_index(idata_grp)
hillprof <- airr_diversity_hill(idata_grp, q = c(0, 1, 2))
pielou |> vis()r
d50 <- airr_diversity_dxx(idata_grp, perc = 50)
chao1 <- airr_diversity_chao1(idata_grp)
shannon <- airr_diversity_shannon(idata_grp)
pielou <- airr_diversity_pielou(idata_grp)
hill1 <- airr_diversity_index(idata_grp)
hillprof <- airr_diversity_hill(idata_grp, q = c(0, 1, 2))
pielou |> vis()8) Public overlap
8) 公共序列重叠分析
r
m_pub <- airr_public_intersection(idata_grp)
m_jac <- airr_public_jaccard(idata_grp)
m_pub |> vis()
m_jac |> vis()r
m_pub <- airr_public_intersection(idata_grp)
m_jac <- airr_public_jaccard(idata_grp)
m_pub |> vis()
m_jac |> vis()9) Snapshot expensive steps
9) 快照保存耗时步骤
r
undefinedr
undefinedSave intermediate immutable snapshot to avoid recomputing expensive transforms
保存中间不可变快照,避免重复计算耗时的转换操作
idata_cached <- immundata::write_immundata(idata_mut, "path/to/snapshot_folder")
---idata_cached <- immundata::write_immundata(idata_mut, "path/to/snapshot_folder")
---Common Pitfalls and Best Practices
常见误区与最佳实践
-
Changing receptor definition mid-analysis
- Receptor schema is foundational. If receptor definition changes, re-read data into a new .
ImmunData
- Receptor schema is foundational. If receptor definition changes, re-read data into a new
-
Using wrong mode for single-cell
- Paired analysis requires ,
barcode_col, and typicallylocus_col.umi_col - Single-chain mode does not necessarily collapse multiple chains per barcode.
- Paired analysis requires
-
Skipping metadata strategy
- Prefer metadata-driven ingestion () when file provenance and sample mapping matter.
path = "<metadata>"
- Prefer metadata-driven ingestion (
-
Direct internal-table edits
- Avoid low-level manual edits. Use high-level verbs (,
filter,annotate,mutate).agg_repertoires
- Avoid low-level manual edits. Use high-level verbs (
-
Ignoring immutable pipeline behavior
- Each step returns a new object; persist expensive steps with snapshots.
-
Non-canonical columns
- In scripts, rely on canonical columns.
imd_* - In package code, prefer schema keys/aliases (e.g., via ).
imd_schema()
- In scripts, rely on canonical
-
Distance-heavy transforms on large data
- Pattern/Levenshtein operations can be expensive; run once, snapshot, then reuse.
-
分析中途修改受体定义
- 受体schema是基础。若需修改受体定义,需将数据重新读入新的对象。
ImmunData
- 受体schema是基础。若需修改受体定义,需将数据重新读入新的
-
单细胞数据使用错误模式
- 配对分析需要、
barcode_col,通常还需要locus_col。umi_col - 单链模式不一定会合并每个barcode对应的多条链。
- 配对分析需要
-
忽略元数据策略
- 当文件来源和样本映射很重要时,优先选择基于元数据的导入方式()。
path = "<metadata>"
- 当文件来源和样本映射很重要时,优先选择基于元数据的导入方式(
-
直接编辑内部表格
- 避免低层级手动编辑,使用高层级操作(、
filter、annotate、mutate)。agg_repertoires
- 避免低层级手动编辑,使用高层级操作(
-
忽略不可变工作流特性
- 每个步骤都会返回新对象;使用快照保存耗时步骤的结果。
-
使用非标准列
- 在脚本中,依赖标准的列。
imd_* - 在包代码中,优先使用schema键/别名(如通过)。
imd_schema()
- 在脚本中,依赖标准的
-
大数据集上执行大量距离计算
- 模式/莱文斯坦距离操作可能耗时;执行一次后保存快照,后续复用。
Bundled Resources
内置资源
TBD
待补充
Additional Resources
额外资源
- immunomind docs home: https://immunomind.github.io/docs/
- Quick Start: https://immunomind.github.io/docs/intro/quick_start/
- Reading repertoire files: https://immunomind.github.io/docs/guides/io/ingesting/
- Reading single-/paired-chain data: https://immunomind.github.io/docs/guides/io/modes/
- Filter: https://immunomind.github.io/docs/guides/transform/filter/
- Annotate: https://immunomind.github.io/docs/guides/transform/annotate/
- Mutate: https://immunomind.github.io/docs/guides/transform/mutate/
- Workflow phase 1 (ingestion): https://immunomind.github.io/docs/concepts/workflow/phase_ingestion/
- Workflow phase 2 (transformation): https://immunomind.github.io/docs/concepts/workflow/phase_transformation/
- ImmunData structure: https://immunomind.github.io/docs/guides/data_schema/
- Single-cell end-to-end tutorial: https://immunomind.github.io/docs/tutorials/single_cell/
- immunomind文档主页: https://immunomind.github.io/docs/
- 快速开始: https://immunomind.github.io/docs/intro/quick_start/
- 免疫库文件导入: https://immunomind.github.io/docs/guides/io/ingesting/
- 单链/配对链数据导入: https://immunomind.github.io/docs/guides/io/modes/
- 过滤操作: https://immunomind.github.io/docs/guides/transform/filter/
- 注释操作: https://immunomind.github.io/docs/guides/transform/annotate/
- 新增字段操作: https://immunomind.github.io/docs/guides/transform/mutate/
- 工作流阶段1(导入): https://immunomind.github.io/docs/concepts/workflow/phase_ingestion/
- 工作流阶段2(转换): https://immunomind.github.io/docs/concepts/workflow/phase_transformation/
- ImmunData结构: https://immunomind.github.io/docs/guides/data_schema/
- 单细胞端到端教程: https://immunomind.github.io/docs/tutorials/single_cell/
Tips for Effective Analysis
高效分析技巧
- Start with a small subset and verify schema + grouping before scaling up.
- Print object summaries after ingestion and after major transforms.
- Use explicit variable names for stages (,
idata_raw,idata_qc,idata_ann).idata_stats - Prefer pipelines that can be re-executed end-to-end from raw inputs.
- Keep biologically meaningful grouping variables in repertoire schema early.
- Use early and often for sanity checks before formal interpretation.
vis()
- 先从小数据集子集开始,在扩大规模前验证schema与分组是否正确。
- 导入数据和完成主要转换后,打印对象汇总信息。
- 为不同分析阶段使用明确的变量名(如、
idata_raw、idata_qc、idata_ann)。idata_stats - 优先选择可从原始输入端到端重新执行的工作流。
- 尽早将具有生物学意义的分组变量纳入免疫库schema。
- 正式解读前,尽早并频繁使用进行合理性检查。
vis()