airr
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAdaptive Immune Receptor Repertoire analysis agentic tooling
适应性免疫受体组分析自动化工具
Overview
概述
This skill provides a practical workflow for Adaptive Immune Receptor Repertoire (AIRR / VDJ-seq) data analysis with immunarch (analysis + visualization) and immundata (data ingestion + transformation + schema handling).
Use it to:
- Load AIRR data from one file, many files, glob patterns, or metadata tables.
- Define receptor semantics (chain-agnostic, single-chain, paired-chain).
- Build reproducible immutable pipelines.
- Compute repertoire statistics, clonality, diversity, and public overlap.
- Move annotations between ImmunData and single-cell objects (e.g., Seurat/AnnData).
本技能提供了一套借助immunarch(分析+可视化)和immundata(数据导入+转换+模式处理)工具进行适应性免疫受体组(AIRR/VDJ-seq)数据分析的实用工作流。
可用于:
- 从单个文件、多个文件、通配符模式或元数据表中加载AIRR数据。
- 定义受体语义(链无关、单链、双链配对)。
- 构建可复现的不可变分析流程。
- 计算受体组统计数据、克隆性、多样性以及公共重叠度。
- 在ImmunData与单细胞对象(如Seurat/AnnData)之间传递注释信息。
When to Use This Skill
适用场景
Use this skill when the user asks to:
- Analyze bulk or single-cell AIRR/TCR/BCR data.
- Compare repertoires across sample groups (tissue, therapy, cluster, donor, timepoint).
- Compute clonality/diversity/publicity metrics.
- Define or change receptor schema (e.g., , TRA-only, TRA+TRB).
cdr3_aa + v_call - Filter receptors by patterns or sequence distance.
- Add/propagate labels between repertoire data and scRNA metadata.
- Convert old immunarch objects to the newer ImmunData pipeline.
当用户有以下需求时,可使用本技能:
- 分析批量或单细胞AIRR/TCR/BCR数据。
- 比较不同样本组(组织、治疗方案、细胞簇、供体、时间点)的受体组差异。
- 计算克隆性/多样性/公共序列重叠度指标。
- 定义或修改受体模式(如、仅TRA链、TRA+TRB链)。
cdr3_aa + v_call - 按模式或序列距离筛选受体。
- 在受体组数据与scRNA元数据之间添加/传递标签。
- 将旧版immunarch对象转换为新版ImmunData流程格式。
Quick Start
快速开始
Basic import and first look
基础导入与初步查看
r
library(immunarch)r
library(immunarch)Demo data + basic grouping
Demo data + basic grouping
idata <- get_test_idata() |> agg_repertoires("Therapy")
idata
idata <- get_test_idata() |> agg_repertoires("Therapy")
idata
Core analyses
Core analyses
airr_stats_genes(idata, gene_col = "v_call") |> vis()
airr_public_jaccard(idata) |> vis()
airr_clonality_prop(idata)
airr_diversity_pielou(idata) |> vis()
undefinedairr_stats_genes(idata, gene_col = "v_call") |> vis()
airr_public_jaccard(idata) |> vis()
airr_clonality_prop(idata)
airr_diversity_pielou(idata) |> vis()
undefinedOptional: add clonality labels to Seurat metadata
可选:将克隆性标签添加至Seurat元数据
r
idata <- annotate_clonality_prop(idata)
sdata <- annotate_seurat(idata, sdata, cols = "clonal_prop_bin")
Seurat::DimPlot(sdata, reduction = "umap", group.by = "clonal_prop_bin", shuffle = TRUE)r
idata <- annotate_clonality_prop(idata)
sdata <- annotate_seurat(idata, sdata, cols = "clonal_prop_bin")
Seurat::DimPlot(sdata, reduction = "umap", group.by = "clonal_prop_bin", shuffle = TRUE)Ingest AIRR files via immundata directly
直接通过immundata导入AIRR文件
r
library(immundata)
md_path <- system.file("extdata/tsv", "metadata.tsv", package = "immundata")
files <- c(
system.file("extdata/tsv", "sample_0_1k.tsv", package = "immundata"),
system.file("extdata/tsv", "sample_1k_2k.tsv", package = "immundata")
)
md <- read_metadata(md_path)
idata <- read_repertoires(
path = files,
schema = c("cdr3_aa", "v_call"),
metadata = md
)r
library(immundata)
md_path <- system.file("extdata/tsv", "metadata.tsv", package = "immundata")
files <- c(
system.file("extdata/tsv", "sample_0_1k.tsv", package = "immundata"),
system.file("extdata/tsv", "sample_1k_2k.tsv", package = "immundata")
)
md <- read_metadata(md_path)
idata <- read_repertoires(
path = files,
schema = c("cdr3_aa", "v_call"),
metadata = md
)Typical User Intake (what to ask/assume)
典型用户需求确认(需询问/假设的内容)
Before coding, identify:
-
Modality: bulk vs single-cell AIRR.
-
Input format: TSV/CSV/Parquet, gzipped or not, one file vs many.
-
Schema intent:
- chain-agnostic,
- single-chain (e.g., TRA only),
- paired-chain (e.g., TRA+TRB, IGH + IGK|IGL).
-
Grouping variables: repertoire schema (sample/cluster/tissue/therapy).
-
Target analyses: stats, gene usage, clonality, diversity, overlap, annotation transfer.
-
Scale/performance: whether snapshots/materialization strategy is needed.
If unknown, default to conservative, reproducible choices and print intermediate summaries.
编写代码前,需明确:
-
数据类型:批量或单细胞AIRR数据。
-
输入格式:TSV/CSV/Parquet、是否压缩、单个或多个文件。
-
受体模式需求:
- 链无关型,
- 单链型(如仅TRA链),
- 双链配对型(如TRA+TRB、IGH + IGK|IGL)。
-
分组变量:受体组模式(样本/细胞簇/组织/治疗方案)。
-
目标分析内容:统计数据、基因使用情况、克隆性、多样性、重叠度、注释信息传递。
-
数据规模与性能需求:是否需要快照/实例化策略。
若信息未知,默认采用保守、可复现的设置,并打印中间汇总结果。
Standard Analysis Workflow
标准分析工作流
1) Ingest data and metadata
1) 导入数据与元数据
Use + .
read_metadata()read_repertoires()r
library(immundata)
inp_files <- "path/to/airr/*.tsv.gz"
md_table <- read_metadata("path/to/metadata.tsv")
idata <- read_repertoires(
path = inp_files,
schema = c("cdr3_aa", "v_call"),
metadata = md_table,
repertoire_schema = "Sample"
)使用 + 。
read_metadata()read_repertoires()r
library(immundata)
inp_files <- "path/to/airr/*.tsv.gz"
md_table <- read_metadata("path/to/metadata.tsv")
idata <- read_repertoires(
path = inp_files,
schema = c("cdr3_aa", "v_call"),
metadata = md_table,
repertoire_schema = "Sample"
)2) Choose receptor schema explicitly
2) 明确选择受体模式
Chain-agnostic (bulk / pre-filtered)
链无关型(批量/预过滤数据)
r
idata <- read_repertoires(
path = inp_files,
schema = c("cdr3_aa", "v_call")
)r
idata <- read_repertoires(
path = inp_files,
schema = c("cdr3_aa", "v_call")
)Single-chain (e.g., TRA only)
单链型(如仅TRA链)
r
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = "TRA"
)
idata <- read_repertoires(
path = "path/to/single_cell.csv.gz",
schema = schema,
barcode_col = "barcode",
locus_col = "locus"
)r
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = "TRA"
)
idata <- read_repertoires(
path = "path/to/single_cell.csv.gz",
schema = schema,
barcode_col = "barcode",
locus_col = "locus"
)Paired-chain (e.g., TRA + TRB)
双链配对型(如TRA + TRB)
r
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = c("TRA", "TRB")
)
idata <- read_repertoires(
path = "path/to/single_cell.csv.gz",
schema = schema,
barcode_col = "barcode",
locus_col = "locus",
umi_col = "umis"
)r
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = c("TRA", "TRB")
)
idata <- read_repertoires(
path = "path/to/single_cell.csv.gz",
schema = schema,
barcode_col = "barcode",
locus_col = "locus",
umi_col = "umis"
)Paired with alternative light chain (e.g., IGH + IGK|IGL)
轻链可变配对型(如IGH + IGK|IGL)
r
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = c("IGH", "IGK|IGL")
)r
schema <- make_receptor_schema(
features = c("cdr3", "v_call"),
chains = c("IGH", "IGK|IGL")
)3) Transform immutably (filter / annotate / mutate)
3) 不可变转换(过滤/注释/修改)
r
undefinedr
undefinedFiltering
Filtering
idata_f <- idata |>
filter(v_call == "TRBV2") |>
filter(imd_proportion >= 0.0002)
idata_f <- idata |>
filter(v_call == "TRBV2") |>
filter(imd_proportion >= 0.0002)
Sequence-distance filter
Sequence-distance filter
idata_seq <- idata |>
filter(seq_options = make_seq_options(
patterns = "CASSELAGYRGEQYF",
query_col = "cdr3",
method = "lev",
max_dist = 3
))
idata_seq <- idata |>
filter(seq_options = make_seq_options(
patterns = "CASSELAGYRGEQYF",
query_col = "cdr3",
method = "lev",
max_dist = 3
))
Annotation join
Annotation join
idata_ann <- annotate(
idata = idata,
annotations = cells[c("barcode", "ident")],
by = c(imd_barcode = "barcode"),
keep_repertoires = FALSE
)
idata_ann <- annotate(
idata = idata,
annotations = cells[c("barcode", "ident")],
by = c(imd_barcode = "barcode"),
keep_repertoires = FALSE
)
Mutations
Mutations
idata_mut <- idata |>
mutate(big_chains = umis >= 10) |>
mutate(dist_to_pattern = dd$levenshtein(cdr3, "CASSSVSGNSPLHF"))
undefinedidata_mut <- idata |>
mutate(big_chains = umis >= 10) |>
mutate(dist_to_pattern = dd$levenshtein(cdr3, "CASSSVSGNSPLHF"))
undefined4) Aggregate repertoires for reporting strata
4) 按报告维度聚合受体组
r
idata_grp <- idata |>
agg_repertoires("Tissue")r
idata_grp <- idata |>
agg_repertoires("Tissue")5) Compute immune repertoire statistics
5) 计算免疫受体组统计数据
r
undefinedr
undefinedChain and gene statistics
Chain and gene statistics
chains <- airr_stats_chains(idata_grp)
genes <- airr_stats_genes(idata_grp, gene_col = "v_call", level = "receptor")
chains |> vis()
genes |> vis()
undefinedchains <- airr_stats_chains(idata_grp)
genes <- airr_stats_genes(idata_grp, gene_col = "v_call", level = "receptor")
chains |> vis()
genes |> vis()
undefined6) Clonality analysis
6) 克隆性分析
r
cl_line <- airr_clonality_line(idata_grp)
cl_prop <- airr_clonality_prop(idata_grp)
cl_rank <- airr_clonality_rank(idata_grp, bins = c(10, 100, 1000))
cl_prop |> vis()
cl_rank |> vis()r
cl_line <- airr_clonality_line(idata_grp)
cl_prop <- airr_clonality_prop(idata_grp)
cl_rank <- airr_clonality_rank(idata_grp, bins = c(10, 100, 1000))
cl_prop |> vis()
cl_rank |> vis()7) Diversity analysis
7) 多样性分析
r
d50 <- airr_diversity_dxx(idata_grp, perc = 50)
chao1 <- airr_diversity_chao1(idata_grp)
shannon <- airr_diversity_shannon(idata_grp)
pielou <- airr_diversity_pielou(idata_grp)
hill1 <- airr_diversity_index(idata_grp)
hillprof <- airr_diversity_hill(idata_grp, q = c(0, 1, 2))
pielou |> vis()r
d50 <- airr_diversity_dxx(idata_grp, perc = 50)
chao1 <- airr_diversity_chao1(idata_grp)
shannon <- airr_diversity_shannon(idata_grp)
pielou <- airr_diversity_pielou(idata_grp)
hill1 <- airr_diversity_index(idata_grp)
hillprof <- airr_diversity_hill(idata_grp, q = c(0, 1, 2))
pielou |> vis()8) Public overlap
8) 公共序列重叠分析
r
m_pub <- airr_public_intersection(idata_grp)
m_jac <- airr_public_jaccard(idata_grp)
m_pub |> vis()
m_jac |> vis()r
m_pub <- airr_public_intersection(idata_grp)
m_jac <- airr_public_jaccard(idata_grp)
m_pub |> vis()
m_jac |> vis()9) Snapshot expensive steps
9) 快照存储耗时分析步骤
r
undefinedr
undefinedSave intermediate immutable snapshot to avoid recomputing expensive transforms
Save intermediate immutable snapshot to avoid recomputing expensive transforms
idata_cached <- immundata::write_immundata(idata_mut, "path/to/snapshot_folder")
---idata_cached <- immundata::write_immundata(idata_mut, "path/to/snapshot_folder")
---Common Pitfalls and Best Practices
常见误区与最佳实践
-
Changing receptor definition mid-analysis
- Receptor schema is foundational. If receptor definition changes, re-read data into a new .
ImmunData
- Receptor schema is foundational. If receptor definition changes, re-read data into a new
-
Using wrong mode for single-cell
- Paired analysis requires ,
barcode_col, and typicallylocus_col.umi_col - Single-chain mode does not necessarily collapse multiple chains per barcode.
- Paired analysis requires
-
Skipping metadata strategy
- Prefer metadata-driven ingestion () when file provenance and sample mapping matter.
path = "<metadata>"
- Prefer metadata-driven ingestion (
-
Direct internal-table edits
- Avoid low-level manual edits. Use high-level verbs (,
filter,annotate,mutate).agg_repertoires
- Avoid low-level manual edits. Use high-level verbs (
-
Ignoring immutable pipeline behavior
- Each step returns a new object; persist expensive steps with snapshots.
-
Non-canonical columns
- In scripts, rely on canonical columns.
imd_* - In package code, prefer schema keys/aliases (e.g., via ).
imd_schema()
- In scripts, rely on canonical
-
Distance-heavy transforms on large data
- Pattern/Levenshtein operations can be expensive; run once, snapshot, then reuse.
-
分析中途更改受体定义
- 受体模式是分析的基础。若需更改受体定义,需将数据重新读取为新的对象。
ImmunData
- 受体模式是分析的基础。若需更改受体定义,需将数据重新读取为新的
-
对单细胞数据使用错误模式
- 配对分析需指定、
barcode_col,通常还需locus_col。umi_col - 单链模式不一定会合并每个条码对应的多条链。
- 配对分析需指定
-
忽略元数据策略
- 当文件来源与样本映射至关重要时,优先使用元数据驱动的导入方式()。
path = "<metadata>"
- 当文件来源与样本映射至关重要时,优先使用元数据驱动的导入方式(
-
直接编辑内部表格
- 避免低阶手动编辑,使用高阶操作符(、
filter、annotate、mutate)。agg_repertoires
- 避免低阶手动编辑,使用高阶操作符(
-
忽略不可变流程特性
- 每个步骤都会返回新对象;对耗时步骤使用快照存储。
-
使用非标准列
- 在脚本中依赖标准列。
imd_* - 在包代码中优先使用模式键/别名(如通过)。
imd_schema()
- 在脚本中依赖标准
-
对大数据执行大量距离计算
- 模式/编辑距离操作可能耗时较长;建议运行一次后保存快照,后续重复使用。
Bundled Resources
内置资源
TBD
待补充
Additional Resources
额外资源
- immunomind docs home: https://immunomind.github.io/docs/
- Quick Start: https://immunomind.github.io/docs/intro/quick_start/
- Reading repertoire files: https://immunomind.github.io/docs/guides/io/ingesting/
- Reading single-/paired-chain data: https://immunomind.github.io/docs/guides/io/modes/
- Filter: https://immunomind.github.io/docs/guides/transform/filter/
- Annotate: https://immunomind.github.io/docs/guides/transform/annotate/
- Mutate: https://immunomind.github.io/docs/guides/transform/mutate/
- Workflow phase 1 (ingestion): https://immunomind.github.io/docs/concepts/workflow/phase_ingestion/
- Workflow phase 2 (transformation): https://immunomind.github.io/docs/concepts/workflow/phase_transformation/
- ImmunData structure: https://immunomind.github.io/docs/guides/data_schema/
- Single-cell end-to-end tutorial: https://immunomind.github.io/docs/tutorials/single_cell/
- immunomind文档主页: https://immunomind.github.io/docs/
- 快速开始指南: https://immunomind.github.io/docs/intro/quick_start/
- 受体组文件读取: https://immunomind.github.io/docs/guides/io/ingesting/
- 单链/双链配对数据读取: https://immunomind.github.io/docs/guides/io/modes/
- 过滤操作: https://immunomind.github.io/docs/guides/transform/filter/
- 注释操作: https://immunomind.github.io/docs/guides/transform/annotate/
- 修改操作: https://immunomind.github.io/docs/guides/transform/mutate/
- 工作流阶段1(数据导入): https://immunomind.github.io/docs/concepts/workflow/phase_ingestion/
- 工作流阶段2(数据转换): https://immunomind.github.io/docs/concepts/workflow/phase_transformation/
- ImmunData结构: https://immunomind.github.io/docs/guides/data_schema/
- 单细胞端到端教程: https://immunomind.github.io/docs/tutorials/single_cell/
Tips for Effective Analysis
高效分析技巧
- Start with a small subset and verify schema + grouping before scaling up.
- Print object summaries after ingestion and after major transforms.
- Use explicit variable names for stages (,
idata_raw,idata_qc,idata_ann).idata_stats - Prefer pipelines that can be re-executed end-to-end from raw inputs.
- Keep biologically meaningful grouping variables in repertoire schema early.
- Use early and often for sanity checks before formal interpretation.
vis()
- 先从小数据集子集开始,在大规模分析前验证模式与分组是否正确。
- 在数据导入和主要转换步骤后打印对象汇总信息。
- 为不同分析阶段使用明确的变量名(如,
idata_raw,idata_qc,idata_ann)。idata_stats - 优先选择可从原始输入端到端重新执行的流程。
- 尽早将具有生物学意义的分组变量纳入受体组模式。
- 尽早并频繁使用函数进行合理性检查,再进行正式解读。
vis()