immunarch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AIRR data analysis with immunarch R package

基于immunarch R包的AIRR数据分析

Overview

概述

This skill provides a practical workflow for Adaptive Immune Receptor Repertoire (AIRR / VDJ-seq) data analysis with immunarch (analysis + visualization) and immundata (data ingestion + transformation + schema handling).
Use it to:
  • Load AIRR data from one file, many files, glob patterns, or metadata tables.
  • Define receptor semantics (chain-agnostic, single-chain, paired-chain).
  • Build reproducible immutable pipelines.
  • Compute repertoire statistics, clonality, diversity, and public overlap.
  • Move annotations between ImmunData and single-cell objects (e.g., Seurat/AnnData).

本技能提供了一套实用的工作流,用于借助immunarch(分析+可视化)和immundata(数据导入+转换+模式处理)工具进行适应性免疫受体库(AIRR/VDJ-seq)数据分析。
可用于:
  • 从单个文件、多个文件、通配符模式或元数据表中加载AIRR数据。
  • 定义受体语义(链无关、单链、双链配对)。
  • 构建可复现的不可变工作流。
  • 计算受体库统计数据、克隆性、多样性和公共重叠度。
  • 在ImmunData和单细胞对象(如Seurat/AnnData)之间传递注释信息。

When to Use This Skill

适用场景

Use this skill when the user asks to:
  • Analyze bulk or single-cell AIRR/TCR/BCR data.
  • Compare repertoires across sample groups (tissue, therapy, cluster, donor, timepoint).
  • Compute clonality/diversity/publicity metrics.
  • Define or change receptor schema (e.g.,
    cdr3_aa + v_call
    , TRA-only, TRA+TRB).
  • Filter receptors by patterns or sequence distance.
  • Add/propagate labels between repertoire data and scRNA metadata.
  • Convert old immunarch objects to the newer ImmunData pipeline.

当用户有以下需求时,可使用本技能:
  • 分析** bulk 单细胞** AIRR/TCR/BCR数据。
  • 比较不同样本组(组织、治疗方案、细胞簇、供体、时间点)之间的受体库。
  • 计算克隆性/多样性/公共性指标。
  • 定义或修改受体模式(如
    cdr3_aa + v_call
    、仅TRA链、TRA+TRB链)。
  • 按模式或序列距离过滤受体。
  • 在受体库数据和scRNA元数据之间添加/传递标签。
  • 将旧版immunarch对象转换为新版ImmunData工作流。

Quick Start

快速开始

Basic import and first look

基础导入与初步查看

r
library(immunarch)
r
library(immunarch)

Demo data + basic grouping

演示数据 + 基础分组

idata <- get_test_idata() |> agg_repertoires("Therapy") idata
idata <- get_test_idata() |> agg_repertoires("Therapy") idata

Core analyses

核心分析

airr_stats_genes(idata, gene_col = "v_call") |> vis() airr_public_jaccard(idata) |> vis() airr_clonality_prop(idata) airr_diversity_pielou(idata) |> vis()
undefined
airr_stats_genes(idata, gene_col = "v_call") |> vis() airr_public_jaccard(idata) |> vis() airr_clonality_prop(idata) airr_diversity_pielou(idata) |> vis()
undefined

Optional: add clonality labels to Seurat metadata

可选操作:向Seurat元数据添加克隆性标签

r
idata <- annotate_clonality_prop(idata)
sdata <- annotate_seurat(idata, sdata, cols = "clonal_prop_bin")
Seurat::DimPlot(sdata, reduction = "umap", group.by = "clonal_prop_bin", shuffle = TRUE)
r
idata <- annotate_clonality_prop(idata)
sdata <- annotate_seurat(idata, sdata, cols = "clonal_prop_bin")
Seurat::DimPlot(sdata, reduction = "umap", group.by = "clonal_prop_bin", shuffle = TRUE)

Ingest AIRR files via immundata directly

直接通过immundata导入AIRR文件

r
library(immundata)

md_path <- system.file("extdata/tsv", "metadata.tsv", package = "immundata")
files <- c(
  system.file("extdata/tsv", "sample_0_1k.tsv", package = "immundata"),
  system.file("extdata/tsv", "sample_1k_2k.tsv", package = "immundata")
)
md <- read_metadata(md_path)

idata <- read_repertoires(
  path     = files,
  schema   = c("cdr3_aa", "v_call"),
  metadata = md
)

r
library(immundata)

md_path <- system.file("extdata/tsv", "metadata.tsv", package = "immundata")
files <- c(
  system.file("extdata/tsv", "sample_0_1k.tsv", package = "immundata"),
  system.file("extdata/tsv", "sample_1k_2k.tsv", package = "immundata")
)
md <- read_metadata(md_path)

idata <- read_repertoires(
  path     = files,
  schema   = c("cdr3_aa", "v_call"),
  metadata = md
)

Typical User Intake (what to ask/assume)

典型用户需求确认(需询问/假设的内容)

Before coding, identify:
  1. Modality: bulk vs single-cell AIRR.
  2. Input format: TSV/CSV/Parquet, gzipped or not, one file vs many.
  3. Schema intent:
    • chain-agnostic,
    • single-chain (e.g., TRA only),
    • paired-chain (e.g., TRA+TRB, IGH + IGK|IGL).
  4. Grouping variables: repertoire schema (sample/cluster/tissue/therapy).
  5. Target analyses: stats, gene usage, clonality, diversity, overlap, annotation transfer.
  6. Scale/performance: whether snapshots/materialization strategy is needed.
If unknown, default to conservative, reproducible choices and print intermediate summaries.

在编写代码前,需明确:
  1. 数据类型:bulk或单细胞AIRR数据。
  2. 输入格式:TSV/CSV/Parquet、是否压缩、单个文件或多个文件。
  3. 模式需求
    • 链无关,
    • 单链(如仅TRA链),
    • 双链配对(如TRA+TRB、IGH + IGK|IGL)。
  4. 分组变量:受体库模式(样本/细胞簇/组织/治疗方案)。
  5. 目标分析:统计数据、基因使用情况、克隆性、多样性、重叠度、注释信息传递。
  6. 规模/性能:是否需要快照/实例化策略。
若信息未知,默认采用保守、可复现的选择,并打印中间汇总结果。

Standard Analysis Workflow

标准分析工作流

1) Ingest data and metadata

1) 导入数据与元数据

Use
read_metadata()
+
read_repertoires()
.
r
library(immundata)

inp_files <- "path/to/airr/*.tsv.gz"
md_table  <- read_metadata("path/to/metadata.tsv")

idata <- read_repertoires(
  path              = inp_files,
  schema            = c("cdr3_aa", "v_call"),
  metadata          = md_table,
  repertoire_schema = "Sample"
)
使用
read_metadata()
+
read_repertoires()
r
library(immundata)

inp_files <- "path/to/airr/*.tsv.gz"
md_table  <- read_metadata("path/to/metadata.tsv")

idata <- read_repertoires(
  path              = inp_files,
  schema            = c("cdr3_aa", "v_call"),
  metadata          = md_table,
  repertoire_schema = "Sample"
)

2) Choose receptor schema explicitly

2) 明确选择受体模式

Chain-agnostic (bulk / pre-filtered)

链无关(bulk / 预过滤数据)

r
idata <- read_repertoires(
  path   = inp_files,
  schema = c("cdr3_aa", "v_call")
)
r
idata <- read_repertoires(
  path   = inp_files,
  schema = c("cdr3_aa", "v_call")
)

Single-chain (e.g., TRA only)

单链(如仅TRA链)

r
schema <- make_receptor_schema(
  features = c("cdr3", "v_call"),
  chains   = "TRA"
)

idata <- read_repertoires(
  path        = "path/to/single_cell.csv.gz",
  schema      = schema,
  barcode_col = "barcode",
  locus_col   = "locus"
)
r
schema <- make_receptor_schema(
  features = c("cdr3", "v_call"),
  chains   = "TRA"
)

idata <- read_repertoires(
  path        = "path/to/single_cell.csv.gz",
  schema      = schema,
  barcode_col = "barcode",
  locus_col   = "locus"
)

Paired-chain (e.g., TRA + TRB)

双链配对(如TRA + TRB)

r
schema <- make_receptor_schema(
  features = c("cdr3", "v_call"),
  chains   = c("TRA", "TRB")
)

idata <- read_repertoires(
  path        = "path/to/single_cell.csv.gz",
  schema      = schema,
  barcode_col = "barcode",
  locus_col   = "locus",
  umi_col     = "umis"
)
r
schema <- make_receptor_schema(
  features = c("cdr3", "v_call"),
  chains   = c("TRA", "TRB")
)

idata <- read_repertoires(
  path        = "path/to/single_cell.csv.gz",
  schema      = schema,
  barcode_col = "barcode",
  locus_col   = "locus",
  umi_col     = "umis"
)

Paired with alternative light chain (e.g., IGH + IGK|IGL)

轻链可选的双链配对(如IGH + IGK|IGL)

r
schema <- make_receptor_schema(
  features = c("cdr3", "v_call"),
  chains   = c("IGH", "IGK|IGL")
)
r
schema <- make_receptor_schema(
  features = c("cdr3", "v_call"),
  chains   = c("IGH", "IGK|IGL")
)

3) Transform immutably (filter / annotate / mutate)

3) 不可变转换(过滤/注释/修改)

r
undefined
r
undefined

Filtering

过滤

idata_f <- idata |> filter(v_call == "TRBV2") |> filter(imd_proportion >= 0.0002)
idata_f <- idata |> filter(v_call == "TRBV2") |> filter(imd_proportion >= 0.0002)

Sequence-distance filter

序列距离过滤

idata_seq <- idata |> filter(seq_options = make_seq_options( patterns = "CASSELAGYRGEQYF", query_col = "cdr3", method = "lev", max_dist = 3 ))
idata_seq <- idata |> filter(seq_options = make_seq_options( patterns = "CASSELAGYRGEQYF", query_col = "cdr3", method = "lev", max_dist = 3 ))

Annotation join

注释关联

idata_ann <- annotate( idata = idata, annotations = cells[c("barcode", "ident")], by = c(imd_barcode = "barcode"), keep_repertoires = FALSE )
idata_ann <- annotate( idata = idata, annotations = cells[c("barcode", "ident")], by = c(imd_barcode = "barcode"), keep_repertoires = FALSE )

Mutations

修改

idata_mut <- idata |> mutate(big_chains = umis >= 10) |> mutate(dist_to_pattern = dd$levenshtein(cdr3, "CASSSVSGNSPLHF"))
undefined
idata_mut <- idata |> mutate(big_chains = umis >= 10) |> mutate(dist_to_pattern = dd$levenshtein(cdr3, "CASSSVSGNSPLHF"))
undefined

4) Aggregate repertoires for reporting strata

4) 按报告分层聚合受体库

r
idata_grp <- idata |>
  agg_repertoires("Tissue")
r
idata_grp <- idata |>
  agg_repertoires("Tissue")

5) Compute immune repertoire statistics

5) 计算免疫受体库统计数据

r
undefined
r
undefined

Chain and gene statistics

链和基因统计

chains <- airr_stats_chains(idata_grp) genes <- airr_stats_genes(idata_grp, gene_col = "v_call", level = "receptor")
chains |> vis() genes |> vis()
undefined
chains <- airr_stats_chains(idata_grp) genes <- airr_stats_genes(idata_grp, gene_col = "v_call", level = "receptor")
chains |> vis() genes |> vis()
undefined

6) Clonality analysis

6) 克隆性分析

r
cl_line <- airr_clonality_line(idata_grp)
cl_prop <- airr_clonality_prop(idata_grp)
cl_rank <- airr_clonality_rank(idata_grp, bins = c(10, 100, 1000))

cl_prop |> vis()
cl_rank |> vis()
r
cl_line <- airr_clonality_line(idata_grp)
cl_prop <- airr_clonality_prop(idata_grp)
cl_rank <- airr_clonality_rank(idata_grp, bins = c(10, 100, 1000))

cl_prop |> vis()
cl_rank |> vis()

7) Diversity analysis

7) 多样性分析

r
d50      <- airr_diversity_dxx(idata_grp, perc = 50)
chao1    <- airr_diversity_chao1(idata_grp)
shannon  <- airr_diversity_shannon(idata_grp)
pielou   <- airr_diversity_pielou(idata_grp)
hill1    <- airr_diversity_index(idata_grp)
hillprof <- airr_diversity_hill(idata_grp, q = c(0, 1, 2))

pielou |> vis()
r
d50      <- airr_diversity_dxx(idata_grp, perc = 50)
chao1    <- airr_diversity_chao1(idata_grp)
shannon  <- airr_diversity_shannon(idata_grp)
pielou   <- airr_diversity_pielou(idata_grp)
hill1    <- airr_diversity_index(idata_grp)
hillprof <- airr_diversity_hill(idata_grp, q = c(0, 1, 2))

pielou |> vis()

8) Public overlap

8) 公共重叠度分析

r
m_pub <- airr_public_intersection(idata_grp)
m_jac <- airr_public_jaccard(idata_grp)

m_pub |> vis()
m_jac |> vis()
r
m_pub <- airr_public_intersection(idata_grp)
m_jac <- airr_public_jaccard(idata_grp)

m_pub |> vis()
m_jac |> vis()

9) Snapshot expensive steps

9) 对耗时步骤生成快照

r
undefined
r
undefined

Save intermediate immutable snapshot to avoid recomputing expensive transforms

保存中间不可变快照,避免重复计算耗时的转换步骤

idata_cached <- immundata::write_immundata(idata_mut, "path/to/snapshot_folder")

---
idata_cached <- immundata::write_immundata(idata_mut, "path/to/snapshot_folder")

---

Common Pitfalls and Best Practices

常见误区与最佳实践

  1. Changing receptor definition mid-analysis
    • Receptor schema is foundational. If receptor definition changes, re-read data into a new
      ImmunData
      .
  2. Using wrong mode for single-cell
    • Paired analysis requires
      barcode_col
      ,
      locus_col
      , and typically
      umi_col
      .
    • Single-chain mode does not necessarily collapse multiple chains per barcode.
  3. Skipping metadata strategy
    • Prefer metadata-driven ingestion (
      path = "<metadata>"
      ) when file provenance and sample mapping matter.
  4. Direct internal-table edits
    • Avoid low-level manual edits. Use high-level verbs (
      filter
      ,
      annotate
      ,
      mutate
      ,
      agg_repertoires
      ).
  5. Ignoring immutable pipeline behavior
    • Each step returns a new object; persist expensive steps with snapshots.
  6. Non-canonical columns
    • In scripts, rely on canonical
      imd_*
      columns.
    • In package code, prefer schema keys/aliases (e.g., via
      imd_schema()
      ).
  7. Distance-heavy transforms on large data
    • Pattern/Levenshtein operations can be expensive; run once, snapshot, then reuse.

  1. 分析中途改变受体定义
    • 受体模式是基础。若受体定义改变,需将数据重新导入为新的
      ImmunData
      对象。
  2. 对单细胞数据使用错误模式
    • 双链配对分析需要
      barcode_col
      locus_col
      ,通常还需要
      umi_col
    • 单链模式不一定会合并每个barcode对应的多条链。
  3. 忽略元数据策略
    • 当文件来源和样本映射很重要时,优先采用元数据驱动的导入方式(
      path = "<metadata>"
      )。
  4. 直接编辑内部表格
    • 避免低层级的手动编辑。使用高级操作(
      filter
      annotate
      mutate
      agg_repertoires
      )。
  5. 忽略不可变工作流的特性
    • 每个步骤都会返回新对象;对耗时步骤使用快照持久化。
  6. 使用非标准列
    • 在脚本中,依赖标准的
      imd_*
      列。
    • 在包代码中,优先使用模式键/别名(如通过
      imd_schema()
      )。
  7. 对大数据执行大量距离计算的转换
    • 模式/编辑距离操作可能耗时;运行一次后生成快照,后续复用。

Bundled Resources

捆绑资源

TBD

待补充

Additional Resources

额外资源

Tips for Effective Analysis

高效分析技巧

  • Start with a small subset and verify schema + grouping before scaling up.
  • Print object summaries after ingestion and after major transforms.
  • Use explicit variable names for stages (
    idata_raw
    ,
    idata_qc
    ,
    idata_ann
    ,
    idata_stats
    ).
  • Prefer pipelines that can be re-executed end-to-end from raw inputs.
  • Keep biologically meaningful grouping variables in repertoire schema early.
  • Use
    vis()
    early and often for sanity checks before formal interpretation.
  • 先从小数据集开始,在扩大规模前验证模式和分组是否正确。
  • 在导入和主要转换步骤后打印对象汇总信息。
  • 为不同阶段使用明确的变量名(如
    idata_raw
    idata_qc
    idata_ann
    idata_stats
    )。
  • 优先选择可从原始输入端到端重新执行的工作流。
  • 尽早将具有生物学意义的分组变量纳入受体库模式。
  • 尽早并频繁使用
    vis()
    进行合理性检查,再进行正式解读。