tooluniverse-multi-omics-integration

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Multi-Omics Integration

多组学整合

Coordinate and integrate multiple omics datasets for comprehensive systems biology analysis. This skill orchestrates specialized ToolUniverse skills to perform cross-omics correlation, multi-omics clustering, pathway-level integration, and unified interpretation across molecular layers.

协调并整合多组学数据集，以开展全面的系统生物学分析。该技能统筹ToolUniverse的专项技能，执行跨组学关联分析、多组学聚类、通路层面整合以及跨分子层面的统一解读。

When to Use This Skill

何时使用该技能

Triggers:

User has multiple omics datasets (RNA-seq + proteomics, methylation + expression, etc.)
Requests for integrative multi-omics analysis
Cross-omics correlation queries (e.g., "How does methylation affect expression?")
Multi-omics biomarker discovery
Systems biology questions requiring multiple molecular layers
Precision medicine applications with multi-omics patient data
Questions about molecular mechanisms across omics types

Example Questions This Skill Solves:

"Integrate RNA-seq and proteomics data to find genes with concordant changes"
"How does promoter methylation correlate with gene expression?"
"Perform multi-omics clustering to identify patient subtypes"
"Which pathways are dysregulated across transcriptome, proteome, and metabolome?"
"Find multi-omics biomarkers for disease classification"
"Correlate CNV with gene expression to identify dosage effects"
"Integrate GWAS variants, eQTLs, and expression data"
"Perform MOFA+ analysis on multi-omics cancer data"

触发场景:

用户拥有多组学数据集（RNA-seq + 蛋白质组学、甲基化 + 表达数据等）
请求进行整合式多组学分析
跨组学关联查询（例如："甲基化如何影响基因表达？"）
多组学生物标志物发现
需要涉及多个分子层面的系统生物学问题
包含多组学患者数据的精准医疗应用
跨组学类型的分子机制相关问题

该技能可解决的示例问题:

"整合RNA-seq和蛋白质组学数据，找出表达变化一致的基因"
"启动子甲基化与基因表达有何关联？"
"执行多组学聚类以识别患者亚型"
"哪些通路在转录组、蛋白质组和代谢组中均出现失调？"
"寻找用于疾病分类的多组学生物标志物"
"关联CNV与基因表达以识别剂量效应"
"整合GWAS变异、eQTL和表达数据"
"对多组学癌症数据执行MOFA+分析"

Core Capabilities

核心能力

Capability	Description
Data Integration	Match samples across omics, handle missing data, normalize scales
Cross-Omics Correlation	Correlate features across molecular layers (gene expression vs protein, methylation vs expression)
Multi-Omics Clustering	MOFA+, NMF, joint clustering to identify omics-driven subtypes
Pathway Integration	Combine omics evidence at pathway level for unified biological interpretation
Biomarker Discovery	Identify multi-omics signatures with improved predictive power
Skill Coordination	Orchestrate RNA-seq, epigenomics, variant-analysis, protein-interactions, gene-enrichment skills
Visualization	Circos plots, integrated heatmaps, network visualizations
Reporting	Unified multi-omics reports with cross-layer insights

能力	描述
数据整合	匹配跨组学样本，处理缺失数据，统一数据尺度
跨组学关联分析	关联不同分子层面的特征（基因表达 vs 蛋白质、甲基化 vs 表达）
多组学聚类	使用MOFA+、NMF、联合聚类识别由组学驱动的亚型
通路整合	在通路层面整合多组学证据，进行统一的生物学解读
生物标志物发现	识别具有更强预测能力的多组学特征
技能协调	统筹RNA-seq、表观基因组学、变异分析、蛋白质互作、基因富集等技能
可视化	Circos图、整合热图、网络可视化
报告生成	包含跨层面洞察的统一多组学报告

Workflow Overview

工作流程概览

Input: Multiple Omics Datasets
    |
    v
Phase 1: Data Loading & QC
    |-- Load RNA-seq (expression matrix)
    |-- Load proteomics (protein abundance)
    |-- Load methylation (beta values or M-values)
    |-- Load variants (CNV, SNV from VCF)
    |-- Load metabolomics (metabolite abundance)
    |-- Quality control per omics type
    |
    v
Phase 2: Sample Matching
    |-- Match samples across omics by ID
    |-- Identify common samples
    |-- Handle batch effects
    |-- Normalize sample identifiers
    |
    v
Phase 3: Feature Mapping
    |-- Map features to common identifier space (genes, proteins, metabolites)
    |-- Link CpG sites to genes (promoter, gene body)
    |-- Map variants to genes
    |-- Create unified feature matrix
    |
    v
Phase 4: Cross-Omics Correlation
    |-- Gene expression vs protein abundance (translation efficiency)
    |-- Promoter methylation vs expression (epigenetic regulation)
    |-- CNV vs expression (dosage effect)
    |-- eQTL variants vs expression (genetic regulation)
    |-- Metabolite vs enzyme expression (metabolic flux)
    |
    v
Phase 5: Multi-Omics Clustering
    |-- MOFA+ (Multi-Omics Factor Analysis) for latent factors
    |-- NMF (Non-negative Matrix Factorization) for patient subtypes
    |-- Joint clustering across omics
    |-- Identify omics-specific vs shared variation
    |
    v
Phase 6: Pathway-Level Integration
    |-- Aggregate omics to pathway level
    |-- Score pathway dysregulation (combined evidence)
    |-- Use ToolUniverse enrichment tools (Reactome, KEGG, GO)
    |-- Identify driver pathways across omics
    |
    v
Phase 7: Biomarker Discovery
    |-- Feature selection across omics
    |-- Multi-omics signatures for classification
    |-- Cross-validation and performance
    |-- Interpretation and biological validation
    |
    v
Phase 8: Generate Integrated Report
    |-- Summary statistics per omics
    |-- Cross-omics correlation results
    |-- Multi-omics clusters and subtypes
    |-- Top dysregulated pathways
    |-- Multi-omics biomarkers
    |-- Biological interpretation

Input: Multiple Omics Datasets
    |
    v
Phase 1: Data Loading & QC
    |-- Load RNA-seq (expression matrix)
    |-- Load proteomics (protein abundance)
    |-- Load methylation (beta values or M-values)
    |-- Load variants (CNV, SNV from VCF)
    |-- Load metabolomics (metabolite abundance)
    |-- Quality control per omics type
    |
    v
Phase 2: Sample Matching
    |-- Match samples across omics by ID
    |-- Identify common samples
    |-- Handle batch effects
    |-- Normalize sample identifiers
    |
    v
Phase 3: Feature Mapping
    |-- Map features to common identifier space (genes, proteins, metabolites)
    |-- Link CpG sites to genes (promoter, gene body)
    |-- Map variants to genes
    |-- Create unified feature matrix
    |
    v
Phase 4: Cross-Omics Correlation
    |-- Gene expression vs protein abundance (translation efficiency)
    |-- Promoter methylation vs expression (epigenetic regulation)
    |-- CNV vs expression (dosage effect)
    |-- eQTL variants vs expression (genetic regulation)
    |-- Metabolite vs enzyme expression (metabolic flux)
    |
    v
Phase 5: Multi-Omics Clustering
    |-- MOFA+ (Multi-Omics Factor Analysis) for latent factors
    |-- NMF (Non-negative Matrix Factorization) for patient subtypes
    |-- Joint clustering across omics
    |-- Identify omics-specific vs shared variation
    |
    v
Phase 6: Pathway-Level Integration
    |-- Aggregate omics to pathway level
    |-- Score pathway dysregulation (combined evidence)
    |-- Use ToolUniverse enrichment tools (Reactome, KEGG, GO)
    |-- Identify driver pathways across omics
    |
    v
Phase 7: Biomarker Discovery
    |-- Feature selection across omics
    |-- Multi-omics signatures for classification
    |-- Cross-validation and performance
    |-- Interpretation and biological validation
    |
    v
Phase 8: Generate Integrated Report
    |-- Summary statistics per omics
    |-- Cross-omics correlation results
    |-- Multi-omics clusters and subtypes
    |-- Top dysregulated pathways
    |-- Multi-omics biomarkers
    |-- Biological interpretation

Phase Details

各阶段详情

Phase 1: Data Loading & Quality Control

阶段1：数据加载与质量控制

Objective: Load multiple omics datasets and perform quality control.

Supported omics types:

Transcriptomics: RNA-seq count matrices, microarray
Proteomics: Protein abundance (MS-based)
Epigenomics: Methylation (450K, EPIC arrays, WGBS), ChIP-seq peaks
Genomics: CNV, SNV, structural variants
Metabolomics: Metabolite abundance (targeted, untargeted)

Data formats:

Expression: CSV/TSV matrices, HDF5, AnnData (.h5ad)
Proteomics: MaxQuant output, Spectronaut, DIA-NN
Methylation: IDAT files, beta value matrices
Variants: VCF, SEG files (CNV)
Metabolomics: Peak tables, identified metabolites

Quality control per omics:

python

undefined

目标：加载多组学数据集并执行质量控制。

支持的组学类型:

转录组学：RNA-seq计数矩阵、微阵列
蛋白质组学：蛋白质丰度（基于质谱）
表观基因组学：甲基化（450K、EPIC芯片、WGBS）、ChIP-seq峰
基因组学：CNV、SNV、结构变异
代谢组学：代谢物丰度（靶向、非靶向）

数据格式:

表达数据：CSV/TSV矩阵、HDF5、AnnData（.h5ad）
蛋白质组学：MaxQuant输出、Spectronaut、DIA-NN
甲基化：IDAT文件、beta值矩阵
变异数据：VCF、SEG文件（CNV）
代谢组学：峰表、已鉴定代谢物

各组学的质量控制:

python

undefined

RNA-seq QC

Filter low-count genes (mean counts < threshold)
Normalize (TPM, FPKM, or DESeq2)
Log-transform for correlation

Filter low-count genes (mean counts < threshold)
Normalize (TPM, FPKM, or DESeq2)
Log-transform for correlation

Proteomics QC

Filter proteins with high missing values
Impute missing values (minimum, KNN)
Normalize (median, quantile)

Filter proteins with high missing values
Impute missing values (minimum, KNN)
Normalize (median, quantile)

Methylation QC

Remove failed probes
Correct for batch effects (ComBat)
Filter cross-reactive probes

Remove failed probes
Correct for batch effects (ComBat)
Filter cross-reactive probes

Variants QC

Use variant-analysis skill for VCF QC
CNV segmentation validation

undefined

Use variant-analysis skill for VCF QC
CNV segmentation validation

undefined

Phase 2: Sample Matching

阶段2：样本匹配

Objective: Identify common samples across omics datasets.

Sample ID harmonization:

python

def match_samples_across_omics(omics_data_dict):
    """
    Match samples across multiple omics datasets.

    Parameters:
    omics_data_dict: {
        'rnaseq': DataFrame (genes x samples),
        'proteomics': DataFrame (proteins x samples),
        'methylation': DataFrame (CpGs x samples),
        'cnv': DataFrame (genes x samples)
    }

    Returns:
    - common_samples: List of sample IDs present in all omics
    - matched_data: Dict of DataFrames with common samples only
    """
    # Extract sample IDs from each omics
    sample_ids = {
        omics_type: set(df.columns)
        for omics_type, df in omics_data_dict.items()
    }

    # Find common samples (intersection)
    common_samples = set.intersection(*sample_ids.values())

    # Subset each omics to common samples
    matched_data = {
        omics_type: df[sorted(common_samples)]
        for omics_type, df in omics_data_dict.items()
    }

    return sorted(common_samples), matched_data

Handling missing omics:

Pairwise integration if not all samples have all omics
Document sample availability matrix

目标：识别跨多组学数据集的共同样本。

样本ID统一:

python

def match_samples_across_omics(omics_data_dict):
    """
    Match samples across multiple omics datasets.

    Parameters:
    omics_data_dict: {
        'rnaseq': DataFrame (genes x samples),
        'proteomics': DataFrame (proteins x samples),
        'methylation': DataFrame (CpGs x samples),
        'cnv': DataFrame (genes x samples)
    }

    Returns:
    - common_samples: List of sample IDs present in all omics
    - matched_data: Dict of DataFrames with common samples only
    """
    # Extract sample IDs from each omics
    sample_ids = {
        omics_type: set(df.columns)
        for omics_type, df in omics_data_dict.items()
    }

    # Find common samples (intersection)
    common_samples = set.intersection(*sample_ids.values())

    # Subset each omics to common samples
    matched_data = {
        omics_type: df[sorted(common_samples)]
        for omics_type, df in omics_data_dict.items()
    }

    return sorted(common_samples), matched_data

处理缺失组学数据:

若并非所有样本都拥有全部组学数据，可执行两两整合
记录样本可用情况矩阵

Phase 3: Feature Mapping

阶段3：特征映射

Objective: Map features from different omics to common gene-level identifiers.

Gene-centric integration:

python

undefined

目标：将不同组学的特征映射到通用的基因层面标识符。

以基因为中心的整合:

python

undefined

Map all features to genes

feature_mapping = { 'rnaseq': 'gene_symbol', # Already gene-level 'proteomics': 'gene_symbol', # Map protein to gene 'methylation': 'gene_symbol', # Map CpG to gene (promoter) 'cnv': 'gene_symbol', # CNV regions to overlapping genes 'metabolomics': 'enzyme_gene' # Metabolite to enzyme gene }


**CpG to gene mapping**:
- **Promoter methylation**: CpGs within TSS ± 2kb
- **Gene body methylation**: CpGs within gene boundaries
- Average methylation per gene (weighted by probe coverage)

**CNV to gene mapping**:
- Use variant-analysis skill to identify genes in CNV regions
- Calculate copy number per gene (log2 ratio)


**CpG位点到基因的映射**:
- **启动子甲基化**：TSS ± 2kb范围内的CpG位点
- **基因体甲基化**：基因边界内的CpG位点
- 按探针覆盖度加权计算每个基因的平均甲基化水平

**CNV到基因的映射**:
- 使用变异分析技能识别CNV区域内的基因
- 计算每个基因的拷贝数（log2比值）

Phase 4: Cross-Omics Correlation

阶段4：跨组学关联分析

Objective: Correlate features across molecular layers to understand regulation.

Example analyses:

目标：关联不同分子层面的特征，以理解调控机制。

示例分析:

4.1: Expression vs Protein (Translation Efficiency)

4.1：表达与蛋白质（翻译效率）

python

def correlate_rna_protein(rnaseq_data, proteomics_data):
    """
    Correlate mRNA and protein levels for each gene.

    Expected: Positive correlation (r ~ 0.4-0.6 typical)
    Discordance indicates post-transcriptional regulation
    """
    # Find common genes
    common_genes = set(rnaseq_data.index) & set(proteomics_data.index)

    correlations = {}
    for gene in common_genes:
        rna = rnaseq_data.loc[gene]
        protein = proteomics_data.loc[gene]

        # Spearman correlation (robust to outliers)
        r, p = spearmanr(rna, protein)
        correlations[gene] = {'r': r, 'p': p}

    # Identify discordant genes (low RNA-protein correlation)
    discordant = {g: v for g, v in correlations.items() if abs(v['r']) < 0.2}

    return correlations, discordant

python

def correlate_rna_protein(rnaseq_data, proteomics_data):
    """
    Correlate mRNA and protein levels for each gene.

    Expected: Positive correlation (r ~ 0.4-0.6 typical)
    Discordance indicates post-transcriptional regulation
    """
    # Find common genes
    common_genes = set(rnaseq_data.index) & set(proteomics_data.index)

    correlations = {}
    for gene in common_genes:
        rna = rnaseq_data.loc[gene]
        protein = proteomics_data.loc[gene]

        # Spearman correlation (robust to outliers)
        r, p = spearmanr(rna, protein)
        correlations[gene] = {'r': r, 'p': p}

    # Identify discordant genes (low RNA-protein correlation)
    discordant = {g: v for g, v in correlations.items() if abs(v['r']) < 0.2}

    return correlations, discordant

4.2: Methylation vs Expression (Epigenetic Regulation)

4.2：甲基化与表达（表观遗传调控）

python

def correlate_methylation_expression(methylation_data, rnaseq_data):
    """
    Correlate promoter methylation with gene expression.

    Expected: Negative correlation (increased methylation → decreased expression)
    """
    # For each gene with promoter methylation
    results = {}
    for gene in methylation_data.index:
        if gene in rnaseq_data.index:
            meth = methylation_data.loc[gene]  # Average promoter beta
            expr = rnaseq_data.loc[gene]

            r, p = spearmanr(meth, expr)
            results[gene] = {'r': r, 'p': p, 'direction': 'repressive' if r < 0 else 'activating'}

    # Identify genes with strong methylation-expression anticorrelation
    regulated = {g: v for g, v in results.items() if v['r'] < -0.5 and v['p'] < 0.01}

    return results, regulated

python

def correlate_methylation_expression(methylation_data, rnaseq_data):
    """
    Correlate promoter methylation with gene expression.

    Expected: Negative correlation (increased methylation → decreased expression)
    """
    # For each gene with promoter methylation
    results = {}
    for gene in methylation_data.index:
        if gene in rnaseq_data.index:
            meth = methylation_data.loc[gene]  # Average promoter beta
            expr = rnaseq_data.loc[gene]

            r, p = spearmanr(meth, expr)
            results[gene] = {'r': r, 'p': p, 'direction': 'repressive' if r < 0 else 'activating'}

    # Identify genes with strong methylation-expression anticorrelation
    regulated = {g: v for g, v in results.items() if v['r'] < -0.5 and v['p'] < 0.01}

    return results, regulated

4.3: CNV vs Expression (Dosage Effect)

4.3：CNV与表达（剂量效应）

python

def correlate_cnv_expression(cnv_data, rnaseq_data):
    """
    Correlate copy number with gene expression.

    Expected: Positive correlation (gene dosage effect)
    """
    results = {}
    for gene in cnv_data.index:
        if gene in rnaseq_data.index:
            cnv = cnv_data.loc[gene]  # log2 ratio
            expr = rnaseq_data.loc[gene]

            r, p = pearsonr(cnv, expr)
            results[gene] = {'r': r, 'p': p}

    # Genes with dosage effect (CNV drives expression)
    dosage_genes = {g: v for g, v in results.items() if v['r'] > 0.5 and v['p'] < 0.01}

    return results, dosage_genes

python

def correlate_cnv_expression(cnv_data, rnaseq_data):
    """
    Correlate copy number with gene expression.

    Expected: Positive correlation (gene dosage effect)
    """
    results = {}
    for gene in cnv_data.index:
        if gene in rnaseq_data.index:
            cnv = cnv_data.loc[gene]  # log2 ratio
            expr = rnaseq_data.loc[gene]

            r, p = pearsonr(cnv, expr)
            results[gene] = {'r': r, 'p': p}

    # Genes with dosage effect (CNV drives expression)
    dosage_genes = {g: v for g, v in results.items() if v['r'] > 0.5 and v['p'] < 0.01}

    return results, dosage_genes

Phase 5: Multi-Omics Clustering

阶段5：多组学聚类

Objective: Identify patient subtypes using integrated omics data.

Method 1: MOFA+ (Multi-Omics Factor Analysis)

MOFA+ identifies latent factors that explain variation across omics.

python

undefined

目标：使用整合的组学数据识别患者亚型。

方法1：MOFA+（多组学因子分析）

MOFA+识别可解释跨组学变异的潜在因子。

python

undefined

Conceptual workflow (uses R's MOFA2 package or Python implementation)

1. Prepare multi-omics data as list of matrices

2. Run MOFA+ to identify factors

3. Inspect factor variance explained per omics

4. Cluster samples based on factor scores

Example interpretation:

Factor 1: Explains 40% variance in RNA-seq, 30% in proteomics → Cell proliferation

Factor 2: Explains 50% variance in methylation → Epigenetic subtype

Factor 3: Explains 20% variance in CNV → Genomic instability


**Method 2: Joint NMF (Non-negative Matrix Factorization)**

Decompose multi-omics matrices into shared latent components.

```python
def joint_nmf_clustering(omics_data_dict, n_clusters=3):
    """
    Perform joint NMF across omics for clustering.

    Returns patient cluster assignments based on shared factors.
    """
    # Concatenate omics matrices (after normalization)
    combined_matrix = np.vstack([
        omics_data_dict['rnaseq'].values,
        omics_data_dict['proteomics'].values,
        omics_data_dict['methylation'].values
    ])

    # Run NMF
    from sklearn.decomposition import NMF
    model = NMF(n_components=n_clusters, init='nndsvd', random_state=42)
    W = model.fit_transform(combined_matrix)  # Feature loadings
    H = model.components_  # Sample coefficients

    # Cluster samples based on H (components)
    from sklearn.cluster import KMeans
    clusters = KMeans(n_clusters=n_clusters).fit_predict(H.T)

    return clusters, W, H

Method 3: Similarity Network Fusion (SNF)

Integrate omics through patient similarity networks.


**方法2：联合NMF（非负矩阵分解）**

分解多组学矩阵以得到共享的潜在成分。

```python
def joint_nmf_clustering(omics_data_dict, n_clusters=3):
    """
    Perform joint NMF across omics for clustering.

    Returns patient cluster assignments based on shared factors.
    """
    # Concatenate omics matrices (after normalization)
    combined_matrix = np.vstack([
        omics_data_dict['rnaseq'].values,
        omics_data_dict['proteomics'].values,
        omics_data_dict['methylation'].values
    ])

    # Run NMF
    from sklearn.decomposition import NMF
    model = NMF(n_components=n_clusters, init='nndsvd', random_state=42)
    W = model.fit_transform(combined_matrix)  # Feature loadings
    H = model.components_  # Sample coefficients

    # Cluster samples based on H (components)
    from sklearn.cluster import KMeans
    clusters = KMeans(n_clusters=n_clusters).fit_predict(H.T)

    return clusters, W, H

方法3：相似性网络融合（SNF）

通过患者相似性网络整合多组学数据。

Phase 6: Pathway-Level Integration

阶段6：通路层面整合

Objective: Aggregate multi-omics evidence at the pathway level.

Approach: Score pathway dysregulation using combined evidence from multiple omics.

python

def integrate_pathway_evidence(omics_results, pathway_genes):
    """
    Score pathway dysregulation across omics.

    omics_results: {
        'rnaseq': {'gene': fold_change},
        'proteomics': {'gene': fold_change},
        'methylation': {'gene': methylation_diff},
        'cnv': {'gene': copy_number}
    }

    pathway_genes: List of genes in pathway
    """
    # For each gene in pathway
    pathway_scores = []
    for gene in pathway_genes:
        gene_score = 0
        evidence_count = 0

        # RNA-seq evidence
        if gene in omics_results['rnaseq']:
            gene_score += abs(omics_results['rnaseq'][gene])
            evidence_count += 1

        # Proteomics evidence
        if gene in omics_results['proteomics']:
            gene_score += abs(omics_results['proteomics'][gene])
            evidence_count += 1

        # Methylation evidence (negative correlation)
        if gene in omics_results['methylation']:
            gene_score += abs(omics_results['methylation'][gene])
            evidence_count += 1

        # CNV evidence
        if gene in omics_results['cnv']:
            gene_score += abs(omics_results['cnv'][gene])
            evidence_count += 1

        if evidence_count > 0:
            pathway_scores.append(gene_score / evidence_count)

    # Aggregate pathway score (mean of gene scores)
    pathway_score = np.mean(pathway_scores) if pathway_scores else 0

    return {
        'pathway_score': pathway_score,
        'n_genes_with_evidence': len(pathway_scores),
        'n_omics_types': evidence_count
    }

Use ToolUniverse enrichment tools:

python

undefined

目标：在通路层面聚合多组学证据。

方法：使用来自多个组学的综合证据对通路失调进行评分。

python

def integrate_pathway_evidence(omics_results, pathway_genes):
    """
    Score pathway dysregulation across omics.

    omics_results: {
        'rnaseq': {'gene': fold_change},
        'proteomics': {'gene': fold_change},
        'methylation': {'gene': methylation_diff},
        'cnv': {'gene': copy_number}
    }

    pathway_genes: List of genes in pathway
    """
    # For each gene in pathway
    pathway_scores = []
    for gene in pathway_genes:
        gene_score = 0
        evidence_count = 0

        # RNA-seq evidence
        if gene in omics_results['rnaseq']:
            gene_score += abs(omics_results['rnaseq'][gene])
            evidence_count += 1

        # Proteomics evidence
        if gene in omics_results['proteomics']:
            gene_score += abs(omics_results['proteomics'][gene])
            evidence_count += 1

        # Methylation evidence (negative correlation)
        if gene in omics_results['methylation']:
            gene_score += abs(omics_results['methylation'][gene])
            evidence_count += 1

        # CNV evidence
        if gene in omics_results['cnv']:
            gene_score += abs(omics_results['cnv'][gene])
            evidence_count += 1

        if evidence_count > 0:
            pathway_scores.append(gene_score / evidence_count)

    # Aggregate pathway score (mean of gene scores)
    pathway_score = np.mean(pathway_scores) if pathway_scores else 0

    return {
        'pathway_score': pathway_score,
        'n_genes_with_evidence': len(pathway_scores),
        'n_omics_types': evidence_count
    }

使用ToolUniverse富集工具:

python

undefined

Get pathways for gene set

from tooluniverse import ToolUniverse tu = ToolUniverse()

Enrichment for genes dysregulated in ANY omics

all_dysregulated_genes = set() all_dysregulated_genes.update(rnaseq_degs) all_dysregulated_genes.update(diff_proteins) all_dysregulated_genes.update(methylation_dmgs)

Run enrichment

enrichment = tu.run_one_function({ "name": "enrichr_enrich", "arguments": { "gene_list": ",".join(all_dysregulated_genes), "library": "KEGG_2021_Human" } })

Score each pathway with multi-omics evidence

for pathway in enrichment['data']['results']: pathway_genes = pathway['genes'] pathway['multi_omics_score'] = integrate_pathway_evidence( omics_results, pathway_genes )

undefined

for pathway in enrichment['data']['results']: pathway_genes = pathway['genes'] pathway['multi_omics_score'] = integrate_pathway_evidence( omics_results, pathway_genes )

undefined

Phase 7: Biomarker Discovery

阶段7：生物标志物发现

Objective: Identify multi-omics signatures for disease classification.

Feature selection across omics:

python

def select_multiomics_features(X_dict, y, n_features=50):
    """
    Select top features across omics for classification.

    X_dict: {
        'rnaseq': DataFrame (samples x genes),
        'proteomics': DataFrame (samples x proteins),
        'methylation': DataFrame (samples x CpGs)
    }
    y: Target labels (disease vs control)

    Returns: Selected features per omics
    """
    from sklearn.feature_selection import SelectKBest, f_classif

    selected_features = {}
    for omics_type, X in X_dict.items():
        selector = SelectKBest(f_classif, k=min(n_features, X.shape[1]))
        selector.fit(X, y)

        # Get selected feature names
        selected_idx = selector.get_support()
        selected_features[omics_type] = X.columns[selected_idx].tolist()

    return selected_features

Multi-omics classification:

python

def multiomics_classification(X_dict, y, selected_features):
    """
    Train classifier using multi-omics features.
    """
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import cross_val_score

    # Concatenate selected features from each omics
    X_combined = []
    for omics_type, features in selected_features.items():
        X_combined.append(X_dict[omics_type][features])

    X_combined = pd.concat(X_combined, axis=1)

    # Train classifier
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    scores = cross_val_score(clf, X_combined, y, cv=5, scoring='roc_auc')

    return {
        'mean_auc': scores.mean(),
        'std_auc': scores.std(),
        'n_features': X_combined.shape[1],
        'features_per_omics': {k: len(v) for k, v in selected_features.items()}
    }

目标：识别用于疾病分类的多组学特征。

跨组学特征选择:

python

def select_multiomics_features(X_dict, y, n_features=50):
    """
    Select top features across omics for classification.

    X_dict: {
        'rnaseq': DataFrame (samples x genes),
        'proteomics': DataFrame (samples x proteins),
        'methylation': DataFrame (samples x CpGs)
    }
    y: Target labels (disease vs control)

    Returns: Selected features per omics
    """
    from sklearn.feature_selection import SelectKBest, f_classif

    selected_features = {}
    for omics_type, X in X_dict.items():
        selector = SelectKBest(f_classif, k=min(n_features, X.shape[1]))
        selector.fit(X, y)

        # Get selected feature names
        selected_idx = selector.get_support()
        selected_features[omics_type] = X.columns[selected_idx].tolist()

    return selected_features

多组学分类:

python

def multiomics_classification(X_dict, y, selected_features):
    """
    Train classifier using multi-omics features.
    """
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import cross_val_score

    # Concatenate selected features from each omics
    X_combined = []
    for omics_type, features in selected_features.items():
        X_combined.append(X_dict[omics_type][features])

    X_combined = pd.concat(X_combined, axis=1)

    # Train classifier
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    scores = cross_val_score(clf, X_combined, y, cv=5, scoring='roc_auc')

    return {
        'mean_auc': scores.mean(),
        'std_auc': scores.std(),
        'n_features': X_combined.shape[1],
        'features_per_omics': {k: len(v) for k, v in selected_features.items()}
    }

Phase 8: Integrated Reporting

阶段8：整合报告生成

Generate comprehensive multi-omics report:

markdown

undefined

生成全面的多组学报告:

markdown

undefined

Multi-Omics Integration Report

Dataset Summary

Omics Types: RNA-seq, Proteomics, Methylation, CNV
Common Samples: 45 patients (30 disease, 15 control)
Features: 15,000 genes, 5,000 proteins, 450K CpGs, 20K CNV regions

Omics Types: RNA-seq, Proteomics, Methylation, CNV
Common Samples: 45 patients (30 disease, 15 control)
Features: 15,000 genes, 5,000 proteins, 450K CpGs, 20K CNV regions

Cross-Omics Correlation

RNA-Protein Correlation

Overall correlation: r = 0.52 (expected: 0.4-0.6)
Highly correlated: 3,245 genes (45%)
Discordant genes: 890 genes (post-transcriptional regulation)

Overall correlation: r = 0.52 (expected: 0.4-0.6)
Highly correlated: 3,245 genes (45%)
Discordant genes: 890 genes (post-transcriptional regulation)

Methylation-Expression

Promoter methylation: Anticorrelation r = -0.41
Epigenetically regulated genes: 1,256 genes (p < 0.01)
Example: BRCA1 promoter hypermethylation → 3-fold reduced expression

Promoter methylation: Anticorrelation r = -0.41
Epigenetically regulated genes: 1,256 genes (p < 0.01)
Example: BRCA1 promoter hypermethylation → 3-fold reduced expression

CNV-Expression Dosage Effect

Genes with dosage effect: 445 genes (r > 0.5, p < 0.01)
Example: MYC amplification (3 copies) → 2.8-fold increased expression

Genes with dosage effect: 445 genes (r > 0.5, p < 0.01)
Example: MYC amplification (3 copies) → 2.8-fold increased expression

Multi-Omics Clustering

MOFA+ Analysis

Factor 1 (25% variance): Cell cycle genes (RNA + protein)
Factor 2 (18% variance): Immune signature (RNA + methylation)
Factor 3 (15% variance): Metabolic reprogramming (RNA + metabolites)

Factor 1 (25% variance): Cell cycle genes (RNA + protein)
Factor 2 (18% variance): Immune signature (RNA + methylation)
Factor 3 (15% variance): Metabolic reprogramming (RNA + metabolites)

Patient Subtypes

Subtype 1 (n=18): High proliferation, MYC amplification
Subtype 2 (n=15): Immune-enriched, hypomethylation
Subtype 3 (n=12): Metabolic dysregulation, mitochondrial dysfunction

Subtype 1 (n=18): High proliferation, MYC amplification
Subtype 2 (n=15): Immune-enriched, hypomethylation
Subtype 3 (n=12): Metabolic dysregulation, mitochondrial dysfunction

Pathway Integration

Top Dysregulated Pathways (Multi-Omics Score)

Cell Cycle (score: 8.5) - RNA (↑), Protein (↑), CNV (amplification)
Immune Response (score: 7.2) - RNA (↑), Methylation (hypo)
Glycolysis (score: 6.8) - RNA (↑), Metabolites (↑)

Cell Cycle (score: 8.5) - RNA (↑), Protein (↑), CNV (amplification)
Immune Response (score: 7.2) - RNA (↑), Methylation (hypo)
Glycolysis (score: 6.8) - RNA (↑), Metabolites (↑)

Multi-Omics Biomarkers

Classification Performance

AUC: 0.92 ± 0.04 (5-fold CV)
Features: 50 total (20 RNA, 15 protein, 10 methylation, 5 CNV)
Top biomarkers:
- MYC expression (RNA)
- CDK1 protein abundance
- BRCA1 promoter methylation
- TP53 CNV status

AUC: 0.92 ± 0.04 (5-fold CV)
Features: 50 total (20 RNA, 15 protein, 10 methylation, 5 CNV)
Top biomarkers:
- MYC expression (RNA)
- CDK1 protein abundance
- BRCA1 promoter methylation
- TP53 CNV status

Biological Interpretation

The multi-omics analysis reveals three distinct disease subtypes driven by different molecular mechanisms:

Proliferative subtype: Characterized by MYC amplification driving coordinated upregulation of cell cycle genes at both RNA and protein levels.
Immune subtype: Hypomethylation of immune genes leading to increased expression and T-cell infiltration.
Metabolic subtype: Shift from oxidative phosphorylation to glycolysis, with concordant changes in enzyme expression and metabolite levels.

These subtypes may respond differently to targeted therapies.

---

The multi-omics analysis reveals three distinct disease subtypes driven by different molecular mechanisms:

Proliferative subtype: Characterized by MYC amplification driving coordinated upregulation of cell cycle genes at both RNA and protein levels.
Immune subtype: Hypomethylation of immune genes leading to increased expression and T-cell infiltration.
Metabolic subtype: Shift from oxidative phosphorylation to glycolysis, with concordant changes in enzyme expression and metabolite levels.

These subtypes may respond differently to targeted therapies.

---

ToolUniverse Skills Coordination

ToolUniverse技能协调

This skill orchestrates multiple specialized skills:

Skill	Used For	Phase
`tooluniverse-rnaseq-deseq2`	Load and analyze RNA-seq data	Phase 1, 4
`tooluniverse-epigenomics`	Methylation analysis, ChIP-seq peaks	Phase 1, 4
`tooluniverse-variant-analysis`	CNV and SNV processing	Phase 1, 3, 4
`tooluniverse-protein-interactions`	Protein network context	Phase 6
`tooluniverse-gene-enrichment`	Pathway enrichment	Phase 6
`tooluniverse-expression-data-retrieval`	Public omics data retrieval	Phase 1
`tooluniverse-target-research`	Gene/protein annotation	Phase 3, 8

该技能统筹多个专项技能：

技能	用途	阶段
`tooluniverse-rnaseq-deseq2`	加载并分析RNA-seq数据	阶段1、4
`tooluniverse-epigenomics`	甲基化分析、ChIP-seq峰处理	阶段1、4
`tooluniverse-variant-analysis`	CNV和SNV处理	阶段1、3、4
`tooluniverse-protein-interactions`	蛋白质网络背景分析	阶段6
`tooluniverse-gene-enrichment`	通路富集分析	阶段6
`tooluniverse-expression-data-retrieval`	公共组学数据获取	阶段1
`tooluniverse-target-research`	基因/蛋白质注释	阶段3、8

Example Use Cases

示例用例

Use Case 1: Cancer Multi-Omics

用例1：癌症多组学分析

Question: "Integrate TCGA breast cancer RNA-seq, proteomics, methylation, and CNV data"

Workflow:

Load 4 omics types for 500 patients
Match samples (450 common across all omics)
Correlate RNA-protein (identify translation-regulated genes)
Correlate methylation-expression (find epigenetically silenced genes)
Correlate CNV-expression (identify dosage-sensitive genes)
Run MOFA+ to find latent factors
Identify 4 subtypes with distinct multi-omics profiles
Perform pathway enrichment per subtype
Select multi-omics biomarkers (AUC=0.94)

问题："整合TCGA乳腺癌RNA-seq、蛋白质组学、甲基化和CNV数据"

工作流程:

为500名患者加载4种组学数据
匹配样本（450个样本在所有组学中均存在）
关联RNA与蛋白质（识别受翻译调控的基因）
关联甲基化与表达（找到表观遗传沉默的基因）
关联CNV与表达（识别剂量敏感基因）
运行MOFA+以找到潜在因子
识别具有不同多组学特征的4种亚型
对每个亚型执行通路富集分析
选择多组学生物标志物（AUC=0.94）

Use Case 2: eQTL + Expression

用例2：eQTL + 表达分析

Question: "How do GWAS variants affect gene expression through methylation?"

Workflow:

Load genotype data (SNPs from GWAS)
Load expression data (RNA-seq)
Load methylation data (450K array)
For each GWAS SNP:
- Test association with nearby gene expression (eQTL)
- Test association with nearby CpG methylation (meQTL)
- Test CpG-gene correlation
Identify SNP → methylation → expression regulatory chains
Annotate with ToolUniverse (GWAS traits, gene function)

问题："GWAS变异如何通过甲基化影响基因表达？"

工作流程:

加载基因型数据（来自GWAS的SNPs）
加载表达数据（RNA-seq）
加载甲基化数据（450K芯片）
对每个GWAS SNP：
- 测试与邻近基因表达的关联（eQTL）
- 测试与邻近CpG甲基化的关联（meQTL）
- 测试CpG与基因的关联
识别SNP → 甲基化 → 表达的调控链
使用ToolUniverse进行注释（GWAS性状、基因功能）

Use Case 3: Drug Response Multi-Omics

用例3：药物响应多组学分析

Question: "Predict drug response using multi-omics profiles"

Workflow:

Load baseline multi-omics (pre-treatment)
Load drug response data (IC50 or clinical response)
Correlate each omics with response
Select multi-omics features predictive of response
Train multi-omics classifier
Identify pathways associated with resistance/sensitivity
Use ToolUniverse drug-repurposing skill for alternative options

问题："使用多组学特征预测药物响应"

工作流程:

加载基线多组学数据（治疗前）
加载药物响应数据（IC50或临床响应）
关联每组学数据与响应情况
选择可预测响应的多组学特征
训练多组学分类器
识别与耐药/敏感相关的通路
使用ToolUniverse药物重定位技能寻找替代方案

Advanced Analysis Patterns

高级分析模式

Pattern 1: Omics-Driven Patient Stratification

模式1：组学驱动的患者分层

For precision medicine applications where patient stratification is goal.

适用于以患者分层为目标的精准医疗应用。

Pattern 2: Multi-Omics Network Analysis

模式2：多组学网络分析

Build integrated networks combining PPI, co-expression, regulatory interactions.

构建整合PPI、共表达、调控互作的网络。

Pattern 3: Temporal Multi-Omics

模式3：时间序列多组学

Longitudinal multi-omics data (time-series or treatment response).

纵向多组学数据（时间序列或治疗响应）。

Pattern 4: Spatial Multi-Omics

模式4：空间多组学

Spatial transcriptomics + proteomics for tissue architecture.

空间转录组学 + 蛋白质组学，用于分析组织架构。

Quantified Minimums

最低量化要求

Component	Requirement
Omics types	At least 2 omics datasets
Common samples	At least 10 samples across omics
Cross-correlation	Pearson/Spearman correlation computed
Clustering	At least one method (MOFA+, NMF, or SNF)
Pathway integration	Enrichment with multi-omics evidence scores
Report	Summary, correlations, clusters, pathways, biomarkers

组件	要求
组学类型	至少2组学数据集
共同样本	跨组学至少10个样本
跨组关联	计算Pearson/Spearman相关系数
聚类分析	至少使用一种方法（MOFA+、NMF或SNF）
通路整合	使用多组学证据评分进行富集分析
报告	包含摘要、关联结果、聚类、通路、生物标志物

Limitations

局限性

Sample size: Multi-omics integration requires sufficient samples (n≥20 recommended)
Missing data: Some patients may not have all omics types
Batch effects: Different omics platforms/batches require careful normalization
Computational: Large multi-omics datasets may require significant memory/compute
Interpretation: Multi-omics results require domain expertise for biological validation

样本量：多组学整合需要足够的样本量（建议n≥20）
缺失数据：部分患者可能不具备全部组学数据
批次效应：不同组学平台/批次的数据需要仔细校正
计算资源：大型多组学数据集可能需要大量内存和计算能力
解读难度：多组学结果需要领域专业知识进行生物学验证

References

参考文献

Methods:

MOFA+: https://doi.org/10.1186/s13059-020-02015-1
Similarity Network Fusion: https://doi.org/10.1038/nmeth.2810
Multi-omics review: https://doi.org/10.1038/s41576-019-0093-7

ToolUniverse Skills:

See individual skill documentation for omics-specific methods

方法:

MOFA+: https://doi.org/10.1186/s13059-020-02015-1
Similarity Network Fusion: https://doi.org/10.1038/nmeth.2810
Multi-omics review: https://doi.org/10.1038/s41576-019-0093-7

ToolUniverse技能:

请参阅各组学专项技能的文档了解具体方法